Robust Botnet DGA Detection: Blending XAI and OSINT for Cyber Threat Intelligence Sharing

We investigated 12 years DNS query logs of our campus network and identified phenomena of malicious botnet domain generation algorithm (DGA) traffic. DGA-based botnets are difficult to detect using cyber threat intelligence (CTI) systems based on blocklists. Artificial intelligence (AI)/machine learning (ML)-based CTI systems are required. This study (1) proposed a model to detect DGA-based traffic based on statistical features with datasets comprising 55 DGA families, (2) discussed how CTI can be expanded with computable CTI paradigm, and (3) described how to improve the explainability of the model outputs by blending explainable AI (XAI) and open-source intelligence (OSINT) for trust problems, an antidote for skepticism to the shared models and preventing automation bias. We define the XAI-OSINT blending as aggregations of OSINT for AI/ML model outcome validation. Experimental results show the effectiveness of our models (96.3% accuracy). Our random forest model provides better robustness against three state-of-the-art DGA adversarial attacks (CharBot, DeepDGA, MaskDGA) compared with character-based deep learning models (Endgame, CMU, NYU, MIT). We demonstrate the sharing mechanism and confirm that the XAI-OSINT blending improves trust for CTI sharing as evidence to validate our proposed computable CTI paradigm to assist security analysts in security operations centers using an automated, explainable OSINT approach (for second opinion). Therefore, the computable CTI reduces manual intervention in critical cybersecurity decision-making.

We analyzed a 12-year dataset of DNS server query logs (4,383 days from 2004 to 2015) from our campus' full cache resolver. Surprisingly, 107 DNS queries (malicious domain names) were detected as botnet traffic (Fig. 1). These results were obtained using 803,333 malicious domain names from our blocklist dataset; however, due to the characteristics of the algorithmically generated domain in the DGA [6], depending only on a domain blocklist might be insufficient. These characteristics indicate that an AI/machine learning (ML) model for CTI systems is necessary.
In this paper, we developed a botnet DGA classification model and provided a comparison with a recent previous work [7] which used the same approach (statistical featuresbased) and same ML algorithm (random forest). We also compared our model's robustness with previous works of character-based deep learning models [8] against three stateof-the-art DGA adversarial ML attacks (MaskDGA [9], CharBot [10], and DeepDGA [11]). Furthermore, using the DGA detection problem as a case study, we demonstrated our approach to extend the current CTI sharing paradigm, described in the next section.
Therefore, this research aims to fill this gap by proposing an extension of actionable CTI, namely, computable CTI, a new paradigm in CTI sharing. We define computable CTI as the next level of actionable CTI by extending the European Union Agency for Cybersecurity (ENISA)'s definition of actionable CTI using AI/ML computability criteria [14]. Furthermore, we define that computable CTI paradigm encourages sharing AI or ML models of CTI systems for cybersecurity communities. As there are already marketplaces for AI models, how to achieve the concept of extending CTI sharing with AI/ML models in a practical manner will be meaningful. The challenging issues with broad adoption of computable CTI sharing include potential bias of decision, privacy preservation [15], [16] and robustness against adversarial ML attacks [17], [18]. Fig. 2 shows the conceptual design of computable CTI. Various CTI sources are available in the market and even publicly available to communities, such as open-source intelligence (OSINT) [19]. Our proposed paradigm uses the OSINT ecosystem to enhance XAI techniques by providing a second opinion from IoC obtained from OSINT, thus preventing automation bias when using AI/ML for security automation of CTI applications. The interaction works two ways: retrieving IoC from OSINT for a second opinion and submitting new confirmed IoC findings to OSINT repositories. The submission of new information to OSINT repositories needs to be conducted in a careful manner, as it can be a way to inject false or poor quality findings intentionally or unintentionally that can cause issues with other detectors relying on such OSINT and poison any training process [20].
As highlighted in [21] and [22], trust is a critical ingredient in the CTI sharing ecosystem. The increasing popularity of OSINT, where communities can subscribe and add new IoC of malicious malware, dangerous domain names/IP addresses, and other information related to a threat, increases the concern of trust and validity in CTI because fake/false IoC information can be quickly submitted into OSINT repositories [19].
Recently, XAI has become an important copilot assisting human users and experts in making critical decisions [23]. XAI could give leverage to make serious decisions in the medical domain [24] and security decisions in dealing with cyber threats in a complicated/mission-critical situation. Because explainability is being mandated in the European Union General Data Protection Regulation (GDPR), it has become critical for practitioners across industries [25]. However, as stressed in [26], explanations in the XAI implementation should be tailored depending on the context and other considerations. Achieving trustable XAI is still one of the grand challenges being pursued by researchers in this field [27], [28]. Our research's objective is to propose blending XAI and OSINT to solve this problem of trust.

C. CONTRIBUTIONS
First, this study expands CTI with computable CTI to reduce human intervention in the cybersecurity decision-making process [23], [24]. Moreover, we improve the explainability of the AI/ML model outputs by blending the XAI and OSINT methods to enhance trust in CTI sharing. We use the DGA detection problem to demonstrate a proof of concept and validate our proposed computable CTI paradigm.
Second, this research proposes a model to detect botnet DGA-based traffic. Compared with a recent study [7], which uses a similar approach (statistical features-based with random forest algorithm) but depends on 24 features, using only seven features proposed in our paper is enough to deliver a satisfactory performance. Moreover, compared with character-based deep learning models [8], our proposed model provides a better defense against three state-of-the-art DGA adversarial ML attacks (CharBot [10], DeepDGA [11], MaskDGA [9]).
Third, this study contributes to the cybersecurity literature as a gateway/direction for future research on AI/ML-based CTI sharing. We released the code and datasets of this study to the IEEE Code Ocean [29] and IEEE DataPort [30] to facilitate reproducible research.

II. RELATED WORKS
Although multiple approaches can be used to guard DNS systems against botnet malware (hiding its communications with C&C server), the conventional DNS security approach, which is filtering based on blocklists, is problematic because no blocklist is completely reliable [31], [32]. The most recent survey/review papers on DNS-based DGAbotnet malicious domain name detection can be seen in [5], [33], [34].
In a recent publication on botnet DGA detection, Hoang and Vu [7] proposed an improved random forest-based model by calculating 24 statistical features, such as character n-gram frequency distributions of a domain name, entropy values, the first character is a digit number or not, and various other statistical calculations. Their experimental results using a dataset of 39 DGA families show an enhanced performance compared with their previous works [7].
However, one may argue that calculating 24 features increases the complexity in computation. Thus, we proposed to use only seven features: entropy, relative entropy Alexa, the minimum of relative entropy botnets, information radius, character length, a new feature generated using a decision tree algorithm, and a domain name's reputation score. In addition, we experiment with broader coverage of DGA datasets (55 DGA families in total).
Besides the statistical features approach, the characterbased classification approach, which relies on character-level embeddings of a domain name, can also be used for DGA classification. Yu et al. [8] compiled character-level-based deep learning models with various architectures for DGA classification. To evaluate our work, we implemented four deep learning models: Endgame [35], CMU [36], NYU [37], and MIT [38].
The current state-of-the-art DGA attacks can be found in CharBot [10], DeepDGA [11], and MaskDGA [9], where the authors employed various sophisticated approaches, such as adversarial ML evasion attack/adversarial examples, to generate domain names for evading DGA classifiers. Sidi et al. [9] demonstrated that MaskDGA attack reduces the performance of a DGA classifier, evading detection system. Peck et al. [10] showed the effectiveness of the CharBot attack, reducing the detection rate of a classifier as low as 1.69%; even retraining the classifiers is not a viable defense strategy.
We applied those three DGA attacks (CharBot, DeepDGA, and MaskDGA) to check the performance of our model for botnet DGA detection applications in dealing with harsh adversarial attacks.
To grasp the understanding of current state-of-the-art XAI, literature reviews and surveys are available in [39]- [43]. Table 1 shows our XAI method selection position, adopting the XAI taxonomy classification systems. The details of our proposed second opinion approach using the XAI-OSINT blend will be elaborated in the subsequent section. In this study, blending XAI and OSINT (as the second opinion) for the CTI system delivers a practical implication to solve trust in the computable CTI ecosystem, i.e., either lack of trust or too much trust (automation bias). The relation between explanation and trust is important [44].

B. FEATURES
We analyzed the datasets by calculating entropy using Shannon's function (1) as our model's first feature. As reported in our previous publication [48], entropy fluctuations can indicate an increasing number of unique random query keywords in DNS queries, frequently observed during dangerous situations, such as Kaminsky-like attacks. Then, we extended a statistical measurement [49] using relative entropy (RE) via Kullback-Leibler divergence (2).
where Q is the baseline distribution calculated on legitimate data (Alexa top 1M domains or ten botnet domains datasets) and P is the target distribution (i.e., the domain in the DNS query log to be verified).
Our model's second feature is RE-Alexa, which measures the distance (or similarity) between the domain in question to Alexa domain unigram distributions. The third feature is Min-RE-Botnets. Here, we calculated the RE value of a suspicious domain with each botnet dataset and considered the minimum value as the Min-RE-Botnets value. Inspired by Sharifnya's work [50], the fourth feature in the proposed model is the information radius (IRad) value, calculated using the Jensen-Shannon divergence function (3). This function is a generalization of the Jensen-Shannon divergence that compares more than two probability distributions. The proposed model uses this function to calculate a target domain name's distance to the botnet datasets.
The next feature is the domain name character length (CharLength). This feature is suitable for botnet DGA detection because several DGA algorithms in our dataset demonstrate similar character lengths, characterizing a unique property of these randomly generated domains.
Then, a new feature is generated using a decision tree algorithm (TreeNewFeature). Here, we combined entropy, RE-Alexa, Min-RE-Botnets, and CharLength features using decision trees and used them to train a predictive model. We constructed a decision tree using those features and used the prediction result as a new feature.
The last feature is the Alexa reputation score (Reputa-tionAlexa). This approach was inspired by Zhao's work [51]. Here, we used the Alexa Top 1M domains to generate a weight matrix to calculate a domain reputation value. The procedure to generate the weight matrix begins by reading all 1M domains from Alexa, and then learning the vocabulary dictionary of n-gram 3 to 5 characters and returning the termdocument matrix. Note that we used the base-10 logarithm function of the total n-gram matrix of all Alexa 1M domains, as shown in (4).
where W is the weight matrix used to calculate the reputation score and C N −gram is the character n-gram frequencies. When calculating a target domain's reputation score, we first extract token counts from the target domain using a vocabulary ngram character constructor. This calculation is the same as generating a document-term matrix using Alexa Top 1M.

C. EXPERIMENTS
We conducted three experiments: 1) We used multiple supervised ML algorithms to compare results to select an algorithm with the best accuracy. Here, five algorithms (naive Bayes, logistic regression, extra tree, random forest, and ensemble learning) were computed using the Scikit-Learn [52]. 2) Comparing our random forest model with the latest previous work [7]. 3) To check the performance of our model for botnet DGA detection in dealing with harsh adversarial attacks, we conducted a robustness evaluation of our classifier against three state-of-the-art DGA attacks (CharBot [10], DeepDGA [11], MaskDGA [9]) and comparison with four deep learning models (Endgame [35], CMU [36], NYU [37], MIT 38]). The evaluation metric is given in (5), where TP, TN, FP, FN stands for true positive, true negative, false positive, and false negative, respectively. Table 3 summarizes the variables/parameters used in the random forest model.

D. BLENDING XAI AND OSINT
This research applied four existing XAI techniques (ANCHOR, local interpretable model-agnostic explanations (LIME), Shapley additive exPlanations (SHAP), and counterfactual explanation) and proposed our approach (XAI and OSINT blend) to produce a second opinion explanation (Table 1). We take advantage of the SHAP method for presenting a global explanation delivering a Game Theory-inspired feature relevance explanation [53]. SHAP is based on the game theoretically optimal Shapley values. We focus on using the model-agnostic approach to enable more freedom to use any advanced algorithm for a classification model. Therefore, we considered implementing the KernelExplainer, kernel-based estimation approach for Shapley values inspired by local surrogate models for SHAP explanation. Moreover, we use SHAP's force plot to provide a local explanation.
The next XAI method implemented in this study is the LIME [54]. LIME trains local surrogate models to explain individual prediction/classification. It provides a local explanation, explaining individual classification results of a black-box model. Therefore, users will understand why the CTI system classifies a suspected domain name into a legit or botnet DGA domain name.
Next, we applied ANCHORS, the LIME's improvement, to predict how a model would behave with less effort and higher precision [55]. ANCHORS is a rule-based learner, explaining by simplification. We expect an explanation expressed as easy-to-understand IF-THEN rules from this method. This type of expression might be more convenient to explain the model's behavior: why did the CTI system decide a domain name as a botnet DGA domain, or why the CTI system classified this suspicious-looking domain name as a legit domain name? We used Alibi [56] to implement ANCHORS in our CTI system.
Next, we applied a counterfactual explanation, adding explainability using an example. We used the What-If Tool [57] to implement this functionality, thus enabling visualization to highlight the nearest counterfactual datapoint (if a legit domain name is selected, then the nearest botnet DGA domain name will be shown, and vice versa). This tool will enable cybersecurity analysts to detect minimal changes in features' value to make the CTI system produce different classification results. Thus, CTI systems could gain more trust from users because they understand the explanation.
For the second opinion, we used two OSINT sources (Google Safe Browser and OTX AlienVault) [58]. We sent application programming interface (API) queries to these sources to retrieve a comment/report on the suspected domain in question. We fused this information with our botnet DGA model's output as a second opinion. The aggregate IoC from OSINT to confirm the AI/ML model's output and classification results can be submitted to OSINT repositories after thorough expert examinations to prevent submitting false/poor-quality findings. The submission of new IoC information to OSINT repositories needs to use extra caution, as it can intentionally or unintentionally be a way that can cause issues with other detectors relying on such OSINT and poison any training process. Therefore, computable CTI can advance OSINT communities for IoCs of new threats (Fig. 2).

IV. RESULTS AND DISCUSSIONS A. COMPARING THE ACCURACY OF ML ALGORITHMS
The results of our experiments are shown in Tables 4 and  5. Overall, the random forest model achieved the highest accuracy, followed by the extra tree algorithm. Note that naive Bayes always showed the lowest performance among the compared algorithms. The highest accuracy (96.2%) was obtained using random forest with all seven features. The top-three essential features were the CharLength, Reputa-tionAlexa, and TreeNewFeature.
We analyzed all features using statistical tests to select features with the strongest relationship with the output variable. Fig. 3 shows the results of the univariate selection Chi-squared test. The ReputationAlexa, CharLength, and TreeNewFeature features had the highest relationship with the class output. Then, we investigated how features are related to each other using a correlation matrix. As shown in Fig. 4, the TreeNewFeature, Char-Length, Entropy, and RE-Alexa features positively correlate with the output, and a   negative correlation can be observed for the Min-RE-Botnets and IRad features. Furthermore, we performed feature importance analysis to score each feature in our proposed model. As shown in Fig. 5, the TreeNewFeature, ReputationAlexa, and CharLength features obtained the highest scores, which indicates that these features are essential for the output variable.

B. TIME COMPLEXITY TO CALCULATE THE FEATURES
ML classification that uses a statistical-based approach requires computations to calculate their features. Fig. 6 shows the computational cost of our approach in terms of time complexity to calculate the features needed in our model. Min-RE-Botnets and IRad require longer computation time than other features, as the equation (2) and (3) bring consequences of O(n) linear time complexity, with n = the number of DGA families (55 families in our experiments). This will become a disadvantage when the number of DGA families grows. However, the ReputationAlexa feature does not require heavy computations, as the preparations need to be done only once during the model training step: reading all the domains from Alexa and then learning the vocabulary dictionary of n-gram 3 to 5 characters to generate the weight matrix. Table 6 provides a comparison between our proposed random forest model and the previous work (Hoang and Vu [7]), which used the same approach (statistical features-based) and same random forest algorithm. Using the same datasets settings as in [7, pp. 7-8], for all experiments, our model gives a better detection rate (with an average of 98.9% accuracy), despite our approach using only seven features compared with their approach, which depends on 24 features. These results give clear evidence on the advantage of the potential utilization of seven features proposed in our paper to deliver a satisfactory botnet DGA detection performance.

D. ROBUSTNESS EVALUATION
First, we examined the performance of our random forest model with seven features using ground-truth datasets consisting of Alexa and 55 DGA families' domain names (1,998,502 domain names in total). As shown in Table 7, the character-based deep learning models produce slightly higher accuracy (∼99.0%) than our model (96.3% accuracy).  However, robustness evaluation with CharBot, MaskDGA, and DeepDGA attacks (394,000 domain names in total) give evidence that our model provides better defense against all the three DGA attacks (44,2% accuracy). Evaluation against individual DGA attacks shows that our model has better robustness against CharBot and DeepDGA attacks, except in the MaskDGA attack.
These results confirm the advantages of our model to be used for botnet DGA detection in dealing with harsh DGA attacks, in which a novel DGA attack can significantly drop the accuracy of a DGA classifier up to only 9.1% accuracy (in the case of the NYU model tested with CharBot attack). This tendency is similar to the previous works [9]- [11].

E. SHARING MECHANISM IN COMPUTABLE CTI
We identified several protocols that potentials for implementing sharing AI/ML models, such as docker container, native (dependent on the tools used, e.g., joblib for Python); PMML/XML-based predictive model interchange format; and open neural network exchange (ONNX) the open standard for ML interoperability. We serialized our final ML model for botnet DGA detection using the ONNX and Python's joblib approaches. The serialization and deserialization could run smoothly. Although the trained model's file size could become large when the training data is enormous, sharing a trained/ready-to-use model is very convenient for others who need to analyze botnet DGA traffics without the burden of building and training a model.

F. BLENDING XAI AND OSINT
Firstly, our CTI system displays a global explanation of the model (Fig. 7). Our model considers character length as a key feature in recognizing botnet DGA domain names.  Therefore, the domain with a too-long number of characters tends to be a botnet DGA domain name, which is true, based on our ground-truth dataset. To trust the model, users must understand what the model is good at and when the model could go wrong. We provide visualization (Fig. 8), enabling a cybersecurity analyst to see where the classification is wrong, such as when a legit domain name exists, but the model classifies it as a botnet DGA domain name. Even though the model has high accuracy, if it classifies a well-known legit domain (such as google.com) as a botnet DGA, it is unacceptable.
We provide local explanations or explain a single decision output to present the simplification idea of the logic of why the model produces that decision. We show an example of why the CTI system determines that a normal-looking domain name (of which the number of characters is not too excessive) is classified as a botnet DGA domain name (Fig. 8). LIME [54] shows easy-to-understand plots. Even though the short character length makes it look like a legit domain name, the reputation score calculation compared with Alexa Top 1M domain names causes the opposite decision. ANCHORS provides a similar explanation, but in the IF-THEN rule format [55]. Moreover, SHAP [53] displays how each feature's value is forcing a decision toward a legit or botnet DGA classification result. A counter explanation After showing SHAP, LIME, ANCHORS, and counterfactual explanations, we continue providing a second opinion (Fig. 9) by integrating API query results from two OSINT sources (Google Safe Browser and OTX AlienVault). Therefore, we confirm that a good explanation and second opinion (by implementing XAI and OSINT blend) are keys to establishing trust in using the shared AI/ML model.
In our study of botnet DGA detection, automation bias refers to an act where a cybersecurity analyst never doubts the AI/ML model's decision output, whatever they are. Such as when the model falsely detects a domain name as a malicious botnet DGA, and the cybersecurity analyst trusts it too much. We emphasize that blending XAI and OSINT could solve the automation bias through a second opinion.
Cyber false flags are hackers' tactics to deceive or misguide attribution attempts and covert cyberattacks [59]. By blending XAI and OSINT into AI/ML-based CTI systems, cybersecurity analysts have a handy tool to compare any information from OSINT sources, with the model's results taken from CTI-sharing repositories (measure twice, cut once using the AI/ML model to confirm OSINT information). This description highlights the usefulness of our proposed XAI and OSINT blend for cyber false-flag phenomena.

G. PRACTICAL IMPLICATIONS OF COMPUTABLE CTI
Reducing human intervention in cybersecurity decisionmaking using AI/ML automation will help security analysts in security operations center environments to win the arms races against new cyber threats. The computable CTI paradigm emphasizes a robust AI/ML model with adversarial defense techniques, also blending XAI and OSINT to solve the automation bias. For example, in our botnet DGA case study, OSINT data become a second opinion (or validation) for known DGA domain names. Thus, we achieved cybersecurity decision-making automation. When no information in the OSINT database exists regarding a suspected domain name, security analysts can still make a fair decision by referencing the explanations produced by XAI techniques.
Computable CTI paradigm also encourages cybersecurity communities to contribute their carefully curated CTI detection outputs to enrich IoC data in OSINT repositories. OSINT APIs integrate AI/ML models to enable submitting new threat information to the OSINT database. In our case study of botnet DGA detection, we used the OTX AlienVault's DirectConnect API to demonstrate submitting new confirmed and validated findings, when no available OSINT exists for the botnet DGA domain names. Therefore, computable CTI implies two-way interactions: gaining benefits from aggregating OSINT threat data and contributing to the latest threats' IoCs for tackling new global attack vectors.
Recently, we have been observing the emergence of public repositories/marketplaces for ready-to-use AI/ML models, such as in TensorFlow Hub. Various models for common problem domains (image, text, video, and audio) are available to be used for transfer learning; however, AI/ML models for CTI applications are scarce [60], [61]. Our study is a gateway for future AI/ML model-based CTI-sharing research. Therefore, in this section, we elaborate on the frameworks needed in computable CTI (Fig. 2) to ensure that cybersecurity communities will be encouraged to share their AI/ML models for CTI sharing.
First, regarding the interoperability of the AI/ML model, we demonstrated how we could manage interoperability when sharing the AI/ML model using ONNX. Adopting this standard removes the barrier of being locked on one AI/ML platform. Sharing CTI models in the ONNX standard will reach a wider audience of cybersecurity communities.
Second, users' privacy must be protected because model sharing takes place among users. We propose adopting privacy labels (color-coded: white, green, amber, and red) relating to privacy-related measures and compliance with privacy regulations [62] on the shared models. Various privacy-preserving techniques can be employed when the models include storing, processing, and transferring private information [15], [16].
Third, the computable CTI paradigm encourages adopting the code-signing practice to ensure the integrity and authenticity of the shared AI/ML model. Sigstore, a recently announced project of The Linux Foundation aiming to foster adopting cryptographic signing, might become a catalyst for the wide adoption of computable CTI in cybersecurity and open-source communities.

V. CONCLUSION
First, we showcase a novel model for botnet DGA detection. Our random forest model achieved 96.3% accuracy (tested with datasets of 55 botnet DGA families) and outperformed VOLUME 10, 2022 the previous work (see Section IV.C). Our model is also more robust against three state-of-the-art DGA adversarial attacks (MaskDGA, CharBot, and DeepDGA) than the previous works (see Section IV.D).
Second, we highlight the practicality of blending XAI and OSINT to deliver better AI explainability through second opinion approaches, thus mimicking the second opinion phenomena in hospital/medical situations to confirm the results/findings. We advocate the XAI and OSINT as an antidote for skepticism toward the model's output, which might contribute to the CTI system's trust and prevent automation bias when users have too much trust in the CTI system's output. Blending XAI and OSINT also has a potential for solving the false-flag problems.
Third, we underline the case study of botnet DGA detection with XAI and OSINT blend as evidence to validate our proposed computable CTI paradigm. Improving trust might result in a paradigm-shift phenomenon. Cybersecurity communities will leave the traditional CTI-sharing paradigm (sharing only threat indicators, such as threat domain names), and communities will start to share AI/ML models for CTI systems. With the emergence of the computable CTI-sharing paradigm, additional collaboration among cybersecurity communities will occur to develop advanced AI/ML-based CTI systems. For instance, using transfer-learning techniques to develop new AI/ML for new cybersecurity tasks/problems utilizing the shared models.
The limitations of our DGA detection model are the time complexity when calculating the features (Section IV.B) and the limited robustness against MaskDGA attacks (Section IV.D). Future improvement should focus on crafting better features and adversarial defense strategies. Moving target defense (MTD) [63] can potentially raise the model's robustness by combining various models to work together.