Applications of artificial intelligence to detect android botnets: A Survey

With the growing popularity of Android smart devices, and especially with the recent advances brought on by the COVID-19 pandemic on digital adoption and transformation, the importance of protecting these devices has grown, as they carry very sensitive data. Malicious attacks are targeting Android since it is open source and has the highest adoption rate among mobile platforms. Botnet attacks are one of the most often forgotten types of attacks. In addition, there is a lack of review papers that can clarify the state of knowledge and indicate research gaps in detecting android botnets. Therefore, in this paper, we conduct a literature review to highlight the contributions of several studies in the domain of Android Botnet detection. This study attempts to provide a comprehensive overview of the deployed AI apps for future academics interested in performing Android Botnet Detection studies. We focused on the applications of artificial intelligence and its two prominent subdomains, machine learning (ML) and deep learning (DL) techniques. The study presents available Android Botnet datasets suitable for detection using ML and DL algorithms. Moreover, this study provides an overview of the methodologies and tools utilized in APK analysis. The paper also serves as a comprehensive taxonomy of Android Botnet detection methods and highlights a number of challenges encountered while analyzing Android Botnet detection techniques. The research gaps indicated an absence of hybrid analysis research in the area, as well as a lack of an up-to-date dataset and a time-series dataset. The findings of this paper show valuable prospective directions for future research and development opportunities.


I. INTRODUCTION
The industry of smartphones has changed the lives of people dramatically. Smartphones are no longer just for communication; they have evolved into one of life's necessities. Furthermore, as a consequence of the considerable changes in the nature of our daily vocations, schools, and routines induced by COVID19-related social distance laws and obligations, smart devices have become increasingly engaged in people's regular duties. The Android system is the largest mobile operating system market share worldwide, with a reported 72.21% of usage by July 2021 [1]. Because of this popularity, its open-source nature [2], and the ease with which new applications may be added to Google's Play Store [3] thanks to the well-documented guidance for building new apps offered by the Android official website [4], Android has more attack attempts than any other system.
In this review, we will give a broad overview of the literature on Android Botnet detection, including datasets, analysis methodologies, and detection techniques. We will also go through the Machine Learning (ML) and Deep Learning (DL) approaches used in Android Botnet detection, as well as the best classifiers that have been evaluated. Table  1 summarizes some of the surveys conducted on Android Botnet detection.
The essential goal and contribution of this work are to study the current state of Android Botnet detection methods and give a comprehensive view of ML and DL techniques in this area, which has been identified as a gap that needs to be addressed.
In summary, the following are the key contributions of this paper: 1. Display available Android Botnet datasets appropriate for detection using ML and DL techniques. 2. Present an overview of the methodologies and tools utilized in APK analysis.

Provide a comprehensive taxonomy of Android
Botnet detection approaches. The rest of this paper is organized as follows: Section II provides an overview of Android Botnet characteristics. Section III depicts Android Botnet analytical methods. Section IV reviews detection techniques for Android Botnets in the literature. Section V analyzes the findings and provides an outline of the challenges that have been identified. Lastly, the paper is concluded in Section VI.

II. METHODOLOGY
This paper seeks to describe the various methodologies for detecting botnet attacks on Android smart devices that utilized ML and DL techniques. Hence, PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) technique was used to search for and choose papers that were relevant to the study's scope. Figure 1 illustrates the process followed in this paper.
The databases considered in the identification phase were chosen from the best-known publications and conferences, such as IEEE, ACM, and Elsevier. Only papers published between (2015-2021) were selected. To find relevant articles, different keywords were utilized, including Android intrusion detection, Android malware detection, Android security, Android Botnets, and Deep Learning in Android Botnet Detection. Initially, 107 papers covering the topic of Android malware detection were discovered, while 29 articles covering Android APK analysis were found. There were no duplicate papers.
In terms of screening and eligibility, the abstract and methodology sections of these publications were assessed in order to choose studies focusing on Android botnet attacks. Following that, an evaluation of the full-text articles was performed to identify studies that used ML and DL approaches in the Android botnet detection process. In our study, two additional publications were included to address the dynamic analysis method in the analysis section.
We addressed the current state of the art research on Android APK analysis methods and Android botnet detection strategies with a total of 20 publications a result of our study.

III. ANDROID BOTNET
Botnets are networks of "malware-compromised machines" [10] which are host computers (or smart devices, our focus in this study) from unlimited geographical places, called bots, enslaved and controlled by one or several attackers, called botmasters, for future malicious actions. The infected devices might not act maliciously, but at the right time could be remotely activated by the botmaster to perform the desired goals, such as stealing information, financial charges, spreading viruses or warms, identity-related frauds and thefts,

Paper
Year Key Contribution [5] 2012 Identified detailed characteristics of Android Botnets Presented a taxonomy of mobile botnet attacks and architecture, and discussed detection approaches Provided a chronology of Android botnets' families with their characteristics, and discussed their attack strategies [8] 2018 Categorized Android datasets and described: AndroZoo, Gnome Project, and AndroVault datasets [9] 2018 Discussed the origin of common Android datasets with their drawbacks, and introduced their dataset CICAndMal2017  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  [11]. Anserverbot, Beanbot, and Geinimi are three wellknown Android botnet families.
Historically, the traditional channel for botnets was Internet Relay Chat (IRC), a text-based communication protocol that was used before in group discussion forums and one-to-one private communication and file sharing [10]. Nowadays, other types of protocols are used for bots spreading and communication such as Peer-to-Peer (P2P) or HyperText Transfer Protocol (HTTP) protocols. Botnet attacks start with implanting malicious segments of codes into the devices by several means, which could be by Bluetooth communication, malicious email attachments, or apparently good application in the store to be downloaded. Figure 2 illustrates an overview of the Botnet structure.

IV. ANDROID BOTNET ANALYSIS
Several studies have been conducted to address the mobile botnet detection problems, proposing several techniques and mechanisms. In this section, we explore the journey of detecting Android Botnets, from choosing a suitable dataset to validating whether or not an application is a botnet.

A. AVAILABLE DATASETS
Various studies employ ready-to-use datasets in their experiments. These datasets, such as the Drebin dataset [25], are primarily made up of extracted features in the form of zeroand-one vectors or numerical data. Other datasets are collections of applications in the form of APK files that the researchers use to conduct their analysis.
There are a variety of ready-to-use datasets to pick from, depending on the type of dataset required. Each of these datasets is described below, in order of newest to oldest. The number of records in each dataset, as well as the research that used them, are displayed in Table 2.

1) ISCX
In the Android Botnet detection field, only one source dataset is made specifically for this type of malware, which is the ISCX dataset [15] (also known as the University of New Brunswick (UNB) dataset in certain research). This dataset contains APK files for studies that want to extract the features themselves.

2) 28-SABD
This is a new dataset specific to the Android botnet created by [12]. This dataset presents a new dataset based on the ISCX dataset that includes features derived via dynamic analysis in the form of vectors of zeros and ones.

3) DREBIN
This is one of the most used datasets in malware detection studies generated by [25]. Its features were extracted using static analysis with a total of 131,611 including malicious and benign applications collected from August 2010 to October 2012 [25].

4) GNOME / MALGNOME
This dataset is produced by [27], which contains 1260 malware samples categorized by malware families and gathered from August 2010 to October 2011.

5) OTHERS
All research that employed various datasets from different resources, or gathered by the same study, falls under this category in the following table. As a result, datasets tailored to Android Botnets are required. Furthermore, it has been noted that a time-series dataset is required for researchers who want to run experiments using DL classifiers that rely on time-series data.
For studies that rely on extracting features from the original APK files as part of their study, they must analyze the APK files themselves. Based on the methodology used and the tools used to extract the essential characteristics, these studies may be categorized into three parts: static, dynamic, and hybrid   This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. analysis. Due to the lack of Android Botnet detection based on the hybrid method, only the first two methods are discussed here.

B. STATIC ANALYSIS
Static analysis examines applications without running them, mainly by looking for harmful segments in source code based on known malicious characteristics, which is similar to a signature-based method. In static analysis, we analyze the application and extract specific features that would help in building a detection model. Where in signature-based methods, segments of code are extracted from the examined application to be compared with signature codes of known malware. Several tools are essential to do static analysis on an Android application. ApkTool, which is designed to reverse engineer APK files, is the most prominent. APK files, which are analogous to EXE files on Windows computers, are Android Package files for executable apps. AAPT (Android Asset Packaging Tool) is a similar tool that can decode and extract data from AndroidManifest.xml without requiring the whole APK file to be decompiled [33]. ApkParser is a tool that converts the Manifest file to a text-readable format [34]. Baksmali and AXMLPrinter2 are both decompilers that extract readable bytcode from APK files [35]. AndroGuard, WALA, Soot, IACDroid, and Amandroid are among the frameworks for static analysis of Android apps available in the literature.
Several studies performed on Android Botnet detection followed the static analysis method. Both studies [26] and [23] employed ready-to-use datasets: Drebin and Gnome, respectively. These datasets are created by statically analyzing APK files. The work [23] looked at the link between permissions and used features and concluded that a certain combination of these two factors may be used to identify a Botnet application. Then, as a feature selection approach, it used the Information Gain (IG) algorithm to ensure that the most impactful features were chosen.
For their own static analysis, all of the research [16], [22], [21], [19], [17], and [20] employed the ISCX dataset. Permissions and API Calls are the most commonly used elements in Botnet detection techniques by the majority of them. W. Hijawi et al. [16] have used permissions and their corresponding protection level as features for botnet detection using different ML classifiers. Each permission in Android systems has a protection level that ranges from normal to dangerous, as follows: if the degree of protection is normal, there is no need to ask the user to use the permission. This is something that can be used right away. Dangerous permissions, on the other hand, must be explicitly requested by the user.
Both [23] and [21] utilized the Information Gain (IG) feature selection algorithm, whereas [40] used WEKA's SubSetEval to pick the most correlated features since the features were represented using the word of bag approach.
The following is a summary of the most commonly used features in the static analysis method.

1) PERMISSIONS
Android is a permission-based access control system to control the actions that a process can do. This implies that every app that wants to execute a certain operation must first request permission from Android [36]. Permissions are extracted from the AndroidManifest.xml file.

2) API CALLS
The functions called by an application in its source code are referred to as API calls. They are evoked during runtime by an application, to request specific information or to perform a specific task.

3) BACKGROUND SERVICES
Android Background Services is a component of Android that operates in the background. It is often launched in an android activity and runs in the same thread as the activity.

4) RECEIVERS
A broadcast receiver is an Android component that allows apps to send and receive events from the Android system or other apps. It's used to communicate between processes in an asynchronous manner.

5) INTENTS
Intents are used to notify the Android system of a certain event. Intents are used to communicate between different components of different applications. This feature is also used in the dynamic analysis approaches.
Although static analysis saves time and effort, and it appears to be the most common analytic approach for Android Botnet detection studies, it is ineffective in identifying the characteristics of unknown or obfuscated malware.

C. DYNAMIC ANALYSIS
Dynamic analysis, on the other hand, examines an application's behavior throughout execution time by analyzing the application's runtime system calls and performance. Dynamic analysis, unlike static analysis, is successful in identifying obfuscation and delivers relevant information in a short amount of time. However, dynamic analysis necessitates a large amount of computer power and may not be effective in detecting malicious activities in real-time.
Numerous tools are used to do dynamic analysis on an Android application. The adb tool (Android Debug Bridge) is the most often used tool for running commands directly on Android-connected devices or emulators. For this reason, it is one of the most critical tools supplied with the Android SDK platform. Several tools, like Strace for collecting system calls [37], APIMonitor for monitoring an app's API calls [38], and tcpdump/Wireshark, a well-known packet analyzer [30], are used in conjunction with the adb tool to capture and record some of the app's features.
The malicious applications are executed on an Android device, either a real physical device or an emulator, to do the dynamic analysis. As a result, an emulator is a commonly used auxiliary tool in dynamic analysis. The native emulator of the Android SDK is the most often used. However, several other emulators, such as Genymotion [39]and BlueStack [40], are also utilized. The Monkey tool is an extremely valuable tool for dynamic analysis [41]. The Monkey tool is a generator that uses the stateless input generation approach to generate a pseudo-random sequence of events on an emulator or the real device in order to observe how the application behaves throughout various phases of regular smart device operation.
Several open-source frameworks/sandboxes are also used for dynamic analysis, including DroidBox, DynaLog, which is based on DroidBox and can automatically analyze hundreds of programs [42], Robotium, Café, which is based on Robotium and includes extra features, and CuckooDroid, a sandbox analyzer based on the Cuckoo platform. Unfortunately, the majority of these frameworks are no longer supported and hence cannot be utilized correctly.
The Santoku Linux environment, which is designed at providing tools and utilities for Android security analysis, is an underutilized capable framework for security analysis of Android platforms. Santoku contains the tools that DroidBox need to work effectively, such as adb, logcat, Android emulator images, etc. [42]. Table 3 summarizes some of the Android analytic tools that are currently available.
Android Botnet detection was carried out in a number of research-based on dynamic analysis. For example in [28], the authors have employed tPacketCapture pro tool to collect network visitors of software from the Gnome dataset. Another study by [12] analyzed the network traffic from botnet apps derived from the ISCX dataset using four actual machines and a BlueStack emulator. Network traffic has been shown to be the most trustworthy characteristic according to several studies such as [18] and [29].

1) NETWORK TRAFFIC
In network traffic, the analyzer would keep track of various aspects of the data transmission and reception process. These aspects might include the amount of data sent, received the destination IP address, or any other network-related information.
In general, dynamic-based analysis requires high computational power and may not be efficient in detecting malicious behavior in real-time.

D. DISCUSSION
The thirst method in the malware detection process, in general, is hybrid analysis. A hybrid method is used to evaluate system behavior by employing both static and dynamic analysis in parallel. It is used to benefit from both static and dynamic analysis advantages while overcoming each one's drawbacks.
To the best of our knowledge, there has never been any research done on Botnet detection using a hybrid analysis method. As a result, this has been identified as a knowledge gap in the literature. Figure 3 shows a taxonomy of the discussed analytical methods. Presenting this taxonomy would aid in categorizing studies based on the type of analysis method adopted, directing researchers to papers that cover certain analysis methods or utilize specific datasets, and identifying the need for greater scientific research on overlooked methodologies and features.

V. ANDROID BOTNET DETECTION
Although various attempts have offered an overview of using ML techniques in botnet detection in general [43], few research leveraging AI applications on Botnet detection in Android systems have been undertaken. Most of which were good attempts to generate labelled datasets as botnets samples. Although [12] has used ML-based classifiers to label the data for generating a ready-to-use Botnet dataset, the study [20] has applied ML using the WEKA platform to generate its own dataset to perform the detection. WEKA has been utilized in other several existing works, such as [19], [29], and [44], which is a basic tool to use in terms of implementing ML techniques in detecting malware, while cloud-based botnet detection techniques were adopted by other studies such as [30] and [44].

A. ML-BASED DETECTION
Most Android botnet detection studies utilized the static analysis method for feature extracting. For example, [17] utilized four filters to classify the collected features statically: MD5, which is a known code of popular malicious applications, basically is used as a message integrity technique [45], Permissions, Broadcast Receivers, and Background Services. The study used SVM, KNN, J48, Bagging, NB, and RF classifiers on different features after the application was filtered out of these four layers. On Permissions, NB produced the best results among the other classifiers. Both [21] and [23] employed the Information Gain (IG) feature selection algorithm on statically extracted features, whereas [46] used and emphasized the merits of the Fuzzy SAPSO selection algorithm on dynamically generated features.   Table 3 Summary of some of the existing Android analytic tools This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Another work that employed a feature selection technique is [20]. The authors utilized text mining on the Java source code of the apps in this study, and in the first approach, they used Natural Language Processing (NLP) technique with varied quantities of words as the adjusted parameter. In the second approach, they utilized the CodeAnalyzer tool to extract certain quantitative features. As a consequence, three datasets have been generated. Finally, they used WEKA's SubSetEval feature selection method on these datasets to reduce the number of features and pick the best features for classification in order to prevent overfitted models. On these diverse datasets, they tested Naïve Bayes (NB), Decision Tree (DT, also known by J48), K-Nearest Neighbor (KNN), Random Forest (RF), and Sequential Minimal Optimization (SMO), and found that the last three performed better than the others.
The features selected by [26], on the other hand, were chosen after reviewing other research and resulted in 16 Permissions and 31 API Calls being the most significant characteristics of Android botnet applications. This work also implemented NB, KNN, J48, RF, and SVM classifiers in their detection experiments, with RF achieving the highest accuracy rate of 99.4%.
The primary detecting feature of [18], [28], [29], [44], and [46] was Network Traffic. The researchers in [18] selected the top four features based on the [47] study and then added seven additional features based on the accuracy of their own assessment. They began the detection with a small set of labeled data and then used the incoming Network Traffic to train the KNN classifier incrementally. By using self-learning classification, they were able to obtain a detection rate of 90.3%.

B. DL-BASED DETECTION
Despite the fact that DL is a subset of classic ML, DL classifiers have recently caught the interest of academics. Because it teaches itself to process and learns from the data by optimizing the parameters, DL beats traditional ML.
Only [22] and [24] adopted DL classifiers in the identification of Android botnets on features retrieved using a static-based analysis method. Both used the CNN classification technique, and both achieved high detection accuracy, with 98.9 and 97.2 percent, respectively. Figure 4 shows a taxonomy of the discussed detection techniques with corresponding classifiers applied in the literature. According to this taxonomy, only three papers have investigated the potential of DL techniques, indicating that DL approaches are currently fairly unexplored.

VI. DISCUSSION AND OPEN CHALLENGES
In this section, we will go through the findings of the presented survey on Android botnet detection utilizing AI applications, which may be summarized as follows: • Methods of analysis used in prior research on Android APK files. • Tools and features associated with the analysis method used.
• Datasets for Android botnet applications that are currently available. • Current AI applications, including its primary subfields ML and DL, have been covered in the literature to detect Android botnets. By doing this evaluation, we discover that the majority of studies have utilized the static analysis method in Android botnet detection, as seen in Figure 5. We can also observe that the Android researchers have not yet investigated how hybrid analysis could be used to detect botnets. Hybrid analysis, as previously indicated, may be more successful in extracting more relevant characteristics of a certain type of malware.
Each analysis method has its own set of tools and processes, some of which are selected by researchers because they are straightforward to use, while others are not employed because they lack adequate support. Figure 6 provides the analysis tools that were utilized in the literature for each of the analytical methods. APKtool, which is the most popular tool for reverse engineering APK files, is one of the most powerful and commonly chosen in static-based and signature-based analysis, as well as being known for its ease of use.
In the case of dynamic and hybrid analysis, which need the observation of tested applications during the execution of the app, it is observed that there is a shortage of acceptable tools, whether due to a lack of adequate support, a lack of appropriate frameworks to use, or difficulties and challenges in using them as stated previously in Table 3.  Besides that, the features that are employed are influenced by the sort of analysis that is conducted. Figure 7 illustrates how these features are considered according to the analytical method employed. Permissions are the most commonly used feature in static-based analysis, whereas Network Traffic is the most prominent feature in dynamics-based analysis, as shown in the diagram. Intents are employed in both types of analyses, and their impact on detection accuracy requires additional research. Nevertheless, as previously mentioned, only a few researchers have used feature selection algorithms in their investigations.
In terms of datasets, as previously stated, just one dataset, the ISCX dataset, is created exclusively for Android Botnet apps.
The 28-SABD dataset is the second accessible dataset, which is a vector of zeros and ones produced from the aforementioned ISCX dataset. Figure 8 demonstrates the various usage of these datasets in each analysis method, whereas Figure 9 depicts the ISCX dataset's dominance in the majority of Android Botnet studies.
Furthermore, as demonstrated in Figure 10, ML approaches are increasingly used in botnet detection. It is noted that researchers used a variety of classifiers, with some classifiers, such as NB, J48, RF, and SVM, being preferred by the researchers (demonstrated in Figure 11). The most examined classifier in Android Botnet Detection is NB as shown in Figure 12. RF, on the other hand, has the best performance among the other classifiers, with an accuracy of 99.4 percent (see Tables 4 and 5). Only one research looked at the potential of using a time-series-based detection method. This indicates that we should pay more attention to experiments and the need for time-series datasets, which are not currently available in the literature. In terms of accuracy, Figure 13 presents a comparison of the strongest classifiers employed in the reviewed studies.
It's worth noting that DL classifiers have only been explored in three recent studies. However, it should be emphasized that these employed deep learning classifiers, CNN (Convolutional Neural Networks) and DNN (Deep Neural Networks), have achieved extremely high accuracy rates, ranging from 97.2 to 99.1 percent respectively (see Table 3 and Table 4).
Other DL classifiers are proving their power to identify malware, and their promise in this field simply has to be explored. Table 4 shows a comparison of research on Android Botnet Detection that used a static analytical method, whereas Table  5 shows a comparison of studies on Android Botnet Detection that used a dynamic analytical method.

FIGURE 7 Features Used in Botnet Detection with Each Analysis Method
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3187094

I. CONCLUSION / AND FUTURE WORK
Security flaws may affect any smartphone, although attackers are particularly interested in Android devices. This is related to various factors that have already been mentioned in the paper. This study aimed to offer a thorough overview of the implemented AI applications for future academics interested in conducting experiments on Android Botnet Detection. AI applications, including its subdomains ML and DL, have demonstrated their capacity to identify many types of malware. Botnets on Android are a relatively old subject with few investigations. In this assessment, 20 papers in this field from 2015 to the present were gathered and examined. The main analytic approaches for extracting features were addressed, including static, dynamic, and hybrid analyses. It also mentioned the lack of hybrid analysis research in the area, as well as the lack of an up-to-date dataset and a time-series dataset. In order to serve as a reference for future study, a taxonomy of features and datasets utilized in the literature was provided. In static-based analysis, Permissions was the most commonly used feature, whereas, in dynamics-based analysis, Network Traffic was the most commonly used feature. Another taxonomy was presented to distinguish between studies that utilized NB, J48, RF, SVM, KNN, SMO, Bagging, NN, and DT classifiers, as well as DL classifiers. Detailed tables were produced to provide a picture of the current work.
The best classifier among all ML classifiers is RF, while DNN is the best one among DL classifiers in this subject, according to these tables. For future work, we intend to perform a hybrid analysis on Android APK files, extract a time-series dataset and then use DL-based classification to detect Android botnets, which has been recognized as a research need. Furthermore, an in-depth examination of various Android vulnerabilities and attacks such as SMS, Email, Spying, Application Sandboxing, and Rooting attacks will be carried out.   This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3187094