Introduction
To this degree, it is guaranteed that mobile devices are an integral part of most people’s daily lives. Furthermore, Android now controls the vast majority of mobile devices, with Android devices accounting for an average of 80% of the global market share over the past years [1]. With the ongoing plan of Android to a growing range of smartphones and consumers around the world, malware targeting Android devices has increased as well. Since it is an open-source operating system, the level of danger it poses, with malware authors and programmers implementing unwanted permissions, features and application components in Android apps. The option to expand its capabilities with third-party software is also appealing, but this capability comes with the risk of malicious attacks. When the number of smartphone apps increases, so does the security problem with unnecessary access to different personal resources. As a result, the applications are becoming more insecure, and they are stealing personal information, SMS frauds, ransomware, etc.
In contrast to static analysis methods such as a manual assessment of AndroidManifest.xml, source files and Dalvik Byte Code and the complex analysis of a managed environment to study the way it treats a program, Machine Learning includes learning the fundamental rules and habits of the positive and malicious settings of apps and then data-enabling. The static attributes derived from an application are extensively used in machine learning methodologies and the tedious task of this can be relieved if the static features of reverse-engineered Android Applications are extracted and use machine learning SVM algorithm, logistic progression, ensemble learning and other algorithms to help train the model for prediction of these malware applications [2].
Machine learning employs a range of methodologies for data classification. SVM is a strong learner that plots each data item as a point in
In the latest versions of the Android operating system (OS), any app that requires access privileges may ask the OS for permission, and the OS will ask the user whether they want to approve or decline the request through a pop-up option. Many studies have been conducted on the effectiveness of this resource management strategy. Research shows that consumers make decisions by granting access to all requests to applications [8]. In comparison, more than 70% of Android mobile applications request permission that isn’t required or is not needed in the app in the first place. A chess game that asks for photographs or requests for SMS and phone call permits, or loads unwanted packages is an example of an extra requested authorization. So, trying to set an app’s vindictiveness and not understanding the app is a tough challenge. As a result, successful malicious app monitoring will provide extra information to customers to assist them and defend them from information disclosure [9]. Figure 1 elaborates the android risk framework through the Google Play platform, which is then manually configured by the android device developers.
In contrast to other smartphone operating systems, such as iOS, Android requires users to access apps from untrusted outlets like file-sharing sites or third-party app stores. The malware virus problem has become so severe that 97 % of all Smartphone malware now targets Android phones. In a year, about 3.25 million new malware Android applications are discovered as the growth of smartphones increases. his is roughly equivalent to introducing a new malware version of Android every few seconds [10]. The primary aim of mobile malware is to gain entrance to user data saved on the computer and user information used in confidential financial activities, such as banking. Infected file extensions, files received via Bluetooth, links to infected code in SMS, and MMS application links are all ways that mobile malware can propagate [11]. There are some strategies for locating apps that need additional features. Using these approaches, it should be easy to assess whether the applications labelled as suspicious and requiring extra authorization are malicious.
Static analysis methodologies are the most fundamental of all approaches. Until operating programs, the manifest file and source codes are examined [12]. For many machine learning tasks, such as enhancing predictive performance or simplifying complicated learning problems, ensemble learning is regarded as the most advanced method. It enhances a single model’s prediction performance by training several models and combining their predictions. Boosting, bagging, and random forest are examples of common ensemble learning techniques [13]. In summary, the main contributions of our study are as follows:
We present a novel subset of features for static detection of Android malware, which consists of seven additional selected feature sets such as (Permissions, App-Components, Method Tags, Intents, Packages, API Calls, and Services/Receivers) that are using around 56000 features from these categories. On a collection of more than 500k benign and malicious Android applications and the highest malware sample set than any state-of-the-art approach, we assess their stability. The results obtain a detection increase in accuracy to 96.24 % with 0.3% false positives.
With the additional features, we have trained six classifier models or machine learning algorithms and also implemented a Boosting ensemble learning approach (AdaBoost) with a Decision Tree based on the binary classification to enhance our prediction rate.
Our model is trained on the latest and large time aware samples of malware collected within recent years including the latest Android API level than state-of-the-art approaches.
There are major issues to be addressed to incorporate our strategy. High measurements of the features will make it difficult to identify malware in many real-world Android applications. Certain features overlap with innocuous apps and malware [14]. In comparison, the vast number of features will cause high throughput computing. Therefore, we can learn from the features directly derived from Android apps, the most popular and significant features. The paper implements prediction models and various computer ensemble teaching strategies to boost and enhance accuracy to resolve this problem [15]. Feature selection is an essential step in all machine-based learning approaches. The optimum collection of features will not only help boost the outcomes of tests but will also help to reduce the compass of most machine-based learning algorithms [16].
Studies have extensively suggested three separate methods for identifying android malware: static, interactive meaning dynamically, and synthetic or hybrid. Static analysis techniques look at the code without ever running it, so they’re a little sluggish if carried out manually and have to face a lot of false positives [17]. Data obfuscation and complex code loading are both significant pitfalls of the technique. That is why automated operation helps to achieve reliability, accuracy, and lesser time utilization [18]. Reverse engineer Android applications and extract features and do static analysis from them without having to execute them. This method entails examining the contents of two files: AndroidManifest.xml and classes.dex and working on the file with the.apk extension. Feature selection techniques and classification algorithms are two crucial areas of feature-based types of fraudulent applications. Feature filtering methods are used to reduce the dimension size of a dataset. Any of the functions (attributes) that aren’t helpful in the study are omitted from the data collection because of this. The remaining features are chosen by weighing the representational strength of all the dataset’s features [19]. Parsing tools can help learn which permissions, packages or services an application offers by analyzing the AndroidManifest.xml file, such as permission android.permission.call phone, which allows an application to misuse calling abilities. The paper elaborates exactly what sort of sensitive API the authors could name by decoding the classes.dex file with the Jadx-gui disassembler [20]. In certain cases, including two permissions in a single app can signify the app’s possible malicious attacks. For example, an application with RECEIVE SMS and WRITE SMS permissions can mask or interfere with receiving text messages [21] or applying sensitive API such as sendTextMessage can also be harmful and lead to fraud and stealing.
Until we started our main idea of the project. The fact explained that Android applications pose a lot of threats to its user because of the unnecessary programs compiled inside them and explained why it is necessary to automate the process of static analysis for the efficient detection of malware applications based on the extracted features. The rest of the paper is planned as follows. Related works are examined in Section II. Section III presents the design details of the proposed model. Section IV elaborates the assessment findings and future threats. The experiments and results will be dilated and performed in Sections V and VI. Section VII includes our research issues, recommendations, and conclusions for the future.
Related Works
Linux (Android core) keeps key aspects of the security infrastructure of the operating system. The Android displays to the administrator a list of features, sought to reinstall an update. The program installs itself on the computer after they issue access. Figure 3 shows the integrated core parts of Android architecture. It comprises applications at the top layer and also includes an application framework, libraries or a Runtime layer, and a Linux kernel. These levels are further divided into their components, which make an Android Application. The Linux Kernel is the key part of Android that provides its OS functionality to phones, and the Dalvik Virtual Machine (DVM) is to manage a mobile device. Application is the Android architecture’s highest layer. Native and third-party apps such as contacts, email, audio, gallery, clock, sports, and so on are located only in this layer. This framework gets the classes often used to develop Android apps. It also handles the user interface and device infrastructure and provides a common specification for hardware entry. To facilitate the development of Android, the Platform Libraries include many C/C++ core libraries and Java-based libraries such as SSL, libc, Graphics, SQLite, Webkit, Media, Surface Manager, OpenGL, and others. The taxonomy helps understand the viewer with a logical algorithmic approach for grasping the core surfaces and functionality of the operating system.
The methods proposed in this related work contribute to key aspects such as selected features for classification and a higher predictive rate for malware detection. Certain research has focused on increasing accuracy, while others have focused on providing a larger dataset, some have been implemented by employing various feature sets, and many studies have combined all of these to improve detection rate efficiency. In [22], the authors offer a system for detecting Android malware apps to aid in the organization of the Android Market. The proposed framework aims to provide a machine learning-based malware detection system for Android to detect malware apps and improve phone users’ safety and privacy. This system monitors different permission-based characteristics and events acquired from Android apps and examines these features employing machine learning classifiers to determine if the program is goodware or malicious. The paper uses two datasets with collectively 700 malware samples and 160 features. Both datasets achieved approximately 91% accuracy with Random Forest (RF) Algorithm. [23] Examines 5,560 malware samples, detecting 94 % of the malware with minimal false alarms, where the reasons supplied for each detection disclose key features of the identified malware. Another technique [24] exceeds both static and dynamic methods that rely on system calls in terms of resilience. Researchers demonstrated the consistency of the model in attaining maximum classification performance and better accuracy compared to two state-of-the-art peer methods that represent both static and dynamic methodologies over for nine years through three interrelated assessments with satisfactory malware samples from different sources. Model continuously achieved 97% F1-measure accuracy for identifying applications or categorizing malware. [25] The authors present a unique Android malware detection approach dubbed Permission-based Malware Detection Systems (PMDS) based on a study of 2950 samples of benign and malicious Android applications. In PMDS, requested permissions are viewed as behavioral markers, and a machine learning model is built on those indicators to detect new potentially dangerous behavior in unknown apps depending on the mix of rights they require. PMDS identifies more than 92–94% of all heretofore unknown malware, with a false positive rate of 1.52–3.93%. The authors of this article [26] solely use the machine learning ensemble learning method Random Forest supervised classifier on Android feature malware samples with 42 features respectively. Their objective was to assess Random Forest’s accuracy in identifying Android application activity as harmful or benign. Dataset 1 is built on 1330 malicious apk samples and 407 benign ones seen by the author. This is based on the collection of feature vectors for each application. Based on an ensemble learning approach, Congyi proposes a concept in [27] for recognizing and distinguishing Android malware. To begin, a static analysis of the Android Manifest file in the Android Application Package (APK) is done to extract system characteristics such as permission calls, component calls, and intents. Then, to detect malicious apps, they deploy the XGBoost technique, which is an implementation of ensemble learning. Analyzing more than 6,000 Android apps on the Kaggle platform provided the initial data for this experiment. They tested both benign and malicious apps based on 3 feature sets for a testing set of 2,000 samples and used the remaining data to create a training set of 6,315 samples. Additional approaches include [28], an SVM-based malware detection technique for the Android platform that incorporates both dangerous permission combinations and susceptible API calls as elements in the SVM algorithm. The dataset includes 400 Android applications, which included 200 benign apps from the official Android market and 200 malicious apps from the Drebin dataset. [29] Determines whether the program is dangerous and, if so, categorizes it as part of a malware family. They obtain up to 99.82 % accuracy with zero false positives for malware detection at a fraction of the computation power of state-of-the-art methods but incorporate a minimal feature set. The results of [30] demonstrate that deep learning is adequate for classifying Android malware and that it is much more successful when additional training data is available. A permission-based strategy for identifying malware in Android applications is described in [31], which uses filter feature selection algorithms to pick features and implements machine learning algorithms such as Random Forest, SVM, and J48 to classify applications as malware or benign. This research [32] provides a feature selection using the Genetic algorithm (GA) approach for identifying Android malware. For identifying and analyzing Android malware, three alternative classifier techniques with distinct feature subsets were built and compared using GA. Another technique achieves satisfactory accuracy but there FPR is very high with limited samples [33].
One of the important matters that has not been considered by any of the studies is the sustainability of the model after the advancement of applications. This issue is still a challenge for our research as well. The model’s ability to classify will gradually decrease over time when new features or evolved applications are created. Only [29] and [26] specify this issue and introduce it as a drift concept, describing the low performance of their systems after some time. Our research doesn’t implement this problem as well, but we suggested some potential studies to initiate solutions for models’ sustainability in the research issues and challenges section. Another matter that could arise in the field of implementing machine learning algorithms is the “Multicollinearity Problem” which we have discussed in the introduction section. This subject arises due to the algorithms being dependent on multiple variables embedded in these machine learning or deep learning models. Although it is one of the rising issues in the area and could be present in our study it would constitute better as separate research. Our model is already incorporating a wide range of evaluations and analysis of Android applications features sets but this would be a great opportunity to further enhance the models for future use. There are relevant studies that support alleviating this challenge by detecting the model’s dependencies in terms of comparing multiple models together and then calculating the greater impact of the highest given model. Authors in references [34], [35], [36] consider different tales concerning different machine learning models to highlight and find out the measures for different model scenarios.
Tables 1 and 2 elaborates on the novelty of our approach and compare state-of-the-art methodologies in several categories. Table 1 focuses on the key novel categories in terms of malware samples, feature sets, the method proposed, accuracy, false-positive rate, the level of API (increased complex application behavior) and system environment for data processing. It also explains that our sample set and feature set is larger and achieve satisfactory accuracy with 0.3% FPR, depicting the lowest false positives other than DroidSieve. Our contribution lands on the upgraded API levels with large sample sizes including enhanced feature sets to detect malware. Table 2 elaborates a more in-depth approach and shows the key features present in the proposed and other approaches with also the time awareness of the data being collected.
A. Reverse Engineered Applications Characteristics
As for Android apps, various apps have various functionalities. If the app is to use the device tools, you must specify the corresponding allowances in the Android Manifest format. Different program forms, therefore, have different declarations of prior approval [37], [38]. System static analysis also identifies an application as malicious or benevolent. In classification, they make rational choices using features. The article shows the taxonomy diagram for the features present in Android applications [39]. It comprises all the components present in the APK files and how they are when they are reverse engineered by using a disassembler, in our case Jadx-gui. Fig.4 shows the process of apk file disassembly.
1) Androidmanifest.xml
In the root folder of any reverse-engineered application, there must be an android Manifest.xml file. The Manifest file gives essential information to the Mobile application, which is required by the framework before executing any code for the app. The authorization process should protect the application’s key elements, which include the Operation, Service, Content Provider, and Broadcast Receivers. These results mainly accomplished by affiliating these components with the relevant element in its manifest definition and making Android dynamically implement the features in the closely associated contexts [28].
Fig. 5 shows the taxonomy of the Android manifest components, which contain all the requested permissions, packages, intents and features for extraction.
B. Feature Set Extraction
Using feature filtering decreases the dimensions of data collection by deleting functions that are not useful for study. We chose the characteristics based on their capability to display all data sets. Enhanced efficiency by reducing the dataset size and the hours wasted on the classification process introduces an effective function selection process. Our process does not support a revamped Android emulator, because it’s not a convenient approach and we preferred our system for physical devices in the future. Jadx carries out the modification and evaluation of source code. The system concentrated on trying to hook the byte-level API calls [40]. For our dataset, features from over 1, 00,000 applications are extracted containing around 56000 extracted features. Functions and processes of opcode API features are removed from the disassembled Smali and Manifest files of an APK file. The Smali file, segmented by the process and the opcode frequency of Dalvik for every method, is determined by scanning Dalvik Bytecodes. To verify invocation of a hazardous API in that form, it is also possible to determine the hazardous frequency of an API invocation for each method during the byte code search. For string functions, strings are selected without the method of isolation from the entire Smali archives [41].
We will never have a predictable response when the number of features inside a dataset exceeds the number of occurrences. In other terms, when we don’t have enough data to train our machine on, generating a structure that could capture the association between both the predictive variables and responses variable appears problematic.
The system used in this study also incorporates larger feature sets for classification. Although this problem arises in machine learning quite often to some extent choosing the type of model for detection or classification can highly impact the high dimensionality of the data being used. Support vector machine and AdaBoost can handle relatively well than other algorithms because of their high dimensional space/hyperplane sectioning. Another suspension for our datasets was the tool used for extracting the given datasets. Androguard implements parsing and analyzing automation to further break down components of application apk’s after decompiling and encourages weighting of the data into binary, making it easy to use relevant data for classification. It uses certain functionality to get useful features from manifest files of these Android applications reducing the acquiring irrelevant features. Although the data in this study works significantly well for evaluation, however, the datasets will be needed to upgrade in terms of forthcoming evolving measures.
Certain other authors have presented many tools and proposals to deal with high dimensional data such as [42], [43], inducing multiple methods such as filtering wrapping to enhance robustness.
The feature set of our model includes:
Permissions{F}_{1}\rightarrow API Calls{F}_{2}\rightarrow Intents{F}_{3}\rightarrow App Components{F}_{4}\rightarrow Packages{F}_{5}\rightarrow Services{F}_{6}\rightarrow Receivers{F}_{7}\rightarrow
1) Permissions
Permission is a security feature that limits access to certain information on smartphone, with the role of preserving sensitive data and functions that might be exploited to harm the user’s experience. A unique label is assigned for every permit, which typically denotes a limited operation. The permissions are further categorized into four parts by Google: normal, dangerous, signature, and SignatureOrSystem. For evaluating Android permissions, researchers take a variety of methods [44]. Standard (also called secure) levels of security permissions, such as VIBRATE and SET WALLPAPER, are permissions without risk. Android kit installer will not allow the user to approve these permissions. The dangerous security standard will pose warnings to the user before implementation and will require the user’s consent. The signature and symbol Security stages of SignatureOrSystem cover the riskiest permits. Only applications with the same certificate, as the certificate used to sign the request declaring approval, are allowed to sign signature permissions [45]. It also acts as a buffer in the middle of hardware and the rest of the stack. A variety of different C/C++ core libraries, such as libc and SSL, are being used in libraries. Dalvik virtual machines and key libraries are part of the Android Run Time. App Model defines classes for developing Android applications, as well as a standardized structure for hardware control and the management of user experience and app property. API libraries are used for both proprietary and third-party users [46]. Table 3 shows some dangerous permissions that pose problems to the reverse engineered applications.
2) Intents
The message delivered among modules such as activities, content providers, broadcast receivers, and services is known as Android Intent. It’s commonly used along with the startActivity() function to start activities, broadcast receivers, and other things. Individual intent counts are exploited as a continuous feature in categorization. To provide more specificity, we divide the list of intents into further sections, such as intentions including the phrases (android.net), which are linked to the network manager, intents including (com.android.vending), for billing transactions, and intents addressing framework components (com.android) and proving to be harmful elements in these apps.
3) API Calls
Safe APIs are tools that are only available by the operating system. GPS, camera, SMS, Bluetooth, and network or data are some examples. To make use of such resources, the application must identify them in its manifest [47]. The Cost-sensitive APIs are those that can increase cost through their usages, such as SMS, data or network, and NFC. Each version includes these APIs in the OS-controlled set of protected APIs that require the device’s user’s sole permission. API calls that grant sensitive information or device resources are commonly detected in malicious codes. These calls are isolated and compiled in a different feature set so they might contribute to harmful activity. Table 4 elaborates dangerous API features:
4) API Components
The program that requires access or activity e.g., a path from point A to point B on a route predicated on a user’s location from another application makes a call to its API, stating the data/functionality demands. The other software includes the data/functionality that the first program requested. For privacy reasons, some API features must be declared and not used in these apps. These components relate to broadcast features present in these applications.
5) Packages, Services and Receivers
The package manifest has always been found in the package’s root and provides information about the package, such as its registered name and sequence number. It also specifies crucial data to convey to the user, such as a consumer name for the program that displays in the User Interface (UI). The file format is in .json for packages.
According to a publication process model, Android apps can transmit and receive messages from the Android system and other Android apps. When a noteworthy event occurs, these broadcasts are sent out. The Android system, for example, sends broadcasts when different system events occur, such as the system booting up or the smartphone charging. Individuals can sign up to receive certain broadcasts [48]. When a broadcast is sent, the system automatically directs it to applications that have signed up to receive that sort of broadcast. Services, unlike activities, do not have a graphical user interface. They’re used to build long-running background processes or a complex communications API that other programs may access. In the manifest file, all services are represented by < service > elements and they allow the developer to invalidate the structure of the application.
C. Classification
The collection of chosen features in the signature database, separated into training and test data, and is used to recognize android malware apps by traditional machine learning techniques [49]. There are three different computer frameworks: supervised learning, unsupervised learning, and reinforcement learning. The supervised learning method is the focus of this paper, comprises algorithms that learn a model from externally provided instances of known data and known results to produce a theoretical model so that the learned model predicts feedback about previous occurrences over new data [50]. The deployment of ensemble techniques and strong learning classifiers helps classification of our binary feature sets, resulting in correctly categorized malware and benign samples. We believe that these classification mechanics produce efficient outputs because of their sorting nature. Fig. 6 explains the process of the learning model.
A comparative algorithm selection for our model based on AdaBoost, Naive Bayes, Decision Tree classifier, K-Neighbor, Gaussian NB, Random forest classifier, and Support Vector Machine performing a relative review which will give an accurate analysis of the algorithm for the prediction of our model.
1) Algorithm Characteristics Appraisal
The assessment of suggested algorithms was carried out using Python. The use of FPR and Accuracy assess our comparative algorithms trials [51]. These estimates, derived from the following basic factors, are listed further down:
Accuracy: Accuracy is one criterion being used to evaluate classification techniques. True Positive (TP) refers to the number of malicious apps which were misclassified as malicious, and False Negative (FN) identifies the number of safe applications which were misidentified as malicious. The number True Negative (TN) measures the truly benign applications and FN denotes the number of irregular apps that were wrongly labelled as normal [52].
False Positive Rate: Determines the measuring factor of a model’s ability to identify correct apps or the model’s ability to generate FP.
\begin{align*} {(Accuracy)}_{m,b}=&\frac {\left ({ TP }\right )_{m,b}+{(TN)}_{m,b}}{All~samples} \tag{1}\\ {(FPR)}_{m,b}=&\frac {\left ({ FP }\right )_{m,b\mathrm { }}}{({FP)}_{m,b}+({TP)}_{m,b}}\tag{2}\end{align*} View Source\begin{align*} {(Accuracy)}_{m,b}=&\frac {\left ({ TP }\right )_{m,b}+{(TN)}_{m,b}}{All~samples} \tag{1}\\ {(FPR)}_{m,b}=&\frac {\left ({ FP }\right )_{m,b\mathrm { }}}{({FP)}_{m,b}+({TP)}_{m,b}}\tag{2}\end{align*}
Proposed Methodology
The major goal of our research is to determine which criteria are most helpful in detecting malware in cell phones, particularly those running Android. We have taken up the task to train up to six machine learning algorithms such as AdaBoost, Support Vector Machine, Decision Tree, KNN, Navies Bayes and Random Forest techniques and classify these machine learning algorithms accurately. The methodology section is divided in two sections; Pre-Processing (explaining the pre-requisite processing) and the Proposed Model (Model functionalities and components).
A. Pre-Processing
APK files from numerous apps were included in the resulting datasets (containing malware and benign characteristics). A Jadx-Gui decompiler is then used to reverse engineer the apk files to extract features from the Android manifest file’s feature set for further processing. These stages are regarded as pre-processes from before real assessments and are essential parts before any kind of testing and training using any predictive models.
Androguard, an open-source tool that extracts prioritised features from files and converts them into binary values, is used to extract features. For labelling the false or accurate android application, we employ binary search techniques, i.e., 1 or 0 for benign and 1 or 0 for malware. Figure 7 shows our technique’s pre-processing framework and flow structures, which must be accomplished before the classifiers are tested.
The operations embedded in the rectangle are to be determined beforehand, ensuring efficient data collection. The main role in this is by the decompiler and extractor which improves and eases the model’s data classification efficiency for detection of malware applications. Although our study discusses the challenge of multi-collinearity and the use of high dimensional data being implemented, we have discussed the better output for high-dimensional data in our feature extracted section but the issue of collinearity still stands and can be done as a novel contribution as future work.
Succeeding the extraction process and the use of efficient datasets accommodating useful features, the testing and training are administered. For our model, a comparative approach will be adopted based on Naive Bayes, Decision Tree classifier, K-Neighbour, Gaussian NB, Random Forest classifier, Support Vector Machine and AdaBoost. The comparison evaluation will provide an accurate assessment of the algorithm used to forecast our model. The installation package is a ZIP-compressed bundle of files that includes the manifest file (AndroidManifest.xml) and classes.dex. The manifest file describes an Android application, namely the activities, services, broadcast receivers, and content providers that make up the system. The methodology and the classification are explained before in the related work section. The next section describes the model functionality.
B. Proposed Model
The model gathers information from many Android applications (Google Play). These reverse-engineered (decompiled through Jadx-Gui) apps are then subjected to static analysis to extract features. Our suggested approach in figure 8, for the training phase, uses the retrieved characteristics to create vector mapping parsed through Androguard. The contribution is indicated by the proposed feature section that encompasses nearly 56,000 extracted features from the feature set seen in figure 8. Those collected features are then composed in a form of a dataset .csv file, stating the benign and malware properties in 1 or 0. After we generate the datasets, the features are ready for classification by predictive models. We adopted Python to create a machine algorithm classification performance program after collecting the dataset, and then we’ll employ the best accurate algorithms to train our models for malware and benign detection. The system’s approach and its operation are detailed in figure 8, which depicts the whole methodology of our model and algorithm learning phase with the training model processing for detection. Figure 9 shows us the training cycle of the program and how the model first is constructed and then evaluated. Then further on the data is cycled towards testing and that is the data fed to the trained model for further prediction analysis of the android applications.
The future threats and predictions pointed out in the next section state insecure android applications which contain unnecessary permissions, and opt for an easy way for an attacker to steal private data or launch major attacks, and later on, present the methodology of our research.
Future Threats and Prediction
By 2020, mobile applications will be installed onto consumer devices over 205 billion times. Statistics by Marketing Land suggest that 57 percent of the overall digital content time is spent on mobile devices. Our daily activities always depend on social networking, bank transfers, business operations, and mobile managed services applications. Accommodating over two billion individuals, almost 40% of the world’s total population, Juniper Sources point to the number of those using mobile banking services. These predictions and future threats are based on theoretical data collected through extensive survey of journals, online forums and research articles.
Developers devote close attention to the development of software to provide us a comfortable and seamless experience and when someone enthusiastically installs these mobile applications requiring personal information, the user stops thinking about the security consequences. This is the reason people don’t even look closely at the permissions or the feature updates being asked by the applications [61]. They simply download the application they want and, when asked for installation, they overlook everything else and start using the app. Most of these applications never even ask the consent of the consumer and the hackers are using their information without their knowledge. The future threat rises, at the end of 2020 and beginning of 2021:
70% of Google Play Store applications require access to one more “dangerous permission and packages, up from 66.6% in Q12020, which is a 5 percent raise”. 69.4% of applications for children (13 years of age) claim at least one risky permit up from 68.8% in 2020 (a 1 percent rise).
Over 2.3 million applications altogether, over 2.1 million applications for children need at least one harmful authorization.
According to multiple tech reviews, each one published in 2021, states that according to research of 2,500 top-of-the-line and rising applications, over two portions of the most popular Android applications on Google Play request excessive user permissions and access. These allow apps, among other unwanted behaviors, to launch harmful scripts and access messages unnecessarily with unwanted features inbuilt [62]. They stated that with the increase in usage of application components and features and also the release of new Android frameworks and APIs each year. It is most likely that threats are surely to increase by 15% from 5%. The average Android user has roughly 80 applications loaded, thus at least one app on the phone demands additional authorization on the phone. It is likely that excessive authorizations may jeopardize user data and privacy or even allow device hacks.
Figure 11 elaborates the dangerous malware increase till 2020 with every newer version of API Level. Figure 12 shows the most rising apps from 2016 to 2021 and the percentage of dangerous permissions, packages these applications gain [63]. These applications are used daily and if they are involved in unnecessary and third-party access, then there is a special need to apply countermeasures on these applications, as this is going to be a major threat in the future.
Also, the Figure depicts the need to measure these threats and devise countermeasures or at least present models to provide more encoded procedures to carry out for these well-known applications. These apps provide a lot of opportunities, but with an increase in private and intellectual property stored in these apps, certain anecdotes need to be proposed.
Experimental Results
In this section, the results of our experimentation are stated. To start our experimentation discussion, we will elaborate on the basic criteria for performing our implementation successfully and will also briefly discuss the data collection or the dataset that we got and then further converse about the actual contribution part.
A. Experiment Setup
Our environment is based on Windows 8.1 Pro with Intel®, Core (MT) i5-2450 CPU, 2.50 GHz as a processor. The installed memory (RAM) of the system is 4.00 GB with a 64-bit Operating System (OS), x-64 based processor.
For the generated dataset Androguard 3.3.5 (latest release) is used for decompiling and feature extraction, deployed in regulated .csv files in binary vectors. We have installed Python 3.8.12 (version 3.8) on our system for the implementation and execution of training and testing scripts of imported machine learning models.
B. Dataset
Three different datasets are used for our implementation, mainly apps belonging to Google Play. The static features of our first two datasets containing API calls, permissions, intents, packages, receivers and services were collected from MalDroid [64] and DefenseDroid [65] which includes around 14,000 malware samples. The model also uses a third dataset of around 6000 malware samples and 2421 benign samples using our own generated application’s dataset. Applications in the datasets were randomly selected from Google Play and then reverse-engineered by the Jadx-GUI tool to acquire their APK’s. The features present in our own selected applications are then extracted using Androguard into binary data. All the datasets from different platforms are combined to incorporate our multiple features sets more than state-of-the-art approaches (explained in table 5) in a single training to achieve higher accuracy and classification of malware. The datasets are first trained on every algorithm for comparative classification analysis. After the accuracy of the algorithms are evaluated, the dataset is again trained and tested on the higher-performing algorithms to use as a feed, based on the features, inserted into the database and our model will then forecast the output for a given android application extracted features. Table 6 represents the datasets training and testing ratio and number of columns before and after pre-processing.
The next subsection elaborates the discussion and presentation of the programs for our machine learning algorithms.
C. Machine Learning Algorithm and Ensemble Learning
Six models have been selected to experiment with two strong classifiers (AdaBoost, SVM and Random Forest). The model executes upon KNN, NB, RBF, Decision Tree, SVM and we have also performed AdaBoost with Decision Tree by calculating the weighted error of the Decision tree based on its data points. As the input parameters are not jointly optimized, Adaboost is less prone to overfitting. Adaboost can help you to increase data performance of existing weak classifiers. After the higher weight of all the wrongly misclassified data points is rightly classified, the model can enhance model accuracy. Figure 13 shows the functioning of the boosting technique.
Since, there is a distinct boundary between two categories, ensemble methods and SVM perform rather well enough when dealing with clear aligned datasets following adequate extraction processes. Another significant benefit of the SVM Algorithm is that it can handle high-dimensional data, which comes in handy when it comes to its use and application in the Machine Learning sector. As seen in the diagram above, AdaBoost’s greater weighted property aids our weak learner (Decision Trees) with achieving higher accuracy and wider consumption for misclassified binary feature inputs.
D. Program Parameters
Our project is based on Python 3.9.7 and divided our execution into two programs. The first program, written to compare the algorithms for the accuracy check of respective models, based on AdaBoost, Decision Tree, KNN, SVM, Naive Bayes, and Random Forest for the comparative analysis. The program uses different import and split functions to train the models and then stores the result in a variable embedded for the testing model. The function sklearn.model_selection, used for accessing the bundles of algorithms, accuracy_score for accuracy readings, pandas to read the database, and NumPy to convert the testing model data into
The parameter on the x-axis is the features of the algorithms and on the y-axis is its label (figure 19), meaning the accuracy percentage for these algorithms. The x (accuracy of the models) and y (labels of the models) parameters of the program are configured to shuffle = True using the test_train_split function, so each algorithm takes a random permission value from the dataset. Figures 14 and 15 show the import modules and parameters values set in our program.
First, all the algorithms are imported into the program to implement the training data for the model, meaning the machine is training based on the given datasets. The program will work as each algorithm will take up random binary value of an app from the dataset and execute its feature’s accuracy score in another variable. After training the data, the program passes the testing data to store into a predictive function. The program is designed to identify the normal and harmful permissions features through the dataset binary values (0.1) and specifies those results in function pred () As you can see in the code below, the program uses a fit () function, which takes the training data as arguments that are fitted using the x and y parameters into testing data for our two models (AdaBoost and SVM). All the variables were specified at the end that was given to each of our algorithms in the program to the variable acc. After executing the program, every algorithm will start accessing the dataset and start predicting the dataset value for the android features. Figures 16 and 17 represent the main key functions for our models AdaBoost and SVM, which are discussed above.
Figure 17 also explains the predictive procedure of the ensemble model with 1000 malware sample runs and given features to train for a single predictive classification output. The same
Figure 19 shows the accuracy percentage for our models which is 96.24% and the graph displays the highest correct predictive frequency out of all the algorithms, professing the research work for greater validity. This graph is plotted by training the algorithms on the datasets to verify which algorithm can classify the application’s features accurately. Program 1 (python script for models accuracy) is scripted to import all of the algorithms and execute them one by one on these datasets to train the algorithms, producing the most precise values after testing. In the case of AdaBoost, we trained Decision Tree first on the dataset and then used those classified values to train on the higher weights using AdaBoost. AdaBoost takes those classified samples and features used by decision trees and generates higher weights for correct results after training on those features again.
Model Precision Evaluation
After training the datasets on algorithms and achieving accuracy percentage, individually developed another program that uses the properties of the previous code to help execute and predict the application state according to the input from the dataset. For this program, the algorithm with greater prediction capabilities is imported, i.e., AdaBoost and SVM using the function sklearn imports linear_svc and sklearn.ensemble import AdaBoost. The database stores input features into the rr python module as a feeding factor for the trained models and designated 1 for the benign applications and 0 for the malware application, meaning the app which uses unnecessary features will give the output 0, helping the use understand that this is a malicious app. This will work in a way that, when the program executes, the algorithms will take the input from the database and then categorize the features based on what we trained the algorithm upon. So, if there are malware applications fed as an input to the database, the trained model will predict the outcome and label the state of the application.
Following the import of the trained models, the random_state = 0 and the testing data = 0.25 for the algorithms. The import of sklearn.preprocessing_normalize function, which takes samples separately according to the Normalize unit. Every set of data with one component or perhaps more (each data matrix row), rescaled separately from other samples to the standard. The program also imports the function sklearn.features_extraction.text which transforms a text data array into a token count matrix and at the very end declares the accuracy score of these algorithms by using sklearn.metrics function, implementing loss, score, and utility functions to quantify performance in the categorization of the feature sets. Parameters for this program are the same as the previous program, but to fix features on every algorithm, the
Further ahead, the prediction results of the program are discussed. As the code executes, the models will take the features from the dataset that was provided for a single application. The result displayed in Figure 22 shows that it’s a benign application. When permission features, again fed as input the Figure 23 shows that it is a malware application based on the features the highly trained models draw out. In the same manner, the database is fed with feature binary values and the model will predict the result in 1 or 0. Figures 16 and 17 elaborate on the predictive function which will allow AdaBoost and SVM to predict the basis of the applications on the feeding input. Figures 22, 23, 24 and 25 are output screenshots of 1 showing benign and 0 for harmful applications with random application features for respective models.
A. Results
After the forecast of our models, results show that the accuracy for our highest predictive systems is 96% and 92%. The proposed model doesn’t peak in higher accuracy or predictive rate but it contributes by introducing enhanced and large feature sets (containing around 56000 newly extracted features) with the latest API level applications datasets collected in recent years than state-of-the-art approaches. Another point of view for a less predictive rate is the limitation of our sources/environment to process and generate these datasets on our models. The novelty and contributions are explained in Tables 1 and 2.
Figures 26, 27, 28 and 29 show the runs performed on the datasets on our trained model. The applications in orange indicate not harmful apps and only passes sensitive features over the line, which doesn’t pose that much of a threat for the application, but it still shows the model issue for indicating true negatives for zero apps. The applications in black indicate harmful applications and the false positive rate (FPR) of this category which falls over the non-harmful apps is about 3–4 applications in case of AdaBoost and 6–7 in case of SVM in our system for 1000 runs, as shown in figures above achieved with 96% and 92% accuracy of AdaBoost and SVM.
All four figures are plotted in a hyperplane which describes the applications classifications in two sections i.e. Harmful and Non-harmful applications. The above line represents the harmful apps section (Black and Red) and applications lying below the line indicated non-harmful applications. The plotted hyperplanes help in understanding the prediction applications perspective as shown in Fig 27 and 29 showing successful classification above the line and 3–4 apps below line indicating misclassifications. The same process is for non-harmful apps in orange colors (Fig 26, 28) and the above line shows misclassifications but they don’t pose serious threats.
The Forthcoming is the comparative review of both malicious and benign applications of our models and experimental results with accumulative accuracy and FPR. The purpose to plot a comparative graph of malware detection is to understand the relative perspective of both our parameters. Figure 30 represents a comparative analysis of both models in terms of malicious and benign applications. Triangles in red represent the classification and detection of AdaBoost and in the square, the SVM is displayed. The graph shows a malware section angle for the executive runs performed and the values above the hyperplane shows the category of Non-Harmful apps. The 0.7 misclassification rate of SVM and 0.3 of AdaBoost is plotted with malware applications falling into the true positive category.
Nevertheless, the models perform with 96.24% accuracy by accurately predicting the applications categories.
We use Accuracy and FPR as evaluation markers in this project. Precision is computed as the percentage of true harmful samples in the malware tagged by the detection system, showing the system’s capacity to discriminate malware properly in the field of malware detection. False Positive Rate (FPR) is the criteria to judge the model’s performance in terms of establishing how many true indications a model gives. Below are the experimental results in quantitative measures, presented in table 8, which explains the points based on accuracy, false positive rate and their predictive measures after testing on binary input for 1000 runs on our 2 higher predictive models depending on testing and training of mixed datasets containing features and malware samples. The operational speed advantage of AdaBoost is not apparent when adopting the datasets for classification and prediction. However, given AdaBoost structural features with parallel learning, we anticipate it will perform better while computing bigger data sets. We reached the same conclusion after we analyzed a much bigger data set with over 500,000 apps.
In table 8, both models are compared and trained on datasets and specify the accuracy, FPR and features used and selected corresponding to the composing samples. The FPR is also presented in figures 26 to 28 above, specifying the calculative measures through a hyperplane. The accuracy and false positives have been measured by the equation described in section IV in algorithm characteristics for the number of runs of the model. Results show 96.24% as the highest accuracy for the model after experimentation and false-positive rate of 0.3% in the case of the ensemble approach.
Related works explain the originality of our model and exhibit the novel features and sample size. To conclude our model still lack fewer percentages in terms of accurate detection. To justify this fact, table 9 presents some properties of similar studies with higher performance rates, indicating such elements which elaborated the efficiency of our system.
[29] This model has exceptional computational/processing power with a much stronger environment to test and train their datasets. [24] Has somewhat of a similar resource with higher processing but their sample size is very limited in comparison to our model. A few other studies describe similar technical advantages, thus, leaving us to work with restrictive measures. Table 9 presents some key properties to elaborate on similar systems’ components.
Research Issues and Challenges
This section highlights our experiment’s prevalent and crucial topics. These hurdles are based on various stages of our work and maybe gradually rectified in the work to be undertaken in the future.
Features declared mostly on the device are more durable than the features specific to the applications and therefore can usually automate malware detection. The range of android parameters for processing is rather big and difficult to detect properly if someone does not extract the features properly.
There is still a fast increase in the number of apps. Malware apps can always be identified in potential in combination with methods based on AI or machine learning, such as inept learning, to make the detection more sophisticated to make it easier to identify and regulate app prediction rate.
Application behaviours in the malware ecosystem encourage non-emerging threats. Our study doesn’t incorporate the rider analysis or behaviour of repackaged malware. The study simply uses the reverse-engineered apk files and extracts the given context to the AndroGuard and extracts features in binary vectors. Although this is a major issue and a key challenge with the advancement in Android malware. This approach will be our advanced project to perform differential or effective analysis on reverse applications, determining the effects of these applications and their results.
The applications with time induce new features with enhanced malware abilities which is why we would have to upgrade the system whenever the model’s FPR rate after execution increases. The simplest explanation for how to identify if the model is degrading on evolved features is that our datasets are designed in binary matrix extracted from features that are currently implemented in these applications and not features that will be present in evolved apps in coming years. With new features, we would have to reverse and extract those features to form an updated dataset again to train on these classifiers. [66], [67], [68] and [69] discuss the possible solutions for this key issue and propose some possible solutions but for our model and given the resource we have only performed for current features. For future work, we will consider model sustainability and how to classify the malware that our system will be able to detect even if the features are not yet implemented.
The research mentions the problem of multicollinearity in the introduction, depicting the rise of dependent variables in-between machine learning algorithms which cause interpretation in results. However, this field of study can be taken as a future work for further testing of several models handling multicollinearity because our model itself is already performing high processing detection schemes to generate accuracy for Android applications features malware. We will foresee this issue and incorporate it to produce an efficient solution to the problem. Authors in [70], [71], [72] proposes some solutions to tackle this challenge and can help understand viewers queries.
A. Limitations
The technique in this paper is based on binary classification of lightweight code of static feature sets present in the Android manifest file. The three major limitations of our method are:
The research doesn’t include dynamic or runtime application features. We will consider the potential dynamic aspects of Android applications in the future, including real-time permissions and API requests and possible features extracted. We will evaluate the behavioural traits of the app using a mixture of dynamic and static evaluation to discover harmful tendencies.
Our system lags in future sustainable operative measures, meaning the system will need to be upgraded in terms of forthcoming API levels and malware collection or terms of new innovative features present in these Android applications.
The constraint of a slow and low processing environment is another motive for less accuracy and predictive measures of our model in comparison to a few other peak detection techniques achieving higher accuracy.
Conclusion
In this research, we devised a framework that can detect malicious Android applications. The proposed technique takes into account various elements of machine learning and achieves a 96.24% in identifying malicious Android applications. We first define and pick functions to capture and analyze Android apps’ behavior, leveraging reverse application engineering and AndroGuard to extract features into binary vectors and then use python build modules and split shuffle functions to train the model with benign and malicious datasets. Our experimental findings show that our suggested model has a false positive rate of 0.3 with 96% accuracy in the given environment with an enhanced and larger feature and sample sets. The study also discovered that when dealing with classifications and high-dimensional data, ensemble and strong learner algorithms perform comparatively better. The suggested approach is restricted in terms of static analysis, lacks sustainability concerns, and fails to address a key multicollinearity barrier. In the future, we’ll consider model resilience in terms of enhanced and dynamic features. The issue of dependent variables or high intercorrelation between machine algorithms before employing them is also a promising field.