An Android Malware Detection Approach Based on SIMGRU

With the rapid development of the Internet era, the number of malware has reached an unprecedented peak, and therefore malware is threatening global network security seriously. In this article, we propose an Android malware detection approach based on SIMGRU, which belongs to the static detection approach. The similarity of clustering is widely used in static analysis of android malware, so we introduce the similarity to improve Gated Recurrent Unit (GRU), and obtain three different structures of SimGRU: InputSimGRU, HiddenSimGRU, and InputHiddenSimGRU. The InputHiddenSimGRU is the combination of InputSimGRU and HiddenSimGRU. The experiment shows that InputSimGRU, HiddenSimGRU, and InputHiddenSimGRU outperform the conventional GRU model and other methods.


I. INTRODUCTION
Nowadays, a large number of mobile software is released constantly into the application market, which provides favorable conditions for malicious software spread. As a result, newer and more complicated Android malware are emerging, along with the rising number of malicious Android apps [1]. Malware targeting the Android platform is continuously improving in performance as well as applying new techniques to thwart analysis and to avoid detection. Therefore, it can be seen that Android malware on mobile phones is active and is jeopardizing the security of the mobile terminal persistently.
To detect malware effectively, experts proposed many approaches to identify malware [2]- [9]. They can be classified as static detection, dynamic detection, and hybrid detection. The static detection decompiles the application program by reverse engineering and extracts critical features to judge whether there is malicious code. Static analysis is based on matching analysis, which can detect known malware quickly and effectively. The dynamic detection approach is to identify malicious behavior while the application is running in the simulated environment. The hybrid detection method The associate editor coordinating the review of this manuscript and approving it for publication was Diego Oliva . combines the dynamic and static detection. This article mainly studies the static detection approach of Android malware. Sun et al. [2] presented a feature extraction method in JAVA source code. And SVM is applied to make the system to accommodate the function of new malicious samples to detect new malicious software and existing malware. Wang et al. [3] designed a novel method to extract permission patterns on the differences between Android benign apps and malware and used these differences to detect Android malware. Talha et al. [4] presented a permission-based Android malware detection system that applies static analysis to classify Android applications as benign or malicious. Fan et al. [5] proposed a malware family detection system based on frequent subgraphs. Android malware is automatically classified based on frequent subgraphs representing the features of the same malware family. Shang et al. [6] proposed an Android malware detection model based on improved naive Bayes classification. Allix et al. [7] devised several machine learning classifiers that rely on a set of features built from applications' CFGs. Wu et al. [8] proposed an Android malware detecting system that provides accurate classification and sensitive data analysis. The study adopted a machine learning approach that leverages the use of dataflow application program interfaces (APIs) as classification features to detect Android malware. Hou et al. [9] constructed the weighted directed graphs and then applied a deep learning framework resting on the graph-based features for unknown Android malware detection.
Although many machine learning approaches are proposed to detect android malware, there are still some problems. Firstly, the effectiveness of these methods, except deep learning approaches depend on the features extracted manually. Secondly, although there are many deep learning approaches for android malware detection, most of them only apply existing deep learning technology to detect malware and do not associate characteristics of malware detection with deep learning.
This article presents an Android malware detection approach based on SimGRU, which belongs to the static malware detection. Due to the wide applicability of the similarity in the static malware detection, we introduce the similarity principle to the GRU cell and establish a new Android malware detection model-SimGRU. Furthermore, we classify the model into three types: InputSimGRU, HiddenSimGRU, and InputHiddenSimGRU. The experiment shows that the proposed approach has a better performance than the traditional GRU model and the other three methods.
The rest of this article is organized as follows. In section II, we review the work related to Android malware detection. In section III, we introduce the traditional GRU model. In section IV, we explain how to improve GRU to SimGRU in detail. Section V is the experiment. Finally, we summarize the article in Section VI.

II. RELATED WORK
With the rapid development of malware and the emergence of various polymorphic and deformation techniques, traditional detection methods have become inefficient. In recent years, people's gradual attention to network security has accelerated the research on network security, which has led to the rapid development of Android malware detection.
Junaid et al. [10] presented a static analysis framework called Dexteroid, which uses reverse engineer life cycle models to capture the behavior of Android components accurately. Aafer et al. [11] introduced a lightweight classifier to detect malicious software--Droid APIMiner, which captures the relevant features of malicious software. Fan et al. [12] proposed the malicious application detection method (DAPASA) to detect Android malicious applications by sensitive subgraph analysis. DAPASA generates sensitive subgraphs, which extract five features from the sensitive subgraph to describe the different calling modes of the application's sensitive API. As an essential feature of Android apps, the function call graph can describe the behavior of Android applications very well. Wang et al. [13] extracted a very large number of features from each app and categorized them into two groups, app-specific, and platform-defined features. These features would be fed into four classifiers (i.e., Logistic Regression, linear SVM, Decision Tree, and Random Forest) for the malware detection. Li et al. [14] proposed an Android malware detection system based on feature fusion to improve the detection efficiency of Android malicious applications.
Shang et al. [15] performed malware detection by the similarity of the function call graphs. In terms of code obfuscation, function call graphs are more robust than traditional feature codes. Xu et al. [16] proposed a method for identifying malware variants based on function call graphs. Firstly, the malicious program is disassembled to propose a function call graph, and then the graph coloring technique is used to calculate the similarity function between two function call graphs. Finally, the similarity function is used to identify variants of known malware. Utku and Doğru [17] developed a permission-based detection system based on Android malware's machine learning methods. Song et al. [18] proposed a static detection framework, which consists of four layers of filtering mechanisms, the message digest values, the combination of malicious permissions, the dangerous permissions, and the dangerous intention. Rastogi et al. [19] studied some of the repackaging detection techniques in detail.

III. GRU
GRU [20] is a special RNN model that solves the gradient disappearance problem of native RNN and is widely applied in classification and other fields. There are two gates in the structure of GRU: the reset gate and the update gate. The structure of two gates can learn from the hidden state of the input and the output of the previous GRU unit adaptively. The main structure of the GRU cell is shown as figure 1. At the time step t, the learning process is calculated as follows: where x t is the input vector at the time step t, σ is the sigmoid activation function, W z , W h , and W r are the mapping matrices, U z , U h , and U r are the weight matrices, and b is the bias. VOLUME 8, 2020 As can be seen from figure 1, x t is a vector of the original input, and h t is the hidden state learned from the hidden state of GRU in the previous step t-1 and the input vector x t at the time step t. r t is the output of the reset gate, and z t is the output of the update gate. Since the input vector and the hidden state represent different information, they control the information transferring process jointly in the GRU unit. As a result, the reset gate and the update gate perform similar calculations, making it possible to memorize as much as possible in the GRU structure.

IV. SIMGRU
We introduce the principle of similarity to GRU to improve the performance of android malware detection due to the similarity function that other works utilize to detect malware.
We name the proposed model as the SIMGRU model.

A. BASIC PRINCIPLES
Android malware, which belongs to the same family, usually calls common sensitive functions because these functions often implement similar functions. As a result, there are similarities in the calling sequence of sensitive functions invoked by the same family malware. Therefore, similarity functions are often utilized to represent the similarity.
There are many similarity functions. Euclidean distance, which expresses the similarity between two objects by calculating the distance between them, is one of the most commonly used functions. The Cosine function is usually used to calculate the angle between two vectors, which is also widely used in malware detection. The calculation formulas of Euclidean distance and Cosine function are as follows: Thus, we introduce similarity functions to GRU to detect Android malware due to the performance of similarity functions in malware detection, which we name as SimGRU. Furthermore, we classify the models into three types: Input-SimGRU, HiddenSimGRU, and InputHiddenSimGRU. The static features of both benign apps and malware are converted to the vector x t , which is input to the SimGRU model. The static features include permissions, API calls, and network addresses, and so on.

B. INPUTSIMGRU
The input vector x t of GRU is a vectorized representation of the original data. The similarity function can be represented by s = sim(x t−1 , x t ). The similarity between x t−1 and x t can be calculated by x = (1-s)x t−1 . Therefore, we introduce the similarity function to GRU, which is named InputSimGRU. The basic cell unit is shown in figure 2.
The position pointed by the wide arrow is calculated based on the similarity function of the input x t . The similarity between the inputs at two adjacent time steps and the current input is input to the reset gate and update gate, which determines whether to trigger the gates. Therefore, the gates are computed as follows: The input x t is still selected as the input to candidate states h ∼ t to memorize the input at the current time step: Therefore, the hidden state h t of InputSimGRU is calculated at the time step t as follows:

C. HIDDENSIMGRU
The hidden state h t of GRU simulates the process of human neurons, so it is the representation of GRU cells, which may help detect Android malware more effectively. As a result, the similarity of the hidden states instead of inputs can be calculated as s = sim(h t−2 , h t−1 ), and the similarity between two adjacent hidden states is h = (1-s)h t−2 . The proposed GRU is named HiddenSimGRU, whose cell is shown in figure 3. The position pointed by the wide arrow is the similarity function calculated based on the hidden state. The similarity of the hidden state h t−2 and h t−1 is calculated at the current time step t. Then the similarity between two adjacent hidden states and the current hidden state is input to the reset gate and update gate. The formula is calculated as follows: Other states are computed as the same as those in the cell of InputSimGRU.

D. INPUTHIDDENSIMGRU
Since the input and the hidden state can help detect Android malware from different perspectives, so the similarity of the input and the hidden state can be calculated simultaneously in the GRU cell so that more similarities can be learned. We name the model as InputHiddenSimGRU, which is shown in figure 4. The position pointed by the wide yellow arrow is the similarity calculated by the hidden state, and the position pointed by the wide green arrow is the similarity computed by the input. Therefore, the calculation of the reset gate and the update gate is as follows: (14) In summary, three different GRU structures above can be described as follows: where K and J can be 0 or 1.When K=0 and J=0, the model is the original GRU. When K= 1 and j =0, the model is InputSimGRU, and the input is calculated by the similarity function; when K =1 and J =1, the model is HiddenSim-GRU, and the similarity of hidden states is computed; when K=1 and J=1, the model is InputHiddenSimGRU, and the similarity of both the input and the hidden state is calculated in the cell.

E. TRAINING
The training process is shown in algorithm 1. Firstly, the similarity can be obtained by calculating the sim function. Then the similarity is input into the reset gate and the update gate. They control the information transmission process and learn how to detect Android malware adaptively. The learning result is merged through the full connection layer and predicted by the SoftMax layer to determine whether the Android application is malicious or not.

V. EXPERIMENT
To evaluate the performance of Android malware detection based on SIMGRU, we used the Drebin dataset [21] for the experiment. We performed 10-fold cross-validation due to the limited malicious samples. The dataset was partitioned into 10 subsets, one subset was used as the testing set, and another one was used as the validation set, and the rest were used for the training set. Furthermore, we conducted the experiment 10 times so that each subset was selected once as the test dataset. The effectiveness of different SIMGRU structures on Android malware detection is analyzed.

A. DATASET
There are 123,453 benign applications, which were collected from Google play, Chinese markets, Russian markets, and other sources in Drebin dataset [21], and also a collection of 5560 malicious samples from 179 different malware families collected between 2010 and 2012. All samples VOLUME 8, 2020  in the Android malware genome project [22] are included. This dataset is one of the well-known android malware datasets to evaluate the detection approach.
Three metrics were selected as the metrics to evaluate the performance of the algorithm as follows: Accuracy can be calculated as (19): Precision is calculated as follows: Recall is computed as follows: The true positives (TP) is the number of malicious applications classified as malicious applications. The true negatives

B. PERFORMANCE OF SIMGRU
To evaluate the performance of InputSimGRU, Hidden-SimGRU, and InputHiddenSimGRU for Android malware detection, we deemed GRU as the baseline to measure the effectiveness of different methods for Android malware detection. In addition to the different models, other parameters such as the number of layers and units are the same. The SimGRU model adopts the cosine function. Although we have tested other similarity functions, such as Euclidean function, we did not report the results due to the similar results.
As can be seen from Table 1, the accuracy of SimGRU is better than GRU by about 1%. Furthermore, InputHidden-SimGRU has the highest accuracy, whereas HiddenSimGRU is better than InputSimGRU. The precision follows almost the same principals with accuracy. However, there is little difference in the recall rate.
The accuracy curves, representing the learning ability of different GRU structures for Android malware detection in the training process, are shown in figure 5. It can be seen that when epoch∈ [0,4), the training accuracy increases dramatically; when epoch∈ [4,60), accuracy gradually becomes steady after a tremendous increase.
One example of the detection process is shown in figure 6 to represent the capacity of SimGRU to detect malware as the time step increases. The process of the detection is from ''normal'' to ''malicious'', which shows the ability of the models to identify malware. SimGRU is faster than the traditional GRU in detecting malware. Furthermore, InputHid-denSimGRU is the fastest, and HiddenSimGRU is faster than InputSimGRU in identifying malware.
Taking all the factors into account, we can conclude that SimGRU has great superiority over GRU. That might be the introduction of Sim functions, which can capture the similarity of malware better than GRU. The combination of InputSimGRU and HiddenSimGRU--InputHiddenSimGRU outperforms the other two models because the sim function in both layers can help SimGRU detect malware better than in either layer. That is to say, SimGRU can focus on the similarity of malware much more than GRU due to the Sim function calculated in the SimGRU cell. As a result, it can capture more static features to identify malware.

C. COMPARISON WITH OTHER METHODS
To test the feasibility of the proposed model, we compare SimGRU with the other three methods [23], including Support Vector Machine, Random Forest, KNN. For the sake of fair comparison, we conducted the experiments the same as the proposed models.
We reported the results in table 2. The SimGRU model outperforms the three methods in Accuracy, Precision, and Recall rate, whereas the GRU model also works better than them. Thus, there is the competitive performance of the SimGRU model compared with both GRU and other three approaches. That means that it holds great superiority in detecting Android malware.

VI. CONCLUSION
We propose an Android malware detection approach based on SIMGRU. This method uses static analysis to analyze Android malware. Due to the wide applicability of similarity in Android malware detection, we improve the traditional GRU model to SimGRU which is based on the similarity principle, and propose three different GRU structures: Input-SimGRU, HiddenSimGRU and InputHiddenSimGRU. The experiment shows that SimGRU has a better performance than GRU and the other three methods.
HANXUN ZHOU was born in 1981. He received the master's and Ph.D. degree in computer application technology from Northeastern University, in 2006 and 2009, respectively. He is currently an Associate Professor with Liaoning University. His research interest includes network security, especially malcode.
XINLIN YANG was born in 1996. He graduated from the Shandong University of Technology, in 2019. He is currently pursuing the master's degree with Liaoning University. Follow Dr. Zhou Hanxun to learn the direction of network security based on machine learning, etc.
HONG PAN was born in 1979. He received the Ph.D. degree in economics from Liaoning University, in 2016. He is currently an Associate Professor with Liaoning University. His research interests include digital economy, big data management, distributed data management, and uncertain data management.
WEI GUO was born in China, in 1983. She received the B.S., M.S., and Ph.D. degree from the Northeast University, Shenyang, Liaoning, China, in 2006, 2008, and 2011, respectively. Since her graduation, she has been worked at Shenyang Aerospace University, China, where she is currently an Associated Profession. Her research interests include artificial intelligence, machine learning, and image processing.