Multifamily Classification of Android Malware With a Fuzzy Strategy to Resist Polymorphic Familial Variants

The Multifamily classification of Android malware aims to identify a malicious sample as one of the given malware families. This problem is believed to be much more significant than the binary classification (simply identify a sample as malicious or benign) because it is able to reveal the behaviour patterns of multiple malware families and bring deep insights into the working mechanism of malicious payload. The main challenges of the multifamily classification involve two aspects: recognizing the behaviour patterns of malware families as well as addressing the issues of code obfuscation and polymorphic variants that are commonly used by adversaries to evade rigorous detections. To address these challenges, in this article, we utilize the regular expressions of callbacks to describe the behaviour patterns of malware families, and propose a two-step fuzzy processing strategy to resist potential polymorphic familial variants. The alphabet of such regular expressions only consists of security-sensitive API calls, this enables the regular expressions to resist various kinds of code obfuscation and metamorphism. The proposed fuzzy strategy, applied to the regular expressions, comprises two steps: the first step transforms an original regular expression to such a fuzzy regular expression that possesses a broader meaning than the original one; the second step further relaxes precise plaintext match between two regular expressions to a fuzzy match by introducing the notion of similarity of regular expressions. Applying this strategy promotes the abstract level of a regular expression and enables the behaviour pattern specified by the regular expression to be more resilient to code obfuscation and polymorphic variants. Furthermore, selecting the fuzzy regular expressions as features, we use text mining techniques to train a multifamily 1-NN classifier over 3270 samples of 65 families. The experimental results show that our approach outperforms most of the state-of-the-art approaches and tools, confirming the effectiveness of our approach.


I. INTRODUCTION
The Android system, as one of the most popular mobile platform, still faces serious security challenges due to its open-source nature, imperfect design of the permission system, and the absence of a full certification to application publications. Malware or malicious payload exploits the vulnerabilities within the Android system to implement a variety of attacks, such as privilege escalation, remote control, financial charges and personal information stealing, severely The associate editor coordinating the review of this manuscript and approving it for publication was Feng Xia . threatening privacy protection, financial security, even social stability [1]. Therefore, it is very urgent to develop solutions to detecting and analysing the potential misbehaviour whenever installing and using an application.

A. CHALLENGES
With a large number of malware samples being accumulated and publicly available, data mining and machine learning techniques provide an alternative perspective to detect and analyse malicious applications [2]- [10]. In this setting, the issue of malware detections can be treated as a problem of classification, which can be tackled effectively by training an optimal classifier over massive malware samples.
Classification of Android malware can be binary or multiclass (or called multifamily in our setting). The binary classification is to simply distinguish an unkown application as malicious or benign; while the multifamily classification requires to further categorize a malicious sample into one of the multiple candidate families.
Obviously, the task of multifamily classification is tougher than that of the binary classification, because the former has to first recognize the behaviour of an unknown sample, then search for the best match of it against a set of family patterns. In spite of the presence of these difficulties, the multifamily classification is critically demanded for the obvious benefit to malware analysis, considering it is able to reveal the behaviour patterns of malware families and offer much more helpful insights into the working mechanism of malicious payload.
In general, any approach to the multifamily classification must properly address the following challenges only to achieve a better classification result:

1) HOW TO DESCRIBE THE BEHAVIOUR PATTERN OF AN ANDROID APPLICATION AND HOW TO MAKE SUCH A DESCRIPTION MORE RESILIENT TO CODE OBFUSCATION
To represent the behaviour patterns of malware families and samples, we shall first capture the essence of the program and present a concise but comprehensive description for the application to be analysed. Some works adopt a set of typical permissions [7], [8], (bigrams of) API calls [2], [9], [11], or system broadcasts [8], [9] to characterize the behaviours of Android applications; some works employ graphical forms, such as call graphs [12], control flow graphs [9], or other kinds of graphs [13]- [15], to represent the structure and behaviour of an application. Different descriptions will lead to different computational overheads and generate classfiers with different performance.
Furthermore, we demand the behaviour descriptions must be equipped with the capability to resist various kinds of code obfuscation and metamorphism, such as renaming user-defined functions, nested calls and control flow obfuscation, in order to combat against possible evasion from detections.

2) HOW TO EXTRACT COMMON BEHAVIOUR PATTERNS OF FAMILIES AND COPE WITH POSSIBLE POLYMORPHIC VARIANTS
A malware family is usually identified by a common behaviour pattern shared by all the samples within the family. In fact, the major work of multifamily classification is just to recognize the behaviour patterns of malware families. However, the common behaviour pattern of a family may suffer from somewhat variations. This situation may be raised either by some samples without exactly matching the common behaviour pattern, or by some samples that are deliberately designed by adversaries to evade rigorous detections. Therefore, we must properly deal with possible variants so as to effectively extract the common behaviour pattern for a family.

3) HOW TO IDENTIFY THE MOST DISTINGUISHABLE DISCRIMINANTS FOR FEATURE VECTORS AND HOW TO REDUCE THE DIMENSIONALITY OF FEATURE SPACE
To mitigate the problem of ''curse of dimensionality'' in machine learning process, we shall try to discriminate the most significant ones from a large number of candidate attributes to reduce the dimensionality of feature vectors and improve the performance of classification.

B. APPROACH
To address the above challenges, in this article, we propose an approach to the multifamily classification of malware by using text mining and machine learning techniques. We first overview the main blocks of our approach illustrated in Fig. 1. The implementation and the related datasets of our approach have been published in Github and Baidu Cloud Disk respectively: https://github.com/dasdasdf/Codes for code and https://pan.baidu.com/s/1Mia3mCcsL6D_BMGZ02jKKA with password ''x4re'' for datasets.
Our approach consists of two phases:

1) MODELLING PHASE
This phase aims to construct feature vectors for malware families, one feature vector for each family. The behaviour of an Android application is described by a set of behaviours of callbacks; the behaviour of a callback is specified by a regular expression that can be extracted from the reduced iCFG (Interprocedural Control Flow Graph) of the callback (Section III).
To cope with the possible variants of a family, we devise a two-step fuzzy strategy to manipulate the extracted regular expressions (Section IV). The first step maps a regular expression to a fuzzy one; the second step further relaxes an exact match to a fuzzy match of regular expressions by introducing the notions of distance and similarity for regular expressions.
Next, we calculate the predominant discriminative behaviours dFBeh(F i ) for each family, and select the most significant 3211 attributes to constitute the feature vectors for both families and unknown samples. Thereafter, we leverage the TF-IDF (Term Frequency-Inverse Document Frequency) indicators, widely used in text mining discipline, to quantify the behaviour characteristics of families, and finally formulate the feature vectors for overall 65 malware families (Section V-B).

2) CLASSIFICATION PHASE
Given an unknown (or test) application, we first extracts the regular expressions of callbacks of the application, then process them into fuzzy regular expressions and finally formulate its feature vector. To conduct the multifamily classification, we construct an 1-NN classifier that matches the application's VOLUME 8, 2020 feature vector with all familial feature vectors to find the nearest neighbor as its candidate family (Section V-D).

C. CONTRIBUTIONS
The main contributions of this work consist in the following four aspects, which respond the aforementioned challenges respectively: 1) We select callbacks rather than class methods as the basic units to describe an application's behaviour. As callbacks act as the entry points of an application's execution, their behaviours tend to offer much more semantics than class methods. The behaviour of a callback is further specified by a regular expression; the alphabet of such an expression only includes security-related API calls, enabling the regular expression to resist against code obfuscation. 2) We propose a two-step fuzzy processing strategy to tolerate the possible variations of a family. For a regular expression r, in the first step, we use a substitution operation h to map r to a new one r (i.e., h(r) = r ), which covers a broader range of behaviour patterns than r; in the second step, the precise match is relaxed to a fuzzy match by introducing the concept of similarity between regular expressions. Through the both fuzzy processing steps, the abstraction level of a regular expression is promoted, and the behaviour pattern specified by the regular expression turns out to be more resilient to code obfuscation and polymorphic variants. 3) To achieve dimensionality reduction for feature vectors, we first calculate for each family F i the predominant discriminative behaviour dFBeh(F i ) (a set of behaviours solely appear in family F i ), then use RELIEF-F [16] technique to select 3000 additional attributes for feature vectors. In this way, the dimensionality of feature vectors is sharply reduced from 23568 to 3211, meanwhile a robust classification scheme is achieved as well. 4) With the constructed feature vectors, we trained a multifamily 1-NN classifier over 3270 samples of 65 families. The experimental results show that the average precision of our approach is about 97.8%, which outperforms against most of the state-of-the-art approaches and tools, evidencing the effectiveness of our approach. The remainder of this article is organized as follows. Section II involves the behaviour descriptions for Android applications and families. Section III concentrates on extraction of regular expressions from iCFGs. Section IV introduces the two-step fuzzy strategy for the regular expressions, including the substitution rules and the notions of distance and similarity for the regular expressions. Section V focuses on the formulation of feature vectors for malware families, and construction of an 1-NN classifier for the multifamily classification. We implement our approach and report the experimental results in Section VI, and finally present related work in Section VII and conclusion in Section VIII.

II. DESCRIBE BEHAVIOURS OF MALWARE FAMILIES A. BEHAVIOURS OF FAMILIES
Let C = {F 1 , . . . , F n } be a class of n malware families. A family F i (0 ≤ i ≤ n) includes m i sample applications in a given dataset, thus a family can be denoted where a i,j is one of the sample applications in family F i .
The behaviour of application a i,j can be described by a set of Callback Behaviours (or CBs for short); each CB is a regular expression that specifies the behaviour of the callback. Formally, Definition 1: (Behaviour of an application) The behaviour of an application a i,j , denoted ABeh(a i,j ), is defined as a set of CBs where CB k ∈ N × P * , 1 ≤ k ≤ p; N is the set of all callbacks' names; p is the number of callbacks in a i,j ; is the alphabet of regular expressions, namely a set of events to be observed; in our setting, the events are of the generalized security-sensitive API calls [7].
In what follows, we use the form (n, r) to denote the behaviour of a callback, where n ∈ N is the callback's name, r ∈ P * stands for the regular expression associated with the callback.
The following considerations give the reasons why we choose callbacks (rather than class methods adopted in the work [17]) as the basic units to describe an application's behaviour: 1) The granularity of callbacks is more coarse-grained than that of class methods, leading to a smaller number of descriptions and lower dimensionality of feature vectors. This benefit will improve the efficiency of the classification. 2) Callbacks are usually triggered by various users and system events, and act as the entry points of execution of an Android application, therefore choosing them as the unit will allow us to fully capture the reactive behaviour of an application. Example 1: The behaviour of the callback Activity::onClick() can be described as such a CB: (Activity::onClick(), e 1 ; e 2 ; e 3 ; e * 4 ) where ''Activity::onClick'' is the name of the callback, Activity is the component type of onClick(); e 1 ; e 2 ; e 3 ; e * 4 is a regular expression, each event e i is concatenated with the sequential operator '';'', the symbol ''*'' denotes a while loop. Here, e 1 , e 2 , e 3 , and e 4 may represent API calls getDeviceId(), getSubscriberId(), getDefault() and sendTextMessage() respectively. Thus the regular expression reveals such a malicious scenario: once onClick() of an Activity component is invoked, the application will collect the information about ''DeviceId'' and ''SubscriberId'', and then send the collected information through calling the API sendTextMessage() one or more times.   ABeh(a i,j ).
In short, the set cFBeh(F i ) contains those CBs that can be found in all samples of F i . Note that, cFBeh(F i ) cannot be thought as the distinguishable behaviour pattern for family F i , because a common behaviour of a family may also appear in multiple families.
Definition 4: (Predominant discriminative behaviour) For a class C = {F 1 , . . . , F n } of n malware families, the predominant discriminative behaviour of F i , denoted dFBeh(F i ), is a set of CBs that satisfies the following conditions:

III. EXTRACT REGULAR EXPRESSIONS FROM CFGS
Each CB has two components: a callback's name and a regular expression that specifies the behaviour of the callback. In this section, we introduce how to extract the regular expression for a given callback; this process is illustrated by Fig. 2. A graph transformational approach [9] is employed to extract the regular expressions. At first, an iCFG (Interprocedual Control Flow Graph) shall be constructed for a callback under the help of a certain static program analysis tool (we use Soot [18] in this article). An iCFG is a supergraph that integrates a collection of CFGs (Control Flow Graphs) together to describe the interprocedural calls of class methods. As we only concern with the security-sensitive API calls, in the next step, we reduce the generated iCFG to an API graph by removing irrelevant nodes and edges. Finally, we treat the generated API graph as a finite automaton, and use the automata-to-regular-expressions algorithm [19] to convert it to a regular expression.
In what follows, we only use a running example to show this process. The details of CFGs and the reduction transformation of iCFGs please refer to the authors' previous work [9]. Example 2: Fig. 3 shows the process of extracting the regular expression from the iCFG of the callback onClick(). Subgraphs (a) and (b) are of the CFGs of onClick() and method1() respectively; they are connected through the edges ''Call edge'' and ''Return edge'' to form the iCFG of onClick(). Therein, the shaded nodes represent the security-sensitive API calls, other nodes represent either structural nodes (such as ''start'' and ''exit'' nodes) or the nodes that are labelled with security-irrelevant API calls or statements. The original iCFG is reduced to the API graph that only preserves the ''start'' and ''exit'' nodes of onClick() and the security-sensitive API call nodes, as illustrated by Subgraph (c). Finally, the regular expression is extracted from the API graph by using the automata-to-regular-expressions algorithm.

IV. THE TWO-STEP FUZZY PROCESSING STRATEGY FOR THE REGULAR EXPRESSIONS
A malware family usually derives numerous polymorphic variants, namely similar but not exactly the same behaviour patterns. If we would simply make use of the precise match to compare two regular expressions and calculate cFBeh(F i ) and dFBeh(F i ), then it might be too stringent to find out a meaningful behaviour pattern for a family.
To tackle this problem, we propose a two-step fuzzy processing strategy to deal with the possible variants of a family. The first step performs a substitution operation on an original regular expression to produce a fuzzy regular expression that covers a broader range of behaviour patterns than the original one; the second step further relaxes the exact match to a fuzzy match over the produced fuzzy regular expressions in an effort to tolerate somewhat variations to the common behaviour pattern of a family.

A. THE FUZZY PROCESSING OF REGULAR EXPRESSIONS
Essentially, the fuzzy processing is a substitution operation on a regular expression; that is, every event (i.e., API call) in a regular expression is replaced by a more coarse-grained event according to some given rules. This processing widens the meaning of each event and enables the regular expression to cover a broader range of behaviour patterns.
In what follows, we first present the substitution rules for an event, then define the substitution operation on a regular expression.
Definition 5: (Substitution rules) Let e be an event in a regular expression r. The substitution of e is an operation h that maps e to a new symbol (or a set of symbols) according to the rules shown in Table 1.
A rule is of a form where LHS is the condition part of the rule, and RHS is the rule's conclusion part. The symbol → means a substitution relation, therefore e → e states that ''replace event e with a new event e ''. The rule R 1 means that ''If e is a permission-guarded API call, then it is replaced with the guarded permissions P of e.'' The intuition behind this rule is that the guarded permissions have a more abstract and essential intention than the corresponding API call, and can cover multiple similar API calls with the same intention. For example, getSubscriberId() is guarded by the permission READ_PHONE_STATE; this permission actually means such an intention that the program intends to do a ''read'' operation (may be getSubscriberId() or getDeviceId()) on the resource ''phone''. Replacing this API with the permission READ_PHONE_STATE can capture the essence of the API call, and enables the guarded permission to cover a number of similar API calls such as getSubscriberId(), getDeviceId(), or some other APIs with the same intention.
The rule R 2 means that ''If e is a source (or sink) API, then we use the name of the source (or sink) to replace e.'' The source (or sink) API means the API identified in SuSi project [20]. If such an API appears in a regular expression, then it is replaced with the source-or sink-resource name of the API. For example, getMacAddress() is a source API, it is replaced with the symbol ''SRC_Mac'' where ''Mac'' is the source-resource name of this API; insert() is a sink API, it will be replaced by the symbol ''SINK_CP'' where ''CP'' means ''ContentProvider'', the sink-resource name of the API.
The rule R 3 means that ''If e is a dynamic loading API, then it is replaced with the symbol '' DYNLOAD''. For a dynamic-loading API, it is readily replaced with a constant symbol '' DYNLOAD'', meaning that this API is about the dynamic loading function, but we don't care about what it really is. Similar way is also applied to R 4 -for an API other than the APIs in R 1 ∼ R 3 , it is replaced with a constant symbol '' SENAPI''.
If there exists a conflict in the application of the rules R 1 ∼ R 4 , then the conflict is resolved by the priorities of the rules, which are assigned via R 5 .
The substitution operation for some typical events is illustrated in Table 2.
The domain of the substitution operation h can be generalized from events to regular expressions in a straightforward fashion. For example, given a regular expression  Note that, through the substitution operation h, a callback behaviour CB = (n, r) is transformed to a fuzzy callback behaviour CB = (n, h(r)).
But for convenience, in what follows, we still use CB to denote the fuzzy version of the callback behaviour; all the regular expressions mentioned thereafter, if no confusion, will refer to the fuzzy ones that have been processed through the substitution operation.

B. THE EDIT DISTANCE OF REGULAR EXPRESSIONS
The second step of the fuzzy processing strategy is concerned with the fuzzy match between the regular expressions. The fuzzy match is made by comparing the similarity between two regular expressions. The similarity is measured by the distance of the regular expressions.
The edit-distance of two regular expressions can be defined as follows.
Definition 6: (Edit-distance of regular expressions) [21] The edit-distance of two regular expressions r 1 and r 2 , denoted d(r 1 , r 2 ), is defined as: where L(r) is the language (i.e., the set of strings) of the regular expression r; w 1 , w 2 are strings of events, d(w 1 , w 2 ) is the edit-distance of w 1 and w 2 .
The distance d(r 1 , r 2 ) can be computed by the aid of two algorithms: composition of weighted automata [22] and single-source shortest-paths of graphs [21]. The skeleton of the algorithm is presented in Algorithm 1.
At first, the regular expressions r 1 and r 2 are converted to the corresponding automata A 1 and A 2 using traditional regular-expression-to-automaton algorithm [19] (lines 1-2); then compute the composition of A 1 and A 2 to obtain a new automaton A (line 3); finally, the distance d is computed using classical shortest-distance algorithm (line 4) such as Bellman-Ford dynamic programming algorithm.
Let's estimate the time complexity of the algorithm. Conversion of a regular expression to an ordinary NFA (without transition) takes O(n 3 ) time on the regular expression of Algorithm 1 Compute the Distance of the Regular Expressions Input: Regular expressions r 1 and r 2 Output: d(r 1 , r 2 ): distance of r 1 and r 2 1 A 1 = regular-expression-to-automaton(r 1 ); 2 A 2 = regular-expression-to-automaton(r 2 ); where |Q m | = max(|Q 1 |, |Q 2 |), |E m | = max(|E 1 |, |E 2 |).

C. THE SIMILARITY OF REGULAR EXPRESSIONS
The similarity of regular expressions can be measured in terms of the distance defined in the previous section. Definition 7: (Similarity of the regular expressions) The similarity of two regular expressions r 1 and r 2 is defined as where ∅ is the empty regular expression, d(r 1 , ∅) + d(∅, r 2 ) is then equal to the maximum edit cost to transform r 1 into r 2 .
With the notion of similarity, we can relax the exact plaintext match between the regular expressions to a fuzzy match; that is, two regular expressions r 1 and r 2 are considered to be the same if their similarity is greater than an acceptable threshold, i.e., θ ≤ sim(r 1 , r 2 ). The notion ''the same'' can be generalized to the callback behaviours in a straightforward way.
Definition 8: (The same callback behaviours) Two callback behaviours (n 1 , r 1 ) and (n 2 , r 2 ) can be considered to be the same if and only if n 1 = n 2 and θ ≤ sim(r 1 , r 2 ), where θ is a given threshold ranging from 0 to 1.
In this article, we assume θ = 90%. If two callbacks have the identical names, and the similarity of their regular expressions is greater than 90%, then the both can be seen the same.

D. CALCULATE THE COMMON BEHAVIOURS OF FAMILIES BASED-ON THE SIMILARITY
We use the aforementioned notion of similarity to improve the calculations of common behaviours cFBeh(F i ) and predominant discriminative behaviours dFBeh(F i ) of malware families F i (see Section II-A).
To investigate the necessity and effectiveness of the similarity-based approach, we first conduct an experiment to compare this approach with the exact plaintext match approach. The experimental results are illustrated in Fig. 4.
The experiment counts the number of callback behaviours in the sets cFBeh(F i ) and dFBeh(F i ). The sets cFBeh(F i ) and dFBeh(F i ) have two versions respectively: one is calculated by using the similarity-based approach (θ = 90%), the other by the plaintext match approach. To distinguish both versions, we use cFBeh(F i , θ) and dFBeh(F i , θ) to denote the similarity-based versions; cFBeh(F i ) and dFBeh(F i ) just for the exact match ones. Note that, dFBeh(F i , θ) only derive from cFBeh (F i , θ).
The results show that the similarity-based approach can extract more callback behaviours than the plaintext match one. In particular, in some families, such as Boxer, Fake-Installer, and Zitmo, etc., the similarity-based approach can extract the common behaviours that cannot be discovered by the plaintext match approach.
Moreover, the results show that among all malware families, only 8 families have empty dFBeh(F i , θ), and the remaining 57 families have non-empty dFBeh(F i , θ). From this observation, it follows that, in most cases, the discriminative common behaviours dFBeh(F i , θ) can serve as a better candidate feature to classify an unknown sample.

V. MULTIFAMILY CLASSIFICATION BASED-ON TEXT MINING
We treat the multifamily classification as a problem of text classification-the callback behaviours of an application can be viewed as ''words''; the application can be viewed as a ''text'', namely a collection of the ''words''; and apparently, a malware family can be regarded as a ''class'' of ''texts''. Therefore, classifying an unknown application to a malware family amounts to the classification of a ''text'' into one of the text ''classes''. By using the traditional text mining techniques, we can construct a text classifier to resolve the problem of familial classification. The main challenges of this solution consist in the feature selection and classifier construction, which will be addressed in this section.

A. FEATURE SELECTION WITH THE RELIEF-F APPROACH
In principle, the sets dFBeh(F i , θ) should be directly selected as the features to distinguish malware families for an unknown sample, but these features may suffer from somewhat fragility in nature.
Consider this case where a new malicious sample is classified into a certain family, but it was found no common callback behaviours shared with all the other samples in the family. This may be because the sample was either misclassified, or deliberately designed to evade rigorous detections. In this case, both cFBeh(F i , θ) and dFBeh(F i , θ) turn out to be empty; that is, no discriminants can be found to identify the family.  To tolerate such a fragility and select a more robust feature set for the classification, we construct the feature vectors in this way: • First include all callback behaviours in dFBeh(F i , θ) of 57 families (notice only 57 families present nonempty dFBeh in total 65 families) as the attributes to the feature vectors, • Then search for k additional callback behaviours as extra attributes to join in the feature vectors. Therefore, the feature vectors shall be formulated as the form where f i (1 ≤ i ≤ k) are of the additional k attributes.
To this end, we leverage the RELIEF-F approach [16] to select k candidate callback behaviours as the additional attributes f i (1 ≤ i ≤ k).
Following the RELIEF-F approach, we select the most contributive extra attributes of the feature vectors according to these steps:

B. CONSTRUCT FEATURE VECTORS FOR MALWARE FAMILIES
For each malware family, we shall construct a familial feature vector to model its predominant characteristics. A familial feature vector is a vector with overall 3211 attributes contain 211 attributes together with extra 3000 the most contributive attributes. Given a training set, we shall calculate the weight of each attribute in a familial feature vector. We borrow the indicator TF-IDF (Term Frequency-Inverse Document Frequency) [23], widely used in text mining discipline, as the weight of an attribute. For the feature vector FV i of the family F i the weights t i,j (1 ≤ j ≤ 3211) are calculated by where TF(•) is the term frequency of the callback behaviour CB j over the family F i ; IDF(•) is the inverse document frequency of CB j over family class C. Both terms are calculated as follows.
Definition 9: (Term frequency) TF(CB j , F i ) represents the frequency of a callback behaviour CB j over the family F i , which can be calculated by the following formula where freq(CB j , a) is the number of times CB j appears in the application a. Definition 10: (Inverse document frequency) IDF(CB j , C) represents the inverse document frequency of a callback behaviour CB j over family class C, which can be calculated by the formula When a certain callback behaviour does not appear in any family, it will cause the denominator to be zero; therefore we add 1 to the denominator to cope with this situation. Example 3: We give an intuitive example to illustrate the calculation of TF-IDF for a callback behaviour. Suppose a family class C includes 100 malware families, i.e., C = {F 1 , . . . , F 100 }. The distribution of the callback behaviour CB j over the families is shown in Table 3.  Table 3 shows that CB j has the highest weight in family F 1 , because all applications in F 1 contain CB j ; for those families that do not contain CB j , the corresponding weights are all 0's.

C. THE ALGORITHM TO CALCULATE FAMILIAL FEATURE VECTORS
In the overall 3270 malware samples of 65 families, we randomly select a total of 1300 samples (20 samples for each family) to constitute the training set, and the remaining data set as test samples.
The algorithm for constructing the familial feature vectors is presented as Algorithm 2, which mainly consists of the three steps: 1) The first step (lines 1-7) calculates the sets of callback behaviours for each sample and each family; 2) The second step (lines [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] calculates the TF indicator for each callback behaviour CB k that has been selected as one of the attributes in the feature vector FV i . The variable appear records the number of the applications where CB k may appear. Remember that, to determine whether CB k belongs to an application a i,j (line 14), we should first compute the similarity between CB k and every callback behaviour of a i,j (Section IV-B), and then observe whether the similarity is greater than the threshold 90%. If so, it suggests that CB k belongs to a i,j , and then appear is increased.

3) The final step (lines 23-25) calculates the IDF indicator
for each CB k , where inFamily(·, ·) is a two-dimensional array, which records whether a callback behaviour appears in a family.

D. CONSTRUCT AN 1-NN CLASSIFIER
We construct an 1-NN classifier to distinguish an unknown sample into its predefined family. To this end, we shall first process the unknown sample to a feature vector, then compute the distances between this sample and every malware family, and assign the nearest one as its family. Processing an unknown sample, say a, is much similar to that of malware families, and shall involve the following steps: 1) Extract the regular expressions of callbacks of a, and apply the fuzzy processing rules to these expressions to obtain ABeh(a); 2) For each callback behaviour CB k ∈ ABeh(a), if CB k is one of the selected attributes, then we shall compute its weight using TF-IDF indicators: 3) Contruct the feature vector V (a) = [w 1 , w 2 , . . . , w 3211 ] for application a. To measure the similarity or distance between an unknown sample (i.e., the vector V (a)) and a family (i.e., the vectors FV i ), we adopt the traditional cosine similarity [23] between two vectors.
The cosine similarity between the vectors d = (d 1 , d 2 , . . . , d k ) and e = (e 1 , e 2 , . . . , e k ) is defined as Then we compare these cosine similarities sim(V (a), FV i ) (1 ≤ i ≤ 65) one by one and determine a to a family F I with the largest similarity, that is,

VI. PROTOTYPE AND EXPERIMENTS A. IMPLEMENTATION
The prototype of our approach is based on Soot [18] and Sklearn library [24]. Soot is a Java bytecode optimization framework formerly developed by Sable Research Group of McGill University. It provides a variety of analysis frameworks for Java programs, such as IFDS/IDE dataflow analysis, Call graph construction and Point-to analysis. In this article, we only use Soot to generate (interprocedural) control flow graphs for every implemented class methods or callbacks. Sklearn is a python-based third-party machine learning library, which helps us train the 1-NN classifier. The overview and implementation of our approach have been shown in aforementioned Fig. 1.

B. TRAINING DATASETS
Our work is largely based on a substantial dataset of real-world Android OS malware samples. We collected the dataset from Drebin project [8], Virus Share [25], FalDroid [12] and Android Malware Genome Project [1], which cover the majority of malware families and include a variety of infection techniques and payload functionalities. Table 4 lists the datasets adopted in our paper.
More than 90% of malware samples are smaller than 5 MB, and approximately 3% of the samples are larger than 10 MB. For the purpose of this article, we discarded a number of encrypted samples and the malware families that contain only one sample, resulting in a final dataset of 3270 malware samples grouped into 65 families, from which a total of 23568 callback behaviours have been extracted. The dataset, implementation, and immediate experimental results have been uploaded to Github and Baidu Cloud Disk (see Subsection I-B).

C. EXPERIMENTAL EVALUATION
Our evaluation intends to investigate the following research questions: RQ1 Why do we select CBs (Callback Behaviours) as a predominant feature to do the multifamily classification task? VOLUME 8, 2020  In other words, why do CBs offer a higher discrimination power for the classification?
RQ3 How well does the 1-NN classifier perform when being applied to the test dataset?
RQ4 How does our approach compare against other stateof-the-art approaches or tools in terms of precision?
In the sequel we address each research question in detail.

RQ1: High discrimination power of CBs
We first investigate the distribution of all 23568 CBs over 65 malware families, as shown in Fig. 6. The result shows that 81.49% of the CBs only appear in individual malware families, indicating that if a CB appears in a family, it is unlikely to appear in the other families simultaneously. The proportion of the CBs that appear simultaneously in two families drops to 5.94%, and less than 5% of CBs distribute in more than 9 families at the same time. From this distribution, it follows that callback behaviours can serve as a strong candidate to distinguish most of the malware families from each other.

RQ2: Number of CBs in each family
We next investigate how many CBs are in each malware family. This exploration will provide an intuitive insight into why we choose callback behaviours as a feature to train the classifier. The distribution of the number of CBs over the families is illustrated in Fig. 7.
It shows that the number of CBs (i.e., |FBeh(F i )|) has a wide range in different families, thus it seems no sense to do with the average number of CBs within a family. This fact might be caused by the wide spread and popularity of a certain family. For example, the families such as DroidKungFu and Plankton, commonly appear in many repackaged malware, so the total number of CBs in these families exhibits a sharp increase.
The distributions of the common CBs (i.e, |cFBeh(F i )|) and the predominant discriminative CBs (i.e, |dFBeh(F i )|) of families have been shown previously in Fig. 4 (see Section IV-D). The result also shows that the similarity-based fuzzy match extracts more callback behaviours than the plaintext match. For example, 24 of 65 families have empty common callback behaviours with the plaintext match approach; on the contrary, only 1 of 65 families has no common callback behaviours with the similarity-based approach.

RQ3: Performance of the 1-NN classifier
In all samples of the dataset, 25% of them are selected as the test set and are applied to the 1-NN classifier for family classification. The experimental results for each family are listed in Table 5. It shows that the average error rate of the 1-NN classifier is only 2.23%, achieving a good result in terms of classification precision.

RQ4: Comparison with state-of-the-art approaches
To confirm the effectiveness of our approach, we mainly compare our approach with two recent typical works: DENDROID [17] and FalDroid [12], since the both show the state-of-the-art performance and have given comprehensive experimental results that can be readily used for conducting comparison experiments. DENDROID collects 33 malware families, and the average precision is about 94%; FalDroid collects 36 families, and the average precision is 94.2%. Our work falls all malware samples into 65 families, and the average precision is 97.8%, achieving a better average classification result than the both. Furthermore, we compare the detection precisions for each malware family. Considering DENDROID, FalDroid and our work have different malware families, we select a collection of common malware families in these three works, and apply our dataset to them to compare the classification results with respect to each individual family. The comparison results are shown in Figure 8, showing that the accuracy of our approach outperforms against DENDROID and FalDroid in general. The reason for such a superiority is that our approach takes the contexts information into account when selecting features-the callbacks that are predefined in various components are considered as the basic behaviour units. Callbacks usually serve as the entry points of program execution, therefore selecting them as features will facilitate identifying the behaviour patterns of families. Additionally, callbacks retain a more coarse-grained granularity than class methods, this further produces a benefit of dimensionality reduction for the classification.

VII. RELATED WORK
The methods and tools for automatic Android malware detection have been intensively studied over the recent years. There are two basic kinds of detection techniques-static and dynamic approaches [4], [27]- [30]-according to whether execution of a specimen is required in the process of detection. The static approaches can be further divided into whitebox and blackbox ones. Note that, according to this taxonomy, our work falls into the category of static and blackbox approaches. In what follows, we focus on the static and blackbox approaches, and mainly overview their basic ideas and virtues and limitations.
A. WHITEBOX VS. BLACKBOX STATIC APPROACHES Static analysis approaches are able to directly analyse the text of an applications without executing it at all. According to whether taking the states of a program into consideration, the static approaches are divided into two categories: whitebox [31], [32], [34] and blackbox.
The whitebox approaches require to investigate the structures and semantics of programs, and detect misbehaviours by using traditional program analysis techniques [33] (such as dataflow analysis) or by discovering certain semantic inconsistencies in programs. The most typical whitebox work is FlowDroid [31], a static taint analysis approach and toolsupport, which concentrates on discovering potential information leakages from private sources to public sinks. The major benefit of the whitebox approaches is that they are able to precisely reveal the working process of malbehaviours; but on the other hand, they are vulnerable to the high complexity and poor scalability, usually caused by the nature of static program analysis techniques.
On the contrary, the blackbox methods do not tangle with the details of internal structure of a program, but focus on the external features of the program, and employ some kinds of statistical methods to detect misbehaviours. With a large number of malware samples being available and the rapid development of statistical machine learning techniques, the blackbox methods have been found promising for the detection of Android malware. VOLUME 8, 2020

B. MACHINE LEARNING-BASED MULTIFAMILY CLASSIFICATION APPROACHES
From the viewpoint of machine learning techniques, malware detection can be considered as a supervised binary or multifamily classification problem. The main challenges of these solutions consist in how to select and generate representative features, and how to resist code obfuscation and familial variants. Permissions, APIs, control flow graphs and their associated contexts are frequently selected as features [2], [3], [8], [9], [12], [13], [17], [34].
Our work mostly inspired by DENDROID [17] and FalDroid [12]. DENDROID selects code structures (the sequences of program statements) of class methods as features, and adopt classical text mining and standard Vector Space Model to classify malware samples into families. FalDroid [12] proposed a graph-based approach to Android malware familial classification. It constructs frequent subgraphs to represent the common behaviours of malware samples, and performs graph match with the weightedsensitive-API-call-based approach to compete against polymorphic variants.
Our work improves the both studies from three aspects. First, we select the regular expressions of callbacks as features. Since callbacks act as the entry points of an Android application, their behaviours tend to offer more semantic information than those of general class methods; second, the proposed graph transformation of iCFGs (Section III) is able to effectively tackle certain forms of code obfuscation, such as renaming user-defined functions, nested calls and control flow obfuscation via filtering out the security-irrelevant nodes and edges from the iCFGs; and finally, we propose the two-step fuzzy processing strategy to compete against polymorphic variants. The highlight of this strategy is of the substitution operation that widens the meaning of sensitive API calls, achieving a better resilience to familial variants than DENDROID and FalDroid.
Our work is also closely related to DroidSIFT [14] and DESCRIBEME [15], both of which utilize WC-ADGs (Weighted Contextual API Dependency Graphs) as program semantics to construct feature sets. A WC-ADG contains various key semantic-level behaviour aspects of an Android malware sample, such as API dependency, context and data dependency. The main complexity of the work derives from the construction of WC-ADGs, which involves diverse static analysis techniques, including the (forward and backward) dataflow analysis, constant analysis and generation of program dependency graphs. Comparison with the both work, our approach only involves CFGs and API calls; the construction of CFGs and extraction of behaviour aspects are straightforward and efficient, leading to an easier implementation than these work.
To better fight against familial variants, a novel method called Artificial Malware-based Detection is proposed [35], which generates new malware patterns using the genetic algorithm from the patterns that have been found from existing malware, for the sake of accommodating the emergence of unseen variants.
The study [10] proposes an approach to selecting the fingerprints (i.e., the signatures) of malware families. It uses the Fisher discriminative criteria to rank the features, and devises a frequency-based feature elimination algorithm which is parallel with the TD-IDF algorithm in our work. This study selects suspicious API calls, permissions and hardware components as features for classification; however our approach takes API calls and their calling sequences as features that hold richer semantic information than individual API calls.
EC2 [36] is a hybrid approach that considers both static and dynamic features. The static features include authors, structures and permissions, and dynamic ones include n-grams of read and write operations, system operations, network operations and SMS operations. It combines the results of supervised classification methods and unsupervised clustering methods together to classify families. The major limitation of this work is that it fails to fully address the issue of code obfuscation and familial variants, thus degrading the accuracy of classification.

VIII. CONCLUSION
This article proposes a static multifamily detection approach for Android malware based on text mining and machine learning techniques. The most significant aspects of our approach are of twofold: 1) the graph reduction transformation of iCFGs (Section III), and 2) the two-step fuzzy processing strategy for the regular expressions (Section IV). The reduction transformation of iCFGs merely reserves the security-sensitive API nodes and their calling sequences. In this way we can tackle various forms of obfuscation such as renaming user-defined functions, nested calls and control flow structure obfuscation. The two-step fuzzy processing strategy first performs a substitution operation on the regular expressions to raise up the level of abstraction and cover a broader range of behaviour patterns, then relaxes a plaintext match into a fuzzy match to resist potential variations to a family behaviour pattern.
At present, we only discovered the common behaviour patterns (i.e., the regular expressions of API calls) of malware families, but it is unclear yet how these patterns are stimulated and behaved to accomplish a malicious attack. This desire will motivate us to further investigate the intentions of behaviour patterns and explore the intricate working mechanisms of malware families in the near future. Another, ICCs (Inter-Component Communications), especially implicit ICCs, have not been involved in this article; this limitation will drive us to improve the construction of iCFGs by fully considering implicit calling relations among components.
XI DU was born in Yongle, Zhen'an, Shangluo, Shaanxi, China. She received the bachelor's degree in IoT from the Xi'an University of Science and Technology, in 2018, where she is currently pursuing the master's degree in the major of computer system architecture. Her research interest includes static detection approaches for Android malware.
QIAN LEI was born in Zhongzhang, Liyang, Xianyang, Shaanxi, China, in 1995. She received the bachelor's degree from the Xi'an University of Science and Technology, in 2017, where she is currently pursuing the master's degree. During her graduate studies, she has been engaged in the research of static detection approaches for Android malware and has published several journal and conference papers on this topic.
KEHONG LIU was born in Xi'an, Shaanxi, China, in 1999. She is currently pursuing the bachelor's degree with the Xi'an University of Science and Technology. During her undergraduate studies, she has been engaged in the research project of Android malware detection and has published several journal and conference papers on this topic.