DroidPDF: The Obfuscation Resilient Packer Detection Framework for Android Apps

The Android packing techniques were originally used to pack and conceal important information of the apps to prevent malicious developers from deconstructing the software logic. However, due to the lack of supervision, packing techniques have become the common methods for Android malwares to harden apps and circumvent virus detection engines in recent years. With obfuscation and encryption techniques, packing engines can alter the code structure of malwares and hide the malicious code to deceive and bypass the detection mechanisms, such as signature matching. For packing techniques, the packers are the agents created by packing engines and used to protect the softwares. Hence the packers pose a major challenge to the automated malware detection when researchers analyze a large collection of Android apps statically. It is necessary to identify packed samples in advance so that researchers can adopt different process procedures. To address this problem, we propose an intelligent AnDroid Packer Detection Framework called DroidPDF. It adopts a concise feature set that is resilient to obfuscation techniques. It also introduces weighted entropy to improve the detection effectiveness and achieves an average F1 Score of 0.9870.


I. INTRODUCTION
As the most popular mobile operating system, the current Android market share is as large as 76.2% [1]. In 2018, 75.7 billion downloads were made through Google Play [2]. The great success of Android apps provides attackers with lucrative and profitable targets. In particular, attackers disassemble popular apps with reverse-engineering technique, pack them after inserting modules with malicious behaviors, then upload repackaged apps to various markets to compromise the victims' smartphones.
According to 2018 Mobile Threat Report of McAfee [3], they noticed a 77% increase in Android banking Trojans. The banking Trojans evolve constantly and have been the main source of revenue for cyber criminals. According to another report of AVL Team [4], they observed a rapid growth of packed Android malwares. To be more specific, the number of packed subjects increase from 54971 to 1897256 from 2013 to 2017. With the growth of popularity, The associate editor coordinating the review of this manuscript and approving it for publication was Ana Lucila Sandoval Orozco.
Android also attracts the attention of malware developers who use advanced code packing techniques.
Packing engines perform a series of purposeful operations that modify the original code structure, encrypt the payloads of apps and insert stub programs which serve as unpacking routines to release the payloads at runtime [5]. Most packing engines also adopt a number of obfuscation techniques to make the logic obscure and unreadable [6]- [10], including anti-debug, string encryption, payload encryption, dynamic code loading, control flow transformation, junk code insertion, etc. On the one hand, benign apps employ packers to protect proprietary logic and sensitive algorithms from misusing. On the other hand, malicious apps make use of packers to conceal malicious functionalities to escape from anti-virus engines. Anti-virus engines can easily recognize known malwares by matching the predefined signatures or patterns of malicious apps. However, the signatures or patterns become invisible once the apps are packed. There is a growing trend that malwares are taking advantage of packers to evade anti-virus engine. It has been a common phenomenon that malwares adopting packing techniques pretend to be popular apps. The emergence of packed malwares pose a serious challenge to the anti-virus engines during static analysis of a large collection of samples [11]- [13].
In light of this background, we propose a lightweight solution named DroidPDF to detect packed samples. It is able to identify whether an Android Package (APK) has been packed or not by pattern recognition [14]. It can preprocess the samples to automate the identification of malwares. For example, identify packed samples in advance so that researchers can adopt additional operations for packed samples to recover the source code and the anti-virus engine can always have source code to analyze.
In summary, our major contributions include: • We introduced weighted entropy to improve the detection performance. Weighted entropy is a technique based on information entropy and customized for Android packer detection. It allocates more weight to subjects with higher entropy to increase the differentiation between the encrypted text and the plain text. • We trained a precise classification model with highly trusted data and derive a concise and effective feature set of 9 independent features from the raw attributes. Through comparison, DroidPDF outperforms other state-of-the-art frameworks.
• Our feature set is robust and resilient to obfuscation techniques. Although packers adopt a series of obfuscation techniques, DroidPDF is still able to achieve an average F1 Score of 0.9870. When it comes to identifier obfuscated samples, DroidPDF does not experience dramatic performance decrease and achieves an average F1 Score of 0.9819. Through statistical test, we find that there is no significant difference between the scores of DroidPDF achieved on normal and obfuscated datasets. In other words, DroidPDF is resilient to obfuscation techniques.
The rest of the paper is organized as follows. Section II introduces related background knowledge regarding the packing mechanism and information entropy. Section III demonstrates the implementation of our framework Droid-PDF which is a practical approach to detect the packed Android apps from a massive collection of samples. Section IV presents the method of selecting optimal classification model for DroidPDF and evaluates DroidPDF from various angles. The results show that DroidPDF has a favorable prospect for Android packer detection. Section V discusses our work and brings up the limitations and future work. Section VI cites some related researches and states briefly the advantages and disadvantages of them. Section VII concludes the paper.

II. BACKGROUND
Before presenting our packer detection framework, we first introduce some prerequisites concerning the packing mechanism and entropy theory.

A. PACKING MECHANISM
Android packer usually refers to the runtime packer [15]. Runtime packer is a process that decompresses or decrypts the packed code and data during execution. A runtime packer routine, usually uses native libraries to decompress or decrypt the original code dynamically, loads the released code into memory, then hands over the control from packer routine to the released application. In the majority of cases, the packer only contains a DEX file and several native libraries after decompressing. In fact, the DEX file generally serves as a stub program of packer to initialize and launch the packer routine. As for the native libraries, they contain proprietary algorithms used to recover the original payload of the app. The general design is that the packer routine starts up from the stub DEX file which loads the native libraries. These native libraries are the major unpacking agents which perform all the necessary steps to recover the DEX file structure to form a valid DEX file that is homologous with the original one. Then the packer routine loads the recovered DEX file into memory and commences executing it, meanwhile, hands over the control to the newly started application.
As aforementioned, most packing engines also adopt several obfuscation techniques to make the logic difficult to understand. Through our manual analysis and other researches [6]- [10], we find that packers usually adopt following obfuscation techniques to protect apps.
• Anti-debug. Packers usually create a sub-process to check if current app is being debugged. Packers will exit the main process once being attached by other processes.
• String Encryption. Packers usually encrypt constant strings so that some sensitive values will not leak out easily.
• Payload Encryption. Packers usually encrypt the original payload and decrypt it only at the time of execution.
• Dynamic Class Loading. Packers usually load classes with dedicated class loaders at runtime.
• Control Flow transformation. Packers usually convert Java methods to native methods which are invoked through JNI calls. Therefore the real control flows are concealed.
• Junk Code Insertion. Packers usually insert junk code to change the work flow and make it hard to understand.

B. ENTROPY
Entropy [16], also called information entropy, is measured as the average amount of information contained in received message in information theory. The more random the data is, the greater the entropy will be. Therefore, high entropy is usually deemed as a symbol of uncertainty instead of certainty. Lyda and Hamrock [17] introduced entropy analysis and put forward Bintropy to measure the statistical variation of executable programs. The underlying principle is that packed and encrypted samples usually generate less predictable sequences. This approach enables analysts to effectively identify packed apps. VOLUME 8, 2020 Robert's work provides a good inspiration for Android packer detection through entropy analysis. We study 6 major packers (Ali, Baidu, Bangcle, Ijiami, Qihoo, and Tencent [18]- [23]) and find that after packing process the average entropy increases. At first, we randomly collect 80 open source Android apps, next get them packed with above 6 packers. Then we work out the average entropy of them and compare the packed with non-packed apps pairwise. The average entropy changes of Ali, Baidu, Bangcle, Ijiami, Qihoo, and Tencent packers are +2.45%, +2.4%, +2.8%, +3.1%, +2.6%, +4%, respectively.
In this paper, we introduce weighted entropy which is based on information entropy analysis to measure the uncertainty of a series of bytes of certain decompressed files. We use Eq. (1) to calculate the information entropy of discrete random bytes.
where p i ∈ [0, 1] is the probability of the ith character appearing in the stream of characters of the data. p i becomes 0 when ith character does not exist in the data stream and 1 when the data stream only contains the ith character. The special case is p i = 0 because log 2 (p i ) is an invalid expression. So we make p i · log 2 (p i ) = 0 when p i = 0. M is the kind of unique characters. Since there are 256 different characters, the entropy value should be a real number within the range 0 ≤ E ≤ 8. The entropy becomes 0 when there is only 1 kind of unique character in the data stream. The entropy becomes 8 when there are 256 kinds of different characters and their frequencies are the same. The corresponding entropy calculator is implemented in Algorithm 1.

III. METHODOLOGY
In this section, we describe the procedures of building up DroidPDF. A set of concise and valuable features will be used which are derived by analyzing a series of entries in APK file. By combining with the optimal classification model, DroidPDF is able to recognize both known and unknown packers.
A. ARCHITECTURE Fig. 1 presents the architecture of DroidPDF. Since the APK file is organized as the zip file format, DroidPDF uses decompression tool unzip [24] to extract compressed files at first. Next, it collects a set of original attributes from the decompressed content. After a series of process on these raw attributes, the original data are transformed to features. Then each sample is simplified into a feature vector which will be sent to the trained classifier and be examined whether packed or not.

B. RAW ATTRIBUTES EXTRACTION
Before calculating the feature vector, we need to extract attributes from APK file. The following 9 types of attributes

1) SIZE OF APK FILE
This attribute can be acquired by counting the bytes of the APK directly.

2) SIZE OF DEX FILE
To obtain this attribute, we decompress the APK first. Then, count the bytes of the file named classes.dex which is located in the root directory of the decompressed APK.

3) PACKAGE NAME
This attribute is stored in the file named Android-Manifest.xml which is located in the root directory of the decompressed APK. It is noteworthy that Android-Manifest.xml is encoded as AXML format. To read the package name after decompressing, it requires an AXML format parser to decode the unreadable text in AndroidManifest.xml.

4) APPLICATION NAME
The method of extracting application name is as same as getting Package Name. However, this attribute value may be empty in most circumstances. The reason is that a number of Android apps adopt the default android.app.Application class which does not need to be declared in AndroidManifest.xml. By contrast, if the app implements some customized features by creating a subclass of android.app.Application, it should specify the fully-qualified name of this subclass in Android-Manifest.xml

5) DECLARED COMPONENTS
In general, there are four typical components which are the primary elements for constructing Android apps, including android.app.Activity, android.app.Service, android.content. BroadcastReceiver, and android.content.ContentProvider. Once the application implements several of them and calls them during execution, these implemented components must be declared in AndroidManifest.xml during development, otherwise it will crash.

6) CLASS DEFINITIONS
This attribute can be extracted from classes.dex. File classes.dex essentially is the Dalvik executable file which is compiled from the Android program. In terms of the DEX format specification, we build up a custom parser to read all the class definitions.

7) CLASS METHODS
This attribute is also extracted from classes.dex. The manner of reading the class methods is as same as getting Class Definitions.

8) LIST OF DECOMPRESSED FILES
To obtain this attribute, it requires recursively searching the root directory of the decompressed APK. Find all the files and count their byte size, respectively.

9) ENTROPY OF DECOMPRESSED FILES
The entropy of all the native libraries (.so extension), JAR files (.jar extension), and DEX files (.dex extension) from last step will be calculated. Particularly, entropy calculation is a compute-intensive task. To achieve high performance, the logic should be written in C language or other compiled language with excellent execution efficiency [25].

C. FEATURE TRANSFORMATION
Since all the attributes are extracted in last phase, we transform these attributes to features for classification models. The feature set contains 9 different kinds of features which are briefly described in Table 1.

1) PROPORTION OF DEX FILE
It is a numeric type which represents the proportion of the DEX file size to the entire APK size. In most situations, the new classes.dex file created by packing engines is a packer stub used to set up packer routine. Most content of the original classes.dex file is encrypted and extracted to other files to avoid being found. In general, the stub file is in smaller size than the original one.

2) NON-EMPTY APPLICATION NAME
It is a boolean type and the value should be set to true if the developer designs a customized Application class. The packer needs to implement its own subclass of the android.app.Application class so that it can maintain the global application state. Moreover, the implemented Application subclass must be declared in AndroidManifest.xml and instantiated before any other classes. After releasing the original content, the packer application switches execution to the original application.

3) PACKAGE NAME IN APPLICATION NAME
It is a boolean type and the value should be set to true if package name exists in the fully-qualified application name. VOLUME 8, 2020 For the original apps that implement their own Application classes, the fully-qualified application names usually are prefixed with their package names. By contrast, the packers' counterparts do not carry the package names of the original apps' in most cases.

4) NUMBER OF NATIVE LIBRARIES
It is a numeric type which means the number of .so extension files. Native libraries are in the form of .so extension files in Android which are executable and linkable. Most packers contain native libraries which incorporate algorithms of releasing Dalvik byte code. There are two advantages to implement packer routines based on native code. On the one hand, the native code is stored as binary code which is resilient to static analysis and makes the software not easy to be deconstructed. On the other hand, the native code is able to access and manipulate the memory storing the byte code directly.

5) NUMBER OF VISIBLE CLASSES
It is a numeric type which denotes the number of classes defined in classes.dex. Android packing engines will encrypt most of the DEX classes and keep them into other files. Usually, there will be very few classes left. In fact, a normal app without being processed by packing engines has at least 1500 classes.

6) NUMBER OF NATIVE METHODS
It is a numeric type which represents the number of native methods defined in classes.dex. More and more sophisticated packing engines use native method technique to translate Java methods to native ones. The native methods should be invoked by JNI calls and executed by dedicated runtime. Through investigation, we find that almost all the popular packer vendors support native method technique, including Ali, Baidu, Bangcle, Ijiami, Qihoo, Tencent, etc.

7) PROPORTION OF NATIVE METHODS
It is a numeric type which means the proportion of native methods to total methods. Some packers adopting native method technique contain more native methods of which the proportion of native methods is higher.

8) PROPORTION OF VISIBLE COMPONENTS
It is a numeric type which means the proportion of components implemented in DEX file to those declared in manifest file. We all know that Android components should be declared in AndroidManifest.xml before using them. Since packing engines always conceal classes of classes.dex, it is a high probability event that only a few components left in DEX file. The app may be packed if components are registered in AndrodiManifest.xml, but not present in classes.dex.

9) WEIGHTED ENTROPY
It is a numeric type of which the value is in the range from 0 to 8. Although the higher entropy follows the higher suspicion of packer, simply calculating entropy on files does not produce sufficient information. It is reasonable to emphasize the weight of high entropy subject since content of high entropy is more likely to carry encrypted sequences. Here we propose the weighted entropy method. Weighted entropy takes in the entropy of these subjects and yields a uniform representation of the entropy.
Before the weighted entropy, we explain how to derive the weight factors acting on each subject. The factors should be expressed as Eq. (2), where N represents the total number of data flows, and j represents the jth data flow. The data flows are sorted in descending order by the entropy score, which means the first data flow carries the highest entropy and in turn the Nth carries the lowest. The overall summation of weight factors should be 1.
Since we get acquainted with the weight factors, we can derive the weighted entropy as Eq. (3), where i, p i and M are already described in Eq. (1); N and j are already described in Eq. (2). The formula of weighted entropy can be summarized as Algorithm 2.
Algorithm 2 Calculation of Weighted Entropy for Specified Sample Input: path, the file path of the sample of which the weighted entropy is going to be calculated. Output: Weighted entropy of the given sample. In the modeling procedure, each app is converted to vector based representation to form the dataset. With the dataset, we can perform hyper-parameter tuning to work out the optimal combination of hyper-parameters for the candidate models. Besides, model selection is also as important as hyper-parameter tuning. It is recommended that researchers compare the performance of different models on the given dataset to determine the finally model. we detail this process in the following paragraphs.

1) FEATURES TO DATASET
Suppose each sample is defined by m features and the number of our samples is n, then the whole sample set can be represented by a m × n matrix M.
where x ij means the jth feature of ith sample. Each feature and its corresponding ID is listed in Table 1. In our implementation of DroidPDF, the size of feature set is m = 9. The possible values of y i are either integer 0 (means non-packed) or 1 (means packed). Given a new X i , the classifier predicts the corresponding y i .

2) HYPER-PARAMETER TUNING
In machine learning, hyper-parameter tuning is the process of selecting a set of optimal hyper-parameters for a classification model [26]. It is important to optimize the hyper-parameters before the model training process as the appropriate hyper-parameters may influence the performance greatly. Researchers usually prepare several possible values for each hyper-parameter, then run randomized or grid search to seek for the combination that yields the highest score.
It is a common practice to apply K-fold Cross Validation (K-fold CV) to evaluate different hyper-parameter settings [27], [28]. The procedure of performing K-fold CV is summarized in Algorithm 3. It randomly divides original sample data into K disjoint subsets of the same size. One of these subsets is selected as testing set and the other K-1 are merged to form the training set. We repeat above process K times and in each round every subset takes turn in serving as testing set and the remainder being the training set.
We summarize this evaluation method of hyper-parameter settings in Algorithm 4. For a specific model, the optimal hyper-parameter set can be determined by the mean of K-fold CV scores after searching all the possible combinations. To boost the efficiency of hyper-parameter tuning, it is suggested to apply some dedicated hyper-parameter optimization frameworks, such as Irace [27], scikit-learn [29], and Talos [30].

3) MODEL SELECTION
Model selection is the process of selecting the optimal one from a collection of candidate classification models. In this work, 9 types of powerful models are evaluated.
The ground truth targets 9: 10: score ← Calculate metrics with y pred and y true according to scoring; 11: scores[i] ← score; Keep the score of current iteration 12: end for   [33], respectively. All of them are frequently used models and many previous studies have applied them to recognize the patterns of observed subjects.
It is a common method to select the best model based on the K-fold CV results. However, this method has the problem that the training sets overlap with each other and information may thus leak into the model which will result in the over-fitting problem [34]- [36]. To address this challenge, we adopt Nested Cross Validation (Nested CV) [35] to evaluate these models. According to research [37], the Nested CV yields a nearly unbiased estimate of the true error. Nested CV is the extension of cross validation which consists of two nested loops of cross validation. The inner loop is used to perform the hyper-parameters tuning and the outer loop is used to compute an estimator of the error. This method is summarized in Algorithm 5.

IV. EVALUATION
In this section, we evaluate the performance of DroidPDF. The development of DroidPDF is done on Thinkpad s540 laptop, carrying 8G RAM and 4 physical processors. Each processor is Intel(R) Core(TM) i7-4500U CPU with 2 cores up to 1.80GHz. The operating system is Ubuntu 16.04 LST. All the Machine Learning models are backed by scikit-learn [29] and the version is 0.21.2. The Deep Learning models are backed by TensorFlow [38] of which the versions is 2.2.0.

A. DATASET
To evaluate the predictive ability of DroidPDF, we collect two sample sets, S1 and S2. S1 consists of 1400 packed Android apps which are collected from VirusShare [39] (600 samples), AppChina [40] (500 samples) and our previous research (300 samples). Samples from our previous research are packed through invoking the online packing services of Ali, Bangcle, Baidu, Ijiami, Qihoo, and Tencent. For the rest 1100 packed samples, 1043 of them are identified by signature based method [9], [41]. The rest 57 samples are detected manually which is similar to research [17]. To be specific, we use the unpacker DexX [7] to recover their source code when running these samples on the physical devices. Runtime packers encrypt and conceal the original source code and only release it when the app is running. So we compare the recovered source code with the decompressed source code and mark the apps as packed if the source code does not match. Fig. 2 shows the packer distribution of all the packed apps. The majority of them are from the mainstream packer vendors, such as Ali, Bangcle, Baidu, Ijiami, Qihoo, and Tencent. S2 is composed of 1400 non-packed Android apps which are obtained from F-Droid [42]. F-Droid is a website that provides source code of Android apps and it is often mentioned by other researches [9], [17] when seeking for non-packed samples. To ensure that all the samples from F-Droid are non-packed, we download all the source code from F-Droid which is used to compare with the corresponding samples. It turns out that all the samples are non-packed. Let P sets denotes the set of hyper-parameters with different values; Hyper-parameter settings for model M j Score each hyper-parameter setting 10: 11: best_P set ← Select the best hyper-parameter setting from ps_score_map by score value; 12 We prepare several datasets in advance which will be used in the following experiments. Note that we will follow a stratified fashion when partitioning the datasets. In other words, the proportion of samples of different labels (packed, non-packed) will be the same in each dataset. We split S1 into two datasets D pk1000 and D pk400 after shuffling. They consist of 1000 and 400 packed samples, respectively; we repeat this process and split S2 into D npk1000 and D npk400 . They consist of 1000 and 400 non-packed samples, respectively. Dataset D pk1000 and D npk1000 are used for model selection and parameter tuning in Sections IV-C and IV-E. D pk400 and D npk400 are used for performance comparison in Section IV-D and IV-F.

B. EVALUATION METRICS
In this work, we use F1 score [43] as the metrics to evaluate the performance of different models. In statistical analysis of binary classification, the F1 score is a measure of accuracy of the given model. The F1 score can be deemed as a harmonic mean of metrics Precision and Recall. The formula of F1 score can be expressed as: where Recall and Precision can be calculated as follows: where TP, TN, FP, and FN represent True Positive, True Negative, False Positive, and False Negative, respectively. Positive indicates packed and Negative indicates non-packed. True means correct prediction and False means incorrect prediction.

C. MODEL SELECTION FOR DroidPDF
In this section, we detail the process of searching the optimal hyper-parameter setting and the method of choosing the optimal model for DroidPDF. Before the experiment, we combine the dataset D npk1000 and D pk1000 together to form the dataset D 2000 for the follow-up evaluation.
In this experiment, we use 5-fold CV to tune up the hyper-parameters of the models. It is necessary to point out that in K-fold CV experiments, the choice of K is a trade-off between bias and variance. Generally speaking, K is an empirical value and researchers usually choose K=5 or K=10. The larger K is, the more data is added to the training set, the smaller the bias of the model will be. However, with larger K, there will be greater correlation among training sets (considering the most extreme example, when K is the size of the data set, also known as Leave One Out Cross Validation [44], every time the training sets are almost the same). The large correlation of training sets will result in a VOLUME 8, 2020  large variance on the final result. For each model, we work out the K-fold CV scores of all the possible combinations of hyper-parameters by invoking Algorithm 3. The optimal hyper-parameter setting is determined by the mean of K-fold CV scores. The evaluation results are presented in Table 2. The second column presents the best hyper-parameter setting of each model; the third column presents the scores achieved by corresponding hyper-parameter settings.
As explained in Section III-D, it is recommended that researchers adopt Nested CV to evaluate the performance of different models rather than Non-nested CV. We adopt 5 × 2cv [45] method which is one of the implementation of Nested CV and recommended by MLxtend [46] for practical usage. For 5 × 2cv, the outer loop is based on 5-fold CV and the inner loop is based on 2-fold CV. Table 3 presents the performance of 9 different models after performing 5 × 2cv. As shown in Table 3, Xgboost outperforms other models and achieves an average F1 Score of 0.9870. This fact leads us to choose Xgboost as the classification model of DroidPDF. It is a common practice to fit the final model to the whole dataset after hyper-parameter optimization [29]. Therefore, with the optimal hyper-parameter setting in Table 2, we train Xgboost on D 2000 and name the trained model as M α for the further use.

D. COMPARISON OF EXISTING RESEARCHES
In this section, we prove that our DroidPDF outperforms some typical state-of-the-art researches. APKiD [41] and Liao et al. [47] are compared in this experiment. APKiD is a signature-based solution of packer detection which identifies many types of packers, compilers, and obfuscators; Liao et al. presented a framework based on the combination of static analysis and dynamic analysis. Firstly, it extracts all the classes names from the APK file through static analysis. Secondly, it modifies the class loading function Dalvik_dalvik_system_DexFile_defineClassNative and dump all the class names through dynamic analysis. Finally, it compares the class names from static and dynamic analysis. If class names only exist in the result of dynamic analysis, this means the app contains packed code.
To evaluate each packer detection framework, we adopt datasets D pk400 and D npk400 which consist of 400 packed and 400 non-packed samples, respectively. The results are listed in Table 4. It is obvious that our model outperforms Here we analyze the reasons of the failures of APKiD and Liao's framework, respectively. (1) For APKiD, although fingerprint detection methods can detect packed apps, they are less versatile in the real world environment. The limitation of fingerprint based tools is that they can not recognize never seen features. Through our research, we find that APKiD fails to detect most Tencent packers since it has no signatures of related packers. Due to the fact that packers have been evolving all the time and more and more custom designed packers are devised, the traits of packers will change over time and the fingerprint based detection methods become less effective.
(2) For Liao's framework, we find that many apps protected by Ali and Ijiami packers conceal method code instead of the whole classes. According to Liao's method, apps of which the class names from static and dynamic analysis being the same should be classified as non-packed. But the fact is that apps of same class names are still packed for Ali and Ijiami packers.

E. ENTROPY ANALYSIS
In this section, we validate the effectiveness of weighted entropy. As we researched in Section II-B, the entropy of apps will more or less increase after being packed by the mainstream packers. This phenomenon leads us to devise weighted entropy as a trait to help to distinguish the packers. We conduct a controlled experiment including statistical test to measure whether it is statistically significant to have weighted entropy in the feature set. In other words, we examine whether there is significant difference with or without weighted entropy.
In this experiment, we repeat the process in Section IV-C to work out the optimal model on dataset D 2000 without weighted entropy. To be more specific, we remove the feature column of weighted entropy from the dataset, then evaluate the models as previous process. Table 5 shows the F1 scores of 5 iterations. It is evident that Xgboost outperforms other models. Consequently, in the controlled experiment, we employ Xgboost as the classifier for dataset without weighted entropy.
Comparing the best mean score in Table 3 (achieved by Xgboost, 0.9870) with that in Table 5 (achieved by Xgboost, 0.9779), we can see performance decline when the feature of weighted entropy is removed. However, we need to determine whether this happens by chance or not. Here we apply independent two-sample t-test [48] to compare the performance of these two models in a more comprehensive manner. The statistical significance of the t-test indicates whether the difference between two models is significant or not. We set a significance threshold of α = 0.05 for rejecting the null hypothesis H0 that both models perform equally well.
In our scenario, statistical significance suggests the performance difference between these two models trained on datasets with and without weight entropy. Let x and y denote these two models to compare, The formula of t-value is presented in Eq. (7).
wherex andȳ represent the means of performance scores of model x and y, respectively; n x and n y represent the number of performance scores of model x and y, respectively; σ 2 is the pooled variance of the given samples. It is expressed as follows:  where x i and y j represent the performance scores of model x and y, respectively; n x +n y −2 is the degrees of freedom (df) which is 5 + 5 − 2 = 8 in this experiment. According to Eq. (7), we work out the t-value of 2.52. Since t-value and the degrees of freedom df are determined, the p-value turns out to be 0.036. The fact that p-value = 0.036 is less than α = 0.05 indicates that the performance difference between these two models is significant. In other words, the enhancement of performance is statistically significant after including weighted entropy to the feature set.

F. OBFUSCATION ANALYSIS
In this section, we validate the model's resilience to the obfuscation techniques. Through investigation, we find that identifier obfuscation performed on apps' code would cause much more drastic change (increase or decrease) to the entropy than other obfuscation techniques. However, apps simply adopting identifier obfuscation does not mean they are packed since the source code is not concealed. Since our framework is partially based on entropy analysis, the change of entropy may be an adverse impact to our framework. We can see the difference between the original code and the identifier obfuscated code in Fig. 3.
There are two major challenges for DroidPDF when facing obfuscation technique. For packed apps, there are chances that some packers may use identifier obfuscation techniques to obfuscate the source code which may result in the entropy decrease; for the non-packed apps, the identifier obfuscation acting on normal apps may result in entropy increase. To prove that DroidPDF is obfuscation resilient, the most straightforward approach is to use model M α to identify obfuscated datasets (including packed and non-packed samples). Then demonstrate that there is no significant difference between scores achieved on obfuscated and non-obfuscated datasets.
To generate obfuscated datasets, we use two obfuscators, Proguard [49] and Obfuscapk [50], to perform the obfuscation on D pk400 and D npk400 . The process of performing obfuscation is detailed as follows. We randomly select 400 samples from D pk400 and D npk400 (each dataset 200 samples) and use Proguard to obfuscate them; the rest 400 samples of from D pk400 and D npk400 are obfuscated by Obfuscapk. Finally we produce two obfuscated datasets D obfpk400 (packed) and D obfnpk400 (non-packed). It turns out that the entropy of 80% samples decreased after being obfuscated by Proguard and 90.2% samples' entropy increased after being obfuscated by Obfuscapk. According to our study, ProGuard uses a great deal of special characters. This fact will result in relatively lower entropy for samples due to the high occurrence of same character; Obfuscapk is different from Proguard as it uses a serials of more random sequences which can result in relatively higher entropy for samples.
We conduct an experiment that iterates 10 times. In each iteration, we regulate the proportion of obfuscated samples to study the influence of obfuscated apps to our framework. For example, if we want to form a dataset of which the proportion of obfuscated samples is 10% (80 obfuscated samples), we can randomly select 40, 40, 360, and 360 samples from D obfpk400 , D obfnpk400 , D pk400 , and D npk400 , respectively. Note that samples in D obfpk400 are from D pk400 and samples in D obfnpk400 are from D npk400 . To avoid duplication, the same sample can not be chosen twice in each iteration. For convenience, we name these test datasets as D mix10 , D mix20 , · · · , and D mix100 , respectively. The subscript in the dataset name indicates the proportion of obfuscated samples in percentage.
We use model M α to predict on these test datasets. Table 6 shows the experiment results. To evaluate the performance of M α on obfuscated samples in a more comprehensive way, we use paired t-test to measure the statistical significance. According to research [35], the Nested CV method reduces the bias and produces an estimator of the test error which is highly close to that gained on the independent testing set. We reuse the scores of Xgboost in Table 3 as a surrogate of performance estimation of M α to compare with the performance in Table 6. By applying Eq. (7), the degrees of freedom df, t-value, and p-value turns out to be 13, 1.97, and 0.07, respectively. Since p-value=0.07 is greater than α = 0.05, this outcome indicates that there is no significant difference between the scores of M α achieved on normal and obfuscated datasets. Through above statistical analysis,  we can conclude that DroidPDF is resilient to the obfuscation technique.

G. EFFICIENCY ANALYSIS
It is obvious that the processing time is also considered as another performance indicator for a large scale classification system. We combine S1 and S2 together to form the test set. For each sample, DroidPDF performs the feature extraction operation for 10 times and take the average as the actual processing time. Fig. 4 shows the processing performance of DroidPDF. Fig. 4a shows the APK decompressing time against the APK file size. The average APK decompressing rate is around 60MB/s. We notice a linear increase of decompression time in Fig. 4a which means that the APK decompression time converges linearly with the subject size. Fig. 4b shows the DEX file parsing time against the DEX file size. The average DEX file parsing rate is around 15MB/s. In Fig. 4b the DEX processing time decreases unusually for some DEX files larger than 8MB. Through analysis, we find that these DEX files are packed by Tencent packers. We also observe that these Tencent packers insert invalid code chunks to pad the DEX files, which makes these DEX files look normal. Since it only contains a few of valid classes, it will not take too much time to process it. Fig. 4c shows the entropy calculating time against the subject size. The average entropy calculating rate is around 284MB/s. We can see a linear increase of entropy calculating time which means that the entropy calculating time converges linearly with the size of subjects. Fig. 4d shows the overall processing time for a whole APK including decompressing, DEX file parsing, and entropy calculating. The overall processing rate is around 14MB/s. In Fig. 4d the processing time does not converge linearly for many samples greater than 70MB. After investigation, it turns out to be the problem of the resource files. For most samples larger than 70MB, the size of DEX file and native libraries has seen little growth. As a matter of fact, the actual growth comes from the resource files. Since the resource files are treated as uncorrelated, it explains why the processing time does not increase along the sample size.

V. DISCUSSION
In this section, we discuss some practical methods to boost the speed of DroidPDF.
The programming language that implements the project matters tremendously for the processing time. It is strongly recommended that researchers adopt C programming language to speed up the entropy calculation since it is a computation-intensive task. According to our test, C language implementation is almost 40 times faster than the counterpart written in Python.
To improve the processing efficiency of DroidPDF in real life production, we encourage researchers to incorporate the fingerprint technique, which operates as a filter to pick out the already known packers before sending to the classification engine. In turn, for the never seen packers detected by Droid-PDF we can extract the fingerprints which can be utilized to enrich the fingerprint repository.

VI. RELATED WORK
In large scale analysis of Android malicious apps, it is necessary to identify packed apps in advance so that we can adopt different analysis workflows to facilitate the process of automatic identification. To address this challenge, there are several prior contributions with respect to packer detection that can be referred to.
APKiD [41] is an Android packer recognition tool based on fingerprint identification. Although fingerprint detection methods can detect packed apps, they are less versatile in the real world environment. The limitation of fingerprint based tools is that they can not recognize never seen features. In other words, APKiD always lags behind when the new packers come out.
Bintropy [17] is the earliest attempt to use entropy to identify packed apps of Windows Portable Executable (PE) format. This is a good inspiration for packer analysis. But we should be aware of the fundamental difference between PE and APK. The difference is that APK is a compressed file and calculating entropy on it will cause high entropy issue.
Chau et al. [51] also proposed an entropy analysis solution. It selects several typical resource files and calculates their entropy to form feature vector. This method is very innovative except one limitation. That is the entropy analysis is not resilient to obfuscation techniques, especially the identifier obfuscation. Obfuscated identifier may carry lower or higher entropy sequences, which will have a negative impact on the results.
Liao et al. [47] presented a framework based on the combination of static analysis and dynamic analysis. According to Liao's method, apps of which the class names from static and dynamic analysis being the same should be classified as nonpacked. As we analyzed in Section IV-D, they failed to detect some samples packed Ali and Ijiami packers since these two packers encrypt the method instructions instead of the whole classes.
The reason we use multi-dimensional features is that recognizing packer by the entropy threshold is similar to fingerprint in principle. The entropy value of a packer can be changed, and other non-packed apps' entropy values may also locate within this range due to the change of SDK [52], obfuscation techniques, or other factors. Hence, we introduce other 8 entropy irrelevant features to make the classification system more robust.

VII. CONCLUSION
In this study, we proposed a packer detection framework which is based on a set of heuristic features to recognize the packed and non-packed apps. We introduce weighted entropy to enhance the performance and achieved a good score. According to the experiment outcomes, Xgboost classifier outperforms other state-of-the-art frameworks. Our Droid-PDF shows a promising prospect as a practical framework to detect not only the known packers but also the unknown ones. QIAOYAN WEN received the B.S. and M.S. degrees in mathematics from Shaanxi Normal University, Xi'an, China, in 1981 and 1984, respectively, and the Ph.D. degree in cryptography from Xidian University, Xi'an, in 1997. She is currently a Professor with the Beijing University of Posts and Telecommunications. Her current research interests include coding theory, cryptography, information security, Internet security, and applied mathematics.