Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems

Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.


I. INTRODUCTION
Malware (malicious software) is any program or file that is intended to damage computers, computer systems, or networks. The malware performs a type of misbehavior that goes against the benefits of users after it is implanted into computers. Types of malware include viruses, worms, trojans, rootkits, spyware, ransomware, etc. Malware infection has been increasing over the years. There have been many endeavors to prevent malware attacks and the spread of infection to other computers, but it is difficult to deal with advanced malware variants, such as polymorphism [1]- [4], packing [5], obfuscation [6], etc. A malware detection system utilizes known detective patterns to verify whether a new application becomes a threat.
The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk .
The set of detective patterns are collected by analyzing the previously-known malware. The fast growth of malware variants nullifies malware detection systems based on these known patterns. Commercial antivirus software has difficulty detecting new malware variants unless the software is kept up-to-date. To resolve these difficulties, machine learning techniques have been applied in malware detection systems [2], [7]. Static analysis using machine learning algorithms is able to detect some parts of polymorphism or obfuscation code that can appear as patterns in sequence or 2-dimensional image [2], [8].
The general workflow of the machine learning process includes the following steps: data collection, feature extraction, model training, and model selection. The data collection step collects malware and benign files. The feature extraction step decides a suitable representation for malware vectorization and prepares the training dataset. The model training step makes use of the training dataset to yield the learned models for choosing the best model. In the model selection step, the best learned model is selected and applies to a real application.
Feature extraction is the important step of machine learning process [7], [9]. It prepares feature vectors for representing the characteristics of malware because it is associated with the general performance of a machine learning algorithm. Leaning performance often depends on the types of extracted features. The step of feature preparation is considered time-consuming in terms of selected feature types. However, the feature vector preparation can be done in parallel.
Dynamic analysis observes malware behavior while malware is being executed in a virtual environment, such as CWSandbox [16] or Cuckoo Sandbox [17]- [19]. In a virtual environment, the malware attack cannot cause damage to a system due to a controlled execution. The virtual environment monitors the changes in network, registry, MFT (Master File  Table), and the behavior of processes. Additionally, the virtual environment records log files. Dynamic features are the API call sequence and arguments, monitored processes, registry changes, mutex changes, etc. These kinds of features require computationally intensive operations due to the need of a virtual environment. However, they are able to detect obfuscated malware while static features are vulnerable to it [2], [10], [20].
Machine learning models require persistent malware analysis against increasing malware variants. Malware detection systems are being developed with machine learning techniques to reduce FPR (False Positive Rate) of signature-based malware detection techniques. In recent years, tree-based ensemble algorithms are one of most important methods among all the ensemble algorithms for solving prediction and classification [21], [22]. Decision trees have less impact on learning time than SVM, despite the growth of training data. However, a decision tree with high depth has the disadvantage of being overfit. Ensemble learning algorithms can prevent overfitting with the bias-variance analysis. Thus, the treebased ensemble algorithms are considered for this study. Through an ensemble approach, the experimental results show that the performance of malware detection systems can be improved by learning numerous classifiers and combining their outputs.
In summary, this paper makes the following contributions: • We propose updated malware features that minimize the drawbacks, such as variable length size, high dimension, and high storage use. The proposed malware features show that their overall performance is better than the original training feature in terms of training time and accuracy.
• By applying expertise analysis knowledge, frequently used functions, and entropy discretization to the studied malware features, the variable lengths of malware features can turn into fixed lengths. Such features do not require selecting a fixed length and padding feature vectors.
• We perform extensive experiments with tree-based ensemble algorithms. Experimental results show that the tree-based ensemble model is effective and efficient to detect malware in terms of training time and overall performance. The paper is organized as follows: Section II presents the related works of malware detection methods in terms of datasets, feature vectors, machine learning algorithms, and performance. Section III discusses gram matrix, entropy feature, and API or DLL feature that characterize malware. Section IV presents the experimental results and Section V concludes our work.

II. RELATED WORKS
The evaluation of malware learning models uses known datasets or self-collected datasets. Training datasets are collected from Web databases and composed of binary classification or malware family classification. Table 1 summarizes the malware datasets in terms of class size, malicious and benign size, adopted feature, learning model, and data source.
Various n-gram features of opcode are selected to represent malware and benign [8], [13], [23]. Learning models were tested for TF and TF-IDF (Term Frequency -Inverse Document Frequency [35]) features of 2-gram opcode [23]. Random forest achieved the highest accuracy of 0.95 for TF feature and n-gram feature was optimal if n = 2. The 2-gram and weighted term frequency features were tested for learning models [13]. SVM of polynomial kernel showed the best accuracy of 0.96. A major block of opcode was chosen to avoid a high dimensional feature when including all the opcode, which lowered feature extraction time [8]. The selected opcode was transformed into a square image and tested CNN (Convolution Neural Network [36]). Since the length of the selected opcode differs in malware, a hash function was applied to opcode and transformed hash values into a squared image. The detection accuracy of 0.87 or more was reported for all classes.
The ANN (All-Nearest-Neighbor) algorithm was introduced to detect malware based on sequential patterns from PE headers [26]. The results showed a detection rate of 96.0%. Each training feature was vectorized from PE sections with byte entropy, histogram, and imported functions and meta data [11]. The dimension of each feature was 256 and the neural network structure was 1024 × 1024 × 1024 × 1. The result ranged from 67.0 to 95.0% and the highest rate was 95.2%. The histogram similarity analysis [37] was adapted VOLUME 8, 2020 to detect malware family by comparing the peak points of the byte entropy graph [12]. The malware family detection rate was about 98.0% with threshold 0.75. When the static feature was gathered from bytecode features, disassembly code, and PE features, the performance of random forest was the highest and the F1-ratio was 93.56% [28]. XGBoost learned multiple byte-based features from registry, keyword frequency, n-gram, entropy, and byte images [30]. The accuracy was 0.98 for the entropy feature and 0.99 for the keyword frequency feature.
The DNA sequence matching method was applied to extract LCS (Longest Common Subsequence) from API call sequences [15]. The LCS subsequences were collected from malware only and excluded patterns that appeared in benign. They reported near 99.9% generalization performance. Cuckoo Sandbox was used to capture API call sequences, DLL call sequences, and the presence or absence of string information [19]. IGR (Information Gain Ratio) monitored the feature dimension for feature fusion process. Random forest achieved 0.996 in AUC for malware detection and 0.978 in AUC for malware family classification. Under Cuckoo Sandbox, malware features were represented with resource related data such as file access log, registry key access, process execution, packet log, CPU, and memory [17]. Self-organizing feature map (SOFM) was trained with malware features and used to predict the cluster of test data [29]. The detection accuracy of about 0.9 was the best when the feature map size was 80 × 80.
Malicious behavior was encoded with resource data such as files, mutexes, registry keys, network traffic, and error messages [34]. Random forest reported TPR (True Positive Rate) of 95.0%. They claimed three limitations of dynamic malware analysis. In absence of malware behavior, dynamic analysis cannot characterize malicious behavior. More and more malware has evolved along with anti-sandbox functionality. Lastly, due to high FPR (False Positive Rate), malware can be installed easily in the system directory without user interactions. To overcome these limitations, FPR should be lowered if training data containing malicious behavior is prepared.
MIST (malware instruction set) encoded malware behavior as a sequence of instructions captured under CWSandbox [32]. Their fusion learning model grouped training data into similar behavior and predicted malware into behavioral classes by the nearest-neighbor algorithm [16]. The results showed that F-measure was 0.94 to 0.99 when the MIST level was 1 or 2.
The malware feature of this study uses static analysis and proposes the modification of opcode, API, DLL, and entropy features. The modification method utilizes dimension reduction to represent a fixed-length malware feature, which is essential for applying machine learning algorithms. Various tree-based ensemble algorithms are chosen to model malware detection systems and their results are analyzed for practicability.

III. MALWARE DETECTION MODEL
The malware detection system consists of three steps as in Figure 1: feature engineering, model learning and model evaluation. The feature engineering step prepares training sets based on byte, opcode, and API data from the collected dataset. Radare2 1 is used to generate disassembly code. For the model learning step, tree-based ensemble models are used for our malware detection models. The model evaluation step chooses the best model through cross-validation and confusion matrix.

A. DATA COLLECTION
Malware files were collected from Malwares. 2 18 malware families are listed as shown in Table 2. Adware is the most frequent family, and CoinMiner is the rarest family. Kaspersky 3 categorizes malware files according to malware behavior. The collected dataset is composed of 122,963 files including 20,000 benign files. Malicious files are labeled with positive class (+1) and benign files with negative class (−1) for a binary classification problem.

B. FEATURE ENGINEERING
It is necessary to map training samples to feature vectors for the ensemble algorithm. A vector representation is obtained by defining a numerical measurement method that can replace samples. We focus on static features, mainly from both executable and disassembly code.

1) GRAM MATRIX
Malware M is represented as a sequence of opcodes: M =< s 1 , s 2 , · · · , s N > and s i ∈ S, where N is the total length and S is the set of opcodes. An n-gram presents a contiguous subsequence of n items. By sliding a window of length n over M, the n-gram of the i th index is subsequence (s i , s i+1 , · · · , s i+n−1 ). The n-gram maps M to a high-dimensional vector space, where each dimension is related to a single gram. The function φ returns the n-gram feature for M: φ(M) = {(g, freq(g|M))|g ∈ G}, where G is the set of unique n-grams and the function freq(g|M) returns the frequency, the probability, or the binary flag of g in M.
The gram features are widely used in natural language processing and DNA sequence analysis [35]. Although n-gram feature provides the effective representation of malware, the exponential growth of feature dimension suffers from time complexity [38]. The total number of unique byte n-grams becomes 2 8n . The experimental analysis showed that the appropriate value of n was 2 for opcode features [13], [23].  The dimension reduction method of malware features was adapted by finding out the frequent relevant opcodes from malware and benign files [13], [39]. Table 3 lists the selected opcodes along with frequency rate. The number of selected opcodes is 32 (i.e., 2 5 ), so a 2-gram training feature is presented with a 32 × 32 matrix.
Gram matrix can solve the drawback that the feature vectors of n-gram have different lengths according to file size. It is also represented as a sparse vector which has the benefit of high-dimensional features [40]. Figure 2 is the example of a gram matrix of a binary code.

2) WEM (WINDOW ENTROPY MAP) FEATURE
Entropy has been used to detect malware by measuring the degree of uncertainty from binary files [8], [11], [30]. The maximum entropy appears when the probabilities for all symbols are the same. On the contrary, if the bytes occur with high probabilities, the entropy value will be smaller. Considering a binary file as hexadecimal time series data, the frequency rate of each byte is mapped to the entropy value measuring the degree of uncertainty. WEM generates a two-dimensional entropy feature from malware.
A byte entropy histogram is computed through sliding a byte window of size ω over hexadecimal sequences with a step size of τ bytes. Let T be the total number of sliding windows in M. Then, M is represented by M = < W 1 , W 2 , · · · , W T >, where W k is the k th window. The bin entropy is computed by the Shannon entropy, is presented through level-wise variation of H(M) and L becomes a quantizing resolution on entropy values. For W k , the level index l of b l,j is based on the accumulation c k−1,j of the j th bin, where the entropy accumulation is C = [c k,j ] T ×256 and c k,j is the sum of the j th bin entropy from W 1 , W 2 , · · · , W k−1 . The bin entropy h k,j is augmented to b l,j after index l is computed by quantizing c k,j with the predefined . Algorithm 1 is a pseudo code for building the WEM from H(M).
c k,j = 0 for k = 1, · · · , T and j = 1, · · · , 256 4: b l,j = 0 for l = 1, · · · , L and j = 1, · · · , 256 5: for k = 2, · · · , T do Accumulation 7: for j = 1, · · · , 256 do 8: end for 10: end for 11: for k = 1, · · · , T do Construct WEM 12: for j = 1, · · · , 256 do 13: end for 16: end for 17: return B 18: end function When L = 1, WEM is similar to the byte entropy feature [11] 256 is the aggregation of the j th bin entropy over M. Other methods studied use feature vectors that result in changes in entropy or generating hash values in sliding windows of malware and benign [12], [41]. But WEM is focused on the bin entropy which takes advantage of a fixed feature map and level-wise variation regardless of different file size. In addition, as L increases, the quantization level captures the scatter relation of the j th among sliding windows.

3) API FEATURES
The analysis of API call sequences provides the information on how malware codes will run or what they will do. Malware or benign are classified by the simple statistics of the frequency of the called APIs. The static analysis is applied to extract API names from PE section and constitute API sequences. The extracted APIs are included within Microsoft Windows DLLs (Dynamic-Link Libraries) or user defined DLLs. API sequences follow the order in which API functions appear in disassembly code.
From the collected database, the number of unique APIs is 147,200: 49,158 for malware and 121,023 for benign. The number of APIs occurring in both malware and benign is 18,056. The API feature of a single binary file must have 18,056 dimensions if all API functions are included.
The frequency of an API is computed by checking whether the feature is selected or not. An API is selected as a final feature if the entropy of the frequency rate is more than 6 × 10 −6 . Therefore, the number of selected APIs is 122 for our database. Vyas et al. selected 92 APIs extracted from PE section of 1,100 malicious files (i.e., Virus, Backdoor, Worm, and Trojan) and 2,600 benign files [42]. Inforsec Institute reported 131 APIs that were commonly encountered by malware analysis [43]. Table 4 shows the list of APIs frequently used in malware. From three studies, we choose  the 195 unique APIs as the final API feature. The selected APIs are analyzed with the API categories as in Table 5.
In the collected dataset of 122,963 files, the number of distinct DLLs is 4,244, consisting of 1,276 malware files and 3,415 benign files. Thus, the number of DLLs imported for malware is less than that of the benign. The high dimension of API or DLL features is reduced by selecting the features which are frequently used in malware.
A dynamic link library (DLL) in Microsoft Windows operating system calls functions by importing other DLL files. For example, malicious codes utilize some functions in kernel32.dll that handles memory, process, and thread. Window executable files are not necessary to call functions in ntdll.dll directly because kernel32.dll imports ntdll.dll. Malicious codes often include DLL functions implemented by developers. However, DLL files implemented by developers are excluded because common APIs provided by system software are only considered to detect malware. Table 6 shows the top 20 DLLs with more than 0.2% in frequency. Based on the selected DLLs whose frequency is above 0.3%, we transform their functions defined within DLLs into feature vectors. For the collected database, the number of DLLs is 2,054 and the number of distinct DLL APIs is 3,842. Therefore, the dimension of DLL API features consists of 3,842 DLL APIs.

C. EVALUATION METHOD
Our malware detection system was evaluated with a confusion matrix as shown in Table 7. Accuracy and error rates are used to measure the performance of malware detection systems. However, they cannot give an overview of the range of performance with varying thresholds. When a single threshold divides test examples into malware or benign class, it is not obvious how the right threshold value should be chosen. Therefore, threshold-free measures, such as Receiver Operating Characteristic (ROC) and Precision Recall Curve (PRC) plots [44], are selected when comparing the generalization performance of malware detection system.
ROC plots reveal a tradeoff between specificity and sensitivity while PRC plots present a tradeoff between precision and recall [44], [45]. Recall is the fraction of correctly predicted examples among malware predictions. An integral score of the area under the ROC (AUC-ROC) represents the performance of a classification model. Similarly, AUC-PRC becomes a useful metric for comparing a classifier along with precision and recall by shifting the decision threshold of the classifier.
Class imbalance phenomenon appears in the collected dataset whose ratio of benign to malware class is about 1 to 5. For class imbalance problems, an estimate of the number of wrong predictions among the positively classified instances is of great importance [46]. The investigation of AUC-ROC in an imbalance problem can be misleading in connection with the reliability of malware detection analysis, owing to a different interpretation on FPR [45]. From our point of view, PRC is more appropriate than ROC because PRC plots reflect the fraction of correctly classified examples among ones predicted as malware. Therefore, the PRC analysis is chosen as a direct and intuitive measure for our malware detection system.

IV. EXPERIMENTS
The feature extraction methods and tree-based ensemble models are implemented with Python and operate as pipelined modules. Tree-based ensemble models are more flexible, and less data-sensitive, than a single tree model [47]. The chosen ensemble algorithms are random forest [48], AdaBoost [49], XGBoost [50], extra trees [51], and rotation forest [52]. The decision tree algorithm is chosen as a base classifier.
We consider the feature extraction an embarrassingly parallel problem in which each feature type can be processed independently for every file. Our malware detection system embeds the tree-based ensemble models of the scikit-learn framework [53] and the modules of the mentioned training features. The learned models are analyzed with performance measures such as accuracy, precision and recall, and AUC-PRC.

A. DATA PREPARATION
A parallel processing platform based on Hadoop Distributed File System (HDFS) consists of 10 computers. These computers are low cost and uses less memory than the latest highend personal computers. All the nodes run on Linux Mint VOLUME 8, 2020  18.2 Sonya. The master node works as a slave node at the same time. The tested feature vectors are as follows: 1) 2-gram is a sequence vector in hexadecimal.
2) 2-gramM is a gram matrix made of frequent opcodes from disassembly files. 3) API-DLL is prepared from frequent Windows APIs within imported DLL files. 4) API is made up of frequent Windows APIs in assembled codes. 5) WEM is a two-dimensional matrix with binary entropy values in a binary file (L = 2). Figure 4 compares the number of distinct APIs along with the number of files. The number of APIs (ALL-API) increases rapidly up to 20,000 files, and the number of APIs is about 14,000. Later, as the number of files increases, the number of APIs grows slowly. From the collected dataset, the dimension size of ALL-API is approximately 18,056. This high dimensionality is reduced by the frequency analysis, so the API-DLL dimension is 3,842 and the API dimension is 195. If the number of files exceeds 100,000, the change in feature dimension for the API and API-DLLs is much smaller than for ALL-API. This result provides the insight on how many malware and benign files are collected for a malware detection system. Figure 5 compares the processing time with the number of workers on a log scale for the y-axis. Processing time is reduced quickly by 4 workers. However, there is little change in the processing time hereafter. The processing time of WEM requiring a lot of entropy calculations is the longest, and the preparation time of 2-gramM, API-DLL, and API is about 4.4 times faster. We observed that the processing time was not  improved with more than 6 working nodes. It is the processing latency that causes communication overhead in the cluster. Table 8 compares feature extraction time as the number of cluster nodes increases. The 2-gram is represented by 61,952 dimensions but its feature vector is highly sparse. The proposed feature analysis decreases the dimensionality of 2-gramM, 2-gramAPI, API-DLL, and API except WEM. While both WEM and API have about 500 or less dimensions, API-DLL has about 3,800 dimensions.

B. PREDICTION ANALYSIS BY PROXIMITY MEASURE
After learning the ensemble model for the training dataset, we compare the similarity or similarity by calculating the proximity between two instances [54], [55]. The proximity measure of two instances is the ratio of the number of identical terminal nodes to the total number of terminal nodes within the ensemble. As the proximity value closes to 1, their classes are predicted at the same terminal node and the classification of the trained model is similar. However, if it is close to zero, the terminal nodes reached by the two instances are different.
Proximity measurements depend on the depth and number of trees in the ensemble. Proximity is used to analyze the results of tree-based ensembles in which instances are trained, even if the instance dimensions are larger [55]. The proximity matrix can be visualized by projecting each instance into d-dimensional space where the proximity value between any pair of instances is considered their distance. Proximity for the entire training instance constitutes 2-dimensional matrix, and Multidimensional Scaling (MDS) visualizes the dataset by orthogonally transforming the proximity matrix onto two eigenvectors with the highest eigenvalues [56]. Figure 7 shows an example of a proximity plot for the WEM feature. The plot was generated by using the MDS implementation of scikit-learn. The dataset consists of 2,000 instances per class and the proximity value is projected onto 2-dimensional space after a random forest model. Benign instances have a distribution in which clusters are clearly concentrated in the center region. Malware instances appear in a distribution that surrounds benign clusters, with some instances intermingled. It is appropriate to set a high tree depth for decision trees for benign instance classification in intermingled regions. In general, higher tree depths in decision trees are more likely to result in overfitting. Therefore, the ensemble learning model is more suitable than the single decision tree model for this malware detection problem.

C. PERFORMANCE COMPARISON
The experiments were conducted with 10 separate 5-fold cross-validations. For each validation, we randomly split the data into five equally sized sets; four sets were for training and the other set was held out for testing. Each dataset was tested 50 times and evaluated on average.
The hyperparameters of decision tree model were monitored with the pre-selected 40,000 training examples(20,000 per class). In learning a decision tree, the measure to split a node is the Shannon entropy, the maximum depth of a decision tree is 15, the minimum number of instances to split at an internal node is 2, and the minimum number of instances at a terminal node is 2. The number of decision trees is 50 for learning ensemble models. Table 9 compares the performance of the selected features in terms of dimension, accuracy, and training time. The ensemble models are higher than the decision tree in training and testing accuracy. We rank the performance of the ensemble models against feature types and calculate the average rank. This compares the functional efficiency and robustness among malware features. Table 9 and Figure 8 show the training and test performance comparison. The ensemble model outperforms the  decision tree. In comparison of learning accuracy, AdaBoost is the best compared to other ensembles, but the extra trees is the lowest. AdaBoost has a training accuracy of 0.985 and the extra trees of 0.939. The learning evaluation of XGBoost is 0.978, the second highest after AdaBoost. The next higher order is rotation forest, random forest, decision tree, and extra trees.
In the test performance comparison, XGBoost and random forest show the classification rate of more than 0.925. Except for rotation forest, the test performance of all ensemble models is over 0.92. The accuracy of rotation forest is 0.9, which is similar to that of decision tree.
WEM is the best in the performance comparison among feature types. Next, it shows the low performance with API-DLL, 2-gram, 2-gramM, and API features. However, WEM shows the best results in training and test performance when compared to other features. API and API-DLL, which consist of functions that are frequently used in malware, are represented by lower dimensional vectors than 2-grams, but do not show much difference in time and performance. The test performance of the WEM of the ensemble model is more than 0.967. In particular, XGBoost is highest with 0.97 classification.
Our results suggest that XGBoost is superior to all algorithms. AdaBoost performs best in the experiment of API feature, and both the decision tree and XGBoost are almost equal for API-DLL. In the performance analysis, the difference between training data and training data based on feature extraction is not significant. Both 2-gram and WEM are better than 2-gramM, API, and API-DLL. However, the fastest ensemble algorithm is extra trees followed by random forest, XGBoost, AdaBoost, and rotation forest in order.
In comparison of training time, the training time of rotation forest runs longest than other ensembles due to PCA (Principle Component Analysis) computation. It is analyzed that the training time of the ensemble model is appropriate, but it requires tens times more training time than decision tree. The PCA compuation for 2-gram with more than 60,000 dimensions was not possible, so we selected 4,201 attributes through the random forest feature selection. Therefore, the 2-gram training time (176.73 sec) of rotation tree is not a proper comparison with the training time of other ensembles. In other words, when training high-dimensional features in rotation tree, it is difficult to transform them by PCA. Table 10 and Figure 9 compare precision, recall, and AUC-PRC score to assess the effect of malware features versus ensemble models. The AUC-PRC scores differ VOLUME 8, 2020 significantly between decision tree and ensemble algorithm. The decision tree show AUC-PRC scores higher than about 0.9 for API-DLL and API features, but low for WEM, 2-gram, and 2-gramM. However, the AUC-PRC score of other ensembles is a higher AUC-PRC score than the decision tree for all feature types.
The ensemble algorithms in Figure 9 do not show much difference according to the type of malware feature. In most ensemble algorithms, WEM's AUC-PRC scores higher than others. Therefore, it is analyzed that AUC-PRC of WEM is more suitable for classification than other types of malware feature. The ensemble model in API-DLL and API has lower precision than recall. On the other hand, the experimental results of the ensemble model in 2-gram, 2-gramM, and WEM show that the recall is lower than the precision. The decision tree has similar precision in all malware features. From the experimental results, the application of the ensemble algorithm for a malware detection system requires a precision-recall tradeoff analysis.

D. COMPARISON WITH OTHER WORKS
It is difficult to compare our result with the previous studies because there is not enough information about the used datasets, feature information and machine learning algorithms. Also, most experimental results are not comparable because PRC-based analysis is not presented, and some studies are related to in-depth learning approaches and ensemble applications based on both static and dynamic features.
The proposed API and WEM features are compared to the previous dimension reduction studies and the entropy representation in association with malware detection. The training data consists of 40,000 instances, with 20,000 instances selected for each class. The comparison algorithms are SVM, deep neural network (DNN), and random forest of the relevant work.
The API reduction method is compared to both Rushabh et al. [42] and Infosec [43] ones that are referred in Table 4. SVM chose RBF kernel (γ = 5, C = 10.0), and random forest was set to the parameter adopted in Subsection IV-C. In Table 11, the proposed API feature shows high performance regardless of the learning algorithms. Random forest  can learn hundreds of times faster than SVM and exhibit high performance. However, it is difficult to analyze API dimension reduction methods for efficient malware detection, because the time of data collection and the dimension of feature vectors are very different.
The proposed WEM feature is compared with the byte entropy [11], and the learning algorithm is SVM, DNN, and random forest. The DNN architecture consists of 4 layers (256 × 256 × 128 × 2) including an input layer. The DNN structure was modified from the architecture that tested the byte entropy feature. The dropout regularization was applied between layer 2 and 3 by dropping 10% of connections. The activation function of the hidden layer units is PReLU (parametric rectified linear unit) and that of the output units is sigmoid. The learning algorithm is Adam with the learning rate of 0.001 and the total of 3,000 epochs. Table 12 is the performance comparison between byte entropy and WEM. WEM has 512 dimensions and byte entropy expresses the characteristics of malicious code in 256 dimensions. The WEM feature is superior to the byte entropy feature in all learning algorithms. However, SVM and DNN require much higher training time than random forest. The experimental result of SVM is expected to be overfitted in WEM and byte entropy, so additional experiments are required according to various parameter settings. The learning time of SVM and DNN is expected to increase in proportion to the number of learning data.
The advantage of our approach is that the use of purely static features allows rapid analysis for a malware detection system in terms of feature reduction, fast computation and generalization performance. Therefore, the proposed approach will be very applicable and effective when applying to EDR (Endpoint Detection & Response) systems.

V. CONCLUSION
This paper analyzed malware features for static analysis and compares tree-based ensemble algorithms. A modified malware feature representation has been proposed to minimize the drawbacks such as variable length, high-dimensional representation and high storage usage in commonly used malware feature representations. The proposed malware features showed better generalization of ensemble algorithms in terms of training time and performance than the original training features. The modified malware features take advantage of frequently used functions, expertise knowledge, or entropy discretization. Therefore, the malware features do not require fixed length selection or padding that appears when training data is prepared in a manner appropriate to feature vectorization for machine learning.
The experimental analysis indicates that the tree-based ensemble model is effective and efficient for malware classification in relation to training time and generalization performance. In addition, our approach can quickly analyze a large amount of malware in terms of low-dimensional features and fast learning. The low-dimensional feature representation using WEM, API, and API-DLL can be an alternative to guarantee high generalization performance when static analysis is applied for malware detection. Further studies that provide the evidence of ensemble model prediction and the learning algorithm for combined malware features are required. DOOSUNG HWANG received the Ph.D. degree from Wayne State University, USA. Previously, he was a Senior Researcher with the Electronics and Telecommunications Research Institute (ETRI), South Korea, and worked on learning algorithm design and intelligent systems, such as expert systems, image recognition, speech recognition, and parallel computing. He is currently a Professor with the Department of Software Science, Dankook University, South Korea. His research interests include machine learning, high-performance computing, image processing, and inductive logic programming.