Application of Distance Metric Learning to Automated Malware Detection

Distance metric learning aims to find the most appropriate distance metric parameters to improve similarity-based models such as <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Nearest Neighbors or <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Means. In this paper, we apply distance metric learning to the problem of malware detection. We focus on two tasks: (1) to classify malware and benign files with a minimal error rate, (2) to detect as much malware as possible while maintaining a low false positive rate. We propose a malware detection system using Particle Swarm Optimization that finds the feature weights to optimize the similarity measure. We compare the performance of the approach with three state-of-the-art distance metric learning techniques. We find that metrics trained in this way lead to significant improvements in the <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Nearest Neighbors classification. We conducted and evaluated experiments with more than 150,000 Windows-based malware and benign samples. Features consisted of metadata contained in the headers of executable files in the portable executable file format. Our experimental results show that our malware detection system based on distance metric learning achieves a 1.09 % error rate at 0.74 % false positive rate (FPR) and outperforms all machine learning algorithms considered in the experiment. Considering the second task related to keeping minimal FPR, we achieved a 1.15 % error rate at only 0.13 % FPR.


I. INTRODUCTION
The term malware, or malicious software, is defined as any software that does something that causes damage to the user. Malware includes viruses, worms, trojan horses, rootkits, spyware, and any other program that exhibits malicious behavior [1]. In information security, malware attacks have been one of the main threats over the past several decades. While malware developers continuously find new exploitable vulnerabilities, create more and more sophisticated techniques to avoid detection and find new infection vectors, malware analysts and researchers continually improve their defenses. This game seems to have an infinite number of rounds.
The attacker's purpose is no longer to cause damage, such as damaging a computer system without getting money. Nowadays, malware has become a rather profitable The associate editor coordinating the review of this manuscript and approving it for publication was Hao Ji. business. Malware writers use a variety of techniques to distribute malicious programs and infect devices. They can use self-propagation mechanisms based on various vulnerabilities or use social engineering to trick the user into installing the malware. Malware writers usually employ obfuscation techniques [2] such as encryption, binary packers, or self-modifying code to evade malware classifiers. Many malware researchers have focused on data mining and machine learning (ML) algorithms to defeat these techniques and to detect unknown malware [3]. The performance of many ML algorithms, such as k-Nearest Neighbors (KNN) or k-Means, depends on the distance metric used to measure dissimilarity between samples over some input space. The distance between two samples having the same class label must be minimized while the distance between two samples of different classes must be maximized.
Distance metric learning (DML) aims to automatically learn distance metric parameters from data to improve the performance of classification and clustering algorithms. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Finding the most appropriate parameters of the metric concerning some optimization criterion is typically formulated as an optimization problem. Evolutionary algorithms, swarm algorithms, and other heuristics [4] are suitable for solving this problem. In this work, we used a biologically-motivated algorithm, Particle Swarm Optimization (PSO), to handle this problem related to malware detection. Most distance metric learning methods learn a Mahalanobis distance with respect to some objective function. The definition of this objective function depends on the training dataset and specific tasks, such as classification or clustering. Malware detection can be defined as a classification problem with two classes: malware and benign samples. The more challenging problem is to cluster malware into malware families [5]. This work empirically demonstrates how to apply distance metric learning to malware detection using a KNN classifier. We experimented with different distance metric learning methods and evaluated them concerning various optimization criteria, such as error rate or its modification.
In this work, we consider portable executables (PE) in the Windows environment. Features consist of metadata from the PE file format [6]. Our proposed detection model is based exclusively on static analysis, aiming to search for information about the file structure without running a program. While the static analysis can be evaded by anti-malware-detection techniques such as obfuscation, it still has a place in the malware detection system since it is much faster than dynamic analysis, which involves running the program.
Malware detection systems that use features originating only from the static analysis can be evaded by obfuscation techniques such as packing [7]. For this reason, we do not suggest using the proposed malware detection system as a standalone and independent application. In order to achieve the highest possible accuracy, it is necessary to use various types of features (byte sequences, API & system calls, opcodes, strings, entropy, instruction traces,. . . ) from both the static and the dynamic analysis. Our work can be used as one component of such a more complex malware detection system.
Note that minimizing the error rate is not the only goal of this work. Another goal is to detect as much malware as possible while maintaining a low false positive rate. From the antivirus vendors' perspective, a false positive error is considered a serious problem. For example, if some legitimate programs integrated into the operating system are detected as malware, then the system could be rendered unusable. False positives can also frustrate developers of the legitimate application that was accidentally blocked by an antimalware system. Since false positives can have serious consequences, we proposed an optimization criterion that takes the cost of false positives into account.
The main contributions of our work are: Architecture of a malware detection system: We propose the architecture of the malware detection model based on distance metric learning. The detection system processes the data from the PE file format where numeric features are normalized and nominal features are turned into conditional probabilities. Training samples are first used to train the distance metric and then they are used in a KNN classifier with the learned distance metric to classify the testing samples.
Scalable optimization criterion for PSO-based model: To reflect the higher cost of a false positive, we constructed a cost function called weighted error rate which we use as a fitness function in the PSO algorithm to minimize error rate and false positive rate.
Application of DML algorithms to malware detection: We explored the use of three state-of-the-art distance metric learning algorithms, namely Large Margin Nearest Neighbor, Neighborhood Component Analysis, and Metric Learning for Kernel Regression, for KNN classification of malware and legitimate software. We compare these models with the PSO-based model and provide practical information concerning performance, computational time and resource usage. We show that the DML-based methods might improve malware classification results even when standard methods such as feature selection or algorithm tuning had already been applied.
The rest of the paper is organized as follows: Section II reviews recent works on malware detection based on machine learning techniques. In Section III, we define the distance metric learning problem and give some theoretical background. Our proposed malware detection model is presented in Section IV. Section V provides an experimental setup. Detailed information about experiments and results is presented in Section VI. Conclusion and future work are given in Section VII.

II. RELATED WORK
The application of machine learning techniques to malware detection has been an active research area for about twenty years. Researchers have tried to apply various well-known techniques such as Neural Networks, Decision Trees, Support Vector Machines (SVM), ensemble methods and many other popular machine learning algorithms. Recent survey papers [8], [9] provide comprehensive information on malware detection techniques using machine learning algorithms.

A. RECENT WORKS
This section briefly reviews some recent works related to malware detection based on machine learning techniques. We mainly focus on works that use the static analysis of Windows PE files, focusing on features extracted from the PE file format.
Wadkar et al. [10] proposed a system based on the static features extracted from PE files for detecting evolutionary modifications within malware families. SVM models were trained over a sliding time window, and the differences in SVM weights were quantified using χ 2 statistic. For most of the 13 malware families considered in the experiments, the system detected significant changes.
Yang and Liu [13] proposed a detection model called TuningMalconv with two layers: a raw bytes model in the first layer and a gradient boosting classifier in the second layer. The feature set was based on static analysis and consisted of raw bytes, n-grams of byte codes, string patterns, and information in the PE header. The experimental results of the TuningMalconv detection model on the dataset with 41,065 samples showed an accuracy of 98.69 %.
Another malware detection model based on static analysis was proposed by Gao et al. [15]. The detection model is based on semi-supervised transfer learning and was deployed in the cloud as a SaaS (Software as a Service). The detection model was evaluated on Kaggle malware datasets and improved the classification accuracy from 94.72 % to 96.90 %.
Xue et al. [17] proposed a classification system, Malscore, which combines static and dynamic analysis. In static analysis, grayscale images were processed by the Convolutional Neural Network. In dynamic analysis, API call sequences were represented as n-grams and analyzed using five machine learning algorithms: Support Vector Machine, Random Forest, Adaboost, Naïve Bayes, and KNN. The authors performed experiments on more than 170,000 malware samples from 63 malware families and achieved an accuracy of 98.82 %.
Zhong and Gu [19] improved the performance of deep learning models by organizing them in a tree structure called Multiple-Level Deep Learning System. Each deep learning model focuses on a specific malware family. As a result, the Multiple-Level Deep Learning System can handle complex malware data distribution. Experimental results indicate that the proposed method outperforms the Support Vector Machine, Decision Tree, the single Deep Learning method and an ensemble-based approach.
All information on executables used in the work proposed by Raff et al. [23] came from the PE header, more specifically, from the MS-DOS, the COFF (Common Object File Format), and the Optional Header. Neural networks were trained from raw bytes which were not parsed for explicit features, and as a result, no preprocessing or feature engineering was required. More than 400,000 samples were used for training, and the Fully Connected Neural Network model achieved the highest accuracy.
Kumar et al. [26] proposed a malware detection system which uses machine learning techniques and is based exclusively on static analysis. The dataset contained 2,722 malware and 2,488 benign program samples, and the original feature set consisted of 53 PE file header fields from the DOS header, File header, and Optional header. These features were then processed, and 68 integrated features were derived.
In the experiments, six machine learning algorithms Logistic Regression, Linear Discriminant Analysis, Random Forest, k-Nearest Neighbors, Decision Tree, and Gaussian Naïve Bayes were used. The highest classification accuracy, 98.4 %, was achieved by Random Forest.
Kolosnjaji et al. [25] proposed a neural network architecture that combines convolutional and feed-forward neural layers. The authors used only the static malware analysis where inputs to feed-forward layers were the fields of the PE header while inputs to the convolutional layers were assembly opcode sequences. The proposed hybrid neural network achieved 93 % on precision and recall. Table 1 summarizes related works and our work in terms of the number of classes, the size of the dataset, the type of analysis, features used, and the source of the dataset.

B. DISTANCE METRIC LEARNING-BASED WORKS
Surprisingly, there is a distinct lack of experimentation with distance metric learning techniques applied on large and real-world datasets from the Windows environment. In the rest of the section, we briefly mention two of our previous works on malware detection methods that rely on distance metric learning. This paper can be considered as an extension of them. In [27], we applied the Particle Swarm Optimization algorithm to the problem of finding the appropriate feature weights used in the heterogeneous distance function [28] specifically defined for the PE file format to classify malware and benign files. We showed that the error rate of the KNN classifier could be decreased by 12.77 % using the weighted distance function. Our other work [5] focused on the application of three distance metric learning methods applied to the multiclass classification problem with seven classes: VOLUME 9, 2021 six prevalent malware families and the benign files. Using Metric Learning for Kernel Regression method to learn the Mahalanobis distance metric, we achieved average precision and recall of 97.04 %, both using the KNN classifier.

III. PROBLEM STATEMENT AND BACKGROUND
This section provides basic information on distance metric learning and briefly discusses three selected distance metric learning methods used in our experiments.
Euclidean distance is by far the most commonly used distance. Let x and y be two feature vectors from real n−dimensional space R n , and let w i , i = 1, . . . , n, be a non-negative real number associated with the i-th feature. The weighted Euclidean distance is defined as: The goal of learning the weighted Euclidean distance is to find the best weight vector w = (w 1 , . . . , w n ) with respect to some optimization criterion, usually the minimal error rate. Several other distance functions have been presented [29]. In order to improve results, many weighting schemes were proposed. A review of feature weighting methods for lazy learning algorithms was proposed in [30].
Mahalanobis distance is another popular distance. It is defined for two vectors x, y ∈ R n of dimension n as where M is a positive semidefinite matrix. Mahalanobis distance can be considered as a generalization of Euclidean distance since the Euclidean distance can be expressed as a Mahalanobis distance where M is the identity matrix. If M is diagonal, then this corresponds to learning the feature weights M ii = w i defined for weighted Euclidean distance in Eq. (1). The goal of learning the Mahalanobis distance is to find the best matrix M with respect to some optimization criterion. Regarding the KNN classifier employed in this work, the main goal is to find a matrix M which is estimated from the training set that leads to the lowest error rate of the KNN classifier. Another goal of this work is to minimize the error rate taking into account the cost of false positives. Since a positive semidefinite matrix M can always be decomposed as M = L L, the distance metric learning problem can be viewed as finding either M or L = M 1 2 . Therefore, Mahalanobis distance defined in Eq. (2) can be expressed in terms of the matrix L as Another application of distance metric learning is dimensionality reduction. The matrix L can be used to project the original feature space into a new embedding feature space. This projection is a linear transformation defined for feature vector x as Mahalanobis distance of two points x, y from the original space defined in Eq. (2) corresponds to the Euclidean distance between transformed points x = Lx, y = Ly defined as follows: This transformation is useful since the computation of Euclidean distance has lower time complexity than that of the Mahalanobis distance.
Distance metric learning has attracted a lot of attention in the machine learning field and is still an active research area [31]. There have been many proposed methods [32]- [34]. Next, we briefly describe three state-ofthe-art distance metric learning methods that we used in our experiments. Specifically, the weighted Euclidean distance was learned by the Particle Swarm Optimization algorithm, and the Mahalanobis distance was learned by the distance metric learning methods described in the rest of this section.

A. LARGE MARGIN NEAREST NEIGHBOR
Large Margin Nearest Neighbor (LMNN) [35] is one of the state-of-the-art distance metric learning algorithms used to learn a Mahalanobis distance metric for a KNN classification. LMNN consists of two steps. In the first step, for each instance x a set of k nearest instances belonging to the same class as x (referred as target neighbors) is identified. In the second step, we adapt the Mahalanobis distance to reach the goal that the target neighbors are closer to x than instances from different classes separated by a large margin. The Mahalanobis distance metric is estimated by solving the semidefinite programming problem defined as: The notation j i refers that the sample x j is a target neighbor of the sample x i , and y i denotes the class of x i . The parameter µ defines a trade-off between the two objectives: 1) to minimize the distances between samples x i and their target neghbors x j , 2) to maximize the distances between samples x i and their impostors x k which are samples which belong among the nearest neighbors of x i but have different class labels (i.e. y i = y k ). Finally, [x] + is defined as the hinge-loss, i.e. [x] + = max{0, x}. In [36], LMNN was extended to multiple local metrics and the learning time of LMNN was reduced using metric ball trees.

B. NEIGHBORHOOD COMPONENT ANALYSIS
Goldberger et al. [37] proposed the Neighborhood Component Analysis (NCA), which is a distance metric learning algorithm specially designed to improve the KNN classification.
Let p ij be the probability that the sample x i is the neighbor of the sample x j belonging to the same class as x i . This probability is defined as: The goal of NCA is to find the matrix L that maximizes the sum of probabilities p i : The gradient ascent algorithm solves this optimization problem. Neither LMNN nor NCA algorithms make any assumptions on the class distributions.  [38] which aims at training the Mahalanobis matrix by minimizing the error loss over the training samples: where the prediction classŷ i is derived from kernel regression by calculating a weighted average of the training samples: MLKR can be applied to many types of kernel functions K (x i , x j ) and distance metrics d(x i , x j ).
Recall that the mentioned distance metric learning algorithms can be used as supervised dimensionality reduction algorithms. Considering the matrix L ∈ R d×n with d < n then the dimension of transformed sample x = Lx is reduced to d.

IV. PROPOSED MODEL
In this section, we describe our proposed malware detection model based on distance metric learning. First, the features description and engineering are provided. Then we describe the modification of the Particle Swarm Optimization algorithm, which we used to find appropriate feature weights of weighted Euclidean distance. Finally, we complete this section by proposing the architecture of the malware detection model.

A. FEATURE DESCRIPTION
The features used in our experiments are extracted from the PE file format [6] which is the file format used for executables, DLLs, object code and other files used in the 32 and 64-bit versions of the Windows operating system. The PE file format is the most widely used file format for malware samples that run on desktop platforms. Before describing the features used in our experiments, let us first examine the short outline of the PE file format.
A PE file consists of headers and sections that encapsulate the information necessary to manage the executable code. The PE file header provides all the descriptive information concerning the locations and sizes of structures in the PE file to the loader process. The header of a PE file consists of the DOS header, the PE signature, the COFF file header, the optional header and the section headers. The optional file header is immediately followed by the section headers which provide information about sections, including their locations, sizes, and characteristics.
Sections divide the file content into code, resources and various types of data. The order of the sections is not the same for each PE file. Moreover, malware authors can change the names of the sections. Therefore, we prefer to consider only the order of sections rather than the name of the sections (such as.text,.data,.rsrc). The last section of a PE file may be of particular importance. It may contain useful information, especially for some types of malware, e.g., the file infector which typically attaches malicious code at the end of the file. To deal with a various number of sections across the samples, we have decided to consider only the first four sections and the last section.
Based on our empirical studies and the PE format analysis, we selected a set of static features that help to distinguish malware and benign files. The features used in our experiments are of three types: nominal, numeric, and bit fields. In the following section, we describe how these three types of features were preprocessed.
Let T = {(x 1 , c 1 ), . . . , (x m , c m )} be the training set, where x i is a feature vector and c i = cl(x i ) is the corresponding class label. In our binary classification task, we will consider two classes C and M, where C denotes the class of benign samples and M denotes the class of malware. Let each sample be represented by the feature set {f 1 , . . . , f n }. Let the feature f j be nominal and let s be the feature vector corresponding to some unknown sample. Then P(cl(s) = M|f j = h) denotes the conditional probability that the output class of s is malware given that feature f j has the value h. Using data from training set T , we estimate this probability as where Following this approach, for each sample s, we transform each value h of each nominal attribute f j according to the VOLUME 9, 2021 following rule: Regarding numeric features, it is necessary to take into account their different ranges. Therefore the following data normalization method is employed on each numeric feature f to rescale its original value h using min-max normalization. For each feature vector s, we transform each value h of each numeric feature f according to the following rule: where f min , resp. f max , is the minimal, resp. the maximal value among all known values of the feature f . To handle features that are bit arrays (b 1 , . . . , b k ), we split up each component b i from the array and consider it as an independent feature. Finally, after preprocessing all three types of features, we apply several feature selection and extraction algorithms and select the most relevant features.

B. FINDING THE FEATURE WEIGHTS USING PARTICLE SWARM OPTIMIZATION
Particle Swarm Optimization (PSO) [39] is a biologicallymotivated stochastic optimization algorithm based on swarm intelligence. Each particle is represented as a point in the search space, and a fitness function determines the quality of each point. Each particle updates its position, which is influenced by the current velocity, the previous best particle's position, and the most successful particle in the swarm.
The concept and notation of the PSO elements for finding the feature weights used in the weighted Euclidean distance Eq. (1) in the KNN classification is as follows: • A particle is represented as a vector of weights w. The current position of i-th particle is denoted by x i and v i denotes its current velocity.
• A swarm or population is an array of all particles considered in the PSO algorithm.
• The local best position p i of i-th particle is its best position among all positions visited so far, and pbest i is the corresponding value of the fitness function f , i.e. pbest i = f (p i ).
• The global best position p g is the position of the most successful particle in the swarm, and gbest i = f (p g ).
• The fitness function f is an objective function used to measure the quality of a particle. In our malware detection problem, the optimization criterion can be defined as the error rate of the KNN classifier. In this work, we will also consider another optimization criterion focused on minimizing the false positive rate. The PSO algorithm has three inputs: the fitness function f , a training set T pso , and vector p of feature importance scores [40] obtained by the feature selection algorithm. The pseudocode of the modified PSO algorithm is presented in Algorithm 1.
Rand(0, ) represents a vector of random numbers uniformly distributed in [0, ] where is a small constant.

Algorithm 1 PSO Algorithm
Input: fitness function f , T pso , p Output: vector of weights 1: initialize particles: 3: for each particle x i do 4: compute fitness function f (x i ) 5: if f (x i ) > pbest i then 6: end for 10: select the most successful particle in swarm so far, and denote it by p g 11: for each particle x i do 12: end for 15: until maximum number of iterations is attained 16: return global best position Operation ⊗ denotes a component-wise multiplication. Note that each particle can memorize its best previous position, and it also knows the best position of the whole swarm so far. Each component of velocity v is kept in the range [−V max , V max ], where the parameter V max influences search ability of the particles. An inertia weight ω is used to better control the search scope and reduce the importance of V max . Higher values of ω tend to prefer the global search while lower values tend to prefer the local search. Parameters φ 1 and φ 2 represent the weights and are used to balance the global and the local search. The purpose of the initialization is in the acceleration of PSO, i.e., reducing the searching space is done using the feature selection algorithm results.
This work concerns the classification problem where the definition of the fitness function depends on the KNN classifier. The fitness function of the clustering problem can alternatively be defined using purity or silhouette coefficient.
The PSO was chosen among other optimization heuristics because its convergence rate is fast and the algorithm is easy to implement and execute in parallel. The drawback of the algorithm is that it is vulnerable to getting stuck in the local minima.
In the rest of this section, we propose the optimization criteria for detecting as much malware as possible while keeping a low false positive rate. To consider the different costs of a false positive and false negative, we adjust the loss function that penalizes false positives.
Since our dataset is well-balanced, we consider the error rate as the appropriate measure of performance. The error rate is defined as the percentage of incorrectly classified instances. We can rewrite the error rate in terms of the number where |T test | is the number of testing samples. We modify Eq. (14) by adding the parameter c > 1 which corresponds to the cost for false positive. Then we define the optimization criterion called weighted error rate (WERR), which takes into account the cost of the false positive: One interpretation of WERR is that if we would change the parameters of the classifier and achieve the same error rate (with possibly different FP new and FN new compared to the original values of FP old and FN old ) then these results will be better with respect to WERR if One aspect of the WERR criterion is that we can ''exchange'' one false positive for c false negatives while keeping the error rate unchanged. Note that when c = 1, then WERR is equal to the error rate. In all experiments, we used the WERR criterion as a fitness function of the PSO algorithm. When not mentioned, the value of the parameter c was set to one.

C. ARCHITECTURE
We present the malware detection system based on distance metric learning. The system uses static analysis of PE file headers and sections. The proposed architecture is depicted in Figure 1 and outlined in the following seven basic steps: Step 1: Splitting the data. The set of samples is randomly divided into the training set and the testing set. The training set is used for training a distance metric (see step 5) and a classifier (see step 6). The testing set is used for testing the classifier with the learned distance metric.
Step 2: Parsing binaries. For each sample from the training and the testing set, we extract and store information from the PE file format. We use Python module pefile [41] for extracting the features. These features will be preprocessed in the step 3. In step 4, only the relevant features will be selected and considered in experiments.
Step 3: Preprocessing of features. Conditional probability P(x is malware|x i = h) is computed for each nominal feature x i and for each value h of the feature x i that appears in training set. Numeric features are normalized according to min-max normalization. Bit arrays are split up into single boolean features.
Step 4: Feature selection. The feature selection algorithm is used to determine the relevant features and produce the final version of the feature set.
Step 5: Learning the distance metric. The distance metric learning method is applied to the training feature vectors in order to produce the appropriate distance metric parameters. In the case of high computational complexity, only a subset of training vectors can be used to learn the distance metric.
Steps 6: Classification. The distance metric learned in step 5 is used in the KNN classifier to classify samples from the testing set.
Steps 7: Evaluation: Performance metrics, such as true positive, false positive and error rate, are used to measure the classification results.
The computation of the conditional probabilities for nominal features and the execution of the feature selection algorithm for all three types of features is only performed on the training samples. The corresponding conditional probabilities and selected features are applied to design both the training and testing feature vectors.

V. EXPERIMENTAL SETUP
In this section, we present a detailed description of the experimental setup. First, we introduce the dataset used in our experiments. Then, we describe performance metrics and present the results of feature selection.

A. DATASET AND IMPLEMENTATION
We validated our approach using datasets containing real-world data from 150,145 Windows programs in the PE file format, out of which 74,978 were malicious, and 75,167 were benign. The malicious and benign programs were obtained from the industrial partner's laboratory and from the Virusshare repository [12]. Our dataset contains both obfuscated (e.g., packed and/or polymorphic) and non-obfuscated samples. We confirm that all malicious samples considered in our experiments match known signatures from anti-virus companies. Also, none of our benign programs were detected as malware.
We used Python module pefile [41] for extracting features from the PE files. This module extracts all PE file attributes into an object making them easily accessible. We extracted 370 features based on static information only, i.e., without running the program. The dimensionality is high since in each section each flag of each kind of characteristic was considered as a single feature.
Our implementations of the feature selection algorithms, DML algorithms, the ML classifiers and the classification metrics are based on the Scikit-learn library [42]. If not specifically mentioned, the hyperparameters of the ML classifiers and the DML methods were set to their default values in the Scikit-learn library.
Our implementation was executed on a single computer with two processors (Intel Xeon Gold 6136, 3.0GHz, 12 cores each), with 64 GB of RAM running the Ubuntu server 18.04 LTS operating system. Memory usage was not exceeded in any experiment conducted in this work.

B. PERFORMANCE METRICS
This section presents the performance metrics we used to measure the accuracy of our proposed approach. For evaluation purposes, the following classical quantities are employed: • True Positive (TP) represents the number of malicious samples classified as malware.
• True Negative (TN) represents the number of benign samples classified as benign.
• False Positive (FP) represents the number of benign samples classified as malware.
• False Negative (FN) represents the number of malicious samples classified as benign. The performance of our classifier on the test set is measured using three standard metrics. The most intuitive and commonly used evaluation metric in machine learning is the error rate (ERR): It is defined on a given test set as the percentage of incorrectly classified instances. An alternative for ERR is accuracy defined as ACC = 1−ERR. The second parameter, True Positive Rate (TPR) (or detection rate), is defined as: TPR is the percentage of truly malicious samples that were classified as malware. The third parameter is False Positive Rate (FPR), and it is defined as follows: FPR is the percentage of benign samples that were incorrectly classified as malware.

C. FEATURE SELECTION ALGORITHMS
To reduce the high dimension of the feature vector we used a feature selection algorithm to select the most relevant subset of features. We applied six feature selection algorithms and evaluated them using the KNN (k = 3) classifier. Figure 2 shows that the highest accuracy was achieved with the Recursive Feature Elimination (RFE) Logistic Regression for 75 selected features. The feature selection algorithms were evaluated on the whole training data, that is, 70% of samples of all 150,145 samples. To make our results reproducible, Table 7 in Appendix A summarizes the 75 selected features used in our experiments. We kept the name of the fields in the same form as in the documentation [43] so that the reader can easily find detailed description. In all following experiments, we used the dataset processed by the RFE logistic regression, which reduced the dimensionality from 370 to 75.

VI. EXPERIMENTAL RESULTS
A collection of experiments concerning distance metric learning techniques has been conducted. Firstly, we compared the DML techniques and performed additional experiments with the two most successful techniques. Then we focused on minimizing the false positive rate, and finally we compared our approach based on PSO to the state-of-the-art machine learning algorithms.
We first searched for the hyper-parameters of the DML methods. Appropriate hyper-parameters can have large impact on the predictive or computation performance. Tuning the hyper-parameters of the LMNN algorithm using grid search exhaustively considers all parameter combinations. The following searching grids were explored: Number of nearest neigbors k ∈ {1, 3, 5, 7, . . . , 21}, Maximum number of iterations of the optimization procedure n max ∈ {500, 1000, 1500, 2000, 2500, 3000}, learning rate of the optimization procedure r ∈ {10 −2 , 10 −3 , 10 −4 , 10 −5 , 10 −6 , 10 −7 , 10 −8 }. Note that in all experiments with LMNN we used the parameter µ = 1 defined in Eq. (6) as the trade-off between the two objectives. The lowest error rate was achieved with the following LMNN hyperparameters: number of neighbors k = 3, maximum number of iterations n max = 1000 and learning rate r = 10 −6 . These hyper-parameters were used in all successive experiments. We left the hyper-parameters of NCA and MLKR at the default values provided by in the Scikit-learn library. Regarding the PSO algorithm, we explored the following PSO parameters: φ 1 , φ 2 ∈ {0.5, 1, 1.5, 2}, and V max ∈ {0.5, 1, 2, 4}. The lowest error rate was achieved with the following hyper-parameters: φ 1 = φ 2 = 1, and V max = 2. The rest of the PSO parameters were as follows: population size is 40 and number of iterations is 30. For the first iteration, inertia weight ω is set to one and it linearly decreases at each iteration to the value ω min = 0.8. All of these PSO parameters were chosen according to guidelines from [44]. We ran the PSO algorithm ten times. Figure 3 illustrates the mean and standard deviation of the error rate corresponding to the various number of iterations. The PSO algorithm was run with 50 iterations, however, in each run the algorithm converged to the local minima before reaching 30 iterations. Note that even after the first iteration, PSO outperforms LMNN. The reason lies in the initialization step of the Algorithm 1. Positions of particles in the initialization step of PSO were set according to the feature importance score computed in the feature selection step rather than randomly. As a result, PSO was accelerated and better classification results were achieved.

A. COMPARISON OF DISTANCE METRIC LEARNING ALGORITHMS
Several distance metric learning algorithms such as LMNN, NCA, and MLKR were designed to improve the KNN classifier. For this reason, these three algorithms were included in our experiments. Table 2 shows the performance of the KNN classifier (k = 3) using the common Euclidean distance, the Mahalanobis distance learned by three selected DML algorithms, and the weighted Euclidean distance learned by the PSO algorithm. The KNN classifier achieved the lowest error rate for the weighted Euclidean metric learned by the PSO algorithm.
Recall that while PSO aims at learning a diagonal matrix, the goal of LMNN, NCA, and MLKR is to learn a full matrix. Due to the high computational complexity of the DML algorithms, we conducted the experiment for the randomly chosen subset of the training dataset. Distance metric learning algorithms were trained on 50,000 samples, and the KNN classifier with learned distance was tested on 21,430 samples. These numbers of samples follow the ratio of 70:30 in the sizes of training and testing sets. Based on the trade-off between minimizing the error rate and execution time, we chose only LMNN and PSO for the rest of the experiments.

B. ADDITIONAL EXPERIMENTS FOR LMNN AND PSO 1) COMPARISON OF LMNN AND PSO
In the first experiment, we explored the performance of the LMNN-based model and the PSO-based model for different sizes of datasets. The experiment was conducted ten times for randomly chosen training and testing datasets keeping the 70:30 ratio of their sizes. The number of training samples, the learning method, and the average learning times, TPR, FPR, and ERR estimated on the testing set are summarized in Table 3. The number of nearest neighbors considered in both LMNN-based and PSO-based models was k = 3.
For smaller datasets (i.e., a few thousand samples), the LMNN-based model achieved a lower error rate with approximately the same learning time as the PSO-based model. The results indicate that with the increasing volume of data the ratio of computing time and error rate decreases in favor of PSO.

2) THE EFFECT OF PARAMETER k
We discuss how different parameter settings of k (i.e., number of neighbors) affect the performance of the KNN classifier. We explored the variation of error rates for the following  three variants: Euclidean distance (without feature weights), Mahalanobis distance learned by LMNN, and weighted Euclidean distance where the weights were computing using PSO. For these three variants, the KNN was trained on 50,000 training samples, and the error rates were estimated on 21,430 testing samples. The experiment was performed ten times, and Figure 4 shows the averaged results of the KNN classifier for various values of the parameter k from the set {1, 3, 5, . . . , 21}.
In the additional experiments, we explored the relation between the number of neighbors and the learning time. Figure 5 shows that with the increasing number of neighbors the learning time of PSO increases only negligibly compared to the learning time of LMNN.

3) LMNN PROJECTION OF THE ORIGINAL FEATURE SPACE
In the next experiment, we used the weight matrix L, defined in Eq. (3) and learned by LMNN, to project the original feature space into a new embedding feature space. We followed the goal that the k nearest neighbors of each instance belong to the same class while a large margin separates instances from different classes. Recall that this projection is a linear transformation defined as x = Lx. This experiment aims to illustrate the difference between the original (non-transformed) data and the LMNN-transformed data. Two-dimensional embedding of 700 samples using the t-SNE algorithm [45] is shown in Figure 6 where similarity plots for four scenarios are compared.

4) LIMITATIONS OF LMNN FOR LARGER DATASETS
The result of the DML algorithms for the Mahalanobis distance metric is a n × n matrix where n is the dimension of the feature vector. Since the number of components of the matrix grows at a quadratic rate with n and the size of the training data is fixed, we can expect that the size of training data stops being sufficient for high values of n. Note that the PSO-based model using the weighted Euclidean distance needs only n parameters to be learned.
In the next experiment, we used the Principal Component Analysis [46] to reduce the data's dimension and examine the learning ability of the LMNN-based model for various dimensions of the feature vectors. We defined performance improvement expressed in percent as where ERR lmnn denotes the error rate of the KNN classifier (k = 3) using the Mahalanobis distance learned by LMNN, and ERR euclid denotes the error rate for the (non-learned) Euclidean distance. Our fixed-size dataset consisted of 50,000 training samples and 21,430 testing samples. Figure 7 illustrates performance improvement for the LMNN-based model. The result of linear regression represented by the red dashed line shows that the improvement of error rate declines with increasing dimension. This result may indicate that for higher dimensions, the size of our dataset may be a limiting factor.

C. MINIMIZING OF THE FALSE POSITIVE RATE
This section concerns the problem of detecting as much malware as possible while maintaining a low false positive rate. We first focus on minimizing the false positive rate using the PSO algorithm. We analyzed how the coefficient c in the WERR criterion defined in Eq. (15) influences the false positive rate and the error rate. In this experiment, we performed PSO with WERR optimization criterion for c ∈ {1, . . . , 10}. The relation between the coefficient c and the false positive rate and the error rate achieved by the KNN classifier (k = 3) is presented in Figure 8. The PSO was performed ten times for randomly chosen 50,000 training samples and 21,430 testing samples. The figure shows the mean values of FPR and ERR with the standard deviation.
As expected, with increasing coefficient c the corresponding FPR decreases. However, for c > 8, FPR does not decrease anymore since KNN using the Mahanalobis distance produces only 20 to 30 false positives. Given the size of our dataset, the lowest FPR, 0.13 %, was achieved for c = 8. Note that with increasing coefficient c the corresponding ERR increases as well while c ≤ 8.
While 0.13% FPR with 1.15 % error rate achieved in our experiment seems reasonable, it can still be impractical in real-world applications. It is undesirable that antivirus programs would delete a benign sample once in every  769 scanned samples on average. However, our proposed malware detection model can be used as one component of a more complex system relying on more data types from both the static and the dynamic analysis.
As for minimizing FPR using LMNN, we modified Eq. (6) by adding the parameter η k which corresponds to the cost of false positive. This modification aims to minimize the number of impostors belonging to the class of benign files. Let T be a training set and let N i denote the set of k target neighbors of x i . Then the modification of LMNN focusing on minimizing the false positives is: where η k denotes the cost of false positive and it is defined as: if y k is class of benign files, c ≥ 1 if y k is class of malware.
Similarly to the WERR optimization criterion, the purpose of the parameter c in the definition of η k is to set the amount of penalization for one false positive. The difference between the modification of LMNN and WERR criterion is that the modification of LMNN takes into account the distance between a sample and its impostor.
To summarize the result, Table 4 shows the performance of the modified LMNN according to Eq. (21) and the PSO methods with the WERR criterion. Note that the PSO-based method resulted in a lower false positive rate when compared to the LMNN-based method.

D. COMPARISON TO THE STATE-OF-THE-ART MACHINE LEARNING ALGORITHMS
In the last experiment, we compared several state-of-the-art machine learning algorithms with our proposed method which refers to the KNN classifier with weighted Euclidean distance where the weights were learned by the PSO algorithm with WERR criterion.
A list of machine learning classifiers considered, together with implementation details, is presented in Table 5. We briefly describe the machine learning techniques applied in the experiment.
The k-Nearest Neighbors classifier [47] is one of the most popular supervised learning methods. It is a non-parametric method that assigns a class label to each tested sample by a majority vote of its k nearest neighbors. Support Vector Machine method (SVM) [48] is mainly defined for two-class classification problems. The core idea is to maximize the margin, which is the smallest distance between the training data and the decision boundary. The SVM method can also be applied in multiclass classification problems using a binary classifier in a one-against-all situation.
Logistic Regression [49] is a parametric binary classifier that estimates the coefficients from the training data using the maximum-likelihood estimation. Similar to SVM, the Oneagainst-all strategy can also be applied to multiclass classification problems.
The Naïve Bayes classifier [50] is a probabilistic algorithm based on Bayes' theorem that predicts the class with the highest a posteriori probability. The Naïve Bayes classifier is based on the assumption that the features are conditionally independent of one another, which is often not valid in practice.
The Decision Tree classifier [51] is represented as a tree where the internal nodes correspond to features and the leaf nodes correspond to class labels. Edges leading to children node correspond to the feature values. The feature vector determines the path from the root node to the leaf node.
Deep Neural network [52] is a feed-forward artificial neural network that consists of three types of interconnected layers of perceptrons. The input layer takes a feature vector, which is then processed in hidden layers, and finally, perceptrons in the output layer output a result.
Adaboost [53] is one of the most popular boosting algorithms. It runs several weak classifiers and assigns them weights that are based on the corresponding error rates. These weights are then used to predict the output class.
Random forest [54] is an ensemble learning method that combines the results made by several decision trees using a voting mechanism. Table 6 provides average classification results of the selected supervised machine learning algorithms compared with the results of our proposed method defined as the KNN classifier using the weighted Euclidean distance learned by the PSO algorithm as described in Section IV.
All machine learning algorithms were run 20 times on a randomly chosen training and testing set with 50,000 samples and 21,430 samples, respectively. Our proposed method outperformed all the machine learning classifiers achieving the lowest FPR and the lowest error rate. Deep Neural Network and Ada Boost were the only ML algorithms having a higher TPR than the PSO-based model; however, they both achieved a significantly higher FPR.

VII. CONCLUSION
This paper proposed a malware detection system based on the k-Nearest Neighbor classifier using the weighted Euclidean distance learned by the Particle Swarm Optimization algorithm. We empirically demonstrated that our VOLUME 9, 2021 approach achieved the lowest error rate and the lowest false positive rate among all state-of-the-art machine learning algorithms considered in our experiment. We described the architecture of the detection system based on structural information from the static analysis of Windows PE files. This approach can also be applied to executable formats of other operating systems, such as macOS or Linux.
In addition, we focused on the problem of detecting as much malware as possible while keeping a low false positive rate because a high false positive error is considered seriously in the antivirus industry. We proposed an optimization criterion based on a weighted error rate to penalize false positives. Using this criterion as a fitness function in the Particle Swarm Optimization algorithm, which was used to learn the feature weights of the weighted Euclidean distance, we achieved 0.13 % false positive rate with an error rate of 1.15 %.
Ongoing work is focused in two directions. First, we are working on learning multiple local distance metrics for different malware families. We plan to investigate both unsupervised and supervised methods. Secondly, it would be interesting to experiment with other distance metric learning algorithms with various optimization criteria to achieve an even lower FPR with an acceptable error rate. Table 7 summarizes the list of 75 features all extracted from the PE file format. For each feature from a section header, we considered the order of the section rather than the name of the section (such as.text,.data,.rsrc). While the sections' order turns out to be important for malware detection, this kind of information is often not mentioned in research papers. We keep the name of the fields in the same form as in the documentation [43] so that the reader can easily find a detailed description in the documentation [6].