WS-LSMR: Malicious WebShell Detection Algorithm Based on Ensemble Learning

To solve the problem that the features produced by hidden means, such as code obfuscation and compression, in encrypted malicious WebShell files are not the same as those produced by non-encrypted files, a WebShell attack detection algorithm based on ensemble learning is proposed. First, this algorithm extracted the feature vocabulary of the unigrams and 4-grams based on opcode; subsequently, the 4-gram feature word weights were obtained according to the calculated Gini coefficient of the unigram feature words and used to select the features, which will be selected again based on the Gini coefficient of the 4-gram feature words. Consequently, a feature vocabulary that can detect encrypted and unencrypted WebShell files was constructed. Second, in order to improve the adaptability and accuracy of the detection method, an ensemble detection model called WS-LSMR, consisting of a Logistic Regression, Support Vector Machine, Multi-layer Perceptron and Random Forest, was constructed. The model uses a weighted voting method to determine the WebShell classification. This experiment demonstrated that compared with the traditional single WebShell detection algorithm, the recall rate and accuracy rate improved to 99.14% and 94.28%, respectively, which proves that this method has better detection performance.


I. INTRODUCTION
With the rapid development of communication networks, Web-based applications have gradually become the main way for Internet companies to provide services to users. Meanwhile, the various types of network attacks against Web applications are also rapidly growing, which greatly threatens the security of the Internet. The 2018 China Internet network security report [1] issued by the National Internet Emergency Center (CNCERT) shows that in 2018, approximately 16,000 IP addresses at home and abroad implanted backdoors into approximately 24,000 websites in China. A network backdoor obtains system-level permissions through a WebShell, which can exist in many kinds of scripting languages. Generally, the website project root folder further extends the harm to the local area network and plants Trojans in the network to spread the virus. The attacker can use the WebShell file to access data information, such as the server database and The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Wei Tsai . files. For the security of a Web site, it is essential to detect the WebShell files on a server.
According to the scripting languages, WebShells can be classified as ASP, PHP and JSP scripting Trojans [2]. Because of the simple syntax and high development efficiency of the PHP language, it has become the first choice for developing all types of portals and Web applications [3]. Therefore, this paper mainly studies the detection method for PHP Web-Shells.
WebShells can be classified into three categories based on their functionality: Full Trojan, Mini Trojan and onesentence Trojan. The Full Trojan, which is a general purpose WebShell, is malicious code with full functionality, which includes being interface-friendly and can be a file operation, command execution and graphical interface during database operations. The Mini Trojan contains only one function. This WebShell category can provide the file upload function using malicious database code. The one-sentence Trojan, which is a short and powerful malicious code that is difficult to detect, plays a powerful role in continuous intrusions and generally VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ takes the form of a command execution code, such as an ''eval()'' function [4]. The features of WebShell scripts are constantly changing, which makes them increasingly difficult to detect [5]. Among the various issues, the selection of the optimal feature subset has long been a concern of researchers. In recent decades, ensemble learning algorithms have been shown to be able to efficiently solve many problems that cannot be solved by single machine learning algorithms [6]- [9]. These algorithms can address the problem of too much data or too little data. If the amount of data is too large, a single learner generally can only learn a small part [10]. Meanwhile, if the amount of data is too small, ensemble learning can sample the data set according to a certain strategy, and it consequently obtains a variety of different combinations of data to expand the data sample [11], [12]. Therefore, in the field of machine learning, researchers pay more attention to them. At first, the algorithms aimed to improve the accuracy of automatic decision-making systems, but at present, these methods can be applied to a variety of machine learning problems and can generate good results [13], [14]. Ensemble learning is a machine learning strategy that is independent of an algorithm. If a single classifier is compared to a decision maker, then multiple classifiers are equivalent to multiple decision makers making a decision together. To ensure that the ensemble classifier achieves a better classification effect than the single classifier, three principles need to be followed: (1) Each sub-classifier of the ensemble classifier must use different classification methods and training methods, (2) The errors produced by the sub-classifiers in the ensemble classifier must be different, and (3) The accuracy of the sub-classifiers must be greater than 0.5. Therefore, malicious WebShell detection based on ensemble learning encounters the following difficulties: (1) How to select the optimal feature subset of malicious WebShell scripts, (2) How to select basic learner in ensemble learning, and (3) How to calculate the weight parameters of the base learner in ensemble learning.

II. RELATED WORK
At present, the WebShell detection methods can be roughly classified as static feature-based detection and dynamic feature-based detection methods.

A. DETECTION BASED ON STATIC FEATURES
Static feature detection mainly matches feature values, dangerous function names and other known conditions to find WebShells. The advantages of this approach are that it is simple to deploy, has a high rate of finding known WebShells, and can be applied using a simple script; meanwhile, the disadvantage of this method is that only known WebShells can be found. Furthermore, manual cooperation is needed to find and exclude some weak features of the files. Although a small web site can be quickly located, and files can be excluded with weak features, using a combination of static features and manual work, for large web sites, the total amount of human effort is too large at this time. Therefore, Webshell detection techniques in web applications [15] proposes a new method to identify WebShells based on the optimal threshold of malicious signatures, malicious function samples and the longest characters at the beginning and end of file labels. The malicious code in each file of the Web application is scanned and found, and then a list of suspect files and a detailed log analysis table for each suspect file are automatically provided by the administrator for further inspection.
In Training a multi-criteria decision system and application to the detection of PHP WebShells [16], signatures, fuzzy hashes, whether dangerous processes are invoked, whether there are obfuscating codes, and entropy are used as features to detect and classify WebShells using an algorithm that trains a multi-criteria decision system. CNN-Webshell: Malicious Web shell detection with convolutional neural network [17] proposed a feature extraction method based on ''word2vec''. First, the word2vec tool was used to transform each word of an HTTP request into a vector, and then it converted the vectors into a fixed sized matrix. Finally, a detection method based on the CNN model was used for classification.

B. DETECTION BASED ON DYNAMIC FEATURES
Detection based on dynamic features is a method that detects the features of the WebShell execution process. This approach is good at detecting the features generated by operations, such as code annotation and code compression; however, feature extraction and feature dimensionality reduction still represent problems. For example, in A Method of Detecting Webshell Based on Multi-layer Perception [18], the sample source code is converted into bytecode by compiling tools, and then the sample bytecode is decomposed into bytecode sequences by Bi-Gram. Next, the feature matrix of the training set is set using the word frequency matrix calculated by the TF-IDF. Finally, a multilayer neural network is used to detect and classify WebShell files. Webshell detection based on random forest-gradient boosting decision tree algorithm [19] used the features of the opcode sequences extracted from PHP source files, then used the TF-IDF vector and hash vector for feature selection, and finally used the combination of the random forest classifier and GBDT classifier for classification. Detecting webshell based on random forest with fasttext [20] first used the VLD tool in PHP to obtain the opcode sequences of PHP files. Then, this method used the FastText algorithm to train the opcode sequence model and predict the corresponding feature values and combined the predicted feature values and static features as the features of samples. Finally, the random forest was used to realize binary classification.

C. WebShell ATTACK DETECTION BASED ON ENSEMBLE LEARNING
The above two detection methods have improved the detection effect of WebShell to a certain extent, but they still have shortcomings, which are summarized in the following Table 1.
To overcome the shortcomings of the above dynamic detection Methods and expand its advantages, this paper proposes using the ensemble learning of binary weighted voting to classify WebShells.
At present, ensemble learning can be roughly classified into three categories: Bagging, Boosting, and Stacking. Bagging is a method that extracts data from the original data set and conducts model training and prediction to obtain K models. Finally, this approach uses these models to predict the data. K predicted values can be obtained from each sample, and the final result can be obtained by voting [21]. Because the weights of the base learners in this algorithm are the same, the base learner selection in this algorithm will directly affect the results of the ensemble learning method. Boosting mainly assembles weak classifiers into a strong classifier. In this instance, AdaBoost will pay more attention to the mislearned samples when learning the algorithm, increase their respective weights, and then superimpose the models generated in each step to obtain the final model [22], [23]. Xgboost is a lifting tree model that can integrate many tree models together to form a strong classifier [24], [25]. However, the disadvantage of this algorithm is that it is easy to overfit due to noise. Stacking first trains the base learner using the initial data set, then combines the data generated by the base learner into a new data set, then inputs that new data set into the classification algorithm, and then finally obtains the final prediction result [26], [27]. The disadvantage of this algorithm is as follows: if the training set of the primary learner is directly used to generate the secondary training set, there is a great risk of overfitting.
A more important classification method is the weighted voting method, which combines the above methods to adaptively allocate weights. Given multiple base classifiers, high weights are assigned to the algorithms with high accuracy, which can better reflect the roles of excellent algorithms [28].
According to the above analysis, this paper will apply the weighted voting algorithm of the binary model to the detection of WebShell files. Compared with other ordinary single machine learning algorithms, this algorithm performs better in terms of its recall rate and accuracy.

D. TYPES OF GRAPHICS
In this paper, the feature vocabulary lists of unigrams and 4-grams are extracted from the sample set according to the opcode. Then, the 4-gram feature words are calculated based on the feature weight values calculated by the Gini coefficients of the unigram feature words, and the selected features are selected again according to the Gini coefficients of the 4-gram feature words. This approach will allow one to construct a feature vocabulary to detect encrypted and unencrypted WebShell files, consequently solving the problem that it is difficult to detect the script and select the optimal feature subset. Second, in order to improve the adaptability and accuracy of the detection method, this paper constructs a differentiated ensemble detection model WS-LSMR, which is composed of a Logistic Regression, Support Vector Machine, Multi-layer Perceptron and Random Forest. The advantages and disadvantages of each algorithm are shown in Table 2.
In addition, this model uses the weighted voting method to determine the classification of WebShells.
The main contributions of this article are the following two aspects.
(1) Based on the original TF-IDF feature vectorization, the 4-gram feature words are weighted according to the feature weights calculated using the unigram feature words via the random forest, which will select features for the first time; and then the 4-gram feature words can be selected for the second time via the random forest. This algorithm can select the optimal feature subset that reduces the dimension and improves the efficiency of the algorithm. (2) Based on the accuracy of the model training, the voting weights of each algorithm in the ensemble detection model are set to improve the detection effect.

III. MALICIOUS WebShell DETECTION ALGORITHM BASED ON ENSEMBLE LEARNING A. SYSTEM ARCHITECTURE
This article uses an ensemble learning algorithm, which is called the WS-LSMR, and its structure is illustrated in Fig. 1 and Table 3. This research can be divided into three modules: preprocessing, feature selection, and model building and prediction.
Preprocessing: First, all files contained in the data set should be de-duplicated to prevent interference in the detection results. Opcode is used to preprocess the features that are obtained by splitting or encrypting dangerous functions. Second, the samples were divided into training samples and test samples at a ratio of 4:1. Finally, the training set is vectorized according to the unigrams and 4-grams, respectively, and the test set is vectorized according to the 4-grams.
Feature selection: First, the 4-gram feature word weights are obtained according to the calculated Gini coefficients of the unigram feature words to select the features. The random forest algorithm is applied to the training samples to select the important features. The training set is sampled using SMOTE  to prevent the prediction result from being one-sided because of the large positive and negative sample gap.
Model building and prediction: Four base classifiers (Logistic Regression, Support Vector Machine, Multi-layer Perception, and Random Forest) are used to train and verify the training samples, which will determine the accuracy of the corresponding base classifier and the prediction probability of each test sample based on the classifier. Then, the accuracy rate of the set is verified and used to obtain the weight value of the base classifier, and finally it and the prediction probability are used to obtain the final classification probability of the test set. 1: By feature selection algorithm ''algorithm'' and one_grams_feature, calculates the importance score of the unigram feature vocabulary: one_feature_score. 2: one_feature_score and four_grams_feature were used to calculate the total score value of each 4-gram Feature: four_score. 3: Obtaining four_score, the retained feature sequence, by implementing the first dimension reduction of the feature according to num1 and four_score. 4: By feature selection algorithm and four_score, calculates the importance score of the 4 -grams feature vocabulary: four_feature_score. 5: The second dimension reduction of the feature is implemented with num2 and four_feature_score to obtain the remaining feature sequence: feature_list. 6: Sample on feature_list dataset with SMOTE method to get the dataset that has been feature processed. 7: return feature_list.

B. OPCODE
Opcode is an intermediate code in the PHP language for the Zend [29] engine to execute, and it is similar to the bytecode file in Java or pycodeobject in Python. The resulting Opcode bytecode file can be directly executed; therefore, the split dangerous functions or encrypted functions in the PHP file will also be reflected in the Opcode, and they will still appear the same Opcode statement code as the compilation result of the normal file when compiled.
At present, there are two kinds of malicious WebShells: encrypted files and non-encrypted files. The features that are directly extracted from the two types of files are different, but the features of the two PHP files after opcode encoding are completely consistent. Therefore, the first step for the data sample is to code the malicious files using opcode to obtain the feature set that can detect malicious WebShell files. In addition, you can identify WebShells using code VOLUME 8, 2020 Table 4.

C. FEATURE VECTORIZATION
The TF-IDF (term frequency-inverse document frequency) is a weighting technique that automatically extracts text keywords [30] and is widely used in text classification for feature vectorization [31]. The TF-IDF uses the IDF (inverse document frequency) to determine the word frequency. By calculating the frequencies of words and the inverse document, the algorithm can effectively identify those invalid words with high word frequency but no actual meaning. Therefore, this algorithm also improves the simple word frequency statistics method [32].
This paper uses the TF-IDF to carry out the feature vectorization of the PHP file, and the basic process is as follows. PHP files are called a sample after being encoded using opcode, and each sample is a segment after being split by the n-grams word bag. TF is the number of times that segment X appears in a given sample. The IDF reflects the frequency of this segment in all samples, and its calculation formula is as follows: N is the total number of texts in the corpus, and N(X) is the number of texts containing the corpus segment X. The TF_IDF (X), which measures the importance of the corpus segment X, can be obtained using the TF (word frequency) and IDF (reverse file frequency) according to formula 3.

D. FEATURE SELECTION
Because most of the opcode features can only use a small part of the words in the vocabulary, this property will lead to the sparseness of the word vector. Therefore, it is necessary to eliminate the unimportant features to prevent the computational efficiency from decreasing due to feature explosion. At present, the main algorithm that can be selected for calculating the importance of features is the decision tree model. In it, each feature calculates the incremental value of the Gini coefficient according to formula 4, and it finally selects the important features based on the incremental value order. The Gini coefficient is a measure of the impurity of data: M represents the number of type C entries in dataset D, and p i represents the probability that any sample in D belongs to Ci. 75790 VOLUME 8, 2020 The Gini coefficient after splitting by attribute a is calculated as follows: where D 1 is a non-void proper subset of D; D 2 is a complement of D 1 in D, that is, D 1 + D 2 = D; and the minimum value is selected as the Gini coefficient of feature a. The feature can be indicated much more important when its increment of the feature is larger and the ability to distinguish black and white lists is more stronger.
The specific steps of feature selection are as follows: (1) Calculate the importance score for the unigram feature words using the random forest; (2) The 4-gram feature words are selected for the first time according to the score values; (3) The feature importance score of the 4-grams is calculated using the features selected at the first time to carry out the second selection of the features; and (4) Finally, select the important features that meet certain important conditions. The features obtained above are the optimal feature subset, which improve the algorithmic efficiency while ensuring the accuracy and recall rate. The feature selection execution process is shown in Fig. 2, and Table 5 shows the top 10 most important features.

E. TRANSFORMATION OF THE WEIGHTS OF UNIGRAM AND 4-GRAMS FEATURE WORDS
In the process of TF-IDF vectorization, 4-grams features are combined based on unigram features such that each 4-grams feature can be linked to a corresponding single unigram  feature. In Fig. 3, for the 4-grams feature ''ABCD'', the weight value of this feature word is the simple addition of the unigram weight values of feature A, feature B, feature C, and feature D.

A. EXPERIMENTAL CONDITIONS
This experiment is based on Python version 3.5.4, the experimental environment uses the win10 64-bit operating system, the processor is an Intel (R) Core (TM) i3-7130U CPU @ 2.70 GHz, and there is 8G of memory.

B. EXPERIMENTAL DATA
The malicious WebShell samples are mainly downloaded from GitHub public projects. Since this study only performs offensive detection for PHP files, the total number of malicious samples is 566. The normal PHP samples mainly come from common PHP frameworks, including PHPCMS, Word-Press, Fenxiangyo, oa and yii2. There are a total of 5,379 samples. The data sources are shown in Table 6.

C. EVALUATION CRITERIA
This paper evaluates the WebShell detection method based on ensemble learning using the recall rate, accuracy rate and specificity. The model evaluation confusion matrix is shown in Table 7.
A true positive is an outcome where the model correctly predicts the positive class (True Positive = TP).
A false negative is an outcome where the model incorrectly predicts the negative class (False Negative = FN).    The specificity = TN/ (TN + FP), which represents the proportion of negative samples that are correctly predicted by model and measures the recognition ability of the classifier for negative cases (normal PHP files) with respect to all negative results.

D. FEATURE SELECTION
An opcode sequence is more representative than a single opcode. It is necessary to find the best performance for the range from unigrams to 6-grams and, finally, use the feature vector to conduct filtering to reduce the feature dimension. Fig. 4 and Table 8 show that during the transition from unigrams to 4-grams, the recall rate increases as the grams  increase, and the model is better able to distinguish between black and white list samples. The recall rate decreases when n_grams > 4 because of fewer occurrences in the Web-Shell file, thereby creating an invalid presentation vector, and the recall rate reaches its maximum at 4-grams. Therefore, 4-grams are optimal for the final training and testing.
The experimental data of the thresholds selected for the first time are shown in Fig. 6 and Table 9. In Table 9, ''len'' represents the number of remaining features selected according to the threshold from the original 15,000 features. Figure  6 shows that when the threshold value is between 0.01 and 0.03, the three evaluation indexes linearly increase, which increase as the number 4-gram features increases. The three evaluation indexes were in a state of decline between 0.03 and 0.07. At this time, the value decreased as the number of 4-gram features increased. The main reason was that some  impure features were added to the feature set, which affected the final detection effect. The recall rate, accuracy rate and specificity reach their respective peaks between 0.08 and 0.09. At this time, the increase of the number of features will increase the detection index in a single direction, while the overall detection index will decrease. Therefore, the threshold of the dimension reduction selection of the first feature is 0.03, which is the best detection standard.

E. BINARY WEIGHTED VOTING MODEL
Binary weighted voting is a more effective way to deal with classification problems, and it is a highly direct method, primarily because an algorithm with good classification performance receives a high weight. In addition, the voting results can often use the complementary between single classification models to reduce the error of a single classifier and improve the prediction performance and classification accuracy.
In this paper, to make the difference of single classifier more obvious, each classifier randomly extracts part of the data from the training set. Table 10 and Fig. 7 show that after the percentage of extracted training data reaches 80%, the recall rate and accuracy rate reach their peaks. This finding indicates that the data are saturated at this time. If the input of the training set continues to increase, the difference between the base classifiers will worsen, and the accuracy rate and recall rate will both be reduced. Therefore, the extraction of 80% of the training data is optimal for the final training effect.
represents the output of the classifier h i on the category marker c j .
Plurality voting In other words, the mark with the most votes is the predicted one, and if multiple marks receive the most votes at the same time, one of them is randomly selected.
Weighted voting (WV) If each classifier also has a weight value w i , The specific steps of ensemble binary weighted voting are as follows.
(1) Calculate the average accuracy P_avg_i of each classifier using 5-fold cross-validation and use this model to predict the prediction probability of the test set P_test_i, where i represents the classifier. (2) Calculate the weight value of each classifier. The formula is as follows: where i represents the number of classifiers, T represents the number of classifiers, and n represents the gap between the good and bad algorithms for detecting WebShells.    (4) The final classification result is obtained by comparing the probability p_test with the threshold value 0.5. In step 2, the range of n that increases the difference between base classifiers is generally not large. If the range is too large, the prediction probability of the final sample is likely to be greater than 1. Therefore, this experiment is conducted where n ranges from 10-1000. The experiment is divided into three steps to select the most appropriate n that increases the difference between the base classifiers.
(1) Different values of n are tested separately. In the first step of the experiment, n ranges from 0 to 1000. The specific values are 10, 100, 200, 300, 400, 500, 600, 700, 800, 900 and 1000. The experimental results are shown in Table 11 and Fig. 8. (2) It is found that the best range for the value of n in Figure 8 is between 0-200. In the second step of the experiment, n ranges from 0 to 200. The specific values are 10, 20, 40, 60, 80, 100, 120, 140, 160, 180 and 200.   The experimental results are shown in Table 12 and Fig. 9. (3) It is found that the best range of n in Figure 9 is 10-90, and the third step is carried out using a range from 10-90. The specific values of n are 10, 20, 30, 40, 50, 60, 70, 80 and 90. The experimental results are shown in Table 13 and Fig. 10. After testing the values of n that increase the difference between base classifiers, it is finally determined that n = 30 is the most appropriate. As shown in Fig. 11, when the value of n that increases the difference between base classifiers is between 0 and 30, the three evaluation indexes of the recall rate, accuracy rate and specificity increase. The main reason is that the difference between base classifiers increases as the value of n increases. When the value of n that increases the difference between base classifiers is greater than 30, the accuracy tends to decline at this time. The main reason is that   the weight n between them is too large, making the difference between base classifiers weak.
After this weighting, the superiority of the best algorithm for the data can be better reflected. The higher the accuracy of the algorithm is, the greater the weight value will be, and the final prediction effect will be higher than the correctness of any algorithm.

F. COMPARISON OF THE WL-LSMR ALGORITHM AND OTHER CLASSIFICATION ALGORITHMS
The ensemble algorithm uses four base classifiers (logistic regression (LR), support vector machine (SVC), multilayer perceptron (MLP), and random forest (RF)). The algorithm adopts the 5-fold cross validation method to select the following parameters. The reciprocal of the logistic regression's regularization coefficient λ is the reciprocal of 100, and the penalty is the ''L2'' penalty term. The support vector machine's regularization coefficient λ is the reciprocal of 100, and the kernel is a linear kernel function. The activation function of the multilayer perceptron is ''tanh'', the learning rate is 0.001, and solver is the stochastic gradient descent. The Gini coefficient is used to assess the random forest, 300 decision trees are used, and the minimum sample size is 20.
To verify the performance of the ensemble learning algorithm, We downloaded several popular superior WebShell detectors from the Internet, which are respectively D Shield, WEBDIR + and SHELLPUB. Using the same test set as the WS-LSMR model for scanning detection.It is true that some of the detectors showed good performance on the scanning accuracy, such like D Shield. The experimental results are shown in Table 14. This paper compares it with other classical classifiers: the k-nearest neighbours (denoted as KNN), the multi-layer Perception (denoted as MLP), the Decision Tree and the Random Forest. The experimental comparison of all these classifiers for PCA dimension reduction is shown in Table 15. This paper compares it with several popular ensemble algorithms, RandomForest, Adaboost, and Xgboost use the default parameters provided in scikit-learn. Stacking uses the parameters of logistic regression, support vector machine, multilayer perceptron, and random forest to keep the same parameters as the four algorithm parameters in this article, and Meta-Learner is set to logistic regression. The experimental results are shown in Table 16 on the basis that the feature processing method is consistent with that in this paper. Compared with other methods, although the detection time of WS-LSMR is longer than other methods, the method in this paper obtained a good feature subset by optimizing and selecting features, and weighted optimization was performed based on each base classifier to achieve the best performance. our model can greatly improve the recall rate while ensuring the accuracy.

G. EXPERIMENTAL EXPANSION
The feature selection, model training and prediction methods proposed in this paper can also achieve good results in scripting languages JSP and ASP. The data of JSP and ASP come from the open source project downloaded from VOLUME 8, 2020   the following website, and the blacklist is still downloaded using the blacklist website of this paper. The experimental results are shown in the Table 17. The data set is shown in the Table 18.

V. CONCLUSION
This paper proposes a feature selection algorithm based on GINI coefficient and a weighted voting method based on WS-LSMR. The experimental results show that this method can handle unbalanced WebShell data, and has a very high recall rate while ensuring the accuracy. The experimental procedures of feature selection, model training and prediction proposed in this paper are also applicable to the detection of other scripting languages, and can achieve better detection results, experiment shows that the proposed methodology can be extended to the detection of other scripting languages. Although the detection time of WS-LSMR is longer than other methods, the accuracy and the recall of the method proposed in this paper performs much better than other methods. In order to further ensure the security of the website, in the future, we will continue to classify and detect different types of offensive files in terms of time performance, so as to further improve the adaptability of the detection algorithm.