Hierarchical Long Short-Term Memory Network for Cyberattack Detection

With the continuous development of network technology, cyberattack detection mechanisms play a vital role in ensuring the security of computers and network systems. However, with the rapid growth of network trafﬁc, traditional intrusion detection systems (IDSs) are far from being able to quickly and accurately identify complex and diverse network attacks, especially those related to low-frequency attacks. To enhance the overall security of the Internet, an IDS based on hierarchical long short-term memory (HLSTM) networks is proposed. With the introduction of HLSTM, the network can learn across multiple levels of temporal hierarchy over complex network trafﬁc sequences. The system is evaluated on the well-known benchmark data set NSL-KDD for comparison with other existing methods. The experimental results demonstrate that compared with existing start-of-the-art methods, our system has better detection performance for different types of cyberattacks. In addition, the low-frequency network attack types have higher classiﬁcation accuracy and a lower false detection rate.


I. INTRODUCTION
Currently, networks are increasingly integrated with people's daily lives. With the rapid development and widespread application of the Internet and the Internet of Things (IoT), network security has gradually attracted the attention of enterprises and countries. It is more necessary than ever to defend networks from cyberattacks. As a key part of network security defense, the intrusion detection system (IDS) refers to a network protection system built using certain security policies to detect intrusion behavior. Due to the large traffic volume and complex structure of network data, the processing capability of machine learning is limited. For this reason, traditional IDSs based on conventional machine learning methods generally have some shortcomings, such as a high false positive rate, poor generalization ability, and low real-time performance. Therefore, establishing an IDS that can effectively The associate editor coordinating the review of this manuscript and approving it for publication was Anandakumar Haldorai. identify various complex and unknown intrusion attacks is an urgent issue.
In recent years, the excellent representation capability of deep learning has attracted much attention. Deep learning has achieved remarkable results in many areas, such as image recognition and natural language processing (NLP) [1]. To improve the intelligence and accuracy of network intrusion detection, many researchers have successfully applied convolutional neural network (CNN) to cyberattack detection, but they have not yet achieved the expected breakthrough [2], [3]. One of the main reasons for this failure is that network traffic is not in an image data format, so there are some problems in blindly applying CNN in IDS [4]. Considering that network traffic data are usually one-dimensional, an interesting attempt used a recurrent neural network (RNN) for processing. However, the existing few methods [5], [6] have shown that using a simple RNN model does not yield significant performance gains. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Considering the above factors, this paper proposes an intrusion detection method based on hierarchical long short-term memory (HLSTM) [7]. The main contributions of this paper are summarized as follows: 1) A HLSTM-based IDS is proposed, that can learn across multiple levels of temporal hierarchy over complex network traffic sequences. Experimental results show that this method has better detection performance for various network attacks, especially for low frequency network attack. 2) This paper also compares the detection ability of the algorithm to different types of network attacks in detail, and provides a comprehensive analysis of the problems in the detection of low-frequency network attack types.
The rest of this paper is organized as follows. Sect. II briefly introduces the research status and existing problems of network intrusion detection. Sect. III elaborates on the HLSTM-based intrusion detection model proposed in this paper, including the dataset used, data preprocessing method, HLSTM principle, and performance evaluation metrics. Sect. IV gives the specific experimental results and comparative analysis. The full-text summary and further work prospects are presented in Sect. V.

II. RELATED WORK
Traditional intrusion detection systems typically use machine learning algorithms. A similar research account is described in [2], including support vector machine (SVM), decision trees, k-nearest neighbor (KNN), random forests, naive Bayes networks, and others. However, the actual network operating environment is complex and changeable, and various cyber-intrusion attacks are emerging one after another. Many unknown intrusion attacks are not included in the training data set; for example, 16.6% of the attack types in the test data set do not appear in the training data set for the NSL-KDD data set [8]. Therefore, traditional intrusion detection methods tend to have poor performance.
In recent years, deep learning has gradually become a popular research topic, and many works have applied deep learning methods to the field of cyber attack detection. Labonne et al. [9] proposed a cascade-structured meta-specialist approach based on multilayer perceptron (MLP) for classification in intrusion detection. Kasongo and Sun [10] used a feed-forward deep neural network with a filter for wireless intrusion detection. Yang et al. [11] provided an aggregation approach using the deep belief network (DBN) and modified density peak clustering algorithm (MDPCA). Jia et al. [12] designed an IDS based on a deep neural network (DNN).
With the breakthrough of CNN in the field of image and natural language processing, a series of CNN-based intrusion detection studies [13]- [16] were proposed. Despite these methods having achieved encouraging results in network intrusion detection, there are still some issues that need to be addressed. Considering that the network traffic generated in network communication is one-dimensional byte stream data rather than image data, blindly using CNN for intrusion detection inevitably affects the performance of the model.
Recently, an interesting attempt is to use RNN for intrusion detection, as in the literature [5], [6], [17]. However, all these works show that using only a simple RNN layer as the classifier does not yield significant performance gains in cyberattack detection. To further improve the performance of IDSs, an intrusion detection method based on HLSTM is proposed. Compared with the existing methods, our model has a higher detection accuracy with a lower false detection rate.

III. PROPOSED METHODOLOGIES
This section introduces the proposed HLSTM-based IDS in detail, and the system structure diagram is shown in Fig. 1. The data set used, the data preprocessing method, the proposed intrusion detection model based on HLSTM, and the performance evaluation metrics are sequentially described below.

A. NSL-KDD DATA SET
The NSL-KDD data set is the benchmark data set for intrusion detection in the network security field [8]. It is an improved version of the KDD CUP 99 data set [18], which effectively solves the problem of record redundancy, making the division of the training set and the test set more reasonable [19]. The NSL-KDD data set is widely used as an effective baseline dataset, that can help researchers compare different cyberattack detection methods.
The NSL-KDD data set consists of the training set KDDTrain+, and the test sets KDDTest+ and KDDTest-21, where the KDDTest-21 test set contains more unknown attack types and is, therefore, more difficult to classify than the KDDTest+ test set. The NSL-KDD data set contains five categories of network traffic data: normal, denial of service (DoS), probe, user to root (U2R), and remote to local (R2L). The number and percentage (PCT) of records for each category are shown in Table 1.  [20].
Each record in the NSL-KDD data set contains 41 features and a corresponding classification label. These features are divided into four parts [21]: basic features, content features, time-based network traffic statistics features, and host-based network traffic statistics features, as shown in Table 2.

B. DATA PREPROCESSING
The data preprocessing includes data cleaning, numeralization, and normalization. These steps are briefly introduced below.

1) DATA CLEANING
Data cleaning is an important task in data mining and usually needs to be performed before the model training to ensure the quality of the data. For the NSL-KDD data set, although it has been improved, it can be found that the value of the 20th feature ''num_outbound_cmds'' is always 0, so it is a useless feature that needs to be removed. After data cleaning, each record in the data set contains 40 features.

3) NORMALIZATION
To reduce the impact of the numerical range of different features on model training, each feature value needs to be scaled to a reasonable range. In this paper, the normalization is used to scale the value to [0, 1], and the normalized value x is where x is the initial value, x max and x min is the maximum and minimum value of the feature. In addition, logarithmic normalization would be a better choice for the features ''duration'', ''src_bytes'', and ''dst_bytes'', which have a larger range of values. VOLUME 8, 2020

C. METHODOLOGY
This subsection begins with a quick overview of the LSTM models. Given a sequence of inputs X = {x 1 , x 2 , . . . , x n X }, each input X is paired with a sequence of outputs to predict Y = {y 1 , y 2 , . . . , x n Y }. An LSTM defines a distribution over outputs and sequentially predicts tokens using a softmax function: where f (h t−1 , e y t ) is the activation function between e h−1 and e y t , and h t−1 denotes the representation outputted from the LSTM at time t − 1 [20]. Hierarchical LSTM can learn across multiple levels of temporal hierarchy over a complex sequence [22]. Generally, the first recurrent layer of the HLSTM encodes a sentence (e.g. word vectors) into a sentence vector: where LSTM (·) is defined as the LSTM operation for simplicity, h w t and e w t denote the hidden vectors from the LSTM model and embedding at the word level, respectively. The second recurrent layer then encodes a sequence of such vectors (encoded by the first layer) into a document vector: where h s t and e s t denote the hidden vectors from the LSTM model and embedding at the sentence level, respectively [23]. The document vector is considered to preserve both the word-level and sentence-level structure of the context.
To obtain better attack detection performance, HLSTM is introduced to our proposed intrusion detection system, as shown in Fig. 2. For the preprocessed data, each record contains 121 values, so it can be converted into an 11 × 11 pixel grayscale image. In the HLSTM-based intrusion detection model, the first LSTM layer first encodes each column of pixels of shape (11, 1) to a column vector of shape (128, ). The second LSTM layer then encodes these 11 column vectors of shape (11,128) to an image vector representing the whole image. Finally, the full connection layer is added for prediction.
In addition, considering the severe imbalance in the sample size of different categories in the NSL-KDD dataset, it is necessary to rebalance the training data set to obtain better training results. There are two common methods for rebalancing classes: under sampling and over sampling, but they all have some shortcomings [24]. Therefore, the solution we adopted reduces the loss weight of the normal and the DoS categories and increasing the loss weight of the R2L and U2R categories. By weighting the loss function, the classifier does not tend to learn most of the representative classes.
In addition, considering the large difference in the number of records in different categories on the NSL-KDD data set, the following indicators are also used to evaluate the performance of the model [25].
where the recall, also known as the true positive rate (TPR) or detection rate (DR), is the proportion of correctly detected positives; the false positive rate (FPR), also known as false alarm rate (FAR), is the proportion of negatives that are incorrectly predicted as positive; the precision is the ratio of predicted positives to actual positives; the f-score is the harmonic average of precision and recall.

IV. EXPERIMENT RESULTS AND DISCUSSION
This chapter mainly validates the effectiveness of the proposed HLSTM-IDS on the benchmark data set NSL-KDD and compares it with existing state-of-the-art methods.

A. MODEL TRAINING
All the experiments in this paper are performed using the Keras [26] framework with the backend TensorFlow. The experimental device is a personal computer with a GetForce 1050Ti GPU for accelerating. All models are trained using the KDDTrain+ data set and tested using the KDDTest+ and KDDTest-21 data set. We use the Adam optimizer [27] with a fixed learning rate of 0.001 and train 200 times with a mini-batch size 1024. In addition, other configurations use default parameters.
Besides, considering the large difference in the number of samples of different attack types in the training set, we need to adjust the loss function in the model to avoid the classifier tends to learn the attack categories with a larger number. Specifically, the loss of different types of attack categories needs to be weighted, and the weight is the reciprocal of the ratio of the number of samples in the category to the total number of samples.

B. OVERALL CLASSIFICATION PERFORMANCE
To evaluate the performance of the proposed network model, we performed a multi-classification experiment on the NSL-KDD dataset. At the same time, to prove the superiority of the method, some of the latest deep learning based IDSs are given for comparison. The classification accuracy of each method is shown in Table 3. Our model achieved 83.85% and 69.73% accuracies in the two test sets, which is better than the best classifier in the existing work. The main reason is that HLSTM can learn across multiple levels of temporal hierarchy over the network traffic data. In addition, the specific experimental results are provided in Table 4 in the form of the confusion matrix.

C. CLASSIFICATION PERFORMANCE FOR EACH ATTACK CATEGORIES
To assess the ability of the model to detect different categories of network attacks, we provide detailed performance metrics for each category of attack, along with other existing advanced deep learning-based methods for comparison. Note that most of the literature does not provide their performance on the KDDTest-21 data set, so we only compare performance on the KDDTest+ data set.

1) DoS
The DoS attack is also known as the flood attack. By exhausting the resources of the attacked object, the target computer or network cannot provide normal service or resource access, making it impossible for normal users to access or use the network service. The DoS attack does not include intrusion into the target server or target network device. The detection of such an attack is relatively straightforward and can be easily detected by carefully examining the connection statistics of the attacked object. Therefore, various intrusion detection methods usually achieve a quite good classification results, as shown in Fig. 3.

2) PROBE
As an early stage of network intrusion, probing attacks are a growing cybersecurity issue. The original purpose of the probe is to understand the state of the network and to monitor or collect data about network activity. The probe attack attempts to access a target host through a known or potential vulnerability in the computer system. Similarly, the attack can be easily detected by carefully examining the connection statistics on the host. The results in Fig. 4 also show that each model has achieved better classification performance.

3) R2L
An R2L attack is an attack used to gain local access to a vulnerable computer. The premise is that an attacker can send packets to the victim computer over the network. This type of attack is a low-frequency attack, and the number of samples in the training set is also small, so it is difficult to detect. As seen from the results in Fig. 5, the accuracy of each method is greatly reduced. However, the method in this paper can still achieve a better precision with a lower false positive rate.

4) U2R
A user-to-root attack is a set of attacks that intrude through a user with no or low privileges, bypass authentication or VOLUME 8, 2020  directly obtain root privileges through a vulnerability, and then log in for a series of illegal operations. This type of attack is also a low-frequency attack, and the number of samples in the training set is very small, so detection is very difficult. As seen from the results in Fig. 6, although the accuracy and FPR of each method are good, their precision is very low. Our method has a relatively balanced performance with better precision and a lower FPR.

D. DISCUSSION
From the above experimental results, we can see that the DoS and probe attacks can be easily detected by carefully checking network characteristics and connection statistics, but it is difficult to detect low frequency attacks such as U2R and R2L. The main reasons are as follows: 1) Compared with DoS and Probe attacks, the NSL-KDD data set contains fewer R2L and U2R attack samples, and insufficiently learning these two attacks makes the classifier less suitable for detecting such attacks. 2) Nearly half of the unknown attack types in the test data set KDDTest+ do not exist in the training data set KDDTrain+, which poses a great challenge for intrusion detection, especially R2L and U2R attacks.
3) The connection statistics of low-frequency attacks are very similar to normal connections. The classifier is more inclined to regard such attacks as normal types because of the uneven distribution of data. 4) There is a certain similarity between the behavior of U2R and R2L attacks. Therefore, it is also difficult to distinguish between U2R and R2L. In fact, U2R attacks are one of the variants of the R2L attack [21].

V. CONCLUSION
In this paper, an intrusion detection method based on HLSTM is proposed, which can learn across time levels on complex network traffic sequences. The detection quality of the system for the NSL-KDD data set is greatly improved. The accuracies of multi-classification on KDDTest+ and KDDTest-21 are 83.85% and 69.73%, respectively. It has higher detection precision and a lower false alarm rate than the most advanced methods available, especially in the detection of low-frequency attacks. Note that our method can also be applied to any other intrusion detection data set. For the next step, we need to add more U2R and R2L intrusion samples to the data set to address the increasingly severe low-frequency network intrusion attacks. In summary, the use of deep learning for analysis in the field of cybersecurity remains a challenging and open task.