CNN-LSTM: Hybrid Deep Neural Network for Network Intrusion Detection System

Network security becomes indispensable to our daily interactions and networks. As attackers continue to develop new types of attacks and the size of networks continues to grow, the need for an effective intrusion detection system has become critical. Numerous studies implemented machine learning algorithms to develop an effective IDS; however, with the advent of deep learning algorithms and artificial neural networks that can generate features automatically without human intervention, researchers began to rely on deep learning. In our research, we took advantage of the Convolutional Neural Network’s ability to extract spatial features and the Long Short-Term Memory Network’s ability to extract temporal features to create a hybrid intrusion detection system model. We added batch normalization and dropout layers to the model to increase its performance. Based on the binary and multiclass classification, the model was trained using three datasets: CIC-IDS 2017, UNSW-NB15, and WSN-DS. The confusion matrix determines the system’s effectiveness, which includes evaluation criteria such as accuracy, precision, detection rate, F1-score, and false alarm rate (FAR). The effectiveness of the proposed model was demonstrated by experimental results showing a high detection rate, high accuracy, and a relatively low FAR.

The associate editor coordinating the review of this manuscript and approving it for publication was Nazar Zaki . network intrusion detection systems (IDS) to provide secured 27 networks. Intrusion detection systems intend to provide avail-28 ability, confidentiality, and integrity for the data transmitted 29 in networked computers by preventing unauthorized access 30 to a network, protecting the information and communication 31 systems in the network [3], and, most important, being able 32 to detect known and unknown attacks and threats with high 33 accuracy and a minimum false alarm rate [4]. 34 Two approaches comprise the intrusion detection system: 35 misuse detection and anomaly detection. Misuse detection, 36 also known as signature-based detection, is the initial detec-37 tion model where detection is based on known and stored 38 attacks and threats. This model has a low rate of false alarms 39 and a high detection rate. With the expansion of networks 40 supervised learning, such as Decision Tree, SVM, and 95 Naïve Bayes, and unsupervised learning, such as K-means 96 clustering and Self Organized Map [4]. The primary func-97 tion of machine learning algorithms is to enhance a sys-98 tem's detection capability. The trained data is used to detect 99 attacks and threats. Machine learning algorithms are typically 100 employed to solve regression, classification, and clustering 101 problems. Most prior work on machine learning relied on 102 the NSL-KDD, DARPA, and KDD-CUP99 datasets. Some 103 models produced satisfactory results, but these datasets are 104 out-of-date and contain only simple types of attacks [1], 105 [4]. Training an IDS for the current, continuously expanding 106 network requires a large dataset, and relying on traditional 107 machine learning algorithms that function correctly on small 108 datasets will not result in an efficient model [4]. Deep learning is a subfield of machine learning that inter-111 acts with multi-hidden-layer artificial neural networks [4]. 112 In addition to data representations, deep learning algorithms 113 can also learn from unlabeled or unstructured data [6]. Deep 114 learning has many performance features that allow it to be 115 efficient enough to develop an IDS, such as the robustness 116 of the DL algorithms with high scalability and the ability to 117 deal with different types of data [7]. Deep learning algorithms 118 were mainly developed to solve complex problems, pat-119 tern recognition, search engine, and machine translation [8]. 120 Algorithms such as Deep belief networks (DBN), Restricted 121 Boltzmann machines (RBM), and Autoencoder (AE) are used 122 widely for extracting features [9]. Multi-Layer Perceptron is 123 used in different fields and mainly to minimize the error rate 124 during training [10]. 125 Convolutional Neural Networks (CNN) and Recurrent 126 Neural Networks (RNN) are the most prevalent deep learning 127 algorithms. CNN's primary advantage is its ability to auto-128 matically recognize spatial features without human interven-129 tion, avoid overfitting by reducing the number of trainable 130 parameters, and improve generalization [8]. RNN is primarily 131 used in Natural Language Processing (NLP), speech process-132 ing, and video analysis due to its ability to utilize sequential 133 network features [7], [11]. Due to the memory blocks in 134 RNN's neural network, LSTM was developed as a solution 135 to the RNN's vanishing gradient problem [11].

137
In our research, we construct an intrusion detection system 138 using CNN-LSTM layers. The IDS model's methodology is 139 depicted in Figure 1.     The datasets we used were publicly available. The data is 175 stored in a CSV file in pcap format. In this step, Pandas 176 package was used to read each dataset's details, and after 177 reading each dataset's details, it was cleaned of any null and 178 duplicate values in preparation for the next step. Normalizing the data is a preprocessing technique used to 188 optimize within-range characteristics. The variance of the 189 data read from the CSV file, which has different standard 190 derivations and means, will impact the learning efficiency. 191 In our model, we scaled the input data using Standard Scalar, 192 resulting in a mean of zero and a standard deviation of one. 193 Based on 'sklearn. preprocessing' library Standard Scalar 194 was used to normalize the datasets. Feature selection is also referred to as feature reduction and 197 is responsible for selecting a set of features based on criteria. 198 This process enables rapid model construction and training 199 based on specific features, which reduces training and testing 200 time and improves performance. In our work, we used a 201 method called SelectKBest. SelectKBest was imported from 202 the 'sklearn. feature selection' library which selects the best 203 features based on the highest score. We chose the source 204 function to perform classification and the number of features 205 based on K values. The output is an array containing the score 206 and the name of the feature, and we chose our features based 207 on that array Our model's datasets have been divided into 80% training 210 and 20% testing set. In addition, we divide the training set 211 into training and validation sets to tune our hyperparameters 212 during training to improve the model's performance. Using 213 the Stratified K -Fold Cross Validation technique, the size of 214 both sets was determined based on the factor K .

216
CNN can extract spatial features, while LSTM can extract 217 temporal characteristics. Due to CNN's ability to extract 218 high-level features from large amounts of data, the model 219 begins with CNN. The first layer is the CNN layer; the 220 data will then pass through the convolution layer, where the 221 filters will extract the most critical features to generate a 222 feature map. This map will undergo max pooling to pre-223 serve the most dominant features, followed by batch nor-224 malization. The output will be sent to an LSTM layer to 225 extract temporal features, followed by a dropout layer to 226 prevent overfitting. This combination of CNN and LSTM 227 layers will be repeated three times with varying numbers 228 of neurons and filters, followed by a fully connected layer 229 that uses the SoftMax activation function to perform classi-230 fication. Figure 2 depicts the structure of our deep learning 231 model.  threshold-based activation function will process the feature 244 map to determine whether the neuron will fire or not [6], 245 [7]. In our model, we used ReLU as an activation function, 246 as follows: ReLU (z i ) = max(0,z i ). Therefore, the equation 247 after the activation function will be: where h represents the activation function, w is the weights,  non-zero.
The result of equation 2 will be processed with two variables 273 γ and β. This process will generate an outputŶ , where γ and 274 β are used for better learning output by training them in the 275 learning process.
The central concept of LSTM is its capacity to translate and 279 cache inputs using memory cells over time. This memory 280 cell will be processed by gates whose activation function is 281 represented by gates. As shown in Figure 4, LSTM consists of 282 four gates: forget gate, update gate, tanh gate, and output gate. 283 In these networks, the learning process occurs by adjusting 284 the weights and the value of the activation function so that 285 the temporal features between input and output data can be 286 effectively produced [3], [16], [17].

287
In the LSTM network, input and output values are the 288 vectors of the same size set by X (t). Forget gate will decide 289 which information to keep and which to delete by combining 290 X (t) with the previously hidden state X (t − 1). Moreover, 291 the output will be generated based on the sigmoid function 292 and multiplied with the previous cell state C(t − 1). The 293 update gate considers the input gate, which will determine 294 the information needed to be added to generate C(t). This 295 generation will be based on the sigmoid function and tanh 296 function based on tanh gate. The multiplication of these gates 297 will be added to the output resulting from multiplying forget   The confusion matrix indicators, as shown in Table 1   We have evaluated the models using two classification 347 methods: binary and multiclass. The datasets were divided 348 into two classes for binary classification: benign and attack. 349 As shown in Table 2, the dataset is labeled as benign or as one 350 type of attack for multiclass classification.  Table 3 shows the accuracy of the CIC-IDS 2017 binary 359 dataset. The highest accuracy achieved by CNN-LSTM struc-360 tures with three layers was 99.59 %, followed by LSTM-CNN 361 VOLUME 10, 2022   The results for UNSW-NB binary dataset are in Table 4.  Table 5. After comparing four learning algorithms, we con-373 tinued our research using the CNN-LSTM hybrid structure.  The results based on the binary CIC-IDS2017 dataset are 385 displayed in Table 6  respectively. Therefore, we chose to continue using the Adam 422 optimizer due to its superior accuracy and detection rate. This section demonstrates the third portion of our testing, 426 based on the number of layers, neurons, FC layers, and 427 dropout layer rate.

428
The outcomes presented in Table 9 were very similar. The 429 best performance was 99.6 % for three layers with a dropout 430 rate of 0.2 and one FC layer, followed by 99.55 % for two 431 layers with a dropout rate of 0.2 and two FC layers, and 432 finally 99.56 % for one layer with a dropout rate of 0.2 433 and two FC layers. In order to select the structure with the 434 99842 VOLUME 10, 2022     with a dropout rate of 0.5 and two FC layers. Then, 2 454 CNN-LSTM layers with 0.2 dropouts were followed by 2 FC 455 VOLUME 10, 2022 implementation. There was a slight change in detection rates        Grayhole attacks. Similar detection rate values were observed 510 for other records. Almost every K-Fold yielded poor results. 511 We aim to enhance the model's ability to detect all attack 512 types. Based on the previous values, we decided to continue testing 517 with K = 8.      Figures 13 and 14 show that multiclass and binary clas-529 sification performance is identical. UNSW-NB15 obtained 530 the lowest detection rate values and the highest FAR values. 531 As shown in the figures below, increasing epochs did not 532 affect CIC-IDS2017 and WSN-DS.

533
The confusion matrices of the three datasets are shown in 534 Figure 15. It demonstrates that the classification of the major-535 ity of record types was accurate, but PortScan attacks were 536 predicted to be normal records. Figure 16 demonstrates that 537 the most prevalent attack types were Exploits, Fuzzers, DoS, 538 and worms, which the model classified as Reconnaissance 539 attacks.

540
Due to the model's ability to accurately classify all types of 541 records in the dataset, as depicted in Figure 17, the majority 542 of records in each type were accurately predicted. 543

544
As shown in the following tables, we compared the efficacy 545 of our model to that of prior studies. The overall performance 546 of our model surpasses that of other recent studies.    Table 18, based 560 on UNSW-NB15.

561
The CIC-IDS2017 data set is utilized for another com-562 parison. Table 19 demonstrates the robustness of our binary     Table 20 show the performance base on the 571 WSN-DS dataset. The accuracy achieved by our model was 572 99.58% outperforming other machine learning algorithms, 573 whereas 97% was achieved by Logistic Regression (LR), 574 83.1% based on Naïve Bayes, and 99.1% based on Decision 575 Tree (DT). Also, CNN-LSTM obtained the highest detec-576 tion rate with 97.77%. Our results outperformed the bench-577 marked studies due to the structure of stacking layers of 578 CNN and LSTM followed by DNN, cleaning the dataset, 579 VOLUME 10, 2022 choosing the best features, adding dropout, and adding batch 580 normalization. 99.67 %, 98.14 %, and 98 %, respectively. The highest 635 precision and lowest false alarm rate were also achieved: 636 98.86 % and 0.11 %, respectively. On the other hand, Mul-637 ticlass classification achieved the highest detection rate and 638 F1-score at K = 8: 98.83 and 98.44 %, respectively, and 639 the highest accuracy and precision at K = 10: 98.43 and 640 99.12 %, respectively. K = 2 had the lowest rate of false 641 alarms, 0.67 %.

643
This study developed an intrusion detection system based on 644 the CNN and LSTM deep learning algorithms. We stacked 645 CNN and LSTM layers in our model and took advantage of 646 CNN's ability to extract spatial features and LSTM's ability 647 to extract temporal features. We implemented batch normal-648 ization, dropout layers, and standardization to improve our 649 model. The model was evaluated using the UNSW-NB15, 650 CIC-IDS2017, and WSN-DS datasets, all of which contained 651 benign and attack records. As a first step, we tested the behav-652 ior of these datasets based on CNN, LSTM, CNN-LSTM, 653 and LSTM-CNN. The results indicated that the CNN-LSTM 654 hybrid model provided the highest detection rate and accu-655 racy. Based on this, we evaluated the hybrid model based on 656 binary and multiclass classification scenarios. With 5 epochs, 657 we obtained 99.64 %, 94.53 %, and 99.67 % accuracy for 658 binary classification using the CIC-IDS2017, UNSW-NB, 659 and WSN-DS datasets, respectively. Although the model was 660 unable to provide a high detection rate for certain types of 661 attacks, such as web attacks in CIC-IDS2017 and worms, 662 backdoors, and analysis in UNSW-NB15, the detection rate, 663 and FAR results are encouraging. The effect of K-Fold 664 cross-validation and increasing the number of epochs were 665 examined, and the results indicated that the performance 666 would initially improve before becoming stable. In the future, 667 we intend to improve the model's performance in terms of its 668 low detection rate and high FAR resulting from the dataset's 669 imbalanced records. Lumpur, Malaysia. She has more than 20 years experience as a research 801 scientist. She has published more than 100 peer-reviewed international 802 journal articles/proceedings in the areas of instrumentation and control, 803 system modeling and identification, and evolutionary computation. She is 804 also an Executive Committee for Humanitarian Activities for IEEE Malaysia 805 Section and the Past Chair for IEEE Instrumentation and Measurement 806 Society Malaysia Chapter. 807 808 VOLUME 10, 2022