Flexible and Robust Real-Time Intrusion Detection Systems to Network Dynamics

Deep learning-based intrusion detection systems have advanced due to their technological innovations such as high accuracy, automation, and scalability to develop an effective network intrusion detection system (NIDS). However, most of the previous research has focused on model generation through intensive analysis of feature engineering instead of considering real environments. They have limitations to applying the previous methods for a real network environment to detect real-time network attacks. In this paper, we propose a new flexible and robust NIDS based on Recurrent Neural Network (RNN) with a multi-classifier to generate a detection model in real time. The proposed system adaptively and intelligently adjusts the generated model with given system parameters that can be used as security parameters to defend against the attacker’s obfuscation techniques in real time. In the experimental results, the proposed system detects network attacks with a high accuracy and high-speed model upgrade in real-time while showing robustness under an attack.

learning-based NIDS is one of the most important defense 23 methods to automatically monitor network behavior and to 24 detect abnormal behavior based on the built-in attack models 25 through automatic feature engineering. 26 Many deep learning-based IDSes (DL-IDS) have been pro-27 posed for a decade to improve the attack detection techniques 28 due to advantages such as automatic feature generation, effec-29 tiveness and scalability [1], [2], [3], [4], [5], [7], [8], [9], 30 [10], [17], [23]. Many deep learning methods, such as CNN, 31 GAN, and Autoencoder, have been popularly utilized for the 32 The associate editor coordinating the review of this manuscript and approving it for publication was Amin Zehtabian . development of NIDS [21], [22], [ [22]. 45 Rigaki et al. used GAN to improve the malware detection, 46 because adversarial learning like GAN enhances the robust-47 ness of IDS [23].
The rest of the paper is organized as follows: Section II 111 discusses the previous NIDS based on machine learning and 112 deep learning. Section III presents our proposed system and 113 Section IV shows our data sets and experimental results. 114 Finally, we will conclude our work in Section VI while 115 discussing our methods in different angles in Section V. 117 Machine learning (ML) and deep learning (DL) techniques 118 have been popularly adapted to develop intrusion detection 119 systems (IDS) because of their high accuracy, automation, 120 and no previous knowledge requirement [1], [2], [3], [4], [7], 121 [8], [9], [10], [17]. IDS can be deployed at a single com-122 puter such as host-based intrusion detection system (HIDS) 123 to many networks as network-based intrusion detection sys-124 tem (NIDS) [15], [16]. IDS can be categorized based on 125 a detection method: signature-based detection method and 126 anomaly-based detection method [15]. 127 There are various kinds of machine learning-based IDS, 128 since a machine learning can be applied to packet-based 129 attack detection in IDS. Mayhew 138 trained a convolutional autoencoder model to extract payload 139 features [22]. As adversarial learning enhances the robustness 140 of IDS, Rigaki et al. used a GAN to improve the malware 141 detection effect [23].

142
Machine learning can be applied to a feature engineering-143 based detection method in which common features are packet 144 length, the proportion of TCP flags, and source byte [17]    The proposed system first performs data processing and 212 data classification based on the multi-classifier in Figure 1.

213
The data processing first performs data cleansing by using 214 archived historical data for a model generation and real-time 215 incoming data to update the generated model in real time.

216
In other word, the data processing is where we load datasets, 217 clean them, and balance them in terms of the binary depen-218 dent variables. After the data cleaning and sampling, the proposed system 220 utilizes the Random Forest algorithm as a multi-classifier 221 to select the best quality of data and to evaluate the feature 222 importance. Random Forest consists of multiple decision 223 trees as an ensemble classifier through bagging which is to 224 count and average the votes from each decision tree [44]. 225 Bagging is also called as bootstrap aggregation and it reduces 226 the variance which is a proxy for a consistency. The vote can 227 be mathematically expressed as where we have a total n trees andĈ i (x) is the classification of 230 ith random forest tree [6].

231
As a byproduct of Random Forest, in which feature impor-232 tance is generated. It is a list of features and how quantita-233 tively important they are in Random Forest decision making. 234 It is calculated using a normalization on Gini Impurity.
where f i is the frequency of label i at a node and C is the 237 number of unique labels. When a tree sprouts a branch, the 238 improvement in the split-criterion is the importance measure 239 attributed to the splitting variable, and is accumulated over all 240 the trees in the forest separately for each variable [6].

241
Based on the results of the multi-classifier by using the two 242 equations (1) and (2), the system collects the most promising 243 datasets to be used for input for RNN in the next step. It also 244 selects feature sets based on the outcomes of the feature 245 importance, as we demonstrated in the evaluation section. 246 Our previous work proposed a multi-classifier by exploiting 247 various machine learning techniques to exclude ambiguous 248 data from the training data for high accuracy [3], [4].

249
As discussed in this Section III-A, the multi-classifier out-250 performs in data classification by detecting outliers, which 251 results in decreasing system performance based on our 252 previous research outcomes [3] and [4]. In addition, the 253 multi-classifier showed higher speed to perform data classi-254 fication than deep learning algorithms due to many hidden 255 layers.

257
As shown in Figure 1, the data processing selects and cleans 258 historical data or real-time data through data sampling to 259 balance datasets. Then, the multi-classifier with the results 260 VOLUME 10, 2022 where y is an activation function of the input [10].

291
The net output and the activation of out j on the j-th memory 292 cell are In this paper, we propose a new network intrusion detection 297 system by utilizing the RNN model with the multi-classifier.

298
The proposed system has different system parameters, such as 299 a random time ( t), a window size ( ), and a block size (β), 300 to build a model in real time. The proposed system collects 301 data at a randomly selected time ( t). The collected data size 302 is determined by the two system parameters: a window size 303 ( ) and a block size (β). The window size is the data size to 304 generate a model, and the block size is the data to be used for 305 model upgrades in real time.

306
To improve the RNN models in real time, this paper pro-307 poses two approaches: (1) the best-effort approach and (2) 308 the adaptive feature-engineering approach, as described in the 309 following. Given a random time ( t), a window size ( ), and a block 312 size (β), the proposed system keeps improving the current 313 model when the system achieves better system performance 314 (m) as an accuracy. For example, at a random time, the system 315 processes a set of data based on the value of the window size 316 to generate the first model. The system updates the current 317 model with the new model by regenerating the new model 318 with the original data sets (i.e. the amount of the window 319 size) and additional data according to the block size (β). 320 Based on the result of the multi-classifier according to the 321 system parameter values, the input data will be provided to 322 the input gates at the RNN modeling as in Eq 5 and 5. Those 323 equations are where j-th memory cell has an input gate in j 324 and an output gate out j . The input gate's activation at time t 325 and the output gate's activation at time t are y in j (t) and y out j (t) 326 respectively [10].

327
Unlike the standard LSTM, our LSTM has a threshold 328 value δ that a metric m compares with. For example, if we 329 choose a metric m as an accuracy, then a threshold value δ is 330 the best by-far accuracy. This makes our LSTM equations for 331 input gate and output gate as follows respectively.
In this way, our model can be selectively updated based on 337 the threshold δ. The approach generates a new model based on the updated 340 new feature sets by considering the system parameters that are 341 used for the best-effort approach. In other words, the adaptive 342 feature-engineering approach changed the feature sets on 343 the top of the best-effort approach. In detail, based on the 344 aforementioned δ threshold, our LSTM adaptively updates 345 features seen as f in Equation (7) and (8) The net output and the activation of out j are As we discussed in Section III-A, the multi-classifier can 418 improve system accuracy with the quick data processing 419 time compared to deep learning techniques while deleting 420 ambiguous data from the collected data. Figure 3, 4, 5, and 6 421 showed ROC AUC from different machine learning models: 422 Logistic Regression, Decision Tree, KNN, Random Forest, 423 Multilayer Perceptron, Gaussian Naïve Bayes, and Gradient 424 Boost.

425
The proposed system utilized the Random Forest algorithm 426 to achieve our data classification goal since it showed the 427 best accuracy for the four different datasets, as shown in 428 the benchmark results in this experiment. Note that the pro-429 posed system can also utilize more than one machine learning 430 algorithm to create an ensemble method for the solution of 431 the data classification problem, as presented in our previous 432 work. 433 In addition, the Random Forest algorithm generates fea-434 ture importance as explained in Section III-A. As shown 435 in Figure 1     The result of hyperparameter tuning is as shown in Table 2 the highest accuracy, we tested learning rate and dropout rate.  Based on those experiments and as seen from Table 2 on 463 page 98965, we conclude that the dropout rate is optimal at 464 0.15 for all 4 datasets, and the learning rate is optimal around 465 0.1 or 0.05, depending on the dataset. Overall, a learning rate 466 provides more weights than a dropout rate. In other words, 467 a learning rate is metric-elastic, whereas a dropout percentage 468 is metric-inelastic.

470
This paper has differentiated the training size (i.e. win-471 dow size, ) into three categories for an experiment: 50K, 472 100K, and 150K traces: 50K traces for RNN1( = 50K), 473 100K traces for RNN2( = 100K), and 150K traces for 474 RNN3( = 150K), respectively, in the experimental results. 475 In other words, the proposed system generated three different 476 models based on the three different windows sizes. After 477 generating the first model for each, given a time ( t ), the 478 proposed system updates the generated model with the two 479 different block sizes, β = 20K or 40K traces. The block size 480 is the amount of the new real-time data that we feed into the 481      In Kyoto 2006 dataset, the RNN2 case performs relatively 511 the best for the 20K block size, whereas the RNN1 case 512 performs relatively the best for the 40K block size. Under the 513 20K block size, RNN2 showed 95.103% True Positive Rate, 514 whereas RNN1 and RNN3 showed 91.545% and 92.908% 515 True Positive Rate respectively.

516
In CIDDS dataset, the difference among the three different 517 cases is minuscule. In terms of the block size, the 20K block 518 size performs better than the 40K block size. True Positive 519 Rates are at least 99.900% in both block sizes, but the one 520 from 20K block size is relatively higher.

521
These experiments demonstrated that the data size for 522 model building does not significantly impact the system per-523 formance. The sophisticated system settings in the algorithm 524 is the most important to generate the best model in real-time 525 with a given small amount of data. Note that the experimental 526 results in this section are related to the best-effort approach. 527 But when we used the adaptive feature engineering approach 528 also showed similar results with the best-effort approach.

529
E. ATTACK IMPACT 530 The proposed system keeps updating the generated model 531 depending on the system parameters according to the 532 VOLUME 10, 2022     to select the best subset used for input for the next RNN 586 technique as shown in Figure 1. The best subset included 587 only high-quality data after deleting ambiguous data. Since 588 the window size used for these experiments is much smaller 589 than 500MB, the system time of the proposed system was 590 less than 2-3 minutes. In detail, the 50K window size was 591 2MB, the 100K window size was 4M, and the 150K win-592 dow size was 6.1MB. With the high specification of the 593 experiment machine than ours, we expect that the system 594 time to generate a model will significantly drop to several 595 milliseconds.

597
This paper first proposed a real-time NIDS based on the com-598 bination of RNN and Random Forest with a reasonable data 599 size. The goal of the proposed system continues improving 600 the generated models by reflecting network dynamics in real 601 time while considering the system parameters and feature 602 sets. This section discusses the advantages and disadvantages 603 of the proposed system with future work.

604
A. REAL-TIME MODEL BUILDING 605 To build a model in real time and to achieve the highest 606 accuracy, the proposed system utilizes the machine learning 607 technique first to reduce the data processing time and then 608 applies the deep learning technique to generate an accurate 609 attack model based on the well-classified selected data. As we 610 discussed in the evaluation section, most deep learning tech-611 niques require a lot of processing and model building time 612 while providing more advantages than other methods, such as 613 automation, scalability, and effectiveness. With no previous 614 knowledge, most deep learning methods automatically create 615 an attack model through multiple layer processing with a 616 large data size (i.e. more than 1TB). However, such nice 617 features cannot be useful in a real network environment since 618 network behavior is dynamically changing over time. The 619 pre-built model cannot continuously monitor ever-changing 620 network behavior. In addition, it is not practical to build 621 up the attack model with such large data sizes due to time 622 issues. Thus, to build or to update an attack model in real 623 time, the system must create an accurate model with small 624 high-quality data. To achieve this goal, the combination of 625 the multi-classifier and deep learning solved two important 626 issues: data classification and intelligent attack model gener-627 ation in real-time. The training data size is important to build an accurate attack 630 model. However, selecting high-quality right data is the most 631 significant task before the model building. To build a real-632 time NIDS, the proposed system established three important 633 system parameters: a window size ( ), a block size (β), 634 and a random time ( t). Based on these system parameters, 635 the proposed system collects and selects training data in 636 real time. When we consider the current network capacity 637 (5G or 6G), the proposed system can collect enough data 638 size within several microseconds. Since a 10 Gbps gigabit 639 network can transmit 1.25 gigabytes per second, the proposed 640 system easily collects 50K to 100K traces (2M to 10MB) in 641 real-time within less than one millisecond as we discussed 642 in the evaluation section. And then, the Random Forest algo-643 rithm performs data classification to identify ambiguous data 644 that reduce system performance for high accuracy. Through 645 the experiments, this paper recommends that the window size 646 ( ) would be from 50K to 100K. This paper showed that the 647 largest data size that is more than 100K did not provide the 648 highest accuracy through our experiments.