Network Intrusion Detection via Flow-to-Image Conversion and Vision Transformer Classification

In recent years, computer networks have become an indispensable part of our life, and these networks are vulnerable to various type of network attacks, compromising the security of our data and the freedom of our communications. In this paper, we propose a new intrusion detection method that uses image conversion from network data flow to produce an RGB image that can be classified using advanced deep learning models. In this method, we proposed to use the decision tree algorithm to identify the important features, and a windowing and overlapping mechanism to convert the varying input size to a standard size image for the classifier. We then use a Vision Transfomer (ViT) classifier to classify the resulting image. Our experimental results show that we can achieve 98.5% accuracy in binary classification on the CIC IDS2017 dataset, and 96.3% on the UNSW-NB15 dataset, which is 8.09% higher than the next best algorithm, the Deep Belief Network with Improved Kernel-Based Extreme Learning (DBN-KELM) method. For multi-class classification, our proposed method can achieve a testing accuracy of 96.4%, which is 5.6% higher than the next best method, the DBN-KELM.

the most accurate classification between the malicious and 88 benign network flows so that we can detect new types of 89 attacks? 90 To improve the performance and accuracy of our NIDS, 91 we propose a method that converts a network flow pattern 92 within a specific time interval into a two-dimensional image. 93 One dimension of the image is the various measurements of 94 the network flow (i.e. features), and the other dimension rep- 95 resents the values of these measurements over time. We will 96 then use image processing techniques to classify these images 97 as malicious or benign. This approach allows all or majority 98 available information (i.e. other measurements or features 99 that may not be directly related to the attacks) to be used in 100 the classification to reduce false positives. 101 In addition, some of the advances in image classification 102 techniques can also help to improve processing speeds. For 103 example, if we use convolutional neural networks (CNNs), 104 we can adjust the size of the mask and the stride of the con-105 volution to achieve better computational efficiency. Similarly, 106 using the multi-head attention mechanisms in Vision Trans-107 former (ViT), we can also focus our computational efforts 108 only in regions of interest. 109 B. CONTRIBUTIONS 110 The novelty in this work is that our proposed method selects 111 the most important features in the network flows based 112 on the most common types of network attacks, and converts 113 them into an RGB image representation. We then use state-of-114 the-art image processing techniques to classify these images, 115 which uses regional information to improve computational 116 efficiency and to reduce false positives. The contributions in 117 this paper are: 118 • Feature Selection. We proposed an approach using 119 decision trees to rank the various network flow mea-120 surements (i.e. features) for the different kinds of attacks 121 and select the most important ones for our classification 122 algorithm.

123
• Flow-to-Image Conversion. We proposed an approach 124 to convert the network flows into an image by selecting 125 the most important measurements and converting them 126 into an RGB image representation.

127
• Classification using ViT. We investigated two 128 approaches to classify the resulting image, CNN and 129 ViT. We show that the results using ViT is the best and 130 it surpassed other state-of-the-art algorithms.

132
In this research, detection and mitigation of security attacks 133 will be our main focus. network and classified to whether it is a probing attack or 208 not. This process is then repeated for the R2L attack and 209 the U2R attack. However, this approach has the shortcoming 210 that it should test for the most probable type of attacks first, 211 otherwise the mis-classification rate can be significant. The 212 results of this approach has been shown to be comparable with 213 the traditional multi-class approach, but it takes less time to 214 run.

215
A more recent approach is to convert the network fea-216 tures into a 2D image and then apply CNNs to perform the 217 classification [25]. In this approach, the authors converted 218 each request flow to a 2D matrix, where the size of the 219 matrix corresponds to the total number of data features to be 220 examined (for example, for a request flow of 121 features, 221 they convert that into a 11 × 11 matrix). Then they will 222 convert 2D matrix to 2D image. After the 2D image is formed, 223 the authors will then apply CNN techniques to train it and 224 perform classification. Another approach from [26] tried to 225 convert each request to Alpha Red Green Blue(ARGB) image 226 on UNSW-NB15 and BOUN DDoS dataset. Although their 227 results are superior to their predecessors [27], it is difficult 228 for their method to provide outstanding performance over 229 pre-existing single request processing techniques.

230
Despite the emergence of many new deep learning models, 231 almost all of these models are still only processing data on a 232 request-by-request basis which cannot take advantage of the 233 correlation between data requests in neighboring time frames. 234 The combination and co-processing of these data, within a 235 certain time frame, is the key to increasing the classification 236 accuracy and decreasing the false alarm rate.

238
A. OVERALL ARCHITECTURE 239 Figure 1 shows the overall architecture of our system. Our 240 proposed solution consists of 3 main modules: the Data Pre-241 processing module, the Flow to Image Converter module 242 and the Classifier module. During the training phase of the 243 system, we apply the decision tree classification process to 244 the training data to find the importance of all the features 245 so that we can choose the most important features to be 246 used in the conversion of the data flow to an image. During 247 normal execution, the data that comes in will first go through 248 the Data Pre-processing module where we will extract the 249 important features that were found during the training phase. 250 The pre-processed output will then go through the Flow to 251 Image Converter module and be converted to a series of RGB 252 images. These images are then sent to the Classifier module 253 and will be classified by the deep learning classifier into 254 whether it contains benign or malicious flows.    After determining the important features needed for our algo-305 rithm, we can now use the Pre-processor module for intrusion 306 detection. During the execution of the intrusion detection 307 process, our Data Pre-processor module will read the input 308 from the CSV file, where the rows of the CSV file represent 309 the network flows, and the columns represent the 78 features. 310 First, the number of rows in the CSV files will be reviewed 311 and checked if there are enough rows to generate a square 312 image (this is because the final module (i.e. the classifier 313 module) in our intrusion detection process requires a square 314 image). Since we have identified 24 important features for the 315 attacks that we are interested in (i.e. 24 columns), we need to 316 make sure that the input CSV file have at least 24 rows too). 317 If not, new empty rows with all 0 values for all features will 318 be added to the CSV file. Next, we remove all other columns 319 in the CSV file except the 24 columns that we have identified 320 as important features for our intrusion detection system. This 321 new CSV file will become the input to the next module, the 322 Flow to Image Converter module.    (Red, Green, Blue). As N will be different in every CSV 385 file, we propose to reshape the image from N×24 × 3 to a 386 square 24 × 24×3 image. This will make the design of the 387 classifier in the next module easier as the dimensions will be 388 fixed. We will use a windowing and overlapping mechanism 389 to reshape the images as shown in Figure 2 below.

390
In Figure 2, we have an image as an example, we take 391 the raw dataset from CSV file (left side) and slice those raw 392 dataset into 7 different 24×24×3 images with an overlapping 393 size of 12 rows (i.e. half of the square 24 × 24×3 image).

394
The first square image will be extracted from the first row of       field of image processing. ViT has been shown to compare 455 somewhat better than the convolutional models. The input image is split into fixed-size patches, and each 461 patch is linearly embedded, slotted, and then transformed into 462 tuples (i.e. resulting vector sequences) which are then fed to 463 a standard Transformer Encoder. A learnable ''classification 464 token'' is then added to the string and then given to the 465 classifier to classify the images. In the ImageNet Real input 466 dataset, the ViT model produces the highest result of 90.72%, 467 compared to 90.55% of EfficientNet L2. The application of 468 the ViT model in our NIDS is shown in Figure 4. 469 We train our ViT in the same way as the linear ViT model 470 of A. Dosovitskiy et al. [31]. Our ViT is first pre-trained with 471 the data of ImageNet-21K, and then fine tuned by Stochastic 472 gradient descent (SGD) with a momentum of 0.9. We then 473 train all the ViT models using the AdamW optimizer with a 474 learning rate of 2e-5.  [37]. Table 3 shows the comparison between the 481 different datasets for a deep learning based NIDS.

482
In this research, the two datasets that we used are the CIC 483 IDS2017 and UNSW-NB15. We chose these two datasets 484 not only because of the variety of attack types in them, 485 but also because they have all the features related to server 486 identification and arrival times that we need.    to train and evaluate our algorithm. Table 4 summarizes the 531 binary class distribution of the images generated by both the 532 CIC IDS2017 and UNSW-NB15 datasets.

533
The contribution of each type of attack of generated images 534 from CIC IDS2017 dataset is as shown in Figure 5.  We run our model on a server with 2x Intel Xeon Sil-536 ver 4214 CPUs and 8x NVIDIA RTX 5000 GPUs for both 537 types of classifier architecture (i.e. CNN and ViT). The Ten-538 sorflow 2.3.0 framework is used to construct the classifiers. 539 We experiment with binary classification on the RGB image 540 dataset generated from both the CIC IDS2017 and UNSW-541 NB15 datasets, and multi-class classification on the images 542 generated from only the CIC IDS2017 dataset.

544
We perform experiments with binary classification based on 545 two labels of the data: Benign and Malicious. We separated 546 the dataset to three parts: training (70%), testing (20%) and 547 validation (10%), and we run the training for 50 epochs on 548 both the CNN and ViT architecture. The final accuracy and 549 the false positive rate for our proposed model on both the CIC 550 IDS2017 and UNSW-NB15 datasets with an overlapping size 551 of 12 is shown in Table 5 below. 552 Figure 6 shows the confusion matrices of both the CNN 553 and ViT architectures for binary classification on the CIC 554 IDS2017 dataset, and Figure 7 shows the same confusion 555 matrices for the UNSW-NB15 dataset. The Receiver Oper-556 ating Characteristic (ROC) curves for both the CNN and ViT 557 classifiers are shown in Figure 8. . The comparison of our proposed 562 method with these other methods for the binary classification 563 task is shown in Table 6.

564
The Graphical comparison accuracy ratio of binary classi-565 fication is shown in Fig. 9.

566
From Table 6, we observe that our method using the ViT 567 classifier has outperformed all the other methods in terms of 568 precision and accuracy in both datasets (UNSW-NB15 and 569 CIC IDS2017), except for the accuracy in the CIC IDS2017 570  The performance of the multi-class classification on 20,000 587 test samples using our proposed method with the ViT classi-588 fier on the CIC IDS2017 dataset is shown in Table 7 below.

589
The confusion matrix for multi-class classification on CIC 590 IDS2017 is shown in Figure 10.   Patator attack and DDoS attack on the CIC IDS2017 dataset. 595 The Graphical comparison accuracy ratio of multi-class clas-596 sification is shown in Fig. 11. The accuracy of our proposed 597 VOLUME 10, 2022     For the multi-class classification problem, our proposed 668 method also outperformed the other state-of-the-art methods. 669 Our method has an accuracy of 96.4%, which is 5.6% higher 670 than the next best method, the DBN-KELM [37]. Also, our 671 proposed method achieved 100% precision in two of the 672 object classes, the Patator and DDoS.

673
However, the data imbalance in the dataset makes the 674 results obtained from the multi-class classification inaccu-675 rate. Some types of attacks in the dataset occurs quite infre-676 quently so we do not have enough data to train the classifier 677 correctly. Besides, the feature selection step in our method is 678 also dependent on the data size in each category. If the data 679 size of a category is small, the features that differentiate it 680 from other categories would have a lower information gain 681 and thus be considered less significant. As a result, these 682 features will not get selected by our algorithm and the per-683 formance of our network intrusion detection method will be 684 decreased for attacks in these categories.The future work will 685 be focused on how to improve the proposed solution work for 686 imbalance dataset across different categories.