Visualizing Realistic Benchmarked IDS Dataset: CIRA-CIC-DoHBrw-2020

Intrusion Detection System (IDS) dataset is crucial to detect lateral movement of cyber-attacks. IDS dataset will help to train the IDS classifier model to achieve earliest detection. A good near-realism public dataset is essential to assist the development of advanced IDS classifier models. However, the available public IDS dataset has long been under scrutiny for its practicality to reflect real low-footprint cyber threats, render real-time network scenario, reflect recent malware attack over newly developed DoH protocol, disregard layer 3 information and finally publish contradictory results of classification and analysis between various studies which makes it non-reproducible and without shareable results. This problem can be resolved by sophisticatedly visualizing a new realistic, real-time, low footprint and up-to-date benchmarked dataset. Visualization helps to detect data deformation before designing the optimized and highly accurate classifier model. Therefore, this study aims to review a new realistic benchmarked IDS dataset and apply sophisticated technique to visualize them. The review starts by carefully examining production network features. These are then compared with various well-established public IDS datasets. Many of them are static, unrealistic meta-features and disregard source and destination Internet Protocol (IP) information except CIRA-CIC-DoHBrw-2020 dataset. The study then applies Eigen Centrality (EC) technique from the graph theory to visualize this layer 3 (L3) information. Finally, using various visualization techniques such as Principal Component Analysis (PCA) and Gaussian Mixture Model (GMM), the study further analyzes and subsequently visualizes the data. Results show that the CIRA-CIC-DoHBrw-2020 simulated recent malware attack and has a very imbalanced dataset which reflects the realistic low-footprint cyber-attacks. The centrality graph clearly visualizes IPs that are compromised by recent DoH attack in real-time, and the study concludes decisively that smaller packet length of size 1000 to 2000 bytes is to fit an attack trait.


I. INTRODUCTION
23 Intrusion Detection System (IDS) is always concealed by 24 connected, ever-changing zero day cyber-attack. This stealthy 25 attack is almost undetectable by conventional IDS technol-26 ogy and firewalls [1]. Hence it is critical to develop an 27 The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino .
advanced monitoring system and irreplaceable solution to 28 detect unknown malware [2], [3]. 29 From the literature as shown in Table 1, most of the major 30 works focus on developing IDS classifiers and also it apparent 31 that fewer works have been done particularly in the area of 32 IDS dataset review and visualization. Hence, preliminary sys-33 tematic reviews on the public IDS datasets were conducted at 34 the early stage of this research. Various several notable public 35 datasets have been surveyed which expose several advantages 36 It is worth highlighting that a few studies have shown con-92 tradictory results from its classification models. For instance, 93 some studies have shown that RF has achieved highest accu-94 racy as in [10] contrast to the report from [11] which reported 95 that NB had achieved excellent performance. The irony is 96 that both use similar CICIDS2017 standard dataset. Some 97 contradictory analyses were also spotted. For instance, NSL-98 KD dataset is used to detect low-frequency attacks. It is well-99 known that NSL-KDD was not an inclusive depiction of a 100 contemporary low footprint attack, as stated in [8]. 101 This suggests that this domain might have non-102 reproducible and shareable results, as suggested by [4]. It also 103 indicates that various classification models are event-specific 104 and have to be handled case by case. However, the good news 105 is that this gives plenty of room and opportunities for future 106 improvements. This is particularly true in the area of data 107 pre-processing and data visualizations. This issue serves as 108 one of the reasons why this study is conducted. 109 In this study, since a real attack is made up from multiple 110 frames and network packets, visualization through statistical 111 analysis and machine learning approach is introduced. This 112 will reduce the misclassification and contradictory analysis 113 issues as highlighted in problem number 5. The discussion 114 on visualization approach of this study is further expounded 115 from hereon. Visualization in essence helps to dissect these 116 complex network datasets into visual format. This will assist 117 during the training process of IDS classifiers. It will eventu-118 ally assist in the development of an advanced classifier that 119 applies state-of-the-art machine learning techniques. Visual-120 izing dataset also helps to detect data deformation before it is 121 trained by the classifier model to achieve an optimized, highly 122 accurate model. From the literature, a few pre-processing 123 techniques were applied, such as PCA, t-SNE, k-Means, 124 ADASYN, SMOTE, min-max method, Shrunken centroid 125 and a few others. These techniques were applied for various 126 reasons. For instance, to resolve issues in imbalanced dataset, 127 for feature reduction and notwithstanding for visualization. 128 Since lack of layer 3 information is apparent in the previous 129 studies, as highlighted in problem number 4, this feature will 130 be visualized in this study. Layer 3 or network layer is an 131 essential feature in networking. Many underlying patterns 132 can be revealed out of this feature. Due to that, the Eigen 133 Centrality (EC) visualization concept from the graph theory 134 will be applied. The outcome of this analysis will contribute 135 to the discovery of centrality's degrees. This centrality pattern 136 is drawn from the interaction of these IP addresses. This 137 eventually will notify the source of lateral movement or the 138 attack vector. 139 Several other approaches are utilized in this study to 140 enhance the visualization analysis. These include PCA and 141 GMM, a type of k-Means analysis. Pre-processing techniques 142 for visualization like PCA and GMM are crucial to address 143 data deformation problems that might exist prior testing the 144 dataset against the classifier model [6]. Notwithstanding, 145 various visualization techniques like bar plot, skewness and 146 outlier distributions were also applied. As stated before, since 147 VOLUME 10, 2022 production network has no label, this approach will help to 148 highlight a few coefficient values which helps as a target 149 feature in the unsupervised IDS classifier model. In this study, 150 meta-data features or raw-data features are treated equally.

151
There is no differentiation between processes flow informa-152 tion and raw label.

153
On problem number 1; deficiencies to reflect modern net-154 work threats, problem number 2; lack of dataset that reflects 155 real-time or real-network and problem number 3; lack of 156 study on malicious attack over DoH, protocol involved in 3 information is mandatory in any cyber-attack analysis 191 and network intrusion studies.   and how it contributes to the contradictory analysis issues. 203 Subsequently, state the urgency of having a realistic 204 real-time IDS dataset.

206
This section systematically reviews various related studies on 207 IDS dataset. It contains three sections. Section A explains 208 the gaps of the studies. It summarizes the related works 209 and describe the gaps as an extended problem statement. 210 Then, section B clarifies the benchmarked dataset of CIRA-211 CIC-DoHBrw-2020. Finally, section C explains the layer 212 2 frame which clarifies the differentiation between bench-213 marked dataset and ground-truth dataset column labels.

214
The authors in [6] visualized security dataset of UNSW-215 NB15 on malicious DoS attacks. They applied several pre-216 processing algorithms such as PCA, t-SNE, k-Means dis-217 tance cluster, shrunken centroid, Elastic Net Algorithm and 218 Manhalanobis distance. These were used to examine IDS 219 dataset. They discovered two main issues; 1) an imbalanced 220 dataset 2) an overlapped label. This information was crucial 221 to address problems that might exist prior testing the dataset 222 against the developed classification model. However, the 223 study did not process datasets that were specific to network 224 infrastructure such as IP address.  The authors in [12] proposed an intrusion detection model 242 that integrates deep learning technique. NSL-KDD and CIS-243 IDS2017 datasets were used to train and test the model. 244 Both have been adopted by many studies during the evalu-245 ation process. Adaptive Synthetic Sampling (ADASYN) was 246 applied to resolve the issue on imbalanced dataset. Some 247 other pre-processing steps that were applied include k-Mean 248 and t-SNE. The classifier was modelled by using Convolu-249 tional Neural Network (CNN), Long Short-Term Memory 250 (LSTM) and Random Forest (RF) for binary classification. 251 The main objective of this study is an IDS classification 252 model.

253
Similarly, authors in [13] introduced a classification model 254 that applies improved CNN, which is known as Split Module 255 CNN (SPCCNN) and ADASYN which is used to augment 256 excellent performance from NB classifier.

309
This shows this domain has non-reproducible and share-310 able results, as suggested by [4]. It also indicated that var-311 ious classification models were event-specific and have to 312 be handled case by case. Hence, there is still plenty of room 313 for future improvements in this domain, particularly on IDS 314 dataset and visualizations.

315
In [9], the authors applied data generation model named 316 Synthetic Minority Oversampling Technique (SMOTE) to 317 increase efficiency of the IDS model. Data from minority 318 class were oversampled to increase the average data size. This 319 method basically used k-NN algorithm to augment new data. 320 The final machine learning model with a few fixed hyper-321 parameters was then tested on CSE-CIC-IDS2018 dataset. 322 There was obviously an imbalanced data size in each class. 323 The authors in [14] worked on intrusion detection machine 324 learning model over imbalanced dataset. They proposed a 325 Difficult Set Sampling Technique (DSSTE) algorithm to 326 separate imbalanced dataset into difficult set and easy set. 327 The algorithm used ''edited'' Nearest Neighbor which sub-328 sequently applied k-NN to compress the majority samples. 329 This compressed majority was then combined to the easy set 330 to produce a whole new dataset. To verify the performance of 331 the classifier, CSE-CIC-IDS2018 and NSL-KDD were used 332 to train the model. The authors used t-SNE to visualize these 333 datasets.

334
The authors in [15] proposed a detection model called 335 SAVAER-DNN which applied auto-encoder with regulariza-336 tion technique to detect low-frequent attacks. The model 337 was evaluated against benchmarked dataset from NSL-KDD 338 variants and UNSW-NB15. The work in [15] then applied 339 Uniform Manifold Approximation and Projection (UMAP) 340 techniques to visualize spatial distribution of original and 341 synthetic samples. A few pre-processing techniques on data 342 scaling and one-hot data encoding were performed.

343
The authors in [16] proposed an intrusion detection that 344 applied a technique known as Intrusion Detection Based on 345 Feature Graph (IDBFG). It started with generating filtered 346 normal connections using grid partitions and subsequently 347 recorded those patterns with a graph structure. The behav-348 ioral pattern arising from the graph indicates intrusion traits. 349 The model was evaluated against KDD-Cup 99 dataset, the 350 old version of NSL-KDD. The result was compared against 351 Support Vector Machine (SVM) and Decision Tree (DT). 352 However, NSL-KDD is not an inclusive depiction of a con-353 temporary low footprint attack environment [8].   The process included three phases, which were 1) Data nor-367 malization using min-max method, 2) Feature and 3) Attacks' 368 VOLUME 10, 2022

373
The authors in [19] offered a cloud network intrusion 374 model based on Bi-LSTM and attention mechanism. This was 375 claimed as an effective measure to address the problem of 376 learning attack pattern. Particularly attacks in massive and 377 high dimensional data. This massive data with high dimen-378 sionality can be found in the complex and variable nature of 379 production network traffic. In [19], public dataset KDDCup 380 99 was used to analyze the efficacy of the IDS classifier. Data 381 first were normalized by using min and max method. How-382 ever, according to [8], KDDCup99 suffers from redundant 383 records in its training set. Table 1 summarizes the important characteristics from the 385 past related works. The discussion is available in the follow-386 ing section, which establishes the study gaps.  Table 1, most of the major works were done on devel-390 oping IDS classifiers and obviously fewer works have been 391 seen particularly in the area of IDS dataset review and visu-392 alization. Those classifier models manipulate various IDS 393 datasets, which is discussed in the next paragraph. A few pre-394 processing techniques were applied on previous works such 395 as PCA, t-SNE, k-Means, ADASYN, SMOTE, min-max 396 method and a few others. These techniques were applied for 397   similar representation applies to CSE-CIC-IDS2018 bench-454 marked dataset.

455
It seems there are no clear-cut similarities or differences 456 between popularly cited benchmarked datasets and PCAP 457 features. Hence the comparison requires some expert judg-458 ments and field experiences. These benchmarked datasets 459 are claimed to closely resemble the real-world network 460 dataset, similar to ground-truth dataset. However, none of 461 these include network layer information (L3) of source and 462 destination IP except for CIRA-CIC-DoHBrw-2020. This is 463 highlighted in Table 2. In conclusion, a few problems were identified from the 477 previous related studies. First problem is that the majority 478 of the works were emphasized on classifiers development.

479
In contrast, less effort has been put into data preprocessing, utilizing NSL-KDD to train on detection model of modern 533 low-frequent attack. As mentioned in [8], NSL-KDD was not 534 a depiction of low footprint attack. This suggests that this 535 domain has realistic, non-reproducible and shareable results, 536 as suggested by [4]. It also indicated that various classifica-537 tion models are event-specific and have to be handled case by 538 case.

539
B. CIRA-CIC-DoHBrw-2020 DATASET 540 Domain Name System (DNS) has several security loopholes 541 and has been a great concern for cybersecurity researchers. 542 More sophisticated exploits have been introduced to compro-543 mise DNS servers over the years. To countermeasure some 544 issues related to DNS vulnerabilities, DNS over HTTPS was 545 introduced by IETF in 2018. This is done by encrypting DNS 546 queries and sending them over a covert tunnel. This DoH 547 transaction has been replicated in CIRA-CIC-DoHBrw-2020. 548 CIRA-CIC-DoHBrw-2020 is a synthetic dataset which aims 549 to evaluate DoH traffic in a network environment.

550
This network topology implements two-layered 551 approaches which are used to generate normal and attack 552 DoH traffic along with non-DoH traffic. DoH traffic is gen-553 erated by accessing top 10,000 Alexa websites. It is sub-554 divided into non-DoH, benign-DoH and malicious-DoH. A 555 non-DoH is a traffic generated through HTTPS protocol. 556 Then a benign-DoH is a non-malicious DoH traffic that 557 is also generated through HTTPS and it is accessed by 558 clients that use Mozilla Firefox and Google Chrome web 559 browsers. These two browsers support DoH protocol. Finally, 560 the malicious-DoH is generated by using tools like dns2tcp, 561 DNSCat2 and Iodine. 562 Fig. 1 shows the network diagram that is used to capture 563 the DoH traffic. Firstly, for the first layer, traffic with normal 564 web browsing activity that involves benign DoH is gener-565 ated through the web browsers. This will generate non-DoH 566 HTTPS and benign DoH traffic. This traffic was then cap-567 tured by a few web servers. Secondly, for the second layer, 568 malicious DoH was generated by a mixture of tools to be cap-569 tured by malicious DNS server and DoH server. These gen-570 erated traffics were then captured for pre-processing phase. 571 The web browsers utilized various public DoH resolvers. 572 To utilize this resolver and various capturing tools, Firefox 573 web browser was connected to GeckoDriver and Chrome 574 web browser to ChromeDriver. These generated traffics were 575 captured by tcpdump. A Python script that uses Scapy was 576 developed to generate a DoH traffic flow generator and ana-577 lyzer. A tool named DoH Data Collector was then mounted 578 to simulate different sets of DoH tunneling incidents.

579
For DoH server infrastructure, it is implemented by using 580 Adguard, Cloudflare, Google and Quad9 platform. For the 581 non-DoH and benign DoH, the packets generated amounted 582 to 48952 Kbytes packets. On the other hand, the malicious 583 packets that were generated amounted to 219458 Kbytes 584 packets of traffic. The transmission rate is set randomly 585 between 100bps to 1100bps. The dataset document provides 586 lists of IP addresses used to generate non-DoH, normal DoH 587  Table 2 show col-604 umn time; the time for which the frames were captured. 605 Time here, however, measures delta time up to microseconds 606 from sequence of a completed handshake network trans-607 actions. Then there are columns source and destination IP 608 address. These are valuable network layer information (L3). 609 It shows the communication between packet originator and 610 the intended recipient.

611
Next column is protocol which is a set of rules that are 612 used in network communication. The column frame length 613 is the size of communication wire in bytes of a particu-614 lar transaction and finally is the info column which is not 615 included in Table 2. This column is to provide more descrip-616 tions about a particular packet in text form. Usually, it is 617 difficult to process this column in a classification machine 618 learning (ML) training program, hence it is safe to drop this 619 column. Obviously, in PCAP's features there are no attack 620 and normal labels which require unsupervised type of ML 621 trainings. Here, extra features like source port, destination 622 port (L2 information) and a few more from the frame field 623 information can be added into the column. It is added as 624 additional filters and sometimes through careful examination 625 and deep packet inspection. There are obviously more vital 626 OSI layer components that need to be added.

627
These vital OSI layer components reside in the data layer 628 link layer, which encapsulates most of the information from 629 the upper layers and provides function to transfer Protocol 630 Data Unit (PDU) between nodes. It serves a request from 631 network layer and directs it to the physical layer. During this 632 transmission, data can be successfully received and acknowl-633 edged. However, sometimes that transfer can become unre-634 liable. Hence, in those cases, upper layer protocols like data 635 link layer will perform error checking, acknowledgments and 636 retransmission. It includes application layer protocol infor-637 mation, transportation layer protocol number (either TCP 638 or UDP) information, source and destination IP or simply 639 layer 3 information, source and destination Media Access 640 Control (MAC) address information, source and destination 641 port number and finally checksum.   Firstly, the dataset will be processed through EC, a centrality 678 density method that is applied in Graph and Network theory. vertex v = v 1 , . . . , v n where the matrix is a square of n * n 711 of matrix A of element i, w. This A iw must be the element of 712 an edge from v i to v w . It will be denoted as 0 if there is no 713 edge. Eventually, all the diagonal elements of this matrix will 714 be zero since a vertex is connected to itself (a loop).

715
The next step is to calculate the degree of the graph, d. 716 This is calculated by looking at the number of edges that are 717 connected to a particular vertex. It is denoted by (3).
where v is the number of vertices or nodes (IP addresses) and e 720 is the number of edges (links between source and destination 721 IP). Then, the next step is to calculate degree of centrality. 722 Degree of centrality for a node v, is the fraction of nodes 723 it is connected to. They are normalized, s, by dividing to 724 the maximum number of possible degrees in a graph n-1, 725 as shown in (4).
where n is the number of nodes, v in the graph G. Hence 728 degree of centrality is calculated by (5). 729 Next Eigenvector centrality (EC) is computed. EC com-731 putes the centrality of a node according to the centrality of 732 its neighbors. It is also to measure the influence of a node 733 in a network. For the given graph G = (v, e), where |v| are 734 the vertices and e are the edges, let adjacency matrix be as 735 A = (a v,w ) where v and w are two different vertices. When 736 a v,w = 1, v and w are connected to each other, and when 737 a v,w = 0, these are disconnected to each other. Given relative 738 centrality of node or vertex v as x v it is denoted as in (6).
where M (v) is all the neighbors of node v and λ is a constant 741 and x w is the sum of relative centrality between node v and 742 w, which is denoted as x w = 1/λ w∈V a v,w x w . This can 743 be simplified into vector notation of Eigenvector as denoted 744 in (7) The next step is to calculate the Shortest Path or Betwee-747 ness Centrality (BC). Shortest Path or BC of a node v is 748 computed by summing up all the fractions of all shortest path 749 pairs that pass-through v. It is expressed in (8).  It can be done iteratively to find Expectation step (E step) and 810 Maximization step (M step). E step Q (θ |θ (t) ) is computed by 811 (11).
where E is the expected value, z|x, θ t is the distribution of 814 Z given X and the current estimation of parameters θ (t) , 815 log L(θ ;X , Z ) is a log likelihood function of parameter θ with 816 respect of all that. To maximize the step, the M step is denoted 817 by (13) 818 θ (t+1) = arg max Q(θ |θ (t+1) ) (13) 819 Which denotes to find the maximum parameters that 820 finally satisfy this equation.

822
DNS over HTTPS is relatively a new protocol that was intro-823 duced in 2018. It aims to reinforce security and privacy issue 824 of DNS requests over HTTPS channel. Many trusted web 825 browsers such as Firefox, Safari, Chrome and Edge have 826 adopted DoH. DoH combats DNS data manipulation, Man-827 in-the-Middle (M2M) attacks and eavesdropping.

828
Despite that, it also suffers other security breaches such 829 as spoofing. Spoofing will lead to data exfiltration and C&C 830 attacks through malware proliferation. DoH dataset of CIRA-831 CIC-DoHBrw-2020 establishes security flaws in DNS like 832 DNS tunneling and DNSbased malware. This flaw can bypass 833 firewalls. Hence detecting DoH threats is crucial. Dataset 834 features here is defined as flow information or a processed 835 meta-data. Table 4 below shows the output of data.info() from 836 CIRA-CIC-DoHBrw-2020 dataset.

837
From Table 4, there are 35 columns (from 0 to 34) alto-838 gether. An entry index from 0 to 167516. It has one entry 839 datatype (dtypes) of boolean, 26 entries of float64 datatype, 840 five entries of int64 datatypes and, three objects datatype. 841 Memory usage to process this 167k counts of dataset is about 842 44Mbytes.

843
The source and destination IP by far haven't been found in 844 any benchmarked dataset accept in the CIRA-CIC-DoHBrw-845 2020. Fig. 3 shows its description. Most of the features' mean 846 value lay at the floor level except for PacketLength informa-847 tion. These are attributed for value ranges from minimum to 848 50%. It is also clearly seen a back wall that contains vertical 849 values range from 70% of sizes to maximum. Those features 850 are coming from FlowBytes and the PacketLengthVariance. 851 Most of these back wall features are coming from the raw 852 features an, in contrast, most of the features below the floor 853 level are the processed features or meta data.  as shown in Fig. 4(b) has a bimodal shape that shows the   Similarly, Fig. 7 shows the evolution of three packets 910 types namely PacketLengthMedian, PacketLengthVariance 911 and PacketLengthMean. On the 06th hours of both dates (31 912 March, 2020 and 01 April, 2020) those packet types show 913 increases in their sizes. They reach up to 1000 * 10 6 size in 914 bytes (1000Mbytes). This is also another indication to show 915 how a normal traffic behaves. Again, it is demonstrated here 916 that a normal DoH traffic will have outliers' distribution, 917 which is usually off the mean and reaches its maximum sizes. 918 Fig. 8      Graph and network model are used to understand this DoH 931 network as well as understand their IP relationships. This 932 will better assist on visualizing the dataset subsequently to 933 perform clustering and classification tasks. Fig. 9 Fig. 11 shows the count of Source IP addresses against the 958 DoH label to further support the given graph in Fig. 9 Pre-959 viously, Fig. 9 shows most of these IPs were the source 960 of DoH's attack traffic. From Fig. 11, it is known that the 961 most attack traffic was generated from IP 192.168.20.144. 962 The very least attacks generator is from the host IP 963 FIGURE 11. SourceIP count vs DoH (label).   Fig. 12 shows the count of DestinationIP against the DoH 969 label. The destination's host compromised heavily by the 970 DoH's attack is 9.9.9.11. This is the host also marked as 971 attacks generator. This host has the characteristic of a Com-972 mand and Control (C&C) server which can transmit and serve 973 exploits traffic concurrently. A compromised host with C&C 974 exploit is also known as Zombie.  Fig. 13 shows the generated graph model of CIRA-CIC-977 DoHBrw-2020 dataset. Graph G, which was introduced in 978 Section III (a) is shown in Fig 13 (a). It shows all the nodes, 979 v and its edges, e. To understand the centrality information 980 of the graph G, degree of the node was being measured. 981 Fig. 14 nodes ['1.1.1.1', 995  '9.9.9.11', '176.103.130.130','8.8.4.4', 996  '151.101.2.49'] define the most importance features, 997 i.e. the most traffic travels in and out of these nodes. Again, 998 these are all the IPs which have been described as an attacks 999 generator and destination nodes.   In Fig. 17d), PacketLengthMean of size 1000 to 2000 1037 bytes have filled up the exact same spot of the normal traffic 1038 characterized in Fig. 17c). Hence, larger packet length size 1039 seems to fit a normal traffic. This is a vital information as it 1040 assists to design an unsupervised IDS classifier model.

1041
Then, Fig. 17a) and b) show the PC's graph for DoH dataset 1042 with hue information from SourcePort and DestinationPort. 1043 Apparently those two figures do not demonstrate similar 1044 traits, as shown in Fig 17d). However, it has revealed a few 1045 important attributes. For instance, the small cluster is entirely 1046 populated by the DestinationPort which has been labelled as 1047 malicious destination. Hence, the destination mostly has been 1048 compromised by malware.

1049
Meanwhile in the west region of the biggest cluster, it is 1050 entirely populated by the SourcePort, as shown in Fig. 17a), 1051 which is also the source for benign hosts. In this region, the 1052 majority of the hosts have been infected by malware. This model, on the other hand, has unearthed three clusters 1055 as depicted by graphs in Fig. 18. These 2D graphs show 1056 dataset features with GMM values against DoH attack and 1057 benign label. Cluster 0 has 94877 plots' count. Cluster 2 is 1058 the second highest with 39915 count and finally cluster 1 with 1059 32725 counts. Total counts from these three clusters will sum 1060 up to 167,517 which is the total number of entries. 0 and cluster 1 have noticeable plots, whilst cluster 2 in some 1064 regions has the least plots. However, it is still recognizable.
1065 Fig. 18d) shows PacketLengthMean clustered in 1, 2 and 3.  Fig. 19 shows the boxplot graph for SourcePort which 1075 is also clustered into three classes. Most of these clusters 1076 have SourcePort mean ranges from port 40000 to port 50000. 1077 In Fig. 18a) cluster 1 populates both attack and benign traffic. 1078 This is also coherent to the finding shown in Fig. 17a) where 1079 some of the SourcePort are safe and source from a benign 1080 traffic. On the other hand, majority of the cluster 1 ports 1081  populate attack DoH traffic, as can be seen in Fig. 17b). Only 1082 one port, based on GMM model, from this range is considered 1083 benign.

1085
The advancement in security mechanism revolves around 1086 protection and detection system. Intrusion Detection System 1087 security is still an important technology in the network and 1088 identity perimeter. It is used to detect classical and zero 1089 day attacks in corporate network. It also provides just in 1090 time reporting during investigation and response process. 1091 However, the available public IDS dataset is impractical 1092 to reflect real cyber threats, to render real-time network 1093 scenario, to reflect recent malware attack, disregard layer 1094 3 information and publish contradictory results. This problem 1095 can be resolved by sophisticatedly visualizing a new real-1096 istic, real-time, low footprint and up-to-date benchmarked 1097 dataset. Visualization helps to detect data deformation before 1098 designing the optimized and highly accurate classifier model. 1099 This study aims to review a new realistic benchmarked 1100 IDS dataset and apply sophisticated technique to visualize 1101 them. The study then applies Eigen Centrality (EC) tech-1102 nique from the graph theory to visualize this layer 3 (L3) 1103 information. Finally, it uses various visualization techniques 1104 such as Principal Component Analysis (PCA) and Gaussian 1105 Mixture Model (GMM). Results show the centrality graph 1106 clearly visualizes IPs that are compromised by recent attacks 1107 in real-time and the study concludes decisively that smaller 1108 packet length of size 1000 to 2000 bytes is to fit an attack 1109 trait.
The authors declare that there are no conflicts of interest to 1112 report regarding the present study.