Anomaly Detection Based on CNN and Regularization Techniques Against Zero-Day Attacks in IoT Networks

The fast expansion of the Internet of Things (IoT) in the technology and communication industries necessitates a continuously updated cyber-security mechanism to keep protecting the systems’ users from any possible attack that might target their data and privacy. Botnets pose a severe risk to the IoT, they use malicious nodes in order to compromise other nodes inside the network to launch several types of attacks causing service disruption. Examples of these attacks are Denial of Service (DoS), Distributed Denial of Service (DDoS), Service Scan, and OS Fingerprint. DoS and DDoS attacks are the most severe attacks in IoT launched from Botnets. Where the Botnet commands previously compromised single or multiple nodes in the network to launch network traffic towards a specific node or service. This leads to computational, power, or network bandwidth draining, which causes specific services to shutdown or behave unexpectedly. In this paper, we aim to verify the detection approach reliability when it encounters an attack that it was not trained on before. Therefore, we evaluate the performance of Convolutional Neural Networks (CNN) classifier in order to detect the malicious attack traffic especially the attacks that never reported before in the network i.e. Zero-Day attacks. Different regularization techniques i.e. L1 and L2 have been used to address the problem of overfitting and to control the complexity of the classifier. The experimental results show that using the regularization methods gives a higher performance in all the evaluation metrics compared to the standard CNN model. In addition, the enhanced CNN technique improves the capability of IDSs in detection of unseen intrusion events.


I. INTRODUCTION
IoT applications have expanded world-wide, offering the 21 users many services to engage in their personal lives to make 22 it easier. Even though IoT applications are very powerful and 23 give high connectivity to the users and their data over regular 24 apps and systems, they are still considered vulnerable to a 25 wide diversity of attacks and threats [1]. These vulnerabilities 26 are caused by the architecture type of the IoT ecosystem that 27 includes heterogeneous layers of communication. In addition, 28 The associate editor coordinating the review of this manuscript and approving it for publication was Chien-Ming Chen . the power consumption constraints force a low rate of com-29 putational power that cannot handle the proper cryptographic 30 calculations between the network's nodes [2]. The possible 31 elasticity of the IoT network (i.e. continuous joining and leav-32 ing of unknown nodes) could also generate a vulnerability 33 in the context of securing an IoT network. Therefore, the 34 security and privacy of those users' data became the primary 35 concern to avoid any catastrophic data breaches. Figure 1 36 shows the ecosystem structure of a typical IoT application. from communicating with each other to prevent launching 84 any attack. Another alternative is by analyzing the network 85 traffic in a time-shifted manner i.e. after an attack started, and 86 depending on the analysis, it blocks the suspicious bots par-87 ticipating in the previous attack. Moreover, some approaches 88 rely on analyzing the power consumption of each IoT device 89 in the network to check for malicious activities. While the 90 real-time approach seems faster with higher protection, it is 91 actually not always reliable to detect unknown attacks. The 92 time-shifted approach has a better chance from time perspec-93 tive to learn about the traffic patterns and their behaviours and 94 results. 95 In the last decade, several Machine Learning (ML) and 96 Deep Learning (DL) techniques have been proposed to over-97 come the challenges of developing an effective intrusion 98 detection system [10], [11]. 99 ML algorithms such as Decision Tree (DT), Logistic 100 Regression (LR), Naive Bayes (NB), . . . etc can provide 101 acceptable results when training and testing have the same 102 distribution data. However, when they are tested on a new 103 data distributions (e.g. zero-day attacks scenarios), those clas-104 sifiers fail to provide a high prediction performance as an 105 expected [5], [12]. This is because classical ML-based meth-106 ods have low capability to learn the non-linear or the inter-107 pretation between the various attacks, especially the attacks 108 that have a high degree of similarity with normal traffic. 109 Furthermore, The classical ML techniques mainly rely on 110 feature engineering to select the best features of the attack 111 classes. However, the best features can be varied from one 112 attack to another. In addition, the feature that can be used for 113 one attack class it is not necessarily to be suitable for another 114 class. The situation which can cause a high false alarm and 115 low detection performance in overall. 116 On the other hand, Deep Learning (DL) has been wildly 117 used in different application domains (. speech recogni-118 tion, image processing). It has the capability to extract the 119 intensive features from raw data automatically without prior 120 knowledge [13]. The better performance of DL leads many 121 enterprises, such as Google and Facebook to use it in various 122 applications. Its potential to obtain the hierarchical represen-123 tation of input data in many applications encourages many 124 researchers to use it in cybersecurity tasks such as anomaly 125 detection. DL can learn the complex and non-linear structure 126 of the input data, in contrast to shallow learners, which require 127 hand-crafted features as input. As a result, we no longer spend 128 time on feature engineering to select appropriate feature sets. 129 However, few studies have investigated the applicability of 130 DL for anomaly detection in IoT networks [14].

131
Our proposed IDS utilizes the CNN model for effective 132 and early detection of SDN network threats, motivated by 133 the success of the CNN in solving several difficult classifi-134 cation problems. In addition, CNN provides the concept of 135 parameter sharing, which helps significantly to decrease the 136 dimension parameters of the detection model. Although  In general, regardless of the botnet's family or type, a typi-215 cal botnet has systematic steps in performing its attacks [21]. 216 It starts with developing the malicious software when the 217 botmaster tries to build an unwanted software that may be 218 a virus, worm, spyware, trojans or any other already-existing 219 option if the desired malicious activity is implemented inside 220 this option. In the second stage, the botmaster injects the 221 malware into a victim's device through plenty of available 222 options: spam emails, non-trusted websites, phishing appli-223 cations, fake cracked versions of expensive software . . . etc. 224 All these options take advantage of the victim's ignorance 225 when he trusts a spam email sender or website publisher. 226 After the malware has been injected, the botmaster controls 227 the compromised machines through establishing communi-228 cation sessions with Command-and-Control servers. This is 229 because victims count could reach thousands, which hardens 230 the process of the individual control over all those victims. 231 The botmaster can exploit information from the victims' 232 devices (infected bots) to start his malicious activity like 233 online bank theft, blackmailing or performing a group mali-234 cious attack to a new victim such as DDoS attack. A botmas-235 ter must keep the communication with the bots as long as 236 possible without being revealed to the infected device's owner 237 to keep it compromised and achieve the malicious target the 238 longest possible. values [25]. As a result, the big weights become near zero. 296 We aim to minimize the following cost function during the 297 training process: where, L is the loss function, w is the wight and b is the bias. 300 Now, using L2 regularization, the loss function will become: 301 where, λ is a parameter that can be tuned to control the 304 regularization effect. Using large λ, the weight penalty will 305 be large. Similarly, small λ will reduce the effect of regu-306 larization. This is trivial, because the cost function must be 307 minimized. By adding the squared norm of the weight matrix 308 and multiplying it by λ, large weights will be driven down in 309 order to minimize the cost function. In general, ML classifiers are widely used in attack detection 312 techniques, as they have the ability to learn the patterns 313 behind the data, which increases the predictions quality [26]. 314 In this section, a brief introduction of popular related classical 315 ML and CNN classifiers is presented. 316 1) AdaBoost: is a classification algorithm that is based on 317 DT and similar to Random Forest (RF) [27]. AdaBoost 318 creates multiple DTs but with a predefined max depth, 319 rather than the RF that randomly creates DT without 320 unifying the depth of the trees. Order of trees in RF is 321 not important, while it makes a difference in AdaBoost 322 since each learner/stump or single DT uses the decision 323 of the previous stump into account.

324
2) Logistic Regression (LR) LR is a binary classification 325 algorithm [28] that belongs to the ML set. It depends 326 on creating an s-shaped curve according to a spe-327 cific feature value. This s-shaped curve differentiates 328 between two binary classes of data; it also helps to 329 assess whether a feature is useful for the prediction 330 process or not.

331
3) Naïve Bayes (NB): NB [29] is a ML classification 332 algorithm. It mainly depends on calculating the proba-333 bilities (also known as likelihoods for discrete data) of 334 the training data features. It also depends on calculating 335 each class's probability in the training data. Then, through the training process, it defines the proba-344 bility for each feature in the data waiting for the testing 345 phase. In the testing phase, the classifier multiplies 346 the calculated probabilities of the features exist in the 347 record needs to be classified times the ρ(A) probability 348 once, and times the ρ(B and tested over N-BaIoT dataset, which has normal and attack 425 network traffic logs, that were recorded using Mirai and 426 Bashlitte Botnets. They reduced the features from 115 to only 427 two using PCA reduction method. The algorithm depends on 428 creating separate E2G model for each MCU-based IoT device 429 in the system to make the detection more resource-friendly. 430 RF and DT showed the best results among the others with 431 results close to 100%. The disadvantage in this algorithm is 432 that model should be upgraded frequently when necessary, 433 after being trained with data from the developed type of mal-434 ware action. The suggested update by the authors to use Over-435 The-Air (OTA) adds difficulties in the deployment process. 436 In [37], the authors studied an implementation of a new 437 forensic mechanism using ML techniques to detect a mal-438 ware activity of a Botnet in an IoT network. The study first 439 explains how existing solutions at that time are efficient but 440 have a high false alarm rate. Their proposed scheme is a 441 forensic mechanism that first collects the traffic from the 442 network through tcpdump tool. Then, from the collected 443 traffic, a suitable feature set of the data is extracted using 444 Bro and Argus tools. Afterwards, to start the classification 445 process, the data with the extracted features are exposed to 446 four main algorithms of ML which are: Association Rule 447 Mining (ARM) [38], ANN [39], NB [40]   Datasets are important to help verify detection studies and 517 approaches that are ML-based. For botnets, there are sev-518 eral datasets available, but they are not usually sophisticated 519 enough to be used. The limitations of datasets vary from low 520 diversity of available attacks, traffic is generated virtually 521 not from real traffic, duplication of some records, unlabeled 522 data, or low number of features. Given these limitations, 523 we chose the Bot-IoT dataset [17] to train, test, and verify 524 our classifiers in the previous scenarios. This dataset contains 525 multiple attacks data: Service Scanning, Data Theft, Key 526 Logging, OS Fingerprinting, DoS and DDoS. All attacks' 527 data are available in several communication protocols: UDP, 528 HTTP and TCP.

529
The dataset also contains real and simulated traffic; which 530 is generated using real IoT services and attacking Virtual 531 Machine devices (VMs) that are all connected together using 532 LAN and WAN networks. The IoT traces are also available 533 in the dataset, it also provides 32 features. The generation 534 testbed was designed and implemented in Research Cyber 535 Range lab of UNSW Camberra. The dataset is available in 536 multiple formats: PCAPs, and CSV formats. The Bot-IoT 537 dataset contains around 72 millions records ordered in 74 files 538 with the 35 full-set of features. The authors of the dataset applied the Information Gain algo-541 rithm [50] to select the best effective 10 features of the dataset 542 and wrap them into additional filtered CSV files. However, 543 In this work, we only used 9 features among the selected 544 top 10. Those features are presented in Table 2. 545 Before executing the experiment, the data has to be pre-546 pared to be compatible with the chosen evaluation classifiers. 547 As the Bot-IoT dataset [17] also provides the data in the pre-548 processed form. This form of data is cleaned from records 549 that does not have accepted values, and the non-numerical 550 values are standardized into numerical values. All of that 551 is after the features has been reduced from 32 features to 552 10-best features using IG algorithm. This file is available in 553 CSV format which size around 520 MB and contains around 554 3.7 million records for normal and other 6 types of attacks 555 traffic. In our case, this file represents the starting point of 556 the data preparation procedure. The target of this procedure 557 is to cover the three experimental work scenarios need of data 558 which are four CSV files with the following specifications: 559 1) Each file has 9 features.

560
2) The training .CSV file has DoS attack data and normal 561 traffic data.

562
3) The testing .CSV contains DDoS attack with normal 563 traffic data as well.

564
4) The testing .CSV contains Service Scan attack data with 565 the normal traffic data.    2) Extract the 9 features specified in Table 2 in addition to 581 the results column which is attack.

582
In order for any classifier to work properly, it needs to be 583 provided with sufficient data that contains reasonable count 584 of each class (attack and normal) at least in the training phase.

585
Looking into the data distribution of the previous two files in 586  Unbalanced dataset causes the model bias toward the class 592 that has higher number of records in the training data, which 593 significantly impacts the prediction quality for the minor 594 classes. The reason for this is that in the learning phase, the 595 classifier has a bigger chance to learn about the attack data, 596 while it couldn't learn enough about the normal data. This 597 makes the classifier tend more to predict any giving record as 598 an attack record even if it is a normal record.

599
Therefore, a tool for over-sampling is required to increase 600 the number of normal records to be reasonably compatible 601 with the attack data in each file. A solution for this might be 602 replicating the normal data records in the file until reaching a 603 balanced state. However, this solution raises another problem 604 which is over-fitting [51], because in this case, the classifier 605 trains over the same records multiple times which results 606 in the memorization of the records instead of learning and 607 understanding. A tool called Synthetic Minority Oversam-608 pling Technique (SMOTE) [52] was used to over-sample the 609 minor class (normal data) in our files.

610
SMOTE increases the number of the minority class with-611 out replicating records from it. It does that by plotting the 612 minor class as points in a 2D space and identify the feature 613 vector for each one. Afterwards, the nearest neighbor for 614 each point is defined and generates a new point that relies 615 on the connecting line between the original point and its 616 nearest neighbor. The new point's position depends on a 617 random number between 0 and 1 represents a fraction from 618 the connecting line length.

619
In our case, we used SMOTE method to increase the nor-620 mal class records count in our files to achieve the ratio of 4:10 621 normal:attack as shown in Table 4. 622 VOLUME 10, 2022 function. Different regularization techniques i.e., L1 and L2, 678 have been used to solve the problem of overfitting and to 679 enhance the model performance in zero-day attack detection. 680 We also compared the performance of CNN with three ML 681 techniques, namely LR, NB, and Adaboost. Additionally, two 682 dropout layers were used before the flatten layer and before 683 the fully connected layer to further reduce the likelihood of 684 overfitting. The trained model was tested later on the new 685 portion of the data to show how it performs with data that 686 has not been observed.

688
In this section, we define the metrics and measures to be 689 calculated for each classifier based on its results. These met-690 rics help in the process of the classifiers' evaluations and 691 comparison.        by plotting the true positive rate (TPR) against the false 737 positive rate (FPR).

739
In this work, we executed all experiments on workstation 740 machine that has the following specifications: Intel(R) 741 Core(TM) i5-6200U CPU @ 2.30GHz 2.40 GHz, win-742 dows 10 Pro 64-bit operating system with RAM 8.0 GB. 743 We used Python programming language v3.8, from Anaconda 744 with various libraries, the used libraries in our experimental 745 work are presented in Table 7. Anaconda also includes the 746 Python interpreter, many useful libraries, and Spyder IDE. 747 The machine that executed these experiments has specifica-748 tions illustrated in Table 6.

750
After defining the statistical parameters and the proper eval-751 uation metrics, the detection algorithms associated with the 752 above scenarios were run to demonstrate and compare the 753 performance of each one. In the following subsections, 754 we represent the experimental results of the three different 755 scenarios. For the ROC curves and AUC of the CNN models, figure 5 810 demonstrates that CNN L2 also achieved best AUC with 811 98.94%, followed by AdaBoost with 97.91%. 812 Figure 6 visually compares accuracy values and ROC 813 scores between all classifiers. It confirms that the CNN L2 814 was the best classifier in respect to accuracy and ROC score. 815

816
For the last scenario i.e. C, the testing data contains Service 817 Scan attack. In this scenario, the CNN L2 kept the best 818 rank for classification metrics among CNN L1 and other 819 classifiers as well, as presented in table 10. However, 820 unlike scenarios A and B, AdaBoost significantly dropped its 821 98436 VOLUME 10, 2022   close to 94%. While the second best classifier was CNN L1 840 with 91% value. On the other hand, NB had the lowest ROC 841 score with only 70%.

842
Finally, a visual comparison between the classifiers' accu-843 racy values and ROC scores is presented in figure 8. It clearly 844 shows and proves that the L2 regularization method boosted 845 the performance of CNN L2 classifier among other CNN 846 classifiers and ML-based classifiers.

847
The classifiers were tested on zero-day attacks by steps. 848 It first got trained on DoS attack data, then tested first on 849 DDoS attack data which has a high degree of similarity with 850 the training distributions. As expected in this scenario A, both 851 classical ML-based methods and CNN methods performed 852 well in the testing phase, except for NB. While when we 853 started to test on attacks that has less similarity with the 854 training data (scenarios B, and C), prediction quality had 855 lower values in general comparing to scenario A. AdaBoost 856 performed well on scenario B, but on scenario C it had a 857 significant drop of performance. All in all, the CNN reg-858 ularised models performed the best among the other used 859 Ireland, since 2015. She worked as a Postdoctoral 1154 Researcher at UL, a member of the Data Communication Security Labora-1155 tory, and as a Software Engineer in IBM, Dublin, Ireland, in the areas of 1156 data security and formal verification. Her research interests include security 1157 protocols design and analysis, automated techniques for formal verification, 1158 network security, attack detection and prevention techniques, security for the 1159 Internet of Things, and applications of blockchain for security and privacy. 1160 She has several key contributions in research focusing on detection and 1161 prevention techniques of attacks over networks, the design and analysis of 1162 security protocols, automated techniques for formal verification, and security 1163 for mobile edge computing (MEC). For more information visit the link 1164 (https://people.ucd.ie/anca.jurcut).