In-Depth Feature Selection for the Statistical Machine Learning-Based Botnet Detection in IoT Networks

Attackers compromise insecure IoT devices to expand their botnets in order to launch more influential attacks against their victims. In various studies, machine learning has been used to detect IoT botnet attacks. In this paper, we focus on the minimization of feature sets for machine learning tasks that are formulated as six different binary and multiclass classification problems based on the stages of the botnet life cycle. More specifically, we applied filter and wrapper methods with selected machine learning methods and derived optimal feature sets for each classification problem. The experimental results show that it is possible to achieve very high detection rates with a very limited number of features. Some wrapper methods guarantee an optimal feature set regardless of the problem formulation, but filter methods do not achieve that in all cases. The feature selection methods prefer channel-based features for detection at post-attack, communication, and control stages, while host-based features are more influential in identifying attacks originating from bots.

more opportunities to exploit network vulnerabilities [3], 23 resulting in various IoT-based botnet attacks [4], [5], [6]. The 24 botnet, a large set of compromised machines controlled by 25 attackers, is one of the strongest threats on the Internet to 26 The associate editor coordinating the review of this manuscript and approving it for publication was Chin-Feng Lai .
perpetrate cybercrimes, such as launching DDoS attacks [4], 27 stealing sensitive data [7] or distributing malicious spam [8]. 28 As a result, botnets act as a source of spreading malicious 29 activity and usually threaten the availability of networks, 30 in addition to other significant security consequences. It is 31 important to develop security countermeasures against botnet 32 threats. 33 A typical botnet life cycle has four phases, formation, 34 command and control (C&C), attack and post-attack [9]. 35 Attackers spread malware that helps them recruit new bots 36 (that is, members of botnets) during the formation phase. 37 C&C phase enables them to establish continuous commu- 38 nication with bots to control them for future actions. In the 39 attack phase, attackers carry out malicious operations using 40 bots. The post-attack phase covers activities related to the 41 spread of IoT malware with the purpose of expanding the 42 botnet. IoT networks constitute a lucrative target for botnet 43 owners, as it is possible for them to recruit large numbers of IoT devices, which are usually shipped with various security 45 vulnerabilities. 46 One of the effective security countermeasures against botnets is to establish security monitoring systems to detect mali-stages of the botnet life cycle. More specifically, the set 100 of features that is effective in detecting malicious traffic at 101 one stage may not be instrumental at another stage. Further-102 more, the performance of models that use different feature 103 selection methods can vary according to the classification 104 formulation. 105 The crux of this paper is to find the optimal subset of 106 features with the help of filter and wrapper feature selection 107 methods for various classification formulations that can be 108 applied to IoT botnet attack detection. For this purpose, 109 we have induced ML classifiers using the methods, extra tree 110 classifier, random forest, decision tree, and k-nearest neigh-111 bor. The optimal feature sets are derived by a 10-fold cross-112 validation with classifiers from filter and wrapper methods. 113 In this research, we applied the feature selection methods to 114 two datasets, namely N-BaIoT [22] and MedBIoT [23], which 115 include network activities belonging to different steps of the 116 botnet life cycle in IoT networks. Based on the phases of the 117 botnet life cycle given in [9], we can deduce that N-BaIoT has 118 instances related to the attack phase, while MedBIoT covers 119 post-attack and C&C phases. 120 In addition to a binary classification, such as discriminating 121 malicious traffic from benign traffic, it is possible to for-122 mulate various multiclass classification problems from these 123 datasets. One of such formulations may focus on the detec-124 tion of the malware type that induces the malicious traffic 125 (e.g., Mirai, Bashlite), which is applicable for both datasets, 126 whereas the second one may deal with the attack type that 127 is conducted by the corresponding malware. For the latter 128 case, N-BaIoT provides labels on the types of attacks that 129 originated from infected devices (e.g., UDP flooding, spam), 130 and MedBIoT has labels on whether the activity belongs to 131 the C&C or post-attack phase. Depending on the situation, 132 security administrators may be interested in different aspects 133 of detection to make more informed operational decisions. 134 For example, identifying the type of malware on the infected 135 device would be necessary to apply the correct malware 136 removal procedures. On the other hand, identifying the type 137 of attack rather than the type of malware would be more 138 essential for organizations that receive botnet attacks, as they 139 need to develop defensive countermeasures to block or redi-140 rect network traffic accordingly. In our study, we investigate 141 which feature sets are optimal for each binary and multiclass 142 classification formulation and analyzed whether there exist 143 variations in the optimal feature set that may impact the 144 design considerations of intrusion detection in such different 145 contexts. This contribution is unique because, to our knowl-146 edge, there is no study that provides a deeper analysis of 147 the variations in feature sets that are effective in intrusion 148 detection at different stages of the botnet life cycle. 149 The structure of this research work is described below. 150 In Section II we have mentioned background work and a 151 review of the literature related to botnet detection and feature 152 selection. In Section III, the feature selection methods and 153 experiments are described. Finally, our results are presented 154 in Section IV. Section V gives a discussion of the main 155 VOLUME 10, 2022 findings of this research work. Conclusions are drawn in 156 Section VI.
the LAE and BLSTM classifier that achieved 100% precision, 206 93.17% MCC (Matthews correlation coefficient). 207 Alauthman et al. [34] have proposed a traffic reduc-208 tion mechanism that integrates the reinforcement learning 209 technique in three datasets. The first dataset is information 210 security and objects technology (ISOT) that contains Storm 211 Bot, Waledac Bot, and normal traffic. The second data set 212 comprises four legitimate P2P applications (Vuze, uTorrent, 213 Frostwire and eMule) and three P2P botnets (Zeus, Storm and 214 Waledac) [35], and the third is the ISCX data set [32], which 215 contains benign traffic. The authors have used real-world net-216 work traffic to evaluate their proposed approach and achieved 217 a detection rate of 98.3% and a false positive rate of 0.012%. 218 Singh et al. [36] have developed a quasi-real-time intru-219 sion detection system using open-source tools such as 220 Hadoop, Hive, and Mahout to provide scalability for the 221 identification of Peer-to-Peer botnet attacks. For this, the 222 authors have built the packet capture module to process high 223 data bandwidth in a quasi-real-time (within 5-30 s delay) and 224 developed a distributed dynamic feature extraction frame-225 work to illustrate network traffic statistics of packet captures. 226 The parallel processing power of Mahout (that is, a machine 227 learning library built on top of Hadoop) was used to build the 228 Random Forest model that achieved a detection performance 229 of 99% precision and recall.

231
Feature selection aims to find the best subsets of features from 232 input data to achieve better prediction results by eliminating 233 unnecessary features [37]. The feature selection methods 234 were classified mainly into three categories, such as filter, 235 wrapper, and embedded [14]. Filter methods utilize statistical 236 methods to rank features according to their discriminatory 237 power. They are usually applied in an initial step before induc-238 ing the models. However, wrapper methods use a machine 239 learning model to evaluate the merits of a given set of features 240 in terms of model performance to identify the optimal set. 241 Embedded methods blend the advantageous factors of both 242 the filter and wrapper methods so that they perform feature 243 selection and training of the ML algorithm in parallel. This 244 feature selection method is an integral part of the classifica-245 tion or regression model.

246
Many feature selection approaches have been applied to 247 evaluate the importance of features related to the context 248 of botnet detection. Entropy, impurity, RelieF and principal 249 component analysis (PCA) [38] were used with the neural 250 network classification algorithm. 99.20% detection rate was 251 achieved with the top 10 features based on the entropy of 252 a total of 29 features in two botnet datasets, ISOT [39] and 253 ISCX [32].

254
Velasco-Mata et al. [40] has tested the feature sets 5, 6, 255 7 with two filter methods, Information Gain and Gini Impor-256 tance, over Decision Tree, Random Forest, k-NN for bot-257 net detection for multiclass classification. Finally, the set 258 of five features produced an 85% detection rate with a 259 decision tree classifier induced for the QB-CTU13 [41] and 260 EQB-CTU13 [41] datasets.

285
Random forest feature selection produced 99% highest detec-286 tion among all these feature selection methods.

287
The studies proposing feature selection do not create and 288 compare the optimal sets that can be obtained for different 289 multiclass problem formulations. In this paper, we address 290 this gap by inducing various learning models for two datasets 291 as explained in detail in Section II-C.   Table 1. More specifically, the features that are defined 304 for each data point reflect the aggregated statistics of the raw 305 streams of the network in five time windows (100 ms, 500 ms, 306 1.5 s, 10 s, and 1 min), which are coded L5, L3, L1, L0.1 and 307 L0.01, respectively. There are five main feature categories, 308 host-IP (traffic originated from a specific IP address, coded as 309 H), host-MAC and IP (traffic originated from the same MAC 310 and IP, coded MI), channel (traffic between specific hosts, 311 coded HH), socket (traffic between specific hosts, including 312 ports, coded HpHp), and network jitter (time interval between 313 packets in channel communication, coded as HH_jit). For 314 each major category, the packet count, mean and variance 315 packet sizes are calculated. There have been extra statistical 316 values like the correlation coefficient (PCC) of packet size, 317 radius, covariance, magnitude, which are derived for Chan-318 nel and Socket categories along with packet count, mean, 319 variance. In this paper, we used a specific notation to name 320 the features. The feature name is the concatenation of three 321 keywords. The first one represents the category type (e.g., MI, 322 HH), the second one shows the time window, and the third one 323 indicates the statistical measurement function. For instance, 324 ''HH_L0.01_mean'' means this feature is about the channel 325 type that belongs to a 1-min interval with a mean function.

326
In this study, we have developed six different ML clas-327 sification problems using these two datasets, as detailed in 328  Table 2. The N-BIoT dataset is used for three classifica-329 tion problems, namely, binary, 3-class, and 9-class. Binary 330 classification basically discriminates malicious traffic from 331 benign traffic. 3-class provides greater scrutiny of malware 332 type by classifying data points into categories, mirai, gafgyt, 333 and benign. For the 9-class classification, the data points 334 have been classified into different attack types: ack, benign, 335 compact, junk, scan, syn, tcp, udp, and udpplain. These  The former identifies the types of malware that can be instru-339 mental in detecting infected hosts in an organizational setting.

340
The latter aims to discriminate against attacks carried out by 341 bots, which better informs organizations that are targeted by 342 such attacks.

343
MedBIoT is used for three classification formulations, 344 binary, 3-class, and 4-class. As this dataset is collected at the 345 C&C or formation phases, such formulations reveal which 346 features are important in those phases. More specifically, 347 3-class addresses the identification of the phase (i.e., classes 348 are benign, C&C and Spread), whereas 4-class aims to detect 349 malware category (i.e., classes are benign, Bashlite, Mirai, 350 and Torii).

351
In this work, we have experimented with 20,000 sam-352 ples of each class label for the addressed classification 353 type. For example, if the classification problem contains 354 two classes, we randomly selected 40,000 samples from the 355 source dataset.

357
Within the framework of the present investigation, two types 358 of feature selection methods are considered. The first is called 359 the filter model, which evaluates a feature or a subset of 360 features using a class-sensitive discriminating criterion [44]. 361 These techniques do not depend on the particular classifica-362 tion algorithm. The second type of technique is the wrapper 363 model. Techniques of this type use the characteristics of the 364 specific classification algorithm to choose the feature set.

366
In the domain of numeric feature sets, there are four main 367 types of techniques. The first utilizes the linear correlations 368 between the features. The second is based on the relationship 369 between the inter-class and intra-class separation. The third 370 uses entropy, and the fourth is based on the analysis of 371 variance.

373
Based on Pearson's correlation coefficient (see (1)), the tech-374 nique requires one to compute the collinearity matrix for the 375 entire set of features to find the redundancy of the features. 376 Pearson's correlation technique computes the linear correla-377 tion relationship between two variables. Pairwise correlations 378 between features are analyzed to find the redundancy of fea-379 tures. P-value of correlation coefficients bounds the ranges 380 between −1 and 1. Two features contain a perfect positive 381 correlation if the value is P = 1. There is no correlation 382 between the two features if the value P = 0, and a perfect 383 negative correlation is accepted if the value P = −1. The 384 formula for the Pearson correlation VOLUME 10, 2022 In (1) Fisher score [44] is designed for the numeric features and 392 measures the ratio of the average inter-class separation to the 393 average intra-class separation. It is also referred to as Fisher's 394 ratio [45]. Formally defined in (2) and denoted as F s (not 395 to be confused with F1 score), the numerator calculates the 396 average inter-class separation and, the denominator calculates 397 the average intra-class separation.
where µ i j and σ i j are the mean and standard deviation of the Here, p(x, y) is the joint probability density function (PDF) of In (5) p(x, y) denotes the joint probability mass, the function, 415 the function, and p(x) and p(y) are the marginal probabilities.

416
Mutual information values fall in the interval given below.
To make this paper self-sufficient, the main steps of the 3) To select the first feature, findf i such that For the computational experiments, the classical machine 541 learning workflow was used. The initial datasets are large 542 enough to provide samples that can be balanced with respect 543 to all characteristics of the dataset, malware type, attack 544 type, and device type. In the preprocessing step, balanced 545 samples were drawn from the dataset of interest. Then, the 546 division into training and testing subsets was carried out 547 proportionally 80/20. Initial experiments have demonstrated 548 that among the k-nearest neighbors classifier (kNN), decision 549 tree classifier (DT), random forest classifier (RF), extremely 550 randomized trees classifier (ET), logistic regression, support 551 vector machine, and Ada-boost classifier, the last three have 552 demonstrated much lower performance and were excluded 553 from further investigation. For each remaining classifier and 554 feature selection technique, a ten-fold cross-validation was 555 performed, while, to ensure better results and the best con-556 figuration for each classification algorithm, a randomized 557 search was used to find the optimal hyperparameters for 558 each classifier. The range of hyperparameters is described in 559 Table 3. 560 We use the three steps to evaluate the distinct subsets 561 of features in both datasets. First, the F1 score metric is 562 used to evaluate the set of features. Second, computational 563 time is the total time it takes a computer with a particular 564 processor to complete a task. Third, Performance computed 565 the ratio between the F1 score and the computational time. 566 Intrusion detection systems must respond as quickly as pos-567 sible without sacrificing accuracy. Response time is essential 568 when thwarting the threat in the early stages would limit the 569 degree of losses. For this motivation, time must be considered 570 when evaluating any detection of the model along with the 571 model metrics. The F1 score (see Eq. (7)) is defined as a 572 harmonic mean of precision (P) and recall(R) [51]. In this 573 research work, precision is the fraction of correctly identified 574 botnet samples to all botnet samples identified as a botnet. 575 94524 VOLUME 10, 2022 In our experiments, we used the computational time to 583 calculate the computational cost of classifying a sample. 584 We did not consider the training time of the ML algorithms. 585 We have experimented with all tasks on the same CPU.

597
This section gives experimental results of the learning models 598 induced for six classification problems listed in Table 2. 599 We analyze the importance of the features obtained by filter  we select the best features based on their scores. Furthermore, 618 we induce models with feature sets that have increasing num-619 bers to understand how many features are enough to pass the 620 99% F1 score. Finally, we select the best 3, 5, 3 features for 621 the ANOVA, Fisher Score and mutual information methods, 622 respectively (see Table 4). On the other hand, the wrapper 623 methods usually select three features (for example, DT selects 624 three features in each method), as presented in Table 5.

625
Almost all classifier and feature pairs produce a high 626 detection rate above 99%, as shown in Table 6. Based on 627 the minimal set and computational performance, we selected 628 three pairs and reported more detailed performance results, 629 accuracy, precision, recall, and F1 score values in Table 7. 630 These pairs are: DT with mutual information (that is, three 631 features), Fisher (that is, five features), and SBS (that is, three 632 features). DT with SBS achieves the highest performance 633 metric, as shown in Fig. 3 It is important to note that we computed the computational 645 time of the models (i.e. the testing-time performance) after 646 selecting the features in all filter and wrapper methods. Thus, 647 the time required for feature selection is not reported in this 648 paper, as testing time is a more significant aspect compared to 649 training, which is not done so frequently, and, when needed, 650 high resources can be assigned for such task. In this sense, 651 the calculated time can be affected by the number of fea-652 tures and characteristics of the corresponding learning model. 653 However, in our experiments, as expected, we observed that 654 TABLE 6. F1 scores for binary classification models using feature subsets (represented in Table 4 and 5) of feature selection algorithms in the N-BaIoT dataset.

FIGURE 2.
Computational time required to classify a sample by binary classification models on N-BaIoT dataset using feature sets (see in Table 4 & 5) of feature selection methods.     Table 4 and 5) of feature selection algorithms on the N-BaIoT dataset.
detection, there is no clear increasing or decreasing pat-668 tern regarding the time duration, as the shortest duration, 669 100 microseconds, also plays a significant role in the model 670 performance.

672
In the N-BaIoT dataset, Mirai and Gafgyt malware are used 673 to infect IoT devices. In this part, we report the findings of the 674 three-class classification models that discriminate network 675 traffic as Mirai, Gafgyt, and legitimate. Similarly, we eval-676 uated the feature selection method and the pairs of learning 677 models according to the same performance metric we used for 678 binary classification and presented the F1 scores in Table 8. 679 All pairs, except some KNN models, provide more than 680 99% F1 scores. Pearson correlation still found 33 features. 681 We identified six, three, and five features by using filter 682 methods, fisher score, mutual information, and ANOVA, 683 94526 VOLUME 10, 2022 FIGURE 5. Computational time required to classify a sample using 3 class classification models in the N-BaIoT dataset using feature sets (see in Table 4 & 5) of feature selection methods. respectively, as shown in Table 4. The wrapping methods 684 mostly selected three features (see Table 5).  Table 9 shows the detailed performance metrics for DT and 689 three feature selection methods, Fisher Score, Mutual Infor-690 mation, and SBS. It is obvious that the detection performance 691 is higher than 99% for all metrics.

692
The optimal feature set selected by the mutual informa-   respect to filter methods, as the learning models with these 714 selection methods require a very high number of features to 715 achieve an F1 score greater than 99%. More specifically, 68, 716 28 and 59 features should be fed into the models when Fisher 717 score, mutual information, and ANOVA methods are used, 718 respectively. However, 33 features are identified as not highly 719 correlated by the Pearson correlation method. Wrapper meth-720 ods show very interesting results. Although RFE provides 721 higher detection results using 20-28 features depending on 722 the type of learning model, SFS and SBS achieved higher 723 detection with only three features. 724 Table 10 shows the F1 scores achieved by the nine sets 725 of classification features of the classes. Except for KNN, all 726 VOLUME 10, 2022 FIGURE 8. Computational Time required to classify a sample using 9-class classification models on the N-BaIoT dataset using feature sets (see Table 4 and 5) of feature selection methods. FIGURE 9. Performance achieved by 9-class classification models in the N-BaIoT dataset using feature sets (in Table 4 and 5) of feature selection methods.
TABLE 11. Accuracy, Precision, Recall, F1 summary of classification of results mutual information and SBS features, DT with 28-feature set and 3-feature set respectively for 9-class classification over N-BaIoT dataset.
other models achieve more than 99% in all selection methods.

727
The result of the overall performance metric indicates that 728 SBS and DT are the best pair in the 9-class classification 729 (see Fig. 9). Among the wrapper methods, DT and mutual 730 information emerge as the leading performer. The frequency analysis of the feature categories shows that 741 the host-based features are still the most important category 742 for the 9-class classification (see Fig. 10). However, the 743 selected features of the channel category are higher compared 744 to the binary and 3-class formulations. The contribution of 745 the network jitter category is also more important in this 746 classification task. This means that learning models need 747 to resort to other features, which provide statistics about 748 network activities between hosts and time intervals between 749 network packets to differentiate attack types. When many 750 types of attack are considered, including various denial-of-751 service attacks, such features are instrumental in making a 752 distinction between them. Time window analysis provides 753 a similar distribution, except that lower time intervals (i.e. 754 1.5 seconds, 500 microseconds, and 100 microseconds) have 755 closer distributions to each other. In this part, our objective was to discover a feature set 759 that provides high performance for all classification models 760 induced with the N-BaIoT dataset. Here, we do not claim 761 to obtain the feature set that has been proven to be the 762 best for all formulations, but we show that a working set 763 is possible. Intuitively, for this purpose, we have tested the 764 best feature sets of each classification in the other classifi-765 cation tasks. The best feature set obtained from the 9-class 766 classification provided high detection rates for the remaining 767 binary and 3-classification tasks. However, we were unable to 768 obtain such high results in the reverse situation where binary 769 or 3-class classification features are applied to a 9-class 770 formulation. More specifically, the feature set, {MI_dir 771 _L0.01_mean, HH_L0.01_std, HH_jit_L0.01_mean} that is 772 determined by the SBS and DT pair for the 9-class clas-773 sification is utilized to induce models for all classification 774 types, and we obtained the results given in Table 12. Except 775 for the Junk and UDP classes in the 9-class formulation, 776 all results are equal to or greater than 99%, demonstrating 777 the effectiveness of this common set in all classification 778 types.  Table 2 for the 785 details of classification formulations).  Table 5). We present the F1 797 scores for all model and feature selection pairs in Fig. 11.

798
Although the pairs do not exceed 98%, at least one learn-799 ing model achieved this threshold for each feature selection 800 method. In this data set and in the formulation of the problem, 801 SBS still provides the best performance metric, as shown in 802 Fig. 13. The results presented in Table 13 indicate that SBS 803 achieves a score greater than 99% in all performance metrics.  Table 4&5) of feature selection algorithms.

FIGURE 12.
Computational time required to classify a sample by binary classification models over MedBIoT dataset using feature sets (see in Table 4 & 5) of feature selection methods.
is compared to the selected feature sets in N-BaIoT, it is 811 observed that the channel category is the dominant category 812 instead of the host-based one. As MedBIoT covers malicious 813 activities regarding the the C&C and formation phases of 814 the botnet life cycle, the features that characterize host-to-815 host communications become more important. In contrast, 816 N-BaIoT, which covers the attack phase, can discriminate 817 malicious activities based on host-based features.

818
Similar to N-BaIoT, MedBIoT does not show any specific 819 pattern on time periods, indicating whether longer or shorter 820 periods are preferred. Although the longest period, 1 minute, 821 provides more discriminative features among the others, still, 822 VOLUME 10, 2022 FIGURE 13. Performance achieved by binary classification models over the MedBIoT dataset using feature sets (see in Table 4 & 5) of feature selection methods. the second-best category is 100 microseconds, which is the 823 smallest one.  Table 4&5) of feature selection algorithms.

FIGURE 16.
Computational Time required to classify a sample using 3-class classification models on MedBIoT dataset using feature sets (see in Table 4 & 5) of feature selection methods.    (see Table 4). SBS and SFS methods with any learning  Table 4 & 5) of feature selection methods. of the selected feature set of one classification on the other 873 classification problem. We identified that the feature set of 874 4-class classification also works better in all other classifica-875 tions, as shown in Table 15.

877
In this study, it is shown that all the machine learning problem 878 formulations realized for the detection of IoT botnet attacks in 879 two datasets, N-BaIoT and MedBIoT, achieved high detection 880 performance in more than 99% with a limited number of 881 features (i.e. 3 and 7 features).

882
In our experiments, we used various filter and wrap-883 per methods for feature selection, in addition to four main 884 machine learning methods to induce the models. In the case 885 where we use filter methods, the results of feature selection 886 are fed into the models. In wrapper methods, models are used 887 directly for the assessment of feature subset alternatives. Per-888 formance evaluation was carried out based on the relationship 889 between the F1 score and the computational time required to 890 classify a sample. The wrapper method, SBS, with the DT 891 model has achieved the most satisfactory trade-off between 892 detection capacity and computational cost, exceeding the 893 other alternative feature selection and learning model pairs. 894 Using feature selection approaches, tree-based models 895 (DT, ET, and RF) achieved the best results in all classification 896 types for both datasets, especially in multiclass classifica-897 tion types. k-NN classifier was not suitable for multiclass 898 classification and also took the longest computational time 899 to classify the sample compared to tree-based models.   Figure 23, the use of all host features 942 achieves a perfect model with a 1.00 F score, while network 943 jitter would be helpful for higher rates for the N-BaIoT data 944 set. However, the features of the channel category achieve 945 99% rates, and the host and network jitter categories would 946 also be helpful for MedBIoT, as demonstrated in Figure 24. 947 Our results send a significant message to experts who 948 design intrusion detection systems. The attacks originating 949 from the bots (i.e., as simulated in the N-BaIoT dataset) can 950 be easily detected by the sensors that track the incoming and 951 outgoing packet statistics without considering the destination 952 of the traffic. However, post-attack and C&C stages require 953 the sensors to follow the sources and targets of traffic flows. 954 Although some feature selection methods utilize the features 955 of the socket category, the overall picture shows that the 956 identification of receiving parties would be enough without 957 using the source and destination ports.

958
Our comparison regarding the feature categories from the 959 time interval perspective shows that the longest interval value, 960 1 min, contributes more to the set with a higher discrimination 961 property. 962 We also compared our proposal with the latest meth-963 ods from recent models, and the results are summarized in 964   Table 16. For the N-BaIoT dataset, the 9-class classification 965 achieved better results with lower subsets of features than hyperparameters for each classification formulation and is 1003 summarized in Table 17.

1005
Botnet attacks change the shape and volume to deplete the 1006 target resources on the entire IoT network system. Therefore, 1007 to mitigate the critical impact, a machine learning-based 1008 intrusion detection system is developed to accurately classify 1009 botnet attacks.

1010
In this work, we propose a reduced set of features to 1011 detect and classify malicious activities of popular IoT botnet 1012 malware. We identified six different binary or multiclass 1013 classification problems using datasets, N-BaIoT and Med-1014 BIoT. We applied various filter and wrapper methods with 1015 four machine learning methods to these datasets. Finally, 1016 we derive an optimal set of features for each classification 1017 problem. To our knowledge, no detailed comparison between 1018 the optimal feature sets required for different classification 1019 problems of IoT botnet detection, which can vary depending 1020 on the stage of the botnet life cycle, has been done before. 1021 We obtained very high detection rates for each classifi-1022 cation problem with fewer features. The decision tree-based 1023 SBS takes less time to classify the samples with the highest 1024 detection rate. Wrapper methods, SFS and SBS, were effec-1025 tive in finding the optimal feature sets in each classification. 1026 He is currently a Senior Research Scientist at the 1224 Department of Software Science, TalTech. He has 1225 published more than 100 articles in scientific. 1226 His research interests include human-machine 1227 interaction, analysis of human motions, and appli-1228 cations of AI to the problems of cybersecurity and 1229 geoscience.

1230
HAYRETDIN BAHSI received the M.Sc. degree in 1231 computer engineering from Bilkent University and 1232 the Ph.D. degree in computer engineering from 1233 Sabanci University. He is currently a Research 1234 Professor at the Center for Digital Forensics and 1235 Cyber Security, Tallinn University of Technology, 1236 Tallinn, Estonia. His research interests include 1237 cyberphysical system security and the applica-1238 tion of machine learning methods to cybersecurity 1239 problems.