Enhancing Intrusion Detection in IoT Communications Through ML Model Generalization With a New Dataset (IDSAI)

One of the fields where Artificial Intelligence (AI) must continue to innovate is computer security. The integration of Wireless Sensor Networks (WSN) with the Internet of Things (IoT) creates ecosystems of attractive surfaces for security intrusions, being vulnerable to multiple and simultaneous attacks. This research evaluates the performance of supervised ML techniques for detecting intrusions based on network traffic captures. This work presents a new balanced dataset (IDSAI) with intrusions generated in attack environments in a real scenario. This new dataset has been provided in order to contrast model generalization from different datasets. The results show that for the detection of intruders, the best supervised algorithms are XGBoost, Gradient Boosting, Decision Tree, Random Forest, and Extra Trees, which can generate predictions when trained and predicted with ten specific intrusions (such as ARP spoofing, ICMP echo request Flood, TCP Null, and others), both of binary form (intrusion and non-intrusion) with up to 94% of accuracy, as multiclass form (ten different intrusions and non-intrusion) with up to 92% of accuracy. In contrast, up to 90% of accuracy is achieved for prediction on the Bot-IoT dataset using models trained with the IDSAI dataset.


I. INTRODUCTION
The deployment in the interconnection of the Internet of Things (IoT) and Wireless Sensor Networks (WSN) has taken relevance thanks to their contribution to the development of smart cities in domains such as transportation, mobility, economy, industry, health, among others [1]. Most of The associate editor coordinating the review of this manuscript and approving it for publication was Shaohua Wan. these domains require processing capabilities closer to where the data originates. According to IoT Analytics, there are expected to be more than 30 billion IoT connections by 2025, corresponding to four IoT devices per person [2]. Also, with the exponential growth of IoT technology solutions connected to the cloud, new security and privacy threats related to data and services make them an attractive surface for intrusions. In the same way, network security can be threatened by limited resources such as storage capacity, processing speed, memory limitations, the power of end devices, and the use of wireless communications by hosts (which are vulnerable due to their ease of access) [3], [4].
Wireless networks are more vulnerable to attacks due to their transmission medium, which poses a challenge to existing security mechanisms in their attempt to mitigate emerging threats. For this reason, a number of different solutions have been proposed in the academic literature [5]. For example, efficient autonomous defense systems have been proposed that use machine learning techniques in devices at the perimeter of the network [6]. Similarly, creating an intelligent cybersecurity support architecture has been explored [7], as well as using Machine Learning-based resource management techniques in fog computing platforms [8]. Other approaches include intrusion detection and prevention systems applied to new trends and applications in IoTs and related areas like WSNs, Mobile Ad Hoc Network (MANET), and Connection Point Services (CPS) [4]. Intrusion detection and prevention systems are considered a second line of defense. However, as new attack techniques emerge, it is necessary to develop systems with optimal performance and low resource consumption [9], [10].
Researchers have used anomaly-based network intrusion detection models with Deep Learning (DL) usage in airports [11], intrusion detection models for cyber security in Agriculture 4.0 [12] and to prevent DoS attack in WSN an edge intelligence framework [13], they have identified malicious traffic with anomaly detection techniques and DL detection systems with Auto-encoders [14].
Some difficulties are carried due to the high level of complexity and high consumption of computational resources by detection systems deployed in networks [9], and also due to the low reliability in the quality and accuracy of collected data, and loss of services and information [15].
Unauthorized incursions into the system are called intrusions or attacks. A user can intrude internally or externally. In an internal attack, the user with privileged access obtains restricted information and gains control of the system or network. The external intruder seeks permission to arbitrarily access the system or network to enter and steal vital information from a company and gain control of the system or network [16]. The main functions to be performed by an Intrusion Detection System (IDS) include: i) identifying an intruder, ii) notifying the location of an attacker, iii) logging abnormal movements, iv) minimizing or interrupting malicious actions, e) alerting the administrator of the security intrusions, and v) detecting the type of intrusion [10].
Accordingly, the issues mentioned above, the design and implementation of an anomaly-based IDS employing ML continue to be addressed and evolved. The pipelines must include a Network Configuration that must be established in an environment to send particular attacks in a controlled way. The network traffic data is captured to create a dataset. The dataset is divided into training and testing, with which ML models are trained, and validations are made. At this point, it is already possible to detect and report intrusions, with which, according to the defined control mechanisms, decisions are made, and alarms are generated.
Some advances in the field are through automatic classification techniques that are increasingly accurate in identifying abnormal patterns or anomalies in IDS modeling to reduce the false alarm rate [17]. The development of datasets (Bot-IoT) for network forensics [18], building ML models to identify IoT network attacks [19], the IDS and the comparison of ML classifiers [20], intrusion detection models with supervised and unsupervised algorithms [21], ML models in anomaly-based IDS using the CICIDS2017 and the NSL-KDD datasets [22], [23].
This study contributes significant advancements to the field of cybersecurity in IoT networks from various perspectives. Firstly, it introduces a new dataset called Intrusion Detection System Artificial Intelligence (IDSAI), obtained in a real and balanced attack environment. Secondly, it compares the classification capabilities of an IDS based on machine learning for detecting attacks in an IoT system, evaluating the performance of eight machine learning algorithms, including Extreme Gradient Boosting or XGBoost (XGB), Gradient Boosting (GB), Decision Tree (DT), Random Forest (RF), and Extra Trees (ET). These algorithms are experimented with in three different scenarios to select the most effective ones. Also, essential feature selection techniques are employed to enhance the classification process. Thirdly, explanatory artificial intelligence is utilized to provide insights into the Machine Learning (ML) classification models and identify the most relevant features for each intrusion class. Finally, the proposed system is evaluated through cross-validation of datasets.
The main contributions of this research can be summarized as follows: • Generation of a novel and balanced dataset (IDSAI) obtained from a real-based attack setting. The IDSAI dataset includes ten different types of intrusions and non-intrusion data, providing a valuable resource for intrusion detection research in IoT networks, enabling an accurate evaluation of intrusion detection algorithms. On the other hand, the IDSAI data set encourages research and comparison of results between different studies, thus contributing to the security and protection of IoT networks in the real world.
• Comparison of the classification performance of our proposed Intrusion Detection System across binary and multiclass scenarios using eight machine learning algorithms. This analysis enables the evaluation and comparison of machine learning models' effectiveness in detecting and classifying attacks on IoT systems. By identifying the most efficient algorithms, the response capacity and protection of IoT systems against threats can be enhanced, achieving accurate detection of attacks.
• Study of essential features for binary classification and by attacks (multiclass classification) using machine VOLUME 11, 2023  learning techniques. By studying these features, we gain insights into the key factors that contribute to accurate intrusion detection, enhancing our understanding of the underlying patterns and characteristics of different types of attacks.
• Evaluation of the generalization power of the models through a cross-validation strategy that shows the effectiveness of attack detection models in IoT networks. We assess the performance of the trained models by making predictions on the Bot-IoT dataset, achieving an unbiased evaluation greater than 90% in scenario 3 when using models trained with the IDSAI dataset. This supports the robustness and effectiveness of the proposal in the detection of attacks in real-world IoT networks with a significant impact on improving the security and protection of IoT networks against threats and attacks. This validation approach demonstrates the robustness and effectiveness of our proposed approach in detecting attacks in real-world IoT networks.
This study addresses the problem of intrusion detection in IoT network traffic by introducing the IDSAI dataset, conducting comparative analysis of IDS classification performance, exploring essential features, and evaluating model generalization. Our findings contribute to the field of cybersecurity in IoT networks and have practical implications for enhancing network security.
The remaining sections of this paper have the following order: Section II describes related work, which uses mainly public datasets. Section III explains the proposed dataset, methodology, models, and metrics used in this work. Section IV presents the main results obtained for classifying intrusions and discussing them. Finally, Section V shows the conclusions and the future work.

II. RELATED WORKS
The following are studies in applying anomaly-based detection techniques using ML to identify security intrusions in IoT networks. Table 1 shows a comparison of the related works.
Reference [17] implemented techniques such as K-Nearest Neighbors (KNN) using the Decision Tree Method (DTM) and K-Means to reduce the false alarm rate in the IDS, with the KDD'99 dataset. KNN achieve an accuracy of 96.55% and a Recall of 93.67%. The k-Means model obtains an accuracy of 92.30% and a Recall of 91.58%.
Reference [20] compared Logistic Regression (LR), Multinomial Naive Bayes (MultinomialNB), Gaussian Naive Bayes (GNB), KNN, DT, RF, MLP, and GB classifiers. The 70544 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. metrics to validate the binary and multiclass scenarios are accuracy, precision, and F1-score. The dataset used in the experiment was UNSW-NB15. The results showed that the Random Forest classifier outperforms the other models in terms of accuracy at 87%, precision at 98%, and F1-score at 84%.
Reference [21] compared supervised learning models with NSL-KDD and CICIDS2017 datasets. In this study, in the NSL-KDD dataset, the RF and KNN algorithms generated the best performances with accuracy, recall, F1-score, and precision up to approximately 76%, 96%, 77%, and 65%, respectively. With the CICIDS2017 dataset, RF achieves an accuracy of up to 93%, recall, F1-score, and precision of up to 84%.
Reference [22] carefully reviewed research on IDS with AI and employed supervised ML algorithms, which included Artificial Neural Network (ANN), DT, KNN, Naive Bayes, RF, SVM, Convolutional Neural Network (CNN), K-Means, Expectation-Maximization (EM), and Self Organizing Map (SOM) algorithms. For the experiment, they used the highly imbalanced multiclass CICIDS2017 dataset. As a result, they obtained that KNN, DT, and Naive Bayes models are the best for intrusion detection for the CICIDS2017 dataset (99% of accuracy). It is possible to detect all web attacks using a single algorithm.
Reference [23] presents the application of ML models such as SGD, Ridge Classifier (RC), DT, RF, and ET. The goal is the prediction of DoS, Probe, R2L, and U2R attacks using the NSL-KDD dataset. A feature selection process is carried out, and the DT for identifying news attacks is determined as a good alternative. The experiments with the ET and RF models achieve an accuracy of 99.83% using multiclass classification to detect U2R and DoS attacks, respectively.
Reference [24] proposed ML models to detect intrusion anomalies in IoT network traffic using the BoT-IoT dataset. They selected algorithms such as DT, GNB, and RF. The GNB algorithm is effective for the detection of intrusions.
Reference [25] proposed an IDS based on a big data platform that can differentiate between the types of network traffic flow generated by IoT devices. This work compares ML algorithms on the Apache Spark platform and found that ML algorithms outperform DL algorithms with higher accuracy and less training time for the model. The experimentation is carried out using the BoT-IoT real-world network traffic dataset.
Reference [26] designed a framework to gather data from the publicly available CUPID dataset, which had been annotated with human pentesting activity on the network. This framework facilitated the distinction between automatically generated attacks and those initiated by humans, at the feature level. The types of attacks generated included Webcrawling, Recorded live user interaction, ARP, nmap, Dig, DNSMap, DNSTracer, nslookup, SQLi, Directory Traversal, Password brute forcing, Delivery of reverse Meterpreter shell, STP, and DHCP attacks. For their analysis, the researchers employed supervised algorithms such as RF, KNN, MLP, and others.
Reference [27] proposed a novel method to detect injection attacks in IoT applications by leveraging feature selection and machine learning techniques. The researchers used the public AWID dataset and applied two feature selection techniques: constant deletion and recursive deletion. They used three machine learning algorithms for their analysis: SVM, Random Forest, and Decision Tree. This work suggests that appropriate feature selection can significantly enhance the accuracy of a model's attack detection capabilities in IoT applications.
Reference [28] proposed a comprehensive method encompassing preprocessing steps, SMOTE oversampling, feature extraction, feature selection, and a voting classifier. The chosen features were then classified using AB, B and Voating.
To provide a comprehensive overview of ML applications in stroke management, our research aligns with relevant studies in the field. For example, the article [29] presents an explainable AI model that utilizes ML techniques to predict acute strokes using EEG signals. Similarly, [30] introduces a cyber-physical system that utilizes ECG data to classify stroke patients with altered cardiac activity, facilitating real-time data processing and utilization for stroke identification and post-stroke treatment management. Additionally, [31] focuses on the utilization of a portable EEG device for real-time health monitoring and providing early prognostic information for stroke management. These studies collectively illustrate the wide-ranging applications of ML in stroke prediction, cardiac monitoring, and real-time health monitoring, complementing our research in IoT network cybersecurity.
These works implement IDS with the use of highly unbalanced datasets. The imbalance is caused by the nature of the problem since there are attacks less common than others, and, in general, there are more samples without attacks. Since ML models interpret the complexity and heterogeneity of the data, this search for patterns will be biased with unbalanced databases. They were making it necessary to release a balanced dataset with attacks, not synthetic ones obtained through repetitive or approximation balance techniques.

III. MATERIALS AND METHODS
In this section, the materials and methods employed in the study are described. The IDS architecture, which includes the physical system and its general structure, is outlined. The data used in the analysis, including the features and the different intrusions or classes, is presented. Additionally, the Bot-IoT dataset is utilized for testing purposes. The models employed and their training process, including hyperparameter tuning, are explained. The performance of the IDS is assessed, and the importance of features is determined. Finally, the resources utilized in the study are disclosed.

A. IDS ARCHITECTURE 1) PHYSICAL SYSTEM
The articulated system consists of devices, a network, and a cloud. The hardware elements used are: sensors, Arduino Nano V3.0 A, XBee-Pro S2C 2.4GHz Serie2 63mW (18dBm) communication devices, which achieve a data transmission rate of 250Kbps and comply with the 802.15.4 ZigBee standard, and finally, the Raspberry Pi 3 Model B+ integration platform, which supports the Raspbian OS (Operating System).
The architecture includes a WSN, which contains sensors that transmit environmental measurements (temperature, humidity, carbon monoxide, and ultraviolet intensity) to a node (Raspberry Pi 3). With Python programming language and Application Programming Interface (APIs), the information from the sensors is sent to the cloud for statistics, analysis, and visualization.

2) GENERAL STRUCTURE
In the design of the IDS with an anomaly-based approach addressed in the research, a methodology for the development of data science and ML projects has been used, with the following functions as data collection, data preparation, ML model evaluation, anomaly detection, control mechanisms, alarm, and report.
The traffic is captured as a .pcap file for data collection to be exported as Comma Separated Values (CSV). The process is described with the following steps: 1) Initialize the system: The IDS system is initialized to start the data collection process. 2) Configuring and performing a network traffic analysis: The network traffic is analyzed by configuring the necessary parameters for each case. 3) Determine whether or not to tag the traffic: A decision is made on whether to tag the captured traffic for further analysis. 4) Import PCAP file to capture traffic: The captured network traffic is imported as a PCAP file for further processing. 5) Whether or not to save the report as CSV: An option is given to save the generated report in CSV format.
The data preparation step converts the input data into patterns the ML models can process. The data receive a cleaning and removing unnecessary information (new CSV file created). The ML models are trained with the dataset in different scenarios using 80% data for training and 20% for testing. Also, 10-fold cross-validation is used to ensure results. A set of metrics like accuracy, F1-score, recall, and precision supports the testing.
The trained ML models can now detect intrusions or unauthorized access to the network. The IDS displays real-time alarms when an intrusion is detected, and the report is saved in a database to maintain a later visualization register.

B. DATA
The dataset (IDSAI) contains a total of 1,000,000 samples. Initially, it included 24 features. Initial preprocessing (delete features such as IP addresses and ports because they are easily adjustable by attackers) reduces the dataset to 19 variables and the two label columns (1,000,000 × 21). Half of the data are non-intrusion samples; the other 500,000 are intrusions. Each intrusion class includes a total of 50,000 data samples. In total, the database contains ten types of intrusions.
The IDSAI dataset presented here addresses the challenge of data set imbalance in network traffic analysis. To ensure balance and mitigate bias, the dataset has been meticulously designed by capturing an equal number of samples for each intrusion type. This balanced dataset is crucial in overcoming the inherent bias present in imbalanced data, where certain attack types are less common than others and normal network traffic dominates. The IDSAI dataset serves as a valuable resource for training and evaluating machine learning models in network intrusion detection, offering real attacks from diverse sources and a balanced representation of intrusion classes. By providing this balanced dataset, researchers can develop and evaluate machine learning models more effectively, leading to accurate and reliable intrusion detection in real-world scenarios.

1) FEATURES
The features of the proposed dataset are frequently used in other studies related to intrusion detection systems with approaches based on signatures and anomalies. The researchers selected them after studying the entire data and its structure. In the same way, some of them were defined in previous studies by other works [32], [33], [34]. Below are the dataset features' names, the data type, and a brief description.
• ip_ttl (int64): time to live, the value 0 indicates that this instance is not an IP protocol.
• delta_time (float64): Time delta of the captured frame concerning the previous one.
• icmp_type (int64): type, the value 19 indicates that this instance is of an invalid ICMP type.
• tos (int64): Label the quality of service requested by the IP datagram.
• ip_flags_rb (int64): IP flag reserved bit, at value 2 indicates that this instance is unknown, which is not IP protocol.
• ip_flags_df (int64): IP flag does not fragment. The value 2 indicates that this instance is unknown.
• ip_flags_mf (int64): IP flag plus fragments, in value 2 indicates that this instance is unknown, which is not IP protocol.
• tcp_flags_res (int64): TCP reserved flag, in value 2 indicates that this instance is not TCP protocol.
70546 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. • tcp_flags_ns (int64): TCP Nonce Flag, at value 2 indicates that this instance is not a TCP protocol.
• tcp_flags_cwr (int64): TCP Congestion Window Reduced (CWR) Flag, value 2 indicates that this instance is not a TCP protocol.
• tcp_flags_ecn (int64): TCP ECN-Echo Flag, value 0 for inactive, 1 for active, and value 2 indicates that this instance is not a TCP protocol.
• tcp_flags_urg (int64): Urgent TCP flag, value 0 for inactive, 1 for active, and value 2 indicates that this instance is not a TCP protocol.
• tcp_flags_ack (int64): TCP Acknowledgment flag, value 2 indicates that this instance is not a TCP protocol. It indicates whether the segment carries a valid acknowledgment number.
• tcp_flags_push (int64): TCP push flag, value 2 indicates that this instance is not a TCP protocol. It indicates whether it immediately data transferred to the application.
• tcp_flags_reset (int64): TCP Reset flag can be of 3 values, 0 for inactive, 1 for active, and 2 if it is not TCP protocol.
• tcp_flags_syn (int64): TCP synchronization flag can be of 3 values, 0 for inactive, 1 for active, and 2 if it is not TCP protocol.
• tcp_flags_fin (int64): End TCP flag is used in finishing the connection, value 2 if it is not TCP protocol.

2) INTRUSIONS OR CLASSES
In total, the database contains ten types of intrusions called:  Table 2 shows the category and subcategory of attacks, the impacted protocols, and the attack tool used. The traffic capture tool used was Wireshark in all cases, and the attacked device was Raspberry Pi in all cases. Class distribution. The IDSAI dataset is balanced for the ten intrusions and as a binary way of intrusion (joining ten intrusions) and non-intrusion (Normal) data. Figure 1 shows the data distribution by classes. There are 50,000 data samples for each intrusion (500,000 samples for intrusions) and 500,000 non-intrusion or Normal samples.
Below are the names of the classes or intrusions with a brief description.
• ICMP echo request Flood / Ping Flood: the attacker uses a botnet to send large numbers of ICMP packets to the victim host to exhaust available bandwidth and prevent the victim host from being accessible to legitimate users of the system. It generates approximately 5,368 packets per second [35].
• SYN/ACK and RST Flooding: this attack causes the victim host to acquire a large volume of fake RST packets not registered in a session started in the victim's database, causing a crash because computational resources are broken when trying to compare the large number of packages received caused total system failure or reduced system performance to a minimum [36].
• SYN/ACK Flooding: it is an attack that generates a denial of service or a total collapse of the system due to the excess of SYN + ACK response packets made by a victim host and where the host consumes all memory, CPU and other resources to minimize the attack, but it is impossible to deal with it due to the congestion formed by the response packets. This attack generates approximately 7,500 packets per second [36].
• SYN Flooding faster: it is an attack in which the intruder sends a high volume of TCP/SYN packets originating from one or several false addresses to a victim system. The system tries to reply with ACK-SYN packets to a group of false IPs, which will not acknowledge a receipt. Thus leaving a half-open connection that extinguishes the system's memory resources and, consequently, a deterioration in the system's performance or, even worse, a total crash [36], [37].
• ARP spoofing: it is a malicious attack that transmits false ARP messages over the local network by linking the MAC address of an intruder with the IP of a VOLUME 11, 2023 legitimate host, causing the interception, modification and even the denial of network data frame traffic [38].
• DDoS MAC Flood: the primary purpose of the MAC flood attack is to delete the MAC table. An intruder connected to a switch port floods many frames onto the Ethernet interface. Using false source MAC addresses, the attacker loads the memory of the switches, which is where they are stored in the MAC table; it causes legitimate users to be removed from the table [39] • IP Fragmentation: attacks are common denial of service attacks. The intruder will try to make a fraudulent implementation of IP fragmentation and confuse the operating system into recomposing the original datagram and thus crash the target system. In addition, the attack intends to modify the information to add inconsistencies once the original datagram has been reconstructed; another harm is flooding the IP stack of the victim host [40].
• Brute Force SSH: attack is used to gain unauthorized access to a host, server, or other protected information through illegal access with authentic client names and passwords, which has been achieved with a prediction procedure of the same using all the usernames and passwords of an organization. This type of threat could be prevented through the intrusion detection, and prevention mechanism (IDS/IPS) that controls the number of access attempts [41].
• UDP port scan: is a prevalent security threat used by intruders on a victim to find open doors, learn the operating system and services that allow illegitimate logins by sending and receiving packets to specific ports on a host, and finding faults In the system, in this way, they monitor the response of a network host and inquire about the status of a port, and the ease of access [37].
• TCP Null: attack, the victim receives TCP packets that come with null values in the flag area of the TCP header, and it is because none of the six TCP flags (URG, ACK, PSH, RST, SYN, FIN) have been set. If the port is enabled on the victim, NULL packets are unknown. Instead, the attacker would receive an RST packet. This vulnerability scans the victim's ports and builds a large attack [42].

3) BOT-IoT DATASET FOR TESTING
The unbiased evaluation of the models is applied using the Bot-IoT dataset [18]  • DDoS_TCP: consists of the increase or flooding with a large volume of malicious packets through botnets directed at a victim to exhaust the computational resource or absorb the bandwidth. Because the attack can spread across multiple machines, it will be challenging to differentiate between legitimate users and intruders [43], [44].
• DDoS_UDP: This attack is performed by flooding User Datagram Protocol packets. The intruder floods a random port on the device with UDP packets forcing the victim to check the affected port constantly. However, it is being used or listened to by the system. The affected devices send massive messages of no access and ICMP error. This action depletes the victim's resources, causing the unavailability of the system to legitimate users [36].
• DoS_TCP: This is a typical DoS attack, where an attacker dispatches TCP connection requests to clog existing ports on the system, making it impossible to accept real connections from authenticated users [45].
• DoS_UDP: sends many corrupted UDP packets to exposed ports of a victim host (UDP does not need a communication link like TCP). When a UDP packet is received on a specific host port, it is determined if any application is active or if the host sends an ICMP target unreachable message to the replaced source address. It is clear that if a host receives a large number of UDP packets, the system's performance suffers until it becomes unavailable [37], [44].
• Reconnaissance_OS_Fingerprint: is a technique used in ethical hacking that allows identifying the operating system that runs on a remote host susceptible to attack and what vulnerabilities the host systems present that facilitate the subsequent phase of an attack [42], [46].
• Reconnaissance_Service_Scan: the attack consists of an address analysis to find out the weaknesses of the active services in a network of hosts. Intruders often perform address study in the first phase, then carry out cyberattacks such as DoS and progress to devastating attacks such as DDoS attacks [46].
• XGB is an ensemble technique developed based on Gradient Boosting. It is trained using a simultaneous set of regression trees, whose result is the sum of the score of each tree [47]. Reference [55] added some improvements to it in 2016 and named it XGB. This algorithm combines the idea of Boosting, overcoming the speed and accuracy of limited calculations and blocks, and simultaneously orders each function. It allows parallelizing the computation when searching for the best-split point, which significantly accelerates the calculation speed [50].
• GB is used for solving regression and classification problems. It is equivalent to the Ada Boost algorithm, with a mixture of weak classification models generally developing a DT model. Reference [48] explains that the general idea is to prepare sequentially, each of which attempts to correct its predecessor.
• DT is used to solve classification and regression problems. According to a data set, it builds diagrams of logical structures with which it represents and categorizes a series of conditions given consecutively to solve a problem [49]. It comprises a tree scheme with trees and decision nodes (the result of the decision).
• RF this algorithm trains many DTs, each using a random subset of samples and features. RF achieves increased tree diversity and gives better outcomes [50].
• ET are DT ensembles. This approach adds randomization to the model training process by employing random decision thresholds for each feature rather than seeking the best feasible [51]. VOLUME 11, 2023 70549 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

D. MODEL TRAINING
The proposed experimentation process is divided into a total of 3 scenarios (see Figure 2). The first two scenarios correspond to the experimentation for the proposed IDSAI dataset (see Figure 2 (A)). In scenario 1, binary classification is performed, predicting whether there is an intrusion. In scenario 2, multiclass intrusion identification is proposed; in this case, it will be sought to say, given that there is an intrusion, what it could be, and with what certainty. Scenario 3 seeks to perform an external validation of the dataset proposed for intrusion detection (see Figure 2 (B)). The training data is the IDSAI dataset presented in this work, and the testing data is the Bot-IoT dataset. To test Bot-IoT data, the network traffic capture was structured to the same features with which the models were trained using the IDSAI dataset. With this set of experiments, it is hoped to verify that the models work with different datasets and even have different types of intrusions.
Experiments for scenarios 1 and 2 are performed using ML with Hold-Out by splitting data with 80% for training and the remaining for testing (see Figure 2 (C)). A total of 8 ML algorithms were evaluated in all scenarios, and the best five were selected (see Figure 2 (D)). The best ML models are saved and sent to the IDS environment, which performs the predictions in real-time directly on a device (see Figure 2 (E)).
The Bot-IoT database (see [18] for more detailed information) has data without intrusions (200, 000 samples) and five intrusions such as DDoS TCP, DDoS UDP, DoS TCP, DoS UDP, Reconnaissance OS Fingerprint, and Reconnaissance Service Scan, with 20, 000 samples, each one (see Figure 2 (F)). In scenarios 1 and 2, the IDSAI dataset proposed in this research is used. Furthermore, in scenario 3, predictions are made about Bot-Iot to verify the effectiveness of training ML algorithms using IDSAI data.

1) HYPERPARAMETER TUNING
To perform the hyperparameter tuning (see Figure 3) for ML models in this work are completed the following steps: selection of a set of hyperparameters, establishment of the accuracy as the metric, use only training data to select hyperparameters, according to the grid of hyperparameters and using Grid Search tool the models are trained using 3-fold cross-validation to optimizing the hyperparameter settings. This exhaustive search for the best hyperparameter values for the ML models is applied to all scenarios. After hyperparameter optimization, were chosen the best ones to train the best ML algorithms and predict on testing data (never seen in tuning or training). The hyperparameters of the top-performing ML algorithms in each classification scenario are showcased in Table 3. Detailed explanations of these hyperparameters can be found on the scikit-learn website [56], [57], providing a comprehensive understanding of their functionality.

E. PERFORMANCE ASSESSMENT
This research utilizes eight metrics, which are detailed in [58], [59], and [60]. The four measures used to calculate these metrics are: True Positive (TP) for correctly predicted intrusions, True Negative (TN) for correctly predicted non-intrusions, False Positive (FP) for incorrectly predicted intrusions, and False Negative (FN) for incorrectly predicted non-intrusions. For evaluation, this study considers measures such as accuracy, precision, recall, F1-score, ROC curves, the Area Under the ROC Curve (AUC ROC), cross-validation and execution time.

1) ACCURACY
Accuracy is a metric used to evaluate how well a classification model performs in making predictions. This is done by dividing the total number of correct predictions made by the model with the total number of predictions made [60]. Equation (1) represents the percentage of instances correctly classified out of the total [58], [59]: The precision metric measures the proportion of positive cases that are correctly identified by a model among all the cases identified as positive, including true positives and false positives. Equation (2) shows that precision is calculated as the number of true positives divided by the sum of true positives and false positives [58], [59], [60].

3) RECALL
It is known as the true positive rate [58], [59], [60]. It is the percentage of positive cases correctly detected by the model (see Equation 3).
F1 is a metric used to evaluate the model's ability to accurately identify both positive and negative cases, especially when the data is imbalanced and the positive class is rare [58], [59]. It is calculated as the harmonic mean of Precision and Recall, and is particularly useful for uneven classes [60].

Equation 4
can be used to estimate the model's average precision.

5) AUC ROC
The ROC curve shows the relationship between true positive rate (TPR) and false positive rate (FPR) at different decision thresholds, useful for comparing classification models and finding the best threshold. AUC is a numerical measure summarizing the model's overall performance, with values closer to 1 indicating better performance. A model with high sensitivity and specificity is represented by an ideal curve that reaches the upper-left corner, and AUC ROC is associated with this curve [58], [59], [60].

6) CONFUSION MATRIX
The confusion matrix is a table that summarizes the relationship between a model's predictions and the true labels of the data. It includes TP, TN, FP, and FN. This matrix is useful for visualizing a model's performance in terms of its successes and errors. Each row of the matrix corresponds to the actual class, while each column represents the number of predictions made for each class. It also helps to identify when one class is confused with another [58], [59], [60].

7) CROSS VALIDATION (CV)
A 10-fold CV is used in all experiments [58], [59]. It consists of repeating and calculating the arithmetic mean obtained from the evaluation measures on different partitions. It ensures the results are independent of the training and validation data split. It is regularly used in environments where the main objective is to predict and estimate the accuracy of a model to be put into production.

8) RUN TIME
(RT) Indicates the time for an ML model to train and predict [58], [59].

F. FEATURE IMPORTANCE
The feature importance is calculated using supervised ML algorithms for binary and multiclass classification. Some feature importance methods are embedded into the scikit-learn software for multiple ML models (feature_importances_ and coef_ properties). The DT algorithm, for example, has the feature_importances_ property. The feature_importances_ is accessible in decision tree models and tree ensembles and reflects how much this feature is utilized in each tree. The coefficients with the most significant values are relevant since they lend more weight to the predictions [56]. The Yellowbrick tool gets the feature importance from the models (in the second plane, it utilizes feature_importances_ and coef_). This tool also can stack feature importances for top and bottom importance [61]. It enables us to learn and know which factors have the most influence on each scenario. This research uses the DT algorithm since it is very efficient in training and prediction times while maintaining a good detection ability. It also gets good results by using features efficiently.

G. RESOURCES
Python 3.8 is used to develop and execute the algorithms presented in this study. The computer runs Windows 11 (64-bits), and it has an Intel(R) Core(TM) i9-10980HK CPU @ 2.40 GHz 3.10 GHz processor, 32 GB of RAM, and an NVIDIA GeForce RTX 2070 Super GPU (8GB). The code and data are available in https://github.com/BioAITeam/ Intrusion-Detection-System-using-Machine-Learning.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
A total of three possible scenarios have been defined that cover all perspectives of intrusion classification using supervised ML. The first two scenarios are based on the proposed IDSAI dataset. The last is for external validation using the Bot-IoT dataset. The experimentation results are framed in binary and multiclass classification for the IDSAI dataset. The results include tables covering many metrics to measure the performance, such as Accuracy, F1-score, Recall, Precision, ROC AUC, CV, and Times. In addition, this work presents confusion matrices and ROC Curves with the AUC and Confidence Intervals. Table 4 shows the results obtained with the best five ML models for scenarios 1 and 2. The algorithms are ordered from the highest to the lowest CV value (recommended metric before using ML models in production).

A. IDSAI DATASET FOR INTRUSION DETECTION
The performance of the classifier was evaluated using ROC curves with confidence intervals, as illustrated in Figure 4. The area under the ROC curve served as a metric to measure the classifier's ability to discriminate between classes. In a similar manner, ROC curves with confidence intervals were generated for multiclass classification (scenario 2), as presented in Figure 5. These curves provide valuable insights into the classifier's performance in distinguishing between different classes. The results of this study highlight the effectiveness of the XGBoost classifier in both binary and multiclass classification tasks on the IDSAI dataset. For scenario 1, which corresponds to a binary classification (intrusion, non-intrusion), the best algorithm is XGB, obtaining an accuracy of 94.97%. Regarding training time, the best algorithm is DT, needing only 3.8859 seconds while maintaining an accuracy of over 94%. The algorithm that had the worst times is GB needing 369.9567 seconds; due to this, GB is not so desirable in a production environment even having good performance (94.97 ± 0.06).
In the case of scenario 2 for multiclass classification (nonintrusion, ICMP echo request Flood / Ping Flood, SYN/ACK & RST Flooding, SYN/ACK Flooding, SYN Flooding faster, ARP spoofing, DDoS MAC Flood, IP Fragmentation, Brute Force SSH, UDP port scan, TCP Null), once again, the XGB algorithm has the best accuracy (92.64%). Once again, DT is the algorithm with the best performance (92.51 ± 0.06), requiring less training time (7.1727 seconds). The GB algorithm, which in scenario 1 was second, is now fourth and the most inefficient in time, needing 1,006.6022 seconds to train. Figure 6 shows the confusion matrix for the identification of ten intrusions and non-intrusion data (scenario 2). 70552 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  According to the Figure 6, the models will likely get confused and predict SYN/ACK Flooding or SYN/ACK & RST Flooding when the attack is SYN Flooding faster. On the other hand, it could also be predicted as SYN Flooding faster or SYN/ACK Flooding when the intrusion is SYN/ACK & RST Flooding. It also tends to identify ARP spoofing and Brute Force SSH attacks as non-intrusions data. Therefore, these two attacks are more complex to detect and confuse as non-intrusions.

B. FEATURE IMPORTANCE
ML algorithms tolerate the complexity and heterogeneity of data structures, making it possible to find underlying patterns in the data. Figure 7 graphically shows features' relative importance for binary and multiclass classification, scenarios 1 and 2. This analysis is performed using the IDSAI dataset. The SHAP (SHapley Additive exPlanations) [62] method was used to analyze the relative importance of features in a binary classification scenario using the IDSAI dataset. The results, shown in Figure 8, revealed that Feature frame_len had the highest importance in scenario 1. Furthermore, it was observed that the results obtained with SHAP were similar to those obtained using other interpretability techniques (see Figure 7). This consistency strengthens confidence in the results and provides valuable insights for feature selection and data analysis.
This work (see Figure 7 (A)) shows that the binary classification can be done using the following 11 features (where only the first three have each one a relative importance greater than 20%.): One of the most critical problems in model training is the generalization in front of data of different natures. Reference [63] proposed an alternative approach to assess the generalizability of a model, using two distinct but related datasets instead of a single one. The first dataset is used VOLUME 11, 2023 70553 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   for training and validation, and the second is for unbiased evaluation, as developed in this study. Table 5 shows the results on the Bot-IoT dataset when the algorithms are trained on the IDSAI dataset. The XGB algorithm maintains an accuracy of over 94%, again being the best for intrusion detection. The GB, DT, and RF algorithms 70554 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  Figure 9 shows confusion matrices of scenarios 1 and 3 corresponding to binary classification. The results achieved in binary form are about an accuracy of 94% (see Figure 9 (A)). The confusion matrix shows that intrusions into the Bot-IoT dataset are being correctly classified (see Figure 9 (B)). The XGB algorithm is used, which is the best in all cases.
Additionally, this paper presents the confusion matrix for a 20% testing data extracted from the Bot-IoT dataset (the confusion matrix for multiclass classification for the Bot-IoT dataset is included in Figure 10). The training process was conducted using the remaining 80% of Bot-IoT data, which involved employing all ML models and fine-tuning their hyperparameters. The XGB model was chosen as the best-performing one. The results reveal that although the Bot-IoT dataset is challenging to classify by classes, binary prediction achieves a high level of accuracy, as shown in scenario 3. Regarding multiclass classification, the following are the performance metrics: 26.15 seconds for training time, 0.11 seconds for prediction time, 84.14% for accuracy score, 84.05% for f1 score, 84.14% for recall score, 85.41% for precision score, 97.64% for ROC AUC, and 0.5180 for MSE. Furthermore, cross-validation was conducted, which took 152.44 seconds and achieved an accuracy score of 84.19% with a standard deviation of 0.16%.
The advent of the Internet of Things and its incorporation into smart cities, while advantageous, has ushered in an era of new security and privacy complications, making IoT networks a preferred target for malefactors [1]. These networks' susceptibility is heightened by the inherent constraints of IoT devices such as limited storage, processing capabilities, and memory, as well as the utilization of insecure wireless communications [3], [4]. Existing security methods struggle with these evolving threats due to issues like complexity, resource usage, data quality, and service disruptions [5], [6], [7], [8]. This research contributes to the realm of IoT network cybersecurity by proposing an innovative Intrusion Detection System, demonstrating the power of Artificial Intelligence in intrusion detection, and providing an understanding of the intrinsic causes of different attacks. By doing so, we not only enhance the comprehension of IoT security challenges but also contribute to fortifying IoT networks against emerging threats.

D. LIMITATIONS OF THE STUDY
A possible limitation of the current work stems from the nature of the data since attackers are constantly looking to design new ways to attack. Although the IDSAI dataset allowed generalization for prediction on the Bot-IoT dataset, updating the database with new intrusions is advisable to make the ML algorithms more robust. Table 6 shows a comparison of advantages and disadvantages for various works related to intrusion detection systems. Each author's approach is summarized in terms of the advantages it offers and the corresponding disadvantages or limitations.

V. CONCLUSION
With the growth of the IoT ecosystem, the cybersecurity attack surface has increased. This work presents a sustainable intrusion detection system through supervised ML algorithms. Performance was evaluated using many metrics such as Accuracy, Precision, Recall, F1-score, ROC Curves, ROC AUC, Confusion Matrix, Cross Validation, and Times.
A new dataset (IDSAI) is presented with 1, 000, 000 data samples and 19 features, with intrusions generated in real attack environments. IDSAI dataset has data without intrusions and a total of ten intrusions: ARP spoofing, Brute Force SSH, DDoS MAC Flood, ICMP echo request Flood, IP Fragmentation, SYN Flooding faster, SYN/ACK Flooding, SYN/ACK & RST Flooding, TCP Null, and UDP port scan. IDSAI is a balanced data set with equal number of attacks for each category.
The best ML algorithms for intrusion detection are XGBoost, Gradient Boosting, Decision Tree, Random Forest, and Extra Trees. These ML algorithms can predict the ten specific intrusions achieving an accuracy of over 92%, and in a binary way (intrusion and non-intrusion), achieving an accuracy of over 94%.
On the Bot-IoT dataset, an accuracy of over 90% is obtained. It shows that the models correctly learn to detect intrusions once trained in IDSAI dataset.
In future work, it is recommended to prioritize research on developing a novel intrusion detection system that leverages unsupervised machine learning and anomaly detection techniques. The aim is to compare and evaluate the performance of these techniques using internal and external metrics, with a focus on achieving better performance results while optimizing computational resource consumption. Additionally, efforts can be directed towards refining the IDSAI dataset for model comparison, which involves categorizing new intrusions and their variations to enhance the dataset's effectiveness in evaluating and comparing intrusion detection models.