ShieldRNN: A Distributed Flow-Based DDoS Detection Solution for IoT Using Sequence Majority Voting

The Distributed Denial of Service (DDoS) attack is considered one of the most critical threats on the Internet, blocking legitimate users from accessing online services. Botnets have exploited insecure IoT devices and used them to launch DDoS attacks. Providing IoT devices with the ability to detect DDoS attacks will prevent them from becoming contributors to these attacks. This paper presents an efficient solution to defend IoT devices against such inevitable attacks. The proposed solution consists of two parts: an IoT node detector and a server detector. The IoT node detector is a lightweight classifier to monitor egress traffic. The server detector is a more accurate classifier that is used by the IoT node if it suspected itself to be a contributor to a DDoS attack. To develop an accurate server detector, this paper proposes ShieldRNN: a novel training and prediction approach for RNN/LSTM models. We compare ShieldRNN with other supervised and unsupervised models on the CIC-IDS2017 dataset and show that it outperforms them. Also, we set baseline results for DDoS detection on the CIC IoT 2022 dataset.


I. INTRODUCTION
The Denial of Service (DoS) attack is one of the most dangerous threats an organization may face. The attack is defined as an attempt to overload the capabilities of the victim's machine and makes it unavailable to other legitimate users and devices. In this type of attacks, the attacker uses a single machine to launch the attack. Another variant of this attack is called the Distributed Denial of Service (DDoS) attack which involves multiple machines, that are controlled by the attacker, to launch the attack at the same time on the victim's machine [1]. One of the most popular DDoS attacks was the Mirai Attack that happened in October 2016 when many popular websites were affected including: Twitter, Netflix, Amazon, and Github [2]. This attack was done using The associate editor coordinating the review of this manuscript and approving it for publication was Yu-Da Lin.
hundreds of thousands of Internet of Things (IoT) devices [3] that were infected by the Mirai botnet. A botnet is defined as a network of infected machines called zombies that are controlled by a botmaster [1]. Figure 1 shows the botnet architecture in general. Experts nowadays consider IoT botnets as the new norm of DDoS attacks [3]. There has been different techniques proposed to detect DDoS attacks that can be categorized as: Payload-based techniques and Flowbased techniques. A flow can be defined as the stream of packets that have common network characteristics such as the network protocol, source IP and destination port [4]. The flow-based methods inspect the packet header only while the payload-based methods analyze the information inside the packet. The flow-based analysis is not as accurate as the payload-based analysis since it inspects the packet header and it cannot detect the hidden attacks inside the packet payload [5], [6]. In DDoS attacks, the number of received packets at the victim's machine is huge and the packets have source IP addresses that are nearly random [4]. Moreover, the timing between received packets can play an important role in detecting incoming or outgoing DDoS attacks [2]. In this paper, we focus on DDoS attacks detection based on the traffic flow.

II. RELATED WORKS
Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) can be classified mainly into rule-based and machine-learning-based systems. Most of the available detection systems rely on the signature of the attacks. These systems match incoming network traffic with a predefined set of rules to detect attack patterns [7]. One of the most popular rule-based IDSs is called Snort that was developed in 1998 by Martin Roesch [8]. Another well-known rule-based system is called Suricata [9]. Shah et al. [10] investigated the performance of Snort and Suricate in terms of utilization of computational resources as well as the processing efficiency. They showed that Snort is lighter, i.e., it utilizes less computational resources compared to Suricate. On the other hand, Suricate achieved a higher processing rate, with slightly higher memory usage, which was more than 82K packets/second compared to Snort's rate that was about 60K packets/second. The memory usage of both systems was more than 3 GBytes which indicates that they are not suitable to be implemented in low-resource IoT devices. Zitta et al. [11] implemented Suricate on Raspberry Pi 3 with some rules to detect port scanning activities but there was no data provided regarding the required processing time and the hardware utilization. Researchers in [12] proposed early DDoS detection in Software-Defined Networking (SDN) using Snort. The machine-learning-based approaches include classical machine-learning algorithms such as Support Vector Machine (SVM), Artificial Neural Network (ANN), and Naïve Bayes as well as deep-learning algorithms such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Long-Short Term Memory (LSTM). Machine-learning-based algorithms can guarantee the detection of known attacks and similar unknown attacks even if they were not seen before during the training phase [7]. Shafiq et al. [13] explored five well-known machine learning algorithms and they developed a framework for automatic effective algorithm selection. They found that the Naïve Bayes classifier achieved the best performance in anomaly and intrusion detection in IoT network. Doshi et al. [2] tested five different machine learning algorithms after they collected their own dataset and hand-engineered a set of useful features based on the traffic flow. Another group of researchers at the University of Chinese Academy of Sciences [14] proposed a solution based on the Random Forest algorithm using the NetFlow data to detect DDoS attacks. Hodo et al. [15] developed an ANN model for DDoS detection on a simulated IoT network. Khater et al. [16] implemented lightweight intrusion detection using the ANN algorithm on a Raspberry Pi 3 and reported the energy consumption and CPU utilization while running their system. Using deep learning algorithms, the papers [17], [18], [19] showed that the RNN algorithm can improve DDoS detection compared to the classical machine learning algorithms. The CNN algorithm has shown that it can be applied for anomaly detection tasks such as DDoS attacks [20], [21], [22]. Different from the above works, we propose a machinelearning-based system to be implemented in two stages: an IoT node detector and a server detector. The IoT node detector is a lightweight classifier to monitor egress traffic. The server detector is an accurate deep learning classifier that is utilized by the IoT node if it detected or suspected that it is attempting a DDoS attack. Our proposed system aims to provide a DDoS detection functionality on the IoT node and minimizes the workload of processing all traffic from all connected IoT nodes. It helps IoT nodes to recognize if they are zombies (bots) of a botnet DDoS attack by providing them with intelligent decision making capabilities. Figure 3 shows the proposed system architecture. This system architecture can be beneficial for organizations that have multiple IoT devices connected to the Internet to analyze the traffic of each IoT node inside the node itself instead of analyzing the whole traffic in a central server. For example, it can be used by an organization that has surveillance cameras which are connected to the Internet where each camera analyzes its own traffic to detect abnormal activities and sends the suspected packets to the accurate detector server to get the final decision.

III. EXPERIMENT
We setup our experiment to perform and detect four well-known DDoS attacks: TCP SYN flood, UDP flood, TCP PSH+ACK flood, and ICMP flood [23]. We developed a tool to launch these attacks randomly with a random number of packets per attack, random source and destination ports, and a randomly generated (spoofed) source IP address using  The proposed system architecture. At the beginning, the DDoS worm compromises the IoT device. Then the attacker sends a command to the worm to launch the DDoS attack. After that, the lightweight classifier in the IoT node detects an abnormal behaviour and the IoT device sends the suspected packets to the ShieldRNN server. Finally, the ShieldRNN server analyzes the suspected packets and sends the final decision to the IoT node. Scapy 1 library for packet manipulation. The source code of the implementation of this paper is available on Github. 2

A. DATA COLLECTION
To collect data, we started packet sniffing using Wireshark 3 for nearly 24 minutes to collect normal traffic including web browsing, YouTube watching, and WhatsApp chatting in the web version. At the beginning, we collected 65,314 packets of normal traffic. After that, we launched random DoS attacks with random number of packets per attack on a victim server using our own Python script. We collected a mix of normal and attack packets for more than 102 minutes and we stopped the attack script. Finally, we collected about 17 minutes of normal traffic again and stopped Wireshark traffic sniffing. Figure 2 shows the distribution of packets in the collected dataset.

B. FEATURE EXTRACTION AND PACKET LABELLING
We used a combination of the features mentioned in [2] and [24]. The features were extracted from the frame header, IP packet header, TCP segment header, and UDP datagram header. Table 1 shows the extracted features from each packet. In the case of UDP packets, the features tcp.srcport, tcp.dstport, and tcp.len are replaced with those of the UDP header information. We selected the top 5 most frequent protocols from the feature protocol and one-hot encoded them as ''is_TCP'', ''is_UDP'', ''is_SSL'', ''is_ICMP'', and ''is_DNS''. For other protocols, we simply encoded them as ''is_OTHER'' [2].
For packets' labelling, a packet is considered as an attack packet if its destination IP address was the IP address of the victim's server, otherwise, it is considered as a normal packet.

C. DATA PREPROCESSING
As the first task, we kept the dataset in the same order in which it was captured. Then, the whole 2D dataset matrix was converted into a 3D tensor such that each consecutive seq_len packets are considered as a single example that would be fed into a sequence model like RNN and LSTM. For example, if the seq_len = 10, it means that every 10 consecutive packets form a single training example that will be fed to the sequence model and each packet represents the input vector for a specific time step t. The final 3D tensor size is num_examples × seq_len × num_features instead of the original 2D matrix size total_num_packets × num_features where num_examples is the number of examples, seq_len is the sequence length, i.e., the number of packets per example, num_features is the number of features which represents the input size of the model, and total_num_packets = num_examples×seq_len which is the total number of packets in the dataset. Before splitting the dataset into a training set and a testing set, we needed to ensure that the following two conditions are satisfied: 1) The packets must be in the same order as they were captured to allow the sequence model to learn the correct sequence. 2) Shuffling the data before splitting it into a training set and a testing set so that different normal scenarios and different attack scenarios can be seen in both sets. These conditions contradict each other since we can't preserve the right order while we need to randomly shuffle the data before splitting it. To solve this problem, we developed an improved version of the train_test_split algorithm that takes different seq_len values, e.g. 5, 10, 20, 50, 100, 250, 500, 1000 as in our experiments, converts the 2D data matrix into a 3D tensor with the largest seq_len in the provided list, shuffle and split the data into training and testing data, and convert it back into a 2D matrix to use the training set in feature normalization and feature selection. The idea behind splitting data into training and testing datasets after we created the 3D tensor is that we wanted to preserve the right order in which the packets were captured and to shuffle by examples instead of shuffling by packets. Selecting the largest seq_len to convert the 2D matrix into a 3D tensor before data splitting ensures the minimal error in the number of examples that contains out of order packets. This will be useful later when using smaller values of seq_len during training. Algorithm 1 and Figure 4 explain these steps in details.
We split the dataset into a training set (90%) and a testing set (10%). We chose this data split ratio since the data we used has a large number of examples. Therefore, we want to use most of the data for training but not by affecting the testing phase which will still have a large number of examples to test the model on. After data splitting, we normalized the

Algorithm 1 improved_train_test_split
Input: S: a list of predefined sequence lengths. X : a matrix of size m × n contains the packets. y: a vector of size m × 1 which contains the labels of each packet. α: a scalar between 0 and 1 represents the percentage of the training dataset.
Where m is the number of packets and n is the number of features. Output: X train : a matrix of size p × n. y train : a vector of size p × 1. X test : a matrix of size q × n. y test : a vector of size q × 1.
training set feature-wise across all packets in the training set to have a zero-mean and a unit-variance. One-hot encoded features and bit features, e.g. TCP flags bits, were not normalized. Feature normalization was crucial for the models during training to converge. We applied the same normalization process on the testing set using means and variances that were calculated using the training set. In production environment, we propose to use exponentially weighted moving averaging (EWMA) and variance (EWMV) [25] estimation to cope with any changes in DDoS attacks trends using equation 1 and equation 2: where x t is the current packet, A t−1 and V t−1 are the previous EWMA and EWMV, respectively. α and β are controllable parameters that specify how much we depend on the previous EWMA, EWMV, and the current packet x t to calculate the current EWMA (A t ) and EWMV (V t ), respectively. We set A 0 to be the average and V 0 to be the variance which were calculated on the training set.

D. FEATURE SELECTION
We trained a logistic regression classifier with LASSO regularization and applied a grid search with 5-fold crossvalidation to get the best estimate for the regularization coefficient. Then, we trained another classifier with the best estimated regularization coefficient and extracted the selected features by the algorithm. The best estimated regularization     Figure 6.

F. TRAINING THE ACCURATE DETECTOR FOR THE SERVER
We trained multiple RNN and LSTM models using two techniques that we developed for training and prediction. The first technique is to train the sequence model to predict a label for each packet in the training example, i.e., a sequence-tosequence training. The prediction for this technique is done as a majority voting, i.e., if most of the sequence packets are predicted as attack packets, then the final prediction of the whole sequence is attack, otherwise, the final prediction is normal. The training procedure is explained in Algorithm 5. The majority voting prediction is detailed in Algorithm 4 and shown in Figure 5a. The second technique is to train the sequence model to predict the last output of a given training example as attack if the majority of its ground-truth labels are attack labels. The idea is to compute the majority of the labels in the ground-truth and force the model to learn to output the label of the last output of the given sequence as the label of the  Figure 5b. We trained both LSTM and RNN models for different values of seq_len to see how the sequence length affects the accuracy of prediction. In our experiments, we've set seq_len to values: 5, 10, 20, 50, 100, 250, 500, 1000 and trained RNN and LSTM models for each sequence length. Moreover, we developed Algorithm 2 to train RNN and LSTM models on a mix of sequence lengths. In the case of training on a single sequence length, we simply provide

Algorithm 3 Predict_Last_output
Input: M : a pretrained vanilla RNN model or its variants X : a matrix of size T × n represents a single example.
T : a sequence length that is considered when converting matrix into tensor where n is the number of features per packet Output:ŷ final : the prediction of the trained model: attack or normal ifŷ t = ''attack then 6 count ← count + 1 7 end 8 end // 50% or more is considered the majority 9 if (count / T ) ≥ 0.5 then 10ŷ final ← ''attack 11 else 12ŷ final ← ''normal 13 end 14 returnŷ final the algorithm with a set of sequence lengths S that contains only that single sequence length. Figure 9 in Appendix B shows the F1-scores calculated for 18 RNN/LSTM models where the suffix ''_XXX represents the sequence length that the model was trained on it. If the suffix is ''_mix , it means that the model was trained on a mix of different sequence lengths as described by Algorithm 2. We tested these models on a randomly generated list of sequence lengths and we found that the models trained using the proposed algorithm can generalize better for other The values of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP) when we evaluated our models on the testing set of CIC-IDS2017 -Friday [26]. The suffix ''_mix_MV'' in the model's name means that the model was trained using the algorithms 2, 4, and 5 where ''MV'' stands for Majority Voting and ''mix'' stands for the training using a mix of sequence lengths. The suffix ''_T'' indicates that the model evaluation is done on the list of training sequence lengths and the suffix ''_R'' indicates that the evaluation is done on a list of randomly generated sequence lengths.

Algorithm 5 Train_Seq_to_Seq
Input We found that our proposed training and prediction algorithms 2, 4, and 5 achieved the best F1-score where it achieved 100% correct predictions for all examples in the randomly generated list of sequence lengths: 77, 329, 597, 643, 877. Hyper parameters of all models were fixed. Each model had a single bidirectional hidden layer of size 20 that was trained for 3000 epochs with a batch size 64, a learning rate of 0.001, and a dropout of 50%. We compared the performance of our algorithms with LUCID [22]. We trained LUCID on our dataset with time window = 10 and packet flow length = 10. We found that our models achieved higher VOLUME 10, 2022   scores compared to LUCID when we tested them on the test set. Figure 7 shows the results of our models compared with LUCID [22].

IV. STATE-OF-THE-ART COMPARISON
For a fair comparison between our solution and the stateof-the-art, we focus our comparison on solutions that used the CIC-IDS2017 [26] dataset. We prepared the data and extracted the features as we described in previous sections. Then, we trained different LSTM and RNN models using the algorithms: improved_train_test_split (Algorithm 1), Train_Seq_to_Seq (Algorithm 5), Predict_Majority_Voting (Algorithm 4), and Train_RNN_Mix_Seq_lens (Algorithm 2) with similar settings as described in the accurate classifier training section III-F. We used the traffic trace of  Friday 7 (CIC-IDS2017-Friday) as it was used by [22], [32], and [33]. The total number of packets in the traffic trace was 9,997,874 where 926,978 packets were attack packets and 9,070,896 packets were normal packets. In the testing phase, we computed True Negatives (TN), False Positives (FP), False Negatives (FN), and True Positives (TP) for each sequence length in the training list of sequence lengths as well as on the list of randomly generated sequence lengths as shown in Table 2. After that, we calculated the metrics: F1-score, Accuracy, Precision, and Recall using the total TN, FP, FN and TP from Table 2. The reason to use the total TN, FP, FN and TP is that we wanted to evaluate the overall performance of our models against the state-of-theart solutions. Table 5 shows the performance of our models against the state-of-the-art models. We repeated the same process to train and evaluate the model LSTM_mix_MV_T on the traffic trace of Wednesday 5 (CIC-IDS2017-Wednesday). The suffix ''_mix_MV_T'' in the model's name means that the model was trained using the algorithms 2, 4, and 5 and it was evaluated using the list of training sequence lengths.   Table 4. Also, we compared ShieldRNN (LSTM_mix_MV_T) with unsupervised techniques on CIC-IDS2017-Wednesday dataset and we found that ShieldRNN achieved the best results among all models. Table 6 shows the results of unsupervised learning models compared to the results of ShieldRNN.
Moreover, we evaluated our techniques on the recently released CIC-IoT2022 [27] dataset which contains the normal traffic of multiple IoT devices in different scenarios that simulate the real life usage of IoT devices such as the traffic generated during the power on of the IoT device, while the IoT device is idle, and during human interaction with the IoT device. Also, the dataset contains the traffic of simulated DoS attacks performed on IoT devices. The total number of packets, used in building and evaluating the models, is 55,541,583 where the number of normal packets is 27,571,720 and the number of attack packets is 27,969,863. We prepared the data and trained the ShieldRNN models following the same steps as in the previous experiments. Our results set a baseline for DDoS detection on IoT devices using CIC-IoT2022 [27] dataset. Table 3 shows the details of the results of our models on CIC-IoT2022 [27] dataset. Table 7 summarizes the performance of our models on all of the CIC datasets that were used in this work. VOLUME 10, 2022 FIGURE 9. The F1-score results of 18 models on the testing set after using our training techniques. The values in the X-axis represent the sequence lengths that each model was tested on. Figure 9a and 9b show the results of the models when they were tested on a randomly generated list of sequence lengths. Figure 9c and 9d show the results of the same models when they were tested on the same sequence lengths that they were trained on.

V. CONCLUSION
This paper proposes a system architecture to better detect DDoS attacks originated from IoT devices that are infected and controlled by a botnet. The system is composed of two types of machine learning classifiers: a lightweight classifier that is deployed on the IoT node, and a more accurate classifier that is deployed on the server to do further analysis. For the IoT lightweight classifier, we trained 12 classifiers and we found that a simple Random Forest classifier with only 3 trees achieved 99.983% accuracy and 99.979% F1-score when we tested it on the test set which made it possible to implement such a simple classifier on an IoT device with limited resources. For the accurate classifier, we developed new algorithms of training and prediction for RNN that we called ShieldRNN. We found that using ShieldRNN, i.e., using a mix of sequence lengths combined with seq2seq training and majority voting prediction (Algorithms 1, 4, and 5) achieved the highest F1-score of values: 99.919%, 99.822%, 100%, FIGURE 10. F1-scores of all RNN/LSTM models on the test set when they were tested on a randomly generated list of sequence lengths other than the list that the models were trained on. Each row shows the name of the model and its results when it was tested on the sequence length shown at the top of the column. The model name consist of two parts separated by an underscore: the first part shows the model name and the second part shows the sequence length that the model was trained on. For example, the model name: LSTM_10 means that the model was LSTM and the sequence length was 10 which means that each 10 consecutive packets are fed to the model as a single example to train the model. In case the model name has the suffix _mix prefix instead of a number, it means that the model was trained on various sequence lengths each one was used to train the model for a single epoch.
99.834%, 100%, 100%, 100%, and 100% when it was tested on our dataset with the randomly generated list of sequence lengths: 26, 57, 77, 212, 329, 597, 643, and 877, respectively. Also, we found that using the same sequence length in both training and testing tends to achieve higher scores compared to testing on sequence length different than the sequence length in the training phase. Moreover, we evaluated ShieldRNN on CIC-IDS2017 [26] dataset to compare it with others algorithms and we found that it outperformed all of them. Finally, we set baseline results for DDoS detection using ShieldRNN on the CIC-IoT2022 [27] dataset.   FIGURE 11. F1-scores of all RNN/LSTM models on the test set when they were tested on the same list of sequence lengths that they were trained on.

See Figures 10 and 11.
FARIS ALASMARY received the B.S. degree in software engineering from the King Fahd University of Petroleum and Minerals, Dhahran, in 2018. He is currently pursuing the master's degree with the College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia. He is also an employee working on developing various statistical and end-to-end speech recognition models for Arabic and English languages. His research interests include automatic speech recognition, text-to-speech, and speaker identification.
SULAIMAN ALRADDADI received the bachelor's degree in computer engineering from the Yanbu University College (YUC), in 2020. He is currently working in the cybersecurity field as a Penetration Tester interested in hardware and the IoT assessment.
SAAD AL-AHMADI (Senior Member, IEEE) is currently an Associate Professor with the Department of Computer Science, King Saud University, Saudi Arabia. He has published many articles in highly cited journals and worked as a part-time consultant in several government organizations as well as the private sector. His current research interests include the IoT security, machine learning for cybersecurity, and future generation networks.
JALAL AL-MUHTADI received the M.S. and Ph.D. degrees in computer science from the University of Illinois at Urbana-Champaign, USA. He is currently the Director and a Cybersecurity Consultant of the Center of Excellence in Information Assurance (CoEIA). He is also an Associate Professor with the Department of Computer Science, King Saud University. He has over 50 scientific publications in the areas of cybersecurity, information assurance, privacy, and the IoT security. VOLUME 10, 2022