Optimized Deep Autoencoder Model for Internet of Things Intruder Detection

The development of an optimized deep learning intruder detection model that could be executed on IoT devices with limited hardware support has several advantages, such as the reduction of communication energy, lowering latency, and protecting data privacy. Motivated by these benefits, this research aims to design a lightweight autoencoder deep model that has a shallow architecture with a small number of input features and a few hidden neurons. To achieve this objective, an efficient two-layer optimizer is used to evolve a lightweight deep autoencoder model by performing simultaneous selection for the input features, the training instances, and the number of hidden neurons. The optimized deep model is constructed guided by both the accuracy of a K-nearest neighbor (KNN) classifier and the complexity of the autoencoder model. To evaluate the performance of the proposed optimized model, it has been applied for the N-baiot intrusion detection dataset. Reported results showed that the proposed model achieved anomaly detection accuracy of 99% with a lightweight autoencoder model with on average input features around 30 and output hidden neurons of 2 only. In addition, the proposed two-layers optimizer was able to outperform several optimizers such as Arithmetic Optimization Algorithm (AOA), Particle Swarm Optimization (PSO), and Reinforcement Learning-based Memetic Particle Swarm Optimization (RLMPSO).


I. INTRODUCTION
Recently, deep learning models showed great success for the problem of anomaly detection in IoT environment. These models include convolutional neural network (CNN) [1][2] [3][4] [5], long short term memory (LSTM) [6] [7] [8], deep autoencoders [9][10] [11] [12][13] [14], deep belief neural network [15] [16], and a hybrid deep models [17], [18], [19], [20]. CNN is a deep end-to-end model which is able to perform automatic feature extraction from raw input data. CNN has been utilized by Kim et al. [1] for Denial-of-Service attack detection in IoT networks. The basic idea of their approach is that it converts the 1D traffic features to a 2D image. Then, CNN operations are applied to encode the input 2D image, which is eventually fed to a binary classifier to classify it as an attack or normal IoT traffic. Conducted analysis of the proposed approach on CSE-CIC-IDS 2018 and KDD dataset showed that CNN is efficient in encoding traffic features, and it achieved an accuracy of 82% in the F1-score measure. An eight-layer CNN architecture model was given by Jung et al. [3] for IoT botnet detection in the smart health network. In their research, they focused on the pattern of power consumption to distinguish normal from abnormal IoT traffic. In their study, they collected a dataset for the consumed power of several IoT devices, including camera, Router, and Voice assistance. The power was monitored during idle time as well as during attack time. Experimental analysis showed that CNN was able to recognize malicious from normal patterns with an accuracy of 90%. A multi-CNN scheme that combines several CNN models for IoT industrial attack detection was given in [5]. Basically, they fused two CNN models and evaluated them using the NSL-KDD dataset. The results indicated that the fused scheme outperformed the single CNN model with a detection performance of 87%. The problem of unsupervised anomaly detection using CNN was discussed by Munir et al. [2]. They have developed a time series predictor that uses CNN to predict the next step and pass it to another deep anomaly detector to classify it as a normal or outlier pattern. The proposed approach in [2] showed a competitive performance in several streaming data that were used for model evaluation. A hierarchical of semisupervised temporal convolutional network (TCN) models were studied by Cheng et al. [4]. They stack different TCN, and it was trained with a mix of labeled data and unlabeled instances. LSTM models were investigated by Shi et al. [6], Xu et al. [7], and Li et al. [8]. In the work of Shi et al. [6], they have studied the effectiveness of the standard LSTM model for abnormal botnet traffic detection. Specifically, they have encoded traffic packets as a time series sequence which represents the behavior and characteristics of botnet attacks. Conducted experiments on several botnet benchmarks showed that LSTM outperformed RNN and other related models. This is due to the advantage of LSTM in handling the problem of vanishing gradients which occurs in the training of long sequences [21]. In the work of Xu et al. [7] an improved LSTM was introduced. The key point of their scheme is that they incorporated a time factor and a smooth activation function into LSTM to enhance its performance.
To assess the improved model in [7], extensive experiments were conducted on a real dataset collected from IoT environment.
Result indicated further accuracy improvements were achieved by embedding previously mentioned techniques (i.e. time factor and smooth function). Deep autoencoder models were studied by Shone et al. [9], kim et al. [10], Gurina et al. [11], Telikani et al. [12], Lopez-Martin et al. [13], and Meidan et al. [14]. In the work of Shone et al. [9], they suggested a non-symmetric autoencoder model. The main concept was to evolve only the encoding phase independently without the decoding phase. Experimental analysis indicated that their model reported further accuracy improvement up to 5% against the standard autoencoder. Further work was given by kim et al. [10]. They have implemented a deep autoencoder model for outlier detection. Specifically, their model was built using normal IoT traffic, and then unseen traffic instances that lie outside the trained patter will be classified as suspicious behavior. The idea of flood attack detection using an autoencoder model was explored by Gurina et al. [11]. The designed model was used for the detection of different classes of flood attacks such as SYN flood, TCP flood, UDP flood, ICMP flood, and HTTP flood. Reported results in [11] indicated the superiority of the implemented autoencoder model in handling and classifying all mentioned flood attacks. A cost-sensitive stacked autoencoder model that comprises several hidden layers was discussed in [12]. The key point of their scheme is to set different class costs in order to balance minority/majority data. Additional study has employed a conditional variational autoencoder deep model which was applied for intruder detection in IoT network [13]. The key concept of variational autoencoders is that it depends on probability distributions to capture and encode traffic pattern. Deep Belief Neural Network (DBNN) was applied in the work of Manimurugan et al. [15] for intruder detection in IoT-based smart medical environment. They have investigated several kinds of attacks such as Heartbleed, SQL injection, Infiltration, etc. In [15], optimization algorithms were employed to tune DBNN structure in terms of the number of layers and the number of neurons in each layer. Reported results indicate that DBNN achieved an F1score of more than 97% in all kinds of attacks. Further work was introduced by Balakrishnan [16]. They applied DBNN for the prevention of various IoT network attacks, including denial of service, overflow, brute force, DNS query, cache poisoning, malware infection, and others. The reported average F1-score in [16] was 95.3%. Hybrid deep learning models were studied by Hwang et al. [17], Yin et al. [19], and Parra et al. [18]. Hwang proposed a combination of CNN model with the deep autoencoder model. The hybrid model in [17] was built from a normal IoT traffic, and then DDoS pattern will be captured as an outlier. Conducted experiments indicated the successfulness of the implemented hybrid model with extremely low falsepositive. Similarly, CNN with a deep autoencoder model has been employed in the work of Parra et al. [19] for encoding time-series sequences. In their analysis, they used Yahoo Webscope S5 time series dataset. Their hybrid model achieved an accuracy of 99%. A distributed deep learning model that combines CNN with LSTM was investigated by Parra et al. [18]. In their approach, LSTM was working as a backend detector on the cloud; however, CNN was deployed on IoT edge to encode attack patterns. The integrated model in [18] was evaluated using the Nbaiot dataset, and it was reported of 94% F-1 score measure. The integration of deep learning with metaheuristics was studied in [20]. Basically, the Whale optimizer was integrated with LSTM to perform an automatic selection for the weights and biases. Their approach was evaluated with various benchmark datasets, including CIDDS-001, UNSWNB15, and KDD. Results in [20] showed that an accuracy above 99% was achieved in all conducted datasets. Further deep learning-based hierarchical and ensemble models were presented in [22] and [23], respectively. A table that summarized all previously discussed deep learning models is given in Table 1. Nevertheless, deep learning models are facing the challenge of IoT resource constraints such as power, memory, and CPU support. To mitigate this challenge, optimization algorithms are considered as one promising solution [20]. However, the integration of the deep autoencoder model with optimization algorithms required a powerful optimizer that can work with high dimensions optimization. To fill this gap, this study adopts an efficient two-layers optimizer. This optimizer has several advantages, such as (i) it evolved with small population size, (ii) it has one dedicated layer for fine-tuning and one layer of exploration task, and (iii) it incorporated Q-learning to control the switching from exploration to exploitation. It should be noted that the implemented two-layers optimizer was presented in our previous research as a conference paper for the problem of large-scale optimization [24]. The main contribution of this work could be summarized in the following points: • It uses a lightweight, efficient optimizer that evolved with a micro swarm (three particles only). • It integrates the lightweight optimizer with a deep autoencoder model to enhance the accuracy and reduce the complexity (number of input features and number of hidden neurons). • It applies the proposed optimized model for handling the problem of IoT anomaly detection, and it compares the outcomes with reported results in the literature.

•
It compares the performances of the employed optimizer with other well-known and recent optimization algorithms. The remaining part of this paper is organized as follows. The details of the proposed optimized model are explained in Section II. A series of conducted experiments implemented to evaluate the effectiveness of the proposed optimized model is shown in Section III, followed by the conclusion and future work presented in Section IV. Table 2 lists all abbreviations used in this study.

II. OPTIMIZED DEEP LEARNING AUTOENCODER
The main architecture of the proposed optimized model is given in Fig. 3. As can be seen that it contains four phases, namely, micro swarm initialization, transition using Qlearning algorithm [25], operations execution, and fitness evaluation. These phases are explained as follows.

A. Micro Swarm Initialization
In this phase, the micro swarm population that consists of three particles is initialized with a random vector X according to the search space of the optimized problem. In addition, a velocity variable V is used with X to control the amount of jump in the search space, as shown in Fig. 1. The length of the initialized vector X is equal to the length of the encoded problem in this study. In particular, the vector X is encoded with three parts which are IoT feature selection F, training instances selection I, and the number of autoencoder hidden neurons M as given in Fig. 2. For IoT features selection, it has been encoded as a binary optimization problem where each bin has a variable F that could take a value of zero or one. As such, if the bin is set to one, it means the corresponding filter is selected, and it will be activated during the features extraction process.
Otherwise, it will be omitted. The second part of the scheme is used to encode the training instances (malicious and normal). Therefore, for each variable, I could take a discrete value in the range of 1 to the number of instances. It is worth mentioning that half of the variables are used for selecting malicious IoT instances, and the rest are used for normal IoT instances. The length instance selection part was set to 200 with 100 for malicious and 100 for normal. The last part of the encoding scheme is used to encode the number of hidden neurons in the autoencoder. It should be noted that the variable M was configured to take a discrete value in the range of 2 to 10. In this study, it is assumed that the optimal embedding space is 2D, where it becomes easier to visualize the data, and it makes the KNN classifier works effectively.  In this is studied, the N-baiot dataset [14] is employed, which was captured by several IoT devices, namely Thermostat, a Baby monitor, a Webcam, and Doorbells. These devices were exposed to two different types of botnet attacks, namely Gafgyt and Mirai as shown in Fig. 4 [14].

B. Transition using Q-learning
This layer is responsible for switching between the local search and global search modes. As described in [24], in the global search, the micro swarm particles are able to explore the search space by modifying the whole search vector; meanwhile, in the local search mode, they are allowed to modify part of the search vector (i.e., see Fig. 8). As such, there will be switching between global search and local search, which is performed under the control of the embedded Q-learning algorithm [25]. As indicated in Fig. 5, the Q-learning is modeled with two states, and a Q-table of size 2 x 2 is created to keep track of each state by rewarding each well-performing state and penalizing others (i.e., by giving a value of -1). It is worth mentioning that each particle is associated with its own Q-table, which enhances the diversity of the population and enables each particle to evolve independently from the swarm [24].

C. Operations execution
The implemented two-layers optimizer has three basic search operations, which are exploration, exploitation, and jumping search [24]. As mentioned earlier that the micro swarm has only three particles, and each particle i X is updated based on the following equations.
where is the new location of currently executed particle i, is particle velocity and is the inertia value. Parameters and are cognitive and social acceleration coefficients, respectively. Variables and are random numbers in the range (0,1).
is the local best position achieved by each particle and is the global best position gained by the micro swarm. When the particle is performing an exploration search mode, it simply increases both (set to 0.9) and the value of (set to 2.5). At the same time, it should decrease the value of (set to 0.5), which makes it fly away from the swam, as can be seen in Fig. 6. On the other hand, exploitation search is done by flipping all these values i.e. is set to 0.3, set to 0.5, and set to 2.5 as shown in Fig. 7.

FIGURE 7. Exploitation search operation
The jumping operation basically adds a random value i X according to the range of the search problem, as explained in [24]. The local search operations are identical to the global search operations (i.e., exploration, exploitation, and jumping), except they are applied to update part of the search vector space, as indicated in Fig. 8.

D. Fitness Evaluation
The last step of the proposed optimized autoencoder model is the fitness evaluation. Mainly it is used to assess the quality of the given solution by each particle in the micro swarm. Therefore, each particle will encode a different solution that contains three parts, namely the selected IoT features, selected training instances, and the number of hidden neurons. According to these settings, an autoencoder model will be trained. The basic idea of autoencoder models is to reduce the dimensionality of the data by encoding the input features to a new compressed space named embedding space, as shown in Fig. 9 [26]. As can be seen, the encoding stage has input neurons that receive the input features and map them to an embedding space. The decoder stage is responsible for recovering the data back from embedding space to feature space. As such, the autoencoder will be built guided by a loss function that represents the difference between input and reconstructed data.

FIGURE 9. The architecture of the autoencoder model
To minimize the computational time of fitness function, the number of training iterations of the autoencoder has been set to 100 only. Once the autoencoder has been trained, its output will be passed to a KNN classifier to classify the data and compute the accuracy rate. Here K is set to be five neighbors, as shown in Fig. 10.

FIGURE 10. KNN classifier operation
To evaluate the fitness function for each particle in the micro swarm (X i ), the following formula is used.

Encoder Decoder
Embedding space where A is the recognition accuracy of the KNN classifier and C is the complexity of the autoencoder. Basically, C represents the ratio of the selected features with respect to the total number of features (i.e., 115 features in N-baiot dataset). In addition, the complexity of autoencoder output neurons is computed as the ratio of the number of selected output neurons divided by 10. For illustration, the calculation of C when the number of selected features is 30, and the number of output neurons is set to 5, then C will be 0.7 ( 23/115 + 5/10). α and β parameters are used to control the weights and importance of A against C. Here α was set to 0.9 and β to 0.1.

A. Dataset
In this study, N-baiot dataset [14] is employed for the evaluation of the proposed optimized model. The details of this dataset are given in Table 3. N-baiot dataset has 115 features, and all these features were considered in this study as in previous works [32] and [14]. These features are computed statically as the mean, the variance, the magnitude, etc., from monitoring IoT traffic over different windows, namely 100 ms, 500 ms, 1.5 sec, 10 sec, and 1 min. From each window, a total of 23 statistical features were calculated as described in Table 3. The dataset has been normalized where all features scaled to be in the range [0,1] using the following formula.
where is the output feature after normalization, is the input feature, is the minimum feature value, and is the maximum feature value.

B. Performance Measures
In this study, standard evaluation measures have been used to assess the performance of the proposed optimized model. Specifically, four different measures were implemented, including the accuracy, precision, recall, and F1-score. Their mathematical formula is defined as follows.
where TP is the total number of IoT instances correctly classified as malicious, TN is the total number of IoT instances correctly classified as normal traffic, FP is the total number of IoT instances wrongly classified as malicious, but they are normal traffic, and FN is the total number of instances wrongly classified as normal, but they are malicious traffic.

C. Performance Analysis
This section analyzes the performances of the proposed model with a non-optimized autoencoder base model. It should be noted that the non-optimized model was trained with all input IoT features (115 feature), and the output neurons here was set to 10 neurons. In addition, the nonoptimized model trained with the whole training set, i.e., 70 % of the data. Each experiment has been repeated ten times, and the mean value of all measures has been reported in Table 5. In terms of complexity, it can be seen that the optimized model uses only two hidden output neurons, and it reduced the input features to less than 36 in all studied cases (i.e., doorbell, Thermostat, baby monitor, security camera, and webcam device). In terms of accuracy, specificity, sensitivity, and F1-score, the outcomes confirmed the superiority of the optimized model in achieving better performance. This is due to the benefit of compact embedding space produced by the optimized autoencoder. This space will help the KNN classifier to work effectively due to lower dimensions.

D. Autoencoder Embedded Space Analysis
As a visual analysis for the outcomes of the optimized autoencoder in the embedding space. The testing data has been visualized in the optimized 2D embedding space, as can be seen in Fig. 11. As can be seen that the output of the autoencoder produces almost separable data. This will help the implemented KNN classifier correctly classify and separate the malicious traffic from the normal traffic, as indicated in Fig. 10

E. Selected IoT Features Analysis
Further analysis has been conducted to demonstrate the most selected IoT features by the implemented two-layers optimizer over ten independent runs. Fig. 12 displayed IoT features that have been chosen in all runs. As can be seen that features are related to jitter traffic (features start with HH_jit) have been set in all the runs. This implies that jitter is one of the most valuable indicators used to distinguish malicious IoT traffic. More details about these features can be found in [14].

F. Confusion Matrix Analysis
The confusion matrix has been computed in this section to measure the performance of the model in terms of true positive rate (TP), false positive rate (FP), true negative rate (TN), and false-negative rate (FN). The outcome has been compared with the non-optimized model as given in Fig 13  and Fig. 14. As can be seen, the optimized model is able to eliminate most of the false alarms with a moderate cost of missing the malicious traffic (i.e., true negative rate). This is related to the advantage of generalized and compact 2D embedding space generated by the optimized autoencoder.
On the other hand, the non-optimized model uses ten dimensions embedding space, as explained earlier.

G. Compare with other optimizers
This section investigates the outcomes of the proposed model as compared with other related optimizers, including PSO [27], RLMPSO [28], and AOA [29]. The settings of these algorithms are given in Table 7. Each experiment has been executed ten times, with 500 fitness evaluations given for each optimizer. The mean value of accuracy, recall, precision, and F1-score are reported in Table 6. It is clearly shown that the proposed two-layers optimizer is able to outperform other optimizers in all computed measures. One possible reason for the superiority of the two-layers optimizer is related to the advantage of working with a micro swarm population that required fewer fitness evaluations. More importantly, the implemented two-layers optimizer has the ability to switch adaptively from exploration mode to exploitation at the beginning of the search process. However, PSO and AOA are time-dependent algorithms, and they start with exploration and move gradually to the exploitation mode. RLMPSO algorithm works with small population size but it requires a large number of fitness evaluations needed by the incorporated local search optimizer, as explained in [24]. This makes RLMSPO achieve the lowest results in all measures.   Fig. 15 illustrates a boxplot of the reported fitness value by each optimizer. It presents the minimum, mean, and maximum values produced by each optimizer. It can be seen that the two-layers optimizer is able to achieve the best fitness value in all conducted experiments, including doorbells, Thermostat, baby monitors, and a security camera webcam. In particular, the two-layer optimizer reports a much better mean fitness value (around -0.85) with a compact boxplot. This is due to the aforementioned advantages of the implemented optimizer and also due to the dynamic transition from exploration to exploitation guided by the Q-learning algorithm.

I. Statistical Analysis
This section compares the outcomes of the optimized autoencoder against the non-optimized statically. Specifically, the Wilcoxon rank-sum test [31] is used where Population size 3 c1=2.5 in exploration mode, c1=0.5 in exploitation mode, c2=0.5 in exploration mode, c2=2.5 in exploitation mode, and w=0.9 in exploration mode, w-0.4 in exploitation mode. Two-Layers [24] Population size 3 c1=2.5 in exploration mode, c1=0.5 in exploitation mode, c2=0.5 in exploration mode, c2=2.5 in exploitation mode, and w=0.9 in exploration mode, w-0.4 in exploitation mode. the null hypothesis 0 H assumes that the outcomes of the compared methods have the same distribution. However, the alternative hypothesis 1 H assumes the opposite. The p-value is set to 0.05, which means that the alternative hypothesis 1 H would be accepted when the p-value was less than 0.005 (95% confidence level). The results of this test are given in Table 8. It is clearly shown that the optimized model significantly outperformed the non-optimized model in all measures with a p-value less than 0.05. This is owing to the benefits of the generated 2D embedding space, which helps the KNN classifier to classify the data correctly. Furthermore, the obtained fitness value by the proposed twolayers optimizer has been compared statically with other optimizers, which are RLMSPO, PSO, and AOA. The results are shown in Table 9, and it is indicated that the p-value of the Wilcoxon rank-sum test is less than 0.05 in all compared optimizers. This implies that the two-layers optimizer significantly outperformed other optimizers in terms of reported fitness value.

J. Computational Time Analysis
In this section, the computational time of the proposed autoencoder model has been computed and compared with a non-optimized model. As explained earlier, the nonoptimized model was trained with an input feature of 115 and the output neurons set to 10. The hardware and software specifications are given in Table 12. The required time by both optimized and non-optimized autoencoder is given in Table 10. It is clearly shown that an optimized autoencoder can reduce the computational time by 33% as compared with the non-optimized model. Specifically, it needs only one microsecond to recognize one IoT input instance; however, the non-optimized model needs around 1.5 microseconds. This confirms the benefits of the implemented two-layers  Figure 3. In particular, the time needed for the initialization of the swarm, transition, search operations, and fitness evaluation is computed and depicted in Table 11 as can be seen that most computational time is consumed in the fitness evaluation step. This is due to the required time to train and evaluate both the autoencoder model and KNN classifier. Other steps need a negligible amount of time as shown in Table 11.  Table 13. The presented results showed that the proposed optimized model achieved a better F1-score in all case studies, i.e., doorbell, Thermostat, baby monitor, security camera, and webcam. One reason is due to the advantage of using an autoencoder model to map the features to a compact and separable embedding space. It is worth mentioning that in [32], they fed the selected features by their optimizer directly to the one-class SVM classifier. In this case, the once-class required a lot of computation time to find the optimal decision boundary. In contrast, this study utilizes both the ability of a two-layers optimizer to reduce the input features and the benefits of the autoencoder model to map the selected features to a lower-dimensional embedding space (2D). This will result in speeding up the recognition time of KNN as well as enhancing its generalization due to the reduction in model complexity.

L. Model Evaluation using IoTID20 dataset
To further validate the effectiveness of the proposed approach, the IoTID20 [33] intruder detection dataset has been used in this section. IoTID20 is a public dataset, and it has 66 features. The data has been divided into 70% to 30% for training, testing respectively. The outcome of the proposed approach is compared with other deep learning models, namely 1D-CNN and LSTM. These models were selected because they can work directly on 1D sequence patterns (i.e., IoTID20 features).
The accuracy measure of the conducted analysis is shown in Table 14, and it can be seen that all models almost report the same results to some extent. Nevertheless, the proposed approach has the advantage of working with small input features where the two-layers optimizer was able to reduce the features up to 62%. This is resulted in a shallow, deep model compared with 1D-CNN and LSTM.

IV. CONCLUSION, LIMITATION AND FUTURE DIRECTIONS
This work introduced a novel optimized deep learning-based autoencoder model applied for the problem of anomaly detection in IoT networks. Basically, The optimized model was constructed using an efficient two-layers optimizer that works with a micro swarm population, i.e., three particles. Specifically, The two-layers optimizer performed simultaneous IoT features selection, training instances selection, and autoencoder neurons selection. The formulated fitness function that guided the two-layers optimizer was the accuracy of the KNN classifier that takes the output of the autoencoder as well as the complexity of the autoencoder model. The experimental results on N-baiot dataset confirmed the superiority of the proposed optimized model as compared with the non-optimized model. Moreover, the implemented two-layers optimizer achieved the best results in terms of fitness value as compared with other well-known optimizers, including PSO, RLMPSO, and AOA. Statically, the non-parametric Wilcoxon rank-sum statistical test confirms the significance of the obtained results.
Nevertheless, the proposed model needs further improvements to minimize the number of IoT input features. This will further reduce the complexity of the autoencoder model and make it able to work in a real-time IoT environment. This could be done by the development of a heterogeneous optimizer that works cooperatively as a single model. Further ideas that could be investigated in the future are validating the model using other benchmark datasets and extending the model to work for multiclass attacks recognition. Another future research avenue that could be investigated is the application of the proposed optimizer for the fine-tuning of explainable artificial intelligence presented in [38].