HADIoT: A Hierarchical Anomaly Detection Framework for IoT

The Internet of Things establishes the intimacy between the Internet and the physical world. Due to portable size, most IoT devices have limited computing and storage capabilities and are vulnerable to various malicious intrusions. Therefore, it is vital to have efficient approaches to distinguish the true IoT data from fake one, we term such methods as anomaly detection (AD). To detect anomalies accurately and efficiently, in this article a 3-hierarchy joint local and global anomaly detection framework, HADIoT, is proposed, in which IoT devices generate and transmit sensory data to their local edge servers for local AD after data refinement which includes re-framing, normalization, complexity reduction via Principal Component Analysis, and symbol mapping. High detection accuracy is achieved by jointly local and global ADs. The local AD focuses on the data pattern consistency of individual devices via the Gated Recurrent Unit, and the processed data is then forwarded from edge servers to the cloud server for global AD. The global AD focuses on the analysis of the data pattern correlations between different IoT devices, using the Conditional Random Fields. For the maintenance of cyber-security, the proposed anomaly detection framework HADIoT enables to provide an accurate and faster anomaly detection for IoT applications, compared with existing anomaly detection methods. The performance of the proposed method is also empirically evaluated through simulations, using a real dataset - the Information Security Center of Excellence (ISCX) 2012 dataset. Simulation results demonstrate the effectiveness of the proposed framework in terms of True Positive Rate, False Positive Rate, Precision, Accuracy and F_score, compared with three benchmark schemes.


I. INTRODUCTION
The recent advancement of Internet of Things (IoT) has further flourished new applications of IoT devices [1]. Manufactures are devouring the IoT device market, releasing new IoT products at exponential growth rate. Consequently, there are intrinsic vulnerabilities in many IoT devices where potential risks lurk. Since IoT devices bridge the physical world and cyberspace, threats to the vulnerable devices can be aggregated and amplified in the core network. Hence, it is vital to develop efficient detection methods to detect and mitigate the vulnerability of IoT devices to ensure the cyber-security of IoT environments.
In order to protect the IoT world from anomalous intruders, intrusion detection systems (IDSs) have been developed. There are two main types of IDSs, signature-based detection The associate editor coordinating the review of this manuscript and approving it for publication was A. Taufiq Asyhari . systems and anomaly-based detection systems [2]. Signaturebased IDSs are supposed to recognize familiar patterns, thus not suitable to detect the anomalous data with evolving patterns. In contrast, anomaly-based detection systems are the one more suitable for IoT environments. Specifically, an anomaly is a data point in data space which is significantly different from the rest of the data [3]. Since anomalybased detection systems are able to reveal unusual behaviors, they are capable of detecting intrusions with unknown patterns. Therefore, Anomaly Detection (AD) for continuously changing intrusion patterns in IoT environments poses great challenges.
One challenge is the heterogeneity of IoT devices. Because there are heterogeneous IoT devices on the market, the benign data patterns of different devices with dissimilar surveillance objectives varies. Therefore, it is unrealistic to train only one precise AD model for all types of IoT devices. Meanwhile, IoT devices are endowed with different quantities of resources, so not all of them are capable of performing on-device AD.
Another challenge is the detection accuracy. Anomaly detection methods can be categorized into parametric and non-parametric categories [4]. The parametric methods assume that the distributions of data are a priori. However, network data is generated by multiple processes that reflect the functionalities of corresponding systems. It is unrealistic to suppose that the data is distributed as a single normal distribution. In contrast, non-parametric methods are well performed in the cases where data patterns are vague. The basic idea is to find the regions with low density of data. Nevertheless, this method usually has high computational complexity because of the large numbers of data features, which is commonly referred to as the curse of dimensionality. However, some data features may be irrelevant with anomalous behaviors and they slow down the data training. Therefore, reducing computational complexity of datasets is also a fundamental issue.
To tackle the aforementioned challenges, an efficient AD framework for IoT needs to be designed. It should take considerations of detection accuracy and computational complexity seamlessly. Hence, a hierarchical AD framework for IoT applications, the HADIoT, is proposed in this article. Three important issues are explored in the framework, i.e., the architecture design, accuracy improvement and reduction of computational complexity.
In the proposed framework, a 3-hierarchy architecture is designed. Specifically, IoT devices generate and transmit sensory data to local edge servers for local processing which includes dimensionality reduction via Principal Component Analysis (PCA) and symbol mapping, and local AD via the Gated Recurrent Unit (GRU). Afterwards, large-scale data is delivered from edge servers to cloud servers for global AD, using the Conditional Random Fields (CRF). High detection accuracy is jointly achieved by local and global AD via edge servers and cloud servers, respectively. Local AD focuses on the data pattern consistency of individual devices while global AD emphasizes the pattern correlations between different devices. We perform simulations of the proposed framework and prove its effectiveness, using a real dataset from the Information Security Center of Excellence (ISCX).
The main contributions of this article are as follows.
• We design a novel hierarchical AD architecture for IoT. IoT devices transmit their data to local edge servers via which the processed data is then forwarded to the cloud server for further analysis.
• To the best of our knowledge, we are the first to propose a joint local and global AD framework to enhance the detection accuracy in IoT environments.
-In local edge servers, we adopt PCA to reduce the data dimensionality. We map data vectors to symbols, which reduces the computational complexity and overfitting simultaneously. We use the Gated Recurrent Unit to locally detect the anomalies of each device in reference of its individual symbol patterns. -In the global processing, IoT data from each local server are forwarded to the cloud server for largescale AD. We adopt CRF to detect anomalies via data correlations in the macro data space, i.e., anomalies of each device can be identified in reference of the data patterns of the other devices.
• We demonstrate the effectiveness of the proposed framework through simulations, using a real dataset from the ISCX. The remainder of this article is organized as follows. Section II will review the related work of AD. Section III will introduce preliminaries, including the system model, AD framework and problem definitions. Section IV will propose the refinement of data. Section V will detail the local AD and global AD methods. Section VI will conduct the simulations and analysis of our proposed framework, and Section VII will conclude the the paper and potential study in future.

II. RELATED WORK
Non-parametric methods for anomaly detection are more sophisticated in multi-pattern systems. Xie et al. adopted the K-Nearest Neighbors algorithm to define a hyperplane around a data point, and a data point is identified as anomalous if there are less than k data points inside its hyperplane [5]. But the computational complexity of this method is too high to be applied in high dimensional cases. Chernogorov et al. applied diffusion maps algorithm for dimensionality reduction of the data and then employ the K-means algorithm to detect the anomalies in [6]. Lyu et al. proposed an efficient hyperellipsoidal clustering algorithm for anomaly detection [7]. In addition, Bars et al. presented a probabilistic framework to model and assess the abnormal communication volume at any single node by knowing the set of nodes involved [8].
With the development in deep learning techniques, Ezeme et al. introduced a hierarchical attention-based anomaly detection (HAbAD) model based on stacked Long Short-Term Memory (LSTM) Networks with Attention [9]. Yahyaoui et al. proposed a detection protocol that dynamically executes the on-demand Support Vector Machine (SVM) classifier in a hierarchical way whenever an intrusion is suspected [10]. Nguyen et al. proposed an autonomous self-learning distributed system, DIoT, to detect compromised IoT devices, which uses a novel self-learning approach to detect compromised devices [11]. However, they did not consider the correlations between different types of devices.
In this article, we construct a novel hierarchical AD framework for IoT. We reduce the data dimensionality with both PCA and symbol mapping. To the best of our knowledge, we are the first to jointly consider the local and global data patterns, i.e., the intra-node and inter-node behaviors, which is the main difference between our work and the existing ones.

A. THE SYSTEM MODEL
Consider a 3-hierarchy IoT architecture as shown in Fig. 1. The HADIoT consists of multiple local edge servers and one global cloud server. The local edge servers perform data pre-processing and local AD in reference of each device's own data pattern. The local edge servers then forward the processed data to the cloud server, and the latter provides global AD services which is required of higher computational capacity. Specifically, the three hierarchy of the architecture is detailed as follows.
1) H1: The cloud server is equipped with higher computation capacity than edge servers. Local edge servers offload the processed data to the cloud server for global AD which takes into consideration the correlation of data patterns in the macro data space, i.e., between all the IoT devices. Global AD adopts a machine learning method, CRF, which is usually applied to natural language processing.
2) H2: The local edge servers perform data pre-processing and local AD for IoT devices. They re-frame and normalize the data from heterogeneous devices. Also, local servers use PCA to reduce the data dimensionality and then adopt symbol mapping to further decrease the computational complexity of AD. Lastly, they use a sequence prediction method to detect anomalies of each device according to its historical data pattern.
3) H3: Heterogeneous IoT devices generated and deliver data to local edge servers.

B. ANOMALY DETECTION FRAMEWORK
This subsection provides an overview of the proposed framework for AD in the context of IoT data. The flow chart is depicted in Fig. 2. There are three stages of data processing: 1) data pre-processing; 2) local AD; and 3) global AD. The detailed descriptions of these three stages are presented as follows.
Stage 1. Data pre-processing is the first stage of the proposed framework. We here use the ISCX 2012 dataset as the input of the framework. The first stage is performed on the edge server and consists of three phases, i.e., re-framing and normalizing, dimensionality reduction and symbol mapping.
i) Since IoT devices are heterogeneous, e.g., smart plugs report in frames while electronical thermometers deliver packets to routers, the processing of IoT data needs to reframe the data in the same format. Moreover, the payloads of devices vary from each other. Specifically, some payloads are quantitative while some are qualitative. Different payloads must be normalized in order to perform the universal AD on them.
ii) Some indicators of IoT data within the same local region always remain the same, so they are redundant for the AD processing. Hence, we can discard these redundant indicators so as to reduce the computational complexity. We adopt PCA to realize dimensionality reduction of the IoT data.
iii) IoT devices are typically appliances with particular functions, their behavior patterns are relatively static and limited. Hence the framework is able to capture all possible benign behaviors of IoT devices, where benign patterns of each device can be easily calculated. And the reduced data vectors from the second phase can be classified into different classes in correspondence of the behaviors. Each class is mapped to a specific data symbol S i t , where i refers to the device index and t denotes the time slot. Symbol mapping further reduces the complexity of multi-dimensional vector calculations, which enhances the feasibility of local and global AD in both Stage 2 and Stage 3.
Stage 2. The second stage of data processing is the local AD via Gated Recurrent Unit (GRU), which is also performed on the edge server. The classifications of the data vectors of device i, denoted by the symbol sequence , are recorded in the local edge server, and the symbol sequence reveals the pattern of each device. We identify anomalies using the likelihood of the occurrence of symbols in the sequence of its device, because the data of IoT devices usually follows particular historical patterns. The GRU method is adopted here because of its promising capability of time serial prediction along with less parameters to tune, compared with the well-known Long Short Term Memory (LSTM) method.
Stage 3. The third stage is the global AD using Conditional Random Fields (CRF). Having identified by its own sequential pattern at the local edge server, every data symbol will be delivered to the cloud server for global AD, i.e., each current data symbol will be tested in the macro data space. More specifically, the occurrence likelihood of each data symbol is examined by the correlations with symbols of other devices. Hence, the Conditional Random Fields (CRF) method, which is usually used for Natural Language Processing (NLP), is adopted here for occurrence likelihood detection in the context of the global data symbol space.
where N is the number of IoT devices and l i denotes the length of the data vector of device i. Then, D is re-framed, normalized and mapped to the symbol set . . , J i } and J i indicates the number of data classes, i.e., the number of data symbols, of device i. Therefore, the IoT data can be indexed by time and device. The re-framing, normalization and symbol mapping of the IoT data will be introduced in Section 4.

D. PROBLEM DEFINITION
We aim to compute the anomalous score for symbols of IoT devices and to determine its anomaly. It is defined locally and globally, by local edge servers in Stage 2 and by the cloud server in Stage 3, respectively.
Definition 1 (Local Anomaly): Symbol S i t corresponding to device i at time t is determined as a local anomaly, if its local probability of occurrence is below the local threshold θ l , as shown in (1).
where P i l t represents the likelihood of S i t in reference of the historical pattern of device i, and A i l t = 1 indicates S i t is locally anomalous. The calculation of P i l t will be introduced in detail in Section IV.
Definition 2 (Global Anomaly): Symbol S i t of device i at time t is determined as a global anomaly when its global likelihood of occurrence P i g t is below the global threshold θ g , as presented in (2). The calculation of P i g t will be introduced in Section V.
where A i g t = 1 indicates that S i t is globally anomalous. Definition 3 (Overall Anomaly): Whether or not S i t is anomalous can be determined by the Boolean result of A i l t and A i g t , as shown in (3).
where A i t = 1 indicates that S i t is anomalous.

IV. DATA REFINEMENT
In this section, we introduce the refinement of IoT data. Raw data packets D i t in the packet sequence D 1 i , D 2 i , . . . , D t i of device i are re-framed and normalized to C i t according to 8 features of the packet characteristics, (c 1 , c 2 , . . . , c 8 ). Then, the data is further reduced to X i t via PCA, and finally mapped to data symbols S i t .

A. RE-FRAMING AND NORMALIZATION
The normalization of re-framed data packets according to the selected 8 features are shown in Table 1. The features are selected by the PCA method which will be demonstrated in the following subsection, and the final selection of the optimal features will be presented in detail in Section VI. The data value of c 1 , c 2 , c 3 , c 6 , c 7 , and c 8 are normalized linearly from 0 to 1, in reference of the minimum and maximum values of each feature. Meanwhile, the direction and protocol name are both quantified as binary values. Hence, all the values of the 8 features are normalized between 0 and 1, which makes it achievable to perform generic processing over the packets from heterogeneous devices. The re-framed and normalized data can be depicted as: indicates the data sequence of device i with a window of k time slots, and A tr denotes the transposition matrix of matrix A.

B. DIMENSIONALITY REDUCTION
Having re-framed and normalized from D i t to C i t , we now refine the data by filtering its redundant features via PCA. In this phase, C i t is linearly transformed to the value with the mean of zero at first, as shown in (5), where, m = 1, 2, . . . , 8 denotes the index of the entry of C i t . Hence, the square sum of each entry denotes the variance ofC i t . The covariance matrix of device i is VOLUME 8, 2020 formulated as (6), Since Cov i is a real diagonalizable matrix, it can be diagonalized as i , as shown in (7), where λ i,j is the j-th largest eigenvalue of Cov i and i = (B i ) tr Cov i B i . B i denotes the matrix consists of all the eigenvectors of Cov i . Lastly, the larger λ i,j is, the more significant its corresponding component is. And the redundant features corresponding to the smaller eigenvalues are to be discarded, thus reducingC i t to X i t via (8).
where B i is the reduced matrix constituted by eigenvectors with less dimensionality, i.e., redundant eigenvectors corresponding to small eigenvalues are discarded from B i . Therefore, we realize the projection fromC i to X i with dimensionality reduced, andC i t is transformed to X i t .

C. SYMBOL MAPPING
Each row vector in X i , denoted by X i t , corresponds to the refined data of device i at time t. We aim to simplify the data from vectors to symbols, i.e., symbol mapping. We adopt the K-means algorithm with the Silhouette method to cluster the data of device i into J i classes. Therefore, the symbol set of device i is denoted by S i = {S i t | 1, 2, . . . , J i }, where S i ⊂ S. J i is determined by the Silhouette method which uses the average Silhouette coefficient to evaluate how well the data is clustered under different parameters. Having mapped the data to the symbol set, the volume of IoT data is further reduced. Meanwhile, the noise interference on the data can be smoothed, because the data in the same cluster is represented by the cluster center and the impact of data variations in the same cluster on the AD computation can be ignored.

V. ALGORITHM OF JOINT LOCAL AND GLOBAL ADs
In this section we propose an algorithm for anomaly detection of IoT application by jointly considering both local and global ADs as follows.

A. LOCAL AD
We here detail an application of the standard Gated Recurrent Unit (GRU) to perform the local AD in local edge servers. The GRU method, by Cho et al., is a variant of The Long Short-Term Memory (LSTM) [12], which combines the forget gate and the input gate to a simple update gate. Each recurrent unit adaptively captures dependencies of different time scales, thus enabling the GRU method to reveal correlations of symbols of any device within a given time span.
In the gated recurrent unit of device i at time t, there are two gates, the update gate z i t and the reset gate r i t . The update gate decides how much the unit updates its content. The update gate is computed as: where W i z indicates the weight coefficient matrix of the update gate of device i, h i t−1 denotes the memory content propagated from the previous time slot, and x i t represents the input symbol of device i at time t.
The current memory content h t i is calculated as follows.
where W i is the heritage coefficient matrix of device i, σ , tanh and indicate the sigmoid function, hyperbolic tangent function and element-wise multiplication, respectively. And r i t indicates the reset gate factor, which can be acquired as: The final memory content of the GRU at time t, i.e., h i t , is a linear interpolation between the previous memory h i t−1 and the current memory h t i , which is defined as follows.
The probability distribution of the candidate symbols of device i at time t, denoted byŷ i t can be calculated as: where V i denotes the projection matrix from h t i toŷ i t . GRU preserves the ability of LSTM, i.e., learning the context in its context, and has a simpler structure which reduces the computational cost. When applied in our framework, x t is the current input S i t . Andŷ i t denotes the likelihood of S i t . The weight coefficient matrices of the reset and update gate, W r and W z , are to be trained by the sample data.
In the HADIoT, we treat the sensing data of each device as a sentence. The tth word of sentence i is denoted by D i t . After symbol mapping, S i t represents the tth word in sentence i with only one letter. We make use of a dataset of N sentences with each containing T words to train the GRU language model. Each sentence is individually processed in its local edge server. For time slot t, the GRU unit computes the outputŷ using the input S i t and the previous memory content h i t−1 , via Equation (9) to Equation (13). The dimensions of the input, output and parameters are as follows: where k is the internal memory size of GRU.ŷ t i denotes the probability vector of the J i potential symbol values of sentence i, while the vector y t i indicates the ground truth of sentence i at time t.
We then calculate the cross entropy loss L t i as follows.
where log indicates the element-wise logarithm function.
To train the GRU, we need to acquire the values of all the parameters that minimize the total cross entropy loss of sentence i, i.e., L i = T t=1 L i t : We adopt the Stochastic Gradient Descent method, to tune the parameters [13]. With the trained GRU model, we calculate the likelihood vectorŷ t i to detect the local anomalies of each device i. When the likelihood of the input data is calculated, the probability threshold θ l is automatically adjusted from 1 to 0 by the edge server. Corresponding to certain θ l , the detection results, i.e., the TPR, FPR, Accuracy, Precision and F_score are acquired. And the threshold θ l is ascertained when the corresponding Precision is maximized.
When the parameters and threshold are trained, the GRU model will be applied to the local AD. However, if the nature of the network traffic changes, the GRU parameters and threshold θ l will be re-trained.

B. GLOBAL AD
We then adopt the CRF in the global AD stage. The CRF is a machine learning model, through which the conditional probability distribution of hidden states can be calculated via the given observations. In the HADIoT, the CRF model constructs the conditional probability distribution of each potential state and determines the most probable labeling of the observed data. It was first introduced by Lafferty et al. for text sequence prediction [14], and has been applied to many problem in NLP, bioinformatics, and computer vision.
As mentioned in Definition 2, the global anomalous states of all devices at time t can be denoted as [A 1 g t , A 2 g t , . . . , A N g t ], which is regarded as a Markov-chain. This assumption sounds because the input at time t, x t = [S t 1 , S t 2 , . . . , S t N ], has a sequential form, such as sequence prediction for Natural Language Processing (NLP) problems and gene sequence analysis for bioinformatics. Markov conditional random fields are also suitable for modeling and classifying IoT data because the data can be intrinsically regarded as sentences generated at each time slot. Moreover, the data generated by device i at time t can be described as the ith word in the tth sentence.
The probability distribution over random variables (x t , A g t ) is modeled via a Markov Random Field, which presents that the joint probability can be factorized into a product of potential functions over intrinsic features. The joint probability of a potential state variable anomaly . . , A N g t ] given x t can be calculated as: where f w denotes the wth feature function, along with the corresponding weight factor µ w . Z (x t ) is the normalization divisor: where A g t denotes all the possible values of the state vector [A 1 g t , A 2 g t , . . . , A N g t ] at time t. A feature function is an indicator which describes the contributions of each pair of adjacent states. For example, The CRF model is fully characterized by the feature function set and the weight factors, i.e., (f, µ).
Since the CRF is also a supervised learning method, it needs to go through a training stage before applied in practice. The training stage computes the model parameters µ according to the labeled sample data (A g t , x t ). The objective function can be described as To gain the model parameters µ, we have to maximize the objective function. Because the objective function is convex, the gradient based optimization can guarantee to acquire the global optimal solution. We adopt the well-known gradient ascent method, limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm, to tune the CRF parameters in HADIoT.
When the CRF model is trained, it can be applied to predict the states of the newly-come input data x t at time t via the Viterbi algorithm. The prediction sequence can be acquired as: where A * g t denotes the anomaly prediction vector of x t with the maximum likelihood. VOLUME 8, 2020 Similar to the training procedures of the local AD, when the likelihood of the input data is acquired, the probability threshold θ g is automatically adjusted from 1 to 0 by the cloud server. And the threshold θ g is ascertained when the corresponding Precision is maximized. When the nature of the network data changes, the CRF model and θ g need to be re-trained.
Thus, the algorithm of the HADIoT can be presented as Algorithm 1.

VI. EXPERIMENTAL RESULTS
This section demonstrates the performance of the HADIoT framework compared with two benchmark schemes for anomaly detection. The first two stages of the proposed framework, data pre-processing and local AD, are simulated on a desktop with i7-6700u CPU @2.3GHz and 8GB RAM, while the global AD is simulated on a DELL server with Xeon(R) Silver 4116 @2.1GHz and 128GB RAM. The three stages of the HADIoT framework are all simulated via MATLAB 2016a.

A. EVALUATION METRICS
We evaluate the performance of the proposed framework in terms of the following parameters: True Positive Rate (TPR), False Positive Rate (FPR), Precision, Accuracy and F-score. The mathematical definition of these parameters are illustrated as follows.
Precison × TPR Precision + TPR TP, TN , FP and FN refer to True Positive, True Negative, False Positive and Fasle Negative, respectively. More specifically, TP indicates that the normal data is identified as normal and TN refers to the case when the abnormal data is classified as abnormal. On the contrary, when the data is incorrectly detected as normal or abnormal, it is described as FP and FN , respectively.

B. DATASET
We evaluate the efficiency of the proposed framework HADIoT, using the Information Security Center of Excellence (ISCX) 2012 dataset that consists of almost 1.5 million network packets of seven days, with 20 features. The ISCX 2012 dataset is labelled and the total size of it is 84.45GB. Since we suppose the desktop as the edge server in the simulation, the communication overhead from the IoT devices to the edge server can be regarded as the data volume of the original ISCX 2012 dataset. We randomly select the training dataset and testing dataset by a size ratio of 7 : 3. Specifically, there are 1046500 and 468500 data records in the training set and testing set, respectively.Additionally, there are mainly three kinds of anomalies in the ISCX 2012 dataset, infiltration from inside, Denial of Service (DOS) and brute force attacks, which were generated on 13-15 and 17 June 2010. Meanwhile, network traffic data generated on 11, 12 and 16 June 2010 is normal.

C. RESULTS
For the sake of clarity, the simulation results are illustrated in four parts corresponding to the three stages and the overall performance of the HADIoT framework. In the first part, we demonstrate the results of feature selection and symbol mapping of the IBRL dataset. Then, the performance of local AD via GRU is illustrated in comparison with three benchmark schemes, Long Short-Term Memory (LSTM), K-Nearest Neighbors (KNN) and Convolutional Neural Networks (CNN). Thirdly, the global AD via CRF simulated on a DELL server is presented. Lastly, the performance of 1) Stage 1 PCA is adopted for the feature selection from the dataset. To ascertain the optimal feature selection, the False Positive Rate (FPR) of the HADIoT corresponding to different numbers of feature selected is shown in Fig. 3. It is evident that the FPR of the framework can be minimized by the selection of the first 8 features out of the 20 features ranking in the decreasing order of eigenvalues. The 8 features with relatively larger eigenvalues are considered as the optimal feature selection which includes the data type, payload, source port, destination port, protocol type, source byte, destination byte, and direction. Since each data packet is reduced to 8 features from 20 features, the communication overhead from the desktop to the DELL server decreases by 60%. Therefore, the communication overhead from the edge to the cloud in the farmework is reduced to 33.81GB from 84.45GB.
2) Stage 2 In the training section, the parameters By tuning the threshold probability θ l , we can determine the anomalies when the data likelihood is below the threshold. Specifically, when adjusting the the probability threshold θ l from 1 to 0, various matches of TPR and FPR are acquired, depicted as the Receiver Operating Characteristic (ROC) curve in Fig. 4, and the optimal θ l ∈ [0, 1] is ascertained when the local Precision is maximized.
Then, the local AD is applied to the testing dataset. The data of each time slot is considered as the input of the GRU model, by which the probability of the input data is calculated. Further, the local anomaly is determined by the probability threshold. As can be seen from Fig. 5, the GRU method achieves the highest TPR and precision, compared with the benchmark schemes. Specifically, the GRU method yields an FPR as low as 3.86% while it achieves a TPR at 96.10%.
3) Stage 3 The global AD via CRF is simulated on a DELL server with Xeon Silver 4116 @2.1GHz and 128GB RAM. Similar to the training procedures of local AD, the CRF model is trained by the same training set to learn the model   parameters. We compute the anomaly score of each data packet by setting the threshold θ g . Specifically, by adjusting the probability threshold θ g from 1 to 0, the ROC curve which depicts the TPR and FPR corresponding to different θ g is acquired, as shown in Fig. 6. The threshold θ g is ascertained when the Precision attains at the peak value. 4) Overall Performance Combining the local and global AD result as mentioned in Equation (3), the overall performance of the HADIoT can be achieved. By jointly consider the time serial characteristics and the inter-correlations of various types of IoT data, the overall framework is largely  enhanced, compared with GRU alone. This is mainly because the CRF can detect some anomalies that GRU cannot. For example, some data falls into the normal pattern of its device, but its correlation with other device are obviously absurd. For instance, if the thermometer reads 0 • C under the standard atmosphere pressure and the humidity is 90%, it is very likely that there is an anomaly. As can be seen from Fig. 7, the comparison of the HADIoT, LSTM, CNN and KNN in terms of FPR, TPR, Precision, Accuracy and F_score demonstrates the effectiveness of our proposed method. The HADIoT achieves the highest TPR along with the lowest FPR among the four methods. Specifically, the HADIoT achieves a TPR at 98.12% and a FPR at 4.53%, much better than the other three methods. Because of the combination of GRU and CRF, the HADIoT outperforms the benchmark schemes to a large extent.
Due to the extra computational task introduced by the global AD, the total running time of the HADIoT ascends in a reasonable amount, as depicted in Fig. 8, in comparison with LSTM, CNN and KNN. It can be seen that the processing time of the HADIoT is slightly shorter than LSTM, but longer than CNN and KNN.

VII. CONCLUSION
In this article, we presented HADIoT: a hierarchical anomaly detection framework for IoT devices, which provides a solution for anomaly detection using both local and global anomaly characteristics. We demonstrated the efficiency of HADIoT by testing on the Information Security Center of Excellence (ISCX) 2012 dataset, and HADIoT achieved the True Positive Rates of 98.12% with only 4.53 False Positive Rate. Simulation results show that the proposed framework achieves high effectiveness and outperforms the benchmark schemes. In the future, we aim to develop an autonomous training approach of the AD framework, which trains the framework adaptively according to the change of data patterns.
JING FENG (Member, IEEE) was born in Nanjing, Jiangsu, China, in 1962. She received the Ph.D. degree from Southeast University, Nanjing, in 2000. She is currently a Professor with the National University of Defense Technology, China. Her research interest includes information system integration.
CHAOFAN DUAN lives in Nanjing, Jiangsu, China. He received the bachelor's degree from the Nanjing University of Science and Technology, Nanjing, in 2014, and the master's degree from PLAUST, in 2017. He is currently pursuing the Ph.D. degree with the National University of Defense Technology, China. His research interest includes modeling and analysis of the wireless sensor networks. VOLUME 8, 2020