Intelligent Anomaly Detection for Large Network Traffic With Optimized Deep Clustering (ODC) Algorithm

The availability of an enormous amount of unlabeled datasets drives the anomaly detection research towards unsupervised machine learning algorithms. Deep clustering algorithms for anomaly detection gain significant research attention in this era. We propose an intelligent anomaly detection for extensive network traffic analysis with an Optimized Deep Clustering (ODC) algorithm. Firstly, ODC does the optimization of the deep AutoEncoder algorithm by tuning the hyperparameters. Thereby we can achieve a reduced reconstruction error rate from the deep AutoEncoder. Secondly, ODC feeds the optimized deep AutoEncoder’s latent view to the BIRCH clustering algorithm to detect the known and unknown malicious network traffic without human intervention. Unlike other deep clustering algorithms, ODC does not require to specify the number of clusters needed to analyze the network traffic dataset. We experiment ODC algorithm with the CoAP off-path dataset obtained from our testbed and the MNIST dataset to compare our algorithm’s accuracy with state-of-art clustering algorithms. The evaluation results show ODC deep clustering method outperforms the existing deep clustering methods for anomaly detection.


I. INTRODUCTION
Network traffic increase is directly proportional to increasing malicious activities on the internet. IoT plays a vital role in producing a massive number of network traffic datasets and creates significant challenges for detecting anomalies.
Anomaly detection in network traffic with machine learning is a rapidly growing research area [1]- [7]. Deep clustering techniques for anomaly detection use variations of AutoEncoder's latent representation with a k-means clustering algorithm. For example, Deep Embedding Clustering (DEC) [8], Improved Deep Embedding Clustering (IDEC) [9] and Deep Density-based Clustering (DDC) [10] use dense deep AutoEncoder, Deep Convolutional Embedded Clustering (DCEC) [11] and Deep Density-based Clustering-Data The associate editor coordinating the review of this manuscript and approving it for publication was Amir Masoud Rahmani .
Augmentation (DDC-DA) [10] use convolutional AutoEncoder with k-means clustering, Gaussian mixture variational AutoEncoder (GMVAE) [12] practices variational AutoEncoder with k-means clustering. Most of these deep clustering techniques use the k-means clustering algorithm for the data clustering part, which in turn demands the number of clusters manually. In a real-time situation, predicting the number of clusters at the initial time (training the model) for a new dataset might not help discover new and unknown anomalies. To overcome this major limitation of the existing works, we use BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) in our ODC deep clustering technique. BIRCH has the advantage of intelligent cluster assignment and anomaly detection without human intervention. Also, a deep AutoEncoder reduces the dimensionality of the dataset irrespective of it has linear/non-linear data. The BIRCH clustering method is not getting much attention among the researchers on deep clustering methods. However, BIRCH has the capability of doing intelligent clustering on a vast dataset [13].
Our contributions are summarized as follows: • Optimization of the deep AutoEncoder by tuning the hyper-parameters to achieve a reduced reconstruction error rate.
• We inferred a novel unsupervised anomaly detection algorithm ''ODC'' by incorporating the BIRCH clustering algorithm with the Latent representation of the enhanced deep AutoEncoder.
• Unlike other deep clustering algorithms, ODC does not require to specify the number of clusters needed to analyze the network traffic.
• ODC handles anomalies, including known and unknown attacks intelligently, for a huge dataset.
• We analyzed how the Branching factor value and the Threshold value of BIRCH influence the clustering accuracy and normalized mutual information score values. We observed that our ODC clustering algorithm outperforms the existing deep clustering methods for anomaly detection. Moreover, ODC suits well for vast network traffic datasets where multiple scans of the datasets are not advisable since ODC has the BIRCH clustering algorithm's embedment. ODC incorporates the advantages of the BIRCH clustering algorithm. We achieved great clustering accuracy and normalized mutual index score for the anomaly detection process due to the combination of a deep AutoEncoder and the BIRCH clustering algorithm. Also, ODC put a stop to the need of domain experts to manually label the large datasets and explicitly specify the number of clusters needed for the dataset. Our proposed method differs from the state of the arts [14]- [17] and [18] in which we associated BIRCH clustering with our enhanced deep AutoEncoder. To preserve the data point's local structure, the StructAE [19] learns representations for each data point by minimizing reconstruction error with respect to itself. However, ODC achieves low reconstruction error rate by tuning the hyperparameters such as activation function and the regularization function. Hence, we prove that ODC preserves the data points' structure, leading to an intelligent clustering method to detect anomalies.
The rest of the paper is organized as follows. Section II provides the background information needed to understand the ODC clustering algorithm. The working principles of a deep AutoEncoder and the BIRCH clustering algorithm are explained in Section II-A and Section II-B, respectively. Section III describes the state of the art of deep clustering algorithms. The proposed deep clustering method is explained in Section IV. Section V describes the evaluation process of the proposed deep clustering method. Finally, we discuss the possible extension of our research in Section VI.

II. BACKGROUND
Anomaly Detection [20] is the strategy of recognizing uncommon occasions or perceptions which can raise doubts by being factually not the same as the remainder of the perceptions. Present-day organizations are starting to comprehend the significance of interconnected tasks to get their business's full image. Additionally, they have to react to quick-moving changes in information instantly, particularly if there should be an occurrence of cybersecurity dangers.
Unfortunately, there is no compelling method to deal with and break down, continually developing datasets physically. With the dynamic frameworks having various segments in a ceaseless movement where the ''normal'' conduct is continually reclassified, another proactive way to deal with distinguishing anomalous behavior is required [20].
Based on the dataset we use to train the machine learning model, anomaly detection varies in many real-world applications and academic research areas. With the emergence of sensor networks, processing data as it arrives has become a necessity [21]. Techniques have been proposed that can operate in an online fashion [22]; such techniques assign an anomaly score to a test instance as it arrives, but also incrementally update the model. Authors in [23] showcased the importance of anomaly detection in dynamic settings through a real-world application example, i.e., forest fire risk prediction. Also, they recommend redesigning the current models to be able to detect outlying patterns accurately and efficiently. More specifically, when there are many features, a set of anomalies emerge in only a subset of dimensions at a particular period. This set of anomalies may appear normal regarding a different subset of dimensions and periods.
Authors in [24] discussed the unavailability of financial data for fraud detection research and a methodology for synthetic data generation. They suggest that a universal technique in the domain of fraud detection is yet to be found due to the evolving change in the context of normality and labeled data unavailability. According to [25] much of the research is performed on simulated data (37 out of the 65 surveyed papers); in-vehicle network data and vehicular ad hoc network (VANET) data are seldom considered together to safeguard the connected vehicles (except for 1 out of the 65 surveyed papers); Connected vehicles safety research does not get the same amount of attention as cybersecurity research. It is observed that the anomaly detection domain has various promising research directions; many anomaly detection methods require a large amount of test data set for detecting anomalies [26]. The literature survey we conducted in anomaly detection motivates us to use the machine learning models to determine the abnormal behavior of the legitimate user in a private network.
Anomaly detection should be possible, utilizing the ideas of Machine Learning. It tends to be done in the following manners: Supervised Anomaly Detection: This strategy requires a labeled data set with normal and abnormal examples for building a prescient model. The most well-known supervised methods incorporate supervised neural networks, support vector machine, k-nearest neighbors, Bayesian networks, and decision trees [27]. Supervised models are accepted to give a more superior detection rate than unsupervised techniques because of their capacity to encode interdependencies between factors, alongside their capacity to join both earlier knowledge and information and to restore a certainty score with the model yield [2].
Unsupervised Anomaly Detection: This strategy does not require labeled training data. They assume that the vast majority of the system associations are normal traffic and just a modest quantity of rate is unusual and envision that noxious traffic is factually not quite the same as should be expected traffic [28]. In light of these two suspicions, groups of regular instances are thought to be ordinary, and rare data groups are sorted as an anomaly. The most popular unsupervised algorithms include K-means, AutoEncoders, GMMs (Gaussian Mixture Models), and PCAs (Principle Component Analysis) based analysis [29].
Deep learning is the subspace of machine learning that accomplishes great performance as they learn the detailed features of datasets with the help of neural networks [30]. The existing deep clustering techniques for anomaly detection merge a deep learning algorithm, and a clustering algorithm usually k-means clustering algorithms. With the observations of background studies and the research gap learned from related work in Section III, we proposed our ODC in SectionIV for intelligent anomaly detection.

A. DEEP AUTOENCODER
An AutoEncoder with more than one hidden layer is called a deep AutoEncoder. Deep AutoEncoders learn more complex features of the dataset since they have more layers than a simple AutoEncoder. The deep AutoEncoder intends to reconstruct the input with minimum reconstruction error. The encoding part, decoding part, and the latent representation part (compressed input) are the three essential parts of the deep AutoEncoder. The application of deep AutoEncoder is un-avoidable in network traffic analysis since it compresses a sizeable high dimensional dataset into a low dimensional dataset.
For a given training [31] dataset X = {x 1 , x 2 , . . . , x m } with m samples, where x i is a d-dimensional feature vector, the encoder maps the input vector x i to a hidden representation vector h i through a deterministic mapping f θ as given in (1) h where, W is a d × d matrix, d is the number of hidden units, b is a bias vector, θ is the mapping parameter set θ = {W , b}. σ is a proper activation function. The decoder maps back the resulting hidden representation h i to a reconstructed d-dimensional vector y i in input space as whereŴ is a d ×d matrix,b is a bias vector and θ = {Ŵ ,b} [31]. The goal of training the AutoEncoder is to minimize the difference between input and output. Therefore, a loss function is calculated by the following equation where m is the total number of training dataset. The main objective is to find the optimal parameters (h i andθ) which can effectively minimize the difference between input and reconstructed output over the whole training set as B. BIRCH CLUSTERING ALGORITHM BIRCH, refers to Balanced Iterative Reducing and Clustering using Hierarchies, created in 1996 by Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH is best suited for large data sets or streaming due to the ability to find good clustering solutions with single scan data. Optionally, the algorithm can further scan through the data to improve clustering quality. BIRCH outperforms the existing clustering methods such as K-means and DBSCAN clustering algorithms [13] for handling large data sets. According to [13], BIRCH is a multipath search tree, like the structure of a B+ tree. There are three kinds of nodes in a cluster-feature (CF) tree: Leaf, NonLeaf, and MinCluster. Three following parameters are engaged with the model. The first parameter is B (Branching factor), the greatest number of child nodes that a non-leaf node can hold. The second parameter is L, the most extreme number of child nodes that a leaf node suits. Furthermore, the third parameter is T (Threshold), the most extreme span estimation of the cluster. A CF tree is a set of three data points in a single cluster. These data points are as follows: • Count (N): The number of information esteems in the cluster.
• Linear Sum ( − → LS): Aggregate the individual coordinates of the data points. This is a proportion of the area of the cluster.
• Squared Sum (SS): Aggregate the squared coordinates of the data points. This is a proportion of the spread of the cluster.

SS
BIRCH has two phases: • Phase 1: Building the CF tree. Load the network traffic data into the memory by building a cluster-feature (CF) tree. This phase will compress the initial CF tree only when this option is chosen at the training time. VOLUME 9, 2021 • Phase 2: Global Clustering. Optional refinement of clusters which are obtained from phase 1 by applying an existing clustering algorithm on the leaves of the CF tree. In view of the Additivity Hypothesis of CF [13], the CF estimation of the parent node is the aggregate of the CF estimations of its child nodes.

III. RELATED WORK
In DEC [8], initial dense auto-encoder is prepared with limiting recreation mistake. At that point, as a clustering advancement arrange, the strategy repeats between processing a helper target conveyance from AutoEncoder depiction and limiting the Kullback-Leibler disparity to it. In IDEC [9], it is contended that the grouping loss of DEC undermines the component space; in this way, IDEC proposes the clustering loss and reproduction loss of the auto-encoder. Deep clustering in [32] shows that a l 2 normalization on the latent representation of AutoEncoder makes the latent space more divisible and minimized in the Euclidean space. This significantly improves the clustering precision when kmeans clustering is utilized on the latent representation. DDC [10] clustering technique reduces the dimension of the dataset with the help of deep convolutional AutoEncoder and t-SNE algorithm. Consequently, DDC applies density-based clustering on the result of the t-SNE (2-dimensional embedded data) algorithm without mentioning the number of clusters in advance. Deep clustering algorithms [33], [34] and DDC are using t-SNE for further dimensionality reduction of input data. The issue with t-SNE is that it does not safeguard the distances nor thickness between the data. Also, the compressed data cannot be assured to recreate the original input since there are no hyper-parameters to reduce the reconstruction error between the input data and the recreated data.
Recent works on convolutional AutoEncoder clustering such as [35]- [39] are most applicable for clustering image datasets, not for analysing network traffic datasets. DCEC [11] embraces a convolutional AutoEncoder and shows that it improves the clustering exactness of DEC and IDEC. Dealing the anomalies in credit card transactions [15] is done with the AutoEncoder and k-means clustering algorithm on the European bank transaction dataset. However, this work and the other works specified in this Section III has the problem of predicting the number of clusters after pretraining the AutoEncoder.
Our proposed algorithm ODC optimizes the pre-training process of deep AutoEncoder to reduce the reconstruction error. Furthermore, it uses BIRCH clustering to overcome the limitations of the existing deep clustering algorithms.

IV. PROPOSED DEEP CLUSTERING METHOD
ODC groups the network traffic data based on the Euclidean distance between the nodes so that we get more and more dynamic clusters as the network traffic passes on to the ODC model.

A. ENHANCED DEEP AUTOENCODER
The enhanced AutoEncoder model is constructed using the proper combination of activation function, regularizers, and optimization functions to reduce the reconstruction error value. Our enhanced AutoEncoder treats every input as selfreliant values, thereby reducing the over-fitting of training data. The ODC training phase requires unsupervised learning and fine-tuning the model parameters to enhance the efficiency of the model.
We used an ELU (Exponential Linear Unit) [40] activation function for all layers and Adamax optimization function for the enhanced AutoEncoder model.
ELUs have negative values that push the average of the functions closer to zero. Average functions close to zero allow faster learning as the gradient approaches the natural gradient. ELUs for negative net entries are saturated at a negative value. Besides, the likelihood of code interference for different concepts is less likely, as incomprehensible negative values of information avoid distributed codes. α is a hyper-parameter of ELU. Positively activated ELUs interact by activating the next layer of units. Thus the ELU activation function is well-suited for deep network models where vanishing gradient interferes with the learning of the model. Dropout regularizer randomly dropping out nodes, thereby increasing the uniqueness of a node in the network. The coadaptation of the features in the node is reduced by adopting a dropout regularizer in the network.
The number of hidden layers (h i = {h 0 , h 1 , h 2 , h 3 , h 4 }) in our enhanced AutoEncoder are five. Here, the latent space can be represented as, According to [41] the dropout function is defined as, In equation 11 * signifies a component insightful product.
For any layer l, r (l) is a vector of self-governing Bernoulli irregular factors each of which has probability p of being 1. This vector is sampled and multiplied element-wise with the outputs of that layer, y (l) , to create the thinned outputs y (l) . These reduced features are then used as input to the next layer. This procedure is applied at each layer. If we apply dropout to the hidden layer with a probability value of p, the equation would be modified as follows (at training time): A loss function is calculated by the following equation where m is the total number of training dataset. The column named ''Train RE'' in Table 1 refers to the reconstruction error rate during training time and Test RE means, reconstruction error rate during testing time of our enhanced deep AutoEncoder. The values in Table 1 shows, how our optimized deep AutoEncoder outperforms to reduce the reconstruction error rate both at training and testing time.

B. OPTIMIZED DEEP CLUSTERING WITH BIRCH
The compressed representation of data points (h 2 = f θ (x 2 ) = σ (Wx 2 + b)) which are obtained from the enhanced deep AutoEncoder as explained in IV-A is feed into the BIRCH clustering algorithm. Each time the new data point is added to the CF tree by calculating the radius of the cluster. The radius of the cluster (R) is calculated as The calculated R-value decides where to push the new data point. If R < T , then a new data point is pushed to the same leaf node. If R > T , then the new data point is formed as a new leaf node. Thereby the CF tree is built for all the data points in our training and testing data. If we divide the sum of data points by the number of data points, we can get the centroid of the cluster. The centroid ( C) of the cluster is calculated as − → LS N Thereby we can calculate the distance between two clusters CF i and CF j

C. ODC OUTLIER HANDLING
We can set aside a fixed measure of disk/memory space for taking care of anomalies. Anomalies are leaf nodes of low thickness that are made a decision to be irrelevant concerning the general clustering design. At the point when we revamp the CF-tree by reinserting the old leaf nodes, the size of the new CF-tree is diminished in two different ways [13]. To begin with, we increment the limit esteem (Threshold T), subsequently permitting each leaf node to assimilate more focuses. Second, we treat some leaf nodes as potential anomalies and work them out to disk. An old leaf node is viewed as a potential anomaly in the event that it has far less data points than normal. An increment in the T value or a modification in the distribution considering the new data could well infer that the potential anomaly never again qualifies as an anomaly data point.
The data point whose Euclidean distance to the closest seed is larger than twice the radius of that cluster is treated as an anomaly [13]. As a result, the potential anomalies are examined to check on the off chance that they can be re-invested in the tree without making the tree develop in size. In Algorithm 1 steps from 1 to 4 explain how the compressed form of the input dataset has been made with the help of optimized deep AutoEncoder. Furthermore, the steps from 5 to 22 describe the outlier handling process of BIRCH [13] clustering algorithm. As a result, ODC handles the outlier in the network traffic data well than the existing deep clustering combinations. The evaluation of resultant clusters of ODC is discussed in SectionV.

V. EXPERIMENTAL EVALUATION
An enhanced deep AutoEncoder is implemented in Python using Keras [42]. Experiments on our datasets are conducted on a regular laptop with the Intel Core i7 processor. To evaluate our algorithm ODC, we use CoAP off-path dataset [5] to find out the anomalies in IoT network traffic and the standard publicly available MNIST [43] image dataset to compare the accuracy of ODC results with other existing works. We use the testbed from [5] to get more instances of IoT traffic with a CoAP off-path attack and feed the proposed algorithm with 10,000 unlabeled instances of IoT-CoAP traffic. We are ready to give the CoAP off-path dataset if anyone wants to redo the experiment for their research. To the best of our knowledge, our work is the first to combine deep AutoEncoder with the BIRCH clustering algorithm for anomaly detection in IoT network traffic datasets. The MNIST dataset has 70,000 digits of 28 × 28 pixels. We use publicly released codes by the respective DEC and IDEC authors to execute the corresponding algorithms to our dataset.
The encoder of our ODC contains two hidden layers and an input layer for both the datasets MNIST and CoAP offpath, as in Figure.1. The decoder part contains two hidden layers and an output layer for both the datasets MNIST and CoAP off-path. The dimension of the encoder is set as input data dimension(d) -1626 -756 -50. The decoder dimension is set as a reverse of the encoder, such as 50 -756 -1626 -output dimension(d). The graphs in Figure.  Rebuild CF tree t2 of new T from CF tree t1 10 if leaf data point of t1 is an outlier and disk space available then 11 Write that data point as outlier 12 else 13 use the data point to rebuild t2 14 if t1 <= t2 then 15 if Disk has space then 16 Go to step 5 and repeat the process for the rest of the data points 17 else 18 Re-absorb potential outliers into t1 19 Go to step 5 and repeat the process for the rest of the data points 20 else 21 Re-absorb potential outliers into t1 22 Go to step 5 and repeat the process for the rest of the data points reconstruction error rate, the later combination (ELU, dropout) produces consistent low reconstruction error rate for different iterations and different datasets. At the time of training the model, the decoder is used to reduce the reconstruction error rate. Once the model is optimized with a low reconstruction error, we merge the BIRCH clustering technique with the encoder's latent representation.
The clustering accuracy (ACC) depends on the branching factor (B) and the threshold value (T). When training the clustering algorithm, we choose the value of B and T through several iterations. We start to set the value of B as 15 and T as 1.5 to get good clustering accuracy and NMI. Branching factor value and Threshold value influence the ACC and NMI of the CoAP off-path dataset. It is noted that, when the threshold value and branching factor value decreases, we get   the good ACC and NMI value, as shown in the graphs of Figure. 8 and Figure. 9. Hence, B and T values are directly proportional to the ACC and NMI values of a dataset.
The Table 2 shows our proposed algorithm ODC has the highest clustering accuracy than the state-of-the-art deep clustering methods. The method mentioned in Table 2, AE (AutoEncoder) with the k-means algorithm performs the k-means clustering algorithm on the latent representation    of the trained AutoEncoder. We use the same AutoEncoder parameters as ours (ODC) to evaluate the AE+K-means deep clustering method.   We utilize two standard unsupervised evaluation measurements for evaluation and correlations with the benchmark strategies, clustering Accuracy (ACC), and Normalized Mutual Information (NMI). ACC is defined as, and NMI is defined as, NMI = I (l; c) max{H (l); H (c)} (15) VOLUME 9, 2021 where the I{·} is sign function, the l i is the ground-truth label, c i is the cluster assignment of the i th sample predicted by the algorithm, and m ranges over all possible one-to-one mapping between predicted clusters and labels. l = {l i } n i=1 c = {c i } n i=1 , respectively. n is the number of samples. I (l; c) denotes the mutual information between l and c, and H (·) denotes their entropy. Both ACC and NMI are in [0, 1], and the higher scores imply more accurate clustering results. The graphs in Figure.

VI. CONCLUSION
We proposed an intelligent anomaly detection algorithm ODC for extensive network traffic analysis. IoT environments produce a massive amount of data, and we need a mechanism/model to detect anomalies within the vast datasets. ODC optimizes the deep AutoEncoder to train the encoder. The latent version of the network traffic instances is fed into the BIRCH clustering algorithm for anomaly detection without human intervention. We demonstrated that ODC intelligently detects the anomalies for vast datasets. We analyzed B and T values' influence on the ACC and NMI values of an input dataset. The performance of the ODC deep clustering algorithm is evaluated through our implementation, and results presented in Table 2 clearly shows that our proposed scheme exhibits better performance in comparison with existing schemes.
Future directions of our work would be experimenting with ODC metrics other than Euclidean distance for anomaly detection. Our ODC anomaly detection method can be upgraded further by automating the whole anomaly detection model. This involves generating an alert message and send it to the system or network administrator without delay. Also, the suspected network traffic causing source can be identified and terminated or suspended from the regular network communication in a fraction of seconds.