Real-Time Malicious Traffic Detection With Online Isolation Forest Over SD-WAN

Software Defined Network (SDN) has been widely used in modern network architecture. The SD-WAN is considered as a technology that has a potential to revolutionize the WAN service usage by utilizing the SDN philosophy. Attacks within SD-WAN can affect the network and block the entire services. In this paper, we propose a machine learning based anomalous traffic detection framework named OADSD over SD-WAN that can achieve task independently and has the ability of adapting to the environment. The OADSD adopts Distributed Dynamic Feature Extraction (DDFE) to extract representative features directly from the raw traffic, and proposes the On-demand Evolving Isolation Forest (OEIF) to make the system adapt to an environment. We provide a theoretical analysis of the performance of the OADSD. We also conduct comprehensive experiments to evaluate the performance of the OADSD with real world public datasets as well as a small real testbed. Our experiments under real world public datasets show that, the OADSD can accurately detect various kinds of attacks with a high performance. Compared with the state-of-the-art systems, the OADSD can achieve up to 60% accuracy improvement.

excellent performance in many service providers' global WAN (e.g., Google [12], [17], [18], Microsoft [16], [26]), the threat activities (e.g., DDoS attacks) can occur and they can severely disrupt the user experience and reduce revenues finally. Therefore, real-time network malicious traffic detection is becoming an essential step, which can allow network administrators monitor the healthy status of the infrastructure.
Network malicious traffic detection technology is now widely used in SD-WAN scenarios [16], [18], [47] to provide network security. When anomaly events are detected, alerts are generated and sent to the network administrators to prevent further destruction from attackers.
Machine learning technologies are now becoming a new powerful tool in the network anomaly detection field. According to model complexity, we can divide the machine learning based anomaly detection methods into two categories, i.e., traditional machine learning methods, and the deep learning methods [43]. Traditional machine learning technologies based anomaly detection methods can be further divided into two types, i.e., supervised learning and unsupervised learning, where the supervised learning methods always train the detection models with labeled traffic and the unsupervised methods will separate the unlabeled traffic into clusters. Deep learning methods always train a complex neural network to improve the accuracy. Although machine learning techniques have already shown benefits in network anomaly detection, they still have following limitations over the SD-WAN scenarios: Firstly, classical supervised machine learning based schemes (e.g., SVM [8], Decision Tree [39]) have a high accuracy but they need a large number of labeled traffic from experts for training. However, the internet traffic is now in an exploding growth stage, and labeling traffic is always a timeconsuming, error-prone, and even impossible work for most network administrators. In comparison, classical unsupervised machine learning algorithms (e.g., Kmeans [20]) are easy to deploy but have a low accuracy.
Secondly, most of existing mainstream schemes (e.g., [11], [24], [48]) rely on expertly designed features to detect the anomalies. To efficiently detect the malicious traffic in SD-WAN, most methods need to extract specific features (e.g., flow size) from the raw traffic, which renders it less adaptable to the changing environment. Indeed, these methods are only effective for the unchanging environment but they might have low detection accuracy for the unknown attacks in a varying environment. Also, having too many features in a feature set requires a high volume of memory and computation, and a This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ long training time, which could affect the model accuracy at the end. In reality, the emerging SD-WAN provides a builtin inherent programmatic framework, where the edge routers always have a strong programming ability [18], [22], [40], [47]. This capability can be explored a scheme for dynamic feature extraction adaptable to the environment. Thirdly, a model is always trained to gain a good performance over the training data. However, the network environment could change to a certain extent such that its operational data could be significantly different from the characteristics of the training data, leading to severe intrusion detection accuracy degradation. Therefore, the machine learning models should be smart enough to evolve with the network environment.
In addressing these challenges, we make the following contributions in this paper: Firstly, we perform a deep analysis of malicious traffic detection algorithms and show the drawbacks of state-of-theart machine learning based methods (See § II).
Secondly, we design and implement the OADSD, a system that aims for online malicious traffic detection for SD-WAN. The OADSD contains two parts, i.e., Distributed Dynamic Feature Extraction (DDFE), and On-demand Evolving Isolation Forest (OEIF), where DDFE is deployed over each edge device (e.g., VNF, router) and OEIF works at the controller and edge device. Instead of relying on domain expert extracted features, the edge devices in the DDFE take the raw network traffic as the input and adopt an autoencoder-based method to generate the optimal representation of the raw traffic. The OEIF constructs an Online Isolation Forest model to detect the malicious traffic in the on-demand update manner, where the parameter update is triggered when the observed false alarm rate is high. We analyze the theoretic performance of the proposed OADSD and prove that the performance gap between the proposed OADSD and the best static offline method tends to zero as iterations goes to infinity (See § III).
Thirdly, we conduct extensive experiments with real-world public datasets as well as a small testbed (see § IV). Our results show that our proposed OADSD can detect different types of attacks with AUC ranging between 0.85 and 0.99. Compared with domain expert features, our DDFE can extract more representative features, and could greatly help to improve the intrusion detection performance. We show that the OADSD can achieve a good performance when environment is not consistent with the training environment.
The rest of the paper is organized as follows: Section II shows the background and motivation. Section III introduces the design details of the framework. Section IV evaluates the performance of the framework. Section V introduces the related work. The last section summarizes the paper.

II. BACKGROUND AND MOTIVATION
In this section, we will introduce the background about SD-WAN and show the limitations of state-of-the-art machine learning based anomaly detection methods.
A. Background 1) SD-WAN: SD-WAN is an overlay virtual solution for existing WAN that enables enterprises to securely connect their branches using public internet resources. It incorporates the key principles of SDN into the wide area network, which can provide enhanced quality of service over unreliable links in a lower-cost way. Its primary focus is to connect an organization's different branches located in various geographic locations through the internet. As shown in Fig. 1, the SD-WAN can be used by companies to establish interconnections between their headquarters and external sites. The edge devices located at different locations are purchased and managed in a unified manner. After the edge devices are authenticated, a controller will establish an encrypted channel with the edge devices to transmit control messages, and a dedicated communication tunnel is established between the edge devices to transmit data.
2) Network Anomaly Detection Framework: Traditionally, a network anomaly detection framework usually contains three parts, i.e., traffic pre-processing, feature extraction and anomaly detection. The traffic pre-processing part will gather traffic from the network middleboxes. Then it will clean up the irrelevant network traffic (e.g., ICMP packets) which are not applicable and can affect the performance of detection. Sometimes the collected traffic might be imbalanced. Data normalization or augmentation technologies [45] can be used to improve the quality of data. The feature extraction part will derive the feature vectors from the captured traffic data. Traditionally, there are two types of features, i.e., packetlevel features and flow level features [43]. The packet-level features (e.g., payload size, packet size, and payload ratio) are often used to find application anomalies, but they are not effective in providing a clear distinction in values between malicious and legitimate traffic flows, which might affect the detection accuracy. The flow-level features (e.g., mean packet length) will focus on multiple packets sequence rather than the contents of packets. Compared with the packet-level features, the flow-level features are more accurate but they often fail to detect the application anomalies. Finally, the anomaly detection part will perform detection algorithms to check whether the traffic contains suspicious behaviors.

B. Motivation
As a promising security method, machine learning based malicious traffic detection algorithms have been proposed as complements of the traditional fixed rule based methods [36]. Table I shows a the comparison between the machine learning based anomaly detection methods and the rule based anomaly detection methods. Although machine learning based methods can better detect zero-day attacks and have a higher detection accuracy, they still have the following two limitations: Firstly, most of them fully rely on expertly designed specific features, which is costly and non-adaptable. As shown in Table I, machine learning based anomaly detection methods could be further divided into subtypes, i.e., domain expert feature based methods and automatic feature extraction based methods [6], [10], [28], [37], where the domain expert features based methods rely on pre-defined expert-designed features and they vary across attacks. In reality, more features do not necessarily imply better results. Indeed, too many features require a high volume of memory and computation and can extend the model training time which may affect the model accuracy at the end. The F1-score of different algorithms (e.g., Isolation Forest (IF) [49], One-class SVM (OCSVM) [9], Varational Autoencoder (VAE) [13] and Naive Bayes (NB) [21]) over CICIDS dataset [37] demonstrated in Fig. 2 is also consistent with this observation, where the top 17 features selected by Person score perform about 20% better than the all features provided by the CICIDS2017 dataset.
Secondly, most of them only perform well over training dataset but fail to perform well over varying testing datasets. Most machine learning algorithms always train their best parameters over the training dataset. However, the characteristics of the training dataset cannot always be the same as the testing dataset. The performance indicators over UNSW dataset [29] shown in Fig. 3 also prove this. For example, the jitter feature between the training dataset and test dataset could even be as large as 30%. The different characteristics of the training dataset and testing dataset also make the algorithms only perform well on the training dataset. We further take the Isolation Forest [49] as an example to demonstrate the performance gap between training and testing set provided by UNSW dataset. As Fig. 4 shows, the performance gap between  the training dataset and the testing dataset can even be up to 10%, which demonstrates that using a static training model built over the training dataset would be problematic over a changing environment.

III. OADSD FRAMEWORK
In this section, we will present the OADSD, a system that aims for Online Anomaly Detection over SD-WAN. The architecture shown in Fig. 5 illustrates that OADSD contains two parts, i.e., Distributed Dynamic Feature Extraction (DDFE) and On-demand Evolving Isolation Forest (OEIF), where DDFE is deployed over each edge device (e.g., VNF, router) and OEIF works at the controller and edge device. The DDFE proposes to extract features by using a machine learning model deployed over the edge devices. The OEIF is able to update the model according to the feedback from network administrators to improve the detection accuracy when the false alarm is too high. In the following section, we will introduce the procedure as well as the analysis in details. Table II shows the main notations used in this paper.

A. Distributed Dynamic Feature Extraction
For the traffic passing through the edge devices, we need to extract the features for further detection. Different from  the traditional feature extraction methods which advocate to use the expert-designed features, OADSD proposes a dynamic feature extraction. As shown in Fig. 5, the SD-WAN edge devices will perform an autoencoder based feature extraction method for the captured raw traffic to derive the feature vectors for every µ time interval. Overall, the feature extraction procedure contains the following three main steps. 1) Step 1: Traffic Collection and Traffic Cleaning: Each edge device has a continuously running monitor process (e.g., tcpdump) to capture the raw network traffic going through the edge devices continuously. At the end of each µ time (e.g., 10s), each edge device will run a packet filter process to clean the irrelevant packets (e.g. ARP packet) based on the configuration of the controller, where µ is fixed and configured by the controller according to the ratio of edge resource (e.g., disk volume) and line speed. After the traffic cleaning step, packet number as well as the detection overheads are significantly reduced. In reality, the attacks can happen over many µ time intervals. We will detect these flows each µ time interval and regard them as the malicious ones even if only one time slot contains anomaly. 2) Step 2: Flow Vector Generation: Next, the OADSD will aggregate the packets with the same identifier (i.e., source IP address, destination IP address, source port, destination port and transport layer protocol) to form a flow. In reality, each flow has a diverse packet number, and this makes our feature extraction challenging. For simplicity, OADSD will first convert the flows with different length into the same length. OADSD will fills the short flows with zero elements to a fixed length size (e.g., L). For each packet, we adopt M features (e.g., packet size, payload size, packet head, payload) to present it. Therefore, the flow vector at the i-th time slot v t can be denoted as {v 11 where v i j is the i-th feature of packet i. When generating flow vectors, we first distinguish between short flows and long flows by checking the identifiers, i.e., if a flow's identifier does not appear in the previous time slot, we will identify it as a short flow, otherwise, it is a long flow. For the short flows, we only sample packets within the current time slot. For these long flows, we first perform a sample operation within the current time slot, and then perform another sample operation for the packets sampled in previous time slots together with the the newly sampled packets to form the flow vector. Therefore, long flow vector will contain information of all time slots. 3) Step 3: Autoencoder Based Flow Feature Extraction: As discussed in the previous section, the expert based feature extraction methods can incorporate human errors. To overcome this, we design an autoencoder based feature selection method to discover the non-intuitive features. Autoencoder is a special artificial neural network which tries to reconstruct its inputs. Autoencoder often consists of an encoder and a decoder. The encoder first compresses the input data into lower dimension, then the decoder tries to reconstruct the input by the output of the encoder. As shown in Fig. 6, each element in v will pass an encoder part which is constructed by several convolution layers with many neurons. The fabric of the decoder net is similar to the encoder net and it will output a vector v. The output of the encoder part can be regarded as the flow feature vector (i.e., h). We also add skip connections to the original structure to accelerate the training speed, whose connecting strategy follows the pre-activation version of the deep residual network [15]. We train the model in an unsupervised manner and do not need any labels. We aim to minimize the Mean Square Error (MSE) between v and v: (1)

B. On-Demand Evolving Isolation Forest
After flow feature extraction, we now construct an Ondemand Evolving Isolation Forest (OEIF) to distinguish malicious traffic. The OEIF contains two parts, i.e., Isolation Forest Construction and On-demand Update, where an Isolation Forest (IF) is trained beforehand and the On-demand Parameter Update works at the operational phase.

1) Isolation Forest Construction:
We begin with the definition of Isolation Tree.
Definition 1: Isolation Tree. Let B denote a node of an isolation tree. B is either an external-node with no child, or an internal-node with one test and exactly two child nodes (B l , B r ). A test consists of an feature B q and a split value B c .
The flow feature samples used for training can be presented as is the i-th flow feature set containing Q features. We need 5 steps to obtain a random isolation tree. Fig. 7 shows an example. a) Step 1: Sampling: We need to randomly select some samples from H to train the model. The selected samples are presented as Y . All the elements in Y construct the root node. Fig. 7(a) shows a simple example, where we use a 6 sample in H to train a tree. After randomly sampling, we derive a subsample Y which contains 4 samples for the following steps. b) Step 2: Feature selection: Next, we randomly select a dimension from the flow features to split the selected samples and build the isolation tree. Take Fig. 7(b) as the example, we choose the first dimension (i.e., h 11 , h 21 , h 31 , h 41 , h 51 ) for sample splitting. c) Step 3: Threshold decision: For each tree node including the root node that contains more than one flow feature vector, we need to choose a threshold for sample splitting. We randomly choose a value between the minimum and maximum value of the corresponding feature chosen in Step 2 as the threshold. For example, in Fig. 7(c), the minimum and maximum value of the first feature among the 4 samples in Y are 2.3 and 4.5, and we choose 3.5 as the threshold. Since different device has different input data, the selected thresholds will also be different. d) Step 4: Sample splitting: We need to separate each tree nodes (i.e., external nodes) which contains more than one feature vector into two child nodes. The left node (i.e., B l ) contains the samples whose split value is equal or smaller than the threshold decided in Step 3, and the right node (i.e., B r ) contains the remaining ones. Then we set the selected feature's index to B q and set the split value to B c . For example, as Fig. 7(d) shows, the left node (B l ) contains flow feature vector h 2 and h 5 , while the right node (B r ) contains h 3 and h 6 . The feature of the node is the label of the first feature dimension and the split value is 3.5 in our example. e) Step 5: Recursively building: Finally, we will perform Step 2 to Step 4, until all tree nodes contain one flow feature vector and are unable to be split any more. Then we will return the root node of the isolation tree.
Repeating the above steps, we can obtain an IF (Isolated Forest). Let F denote the root node set and N denote the Internal-node set. Considering an arrival flow feature vector A, Z A denotes a visited-internal-node vector, where Z A B ∈ {0, 1} denotes whether an internal node B is visited (Z A B = 1) or not (Z A B = 0). In the detection stage, the IF is used to map the input feature vector A into Z A . Algorithm 1 describes how the map procedure works. The algorithm will loop each internal-node. The loop will continue until the leaf node is reached. If the split value of the internal-node is larger than Algorithm 1 Visited Internal-Node Input: Arrival flow feature vector A, root node set F. Output: Internal-node path vector Z A while B is an internal-node do 4: end while 7: end for 8: return Z A A's corresponding feature value, then we will loop its right child, otherwise, we will loop its left child (Line 5). If the B-th internal node has been visited, the B-th element in Z A will be set to 1. Therefore, B∈N Z A B can present the total path length A traverses through the forest.
2) On-Demand Parameter Update: The on-demand parameter update scheme works at the operational phase. As an environment changes, the static model weight might lead to a high false alarm rate. If users find the gap between the observed false alarm rate (e.g., ) and the pre-defined false alarm rate threshold (e.g., ) is higher than the pre-defined threshold , the parameter update process will be triggered: There are also other update signal in practice, e.g., zero-day attack. When zero-day attacks happen, it would be observed by the local site operators. Then the results are reported to the controller and trigger the parameter update process. Therefore, the zero-day vulnerabilities can also be detected by our method. To update the parameters, we consider incorporating feedback from IT administrators to tune the weight (i.e., w). Let T present the indices set of the feedback that received from operators. For each t ∈ T , w t denotes the value of weight vector after receiving the feedback t. We now give the domain S for all possible weights: Let λ represents the pre-defined threshold. Let y t denote the detection result for t. If w t · Z t > λ, we will regard it as a malicious flow (y t = −1), otherwise, we regard it as a benign one (y t = 1). And letŷ t denote the true label of t. Mathematically, we should try to seek w that minimizes the incorrectly classified examples over the total feedbacks, i.e., where θ (x) ∈ {0, 1} is the indicator function that takes the value 1 if x is satisfied and 0 otherwise. The sign(x) ∈ {−1, 1} is the sign function.ŷ t is the true label of t, whereŷ t = −1 when t is a malicious flow, otherwise (ŷ t = 1), it is a benign one. However, (4) is hard to use in practice since it is 0/1 loss and needs to access the whole feedback set. In this case, we would like to replace it with a convex loss function whose parameter is w t after receiving the feedback of t : To derive the best parameter when receiving the feedback of t, we aim to use any vector which has a minimal loss on all past rounds, i.e., where R i (w) is a regularized function (i.e., the regularizer) to stabilize the prediction. From (4), we can derive the best parameters. However, it is often hard to solve for the online application. In this case, we adopt the online mirror descent algorithm [14] and change (5) to its gradient function, i.e., where ▽ f i (w i ) is the gradient of f i at w i , which can be derived as: The regularization term R i (w) is defined as follows, where δ is a constant and η i is the learning rate. Take (7) and (9) into (6), we can derive a convex optimization problem with no constraints. According to KKT (Karush-Kuhn-Tucker) condition, we take the derivative of (6) equaling to 0 and get the optimal feature weight vector: In the area of online learning, the regret metric [14] can measure the difference between the online algorithm and the best parameters. A practical online algorithm should have a sub-linear regret upper bound. This implies that on average the algorithm is performing as the best static strategy we can have in hindsight,. i.e., where f t (w t ) is the performance loss after receiving the feedback of flow feature vector t. To analyze the performance of on-demand parameter update scheme, we first give the following lemma: Lemma 1: Compared with the best static decision w * , the regret of the online learning algorithm is bounded by: where R(·) is the regularization term in the loss equation, and w * is the optimal parameters in hindsight.
Proof: First, we consider the regret against the regularized loss function, i.e., The regularized loss can be computed as: Since, the objective of online learning is to find the w t+1 which has the minimum accumulated loss over previous t steps. Therefore, w t+1 has the smallest accumulated loss over previous t steps.
According to the definition of regret shown in (11), then taking (13) and (14) into (15), we could derive the following equation: According to (6), w t has the minimal accumulated loss of previous t − 1 steps. And our regularization term R(·) is nonnegative. Therefore, we have: (17) With (17), we have: Combine (16) and (18) we have: we can prove Lemma 1. Theorem 1: Let G be the maximal value of the weight vector w. N denotes the number of nodes in the forest. Let η t = G √ t , we have: Proof: For every u, v ∈ S. ||u − v|| ≤ ||u|| + ||v||. According to 3, we have ||w t || ≤ N G, thus, ||w * − w t || ≤ Fig. 8. The overall design of OADSD system.
This implies that with the feedback number growing the difference between the best static strategy and the model goes to zero. In a real-life application, alarms generated by the IDS will often require the inspection from the operator in any system. So, using the feedback from such alarms do not increase any extra inspection cost. Setting up a feedback triggering threshold will further reduce the involvement of the operator. Fig. 8 shows the overall architecture of the system, which contains one controller and multiple edge devices. The controller is responsible for configuration management and updating models. The edge device is responsible for malicious traffic detection. The OADSD works as follows: The edge devices capture and filter the inbound traffic from the internet. Then they will extract features and perform malicious traffic detection. The edge devices timely report the detection results to the controller. If the detection results trigger the update policies (e.g., low accuracy), the operators will try to mend the model.

C. System Design
Controller is the coordinator of the whole system which contains three main components: (1) Configuration Server. This module maintains the configurations of the OADSD (e.g., µ). After an edge device connect to the network, it will pull the configurations to edge devices; (2) Controller Portal. If the administrator decides to update the model, then the detection parameters will update and redistribute to the corresponding edge devices. In reality, the feedback from the administrator may not always be accurate. OADSD also provides a rollback mechanism which could rollback to last checkpoint. When the administrator gives incorrect feedback and lead to worse performance, the edge device could quickly rollback to its last checkpoint; (3) Communication Channel. This module is responsible for communicating with the edge devices, where we use long-lived TCP connections to avoid connection established costs.
Edge device is responsible for monitoring and detecting the inbound traffic. We first extract a set of features from each packet (e.g., packet size, payload size, packet head, payload) and form a flow vector to represents the flow. Then the sequence will be fed into an auto-encoder. The auto-encoder is 1d-CNN network with an average pooling in front of it. The average pooling will transform the input matrix into a unified lower dimension input. This allows the auto-encoder to handle network traffic with varied lengths. Then we send the traffic into the auto-encoder to get the traffic features. In details, it consists of four modules: (1) Agent. It runs a sniffer (e.g., tcpdump) to collect and filter the inbound traffic every µ time interval; (2) DDFE. This module is responsible for extracting the traffic feature from the collected packets; (3) Detector Manager. The detector manager is responsible for detecting malicious traffic based on the feature vectors extracted by DDFE. The detector will send a message to controller after detection. The messages sent by the edge device include the identifier (serial number and MAC address of the device). , the controller can differentiate messages from different edge devices. (4) Communication Channel. This module is responsible for communicate with the controller.
We introduced a feedback mechanism to improve the detection accuracy of malicious traffic using machine learning model in SD-WAN environment. By collecting the detection results of the edge devices in the controller, introducing expert feedback and training the machine learning model at the controller in an online fashion. Then sending the parameters to the edge devices for detection, the detection accuracy of machine learning model can be secured. This mechanism makes full use of SD-WAN centralized control and reduces the pressure on the training model of edge equipment.
We use a whitelist and certificate to guarantee the security of the system where the whitelist records each trusted device information (e.g., the serial number, MAC address, and authentication status). Edge devices and the controller have certificates to prove their identity. The whitelist and certificate are managed by the network operator and are preinstalled both in the controller and edge devices. When an edge device establishes control connections to SD-WAN controllers, it exchanges its device SSL certificate with the controller during the SSL handshake process for authentication. Then controller will check whether the edge device is whitelisted. If the above steps are passed, the authentication between controller and edge device is completed, and a two-way SSL connection will be established for secure communication. This will prevent malicious devices from joining SD-WAN domain and all devices are known, trusted, and authorized. Sometimes some edge devices might be compromised and become malicious, then, the network operator will remove the edge device from the whitelist in the controller. Whenever the controller receives a message from an edge device, it will check whether the edge device is whitelisted and authenticated. If it is not authenticated, the controller will terminate the SSL connection with the edge device. In this way, the controller is secured from malicious edge devices in multiple domains. Also, the controller might be compromised and the network operator will remove the controller from the whitelist in each edge device and set a backup controller. When the edge device receives each request from a controller, it will also check whether the controller is whitelisted and authenticated, and if not, the edge device will terminate the DTLS/TLS connection with the controller and switch to a backup controller.

IV. EVALUATION
In this section, we will evaluate the performance of the OADSD by using real-world attacks (e.g., DDoS, DoS) with three public real-world datasets as well as a small testbed. Our main results are as follows: • Compared with existing methods, OADSD could achieve the best performance on most of the datasets and could achieve a high detection accuracy on all attack types.
• The proposed OADSD could adapt to the environment with only a few iterations of updates. We test OADSD on a two-dimensional synthesize dataset. The result shows OADSD could adapt to environment changes. But the baseline IF could not adapt to the environment change, with a significant performance drop being observed.
• Classical Semi-supervised and unsupervised detection algorithms achieve a lower detection accuracy compared to the supervised detection algorithms. Hence, we further compare the detection accuracy with other supervised detection algorithms [19], [21]. The results demonstrate that OADSD could achieve compatible or better performance than those supervised algorithms.

A. Setup
We use the PyTorch to implement the autoencoder described in Section III and the variational autoencoder in the baseline methods. We use C++ (version 5.4.0) to implement the isolation forest model and also the online update algorithm. For all our traffic trace, we use 80% of traffic (i.e. both normal and anomaly) to train the machine learning algorithms and the rest of the 20% traffic are used to test the performance. OADSD has some parameters. L is set to 900 in default. The maximal value of weight vector (i.e., G) is set to 1. The default value of δ is 2. For a flow, when its weighted path length is larger than 0.6 (i.e., λ), we will regard it as a malicious flow. The default pre-defined false alarm rate threshold (e.g., ) is 10% and default update trigger threshold is 5% in our evaluations. For an arriving traffic, if the parameter update condition is satisfied, we will give the true label to our system to mimic the feedback of network operators.
Baselines. We compare the performance of OADSD with the following methods: • Packet-level Detection. We use the state-of-the-art machine learning based detection method, Kitsune [28], which extracts 23 features via flow state variables and feeds the features to an autoencoder based classifier.
We use the open source Kitsune implementation [28], and detect attacks with the same hardware resource as the OADSD.
• SVM with Flow-level Statistics Classification (SFSC). Flow-level statistics features have already been used by many anomaly detection works [21], [27], [37]. We choose the flow-level statistics features described in [29] and [37], including the maximum, minimum and the mean of packet size. Then we perform normalization for the flow-level statistics features. For the classification algorithm, we choose SVM, because it has been widely used and has a good detection performance [9], [19] in network anomaly detection. The implementation of SVM is shown in [5].
• Static Isolation Forest with highest correlation features (SIFD). We perform correlation analysis over all the features adopted by the previous works [21], [27], [28], [37]. Then SIFD chooses the top 10 features and uses the static isolation forest model to detect the anomalies. Compared with OADSD, SIFD detection algorithm is unable to adapt to environment. We use the open source isolation forest implementation [33] in our following evaluations.
Datasets. We test OADSD and above baseline methods with the three following three public datasets: • CICIDS 2017 Dataset [37]. thousands of colleges and campuses through CERNET backbone. Each campus network deploys a core router connected to the CERNET backbone network. CERNET is similar to SD-WAN, e.g., they both use intelligent schemes to establish new routes between campuses with tunnels, they both collect network performance indicators at the edge devices to monitor network performance, and they both switch to backup paths through centralized control to respond to failures. The CERNET dataset is generated from one core router, similar to the inbound traffic collected from edge devices. There are 875 attacks, including DoS, DDoS, and PortScan.
We extract a set of features from each raw packet (e.g., packet size, payload size, packet head, payload) in a flow and form a flow feature vector to perform online malicious traffic detection. Metrics. We use the following metrics to evaluate the detection performance: (1) the Area Under ROC Curve (AUC), the AUC score stands for the area under the ROC curve. The ROC curve is created by plotting the true positive rate against the false positive rate at different thresholds; (2) the average precision score (AP), the average precision score stands for the area under the precision recall curve (PR curve). The PR curve is created by plotting the precision score against the recall score at different thresholds;. (3) Precision, the precision for a class is the number of true positive (TP) items divided by the total number of elements labeled as the positive class. Precision score could be calculated by: Pr ecision = T P T P+F P ; (4) Recall, the recall of a class is the number of true positive items divided by the sum of true positives and false negatives, which could be calculated by: Recall = T P T P+F N ; (5) F1-score (F1), combines the precision and recall into together. The F1score could be calculated by: F1 = 2 · Pr ecision·Recall Pr ecision+Recall .
B. Performance With Real World Traffic 1) Performance Comparison: In this part, we compare the performance of OADSD with the most widely used malicious traffic detection algorithms. Fig. 9 shows the performance comparison between OADSD and SFSC, Kitsune, SIFD under different data set. We can see that OADSD has consistently high performance under all datasets. For detection accuracy, OADSD performs up to 40% higher than the other methods. Next, we select the samples that contain the specific attacks to evaluate the performance, e.g., UNSW-Fuzzers denotes that traffic only contains Fuzzer attack traffic and benign traffic. Table III shows the results measured with AUC and AP score. We find that OADSD can detect all 15 attacks with AUC ranging from 0.932 to 0.99 and AP ranging from 0.75 to 0.98, which performs up to 2×, 10×, 50% better than SFID, Kitsune, SFSC, respectively. For the comparison of the detection algorithm accuracy, the OADSD could achieve a much higher detection accuracy than SIFD, especially for attacks which rely on payload information (e.g., WebAttacks or Infiltration). We can see that unsupervised (e.g. SIFD) and  semi-supervised (e.g. Kitsune) methods achieve a very low AP score in most of the time. The reason for this is that the unsupervised and semi-supervised methods, like IF or VAE, could not utilize the label information, which makes them hard to give an unbiased classification result and leads to a higher false alarm rate. In comparison, OADSD could detect both brute force attacks like DDoS, SSH or FTP Patator and attacks highly correlated with payload information, like SQL injection, exploits and worms. The reason for this is that OADSD can efficiently capture the features of traffic and is able to adapt to network change. We further compare the performance of the supervised methods like SVM, Naive Bayes [21]. The results are shown in Fig. 10. We can see that OADSD could also performs better than those supervised methods. On CICIDS dataset, our OADSD could out perform the Bayes about 48% AP and about 10% AUC improvement over SVM.
OADSD adopts DDFE to derive the flow features dynamically without the knowledge of experts. We now evaluate the effectiveness of DDFE. We choose the CICIDS 2017 Dataset [37] with 20000 benign and 2000 malicious features and use T-SNE [41] to compress the features into two dimension. The visualization results show the center of the malicious traffic is far from the benign traffic. We also present the visualization results of SFSC and SIFD in Fig. 11(b) and Fig. 11(c). This implies DDFE could generate more separable features than other methods, which contributes much to the detection results. Fig. 11. The T-SNE [41] visualization of different feature extraction method under CICIDS data set. The green dots represents malicious traffic, the yellow dots represents normal traffic.   2) Performance Under Different Settings: In this part, we will test the performance of OADSD under different parameters settings on CICIDS dataset. L controls the length of flow used for feature extraction. Fig. 12 shows that with the increasing L, the performance also increases, since the feature extractor could use more flow data and derive better features. However, a larger L also leads to a longer training and processing time. Setting the L to 900 can ensure a good performance in most cases. Fig. 13 shows that the AUC and AP score under different δ which controls the regularization degree. A large δ will make the weight vector stay around its initial point. From the result, setting δ around 2 can produce the best performance, larger δ will greatly affect the AP score. G is defined as the maximum element of the weight vector. Fig. 14 shows the AUC and AP score under different G. The AUC and AP decrease with the increase of G. The reason for this is that a large G will lead to a large learning rate, which makes the weight vector fluctuate and converge slowly.
We add skip connections to the auto-encoder to accelerate the training speed in the DDFE (see § III-A). We will show its benefits in the following. We trained the IF on three different feature sets under CICIDS dataset: (1) feature extracted Also, compared to flow-level features, the features extracted by autoencoder without skip connections achieves lower performance, and has the largest deviation among these three feature sets. The reason is that on the one hand, autoencoder is hard to train and difficult to converge to global optimal weights; On the other hand, the training objective of autoencoder makes the model focus on the fine-grained information of the raw traffic, the patterns of different attacks are more on the high-level.
3) Adaptation to Environment Change: The OADSD can adapt to the environment change, as "concept drift", which could be defined as the joint distribution shift of input variable x and y. This is a common challenge in the traffic anomaly detection [6]. The offline training methods are not suitable for adapting to the changing environment, because of the long training time.
In this experiment, we test the performance of OADSD under a two-dimensional synthetic test dataset to evaluate the adaptability to environment change. We measure the AUC and AP of OADSD and Isolation Forest (IF), which could be seen as an offline version of our detection algorithm. The synthetic dataset is consisted by two major parts, and both of them are sampled from a two-dimensional normal distribution. Each of these parts is controlled by two main parameters, are the center of the mass and the variance of the distribution. The detailed parameters are shown in Table IV. The before and after dataset has a different distribution, which means the parameters learnt from Before Drift dataset cannot transfer smoothly to After Drift dataset. The experiment result shows in Table V also proves this. Because the distribution of the training and testing data is different, the AUC of baseline drops from 0.78 to 0.73 (i.e. 10% drops on AUC) and AP of baseline drops from 0.33 to 0.30 (i.e. 6% drops on AP), when trained the baseline on Before Drift and tested on After Drift.
After a few times of feedback, which only about 2% of the total training data, the AUC and AP score increase back to 0.82 and 0.38. Which is equivalent to the performance by training OADSD on the After Drift dataset (i.e. 0.83, 0.38). But in comparison, the IF could only achieve 0.73 AUC and 0.30 AP, which was 30% lower on AP and 21% lower on AUC compares to OADSD.
The above result shows OADSD could well adapt to the environment change. But the offline version (e.g. SIFD) could not adapt to environment change like OADSD. The reason is that OADSD could fully utilize the feedback of IT administrator and tunes its parameters based on the current context.
In our assumption, the feedback from IT administrator is always correct. However, this is not necessarily true, since humans are notoriously prone to mistakes. And this may deteriorate the performance of OEIF.
Therefore, we test the performance of OEIF under different ratio of incorrect feedback under a two-dimensional synthetic dataset. We measure the performance of OEIF by AUC score. How the synthetic dataset is constructed has been described in the previous section. We use the Before Drift dataset, mentioned in the Table IV, in this experiment to evaluate the performance.
In reality, the false feedback from operators could hurt the detection accuracy. We test the performance of OADSD under different ratio of incorrect feedback. Fig. 16 demonstrates the performance of OADSD. We can see that OADSD maintains steady when there are small number of incorrect feedbacks. Also, smaller learning rate can eliminate the side effect of incorrect feedbacks.

C. Testbed Evaluation
In this section, we will evaluate the performance of OADSD with a small testbed. The OADSD contains three parts, i.e., Distributed Dynamic Feature Extraction (DDFE), On-demand Evolving Isolation Forest (OEIF) and On-demand Update, where both the DDFE and the OEIF are deployed at the edge devices, and the On-demand Update module is deployed at the controller. We construct a small testbed with five dell servers, each is equipped with Intel i5-10210U@1.6HZ CPU (8 cores) and 8GB RAM. Three servers install software routers that realize routing functions by the X86 architecture and userdefined software. The servers connect each other with 1Gbps bottleneck link. One server performs as the controller. The controller will configure and communicate with the routers through the TCP connections. Each second, the last server will randomly choose one router and setup a TCP connection with it. For each connection, we randomly generate an integer p between 0 and 100 for each link. If p/100 is smaller than a pre-config threshold D%, the TCP connection will contain malicious traffic, otherwise, it is benign. We also setup a tester to generate background traffic to each site to emulate the internet traffic. Each connection will last x seconds, where the default value of x is 10 in our evaluations. The default pre-defined false alarm rate threshold (i.e., ) is 10% and default update trigger threshold is 5% in our evaluations. Each group of experiments will contain 1000 TCP connections and we repeat each group 10 times. Fig. 18 shows that the OADSD consistently keeps high performance under different threshold and the performance advantage is larger with D. The odds are up to 30% in our evaluations, which matches the results of the simulations. Fig. 19 demonstrates that OADSD can adjust its performance according to the feedback from the IT administrators, while other schemes have poor performance when the testing environment is different from the training environment. Fig. 20 depicts the resource utilization. We can see that both the CPU and memory utilization is below 10% even the arrival traffic approaches the line rate (e.g., 1Gbps). The small utilization indicates OADSD can work in an efficient manner. When the gap between the observed false alarm rate (e.g., ) and default pre-defined false alarm rate threshold (i.e., ) is larger than a pre-defined threshold , we will perform parameter update process to restore the parameters in our evaluations. We now show the number of feedbacks needed with different and . Fig. 21 shows that OADSD only needs less than 60 feedbacks when the gap is larger than 20%, which implies some cost for the involvement of the IT operator V. RELATED WORK

A. Feature Extraction
Feature extraction is the start point of utilizing machine learning based methods in traffic anomaly detection.    Reference [37] propose a traffic feature extractor based on the flow-level statistical features in network traffic. Reference [28] utilize the per-packet features and use a feature clustering method to evaluate the correlation between features. References [1], [3], and [35] used automatic feature compression methods on the statistical traffic features, which could preserve the most representative part of the feature set. Compared with the statistical features, these methods could achieve a better performance. Reference [32] leveraged the download and upload size interval to get the website fingerprint from network traffic. Reference [42] uses a CNN model to extract features from raw traffic data for traffic classification.

B. Machine Learning Based NIDS
Machine learning based Network Intrusion Detection Systems (NIDS) can detect zero-day attacks and achieve a higher detection accuracy, compared with traditional rule based meth-ods. References [4] and [23] adopted statistical machine learning methods, i.e., Wavelet Analysis, PCA, to detect malware traffic in an unsupervised manner. Although, these methods have a small detection overhead. They often have a low detection accuracy and is sensitive to outliers. References [21] and [27] detected malicious traffic, i.e., DDoS, Black Holes, by using flow-level statistical features and Naive Bayes Classifiers. However, these methods could not achieve real-time detection. Reference [28] proposed Kitsune which aims to reduce the detection overhead by using a lightweight deep learning models, i.e., autoencoders.

C. Malicious Traffic Detection
Online Traffic Anomaly Detection methods are widely investigated [2], [28], [30], [31], [44]. [30], [31] combine the online and offline machine learning methods to make the methods easier to train and can identify new traffic based on the prediction result of online and offline algorithms. Reference [2] proposed a fast activation function that helps the deep neural network converge faster, which is then used to detect DDoS attacks. Reference [28] provide an online schema for training the traffic detection model, which achieves a low training overhead. Reference [44] leverage deep dictionary learning to achieve online anomaly detection for encrypted traffic. However, unlike OADSD, these methods do not guarantee that the model will converge.

VI. CONCLUSION
In this paper, we developed the OADSD, an online malicious traffic detection system, which contains traffic collections, traffic feature Extraction and time evolving anomaly detection. Distributed Dynamic Feature Extraction (DDFE) can extract features directly from the raw traffic and Ondemand Evolving Isolation Forest (OEIF) utilizes expert feedback to update parameters which makes the system have the ability of adapting to the environment. We prove that the OEIF has a sub-linear regret bound, which means the performance gap between OEIF and the best hypothesis goes to zero as update iterations go to infinity. Extensive experiments show that the OADSD can achieve an AUC score greater than 0.9 in most cases. The OADSD can still achieve good performance in the environment which has a different distribution between the training dataset with the help of the expert feedback.