Performance Evaluation of a Combined Anomaly Detection Platform

Hybrid Anomaly Detection Model (HADM) is a platform that filters network traffic and identifies malicious activities on the network. The platform applies data mining techniques to tackle effectively the security issues in high load communication networks. The platform uses a combination of linear and learning algorithms combined with protocol analyzer. The linear algorithms filter and extract distinctive attributes and features of the cyber-attacks while the learning algorithms use these attributes and features to identify new types of cyber-attacks. The protocol analyzer in this platform classifies and filters vulnerable protocols to avoid unnecessary computation load. The use of linear algorithms in conjunction with learning algorithms and protocol analyzer allows the HADM to achieve improved efficiency in terms of accuracy and computation time to detect cyber-attacks over existing solutions. While authors’ previous paper evaluated HADM efficiency (accuracy and computation time) against related studies, this paper, concentrates on HADM robustness and scalability. For this purpose, five datasets, including ISCX-2012, UNSW-NB15 Jan, UNSW-NB15 Feb, ISCX-2017, and MAWILab-2018, with various size and diverse attacks have been used. Different feature selection methods are applied to find the best features. The feature selection methods are selected based on the algorithms’ computation time and detection rate. The best algorithms are then selected through a benchmark on applied datasets and based on the metrics such as cross-entropy loss, precision, recall, and computation time. The result of HADM platform shows robustness and scalability against datasets with different size and diverse attacks.

Intrusion Detection Systems (IDSs) are considered wellknown tools for monitoring and detection of malicious traffic in communication networks.However, IDS is a technology that uses highly developed and complex algorithms for processing large volumes of data [1].The complexity of the algorithms results in long computation time.IDS captures network traffic in real time and compares the received packet patterns with known patterns to detect anomalies in network.Yet the cost and high processing time to handle traffic load is a challenge in IDS.
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Quan.
To solve this problem, network traffic flow control in combination of Data Mining (DM) techniques are proposed in Hybrid Anomaly Detection Model.HADM platform consists of two main parts where each part independently increases the efficiency of attack detection based on the factors such as precision, recall, accuracy and computation time.While part 1 of the model utilizes some algorithms and the protocol analyzer for traffic filtering, reducing the processing time and increasing the accuracy, the part 2 applies a dynamic feature selection with a genetic algorithm to classify unknown attacks and increase the accuracy as well.HADM model comprises a protocol analyzer, linear and learning algorithms as well as other modules.Since some protocols such as streaming protocols are not vulnerable, and attackers usually target specific protocols, protocol analyzer in this platform classifies and filters vulnerable protocols to avoid unnecessary computation load.Protocol analyzer forwards the filtered traffic either to a linear algorithm only for Denial of Service (DoS) detection or to a combination of a linear and a learning algorithm for other types of attacks.The linear algorithm initially defines if the traffic is secure or unsecure regardless of the attack type.In addition, it extracts the proper features in order to provide them to a learning algorithm in order to classify already known attacks and detect unknown attacks.Another counter measure located after the learning algorithm extracts information about known attacks (against which network is already protected) from other deployed security mechanisms in the network e.g., firewall, IDS, DPI etc.It compares the extracted information with the attack received from the linear algorithm and drops similar attack flows.In each step, feedback is sent to a database for next level detection.In the learning algorithm, the received attack is assigned to one of the attack clusters.In addition, algorithm changes its structure and input weights dynamically based on the received feedback.If the attack does not belong to any of the mentioned clusters, it is assigned to a totally new cluster.This novel mechanism dynamically defines new features in order to detect new types of attacks.
The protocol analyzer in HADM platform classifies and filters vulnerable protocols to avoid unnecessary computation load.On the other hand, each data set includes hundreds of features that may cause performance degradation in the detection process.To overcome this problem, feature selection methods are used to select a smaller number of features and reduce the dimensions of the dataset [1].In addition, the use of linear algorithms in conjunction with learning algorithms improves accuracy and reduces computation time.Linear algorithms will detect the attack in general level regardless of their types while learning algorithms cluster the attacks in different categories.This mechanism decreases the load of input data for the learning algorithm that are the most timeconsuming part because of their complexity [2].The major differences and novelty of this paper against related works are as follow: • Most surveys that cover security mechanisms such as existing IDS (i.e., signature, statistical, supervised, unsupervised, etc.) have very limited focus on hybrid IDS techniques, while the paper [7] not only provides a comprehensive review of a hybrid platform but also it applies a protocol analyzer, taking intrusion detection to the next level.
• We are the first to discuss and apply the concept of protocol analyzer to reduce the IDS load.
• Feature extraction and selection techniques play important roles in intrusion detection as they influence their learning processes.Unlike other studies that apply specific feature selection method with a particular algorithm, this paper discusses the topic thoroughly and applies various feature selection methods along with different algorithms to find the best combination.
• None of the related studies have evaluated the IDS robustness and scalabilities against various datasets.Overall, the contribution of this paper is in introducing an efficient (considering computation time and accuracy) anomaly detection platform with several characteristics: • We propose a novel platform that comprises several modules such as protocol analyzer and combined algorithms • We take advantage of the various feature selection methods to select the best features from data and apply different combination of learning and linear algorithms to identify attacks efficiently.
• We integrate our algorithms to protocol analyzer that filters vulnerable protocols reducing the platform load • We extensively evaluate our system robustness and scalability over several metrics, different datasets and various attacks to deliver a proof of concept, which is supported by experimental results.And for this purpose, we: Survey different datasets and their pros and cons to select more prominent datasets for our testing.
Evaluate the platform efficiency based on uniform evaluation metrics including computation time, precision and recall.Evaluate scalability and robustness by applying five different datasets, ISCX-2012, UNSW-NB15 Jan, UNSW-NB15 Feb, ISCX-2017 and MAWILab-2018 that contains diverse attacks.The rest of the paper is organized as follows.Section 2 provides a brief background in intrusion detection.Section 3 gives an introduction about the platform, its algorithms, and metrics.In Section 4, HADM implementation along with datasets and data preprocessing are discussed.Section 5 discusses the experimental results.Finally, in Section 6, we draw conclusion along with scope of future research.

II. RELATED WORK
Di Pietro et al. [3] applies machine learning algorithms such as k-nearest neighbor (k-NN) and Support Vector Machine (SVM) to detect anomalies.Furthermore, a Deep Packet Inspection (DPI) mechanism is utilized to define rules for capturing packets.However, the rules, protocols and details of the process is not explained.In addition, authors have not discussed their model scalability and robustness neither any experimental result is presented in this study.
Vasseur et al. [4] propose a supervised learning classifier to detect DDoS attack.This study applies Deep Neural Networks (DNN) classifier, which mainly concentrates on optimizing training process in order to provide labeled data.However, not only this study introduces a combined method, but also mentioned algorithm is utilized only for DDoS detection rather than other types of attacks.In addition, authors have not discussed their model scalability and robustness neither any experimental result is presented.
Piettro et al. [5], apply a machine learning based model comprises of ANN to compare received traffic with  expected traffic.Presented model is trained with expected traffic and upon receiving the input data, the signature of data is compared with the expected traffic.If they are different, a signature for the attack class will be generated and model will be trained with new information.This study doesn't discuss any types of attack neither the implementation result is presented.
Yadav et al. [6], propose a Virtual Machine (VM) based analytic model to detect anomalies within the network traffic based on the dynamic modeling of network behavior.They have applied honeypot to collect malicious traffic.Though, the model comprises of unsupervised and supervised machine learning algorithms, the honeypot relies only on received attacks and not the other attacks.In this study, applied algorithms is not disclosed and experimental result is not presented.In addition, authors have not discussed the scalability and robustness of their model.

III. HADM PLATFORM
The HADM comprises a protocol analyzer, linear and learning algorithms, validator and database as shown in Fig. 1.
The mentioned components are deployed in conjunction with one another to filter packets on the communication networks, such as mobile networks and for certain network protocols that are known or considered to be vulnerable to or used in cyber-attacks.This allows the HADM to expend a smaller amount of processing resource on other network protocols, such as streaming protocols that are not normally vulnerable and thus not typically targeted by cyber-attackers.The ability of the HADM to focus on vulnerable network protocols helps to avoid burdening network servers with unnecessary computational load.The protocol analyzer filters the network packets and identifies vulnerable protocols.The non-vulnerable protocols are forwarded to the feature extraction module for further processing.The feature extraction module extracts features from the incoming packets and provides these features to the learning algorithm I for the analysis.If the output from learning algorithm I is suspicious, it is recorded into log file.If traffic is carried on vulnerable protocol, the counter and prioritization module forwards the suspicious traffic to next level based on the occurrence of protocol against a defined threshold.The validator and database component validate the output of the linear and learning algorithms.If the actual output (e.g., from the learning algorithm) differs from the expected output, then the actual output is considered as an error.The expected output refers to numeric values that are predefined by a user and represent safe traffic.The actual output contains numeric values assigned to the features and attributes from the output of the learning algorithm.The comparison is done based on the values of these traffic features.The validator output is stored into database component, the attack features are provided as feedback to the protocol analyzer and the linear and learning algorithms for use in subsequent detections.Such an arrangement allows the HADM to define dynamically new attack features in order to better identify new types of cyber-attacks.

A. APPLIED ALGORITHMS
For performance testing, the selected features are applied to six different algorithms including Extreme Learning Machine (ELM), Multi-Layer Perceptron (MLP), SVM, k-Nearest Neighbor (k-NN), Decision Tree (DT) and Logistic Regression (LR).The best algorithms were selected through a benchmark on applied datasets and comparing the results using metrics like accuracy, False Positives (FPs), False Negatives (FNs), training and testing time.The applied algorithms are described below.LR: LR is a supervised learning method, which maps input x ∈ R d to the output y ∈ {0, 1} by constructing a linear function with weight vector w = [w 1 , w 2 , . . ..., w d ] and calculates the probability of the output given the input as P (y|x).The predicted class maximizes this parametric form.This method finds strong relationships in input features.However, model may not converge when the training data is small or the decision boundary between two classes are highly non-linear [8].
DT: DT is a non-parametric machine learning method, which is based on binary splitting features.Each leaf node in a DT represents a specific region or class with different data characteristics, as illustrated in Fig. 2. DT is easily interpretable but might be unstable, since it is affected by the variance in the training set [8].
k-NN: KNN is an unsupervised machine learning method that requires no knowledge about the prior distribution of the data and true class labels [8].It first selects some initial seeds (initial training samples) and then groups the data by comparing the distance of other training samples to these seeds by using a similarity measure.In each iteration, the class of a new training sample is decided by the class of its k nearest neighbors.In experiments, 5 was selected as k and weight of each the nearest neighbor to the final decision was equally distributed.Euclidean distance was selected as the similarity measure.
SVM: SVM constructs a hyperplane in a high-dimensional space by mapping input data to a higher dimensional space using kernel methods in order to create a non-linear decision boundary.SVM has a high accuracy in many applications but the time complexity of it is quite high [8].
MLP: An MLP is a deep, feed-forward, artificial neural network including more than one perceptron and different layers.It includes an input layer to receive the signal (input data), an output layer to give a probability vector for predictions or only one prediction and a different number of hidden layers in order to represent the input vector in a more abstract form.A single perceptron in each layer calculates a weighted sum of the input and applies a non-linear activation function to this weighted sum.The output of one perceptron is fed as an input to the perceptron of the next layer.
During training, MLP accepts the input x, forwards the information from layer to layer using its parameters θ (weights and biases) and produces an output y as well as a scalar cost J (x; y; θ) between the original class y and the predicted class y .With a back-propagation algorithm,  it calculates the partial derivative of the cost function (gradient) with respect to its parameters.It updates weights and biases using gradient values.The back-propagation is applied in each iteration (epoch) until the convergence of the parameters or the convergence of test error.A more detailed information about back-propagation is given in [9].For parameter update, different gradient optimization techniques can be used such as Stochastic Gradient Descent (SGD), Momentum, RMSProp or Adam [9].
In experiments, 2 different MLP architectures are constructed, one with 10 (MLP10) and other with 50 (MLP50) perceptron in one hidden layer.Each perceptron had a logistic activation function, cross-entropy loss was selected as the cost function and Adam optimization technique was used in back-propagation.
ELM: It is a type of single hidden-layer feedforward neural network proposed in [10].Fig. 4 shows that ELM has a similar architecture to MLP.However, ELM does not use back-propagation to update its parameters.ELM randomly initializes its input weights and updates only the parameters connecting the hidden layer and the output layer in order to reduce the computational time while ensuring the robustness [8].
Consider a training data with N samples {(x i , y i )} 1≤i≤N where x i ∈ R d is input data and y i ∈ R c is the true output.Then, ELM can approximate the input to the true output as: where w j is the weight vector connecting input layer to the hidden layer, b j is the bias of the j th hidden node, β j is the However, H is not singular in most cases, since the number of hidden nodes is less than the number of training samples [8].Therefore, ELM finds a pseudoinverse of the solution of the system as: Once an approximate solution β ∧ is found, the weights connecting the hidden and output layer are updated and the training procedure is finished.
In experiments, we used 10 and 50 neurons in hidden layers (ELM10 and ELM50) and sigmoid function as an activation function.

B. FEATURE SELECTION
The datasets involve different features that are often classified into below groups: 1) Flow features: this group includes the identifier attributes between hosts, such as client-to-server or server-to-client.2) Basic features: this category involves the attributes that represent protocols connections.3) Content features: this group encapsulates the attributes of TCP/IP; also, they contain some attributes of HTTP services.4) Time features: this category contains the attributes of time, for example, arrival time between packets, start or end packet time and round-trip time of TCP protocol.5) Additional generated features: this category can be further divided into two groups: a.General purpose features where each feature has its own purpose, in order to protect the service of protocols.b.Connection features are built from the flow of 100 record connections based on the sequential order of the last time feature.6) Labelled Features: this group represents the label of each record [11].However, network packets also carry a wide variety of irrelevant or redundant features.In this section, feature characteristics of our datasets are examined to remove the unwanted features that affect the efficiency and detection rate of our algorithms.For this purpose, we apply different feature selection methods such as Chi2, F-Score, SVMonline and RFE to find the best features from the datasets.The feature selection methods are chosen based on the achieved efficiency considering testing time and detection rates.The utilized feature selection algorithms are described below.
1) Chi2: Chi square measures the dependency between a feature and a class by counting the occurrence of the feature with respect to occurrence of the class.
Chi2 is simple but effective if a feature with a certain distribution can be differentiated easily in normal and attack packets [12].In this method, features with highest scores are selected.2) F-Score: The best feature subset includes features having a high linear relationship with a class.Although the calculated correlation value captures strong relationship between features and labels, it fails to detect non-linear relationships [12].3) SVMonline: Incremental SVM calculates the loss and retrains linear SVM in every batch using stochastic gradient descent.It assigns SVM weights to each feature and selects those with highest absolute value as best discriminative features.Although SVMonline relies on linear dependency of features and labels as in F-Score, it is more robust than F-Score, since it splits the dataset into small batches and calculates the average of model coefficients that further increases the robustness [13].4) RFE: This method first calculates the importance of each feature from a full features list based on a trained estimator, which can be a simple machine learning algorithm.Then, RFE removes features having the least importance value from the subset recursively until a desired length of feature list is reached.RFE was tested by selecting logistic regression as the estimator [14].

C. EVALUATION METRICS
To evaluate HADM detection rate, applied metrics such as cross-entropy loss, accuracy score, precision, recall and F1 score are briefly explained.We consider four classes: normal, unknown, other attacks and DoS (−1, 0, 1, 2) for SVM, k-NN, DT and LR.On the other hand, we consider 3 classes: normal, unknown and other attacks (0, 1, 2) for ELM and MLP. 1) Cross-entropy loss: Entry i and j in a confusion matrix are the number of observations actually in group i, but predicted to be in group j.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.
If the actual probability is p i but the predicted probability is q i , each event will occur with the probability of p i but surprisal will be given by q i in its formula.The weighted average surprisal, in this case, is crossentropy (c) loss and it is calculated as: In the case of binary classification where we have only two classes, we name it as binary cross-entropy loss and the above formula becomes: 2) Accuracy score: It computes the count of correct predictions: In (5), y i refers to the predicted value of i th sample, y i refers to the corresponding true value and 1 (x) is the indicator function.
3) Precision: It is the ability of a classifier not to wrongly label a negative sample as positive.In other words, how many of the selected objects were correct.Precision is calculated with: where, • TP i or True Positive: Is the number of instances with actual class other than the i-th, and correctly predicted to belong in the i-th class.This metric represents the malicious traffic that correctly identified as attack.
• FP i or False Positive: Is the number of instances with actual class other than the i-th, but wrongly predicted to belong in the i-th class.This metric represents the safe traffic incorrectly identified as attack.4) Recall: It refers to ability of a classifier to find all positive samples.In other words, how many of the objects that should have been selected were actually selected.Recall is calculated with: where, • FN i or False Negative: Is the number of instances with i-th being the actual class, but falsely predicted to belong to another class.This metric represents the malicious traffic that incorrectly identified as safe traffic.5) F1 score: It is the weighted average of the precision and recall and is calculated with:

IV. IMPLEMENTATION PHASES
As Fig. 5 shows, the implementation of HADM platform is divided into two parts.This paper concentrates on the part 1 to detect all attacks on a general level and DoS attack in particular.The part 2 will be introduced in a separate paper as an innovative method to label specific types of attacks applying a dynamic feature selection mechanism.and applied five recent datasets that have less of mentioned limitations and meet the real traffic criterions in some level.
Table 1 shows a benchmark on mentioned datasets.In order to evaluate HADM efficiency, five different datasets ISCX-2012, UNSW-NB15 Jan, UNSW-NB15 Feb, ISCX-2017 and MAWILab 2018 with diverse attacks are used.Datasets are classified in four categories: normal, DoS attack, other attacks and unknown.
The ISCX-2012 dataset exhibits realistic network behavior and contain traffic on protocols: HyperText Transfer Protocol (HTTP), Simple Mail Transfer Protocol (SMTP), Secure SHell (SSH), Internet Message Access Protocol (IMAP), Post Office Protocol version 3 (POP3) and File Transfer Protocol (FTP).The dataset is labeled and includes full packet payload together with diverse intrusion scenarios such as FTP and SSH password brute force, Java based Meterpreter, Superuser, Linux Meterpreter payload and The UNSW-NB15 (January, February) data set contains 100 GB of raw network traffic with the class label, 49 features and nine different modern attack types.The involved attacks of the UNSW-NB15 data set were categorized into Fuzzers, Analysis, Backdoor, DoS, Exploit, Generic, Reconnaissance, Shellcode and Worms.Analysis category of attack represents different attacks of port scan, spam and penetrations of HyperText Markup Language (HTML) files.Generic category of attack represents cryptographic generic attacks that works against all block-ciphers with a given block and key size, without considering the structure of the block-cipher [16].
The ISCX 2017 dataset consists of 51G network traffic metadata that is labeled including 80 features and full packet payload.The network traffic is provided on protocols, such as HTTP, HTTPS, FTP, SSH and email.The dataset includes the most common attacks based on the 2016 McAfee report, such as Web based, Brute force, DoS,   DDoS, Infiltration, Heart-bleed, Bot and Scan covered in this dataset [17].
The MAWILab-2018 dataset is captured at a trans-Pacific internet backbone link in Japan.The traffic is captured everyday only for 15 minutes, payload contents are removed and then captured traffic is made available in PCAP format.In addition, the captured data is labeled using several anomaly detection classifiers.The dataset contains Sasser worm,   Netbios, RPC, SMB, SYN, RST, FIN, Ping flood, FTP, SSH, HTTP and HTTPS attacks.The dataset also labels network scans, port scans and DoS attacks.We have analyzed the traffic for 28 th August 2018 [18].

B. DATA PREPROCESSING
The algorithms need data normalization where numeric attributes are transformed into nominal attributes to improve the performance of the algorithms.The IP address and hexadecimal Medium Access Control (MAC) address of the applied datasets are transformed into numeric attributes.Each numeric attribute is between 0 and 1 by calculating batch mean and standard deviation, unless there is an already defined range (e.g., IP address range).
Though we have also trained and tested our model with small batches of data, where each batch has same amount of packets from each dataset, our aim was applying the subset of entire data in order to verify the model scalability.Therefore, all the analysis in this paper refers to training and testing with the entire data.
Distribution of packets in datasets is shown in Table 2; whereas distribution of packets for testing and training is shown in Table 3.
While, the 2/3 of data is used for training the algorithms, the 1/3 is used for testing.In addition, to solve the class imbalance problem, down sampling is applied for effective class distribution of data to train and test the algorithms.The class imbalance problem is a quite common issue with real life traffic and causes performance degradation of conventional machine learning algorithms.
For UNSW-NB15 Jan and UNSW-NB15 Feb dataset, it was realized that ''DoS attacks'' only comprise 0.15% (3846/2472824) and ''Other attacks'' is 1.8% (44956/2472824) of the total UDP packets.Since the cost of miss-predicting ''DoS attacks'' and ''Other attacks'' is equally important, to solve this problem, the cost is kept the same but two different sampling strategies were implemented: under-sampling and over-sampling.Undersampling randomly eliminates some data from majority classes, whereas over-sampling adds duplicated or artificially generated samples to the minority classes [19].For under-sampling, a subset of normal, unknown and other type of attacks' classes were randomly and independently selected.This method reduces the number of samples in each of the classes and combines the subsets that contain the DoS attack as a new dataset.For over-sampling, random samples from the DoS attacks were duplicated and added to the new dataset.
Training and testing time in the experimental results support that over-sampling has longer training time and can lead to over-fitting.On the other hand, under-sampling provides better DoS attack prediction than over-sampling, as mentioned in the paper [19].We had tested on smaller dataset and our findings also concluded that under-sampling performs better DoS attack prediction than over-sampling.Therefore, we applied under-sampling to the datasets.

V. EXPERIMENTAL RESULTS
All the experiments are carried out on a server with Intel R Xeon R 2 x Gold 6130 CPU @2.1GHz (16 cores in each processor), 125 GB RAM, 1.6 TB HDD.The scripts were developed in Python in a Linux environment (Ubuntu 17.10) and utilized scikit-learn library [20].All applied algorithms in the evaluation process of HADM are trained once and saved for the future tests.Currently the platform is tested on single workstation, but the whole functionality can be installed on several Virtual Machines (VMs) for load balancing and decentralized monitoring purposes in order to handle large amount of traffic at a core network.The mechanism has been explained in other papers presented by authors on SDN security [21].
The proposed approach is tested, and performance is evaluated with combination of below algorithms and feature selection methods: 1-ELM10, ELM50, MLP10 and MLP50.Since in previous paper [22], authors already compared HADM efficiency with a scenario where protocol analyzer has not been used (algorithms were used standalone for attack detection) and revealed that protocol analyzer scenario performed better in terms of detection.Therefore, this paper only concentrates on the scenario where protocol analyzer is used.The traffic carried by a vulnerable protocol is directed to k-NN/SVM, DT/LR and the rest of the traffic to ELM/MLP.All UDP traffic is forwarded to k-NN/SVM algorithm and all TCP traffic to DT/LR algorithm.
There are not many variations on features related to protocol, for learning algorithm I in protocol analyzer module, therefore, we considered 9 fixed features including source IP address (saddr1, saddr2, saddr3, saddr4), destination IP address (daddr1, daddr2, daddr3, daddr4) and time to live (ip.ttl).The feature selection methods mentioned in Section II, have been applied on linear algorithm I and II and the best combinations for both feature selection methods and algorithms have been selected based on the achieved efficiency.The original dataset contains 33 features as shown in Table 4.
Out of 33 features, each feature selection method gives us 10 selected features.As it is shown in  algorithm.As it is shown in Fig. 6-12, the proposed approach has selected 10 features out of 33 features for each algorithm that means 69.69% dimensionality reduction.This reduction will be beneficial in situations with a greater number of features such as extracting payload features.
Presented results in Table 6 and Fig. 13, for testing time, total accuracy score, binary cross-entropy loss and false negative score shows MLP algorithm outperforms the ELM algorithm.The binary cross-entropy loss and the false negative score of ELM is quite high, which means that in most cases, it fails to give alarm when an intrusion occurs.If MLP with 10 and 50 hidden layer neurons are compared, it can be seen that MLP with 10 hidden layer neurons performs slightly better than the other algorithms in terms of differentiating the normal and attack traffic, also the testing time is smaller with MLP with 10 hidden layer neurons.Therefore, considering the fact the overall architecture has a high time and model complexity with many pre-processing (protocol analyzer) and post-processing (learning algorithm II) steps, we have chosen MLP with 10 hidden layer neurons for this module to reduce the overall processing time for labeling an incoming network packet and the model complexity.
Table 6 shows that every learning algorithm tested on UNSW-NB15 dataset gives the exact same performance.When checked in detail, it was found that differentiating attack traffic from normal traffic is quite simple.Therefore, even ELM10 can reach the highest performance possible.However, the cross-entropy loss of ELM is still higher than MLP algorithms.This can be explained by checking the confidence values of ELM and MLP.We observed that, even the accuracy of ELM and MLP is the same, the confidence (the probability of the predicted class) of ELM is generally lower than MLP.For example, a packet can be correctly identified as an attack by both MLP and ELM methods, but ELM gives a probability of 70%, whereas MLP gives a probability of 90%.Therefore, cross-entropy loss of ELM might be higher than MLP methods even if their accuracy is the same.As discussed, attacks in UNSW-NB15 dataset have a distinct source and destination IP address that differentiate it clearly from normal and unknown classes.Since, we take fixed features that are majorly source and destination IP address, this causes 0 FN score.UNSW-NB15 has been generated in laboratory environment and this explains the reason for only one IP address for all attacks.To tackle this issue, we utilized 4 feature selection methods (Chi2, F-Score, RFE and SVMonline) to extract 10 best features from UNSW-NB15 datasets and then applied ELM and MLP algorithms.However, still feature selection methods selected majorly source and destination IP addresses as best features; and as a result, again attacks were not misclassified, and FN score was again 0 which confirms UNSW-NB15 dataset is not diverse from the features perspective.The performance evaluation of each algorithm can be seen in the Fig. 13.As it is shown in both tables, for UDP DoS detection, though the detection rate in some results is a bit higher for k-NN still SVM algorithm is selected considering lower computation time.And for other attacks, the best performance is achieved with DT algorithm.It appears from the results that HADM did not have tremendous increase in computation time neither considerable decrease in detection factors while various datasets with different size and diverse attacks have been used.This means that the proposed model is scalable and robust.Fig. 13-17 show the performance evaluation of each algorithm for each dataset, applying the best feature selection methods.The selected points in Fig. 14-17 (linear algorithm I and II), are considered based on the best recall achieved since the aim is to detect the majority of attack packets.

VI. CONCLUSION AND FUTURE WORK
In previous paper [22] Hybrid Anomaly Detection Model (HADM) was proposed as an intelligent platform and its efficiency was evaluated against available methods and algorithms to detect network traffic intrusion.The proposed model compromises of two main parts where each part independently increases the efficiency of attack detection based on the metrics such as precision, recall, accuracy and computation time.Overall, the proposed model utilizes the protocol analyzer and a combination of learning and linear algorithms for network traffic filtering, reducing the processing time and increasing the detection rate.
In this paper, various feature selection methods have been applied together with several algorithms to achieve the highest efficiency.Even though it has been a challenge to find reliable and publicly available datasets, to measure the model robustness and scalability over the previous study, model has been tested with various datasets.For this purpose, 16 datasets that were publicly available starting from 1998 are introduced and compared.Majority of mentioned datasets are small, and they do not have attack or traffic diversity and are usually anonymized.Therefore, the five recent datasets that have less of mentioned limitations and meet the real traffic criterions were selected for testing.From the experimental results it can be concluded that the Support Vector Machine (SVM) algorithm together with SVMonline feature selection improves User Datagram Protocol (UDP) Denial of Service (DoS) detection accuracy along with reduced computation time.Similarly, Decision Tree (DT) algorithm with SVMonline feature selection method gives higher efficiency for other attacks.It appears from the results that HADM did not have tremendous increase in computation time neither considerable decrease in detection factors while various datasets with different size and diverse attacks have been used.This shows that the proposed model is scalable and robust.
The future work of this paper would concentrate on second part of the model and is ongoing.The study will apply a dynamic feature weight selection method together with a deep learning algorithm to dynamically label and cluster unknown and known attacks.

For
certain suspected attacks, such as Denial of Service (DoS) attacks carried on User Datagram Protocol (UDP), the protocol analyzer forwards the filtered packets to feature extraction module and linear algorithm I.For other suspected attacks carried over Transmission Control Protocol (TCP), the protocol analyzer forwards the filtered packets to the feature extraction module and linear algorithm II.The linear algorithm II initially defines whether the packets are safe or unsafe regardless of the suspected attack type, then extracts the features of the suspected attack and provides them to the learning algorithm II.The learning algorithm II compares the extracted features against known attack features and classifies the suspected attack as either known or unknown.In case of unknown attack, they are labeled.The information about attack is then shared to the validator and database component.

FIGURE 3 .
FIGURE 3. Multi-layer perceptron with input layer, output layer and hidden layers.

FIGURE 4 .
FIGURE 4. Structure of an extreme learning machine.

VOLUME 7 ,
2019 output weight vector connecting the j th hidden node to nodes in the outer layer and φ : R → R is the activation function.The equation can be represented in a matrix form H β = Y , and H is a N × M matrix, where N denotes the number of training samples and M denotes the number of nodes in the hidden layer.ELM uses Least-Squares Method (LSM) and tries to minimize the squared Euclidean norm of the error matrix ||H β − Y || 2 in order to update the β matrix.The matrix β can be calculated easily if H was singular: β = H −1 Y .

FIGURE 7 .
FIGURE 7. Selected features in UNSW-NB15 Jan for k-NN with SVMonline.

FIGURE 8 .
FIGURE 8. Selected features in UNSW-NB15 Jan for SVM with F-Score.

FIGURE 13 .
FIGURE 13.Performance evaluation of learning algorithm I.

TABLE 1 .
Comparison between different publicly available datasets.
A. DATASETReliable and publicly available datasets are important forIDSs.Here we briefly compare 16 datasets that have been publicly available since 1998.Majority of mentioned datasets are small, they do not have attack traffic diversity and are usually anonymized.Therefore, we selected 100970VOLUME 7, 2019

TABLE 2 .
Distribution of packets in datasets.

TABLE 3 .
Distribution of Distribution of training and testing packets in datasets.

TABLE 4 .
Features in datasets.

TABLE 5 .
Selected features for each algorithm.

TABLE 6 .
Learning algorithm I performance evaluation.

TABLE 7 .
Linear algorithm I performance evaluation.

TABLE 8 .
Linear algorithm II performance evaluation.

Table 5 ,
all algorithms (SVM, k-NN, DT and LR) are tested with four feature selection methods and then based on the achieved best performance, one feature selection method is selected for each

Table 7 and
Table 8 also show the performance evaluation for testing HADM model based on five metrics, FN score, precision, recall, F1 score and testing time.The precision