Introduction
Intrusion detection systems (IDSs) have been extensively recognized as a prominent technique for discovering and denying malevolent activities in a network [1]. As the number of malicious attacks is ceaselessly increasing, IDSs are much obliged to cope with the pruning of such attacks before they cause widespread destruction. Moreover, the present-day escalation of Internet of Things (IoT) devices and services has remarkably transformed our daily life. A large number of applications based on advanced IoT technology is successfully built and implemented, such as smart city, smart health care, smart home and vehicular networks [2]. These systems represent a further opportunity for attackers. According to [3], security is a primary barrier to the implementation of IoT network and services. This because IoT uses diverse standards and protocols, forming heterogeneous networks.
As the widespread development of IoT devices amplifies, insecure information processing is likely to put IoT networks at risk. The risk of compromising information disclosure in public spaces is particularly high with the broad development of IoT applications. Security architecture in IoT can be divided into three layers, i.e. perception layer, transportation layer, and application layer [4], [5]. Transportation layer includes network access security, which has an obligation to detect and prevent attacks. An IDS is a security mechanism which could be deployed in the transportation layer. It copes with security threats, e.g. DoS/DDoS attack, wireless LAN attack, or middle attack, which might harm the transportation security of IoT.
There are two types of IDSs, i.e., signature- and anomaly-based detection IDS. Signature-based detection deals with sniffing known attacks instantly with a lower false positive rate. Given its nature of dealing with known attack patterns, these techniques are less powerful when discovering new types of attack [6]. Anomaly-based detection, unlike signature-based detection, is able to discover novel attacks, by scanning and verifying the network patterns that are significantly different from the normal network operating patterns. As such, it constantly faces higher false positive rate. Moreover, in many cases, attackers may employ anomaly profiles disguised as normal profiles to train classification algorithms. As a result, an IDS would misapprehend anomalous patterns as normal ones. In the last decade, anomaly-based detection has gained much interest in IDS research because of the quick uprising of novel attack patterns [7], [8]. Considering the ability of network anomaly detection to discover new attack patterns, even a small detection improvement, e.g., a slightly reduced false alarm rate or a higher detection accuracy, would be extremely meaningful to avoid enterprises incurring in huge profit loss due to system performance failure and service unavailability resulting from successful attacks.
An efficient anomaly-based detection can be built using machine learning techniques. It involves solving a binary classification problem, by training a classifier to learn whether normal or anomaly usage patterns exist in the network [9]. A classification model is built using some intrusion datasets, i.e., NSL-KDD [10] or UNSW-NB15 [11], which are publicly available for benchmarking classifiers in IDS research. Various machine learning algorithms, including ensemble learning [12] and fuzzy classifier with evolutionary algorithm [13], have been proposed to improve the performance of anomaly-based IDS. More recently, deep learning [14] has been also considered, due to its prowess at uncovering complex structures of high dimensional data.
Existing solutions for anomaly-based IDS have harnessed different types of classifiers, either as individual classifiers or ensemble (meta) classifiers. When a single classification algorithm is unable to provide acceptable results, multiple classifier systems (MCSs) or classifier ensemble could be taken into account to offer a significant enhancement over individual classifier. MCSs train multiple classifiers to find a solution for the same problem [15]. In contrast to classical approaches, which build classifier model using one learner from the training set, MCSs built a set of classifiers and blend them to predict the final output.
In the past two decades, the combination of multiple classifiers has contributed to advance research in machine learning and pattern classification. Meta classifiers have been proposed in diverse real-life application domains, such as remote sensing, information security, fraud detection, health care, and recommender systems [16]. In such applications, MCSs show a plausible performance improvement over single classifiers. However, there remains underlying problems with meta classifier design, such as the classifier multeity and the choice of the appropriate techniques for combining the output of classifiers into a single one [16].
Most IDS research has focused on the utilization of long-established classification approaches either using individual classifiers, such as naive Bayes [10], [17], decision tree [10], [18]–[20], support vector machines [10], [21], [22], and naive Bayes tree [10]; or meta classifiers, i.e., bagging [23]–[25], boosting [25], [26], voting [27]–[29], random forest [10], [30], and other ensemble approaches [31]–[34]. In this paper, we propose a two-stage meta classifier for anomaly-based IDS, which utilizes two different ensembles, i.e., rotation forest [35] and bagging [36]. We demonstrate that the use of two-stage of meta classifier, combined with hybrid feature selection, can considerably improve the accuracy of anomaly-based IDS. The rationale behind choosing to design an anomaly-based IDS using a two-stage classifier ensemble is that such similar architecture model has shown remarkable accuracy in other domains, such as [37], [38]. However, in the cyber-security field, this type of design has not yet been considered.
Our contributions to the cyber-security domain are the following: (i) we propose an anomaly-based IDS based on a two-stage meta classifier, rather than an ensemble learner. The two-stage ensemble is composed by a meta classifier in the first stage whose base classifier is another meta classifier; (ii) we adopt a hybrid feature selection method to obtain a precise and accurate feature representation for the IDS problem, taking into account the fact that not all features are regarded as significant or even relevant in detecting intrusion; (iii) we conduct an extensive experimental evaluation of the proposed method to show that it produces a significant improvement of the detection rate on two different intrusion datasets when compared to several state of the art techniques; finally, (iv) we present a two-fold statistical test to demonstrate that the performance improvement shown by the proposed algorithm in respect of state of the art techniques are significant.
The remainder of the paper is organized as follows. Section II explores the existing solutions in anomaly-based IDS. A brief overview of anomaly-based IDS framework is given in Section III. This is followed by Section IV, discussing the experimental results. Finally, conclusions are drawn in Section V.
Related Work
The issue of designing anomaly-based IDSs has been extensively researched in the literature. In this paper, we limit the review to approaches that have considered the NSL-KDD and the UNSW-NB15 datasets, i.e., the same recent datasets that we consider in this work, and that do not consider only cross-validation or hold out. The latter techniques are, in fact, not reliable enough in the context of IDSs, since training and testing are carried out using portions of the same dataset. This might lead to biased result, e.g. in some cases performance accuracy might achieve 99.9%. In this paper we use different testing sets, i.e. KDDTest+, KDDTest-21, and UNSW-NB
We firstly discuss the existing solutions considering the NSL-KDD dataset [10], i.e., an updated version of the KDD Cup 99 dataset. The work in [10] has benchmarked several individual classifiers in terms of their performance behavior on the two test datasets, i.e., KDDTest+ and KDDTest-21. The naive Bayes (NB) tree has been the best performing algorithm. A fuzzy-based classification algorithm for IDS is described in [13]. A full feature training set, e.g. KDDTrain+, and a separated test set, e.g. KDDTest+, are involved in the experiment. The fuzzy classifier improves the detection performance with respect to two performance metrics, i.e. accuracy and detection rate.
Rather than using a full feature set, Mohammadi et al. [18] propose a feature selection technique called Reduced Class-Dependent Feature Transformation (RCDFT). To evaluate the chosen feature set, several classification algorithms are used, i.e., decision tree (DT), multilayer perceptron (MLP), and distance-based classifier. DT performs better than MLP and distance-based classifier on the KDDTest+ dataset. In addition, the paper also evaluates other feature selection techniques, such as linear discriminant analysis (LDA), principal component analysis (PCA), and modified class-dependent feature transformation (MCDFT). Even if the false alarm rate has been lowered significantly, some classifiers still suffer from an unfavorable performance result in terms of accuracy and detection rate.
A two-layer dimension reduction and two-tier classifica- tion (TDTC) model for IDS is presented in [48]. A dimensional reduction module is used to decrease the high dimensional dataset to a lower one, with a smaller number of features. In addition, a two-tier classification module consisting of NB and certainty factor version of
A two-tier classifier along with LDA feature selection for IDS are proposed by [49]. The proposed classifier consists of two individual algorithms, i.e., NB and certainty factor voting version of
A combination of hybrid feature selection and tree-based classifier ensemble for anomaly-based IDS is introduced in [6]. The proposed detection approach achieves 99.77% accuracy using a small size of feature set in the NSL-KDD dataset. This outperforms other similar techniques. The 10-fold cross validation (
More recently, a new classifier considering ramp loss function to the original one-class support vector machine for anomaly detection is developed in [50]. By using
An anomaly detection technique based on deep learning model for Internet Industrial Control Systems (IICSs) is developed in [52]. The proposed detection model comprises a consecutive training process performed using a deep auto-encoder and deep feed-forward neural network architecture. The model is evaluated using NSL-KDD and UNSW-NB15 datasets. However, as the validation is conducted by simply dividing the dataset into training and testing set, the performance achieved is very high, which may be due to overfitting. Finally, in [21], a new feature selection technique for anomaly-based IDS called Modified Binary Grey Wolf Optimization (MBGWO) is designed. By considering a reduced feature set, the performance accuracy of SVM tested on KDDTest+ is improved when compared to other similar techniques, such as grey wolf optimizer (GWO), binary GWO, and MGWO. Nevertheless, the detection accuracy is still not able to compete against the previous works.
Framework Design
In this section, we first give an overview of the proposed framework at a conceptual level. The, we discuss in detail the feature selection and classifier modeling that we have adopted.
A. Conceptual Framework of IDS
A conceptual model of the proposed framework is given in Figure 1. The framework comprises three tiers, i.e. feature selection, classifier modeling, and validation. The first tier refers to the process of carefully choosing a feature set as the most appropriate for the anomaly detection task at hand. This is done using a hybrid technique relying on three evolutionary search techniques, i.e. particle swarm optimization (PSO), ant colony optimization (ACO), and genetic algorithms (GA). The feature selection method is described in depth in Section III-B.
In the second tier, a two-stage meta classifier for classification is designed. This tier is responsible for building classification model through the combination of two meta classifiers, i.e. rotation forest (RF) and bagging (BG). Since BG requires weak classifier, a conjunctive rule (CR) [53] classifier is chosen as base classifier. Following this, other meta combinations and a single classifier can be taken into consideration, e.g. bagging of CR (BG-CR), rotation forest of CR (RF-CR), and CR. These classifiers are further used as the basis of our classification analysis using statistical significance tests provided in Section IV-C. The two-stage meta classifier is described in depth in Section III-C.
Lastly, in the third layer, the proposed two-stage meta classifier is evaluated. This validation is performed using 10-fold cross validation (
B. Tier 1: Hybrid Feature Selection
A feature selection technique can be seen as a procedure for selecting a precise, compact and accurate subset of features from a given feature set. In this work, we choose a correlation-based feature selection, which estimates the importance of features using entropy and information gain [55]. In particular, irrelevant, noisy, and redundant features have to be excluded from the dataset in this tier.
We consider an evolutionary approach to feature selection, using three distinct evolutionary search techniques: PSO [56], GA [57], and ACO [58]:
Particle swarm optimization. In this technique, a feature set is represented by particles in a swarm. Several particles are placed in an hyperspace in which each particle possesses random location
and velocityx_{i} . Letv_{i} be the inertia weight constant, with\omega andc_{1} be the cognitive and social learning constant, respectively. Let alsoc_{2} andn_{1} be random numbers,n_{2} be the personal best location of particlep_{i} , andi be the global location among the particles. Then, the fundamental rules for updating the position and speed of each particle are:g \begin{align*} x_{i}(t+1)=&x_{i}(t)+v_{i}(t+1)\tag{1}\\ v_{i}(t+1)=&\omega v_{i}(t)+c_{1}n_{1}(p_{i}-x_{i}(t))+c_{2}n_{2}(g-x_{i}(t)) \\\tag{2}\end{align*} View Source\begin{align*} x_{i}(t+1)=&x_{i}(t)+v_{i}(t+1)\tag{1}\\ v_{i}(t+1)=&\omega v_{i}(t)+c_{1}n_{1}(p_{i}-x_{i}(t))+c_{2}n_{2}(g-x_{i}(t)) \\\tag{2}\end{align*}
Genetic algorithm. In this technique, a set of features is represented by a chromosome. The existence of particular feature in a feature set is determined using binary value, either 1 (present) or 0 (missing). In addition, the Goldberg method is frequently taken into consideration to obtain the best feature set, while a
-fold cv is used by subset evaluator to examine the input features. In the experiment, it is necessary to specify the parameters such as initial population, mutation, crossover probability, andk .k Ant colony optimization. In this techniques, adopting a graph representation, features are denoted by nodes and the selection of the best possible next feature is denoted by edges. The final feature subset is obtained through an ant search in the graph. The search stopping criterion is set to check a minimum number of visited nodes [59]. In addition, in order to evaluate which features are more informative among the currently chosen features, a probabilistic transition rule is utilized. Let
be the number of ants,k the set of antJ_{i}^{k} ’s unvisited features,k the heuristic merit of picking feature\eta _{ij} when presently at featurej ,i the amount of virtual pheromone on edge\tau _{ij}(t) , then the likelihood of an ant at feature(i,j) to be willing to travel to featurei at timej is:t \begin{equation*} p_{ij}^{k}(t) = \frac {[\tau _{ij}(t)]^{\alpha }\cdot [\eta _{ij}]^{\beta }}{\sum _{l\in J_{i}^{k}}[\tau _{il}(t)]^{\alpha }\cdot [\eta _{il}]^{\beta }}\tag{3}\end{equation*} View Source\begin{equation*} p_{ij}^{k}(t) = \frac {[\tau _{ij}(t)]^{\alpha }\cdot [\eta _{ij}]^{\beta }}{\sum _{l\in J_{i}^{k}}[\tau _{il}(t)]^{\alpha }\cdot [\eta _{il}]^{\beta }}\tag{3}\end{equation*}
Several experiments have been carried out by tuning the size of particle, the number of ants, and the population size of PSO, ACO, and GA, respectively. A feature set is then selected by considering the maximum classification accuracy of a REPT classifier [60]. REPT is chosen due to its simplicity and speed in generating decision trees. It reduces the size of decision trees by pruning segments of the tree that contribute only marginally to sample classification. The classification accuracy of REPT is evaluated using subsampling (Monte-Carlo cross validation) technique. Subsampling is very similar to classical bootstrap [61]. It draws a training set
C. Tier 2: A Two-Stage Meta Classifier
A meta classifier trains multiple individual classifiers, either in a parallel or serial manner. In order to construct a two-stage meta classifier, we employ two original ensembles, that is, rotation forest and bagging, which work as follows:
Rotation forest. The goal of this meta classifier is to produce accurate and diverse classification algorithms. To create some feature subset projections, rotation forest uses principle component analysis (PCA). A number of independent feature subsets are trained using the same classification algorithm. Then, a full feature set for each classifier is collected, arranging the ensemble [35]. Let
andF be the feature set and number of subsets, respectively. The rotation forest splits randomlyL intoF subsets. PCA is then applied independently on each subset, and the new extracted features are collected by pooling all principle components. A datasetL is transformed into a new feature space, from which a classifierD is able to create a model. Independent split of the feature set yields the diversity of the extracted features.C_{i} Bagging. In this meta classifier, several base individual classifiers are trained independently in parallel [36]. Let
be the original training set, which hasD sample size. A number ofn bootstrap samplesM are randomly created fromD_{1}, D_{2},\ldots,D_{M} . Next, an individual classifierD is trained on each bootstrap sampleC_{i} . Finally, majority voting is taken to predict the final output of the new test instances. The final predictionD_{i} on a test instance, bagging feeds to its individual classifiersC^{\ast } , collects all of the outputs, the votes of the label, and decides the winner label.C_{1}, C_{2},\ldots,C_{M}
In this work, a new procedure for creating a two-stage meta classifier for an anomaly-based IDS is proposed. The proposed approach, unlike typical meta classifiers, which are frequently built from simple weak classifiers, considers two meta classifiers. Roughly speaking, it is a two-stage classification algorithm, where a meta classifier acts as a base model of another meta classifier. As illustrated in Figure 2, BG is chosen as a base model of RF, where another weak classifier, namely conjunctive rule (CR) [53] is chosen as a base model of BG. CR is prevalently recognized as an inductive learner, where the objective of rule induction is to generate a set of rules from the data [62].
In practice, several combinations of meta learners could be considered, since there exist a large number of meta learners in the literature. However, the combination that we propose is expected to maximize diversity, since RF and BG have different induction strategies, by taking into account the features (vertical induction) and samples (horizontal induction) of a training set, respectively. In the first stage, RF creates a feature set of
Let \begin{equation*} H(x)=\! \begin{cases} c_{j} &\quad if \displaystyle \sum \limits _{i=1}^{T} h_{i}^{j}(x)>\displaystyle \frac {1}{2}\sum \limits _{k=1}^{l}\sum \limits _{i=1}^{T} h_{i}^{k}(x)\\ rejection&\quad \text {otherwise} \end{cases}\tag{4}\end{equation*}
Experiment Result and Discussion
Datasets and experimental settings are discussed in Section IV-A, while Section IV-B and Section IV-C present the result of experiments for feature selection and intrusion detection (including statistical tests), respectively.
A. Intrusion Detection Datasets
We consider the following publicly available intrusion detection datasets that are widely adopted in previous works:
NSL-KDD [10]. It is an improved version of the KDD Cup 99 dataset, which does not have redundant samples, thus preventing classifiers to have a biased result. It comprises 42 features and a class label attribute. We consider 20% of dataset, so-called KDDTrain+, in the model training. KDDTrain+ consists of 25,192 samples, with 13,499 anomalous and 11,743 normal samples. In addition, we take into account two separated test sets, i.e., KDDTest+ (22,544 samples) and KDDTest-21 (11,850 samples), which are provided specifically for performance benchmark analysis.
UNSW-NB15 [11]. This dataset, unlike NSL-KDD, is an original version of an intrusion detection dataset that has appeared more recently. The full training set (UNSW-NB
) is composed by 42 features, with 37,000 samples in the normal class and 45,332 samples in the anomaly class. A specialized testing set (UNSW-NB15_{train} ) is also used in the experiment. UNSW-NB15_{test} has 175,341 samples.15_{test}
All experiments discussed in the remainder of this paper were run on a Linux machine with 32G RAM memory and Intel Xeon Processor. The classifiers are implemented using the
B. Results of Feature Selection
Experiments to determine the best configuration for feature selection are carried out by changing the value of the parameter
The results of REPT on NSL-KDD dataset for each search technique is presented in Figure 3. It is obvious that PSO with
The results for the UNSW-NB15 dataset are visualized in Figure 5 and 6. PSO with
The two selected feature sets discussed above are used in the next section for evaluating the performance of the two-stage classification model in the second tier of our framework.
C. Intrusion Detection Classification Analysis
This section evaluates the performance of the proposed two-stage classifier against other classifiers, namely bagging of CR (BG-CR), rotation forest of CR (RF-CR), and CR. For each classifier, we present the average accuracy using 10-fold cross-validation, i.e.
In order to compare multiple classifiers, a common procedure is as follows [64]. First, omnibus tests, e.g. Friedman rank [65] and Iman-Davenport [66] are applied to determine the ranking of classifiers and to identify if at least one of the classifiers has performance difference among the competitors, respectively. More specifically, the goal of Iman-Davenport is to test whether all the classification algorithms perform equally, or, on the contrary, some of them hold a significant difference. Second, if such significant difference is found, then a pair-wise test, e.g., Friedman post-hoc with the corresponding
The proposed classifier emerges as the clear best performer. The proposed classifier is, in fact, associated with the lowest (e.g. best) mean rank. The
To extend this benchmark, we have compared the proposed two-stage classifier with the performance achieved by previous studies that use the datasets KDDTrain+ for training and KDDTest+ and KDDTest-21 for testing. We also include the result obtained by [10], where the NSL-KDD dataset has been firstly proposed. These results are shown in Table 3 and Table 4. Based on the experimental validation on KDDTest+, the highest detection accuracy is achieved by the proposed approach, which outperforms the most recent anomaly-based IDS techniques, i.e. SVM [21], bagging (J48) [25], and two-tier classifier [49]. Besides having superior detection accuracy, the proposed approach also outperforms significantly other approaches in terms of sensitivity and precision. Even though our proposed classifier does not perform best in terms of FPR metric, it is still comparable as being able to outperform GAR-Forest [34]. Moreover, according to a validation test applied on KDDTest-21, the proposed approach clearly outperforms classifiers available in the current literature, regardless of the evaluation metrics (see Table 4).
Table 5 shows the results regarding the UNSW-NB
The comparison analysis shown in Table 3–5 show that the proposed approach is very competitive as an effective approach for the anomaly-based intrusion detection task. In addition to the performance analysis, the statistical significance tests prove that the better performance of the proposed classifier is statistically significant when compared to state of the art techniques. Note that statistical tests are usually not provided by the other approaches in the literature that we have considered in this paper.
Finally, Figure 7 shows the execution time of the proposed classifier. The training time is calculated based on the computation time required for classification modeling. It is worth mentioning that the proposed model whose considerable reduced the training time when the optimal number features, obtained as the output of tier 1, is considered. For practical implementation, the time performance is acceptable, since the classification has to be trained only once and can then be used as an off-line anomaly detection tool in the network.
Training time taken by the proposed model (reduced set) and original full feature set.
Conclusion
In this paper, a novel method for anomaly-based intrusion detection system based on the combination of hybrid feature selection and two-stage meta classifier has been proposed and discussed. Two intrusion datasets (NSL-KDD and UNSW-NB15) have been employed to evaluate the performance of the proposed approach. Based on the statistical significance tests, it could be concluded that the proposed approach outperforms other state of the art individual classifier and meta-classifiers, such as conjunctive rule (CR), bagging of CR (BG-CR), rotation forest of CR (RF-CR). The proposed method yields a superior result in terms of accuracy, specificity, and precision metric when validated against pre-specified testing sets, i.e. KDDTest+, KDDTest-21, and UNSW-NB