CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection

Feature selection (FS) is one of the important tasks of data preprocessing in data analytics. The data with a large number of features will affect the computational complexity, increase a huge amount of resource usage and time consumption for data analytics. The objective of this study is to analyze relevant and significant features of huge network traffic to be used to improve the accuracy of traffic anomaly detection and to decrease its execution time. Information Gain is the most feature selection technique used in Intrusion Detection System (IDS) research. This study uses Information Gain, ranking and grouping the features according to the minimum weight values to select relevant and significant features, and then implements Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB) and J48 classifier algorithms in experiments on CICIDS-2017 dataset. The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time. Specifically, the Random Forest algorithm has the highest accuracy of 99.86% using the relevant selected features of 22, whereas the J48 classifier algorithm provides an accuracy of 99.87% using 52 relevant selected features with longer execution time.


I. INTRODUCTION
The anomaly-based intrusion detection is one of the techniques used to recognize zero-day attacks. Although various anomaly detection techniques have been developed, yet there are challenges and issues in the area, namely high dimensionality of data [1], impact on computational complexity [2], [3], and computational time [4].
One approach used by researchers to deal with the data dimensionality issue is feature selection technique. Feature selection technique eliminates features, helps in understanding data, reduces computing time, reduces ''curse of dimensionality'' effects, and improves predictive machine The associate editor coordinating the review of this manuscript and approving it for publication was Hong-Mei Zhang . performance [5]. Feature selection is a part of dimensional reduction, known as a process of selecting an optimal feature subset that represents the entire dataset [6].
Many research works that use feature selection techniques to improve the accuracy of anomaly detection have been carried out such as works in [7]- [11]. Most of the works use the Network Security Laboratory-Knowledge Discovery and Data Mining (NSL-KDD) dataset, a refined version of its predecessor KDD Cup 99 dataset. Methods and measurements have been proposed that show the ability in improving detection accuracy including Chi-Square, Information Gain, Correlation Based with Naive Bayes and Decision Table Majority Classifier [12], Support Vector Machine (SVM) [13] and Random Forest [12]. Nevertheless, those methods were not tested on a large dataset with a large number of features. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ As mentioned in [14], data with a large number of features can affect the learning model that tends to overfit and will decrease the performance, increasing memory use, and computational cost for analytic. In fact, very rare researchers which consider computational time in their works, especially in anomaly detection. On the other hand, Information Gain has been widely used by researchers to analyze significant and relevant features. According to works in [15]- [21] the Information Gain is used to reduce dimensionality by selecting more relevant features through feature weight calculation. Eliminating irrelevant features may improve the performance of the detection system. Many research works implement Information Gain on the dataset with limited features to analyze. In this study, the CICIDS-2017 dataset with more complex features is used. The CICIDS-2017 dataset contains a high volume of traffic and a large number of features to be observed for anomalies detection.
Previous works which use the CICIDS-2017 dataset and also use Information Gain feature selection technique do not mention the basis on how to determine the score value used for feature selection. Each researcher uses different score value. In this paper, the authors investigate and analyze the ability of the Information Gain in determining relevant features for network traffic classification, especially for traffic with bigger number of features. The authors distribute the features into groups based on their minimum score values. Then each feature group is used as a filter for the five classifier algorithms; Random Forest, Bayes Network, Random Tree, Naive Bayes and J48 to perform anomaly/attack detection on the dataset. Then, the detection results are compared with the aim is to validate the significance and relevance of the selected feature groups. The more accurate the detection results the more significance and relevant the feature group. Thus, the authors analyze the effect of weighted features resulted from the Information Gain against the anomaly/attack detection performance as well as to find the most significant and relevant features to be used to increase the performance of anomaly/attack detection.
The rest of the paper is organized as follows. Section 2 presents the relevant researches. Section 3 briefly discusses the dataset and experimental setup used in this study. Section 4 explains more details on the experiments and results findings of this study. Finally, Section 5 provides a conclusion and potential future works.

II. RELEVANT RESEARCHES
Research on feature selection has been carried out especially in network attack detection. Wang et al. [22] analyze the features of large network traffic, by choosing the most significant features, using a combination of filtered-based and wrapper-based algorithms. The method produces 10 significant features and can increase the detection rate up to 99.8% and false alarm of 0.34%. Ambusaidi et al. [23] propose a supervised filtered-based features selection algorithm called Flexible Mutual Information Feature Selection (FMIFS).
The algorithm contributes to the Least-squares support-vector machines (LS-SVM) IDS with a better accuracy and lower computational rates than the previous methods.
Authors in [24] propose a feature identification approach by combining filtered-based and wrapper-based methods with clustering method to provide weight for each feature. The proposed method is able to identify features that can improve the accuracy of attack detection. Chen et al. [25] introduce a tree-seed algorithm (TSA) that is used to extract effective features. The proposed algorithm reduces the dimension of data, by eliminating redundant features, which in turn improve the accuracy of the K-Nearest Neighbor (KNN) classifier. The work in [10] discusses a Discrete Differential Evolution (DDE) technique and the C4.5 Machine Learning algorithm. The proposed technique produces 16 relevant features with a classification accuracy of 99.92%. While Peng et al. [26] combine the Ant-Colony Optimization algorithm and feature selection, called FACO. The proposed work is able to produce features that improve the classification algorithm accuracy. Finally, researchers in [27] propose an IDS called FWP-SVM-GA, based on the genetic algorithm and SVM. The proposed algorithm increases detection rate, accuracy, true positive rate (TPR) and reduces false-positive rate (FPR) and SVM training time.
Having done reviewing previous works, the authors come up with a hypothesis that feature selection can improve the performance of classification algorithms by eliminating non-useful and redundant features. Even a small number of selected features may increase the detection accuracy. Up to now researchers mainly use the KDD CUP 99 dataset that only has 41 features as test data. The use of a large dataset still rare. Therefore, the reliability of the proposed methods have not been tested on larger dimension dataset (with more features and number of records). Table 1 summarizes feature selection research works on intrusion detection field for the last five (5) years.
Yulianto et al. [56] combine the Synthetic Minority Oversampling Technique (SMOTE), Principal Component Analysis (PCA), and Ensemble Feature Selection (EFS) to improve the performance of AdaBoost-based IDS on the CICIDS-2017 Dataset. The authors claim that the combined method outperforms the SVM-based method with regards to accuracy, precision, recall and F1 Score.
On the other hand, despite many researchers using Information Gain as a feature selection technique, there are very limited discussions on how to determine the minimum weight or rank score from the Information Gain result. This score determines how much the features are relevant to the class label. Researchers in [18] and in [21] use a score feature above 0.4 and a score above 0.001, respectively. Meanwhile, research work in [28] considers the minimum weight score of 0.8. In contrast, researchers in [29] remove features one by one and apply the classifier algorithm to find the best accuracy. Such work is very time-consuming especially with a large number of features in the dataset.

III. METHODOLOGY
This section describes the dataset, experimental configuration, feature selection technique, classification algorithms, and experimental tools.

A. DATASET
This study uses MachineLearningCSV data, which is part of the CICIDS-2017 dataset from ISCX Consortium. MachineLearningCSV consists of eight (8) traffic monitoring sessions, each is in the form of a comma separated value (CSV) file. This file contains normal traffic defined as ''Benign'' traffic and anomaly traffic called as ''Attacks'' traffic. The attack traffics are detailed more as in the second column of Packet Lenght Std feature is required to detect the types of DDoS, DoS Hulk, DoE GoldenEye, and Heartbleed attacks. The Init Win Fwd Bytes feature is required to detect the types of Web-Attack, SSH-Patator, and FTP-Patator attacks. Whereas the Min Bwd Package Length feature and Fwd Average Package Length features are required to recognize normal traffic [58].
CICIDS-2017 has more complex types of attacks as presented in Table 2. The rational of choosing CICIDS-2017 dataset is to have a dataset that represents closely the current real world network traffic in the experiments.

B. EXPERIMENTAL SETUP
In general, there are four stages in the experimental settings shown in Fig. 1, which can be explained as follows.
1) Only 20% of MachineLearningCSV data from the CICIDS-2017 dataset are used in this experiment.
Since the dataset has redundant features, it is needed to remove the redundant ones. Then relabeling process is performed. The 20% of MachineLearningCSV data are then split into 70% for training data and 30% for testing data. 2) Feature selection is performed on the training data using Information Gain. Then selected features are grouped according to their weights. 3) Then each feature group or feature subset is classified using Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB), and J48 classifiers. The analysis considers the following parameters: TPR, FPR, Precision, Recall, Accuracy, percentage of incorrectly classified, and execution time for the analysis. 10-fold cross-validation is used in this stage. 4) Next, compare and analyze the TPR, FPR, Precision, Recall, Accuracy, percentage of incorrectly classified, and execution time of each classifier algorithm. All learning and testing steps are executed with 10-fold cross-validation. Lastly, draw conclusions.

C. INFORMATION GAIN
Information Gain is the most used feature selection technique. It is a filter-based feature selection [28], [30]. Information Gain uses a simple attribute rank and reduces noise that caused by irrelevant features then detects a feature that have most of information base in specific class. The best feature is determined by calculating feature's entropy. Entropy is a measure of uncertainty that can be used to infer the distribution of features in a concise form [31]. The entropy can be calculated using (1).
With c is the number of values in the classification class and P i is the number of samples for class i. After getting the entropy value, the Information Gain value is calculated using (2).
where S is sample, A is an attribute, v is a possible value for attribute A, Values(A) are a set of possible values for A. | S v | is the number of samples for value v. |S| is the number of samples for all data samples and Entropy (S v ) is entropy for sample that have a value of v.
This work chooses Information Gain as feature selection since it is a filtered-based technique which provides more stable sets of selected features due to its robust nature against overfitting. Overall, computational complexity of filter-based technique is O(m·n 2 ), where m is the number of training data, and n is number the of attributes/features. It is less as compared to embedded and wrapper-based techniques [55]. The complex nature of wrapper-based techniques creates the high risk of overfitting. Thus, using feature selection technique that produces significant, relevant, less number of features and less computational complexity will reduce the execution time of classification algorithms used in the anomaly/attack detection process.
The features are given IDs from 1 to 77. The Information Gain ranks the features based on their weight values and the minimum weight is determined manually using try and error approach. In this work, the researchers propose to rank and group the features according to the minimum weight values. Thus, groups of features are obtained and each feature group will be having different number of features as shown later in Table 6. Further, all feature groups will be validated by using the five classifier algorithms, so we can determine which feature groups are effective enough to be used for attacks' types classification.

D. CLASSIFICATION ALGORITHM
The main consideration on parameters for selecting classifier algorithms in this work is good performance in term of accuracy, learning ability, scalability, and speed. Having done some researches on several previous works that support the consideration, five algorithms are considered, they are: Random Forests, Bayesian Network, Random Trees, Naive Bayes and J48 classifiers to be experimented in this work. Research work by Hadi [20] states that random forest trees are strong learners and have good performance in detecting attacks based on the features resulted by Information Gain feature selection. Niranjan et al. [39] reveals that the ability of Bayesian Network in classifying attacks outperforms other algorithms. According to Sindhu et al. [57], Random Tree is an algorithm that has scalability and efficiency. Naive Bayes is a classification algorithm that is able to identify class labels faster than other algorithms because it has a low complexity of the model [55]. Sahu and Mehtre [15] conclude that J48 algorithm has good accuracy in classifying attacks. Thus, the five classification algorithms are used to validate the significance of the selected features resulted during feature selection stage.

1) RANDOM FOREST (RF)
Random Forest is one of the ensemble classifier methods. If a classifier in an ensemble is a decision tree classifier, then the collection of classifiers is a ''forest''. Each decision tree is created through a random selection of attributes at each node for separation [32]. The random forest algorithm was proposed by Breich in 2001 [33]. Some anomaly detection studies that use random forest include research conducted by [20], [34] and [35].

2) BAYES NETWORK (BN)
Bayesian Network (BN) is a model that encodes probabilistic relationships between variables of interest. The accuracy of this method depends on assumptions which are usually based on the model behavior of the target system. So any significant deviation from the assumption will cause a decrease in detection accuracy [36]. Some anomaly detection studies that use Bayesian networks include works by Reazul et al. [37] and Ding et al. [38].

3) RANDOM TREE (RT)
Basically, Random Tree is a decision tree that is built on a collection of random attributes (random). A decision tree is a group of nodes and branches. A node represents a test attribute and branches represent the results. Decision leaves show the final decision taken after calculating all attributes in the form of class labels [39]. Some anomaly detection studies using this method include [40], [41] and [42].

4) NAIVE BAYES (NB)
Bayesian classification is a statistical classification that is able to predict the probability of class membership. Bayesian classification is based on the Bayes theorem [43]. The Bayesian classification is better known as the Naïve Bayes classification. Naïve Bayes assumes that the influence of attribute values on class is independent of other attribute values. Some anomaly detection studies using Naive Bayes include works by Goeschel [44], and Shakya and Sigdel [45].

5) J48
J48 or C4.5 is a widely used machine learning algorithm and is included in the decision tree algorithm. This algorithm builds a decision tree from a set of training data with the entropy concept [43]. It differs from IDE3 in that it builds a decision tree, where J48 or C4.5, can receive continuous and categorical attributes [46]. Some anomaly detection studies using this algorithm include works by Sahu and Mehtre [15] and Muniyandi et al. [47].

E. ANALYSIS TOOLS
All simulations in this experiment are executed on a computer with specification of Intel Core i7 processor with 2.70 GHz 8 GB RAM, running Windows 10 as Operating System. For analysis purposes, the Weka 3.9 with heap size of 3072 MB, as machine learning software is used.

IV. EXPERIMENTS, RESULTS AND ANALYSIS
This section presents the data preparation, detail of experimenting with feature selection classification, and lastly, results and discussions of the experimentations.

A. DATASET PREPARATION
The eight CSV files as listed in Table 2 are combined into one CSV file. Next, to process the dataset using Weka software, this CSV file is converted into the ARFF file. The experiment uses only 20% of MachineLearningCSV data. There are 78 regular features and one class label used in this study. The dataset contains two features or columns named ''Fwd Header Length'' that make it as redundant features, so one of those columns must be removed. Thus, after removing the redundant features, only 77 features are available to be analyzed. As described in the CICIDS-2017 data prone to high-class imbalance will impact low detection accuracy and high false alarm. By adopting solution suggested by Karimi et al. [30] and Panigrahi and Borah [48] a new labeling attack traffic is introduced as listed in Table 3. The 77 features are already in numerical data type, so no data transformation is required to feed the data into Weka software.
After relabeling the attack classes, the 20% of Machine-LearningCSV data are split into two portions as 70% and 30%. The 70% portion is used for training data and the other 30% portion is used for testing data as tabulated in Table 4. The 70:30 data portion was used in [49]. The experimental VOLUME 8, 2020 results in [50] shows that the use of the 70:30 portion of training and testing data leads to the same level of accuracy as the portions of 80:20 and 60:40. Meanwhile, experimental result of using 70:30 data portion in other work by Abualkibash [51] results high accuracy. Therefore in this study, the researchers divide the training and testing data with a portion of 70:30. Although the dataset is transformed into a new attack label, the ''Infiltration'' attacks have a very small portion of data compared to other types of attacks. Later, the data will be analyzed by the feature selection technique.

B. FEATURE SELECTION USING INFORMATION GAIN
As mentioned in Section 1, the main issue in a large dataset is dimensionality. Feature selection technique reduces the dimensionality of data by selecting relevant features. The Information Gain evaluates the features by calculating their entropies. In this study, feature selection is implemented by Weka software and the process is shown in algorithm 1. Table 5 presents the feature rank as the result of feature selection by Information Gain. As mentioned in sub-section 3.C, the feature selection in this experiment uses a filterbased approach. In other words, the feature selection filters  Table 6, there are seven groups of features and we called as new features subsets.

C. EXPERIMENTAL RESULT
To analyze the performance of the feature selection performed by Information Gain and the five (5) classifier algorithms, seven (7) measurement metrics are used, they are: True Positive Rate (TPR), False Positive Rate (FPR), Precision, Recall, Accuracy, percentage of incorrectly classified and execution time. The execution time is measured during the training time (the time measured from the classification process starts until the classification process stops). In the experiment, each feature subset is classified by RT, BN, RT, NB and J48 classifiers. The overall process is shown in Algorithm 2. To evaluate the performance of classification algorithms, this research uses 10-fold cross-validation. The 10-fold cross-validation is used because it reduces computing time while maintains the performance of the classification algorithms in term of accuracy. Hence, the input dataset will be randomly divided into 10 folds with exactly the same size. For each of the 10 fold data, cross-validation will use 9 fold for training and 1 fold for testing. This process is repeated for 10 times until each fold becomes a test fold. This cross-validation method has been widely used in IDS researches, such as in [52], [53], and [54]. Performances of classifiers using four (4) features selected by Information Gain are listed in Table 7  Next, the classifiers' performances with 22 selected features are listed in Table 9. The result shows RF again has the highest accuracy of 99.86% compared to others. Even this classifier has a good recall value of 0.999 and low FPR value of 0.003, unfortunately the precision value indicates a NaN. On the other hand, RF cannot detect Infiltration using VOLUME 8, 2020   The performances of the classifiers with 35 selected features are listed in Table 10. Similar to the previous results, RF has the highest accuracy of 99.83%, the recall of 0.998, and FPR of 0.004. Nevertheless, the precision noted as NaN. This result shows that RF cannot detect Infiltration. Surprisingly NB achieves better performance than before with 70.84% accuracy, even this achievement lower than other methods, however, it has a good precision with a value of 0.923.
The performances of classifiers with 52 selected features are tabulated in Table 11. It is shown that J48 has a better performance with accuracy of 99.87%, recall of 0.999, precision of 0.999 and low FPR of 0.002 compared to other classifiers.
The performances of classifiers using 57 selected features are listed in Table 12. BN is able to detect all types of traffic with good TPR values.   Lastly, the performances of classifiers using all features are tabulated in Table 13. By using all features, BN is able to detect all types of traffic with good TPR. Observation on Table 11, Table 12, and Table 13 leads to conclusion that RF, RT, and J48 with 53, 57, and all features have a good ability to detect Normal, Dos/DDoS, Brute Force as well as Bot attacks traffics. However, RF, RT, and J48 suffer in detecting Infiltration attack traffic, whereas BN and NB have a good ability to detect it.

D. ANALYSIS
Implementation of the proposed Information Gain feature selection in the experiments yields ranked features according to their weight scores. Features with higher weight scores represent more relevant and significant features of an attack.
As can be observed from Table 5, the top four features (out of 77) with their scores are resulted from the experiment. Thus, features with IDs 41, 13, 65, and 8 are the most relevant and significant features for detecting any attacks and appear in any of features subsets. For the case of Infiltration attack traffic detection, NB is able to detect with TPR value of 0.800 using features subsets of 22 and 35, and perfectly detect (with TPR value of 1.000) using features subsets of 52, 57 and 77. The reason is, because significant features representing infiltration attack traffic appears in the features subsets of 52, 55, 77. Unfortunately, other classifiers; RF, BN, RT and J48 are unable to detect well the Infiltration attack traffic. The small amount of this type of attack traffic in the dataset may cause the bad performance of its detection. As mentioned in subsection 4.A, CCIDS-2017 contains imbalanced data, which is a big challenge in detecting anomalies/attacks. Similar to the case of Infiltration attack, all classifiers are not able to detect well the Web Attack traffic using features subset of 4. Then, only BN and NB classifiers are able to detect the Web Attack traffic using features subset of 15 with the TPR value of 0.993 and 0.829, respectively.
As for Bot Attack traffic detection, RF, BN, RT, and J48 are able to detect the traffic using certain features subsets, but with lower TPR values.
Furthermore, considering the Precision and Recall values, in general the five classifiers detect the traffic relatively well. Nevertheless, in some cases the classifiers produce NaN values. Those cases may happen because of the implementation of 10-Fold Cross Validation in the experiment, which divides the dataset into ten folds (data portion). As the amount of attack traffics for Infiltration, Bot and Web attacks are relatively small, thus, some folds do not contain those traffics. Therefore, it affects the ability to detect the attack during the training stage. Specifically, for the Infiltration attack traffic which has very small amount in the dataset.
The experiment results show that the type and number of selected features may impact significantly the performance of the detection. Fig. 2 Shows the summary of classifiers'  On the other hand, the proposed Information Gain improves NB's accuracy by up to 70.84% with 35 selected features. BN and J48 do not have any significant improvement compared with the use of all features in the analysis.
Besides the accuracy, selected features impact the FPR, as shown in Fig. 3 This work also analyzes the effect of execution time for the selected features process. Fig. 4 shows the summary of the execution time to obtain each feature subset using RF, J48, BN RT, and NB. The relevant selected features process has very significant impact on RF, J48, and BN. The execution time of RT and NB are relatively very small. Overall, the more numbers of features to analyze the more time is required for execution.

V. CONCLUSIONS
This work has discussed experimentation as a proof of concept on impact of feature selection in improving anomaly detection accuracy. Information Gain is designated because of its ability to calculate the weight of features' information.
RF classifier outperforms others in the experiments using features subsets of 15, 22 and 35. Whilst J48 performs the best using features subsets of 52, 57 and 77. Other finding in the experiment is that, although BN has a low accuracy level compared to RF and J48, however it is able to detect all traffics using features subsets of 52, 57 and 77. Furthermore, experiment results show that the selected features decrease the FPR level, especially for BN.
With regards to the investigation on processing time, experimental results confirm that the number of selected features affect the execution time.
The proposed Information Gain produces ranked features based on their weight values. However, expert intervention is still needed to determine the minimum weight value, which affects the number of features selected.
The authors plan to work on different feature selection methods to design an optimal feature selection mechanism. Analysis of each features subset that affects each type of attack will also be carried out as a future work.  RAHMAT BUDIARTO received the B.Sc. degree from the Bandung Institute of Technology, in 1986, and the M.Eng. and Dr.Eng. degrees in computer science from the Nagoya Institute of Technology, in 1995 and 1998, respectively. He is currently a Full Professor with the College of Computer Science and IT, Albaha University, Saudi Arabia. His research interests include intelligent systems, brain modeling, IPv6, network security, wireless sensor networks, and MANETs. VOLUME 8, 2020