Increasing the Performance of Machine Learning-Based IDSs on an Imbalanced and Up-to-Date Dataset

In recent years, due to the extensive use of the Internet, the number of networked computers has been increasing in our daily lives. Weaknesses of the servers enable hackers to intrude on computers by using not only known but also new attack-types, which are more sophisticated and harder to detect. To protect the computers from them, Intrusion Detection System (IDS), which is trained with some machine learning techniques by using a pre-collected dataset, is one of the most preferred protection mechanisms. The used datasets were collected during a limited period in some specific networks and generally don’t contain up-to-date data. Additionally, they are imbalanced and cannot hold sufficient data for all types of attacks. These imbalanced and outdated datasets decrease the efficiency of current IDSs, especially for rarely encountered attack types. In this paper, we propose six machine-learning-based IDSs by using K Nearest Neighbor, Random Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithms. To implement a more realistic IDS, an up-to-date security dataset, CSE-CIC-IDS2018, is used instead of older and mostly worked datasets. The selected dataset is also imbalanced. Therefore, to increase the efficiency of the system depending on attack types and to decrease missed intrusions and false alarms, the imbalance ratio is reduced by using a synthetic data generation model called Synthetic Minority Oversampling TEchnique (SMOTE). Data generation is performed for minor classes, and their numbers are increased to the average data size via this technique. Experimental results demonstrated that the proposed approach considerably increases the detection rate for rarely encountered intrusions.


I. INTRODUCTION
Due to technological developments, most of the real-world transactions have been made available in the cyber world. Thus, many operations, such as banking, shopping, online examinations, electronic commerce, and communication are used extensively within this new environment. With the widespread use of smartphones, people can connect to this global network and perform transactions at any time and from anywhere. Although this digitalization facilitates the daily work of human beings, due to the weakness of the servers and the newly emerged intrusion techniques, networks are The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino . often attacked by the intruders who take advantage of the anonymous nature of the Internet not only to steal some information or money but also to slow down the operation of network services.
Security administrators traditionally prefer password protection mechanisms, encryption techniques, and access controls in addition to firewalls as a means of protecting the network. However, these techniques are not sufficient for protecting the system. Therefore, many administrators prefer the use of Intrusion Detection Systems (IDSs) to detect malicious attacks by monitoring network traffic, as depicted in Fig. 1.
Intrusion can be defined as any kind of unauthorized activity that causes damage to confidentiality, availability, or integrity of the data within an information system. IDSs are a highly preferred means of detecting this type of activity. IDSs can be categorized into three groups: Signaturebased Intrusion Detection Systems (SIDS), Anomaly-based Intrusion Detection Systems (AIDS), Hybrid Systems.
SIDSs store the signatures of the malicious activities in a knowledge base and try to detect intrusions by using pattern matching techniques. Meanwhile, AIDSs try to learn the normal behaviors of the activities and classify the others as suspicious. In this type of system, there is no need to use a signature-base, and the system can identify zero-day attacks that have not been encountered previously. Hybrid systems are composed composed by the integration of SIDS and AIDS to increase the detection rate of known malicious activities by reducing the false positive rate of zero-day attacks.
Due to the advantages of AIDSs, most current IDSs either directly use an AIDS or benefit from it within a hybrid approach. These IDSs need to be trained via machine learning model by processing the dataset. Most of the works on this topic adopted old datasets, which contain redundant information and imbalanced volumes of data types. Although we can encounter some new datasets that contain up-to-date data, the imbalanced size of data types is still a challenge for researchers.
The efficiency of and IDS is directly related to the selected learning model and the quality of the dataset used. A good quality dataset can be defined as a dataset that improves better performance metrics in real-world transactions. As mentioned in [1], [2] imbalanced datasets present a problem to researchers. A dataset is said to be imbalanced when the distribution of classes is not uniform [3]. This is a common problem in many of the classification problems due to the used datasets. Imbalanced dataset results the used classifier biases towards the majority class; however, in most of them, the aim is trying to detect the minority class [4]. This results in a large classification error over the minority class samples and main targets can be missed. To increase the quality of the dataset, it should be balanced according to data types. Therefore, in this paper, we aim to use an up-to-date dataset for training the IDS to develop a realistic knowledge base for the detection of an anomaly. To enhance the efficiency of the system, comparative work is done employing six different machine learning algorithms. To increase the detection rate of the low-sampled attack types, a synthetic data generation tool is used, and the results obtained in the present work are compared with those of previous experiments.
The rest of this paper is organized as follows. In the next section, a literature review on the topic of interest is provided. Section III depicts a comparative study of previously used datasets on IDSs. The design details of the proposed system with the chosen machine learning algorithms are explained, and the experimental results are discussed in Section IV and Section V, respectively. Finally, conclusions are drawn, and directions for future works are suggested in Section VI.

II. RELATED WORKS
Intrusion Detection Systems are striking areas not only for cybersecurity research but also for academic research. Over the past several years, many papers have been published on this topic. In this section, these noteworthy pieces of research (especially related to imbalanced datasets) are discussed briefly.
In 2019, Gao et al. used NSL-KDD dataset to test and develop an IDS by using an adaptive ensemble learning model [5]. They used four different algorithms; Decision Tree, Random Forest, K-Nearest Neighbor, and Deep Neural Networks. Also, they designed an ensemble adaptive voting algorithm. They used an NSL-KDD-Test+ file to verify their approach. The accuracy of the Decision Tree algorithm is 84.2% and the final accuracy of the adaptive algorithm is 85.2%. In the end, they compared related research papers, and they found that their ensemble adaptive model improves detection accuracy.
An online oversampling Principal Component Analysis (PCA) designed to address the anomaly detection problem is proposed in [6]. Their approach focuses on using online platforms for large-scale problems. By oversampling the minority class of the target instance, their proposed algorithm allows them to determine the anomaly of the target instance. A comparison between the PCA and other detection algorithms supported, the applicability, efficiency, and accuracy of the proposed method. Also, their algorithm reduced computational costs and memory requirements.
Yueai and Junjie proposed a two-stage strategy with a load balancing model (such as online and offline phase) to implement an IDS [7]. In the online phase, the system captured packets from the network and then detected intrusion. Meanwhile, in the offline phase, the training dataset was used to make an offline model. They used SMOTE for oversampling and made their classifications with AdaBoost and Random Forest algorithms. Their experimental results showed that SMOTE and AdaBoost did not work well. Abdulhammed et al. (2019) used the CIDDS-001 dataset for handling imbalanced datasets to build an efficient IDS through various techniques [8]. They effectively studied the sampling methods of CIDDS-001 and evaluated this dataset VOLUME 8, 2020 through Voting, Deep Neural Networks, Variational Autoencoder, Random Forest, and stacking machine learning algorithms. This system detected attacks with 99.99% accuracy when using an imbalanced dataset.
A hybrid approach for IDSs with the NSL-KDD dataset was studied in [9]. Their approach involved a combination of SMOTE, cluster centers, and nearest neighbors. They selected important features using the leave one out method. K-Fold Cross Validation (K is 10) was used for measurement purposes. Experimental results showed that the proposed method achieved acceptable accuracy and a low false alarm rate, which don't have significant differences.
In 2019, Taher et al. proposed a supervised machine learning system to classify network traffic [10]. They used the NSL-KDD dataset for testing and training because they wanted to detect whether traffic was malicious or normal. For that purpose, they used Support Vector Machine (SVM) and Artificial Neural Network (ANN) algorithms and feature selection methods. They found that the ANN with feature selection performed better than the SVM.
Tesfahun and Bhaskari applied SMOTE to the training dataset and feature selection method based on Information Gain in NSL-KDD [11]. This study was carried out to deal with imbalanced datasets in IDSs. Random Forest algorithm was used as a classifier for the proposed method. Their experimental results showed that the Random Forest algorithm with SMOTE and information gain based feature selection performs well.
Chandra et al. proposed a hybrid model using the KDD Cup99 dataset in 2019 [12]. They used Filter-Based Attribute Selection to reduce the feature dimension of the dataset. K-Means and Sequential Minimal Optimization algorithms were used for detecting attacks in the dataset. Their proposed method significantly improves the accuracy rate.
In 2012, Qazi and Raza studied the effect attributes' selections for increasing the efficiency of the classification [13]. Also, they used undersampling and oversampling to decrease the imbalance ratio of the dataset. They used SMOTE for oversampling. They found that for the imbalanced datasets, the sampling technique is more accurate than SMOTE for classifying minority classes. It was also found that the Decision Tree and Naive Bayes algorithms are more accurate than other algorithms.
Al-issa et al. (2019) implemented Decision Tree (DT) and Support Vector Machine (SVM) algorithms for detecting attack signatures using a specific dataset [14]. The dataset contained regular profiles and several DoS scenarios in wireless sensor networks. The results showed that the DT achieved a lower false positive rate and higher true positive rate than SVM, as the DT has 99.86% and SVM has 99.62% true positive rate, and the DT has 0.05%, and SVM has 0.09%, false-negative rate.
In 2018, Ahmad et al. conducted a comparative study that resolved problems associated with accuracy and related metrics using Random Forest, Support Vector Machine, and Extreme Learning algorithms [15]. The NSL-KDD dataset was used, which is considered a benchmark for the evaluation of IDSs. The results show that the Extreme Learning Algorithm is better than other algorithms in terms of precision, recall, and accuracy.
A comparison of these related works is made in Table 1 by showing the used datasets, achieved efficiencies, and usage of oversampling (OS) and undersampling (US) methodologies.The bracketed numbers are the reference numbers of the related works. It is seen that the studies mentioned in the Table 1 use old datasets such as KDD-Cup99, NSL-KDD, or their own datasets. This makes the detection of the newest attacks challenging. Especially systems developed with older datasets as KDD-Cup99 and NSL-KDD are not suitable for detecting current attacks. To implement a more effective IDS, an up-to-date dataset is needed to be used. Additionally, most of the previous IDS implementations measure the normal accuracy of the system for showing the efficiency of them. However, this value does not give the correct performance of the system, especially in imbalanced datasets. Therefore, measuring the average accuracy, which gives the same weight for all class types, should be accepted as the primary performance metric.

III. DATASETS
IDSs can be developed either in signature-based or anomalybased forms. To define the anomaly of a system, normal and abnormal requests should be trained by using a dataset. Researchers can use either a public dataset or they can use their own datasets. In the following subsections, most favored datasets are mentioned and then compared with their content and properties.

A. KDD CUP99
KDD Cup99 was created in 1998 by DARPA to detect anomaly network traffic and was used in the 1999 KDD Cup Challenge to test IDSs [16], [17]. In its construction, nine weeks of LAN raw data which results in a TCP dump are used. This dataset is one of the most popular datasets in the fields of data mining and machine learning. There are about 5 million data in the standard dataset. Approximately 80% of the data are attack data, while the remainings 20% are benign. There are 41 properties in the dataset that can be categorized under three headings; basic features, traffic features, and content features. Data in the dataset can be classified into five main categories, which are listed as follows; • Normal: Non-attack data. • DoS Attacks: These attacks are typically used to prevent users from receiving services by sending multiple connection requests to a server over security in the TCP / IP protocol structure.
• Probe Attacks: These attacks are performed to find specific information on a server or any machine.
• R2L Attacks: These are attacks made with unauthorized access as a guest or another user.
• U2R Attacks: These are the attacks of a user who is allowed to enter the system but who is not an administrator; by using this type of attack, a user can act as an administrator, and perform unauthorized operations. There are 22 different attacks within these four main categories [18], and one normal data class. KDD data include some numerical and text-based information about operations performed, and depending on the aim, they are needed to be processed.

B. NSL-KDD
The NSL-KDD dataset was created in 2009 to solve problems related to irregular data in the KDD Cup 99 dataset [19]. The reliability of the systems developed in the previous years was questioned, as there were no accurate datasets for IDSs. The NSL-KDD dataset has important advantages over the original KDD Cup99 dataset: • Unnecessary records in training data have been eliminated; it contains important records in the KDD Cup99 dataset, • It doesn't have duplicate data, • More homogeneous distribution, • The number of records in the training and test sets is proportionally distributed, The NSL-KDD dataset contains a feature map with 42 features, which are grouped under four categories; • General features, • Content features, • Server-based traffic features, • Time-dependent traffic features. Attacks in the NSL-KDD dataset are divided into four different categories: DoS, Probe, U2L, and R2L. In addition to these attacks, there is a single Normal/Benign category.

C. CIC-IDS2017
CIC-IDS2017 was created in 2017, and it includes the most recent and real-world attacks of that year. It was created by analyzing network traffic using information from the timestamp, source and destination IPs, source and destination ports, protocols, and attacks [20]. It includes 86 networkrelated features that also contain IP addresses and attack types.
In accordance with the last dataset evaluation framework in 2016, the criteria for establishing a reliable dataset were determined. Before the creation of CIC-IDS2017 dataset was introduced, no intrusion detection dataset had met the criteria for building a reliable dataset, which was developed in 2016. The criteria are as follows: • It contains tagged data, • All network traffic is recorded, • All protocols are included in the dataset, • Common attacks are distributed proportionally.

D. CSE-CIC-IDS2018
The profile concept was used to create the CSE-CIC-IDS2018 dataset [21]. This is the most recent dataset available in 2018/2019 by the Canadian Institute for Cybersecurity. These profiles can be used by agents or people to create events on a network and can be applied to various network protocols with different topologies. Furthermore, the dataset was enhanced by considering the standards used in the creation of CIC-IDS2017. In addition to the basic criteria, it offers the following advantages: • The number of duplicate data is very low, • Uncertain data is nearly absent, • The dataset is in a CSV format, so it ready to use without processing. This is one of the most recent datasets currently. Two profiles were classified, and five different attack methods were used in the dataset. In addition, various scenarios were created, and data were collected daily. The dataset was edited daily, and raw data were recorded. When creating data, 80 statistical properties such as time, number of packets, number of bytes, packet length, etc. are calculated separately in the forward and reverse direction, and information is given about whether an attack is added. The final dataset is published over the Internet to the researchers, with approximately 5 million data both in PCAP and CSV format. The CSV format dataset should be used if Artificial Intelligence techniques are to be used; the unprocessed PCAP data should be used if a new feature is to be extracted. The numbers of attacks and benign types are shown in Table 2 Table 3 shows the number of records of the most preferred and popular datasets, which are categorized by its classes.
As can be seen, these datasets are not balanced. For accurate calculation of the system's efficiency, this imbalanced structure is needed to be formulated. The imbalance ratio which can be calculated as in Equation 1 can be used as the VOLUME 8, 2020   metric.
where C i shows the data size in the class i. In other words, imbalance ratio can be defined as the fraction between the number of instances of the majority (max) class and the minority (min) class. According to this equation the imbalance ratio of the most popular and recent datasets are listed as in Table 4. There is a vast gap between the data classes which also affects the efficiency of the system. Additionally, sophisticated hackers focus on the development of minority data types to reach their targets. Therefore, to increase the efficiency of the system, this imbalance rate should be decreased.

IV. PROPOSED SYSTEM
Many IDS development studies have been conducted over the years, and increasing detection accuracy is the most critical metric for developers. However, if the dataset is imbalanced and a specific category composes the most significant part of the dataset, then the use of accuracy as a single metric is not much acceptable. If there is a large gap between the data size within the majority and minority categories, sophisticated attackers can focus on minority attack types to increase their efficiency. Therefore, in this paper, we focus on removing the effect of asymmetry between classes in the dataset by increasing the average accuracy of the system.
As mentioned before, many current IDSs are developed over Anomaly Detection by identifying the normal data with the use of six machine learning algorithms. As such, many helpful tools have been created over the last few decades, and currently, the Python programming language, as one of the most popular development environments, has become very important for implementing new learning-based systems. The use of new libraries, such as Scikit-Learn (Sklearn) provides excellent flexibility and ease of use not only for system development but also for testing.

A. FLOWCHARTS OF THE PROPOSALS
Many works have used two well known datasets as KDD Cup99 and NSL-KDD which are relatively old. Therefore, in this paper, an up to date dataset CSE-CIC-IDS2018 is utilized. Along with an up to date dataset, intrusion detection was made with synthetic data production. To obtain the results of the original data in the trained system, the implemented system is executed with original data according to the flowchart depicted in Fig. 2.
However, like many other datasets, this dataset is also imbalanced. To remove the effect of the asymmetric categories, some data-driven techniques can be used. In this work, the data sampling model is used to decrease the imbalance ratio of the system. Thought this model, new data are created for minority classes, and the system is trained with them. To observe the effect of the sampled data, the flow chart of the system is modified, as depicted in Fig. 3.

B. PYTHON AND SCIKIT-LEARN
The Python programming language is easy to use and learn type general-purpose programming language, and it is currently one of the most preferred application development platforms for many application areas. Its efficient structure enables it to be quickly implemented and integrated with other systems, such as desktop and web applications, data analysis and visualization programs, network programming, database applications, and machine learningbased systems. It can be run independently of the platform and does not require a compiler. It is compatible with many operating systems such as Windows, Linux, Mac, and Symbian. It has appropriate code support for processing on one or more CPUs/GPUs with additional parallel execution libraries for increasing the performance of the system.
There is a machine learning library called Scikit-Learn, which is an open-source library, that was developed as an extension of the SciPy library in Python. It allows the implementation of various machine learning algorithms such as classification and clustering. It also provides specialized modules such as feature extraction and, model review. Scikit-Learn is very popular among researchers because it has a lot of resources and is easy to use. Therefore, in this paper, these languages and libraries are utilized to implement the proposals.

C. DATASET, PREPROCESSING AND SYNTHETIC MINORITY OVER-SAMPLING TECHNIQUE (SMOTE)
In this study, the most recent dataset available (CSE-CIC-IDS2018) is used. It is publicly accessible, and it provides CSV, PCAP and logs files. Detailed information about this dataset is provided in Section III. The operations performed during the preprocessing of the dataset are detailed below.
• Missing values, which are also referred to as Not a Number (NaN) values, have been converted to 0 to prevent value errors while working with machine learning models.
• Two columns ('Flow Bytess' and 'Flow Pktss') contain infinity values. These infinity values have been set to the maximum value in the column in which they are present + 1.   Neg' and 'Init Bwd Win Byts Neg') that have 1 as the value where −1 values are present in columns with a similar name and 0 as the value where non-negative values are present.
• One column ('Label') contains the identified attack names. These attack names have been changed to numerical values as in Table 6.
• As the last step for preprocessing combined dataset is shuffled for randomness. The number of features in the dataset increased to 83 from 80 after preprocessing is completed.
Synthetic Minority Oversampling Technique (SMOTE) method was used for the generation of synthetic sample data. This method uses the K-nearest neighbor algorithm [31] to generate new samples. The literature mentions two similar methods ADASYN and RandomOverSampler. The first one, ADASYN, also generates sampled data using KNN algorithm. However, ADASYN generated data are hardly classified with the nearest neighbor approach, while SMOTE does not make any difference. Therefore, it takes a long time to produce sample data by ADASYN. In the preliminary experiments, the achieved accuracy rate by ADASYN was about 5% lower than the SMOTE generated ones. Additionally, the data generation time was also very long. So ADASYN method was not preferred for this study. The second one, RandomOver-Sampler, selects data randomly from the dataset and uses the same data as a sample. Similarly, the RandomOverSampler method was not preferred in our research because it samples existing data and, therefore, the results of RandomOverSampler did not reach better accuracies, especially for minority classes.
In the end, due to its simplicity regarding its implementation and interpretation; its efficiency in low dimensional data, and its rules for synthetic generations different from replicas, SMOTE is selected as the synthetic data generation model [6], [7], [9], [10], [13], [35]. The level of oversampling in SMOTE is directly related to the number of neighbors in the KNN algorithm, which is chosen randomly. SMOTE function creates new samples by considering the difference between the feature vector and its nearest neighbor and multiplying that difference by a random number between 0 and 1. It then adds that result to the evaluated feature vector.

D. SEPARATION OF SAMPLED DATA
Machine Learning algorithms were performed separately on a normal dataset and dataset with sampled data. Some studies showed that sampled data production should be done before the training phase [8], [36]- [39]. Therefore, the sampled data generation was carried out before all operations in the second part of the proposals. The sampled data are generation Brute Force, Infiltration, and SQL Injection attacks are initially determined to be 286,191. Data generation was performed on data containing less than 5% of the total number in the data set. And this size is chosen as the third majority class-size.
Processing sampled data is a trivial issue. Therefore, an ID column has been added to the dataset to keep track of which data is sampled. Training and tests have been performed with these sampled data using a K-fold approach to reach a realistic result by eliminating the effect of randomness. After the tests, the number of correctly predicted original sample data can also be determined easily. The ID column previously added to the dataset was used for this operation. This was done to observe the accuracy of the prediction by deleting the effect of the sampled data as test samples. Table 7 shows the final data distribution of the dataset after running the SMOTE method by generating new samples for minority classes. Through this approach, the imbalance ratio was decreased from 53,887 to 9.98, which is an acceptable rate. The data sampling model increases the total dataset size by about 17%, which increases the training time of the system.

E. MACHINE LEARNING ALGORITHMS
As mentioned previously, for the anomaly-based intrusion detection systems, firstly, the normal behavior of the network flow should be determined. To accomplish this, the system needs to be trained using a learning algorithm. The literature offers many machine learning algorithms. To select the most suitable one for our purposes, we implemented six of them, which are detailed in the following part.

1) ADABOOST ALGORITHM
Adaptive Boosting (AdaBoost), is a community learning Boosting algorithm, that is used for classification problems [40]. ''Boosting'' is the process of achieving a strong result by combining weak results from the data. It dispenses the data evenly in the first step and then makes a classification. Through this classification, it finds the weakest classifier and the updates the weights. It focuses on the worst result during the updating process. After a while, it brings together several bad classifiers to form a successful classifier. Its aim is to increase its success in terms of classification. The final equation for the classification of the dataset can be shown in the following equation 2.
where f i stands for the i th weak classifier, and θ i is the corresponding weight.

2) DECISION TREE ALGORITHM
The Decision Tree (DT) is one of the supervised learning algorithms used for the classification of numerical and class data. It has a predefined goal variable. It also has leaf nodes supported by decision-making steps to reach one of the topdown goals of the algorithm structure [41]. It takes advantage of its simple structure to process large amounts of data quickly. In some cases, more complex trees have to deal with the classification of datasets. In such cases, decision trees become more complex, and it becomes more difficult to reach any of the goals. Overfitting is another problem in decision tree algorithms. Some of the leaf nodes are pruned out of the decision tree to solve this problem. Entropy and information gain should be calculated for decision trees. Equation 3 shows how entropy is calculated.
where S is a dataset, X is a set of classes in S, and p is a ratio of the number of elements in class x. Equation 4 show how information gain is calculated.
where T is the subsets created from the dataset S.

3) RANDOM FOREST ALGORITHM
The Random Forrest (RF) is a type of supervised machine learning architecture that can be used for classification and regression problems [42]. It is effortless to use, and it creates a decision forest by using Decision Trees and performs problem-solving in this way. For this, it creates a random collection of trees. During the process, more than one Decision Tree is trained to yield the most accurate classification. Most of the time, even without the use of a hyperparameter, it can give quite good results. It is one of the most highly preferred methods because it wuickly provides speedy and accurate results even for mixed, incomplete, and noisy datasets.

4) K NEAREST NEIGHBOR ALGORITHM
K Nearest Neighbor(KNN) Algorithm is a supervised learning algorithm. Unlike other supervised learning algorithms, it does not have a training stage [43]. KNN is implemented using data from an original sample class. K data is chosen, which is the closest neighbor to the new data to be decided which sample class it should be added. The distance of the new data to be included in any of the original sample class groups is taken from the data showing the K nearest neighboring property. Euclidean, Manhattan, and Minkowski functions are used for distance calculations. The following equations 5, 6 and 7 show how these distances are calculated.
where N is the dataset size, k is a positive integer, and x i , and y i are the i th coordinates of data. This method is highly resistant to simple and noisy training data. As such, it has the disadvantage of requiring a lot of memory space because it stores all the cases in distance calculations.

5) GRADIENT BOOSTING ALGORITHM
The Gradient Boosting Algorithm (GB) is used for regression and classification problems [44]. Similar to the Adaboosting algorithm, a combination of weak classification models generally creates a model of decision trees. The aim of increasing the gradient by updating the estimates according to the learning rate is, to reach the minimum error values.

6) LINEAR DISCRIMINANT ANALYSIS
LDA is used to reduce the number of dimensions because it makes calculation easier, takes steps to classify data in the best way, and reduces underfitting/overfitting problems [45]. Also, LDA can be used for data preprocessing before classification. It examines the distribution of classes for VOLUME 8, 2020 classification and finds the difference between the average values so that it creates subspaces. Although similar to PCA, LDA maximizes the distance between classes but there is no class concept in PCA and it only tries to maximize the distance between data points. LDAs hyperparameter values of Machine Learning algorithms are shown in Table 8. These are the default values in Python's Scikit-Learn library. It is seen that hyperparameters of machine learning algorithms were set as default in the literature. Therefore, these values were left as default values in the study so that comparisons with other studies can be made.

V. EXPERIMENTAL RESULTS
In this study, the performance of machine learning algorithms in intrusion detection procedures is examined. Training and tests were conducted on the most recent dataset available (CSE-CIC-IDS2018). The parameters are selected by default in all the implemented algorithms except for KNN. In the KNN algorithm, the number of classes was determined to be six (one for non-attack type, and 5 for attack types). To decrease the variability of the performance results due to the random generation of train and test sets, the K-Fold Cross-Validation method was used in the experiments. The chosen K value was 5, in which the training and test data were divided into 80% to 20%.
Proposed systems were implemented in Keras/Tensorflow using the Python programming language, and Scikit learn libraries. To measure the performance metrics, experiments are executed on a workstation that has the properties shown in Table 9. Proposed systems were executed on the Multicore structure of the NVIDIA GeForce GTX 1080 Ti Graphic card, whose specifications are detailed in Table 10.
To calculate the performance measure of the proposed systems; Accuracy, Precision, Recall, F1-Score and Error Rate values are used [46]. These metrics are calculated according to .
where TP i is the i th True Positive, FP i is the i th False Positive, FN i is the i th False Negative, l is the number of multiclass, and β is the balancing factor. The most common choice for β is 1,which is a harmonic mean of precision and recall. The definition used of accuracy is critical because accuracy is the most vital metric used to measure the effectiveness of prediction systems. Accuracy often refers to the complete accuracy of the system, However, Accuracy i can alo refer to an individual accuracy of class i. For an imbalanced dataset, the final definition of accuracy -which is the average of the individual accuracies-is critical for researchers.
In this paper, we have implemented six different machine learning algorithms as K Nearest Neighbor, Adaboost, Random Forest, Gradient Boosting, Decision Tree, and Linear Discriminant Analysis. The performance metrics are obtained through the original dataset and extended dataset with sampled data on attack types. As the first metric, the accuracy is measured. The results of these measurements are depicted in Table 11, which also shows the execution time of the whole process.
Although accuracy is the most prominent metric used for comparing the performance of IDSs, other metrics   (e.g., precision, recall, and f1-score) should also be measured. Table 12 lists these metrics according to the type of machine learning algorithm used.
As can be seen in the table, the Adaboost algorithm is the most successful algorithm with an accuracy rate of 99.69%. The Decision Tree algorithm is the second-most efficient one with an accuracy rate of 99.66%. These are followed by the other algorithms applied to both the original dataset and the sampled dataset.
However, looking only at the complete accuracy does not yield precise comparisons. Because multi-classification IDSs are executed in this paper. The accuracy related to each attack type should be examined separately. New intruders are sophisticated. Therefore, the average accuracy of different algorithms for each attack type should be considered to determine the efficiency of the system. Accuracies for all attack types are calculated for each machine learning algorithm, as depicted in Table 13. As can be seen from this table, three types of attacks (Brute Force,  Infiltration, and SQL Injection) are associated with relatively low accuracy rates.
The low accuracy rates for these attack types stem from the data size in the dataset, as mentioned in Table 2. The total volume of them is about 3%. To increase these rates, new data are generated synthetically, and the total number of these attacks is increased by up to 16.2%, as depicted in Table 5. Then, the proposed algorithms are executed again, and the obtained accuracy rates are depicted in Table 13.
As seen in Table 13 and Table 14, the use of sampled data results in small enhancements to the first three majority data types (Benign, Bot, and Dos), the comparison of their individual accuracies are shown in Fig. 4. To compare these attacks, six machine learning algorithms and three data types are compared. For five of them, Original datasets provide the best solutions while the results are the same for six of them. For seven of these values, Sampled Data provides the best accuracy rates. However, in the minority classes, there are considerable increases, as seen in Fig. 5. In these classes, there are 72.35% accuracy increases for an average.   Although there are small changes in these three types of data, for the resting values, there is a considerable enhancement in terms of accuracy. Fig. 5.a-c depict the accuracy levels of these three data types (e.g. Brute Force attacks, Infiltration attacks, and SQL Injection attacks). From this figure, improvements in accuracy can be seen clearly.
As discussed above, there are some comparisons of the proposed algorithms such as accuracy, time, precision, recall, f1-score. However, to measure the efficiency of a system, a comparison is made between the present study and recent work, (published in 2018) the results of which are depicted in Table 15.
The present study and the comparison study [15] have one machine-learning algorithm in common (random forest). The use of sampled data leads to, a considerable increase in the accuracy of the system, as 99.34% accuracy rate is measured. A significant difference between the papers is that in this paper instead of using the NSL-KDD dataset, which is not up to date, we use CSE-CIC-IDS2018 [21]. Additionally, a comparison with other machine learning algorithms (i.e. SVM, RBF, and ELM), shows that the trained IDSs are more efficient than these other algorithms.
The effect of sampled data size is also measured. Therefore we have tested our proposal for different dataset sizes. First, the original dataset is used for training the system.Second, the minority of dataset size is set to 93,063 (as the fourth major data class). Finally, the minority of dataset size is set to 286,191 (as the third major data class). A comparison of the average accuracies of these datasets is depicted in Fig. 6. As seen in this figure, when the size of a minority data class is increased, the average accuracy rate also increases.

VI. CONCLUSION
In recent years, due to the extended use of the Internet, computing devices can connect to a global network at any time and from anywhere. However, the anonymous form of Internet results in lots of security breaches in the network, which results in intrusions. Additionally, current attackers are more sophisticated, and with the help of automated production tools, they can generate new malwares depending on the weak detection capability of Intrusion Detection Systems(IDSs). IDSs are generally trained using pre-collected datasets. However, almost all these datasets are imbalanced with different imbalance ratios, ranging from 648 and 112,287. Imbalanced datasets result in bias towards the majority class, and in some extraordinary situations, minority classes are ignored. However, these minority classes are generally positive classes. Therefore, the imbalance ratio should be decreased to increase the efficiency of the system and to decrease its average accuracy.
In this paper, six different machine learning models (Decision Tree, Random Forest, K Nearest Neighbor, Adaboost, Gradient Boosting, and Linear Discriminant Analysis) were implemented using a recent dataset (CSE-CIC-IDS2018).
To decrease the imbalance-ratio, a data sampling model was used by increasing the data size of the minority groups. The experimental results showed that the implemented models have a very good accuracy level when compared with recent literature. The use of a sampled dataset caused the average accuracy of the models to increase between 4.01% and 30.59%.
Nowadays, due to the efficiency of big data applications, many machine learning applications are transferred to deep learning models. This paper has been a preliminary study to examine the success of deep learning algorithms in detecting small sample attacks in up to date datasets. Therefore, deep learning algorithms should be used in future work. By using a different design methodology, it is expected that the efficiency of the system will increase.