A GBDT-Paralleled Quadratic Ensemble Learning for Intrusion Detection System

The development of computer and network technology has provided convenience to our daily life, however, attack and intrusion in network emerge endlessly. Intrusion Detection System (IDS) has been developed to confront network attacks. As a result, the research of IDS is one of the most popular fields in recent years. This paper proposes a Gradient Boosting Decision Tree (GBDT)-paralleled quadratic ensemble learning method for intrusion detection system. We use GBDT to deal with the spatial part of traffic data and use Gated Recurrent Unit (GRU) model with special modification for network traffic to deal with temporal data. Then, in order to combine the spatial feature and temporal feature, we fuse GBDT model and GRU model to make a quadratic ensemble model as our final intrusion detection system. The experimental results based on CICIDS2017 dataset show that the advanced spatial-temporal intrusion detection system based on ensemble learning achieves better accuracy, recall, precision and F1 score than the state-of-the-art methods. The accuracies of detecting benign, port scan, Distributed Denial of Service (DDoS), infiltration and web attack traffic are up to 99.9%, 99.9%, 99.9%, 99.9%, and 99.9%, respectively. We also use our method in Information-Centric Networking (ICN) dataset and the results show our method achieves much better performance compared with existing methods.


I. INTRODUCTION
In recent years, Intrusion Detection System (IDS) is widely used in all respects. Pan et al. [1] developed a hybrid intrusion detection system for power system. Aloqaily et al. [2] built a special intrusion detection system for connected vehicles in smart cities. Ambusaidi et al. [3] created an IDS system to study traffic problem. Hodo et al. [4] tried to use IDS in Internet of Things networks. However, due to the explosive growth of Internet, more and more researchers pay attention on IDS in field of internet security [5]- [8].
Nowadays, the continuous development of computer and network technology has made our life more convenient. Meanwhile, the network-oriented crimes have become more and more frequently, and various types of attacks have emerged in an endless stream. Serious network security events have also been greatly increased. In order to effectively prevent and predict network attacks and security threats, The associate editor coordinating the review of this manuscript and approving it for publication was Le Hoang Son .
various Intrusion Detection Systems (IDS) [9] have been designed to detect intrusions by analyzing network traffic data. In order to understand the state of the whole network by fusion sensor data from distributed sources, some scholars have proposed the concept of cyberspace situational awareness (CSA) [10], then evaluated the network security in time to identify and predict potential attacks. Nowadays, many types of IDS play an important role in the network security business, which enhances the security of the network and protects users from cyberattacks. Scholars have used various methods to classify attacks, including traditional methods based on hidden Markov model [11], gray Verhulst models [12] and so on.
In recent years, with the prosperity and development of machine learning techniques, more and more researcher try to use machine learning method to deal with IDS problem [13]- [16]. Lots of scholars have used Support Vector Machine (SVM) [17]- [19], Convolutional Neural Network (CNN) [20]- [23], Recurrent Neural Network (RNN) [24], [25], Long Short-Term Memory (LSTM) [26]- [30], VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Gated Recurrent Unit (GRU) [34]- [36], ensemble learning method [31]- [33], especially Gradient Boosting Decision Tree (GBDT) [37]- [41] and many other kinds of machine learning methods in IDS. Those methods have efficiently improved the classification accuracy. Among these methods, RNN (including GRU and LSTM) and ensemble learning method (especially GBDT) show strong ability to deal with IDS problem. RNN is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. Due to this structure, it can show temporal dynamic behaviors. Unlike traditional feedforward neural networks, RNN can utilize their memory (internal state) to create sequences of inputs. RNN can be deemed as multiple copies of the same network, each passing a message to a successor and it is used to deal with natural language modeling in the beginning. However, due to the strong ability to deal with temporal sequence, it is quite popular in cyberattacks field now.
GRU is a successful model to forecast cyberattacks but there is a serious drawback. Compared with traditional ensemble learning models, it is hard to spatial feature because the core idea of RNN is to deal with the temporal relationship in data. In reality, the cyberattacks normally looks like disorganized and fragmented. Sometimes it is nearly impossible to detect the cyberattack by the temporal relationship individually. We also need to rely on the other information to help us identify cyberattacks.
One of the essential characteristics of cyberattack is that the hacker need to use his own computer or a proxy server to start a network attack. No matter how much work to cover up their action, there are some spatial traces left, and we can use IP address, subnet information and network topology information to mining those spatial features. So, it is easy to think that spatial features hide some very useful information of cyberattack. The past researches show that the ensemble learning model has a strong ability to deal with spatial relationship. GBDT is a classical ensemble learning mothed which combine weak classifier (normally decision tree) to create a strong classifier. GBDT is good at dealing with nonlinear relationship, especially fragmented datasets. Because the spatial relationship in the network data is disorganized and fragmented, GBDT has a good potential to detect cyberattacks.
In order to deal with the spatial-temporal relationship in IDS, we create a GBDT-Paralleled Quadratic Ensemble Learning. We use GRU model as our basic model, then we fuse the GBDT and GRU to do quadratic ensemble learning to create a new model. Our quadratic ensemble learning model absorbs the spatial processing capability of GBDT and obtains a better predictive performance than existing models. Compared with the methods in the recent 3 years, our method has achieved state-of-the-art results.
Also, in the majority of studies and researches, the datasets were outdated, including the well-known KDD Cup'99 dataset [42] and ISCXIDS2012 (Intrusion Detection Evaluation Dataset provided by Canadian Institute for Cybersecurity in 2012) [43]. With the change of time, the attack patterns and attack characteristics of traditional cybersecurity datasets differ from today's new types of cyberattacks, so those outdated datasets are not suitable for today's cybersecurity research as well.
In 2018, a new IDS dataset called CICIDS2017 (Intrusion Detection Evaluation Dataset provided by Canadian Institute for Cybersecurity in 2017) [44] was proposed by Iman et al from Canadian Institute for Cybersecurity. This dataset includes not only traditional attacks such as Denial of Service (DoS), Distributed Denial of Service (DDoS), and port scanning, but also some new types of attacks and intrusions, such as the Heartbleed Bug based attack, which is just be found in recent years. Compared to the use of outdated datasets, research on network security and intrusion based on CICIDS2017 can be better applied to today's network environment.
Recently, many researches focus on Information-Centric Networking (ICN). ICN is one kind of future network architecture which can provide high-quality services such as in the field of video and music for users. However, even though ICN is a very potential network architecture, IDS for ICN is still a new field for research.
The Internet today was built with the narrow waist of IP. In ICN the new narrow waist is the contents [45]. In recent years, ICN attracts more and more attention from researchers and tons of ICN projects such as Data-Oriented Network Architecture [46], Network of Information [47], Content-Centric Networking (CCN) [48], Named Data Networking project [49], an extension of the CCN, and the Publish-Subscribe Internet Technologies [50], [51] are developed. All these projects changed the IP narrow waist architecture to make the content as the primary part of communication, and this makes ICN have good scalability. To enhance mobility support, plans such as MobilityFirst [52] was created. The separation of identifier (the name of content) and locator (the network address) makes ICN can have good mobility support. By using the methods such as Distributed Hash Table ( [54] classifies the ICN attack into four categories, i.e., naming, routing, caching, and other miscellaneous related attacks. We gathered ICN logs in our lab and make it as a Chinese Academy of Sciences Intrusion Detection Dataset (CAS2018). We use our CAS2018 dataset to test the model's ability to detect network attacks in ICN.
As we mentioned before, the traditional GRU model has a strong ability to deal with temporal relationship and it gains huge success in the field of detection of cyberattacks. However, there are several drawbacks which can be improved. In the past research, researchers used one directional GRU to detect cyberattacks. But in reality, the different kinds of cyberattacks may have some hidden relationship. For example, the port scan attacks to target network normally happened after an infiltration attack, so the later information also can help us to discriminate early attacks. In our model, we use Bi-GRU as the substitution of GRU. The result shows that compared with traditional GRU, Bi-GRU has a relatively better ability to detect cyberattack.
Also, GRU cannot deal with the spatial relationship perfectly. There is some method which try to take over the spatial-temporal relationship, but every method has drawback. Hierarchical spatial-temporal features-based intrusion detection system (HAST-IDS) [55] uses the original data flow as input and deems traffic dataflow as a two-dimensional image. This method uses CNN to catch spatial features and uses LSTM to explore the temporal relationship in a network dataset. This structure combined the CNN and RNN into a single model and it is quite popular in Image Caption. The writer uses this structure in cyberattack foresightedly and gets a statistically result. However, it ignores the topology structure in the network so the spatial structure is hard to understand. On the other hand, CNN structure is a black box structure so the explanations is limited and the result is easy to be overfitted. The different characteristics between the image and network flow also reduce the reliability of HAST-IDS.
In order to avoid this problem, we create a new model based on a Bi-GRU and utilize GBDT to enhance GRU. we utilize feature engineering to create a list of features include the information of topology and other spatial features, then do feature selection to choose reliable spatial features.
Our model uses GBDT to enhance the ability to deal with spatial features. the past research had proved that the GBDT is especially good at solving the fragmental spatial information so compared with past models, our model has better spatial processing capacity.
In the end, we combine two different models to make an ensemble learning. GRU is a deep learning method to catch the inner temporal relationship whereas GBDT is an ensemble learning method which is good at deal with fragmented spatial information. These two methods have totally different theory so combining them can absorb their merit in two dimensions (spatial and temporal) and solve the spatialtemporal problem in network problems.
The rest of this paper is organized as follows. Section II describes the related work, including intrusion detection techniques, GRU model, GBDT model, CICIDS2017 dataset and CAS2018 dataset. Section III describes our special feature engineering methods for extracting the spatial-temporal feature of the data, the proposed model in detail and metrics. Section IV contains the evaluation and discussion of our experiment. Section V provides the conclusion of our work and presents future work for researchers.
The contribution of our paper is as followed: 1. We innovatively utilize spatial information to make prediction of cyberattacks. Nowadays, the majority of researchers focus on the temporal relationship in the cyberattack. Although there are several researchers consider the spatial relationship but they only use the basic spatial feature such as IP address [56] or the potential spatial relationship in original data flow [55] and do not excavate the inner spatial relationship, such as topology relationship. We use a list of spatial features including topology and other spatial information so we can dig the deep spatial characteristic in the network to help us to detect cyberattack.
2. We use an ensemble learning to combine the GBDT model and GRU model as one model. The past research has proved that GRU has a strong ability to catch temporal information in cyberattacks. On the other hand, the design principle of GBDT method is to deal with discrete, fragmental and non-significant information. We use an ensemble learning to absorb these merits so our new method has a strong ability to deal with the spatial-temporal relationship in cyberattacks. Also, the basic idea of GBDT and GRU is totally different so the ensemble method can maximize the merit of two different method and avoid overfitting.
3. Nowadays, ICN becomes a hot research topic. ICN is deemed as one kind of future network architecture, because compared with traditional network, it can provide better service on video or music content for users. However, just as Dainotti mentioned [57], majority institutions do not want to share their traffic datasets because of privacy requirement or other reason. Besides, most famous datasets are based on traditional IP traffic, there are no public traffic datasets based on ICN. In order to test the robustness and generalization of our method, we create an ICN traffic dataset. The result shows that our method has a satisfactory ability to detect cyberattacks.

II. RELATED WORK A. INTRUSION DETECTION TECHNIQUES
Intrusion detection can be divided into signature-based detection and anomaly-based detection by detection approach [60]. Signature-based detection matches existing rules and recognize existing patterns [61]. The alarm will be actived once a match occurs. However, this method requires a large rule signature database and cannot detect unknown attacks. Anomaly-based detection techniques include traditional probabilistic methods and a new set of machine learning-based methods [62]. In recent years, many scholars have used machine model algorithms to conduct research related to intrusion detection. Wang et al. [17], Thaseen and Kumar [18] and Gu et al. [19] use SVM for intrusion detection, but they simply use the model without taking into account the special effects of the network data itself, resulting in lower detection accuracy. Zhang et al. [20], Potluri et al. [21] and Li et al. [22] use CNN to do the intrusion detection work, but they ignore the temporal features of network data. Wang et al transforms traffic input as a two-dimensional image and uses CNN and LSTM to classify traffic packets [55], but the research VOLUME 8, 2020 only focuses on the payload of the packet and ignores the space characteristics between the traffic. Hao et al. [24] and Bouzar-Benlabiod et al. [25] use RNN to forecast network attack, but simple RNN cannot achieve a great result. Kim et al. [26], Kim and Kim [28] and Staudemeyer and Omlin [27] use an LSTM-based neural network to process network data, however, LSTM contains too many parameters and has a large computational overhead and they ignore the space characteristics either. Han et al have used Wasserstein Generative Adversarial Network (WGAN) [56] to process data for traffic, nevertheless, models based on Generative Adversarial Network (GAN) requires a very long training time and the model is very hard to train as well. Many scholars used GBDT to deal with network traffic [37]- [41], and get a great result. Hao et al. [34] and Xu et al. [35] used GRU to utilize the temporal relationship of network flow, however, GRU is a version of RNN model which focuses on the inner long-term temporal relationship in dataflow but have a relatively poor ability to catch spatial information. Therefore, we hope to propose a method to comprehensively utilize the spatial-temporal features of network data to complete intrusion detection with good performance.

B. GATED RECURRENT UNIT
Gated recurrent units (GRU) [63] are a special version of recurrent neural network (RNN) [64]. RNN is an extension of a conventional feedforward neural network to handle a variable-length sequence. However, RNN is easy to be influenced by short-term memory. If a sequence is long, RNN is quite hard to transfer useful information form the early sequence to the late sequence. It makes RNN can ignore the important information in a long sequence.
Also, during backpropagation, RNN has serious vanishing gradient problem. Gradient is used to update the weight of our network. If the gradient decay to 0, our network will not update anymore.
LSTM [65] is one of the most popular methods to take over this problem but LSTM has three gates and two memory cells so the training time is quite long when we deal with a huge dataset. Unfortunately, most of IDS datasets are huge so the LSTM model is hard to be trained.
GRU is designed in a manner to have more persistent memory thereby making it easier for RNNs to capture longterm dependencies. Compared with LSTM, GRU has fewer gates and memory cells. Basically, it combines the forget gate and input gate as a single reset gate and merge two memory cells as one. Due to that, the compute of GRU is simpler so it is easy to be trained. The GRU has two gates: reset gate and update gate and one memory. Here is how a GRU use the hidden state h (t−1) and input x (t) to calculate the next hidden state h (t) .
As in (1), (2), (3), (4), the responsibility of reset signal r (t) is to determine how important h (t−1) is to the computation of new memoryh (t) . It calculated by the past hidden state h (t−1) and the input x (t) . The reset gate has the ability to completely diminish past hidden state if it finds that h (t−1) is irrelevant to the computation of the new memory.
The update signal z (t) is responsible for determining how much the past hidden state h (t−1) will be transferred to new hidden memory h (t) . It calculated by the past hidden state h (t−1) and the input x (t) as well. The value of update signal z (t) is between 0 and 1. If the z (t) close to 0, the new hidden state will almost equal to new memoryh (t) . If the z (t) close to 1, the new hidden state will almost equal to past hidden state h (t−1) .
The new memoryh (t) is the summaries of the new input x (t) and the hidden state h (t−1) . The reset signal r (t) will be used as weight to control how much the hidden state h (t−1) will be transfer to the new memory. The information of the new memoryh (t) include past information of h (t−1) as well as input x (t) . GRU use tanh as activation function to squash output to range [−1,1] as well as zero-center.
The hidden state h (t) is added by the past hidden state h (t−1) and the new memoryh (t) . The update signal z (t) control the weight between the past hidden state h (t−1) and the new memoryh (t) .

C. BIDIRECTIONAL GATED RECURRENT UNIT
Bidirectional gated recurrent units (Bi-GRU) are a variant of GRU, which is a combination of a normal GRU and a reverse GRU. At each moment, the input of Bi-GRU will be decided by two different directional GRU. It can catch the spatial influence in data flow by two directions so compared with normal GRU, Bi-GRU has better ability to deal with data with connections between early data and later data.
In the network data, if there are cyberattacks in a data flow, the cyberattacks normally are multiple. The late cyberattacks may imply cyberattacks at early time. Also, in same data flow, the cyberattacks are similar. So, the influence of later data to early data may help us to discriminate cyberattacks. Because of that, we decide to use Bi-GRU in our research.

D. GRADIENT BOOSTING DECISION TREE
Gradient Boosting Decision Tree (GBDT) is a machine learning method for classification and regression problems, which produces a prediction model in the form of an ensemble of basic decision trees, especially Classification And Regression Tree (CART) trees. The idea of GBDT first postulated by Friedman [66], and the functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification. Compared with traditional classifiers, GBDT can produce competitive, robust, interpretable procedures for both classification and regression, especially appropriate for mining less than clean data.
Like other boosting methods, GBDT combines decision trees into a single strong learner in an iterative fashion.
The goal of GBDT is to teach a model F to predict values of the formŷ = F(x) by minimizing loss function, such as the mean squared error(MSE) 1 1 defined by the optimization method. We assume J is the number of terminal nodes in trees, which can be adjusted by hand. The coefficients J controls the maximum allowed level of interaction in our model. With J = 2, there are no interaction is allowed. With J = 3 the model may include effects of the interaction between up to two variables, and so on. Normally, 4 ≤ J ≤ 8 can lead a fairly insensitive to the choice of J in this range. Let J m be the number of its leaves. The input space into J m disjoint regions R 1m , . . . , R J m m , and predicts a constant value in every region. Using the indicator notation, the output of h m (x) can be written as the sum, as in (5).
The b jm is the predicted value in the region R jm . We can multiply coefficients b jm by some value γ m to minimize the loss function, and the model is updated as in (6), (7).
Here the v is the shrinkage to improve the the generalization ability of model.
Because the loss function L (y, F) and the base learner h (x) is difficult to obtain, given any approximator F m−1 (x), γ m can be can be viewed as the data-based estimate of F * (x) under the ''direction'' h m (x), a member of the parameterized class of h (x). It can thus be regarded as a steepest descent step under that constraint. According to the definition of steepestdescent, we gain the now equation: We repeat this process until we get final result.

E. QUADRATIC ENSEMBLE
Ensemble methods is a kind of learning algorithms that construct multiple classifiers and then make prediction by taking a vote of their predictions. It is a very popular method in classification. However, in IDS problem, although the spatial information can be defined, the temporal information is hard to be identified. In order to combine the temporal information as well as spatial information, we make quadratic ensemble to build a new model. First, we use a traditional ensemble learning method GBDT to make prediction based on the spatial feature, and a temporal method bi-GRU to catch temporal information. then we integrate the GBDT model and the bi-GRU model to do quadratic ensemble. The detail will be show at figure 2.
Compared with the normal ensemble model or other traditional classifiers, such as neural net model, quadratic ensemble is more flexible and can find a balance between an ensemble of multiple simple classifiers and a single complex classifier so it can achieve a better performance in IDS problem.

F. CICIDS2017 DATASET
The CICIDS2017 dataset was used in our study. This dataset is a network traffic dataset for cybersecurity research, which is provided by the Canadian Institute for Cybersecurity [35], containing normal traffic and a range of different types of attack traffic. In our study, we selected traffic data in the dataset from July 3rd to July 7th, including benign traffic, DoS/DDoS attacks, port scans, web attacks, and infiltration traffic. Each traffic data contains pcap traffic file, and over 80 features, including source IP, source port, destination IP, destination port, number of packets, protocols, and so on. Because CICIDS2017 Dataset is very imbalance, the class imbalance need to be handled before starting experiment [58], [59]. In CICISD2017 dataset, there are 14 minor class. We delete 3 uncommon class and merge others into 5 major class to relieve class imbalance, which show as Table 1. The 3 minor classes we deleted are FTP-Pataor, SSH-Patator as well as BOT. In real network, FTP transmit user information by plaintext and has serious security problem so normally it will be replaced by SFTP or SCP based on SSH. In SSH, due to security consideration, people usually use RSA key rather VOLUME 8, 2020 than password login. However, Brute Force is a passwordguessing attack so it is not work on RAS key. Even if we use password login, after several wrong password attempt, common tools like fail2ban can ban this IP to connect service to prevent Bruce Force. Because the cost of change IP is quite expensive and Bruce Force is a password-guessing attack which need tons of password attempt so it is easy to be prevented. On the other hand, botnet AREs is a kind of uncommon attack in datasets. In fact, in several famous datasets, such as UNSW-NB15 and CIDDS-001 do not include botnet AREs so we decide to delete this class. However, RAS key can not be Brute Force. Up to bot, it is a kind of uncommon attack and it is not include in majority dataset so we decide to detele them. After class merge, we also resample the dataset. In the end all class in the dataset is over than 1%, the resulting statistics are shown as Table 2.

G. CAS2018 DATASET
The CAS2018 dataset was created by our lab.
As Dainotti et al. [57] pointed out, many companies do not want to share their traffic datasets because of privacy and other considerations. Besides, most famous datasets are based on traditional IP traffic, there are no public traffic datasets based on ICN. In order to test the ability of our model to detect attacks in ICN, we created an ICN traffic dataset. In our lab, there is an experimental ICN environment for some ICN research projects. There are 14 PCs and 42 Sensors linked to our network and they run our special ICN protocol. Every day, sensors collect data and then register those data in our ICN system. As shown in Fig. 1, Computers and sensors are linked to the special ICN Router in our lab, and two infrastructure servers, called Naming Server and Naming Resolution Server (NRS), are available in this network. Naming Server is in charge of naming all the content and devices in this network and Naming Resolution Server is used as a resolution database such as DNS in traditional IP network.
As shown in Fig.1, in our ICN network, we have an ICN router linking all devices running our ICN protocol. Sensors generate ICN content all the time, including camera data and temperature data. Computers can require ICN content from the ICN network and communicate with each other as well. Naming Server and NRS provide naming and resolution service for other devices. We collect origin data from our ICN network devices and give them a label of type. As in Table 3, there are seven types of traffic in our dataset: Normal Traffic -Traffic of normal and safe behaviors DDoS to Resolution Server -DDoS attack traffic to the public NRS. This type of attack can make NRS deny of service and resolution query will timeout.
DDoS to Naming Server -DDoS attack to the public naming server. This type of attack can make new content generated cannot get a name from the naming server.
Flooding Attack -Attack traffic which can cause network congestion.
Jamming -Attack for jamming the network. Packet mistreatment -Traffic of abnormal or broken package.
Impersonation -Traffic of using other devices' private keys without authorization.
The CAS2018 dataset is available at the authors' Github Homepage (github.com/yangjun1994/CAS2018). Currently, there are 8 files in our dataset, and we may update this dataset in our future experiments. The development of CAS2018 dataset is as shown in Fig.2.
We run the experiment from 2nd January 2018 to 9th January 2018, and record the logs of our system. All computers run Ubuntu 16.04 as their operating system, and they are connected to the ICN router. Sensors are installed on Raspberry Pi model B with the official Raspbian System. The Raspberry Pi has a wireless adapter, so our sensor data can be transported to the ICN router through the wireless network. The ICN router can make devices only sending packets by providing the payload and the name of the receiver, without giving the IP address, and the resolution system are in charge of providing the current address of receiver for packet routing. The files in our dataset are as follows.
''Attack.type'' is the definition of the mentioned normal traffic and 6 attack types.
''Sensors.na'' is the sensors' device name and their GUID given by our Naming Server. In our database, the sensors include cameras, temperature sensors, and humidity sensors. The sensors was installed on many Raspberry Pi, and different sensors on the same Raspberry Pi was given a unique name as a unique content source.
''Devices.key'' includes the hash of keys owned by all devices. This file can be used to verify the sender of a packet. If a packet cannot be verified, it may be an attack or a mistreatment packet.
''Publicserver.na'' include the address of naming and resolution servers, this file can be used to verify the authenticity of those servers.
''NRS.log'' and ''NCS.log'' are the log file of our resolution and naming servers. Including the timestamp, source address, request content and other essential information.
''Traffic.log'' is the log file of all ICN traffic data, including timestamp, source and destination address, packet size, and other traffic features.
The feature extraction of CAS2018 dataset is as following: Our CAS2018 dataset is based on traffic and log files with logging the timestamp, packet size, packet type et al. We make feature extraction based on our dataset. First of all, based on the information in packages, we make features of packet numbers and bytes per seconds, the duration of the flow, forward inter arrival time and backward inter arrival time et al.
In original traffic records, packet type is only tagged as sig (means signal) or data. Then, we extend the type of signal packets: if a packet is tagged as sig, we will search the server logs to find out which behavior represent this signal packet, and re-tagged it as the real behavior of the signal. (behavior of all legal signal must be one of the following behaviors: register to name server, query from name server, register to resolution server, query from resolution server) Additionally, in the logs of name server and resolution server, there is a column for user authentication (1 means successful authentication and 0 means fail authentication), so we associated the authentication with traffic information to make new feature.
Finally, there are geographic location of devices in the files named Computers.na and Sensors.na. For each packet, we add the geographic location of both source and destination devices.

III. PROPOSED METHODOLOGY A. FEATURE ENGINEERING AND DATA PRE-PROCESSING
Because of the connection between network hosts and the relevance of subnets, network traffic data often has certain spatial characteristics. In past research, scholars tend to pay more attention to participate in the optimization of the model, and ignore the study of the network itself. In our research, we analyzed the topology connectivity and host relationship of the network host, adding multi-dimensional feature information of the subnets and spatial location of the traffic connection.
There is also a certain timing correlation between network data. For example, DDoS consists a series of attacks in a short period of time. The past research has proved that the RNN model has a strong ability to identify the inner temporal relationship in cyberattack datasets. Basically, our feature includes the spatial part and temporal part. For the temporal part, we use raw traffic flow as our temporal data X tem .
For the spatial data, we create a list of features to help us analysis spatial information.
In the beginning, we analyze the geographical location information of each traffic and create a list of geographic location spatial features. The IP address may hide some very important information of cyberattacks. In a real attack, the resource hacker have is limited so the attack may concentrate on several specific network segment. We parse the IP address of each traffic and create a list of spatial features based on the subnet and host.
More and more researches prove that the topological relationship plays a very essential role in cyberattacks. For example, if device A and device B share the same switcher/router or belong to the same company, we think there are topological relationships between these two computers. The topological relationship shows a potential relationship between different devices. If device A is attacked, device B and the device has a tropological relationship with device B has a higher risk to be attacked. The CICIDS 2017 dataset provide a topology relationship [35] and we use its information to make our topology feature. We create a list of topological spatial feature based on topological relationship between traffic. Now we have many types of features, including discrete and continuous, numeric and string. The various type of feature increases the difficulty in data processing. In order to deal with these features more robust and effective, we use one-hot strategy and other ways to normalize our features. Although string-like features such as host addresses and protocol ports look like numbers, it is clear that these cannot be directly treated as numbers. The one-hot encoding separates each value of a discrete feature into a single feature so the value of new feature has only 0 and 1, which indicates whether the feature is met respectively. By using one-hot coding, we extend the value of discrete features to the Euclidean space, and a certain value of the discrete feature corresponds to a certain point in the Euclidean space.
After feature engineering processing, we get 80 spatial features. However, there are serious correlations between our features, which may lead to overfitting when we train our model. Also, too many features will increase the dimension of our data and raise the time of model computation.
We use two way to select our spatial feature. First of all, we use a Random forest model to calculate the importance of our feature. We a random choose 50% of all train data and use random forest model to calculate the importance of our feature. we repeat it 100 times and record the importance of our feature every time and rank our feature based on the importance of our feature.
We also use the PCA method to reduce our dimension of spatial feature space. PCA is a classical mathematic method to reduce the dimension of data. The idea of PCA is to reduce the dimension by denoising or eliminating the less important features.
According to these two method, we choose 31 features as our spatial data, we call it X spa . Because we use the GRU as our temporal model, so we use a temporal raw dataflow as our temporal data X tem .
Through the above feature engineering processing, our data becomes a standardized or encoded data containing comprehensive information of time and space, so that the model construction and training can be carried out.

B. GBDT-PARALLELED QUADRATIC ENSEMBLE LEARNING
GRU is one of the most successful temporal methods and lots of research has proved its ability to deal with temporal information. But as we mentioned before, its theory decides the innate drawback to deal with spatial information. In order to deal with the spatial-temporal relationship in IDS, we create a GBDT-Paralleled Quadratic Ensemble Learning for Intrusion Detection. We select the GBDT model as our spatial classifier and Bi-GRU model as our temporal classifier, the structure of our quadratic ensemble learning is as shown in Fig. 2.
Frist of all, we split our train data X tra as five folds. For every m, we deemed X tra spa -X tra spa m as new spatial train dataset. We use this dataset to train our GBDT model and get a GBDT model m, which is our spatial classifier model m. We use this spatial classifier model m to make prediction in X tra spa m , and get spatial train prediction Y tra spa m . For each m, we get a spatial train prediction Y tra spa m . We combine these 5 spatial train predictions as one final spatial train data Y tra spa . To X tes m , we do total same thing but use the Bi-GRU instead of GBDT model. Similar, we get a final temporal train data Y tra tem . We combine Y tra spa and Y tra tem and get our final train data Y tra .
Up to our test data X tes , just as what we do before, we split it as five folds. For each fold m, we use the spatial classifier m and temporal classifier m we get before to make prediction. Plsease notes that we do not train the again but use the trained model to make prediction. In the end, we can get a spatial test prediction Y tes spa m and a temporal test prediction Y tes spa m . We calculate the average of Y tra spa m and Y tes spa m separately and get Y tes spa and Y tes tem . We combine the spatial dataset and temporal dataset then we get Y tes . Now we have a train dataset and a test dataset, we use them to train a final meta classifier (here we use LightGBM method) and use this method to make our final prediction in Y tes .
In this way, we get a GBDT-Paralleled Quadratic Ensemble Learning method which takes into account both the spatialtemporal feature of the network data. The Algorithm show the concrete step and the table of notation (Table 4) show the explanation of our symbols used in Algorithm.     Table 5 and table 6 show the hyperparameter search ranges for our classifier and he optimum hyperparameter values in the final model.  Table 7).

Algorithm 1 GBDT-Paralleled Quadratic
The explanation is as follows: True Positive (TP): An attack in actual is classified as an attack in prediction. VOLUME 8, 2020  True Negative (TN): Benign in actual is classified as benign in prediction.
False Positive (FP): Benign in actual is classified as an attack in prediction.
False Negative (FN): An attack in actual is classified as benign in prediction.
Based on the above confusion matrix, we can calculate Accuracy, Precision, Recall and F1 Score to evaluate the performance of our model. Accuracy: Accuracy, as in (10), is one of the most common performance metrics used in the evaluation of models. It is defined as a ratio of correctly predicted traffic to all traffic.  (11), is the ratio of correctly predicted positive attack traffic to the total predicted positive attack traffic. High precision means the low false positive rate. this a metric to judge the ability to distinguish all positive attacks.
Recall: Recall, as in (12), is the ratio of correctly predicted positive attack traffic to all traffic in actual attack class. this a metric to judge the correctness to distinguish positive attacks.
F1 score: F1 score, as in (13), can be thought of as a weighted average of the model precision and recall, with a maximum of 1 and a minimum of 0. Therefore, it takes both false positives and false negatives into account.

IV. EVALUATION
We run model on a workstation with Intel Core TM i9-7900X CPU @3.30GHz, 128GB RAM, 1TB SSD, 4TB HDD, Nvidia Titan TM XP and Nvidia Tesla TM P-100 GPU. The detail will show in Table 8. The runtime of our model is 85min. We used SVM model as classical model, LSTM, Bi-LSTM, GRU, Bi-GRU model as traditional temporal methods, LightGBM [67] and eXtreme Gradient Boosting (XGBoost) [68] as the best traditional ensemble model with great classification result. and 7 state-of-the-art methods to compare the performance with our proposed method.
The 7 state-of-the-art methods are as follows: LSTM-Attention was proposed by Lin et al. [69] in Jul 2019. They used an LSTM model to build a deep neural network model and added an attention mechanism to enhance the performance of the model.
HAST-IDS was proposed by Wang et al. [55]. They used the CNN model and the LSTM model to build a deep neural network model, and they got a very good result.
Serpil et al proposed a Shallow Neural Network (SNN) and Deep Neural Network (DNN) model with autoencoder to detect cyberattack [70] in Jun 2019. Their DNN model with autoencoder got the best performance in their research work.
Razan et al proposed a feature dimensionality reduction approach for machine learning [59] in Feb 2019. They used a random forest (RF) model with the Principle Component Analysis (PCA) and they developed Uniform Distribution Based Balancing (UDBB) method.
Attention Flow-WGAN [56] was an existing fusion model of intrusion detection system using WGAN and attention mechanism to process flow data, proposed by Han et al in June 2019, which is the newest research with great performance and state-of-the-art result.
Reinforcement Learning-based Intrusion Detection System (RL-IDS) was a big data-driven IDS method proposed by Otoum et al. [71], and it got extremely good result on big sensed data for intrusion detection.
Deep belief and Decision Tree-based Hybrid Intrusion Detection System (D2H-IDS) [2] was a combination of Deep Belief Network and Decision Tree, it was developed by Moayad Aloqaily et al. It was build as an intrusion detection system for connected vehicles in smart cities AND the performance on NSL-KDD dataset was nearly perfect.
The performance of our method and the above methods for detecting different kind of traffic are as shown in Table 9,  Table 10, Table 11, Table 12, Table 13. Table 9 shows the experimental performance on benign traffic. It shows the comprehensive ability to detect network attacks. Compared with other models, almost all of our model's performance is better than others. Our model achieved the highest accuracy score, recall score, F1 Score and the second highest precision score in all models. Comprehensively, the performance of our model is best above in all models. The classical model SVM got lowest score in all score.   In the CICIDS2017 dataset, the PortScan attack will occur after infiltration attack. The attack will execute the Portscan attack by the victim's computer. According to the result showed by Table 10, almost all model except DNN with autoencoder model achieves a satisfactory result (all judgment score is over 0.95). The performance of our model is best among the selected model and reaches 0.999 in all score. LightGBM, XGBoost and RL-IDS also get nearly  perfect results. It implies that the ensemble model has a great ability to deal with PortScan data. On the other hand, the performance of models with spatial structure shows the bidirectional structure may have slightly merit to catch spatial-temporal relationship.
A DDoS attack generates the huge network data flow to overwhelm the targeted system and make the online service unavailable from multiple sources.
Up to table 11, the performance of our model to classify DDoS attack is extremely high. LSTM-Attention model and Attention Flow-WGAN also got really good performance. Pca-rf with UDBB model achieved good precision (0.978) but the Recall rate was extremely low (0.078).
On the other hand, the model with temporal structure got stability result than ensemble models. The ensemble models (LightGBM and XGBoost) achieved good precision score (0.931 and 0.952) but got relatively low result in Recall score (0.697 and 0.745). It shows the ensemble learning model have the ability to detect the DDoS attack but may misclassify the normal data flow or other types of attacks to DDoS attack.
The reason may be that there is inner spatial relationships in DDoS attack. As we know, a DDoS attack always happened frequently in short time, and the destination of IP is the same. In this case, if we can use the temporal and spatial information to make prediction, the DDoS attack is not hard to be determined. However, a simple ensemble model is hard to deal with this information so the result is not good as the model with spatial structure.
Another interesting thing to note is that the model with Bidirectional structure have better performance than the model without these structures. the average of scores in Bi-LSTM and Bi-GRU was slightly higher than LSTM and GRU. It may imply that the bi-direction may has a better ability to catch the inner spatial feature in DDoS influence.
The infiltration of the network from inside is normally exploiting a vulnerable software such as Microsoft Word. Infiltration attacks will leave leak on target's computer after successful exploitation and start various attacks on the target's network such as IP sweep, full port scan and service enumerations. Table 12 shows the performance to determinate cyberattack on infiltration traffic. The performance of all models in accuracy is very good (higher than 0.98). However, the majority of spatial models had relatively low score in precision, recall as well as F1 Score. The reason could be that the feature of infiltration attack will not work until the successful exploitation so there is not strong spatial regulation. Also, the feature of the infiltration attack is scattered but may have some general characteristics. Ensemble learning models are good at dealing with these characteristics so it is no doubt that LightGBM and XGBoost models have much better results than spatial models. Up to our model, the performance of our model in accuracy and precision is better than others although Attention Flow-WGAN achieved the best result in recall score. Because of the prediction of Attention Flow-WGAN model based on the original data flow, which may exist some general characteristic including the hidden information of infiltration attack.
Web attack is a common attack type. In CICIDS 2017 dataset, SQL injection is used to make web attack. SQL injection creates a string of SQL commands then use these commands to force the database to reply the information. If the developers don't test code properly to find the possibility of script injection, the Cross-Site Scripting will happen and Brute Force over HTTP can try a list of passwords to find out the administrators' passwords. Table 13 shows the performance of methods on web attack traffic. Our model achieved the highest accuracy, precision, recall, F1 score in all models. It shows that our model can detect web attacks effectively.
On the other hand, bi-spatial models and ensemble models both reached relatively good scores in accuracy and precision but the ensemble learning models got much better results in recall and F1 score. Because web attacks normally happened very frequently in a short time, which like the DDoS attack, the spatial model may misclassify these two attacks and lead a poor result of recall score and F1 Score.
Another interesting thing is that the accuracy of the pca-rf with UDBB model, DNN with autoencoder and Attention Flow-WGAN also reached very high level.
However, these models got very low score in both precision, recall and F1 score. It means these model focus detection of web attack and misclass lots of normal traffic as web attacks. So, the comprehensive classification ability is quite low even it got very high score in accuracy.
On the other hand, Ensemble models show a strong ability on accuracy and precision and acceptable ability on recall and F1 score. It shows that ensemble models will not misclassify normal data flow as attack. The models with spatial structure also got acceptable result in accuracy score although that the Recall scores was relatively lower than ensemble models. On the other hand, the performance of bi-models in Precision score is obviously better than models without bi-structure.
Overall, spatial model and ensemble models have their own advantage to deal with different attacks. Our model fused these two models' merit together and got the best result among these models in CICIDS 2017 dataset.
However, the traditional dataset is a little obsolete to face future challenge. Nowadays, in order to take over this problem, we created an ICN dataset and used it to make prediction.  As shown in Table 14, Table 15, Table 16, Table 17,  Table 18, Table 19, Table 20, we have 7 different classes in our ICN dataset, including Normal Traffic, DDoS to Resolution Server, DDoS to Naming Server, Flooding Attack. Jamming, Packet mistreatment, Impersonation. Because lots of methods do not suit for ICN dataset, we only used 7 traditional method and our methods to make comparison. The result shows a similar conclusion that we got in the CICIDS 2017 dataset.
Compared with other models, our method reached a significantly better result in the CAS2018 ICN dataset. It implies that our method is robust and stable in different datasets and can be used in future network architecture like ICN.

V. DISCUSSION
As shown in Fig. 4 and Fig. 5, our model achieves best result among these models. The reason of this result may be the  advantage of catching long-term spatial-temporal relationship in the network.
On the other hand, the traditional SVM method has relatively poor performance in our dataset compared with other method.
The temporal methods (LSTM, Bi-LSTM, GRU and Bi-GRU) get a very good result on port scan. It is interesting to note that the performance of bi-structure is obvious better than normal structure on web attack. The reason may be that webattack normally happened very frequently and there are several different type attacks happened after a successful webattack. Because of that, the later attack can offer very useful information to detect webattack.
Attention Flow-WGAN is a WGAN model with attention mechanism, this method shows a relatively good ability to detect all type of attack. Actually, the performance of this model and HAST-IDS is only next to our model. DNN with autoencoder shows a good ability to catch spatial information so it gets a very high result on portscan and DDoS attack. But its structure focus on the spatial features and do not consider an inner temporal relation between attack so it hard to detect webattack as well as infiltration.
HAST-IDS combines CNN model and LSTM model so this method has good ability to deal both spatial information and temporal information. Although this model is not the best one on any particular kind of attack but the performance is very stable and acceptable.
LSTM-Attention is a LSTM model with attention mechanism, it has a very strong ability to deal with temporal information. However, it does not include a Bi-structure so its performance on webattack was very poor.
Pca-rf with UDBB achieves a good result on DDoS attack but as a whole it is not a successful method in CICIDS 2017 dataset.
D2H-IDS has a good performance, however the ability to detect web attack is relatively low, the reason maybe is that this structure is designed as an IDS for connected vehicles so it may ignore some essential feature for web attack.
RL-IDS has very good performance on DDoS and protscan attack. However, because it does not include a temporal method, so this model is not the best structure to deal with time-related attack, such as infiltration and webattack.
The performance of LightGBM and XGBoost was similar. Compared with other model, their ability to detect DDoS attack is not good. The reason is that these methods will misclassify the normal data flow or other types of attacks to DDoS attack.
Overall, it is clear that the performance of the method with spatial structure is better than others. It may imply that spatial information can help us to detect cyberattacks. Also, it is interesting to notice that the performance of the spatial models with two direction are slightly better than models without two direction.
One of the most essential characteristics in network dataset is huge and multifarious, which lead an increasing danger to overfitting. Our model used an ensemble learning method to predict cyberattack, which has a strong ability to deal with the overfitting problem. Also, according to feature selection, we limited the number of features and reduced the correlations between features, which decrease the proximity of overfitting as well as increase the reliability of our result.
GRU model is a kind of neural net model and the theory is far different from the GBDT (which the core theory is the ensemble of weak classifier). Compared with an ensemble of two similar models, our ensemble model can absorb the merit of two model and achieved a better result than any single model.

VI. CONCLUSION AND FUTURE WORK
In order to utilize exist spatial-temporal relationships in network data, we created a GBDT-Paralleled Quadratic Ensemble Learning for Intrusion Detection. Past researches showed that the GRU model has a memory structure and has the ability to capture long-term dependencies. However, it is hard to catch the spatial feature. Unlike the traditional way, we used a GBDT model to enhance the ability to catch spatialtemporal relationship. Due to the merit to deal with disorganized and fragmented data, GBDT has a good potential to catch spatial features in cyberattacks. According to ensemble the GBDT model and the traditional GRU model, we took over the problem of spatial-temporal relationship in network flow and we got a great result on the CICIDS2017 dataset and CAS2018 dataset. Compared with other models, our model can identified the inter influence of spatial relationship and had best ability to predict cyberattacks.
However, the traffic of ICN has some unique characteristics. In traditional IP network architecture, IP address represents both identifier and locator. On the other hand, in ICN architecture the identifier and the locator are not bonded. For example, in traditional IP network architecture, if a user moves from A to B, the identifier and the locator will change together but in ICN architecture, only the locator will change because the identifier only relates to the user. This feature lets ICN architecture has a bright prospect on application scenario with mobility requirement, such as 5G and Internet of Vehicles scenario. In our CAS dataset, we gathered the ICN traffic in our experiment network under these considerations, but in reality, production network is much complex than the experiment network in our lab. So in the future, we will evaluate and optimize our model in a more complex ICN network environment.
JUN YANG received the bachelor's degree from the Beijing University of Technology, Beijing, China, in 2016. He is currently pursuing the Ph.D. degree with the National Network New Media Engineering Research Center, Chinese Academy of Sciences, Beijing. His current research interests include network intrusion detection, network security situational awareness, and machine learning.
YIQIANG SHENG received the master's degree from Nankai University, Tianjin, China, in 2003, and the Ph.D. degree from the Tokyo Institute of Technology, Tokyo, Japan, in 2014. He is currently with the National Network New Media Engineering Research Center, Chinese Academy of Sciences, Beijing, China, as an Academic Researcher and an Associate Professor. His current research interests include smart systems, optimization algorithms, machine learning, big data, and network theory with its applications.
JINLIN WANG received the master's degree from the Institute of Acoustics, Chinese Academy of Sciences, Beijing, China, in 1989. He began to work at the Institute of Acoustics, Chinese Academy of Sciences, in 1989. He is currently the Director of the National Network New Media Engineering Research Center, Chinese Academy of Sciences. His research interests include network media, digital signal processing, source coding and channel coding in the IPTV, the technologies of media streaming-based applications in networks, and wireless communications.