T-DFNN: An Incremental Learning Algorithm for Intrusion Detection Systems

Machine learning has recently become a popular algorithm in building reliable intrusion detection systems (IDSs). However, most of the models are static and trained using datasets containing all targeted intrusions. If new intrusions emerge, these trained models must be retrained using old and new datasets to classify all intrusions accurately. In real-world situations, new threats continuously appear. Therefore, machine learning algorithms used for IDSs should have the ability to learn incrementally when these new intrusions emerge. To solve this issue, we propose T-DFNN. T-DFNN is an algorithm capable of learning new intrusions incrementally as they emerge. A T-DFNN model is composed of multiple deep feedforward neural network (DFNN) models connected in a tree-like structure. We examined our proposed algorithm using CICIDS2017, an open and widely used network intrusion dataset covering benign traffic and the most common network intrusions. The experimental results showed that the T-DFNN algorithm can incrementally learn new intrusions and reduce the catastrophic forgetting effect. The macro average of the F1-score of the T-DFNN model was over 0.85 for every retraining process. In addition, our proposed T-DFNN model has some advantages in several aspects compared to other models. Compared to the DFNN and Hoeffding tree models trained with a dataset containing only the latest targeted intrusions, our proposed T-DFNN model has higher F1-scores. Moreover, our proposed T-DFNN model has significantly shorter training times than a DFNN model trained using a dataset containing all targeted intrusions. Even though several factors can affect the duration of the training process, the T-DFNN algorithm shows promising results in solving the problem of ever-evolving network intrusion variants.


I. INTRODUCTION
Intrusion detection systems (IDSs) are crucial components in the current computing infrastructures to identify malicious computer network activities [1], [2]. Along with the growth of network-based applications and systems, the number of cyberthreats is increasing [3]. IDSs play a vital role in cybersecurity [4] by forewarning security administrators about malicious activities such as distributed denialof-service (DDoS), port scan, and SQL injection attacks. Having reliable IDSs is a mandatory safeguard for protecting computing infrastructures against ever-increasing issues of intrusive activities [5].
The idea of creating reliable IDSs with improved accuracy and fewer requirements for human knowledge drives the development of machine learning-based IDSs. Machine learning algorithms such as artificial neural networks (ANNs), fuzzy logic, and support vector machines (SVMs) have become extensively used in IDS studies [6]- [8]. These machine learning algorithms can extract knowledge from datasets through complex pattern-matching processes [6]. Extracting this knowledge requires most machine learning algorithms to be trained using datasets containing all targeted intrusions [9].
The requirement of acquiring datasets containing all targeted intrusions raises an important issue. In real-world sit-uations, security experts collect intrusion data incrementally because intrusions do not emerge at once but gradually over time. It is possible to create a new model for these new intrusions. However, training a model using a dataset containing all intrusions may take a long time. Additionally, it is difficult to modify the previously trained model to accommodate new intrusion variants because the training process is static and only performed once using datasets containing all targeted intrusions. To solve this problem, we need to develop an algorithm that can learn incrementally with a shorter training time as new intrusions emerge. However, the catastrophic forgetting problem becomes the main challenge to realizing this idea [10]- [12].
Catastrophic forgetting is a classic problem faced by many machine learning models and algorithms [12]. Assume we have trained a classification model; then, we retrain this model using a new dataset containing new classes. In this situation, most current classification models may forget how to classify the old classes. Goodfellow et al. [12] explain that when we train a machine learning model with a convex objective, it will always end with the same configuration at the end of the training process, regardless of how it was initialized. For example, a support vector machine (SVM) that is trained on two different tasks will completely forget how to perform the first task. If this retrained SVM model can correctly classify some data from the old task, it is only due to the similarity of both old and new tasks.
This research aims to solve the two problems we previously mentioned: the problem of ever-evolving network intrusion variants and the catastrophic forgetting problem. To solve these issues, we propose an incremental learning algorithm capable of learning new intrusions incrementally as they emerge. Our proposed method is composed of multiple deep feedforward neural network (DFNN) models. Each neuron in the output layer of a DFNN model is linked with another DFNN model, creating a tree structure. Hence, we named our proposed method tree deep feedforward neural networks (T-DFNN). The tree structure in T-DFNN is expandable. New nodes can be added when learning new intrusion variants.
Note that we do not intend to propose our incremental learning model to replace the current standard models, which use a dataset containing all intrusions in the training process. Instead, we intend to propose a model that works alongside existing models. When new intrusions emerge, training a new model using a dataset containing all intrusions is a prolonged process. Our research solves this issue by providing an incremental learning model with a shorter training time without sacrificing the model's performance. While a current standard model is being prepared, this incremental learning model can be used during the interim. The reason is that even though the current standard model has a slow training process, it has a relatively simpler structure, thereby having a faster classification process. This simpler structure is beneficial when used in long-term scenarios.
T-DFNN is a supervised machine learning algorithm. It needs a labeled dataset to perform the training process. In this research, we used Canadian Institute for Cybersecurity's intrusion detection evaluation dataset 2017 (CICIDS2017) to evaluate our proposed algorithm. CICIDS2017 is a reliable and labeled network intrusion dataset that covers both benign and intrusion traffic. The intrusion traffic in this dataset consists of the most common network intrusions. The creator of this dataset [13] did not design this dataset specifically for the incremental learning problem. Therefore, we divided CI-CIDS2017 into several batches and then trained the models in our experiment using these batches sequentially to simulate the incremental learning process.
We should note that the incremental learning term has been used rather loosely in the literature. This term refers to several concepts, such as incremental network growing and pruning, online learning, or relearning of formerly misclassified instances [14]. The incremental learning term in this study refers to a machine learning algorithm that meets the following criteria: 1) It can learn new information, e.g., new network intrusion variants. 2) It can preserve previously acquired knowledge or, in other words, it should not suffer from the catastrophic forgetting problem. In summary, we make the following contributions in this paper: 1) We propose T-DFNN: an incremental learning algorithm for IDSs. A T-DFNN model is composed of multiple DFNN models connected in a tree-like structure. This tree structure model can be partially retrained to accommodate new intrusion variants when they emerge.
2) The T-DFNN algorithm can reduce the catastrophic forgetting effect. When a dataset of new intrusions emerges, it preserves the trained nodes while expanding the model by adding new nodes to classify the new intrusions. This mechanism reduces the catastrophic forgetting effect on the model.

3)
The T-DFNN algorithm can shorten the training time by limiting the old training dataset needed in the retraining process. In the T-DFNN algorithm, the training dataset in each node is selected based on the classification results of the parent node. Only data classified as the same parent node's output label are used in each node. The remainder of this paper is arranged as follows: Section II presents related work; Section III describes our proposed incremental learning algorithm; Section IV explains the experimental setup of this research; Section V presents a summary of the experimental results; Section VI discusses the challenges of implementing the proposed incremental algorithm in the network intrusion detection problem; and Section VII provides our conclusion and future work.
IDSs are security tools that identify malicious network activities on computer infrastructures. They monitor network traffic and system logs to find malicious network activities that conventional firewalls cannot filter [2], [6]. There are two main categories of IDSs based on their detection method: signature-based and anomaly-based IDSs [6], [15]. Signature-based IDSs use pattern-matching techniques to find known malicious network activities. Signature-based IDSs are also known as knowledge-based detection or misuse detection. In contrast, anomaly-based IDSs analyze network traffic to find a significant deviation between observed traffic and acknowledged traffic behavior. Anomaly-based IDSs interpret this deviation of behavior as an intrusion [2], [4], [6], [7], [15]. One approach in building anomaly-based IDSs is using machine learning algorithms [4].
Most of the machine learning models used in previous studies of IDSs are static models and trained using a dataset containing all targeted intrusions. Only a few of them raised the issue of ever-evolving network intrusion variants [16]. Studies by Constantinides et al. [16], Chen et al. [17], Yi et al. [18], Xu et al. [19], and Jiang et al. [20] are examples of those proposing an incremental learning method to solve the problem of ever-evolving network intrusion variants. Most of these studies utilized support vector machines (SVMs) in their proposed incremental learning methods. SVMs belong to the supervised machine learning algorithm category commonly used for classification problems. Despite the prominent properties of SVMs, the training complexity of SVMs is highly dependent on the size of a dataset. Thus, SVMs are not as favored for large-scale data mining as for pattern recognition [21].
Recently, many studies have preferred deep learning using artificial neural networks (ANNs) to process large-scale data [22]- [24]. Deep learning using ANNs is one of the popular algorithms for learning information from complex datasets. Deep learning using ANNs can create complex models compared to traditional probabilistic machine learning techniques [22]. Therefore, they have been broadly used for IDSs [2], [6]- [8], [15].
Despite being broadly used, most deep learning studies in IDSs do not focus on incremental learning. Instead, they focus on improving the classification performance by utilizing deep learning algorithms. In contrast, incremental learning using deep learning algorithms is thriving in image processing fields. For example, Roy et al. [25] used deep convolutional neural networks (CNNs) to build a hierarchical model for incremental learning. Their proposed model organizes the incrementally available images into several superclasses based on features. In the training process, new classes of images are added to the hierarchical model as the subclasses. The retraining processes are limited in the affected superclasses to reduce the computational overhead. Sarwar et al. [26] also proposed an incremental learning algorithm using CNNs. Unlike Roy et al.'s approach, Sarwar et al. used a partial network sharing method in their incremental learning method. Inspired by transfer learning techniques, Sarwar et al.'s method splits CNN layers into shared and classification layers. The first several layers become the shared layers, and the rest become the classification layers. When the model is retrained using new image data, the classification layers are cloned to classify the new images. The result of this cloning process is a tree structure model of shared and classification layers. Both methods use different approaches to generate an incremental learning model. However, the structure of both models resembles a tree structure. Additionally, both methods limit the retraining process in the new branches of the tree structure to reduce the computational overhead. We adopted the idea of using a tree-structured model for incremental learning in our proposed method.
We used DFNN models in the T-DFNN algorithm. The DFNN model is one variant of ANNs used for deep learning. We combined a tree structure and DFNN models to build an incremental learning algorithm that can preserve knowledge from the previous training while reducing the retraining process's computational overhead. Multiple nodes of DFNN models are used in the T-DFNN model to classify the given input data. Unlike Roy et al.'s approach [25], we did not group similar intrusion classes into one superclass. Instead, we utilized a previously trained model to find the old and new classes classified as the same output label.
In our experiment, we compared our proposed method with another tree-based incremental learning algorithm. We chose a well-known incremental decision tree algorithm, namely, the Hoeffding tree algorithm [27]. The Hoeffding tree algorithm can learn from large-scale incremental data. It exploits the fact that a small portion of data is often enough to select an optimal splitting attribute of the dataset. This algorithm is commonly used to process incremental data and has been implemented in several popular machine learning libraries, such as scikit-multiflow [28] and Weka [29].
To test the performance of our proposed method, we used CICIDS2017. It is a newer IDS dataset than the KDD Cup 1999 dataset [30], which has been commonly used in previous IDS incremental learning studies [17]- [20]. We did not use the KDD Cup 1999 dataset because it has several deficiencies. One of the critical deficiencies of the KDD Cup 1999 dataset is the significant number of redundant records, which causes a bias toward the more frequent records [31]. Another unfortunate deficiency of the KDD Cup 1999 dataset is the fact that this dataset is very old [15]. It was created in 1999 for The Third International Knowledge Discovery and Data Mining Tools Competition. Hindy et al. [15] explained that depending solely on old datasets cannot help the advancement of IDSs. Thus, it is better to use newer intrusion datasets that cover recent variants of intrusions.

III. METHODS
The T-DFNN algorithm covers both training and classification processes. The training process of the T-DFNN algorithm generates an incremental learning model. This incremental learning model is then used in the classification pro-  cess to classify the input data. The T-DFNN algorithm aims to classify ever-growing network intrusions efficiently. Thus, the training process in the T-DFNN algorithm is designed to preserve the knowledge learned by the previous model while reducing the quantity of old training data used in the training processes.
T-DFNN model is a tree-structured model. It consists of a root node and may have several leaf and internal nodes. When a T-DFNN model is retrained, the trained nodes are not modified to preserve the previously learned knowledge. Instead, new nodes are created. These newly created nodes are then connected with the existing nodes to create a treestructured model. The innovation of our proposed T-DFNN algorithm is its ability to distribute the training data to each node while limiting the number of old training data used to train these new nodes, thereby shortening the time needed to retrain the model. Figure 1 shows the training and classification flow of the T-DFNN algorithm. The main feature of the T-DFNN algorithm in the training process is the mechanism to reuse a previously trained model to learn new training data. Thus, a trained model is saved after each incremental training process. Except for the first training process, the saved model is loaded along with the new training data. This saved model is then retrained to classify the new input data.
One of the essential components in the T-DFNN model is the T-DFNN node. There are two important items in the T-DFNN node: a DFNN model and a map of output labels. A DFNN model processes the input data and classifies them into several output labels. These output labels can be linked with other T-DFNN nodes using a map. A map is a data structure that consists of key-value pairs. In the map of the output label, the output labels become keys, and the values of these keys are either other nodes or NULL values. A NULL As we previously mentioned, the innovation of the T-DFNN algorithm is its mechanism to distribute the training data to several nodes and limit the number of old training data used in the training process. In the retraining process, several new nodes can be created. In addition to the new training data, old training data are also used to train these new nodes. However, the T-DFNN algorithm limits the number of old training data used to train these new nodes. Each new node only uses old training data classified as its parent node's output label. The details of this training process are described in Section III-A.
The classification process in the T-DFNN algorithm is performed in several steps. First, the input data are processed using the root node of the T-DFNN model. Then, the outputs of the root node's classification process are used to determine whether the classification process will continue using the child nodes or be terminated. These processes have similarities with the classification process in the decision tree. The difference is that in the T-DFNN algorithm, we used the input: X: array of training data Y : array of labels of training data N :

A. INCREMENTAL TRAINING ALGORITHM
After loading the first batch of training data, we process it using the incremental learning procedure shown in Pseudocode 2. In this first training process, we create a node. Consequently, this node becomes the root node of the T-DFNN model. This root node is then trained using the given training data. Last, we map the output labels of the DFNN model to NULL to indicate that the output labels are not linked with any node. We describe the initial training process in Pseudocode 2, Lines 1 to 3.
The retraining process begins by loading the training data and the trained T-DFNN model. We do not retrain existing nodes to prevent the catastrophic forgetting effect. Instead, we use the existing nodes to find suitable output labels to be VOLUME 4, 2016 Pseudocode 3: Training data selection procedure SelectTrainingData(X, Y , Y C, L): input: X: array of training data Y : array of labels of training data Y C: array of labels of classified data L: label output: XL: array of selected data Y L: array of labels of selected data We describe the process of finding suitable output labels to be linked with new nodes in Pseudocode 2, Lines 4 to 18. The process begins by classifying the new training data using the root node's DFNN model, as shown in Pseudocode 2 Line 5. The root node's DFNN model misclassifies new training classes as old classes because we did not train the root node's DFNN model to classify these new training data. We illustrate this condition in Figure 2(b). In Figure 2(b), the root node's DFNN model misclassifies class 4 training data as output label 2. It is also possible that the root node's DFNN model misclassifies a new training class as two or more old classes. For example, in Figure 2(b), the root node's DFNN model misclassifies part of the class 3 training data as output label 0 and the other part as output label 2.
There are two possible conditions when multiple classes are classified as the same output label: 1) The output label is linked with a node.
2) The output label is not linked with any node.
In the first condition, where an output label is linked with another node, we rerun the training process using the linked node as the new root. This recursive training process continues until we find an output label with no linked node. In other words, the second condition occurs. In this second condition, we create and train a new node. Only the training data classified as the same parent node's output label are used in the training process. Last, we link the parent node's output label with this newly trained node by mapping the parent node's output label to the new node. We describe these steps in Pseudocode 2, Lines 9 to 15.
One key component of the T-DFNN incremental training algorithm is selecting appropriate training data used in each node. This selection process limits the quantity of training data in each node. We describe the training data selection process in Pseudocode 3. Training data in each node are previously classified as the same parent node's output label. For example, in Figure 2(b), the training data used in the c 0 node are class 0 and part of the class 3 training data because they are classified as output label 0 in the root node. Likewise, the training data used in the c 2 node are class 2, part of class 3, and class 4 training data because they are classified as output label 2 in the root node. Training data of a class may be divided into several parts and trained using different nodes. For example, class 3 training data used in the c 0 node were previously classified as output label 0 in the root node. In contrast, class 3 training data used in the c 2 node were previously classified as output label 2 in the root node. Thus, there are no overlap training data used in those two nodes.
We do not need to use the training data of all old classes in the training process of new training data. Training data of some old classes are only needed when the old and new classes are classified as the same output label. For example, in Figure 2(b), we do not need to use class 1 training data because when we classify the new training data using the root node, there are no new classes classified as output label 1. This method reduces the quantity of old training data used for the training process of the new classes.

B. CLASSIFICATION ALGORITHM
The T-DFNN classification algorithm is a recursive process. As shown in Pseudocode 4, the first step is classifying the input data using the root node's DFNN model. Then, we check the linked node for each output label. If the output labels of the root node's DFNN model are linked with other nodes, we run the classification algorithm recursively using the linked nodes. This recursive process stops when the DFNN model's output label in the last node is not linked with any node. When the DFNN model of a node classifies the input data, there are two possible conditions regarding each output label: 1) The output label is not linked with any node.
2) The output label is linked with a node.
In the first condition, the input data classified as this output label is not processed further. Thus, this output label becomes the final classification result. We illustrate this condition in the root node's output label 1 in Figure 3(a). In Figure 3(a), the root node classifies the input data as output label 1. Because there is no node linked with output label 1, the final classification result is class 1.
In the second condition, we run the classification algorithm recursively using a node linked with the output label. Not all input data are used in this recursive process. Only input data classified as the output label of the linked node will be selected. We describe this data selection process in Pseudocode 5. This recursive classification process stops when an output  label with no linked node is found. Thus, the last output label with no linked node becomes the final classification result. To do that, we need to update the current node's classification result using the recursive process's output. Only the classification results of the input data used in the recursive process are updated. We describe the process of updating the classification results in Line 7 of Pseudocode 4, which is explained in more detail in Pseudocode 6.
We illustrate the recursive classification process in Figure  3(b). In Figure 3(b), the root node's DFNN model classifies input data as output label 2. Because output label 2 in the root node is linked with the c 2 node, we recursively run the classification process using the c 2 node. Then, the c 2 node classifies the input data as output label 4. Because there is no node linked with output label 4 in the c 2 node, the final classification result of the input data is class 4.
The T-DFNN model may classify a new class using two or more nodes. For example, in Figure 3(c), the root node's DFNN model classifies input data as output labels 0 and 2. Both output labels are linked with different nodes. Output labels 0 and 2 in the root node are linked with c 0 and c 2 nodes, respectively. Under this condition, the input data are split into two groups. The first group is the input data classified as output label 0 by the root node's DFNN model, while the second group is the input data classified as output label 2 by the root node's DFNN model. These two groups VOLUME 4, 2016 of input data are processed further using different nodes. The first group is processed by the c 0 node, while the other group is processed by the c 2 node. Finally, both DFNN models of nodes c 0 and c 2 classify the input data as output label 3.
Because there is no node linked with output label 3 in both nodes, the final classification result of the input data is class 3.
The classification processes of input data in the T-DFNN model involve several nodes at different tree levels. Thus, the time needed to classify the input data is the accumulation of node classifications from the root to the leaf node. In detail, the classification time of a T-DFNN model is estimated by calculating the classification time of its root node using Equation 1.
is a set of node N 's child nodes. If the degree of node N is 0, i.e., node N does not have any child node, the classification time of node N is equal to its DFNN model's classification time. Otherwise, the classification time of node N is the sum of its DFNN model's classification time and the longest classification time of its child nodes.

A. DATASET
In this experiment, we used CICIDS2017. It is a reliable and labeled publicly available network intrusion dataset [13]. return Y C CICIDS2017 contains benign and common network intrusion flows, which match the features of a reliable benchmark dataset proposed by Gharib et al. [32]. These features are anonymity, attack diversity, available protocols, complete interaction, complete capture, complete traffic, complete network configuration, labeling, feature set, heterogeneity, and metadata. Thus, researchers are attracted to develop machine learning models and algorithms [33].
CICIDS2017 consists of 84 network traffic features extracted from raw network packets using CICFlowMeter software [34], which is publicly available on the Canadian Institute for Cybersecurity website. Similar to the previous study [35], we decided to remove six features from the dataset: Flow ID, Protocol, Timestamp, Source IP, Destination IP, and Source Port. From the network topology perspective, the values of these features will differ from real-world scenarios because this dataset was generated at an isolated network. Additionally, we removed 288,602 rows with missing labels from the dataset. After the unused and unlabeled data were removed, the final dataset consisted of 2,830,743 rows and 78 features. There are fifteen classes in the CICI2017 dataset, one of which is benign traffic, and the others are fourteen different types of intrusion traffic. Table 1 shows the traffic distributions in this dataset.
We implemented several preprocessing steps to CI-CIDS2017. These preprocessing steps are required for two main reasons. First, our proposed model uses DFNN models to classify data in the nodes. Several preprocessing steps, such as normalization, should be applied to this dataset before further processing. Second, the CICIDS2017 was not explicitly designed for incremental learning. To simulate the incremental learning process, we divided this dataset into several batches.
The preprocessing steps used in this experiment are as follows: 1) Replace missing values on each feature using the mean value of its class. 2) Replace infinite values on each feature using the maximum value of its class.  3) Replace negative values on each feature using the minimum of its class. 4) Normalize the features using unity-based normalization. 5) Group the dataset into several batches. 6) Split the dataset into training and evaluation groups. Equation 2 is used to normalize the feature by restricting the range of the values between 0 and 1. x i , x min , and x max in Equation 2 represent the value, the minimum value, and the maximum value of each feature, respectively. The purpose of this normalization is to prevent a feature from outweighing the other features. To simulate the incremental learning process, we divide the dataset into several batches. Each batch contains several classes. After dividing the dataset into several batches, we split the data in each batch into training and evaluation data. The ratio between training and evaluation data is 4 to 1. Table 1 shows the batches and data distribution in each batch. In the training process, some classes from old batches were used. For example, when we created a new node that needs to classify old and new training classes, some old training data are used to train this new node. We illustrate this case in Figure 2(b). In Figure 2(b), the training data used in the c 0 node are training data of class 0 and class 3 classified as output label 0 in the root node. The last layer of the DFNN model is a classification layer. Its number of neurons depends on the number of classes classified in this node. We illustrate this behavior in Figure  2(b). In Figure 2(b), the numbers of neurons of the last layer in c 0 and c 2 nodes are different; the c 0 node has two, while the c 2 node has three. The hyperparameters used by the DFNN model in every node of the T-DFNN model are identical. For the optimizer, we used the Adam optimizer with a 0.001 learning rate. By default, the training process continues until 1000 epochs. However, we used the early stopping method, which monitors the classification loss. If there was no improvement in 50 epochs, then the training process stopped.
In the experiment, we compared the T-DFNN model to two DFNN models. We named these two DFNN models DFNNbatch and DFNN-all. DFNN-batch and DFNN-all models had an identical network structure and hyperparameters to the DFNN model used in the T-DFNN nodes. However, these two DFNN models and the T-DFNN model utilized the training data in different ways. The DFNN-batch model only used training data from the current batch, while the DFNNall model used training data from old and current batches. Similar to the DFNN-all model, the T-DFNN model also used training data from the old and current batches. However, not all data from the old batches were used in the T-DFNN. The old training data were only used if they were needed to train new nodes, as we described in section III-A.
The purpose of comparing the T-DFNN model with these two DFNN models was to measure the effectiveness of the T-DFNN algorithm in addressing ever-evolving network intrusion variants and catastrophic forgetting problems. The comparison between the T-DFNN model and the DFNNbatch model demonstrated how severe the catastrophic forgetting problem affects the performance of network intrusion detection. It also demonstrated the effectiveness of the T-  We also compared our proposed T-DFNN model with a Hoeffding tree model [27]. As discussed in Section II, the Hoeffding tree model is a well-known incremental decision tree algorithm capable of learning from large-scale incremental data. To implement the Hoeffding tree algorithm, we used the HoeffdingTreeClassifier method provided by the scikitmultiflow library [28], which is based on MOA [36]. For the hyperparameters of this Hoeffding tree model, we used default hyperparameters provided by the scikit-multiflow library. Unlike the proposed T-DFNN model, the Hoeffding tree model only used the training data from the current batch, which is similar to the DFNN-batch model. The comparison with this well-established algorithm can help us understand the advantages and disadvantages of the proposed T-DFNN model.
We used precision, recall, and F1-score as the classification metrics to compare the performance of the models. We applied these metrics to each evaluation class. Using these metrics, we could measure the classification performance of the models for each class. To measure the performance of the models in each evaluation batch, we calculated the macro and weighted average of the precision, recall, and F1score. Equations 3 and 4 show the formula to calculate the macro average (M A) and weighted average (W A) metrics, respectively.
where • m is the classification metric, which is precision, recall, or F 1-score. • m i is the value of classification metric m of class i. • C is the number of classes. • n i is the number of data of class i. • N is the number of data of all classes. Finally, we ran the experiment ten times and then calculated the average value of all metrics we previously mentioned.
The specification of the computer we used in this experiment is as follows:  Figure 4 shows the average values of our ten experimental trials. It compares two key aspects of our experiment. First, it compares the evaluation metrics, i.e., precision, recall, and F1-score, of the proposed T-DFNN model to DFNN-batch and DFNN-all models. Second, it compares the macro and weighted averages of the evaluation metrics of each model. These comparisons show how well the models classify the data. These comparisons also help us understand the factors that affect the T-DFNN model in classifying the data.
In Figure 4, the x-axis represents the batch order. This batch order also correlates to the number of evaluation classes used in each batch. In Table 1, we can count that the numbers of training classes in batches 1, 2, 3, and 4 are 4, 4, 4, and 3, respectively. However, in Figure 4, the numbers of evaluation classes in batches 1, 2, 3, and 4 are 4, 8, 12, and 15, respectively. The number of evaluation classes increases in every batch because it also contains the old batch's evaluation data. Thus, in the last batch, all evaluation data were used. We included the classes from the old batch in these evaluation data to simulate the incremental learning process. As we described in Section I, the model for IDSs should be able to classify both old and new intrusion classes.
The catastrophic forgetting problem on the DFNN-batch model became apparent after the retraining process. The macro average of the DFNN-batch model's F1-score in Figure 4 declined sharply in each retraining process. As can be seen in Table 2, most of the F1-scores of the DFNNbatch model are below 0.25 after the retraining process. The VOLUME 4, 2016  model misclassified many new classes as new classes or vice versa. The only class that had an F1-score above 0.8 after the retraining process was the benign class. However, we should note that the benign class is the largest in the dataset. 80% of the data in CICIDS2017 is benign class. Thus, even though many benign class data were misclassified, it did not severely affect its F1-score compared to the other minor classes. The most straightforward approach to avoid a catastrophic forgetting problem is retraining the model using the dataset that contains all targeted classes. We tested this approach in our experiment using the DFNN-all model. All three evaluation metrics of the DFNN-all model in Figure 4 show consistent results. For all batches, the macro and weighted average of precision, recall, and F1-score of the DFNN-all model are above 0.8. However, this approach has a severe drawback. As shown in Table 3 The Hoeffding tree model reduced the catastrophic forgetting problem in DFNN batches, and the long training time in DFNN-all models was quite good. We can see in Figure 4 that the Hoeffding tree model is far less affected by the catastrophic forgetting problem compared to the DFNN-batch model. Additionally, the training processes in the Hoeffding tree model were shorter than those in the DFNN-batch and DFNN-all models because it only used the latest training data in its training process.
The experimental results in Table 3 show that the training times of the Hoeffding tree model are shorter than those of the DFNN-batch and DFNN-all models with a reasonably good macro average of F1-scores. However, if we look closely at the evaluation metrics of the Hoeffding tree model in Table 2, we can see that the Hoeffding tree model could not classify several minority classes correctly. The recall values of the Hoeffding tree model for web attack-XSS and web attack-brute force are below 0.45 in all batches. Moreover, as shown in Table 3, the evaluation times of the Hoeffding tree model increased significantly after each retraining process. When we implement the Hoeffding tree model in the network intrusion detection field, these issues become concerning because, in reality, some critical attacks may not have many samples to be analyzed. Additionally, the model may not be feasible for use for a long period because the classification process may become too slow.
The T-DFNN model solved the catastrophic forgetting problem in DFNN batches and the long training time in DFNN-all models. It also had faster classification processes than the Hoeffding tree model without compromising the classification performance. We can see in Figure 4 that the T-DFNN model had better F1-scores than the Hoeffding tree model. Additionally, as we can see in Table 3, the T-DFNN model had faster classification times than the Hoeffding tree model in all baches.
The T-DFNN training algorithm does not use entire old training data in its training process. Instead, it selects the training data based on the output labels of each node's parent node. The training data used in each node are the training data classified as the same parent node's output label. Splitting the training data and distributing them into several nodes speeds up the training process because it reduces the quantity of training data processed in each node. As shown in Table 3, the total quantity of training data used by the T-DFNN model is less compared to the DFNN-all model, while the training times of the T-DFNN model are shorter than the DFNN-all model.
The number of new nodes in each batch of the T-DFNN training process is dynamic. In our study, we ran the experiment ten times. The average numbers of new nodes in batches 1, 2, 3, and 4 of these experiments were 1, 2.6, 7.6, and 4.8, respectively. The number of new nodes generated in each batch depends on the new training data classification result in the previously trained node. Thus, the numbers of new nodes in each batch of the training processes in these experiments were different.
In Figure 5, we visualize a map of the output labels of the first experiment we conducted. For simplification, we use the encoded version of class labels in this visualization. The conversion from the original to the encoded version of class labels is listed in Table 1. We list the number of used training data and the training time of each node in this first experiment in Table 4. In Figure 5 and Table 4, we can observe how  the training data were split and trained in several nodes. Additionally, we can observe the relationship between the number of training data points used and the training time of each node. Even though the map of the output labels and the numbers of nodes in our ten experiments were different, they showed a similar pattern, in which the training time tends to increase along with the number of training data points in each node.
The training process of each node in the T-DFNN model was independent. Thus, the training processes were run in parallel. The training time of each batch equals the longest training time of a node in each batch. For example, the training time of batch 4 of our first experiment shown in Table  4 is 1,367.35 seconds because it is the longest training time in batch 4.

VI. DISCUSSION
The experimental results have shown that the T-DFNN algorithm has the potential to be used for incremental learning. However, we found several factors that affect the performance of the T-DFNN algorithm shown by the evaluation metrics. These factors are the class similarity problem, scarcity of the data, and computational overhead.
From the experimental results, we noticed that the class similarity problem and the scarcity of the data could affect the precision, recall, and F1-score of all tested models. We can observe this problem in the classification results of web attack-brute force, web attack-XSS, and web attack-SQL injection classes in Table 2. Those classes are similar types of intrusion. However, as we can see in Table 1, the quantity of data of those classes is severely imbalanced. Some of them also have a limited quantity of data. In the first batch classification result, the F1-score of web attack-XSS using T-DFNN models was relatively high, 0.971. However, in the second batch classification result, the F1-score value dropped to 0.118 because many data were misclassified as web attackbrute force class, which has a larger number of data. The same problem also occurred in the classification result of the web attack-SQL injection class. The model falsely classified the data of minority classes as other majority classes of the same intrusion type.
The classification results of the heartbleed and infiltration classes in Table 2 show an interesting result. These classes have scarce quantities of data. Heartbleed and infiltration data are only 0.00039% and 0.00127% of the total data, respectively. However, the T-DFNN model's precision, recall, and F1-score of these classes were quite decent. The DFNN-all and Hoeffding tree models also showed similar results. The reason those models can classify these classes correctly is the characteristic of the intrusions. Heartbleed attacks using the Transport Layer Security (TLS) protocol through a security bug in the OpenSSL cryptography library. Infiltration intrusion scans victims from the internal network of infected clients [13]. Both have no other similar type of intrusions in the dataset. These results suggest that data scarcity does not always contribute to the F1-score reduction VOLUME 4, 2016 in the model. Instead, the characteristics of the data have more influence on the F1-score of the model.
Another factor that affects the precision, recall, and F1score of the proposed T-DFNN model shown by the evaluation metrics was its additional computational overhead. Because the T-DFNN model consists of several nodes in a tree-like structure, the data might need to be classified using several nodes before obtaining the final classification result. This process created a computational overhead that can be observed in the classification time of the T-DFNN model in Table 3. The T-DFNN model classification times were longer in every batch because the height of the tree structure in the T-DFNN model increased.
We can estimate the classification time of the T-DFNN model using Equation 1. For example, we can estimate the classification times of each batch in Table 5   80.869, respectively. These estimation results are close to the actual classification times from the experimental results presented in Table 6. These classification times indicate that the computational overhead caused by the growth of the T-DFNN model's tree structure always increases after each training. The computational overhead caused by the tree structure of the T-DFNN model did not occur in the DFNN-batch or DFNN-all model. Thus, the classification time of the DFNNbatch and DFNN-all models did not increase significantly. However, we should note that the computational overhead caused by the growth of the tree structure also occurred in the Hoeffding tree model. The T-DFNN algorithm manages to minimize this computational overhead. As shown in Table  3, the evaluation times of the T-DFNN model are at least 39% shorter than those of the Hoeffding tree model.
Despite all challenging factors affecting the T-DFNN model, we should note that the T-DFNN model has several advantages. These advantages are that it can reduce the catastrophic forgetting effect and shorten the training time. We can observe these advantages in Tables 2 and 3. The T-DFNN model was less affected by the catastrophic forgetting problem. For each class in Table 2, the T-DFNN model can classify the evaluation data in every batch. Even though the precision, recall, or F1-score of the T-DFNN model was slightly lower than that of the DFNN-all model in some classes, the training time of the T-DFNN model was much shorter than that of the DFNN-all model. Additionally, the classification times of the T-DFNN model are shorter than those of the Hoeffding tree model.

VII. CONCLUSION
Incremental learning in IDSs is a challenging problem. The main problems facing incremental learning are the everevolving network intrusion variants and catastrophic forgetting problems. We solved both problems by proposing the T-DFNN algorithm, which combines a tree data structure and DFNN models. The experimental results showed that the model produced by the proposed T-DFNN algorithm can learn and classify the network intrusions incrementally without being severely affected by the catastrophic forgetting effect. Moreover, the T-DFNN algorithm can shorten the training time. However, the T-DFNN algorithm requires more computational steps that increase the classification time.
Other factors that affected the precision, recall, and F1score of the model are the similarity between classes and the scarcity of the data. These factors affected not only the T-DFNN model but also other models in general. Therefore, we suggest more comprehensive research on these factors as future work to improve the performance of the T-DFNN algorithm.