Threat Alert Prioritization Using Isolation Forest and Stacked Auto Encoder With Day-Forward-Chaining Analysis

Security Incident and Event Manager (SIEM) is a security management approach designed to identify possible threats within a real-time enterprise environment. The main challenge for SIEM is to find critical security incidents among a huge number of less critical alerts coming from separate security products. The continuously growing number of internet-connected devices has led to the alert fatigue problem, which is defined as the inability of security operators to investigate each incoming alert from intrusion detection systems. This fatigue can lead to human errors and leave many alerts being not investigated. Aiming at reducing the number of less important threat alerts presented to security operators, this paper presents a new method for highlighting critical alerts with a minimal number of false negatives. The proposed method employs isolation forest to ensure unsupervised performance and adaptability to different types of networks. Furthermore, it takes the advantage of day-forward-chaining analysis to ensure the detection of highly important alerts in real time. The number of false positive cases is reduced by employing an autoencoder. The proposed method achieved a recall score of 95.89% and a false positive rate of 5.86% on a dataset comprising more than half a million alerts collected in a real-world enterprise environment over ten months. This study highlights the importance of addressing the alert fatigue problem and validates the effectiveness of unsupervised learning in filtering out less important threat alerts.


I. INTRODUCTION
The rapid development of the Internet of Things (IoT) has led to the rise in cyber attacks. According to [1], 98% of IoT traffic is transmitted without encryption, which provokes serious cyber threats. Intrusion detection systems (IDSs) are often employed to monitor networks with the aim of finding malicious activities. While potentially useful, IDSs typically raise a high number of threat alerts, many of which are either false alarms or low-priority alerts. Security operators have to examine each incoming alert to judge whether it requires a highly resourced incident response. This task may become impossible to complete when the number of alerts is too high for a human to handle. At the very least, a high number The associate editor coordinating the review of this manuscript and approving it for publication was Long Cheng. of alerts can lead to the alert fatigue problem [2], which is defined as the network operator's inability to examine each alert due to a high number of alerts. The problem may induce human errors, including misjudgments and overlooking important alerts. CISCO reported that 44% of incoming alerts had been completely ignored by their operators due to a high number of alerts [3]. To reduce such risks, CISCO recommended to develop automated security solutions that can support security operators by shortening the time they spend on detection, investigation, and remediation [3]. This study aims at developing such a solution to address the alert fatigue problem by automating the threat alert prioritization process using Artificial Intelligence techniques.
To address the alert fatigue problem, Hassan et al. [2] proposed the NoDoze method that uses automated provenance triage to provide contextual information of incoming VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ alerts by converting a chain of alerts into a single event.
According to this method, alerts belonging to one group are considered as one distinct event, which reduces the number of alerts. Similarly, the method presented in this paper groups several alerts into a single event based on contextual information; however, instead of host-based IDS data, the proposed method leverages network-based IDS data to ensure wider monitoring capabilities in a network. This paper extends our previous work on alert screening methods using temporal analysis [4] and real-time analysis [5] by introducing an unsupervised clustering model based on isolation forest. Isolation forest is a machine learning technique that clusters anomalies without profiling normal instances and is suitable for addressing alert fatigue problem because it can identify positive alerts as outlier. To consider temporal patterns, the proposed method classifies incoming alerts using a model trained on data from previous days. This approach is known as day-forward-chaining analysis [25]. In our previous studies presented in [4] and [5], isolation forest clustering output a relatively high number of false positives. In this study, we address the two main drawbacks of our previous studies. First, we group a chain of alerts as an event. Second, we reduce the false positive rate by leveraging a stacked autoencoder (SAE). In particular, a one-class SAE is applied to reduce the number of false positives by calculating reconstruction errors. Important alerts, which are anomalies, should have higher reconstruction errors than less important alerts because the encoding step incorporates less important alerts only. Furthermore, day-forward-chaining analysis is employed to conduct time-series analysis in real time while considering the underlying patterns of previous data.
The proposed method was evaluated on a real alert dataset collected inside our organizational networks. Among the 564,561 threat alerts included into the dataset, 291 were true highly critical alerts, as classified by our security experts. In this study, false positives refer to less critical threat alerts classified as highly critical threat alerts. In the experiments, the proposed method achieved a recall score of 95.89% and a false positive rate of 5.86%. We believe that this method is effective in addressing the alert fatigue problem.
The main contributions of this study are the following: • It proposes an automated alert prioritization method based on deep learning capable of overcoming the alert fatigue problem by reducing the number of threat alerts that should be investigated by security operators.
• The proposed method is unsupervised; it does not require any prior-labeled data.
• The proposed method works in real time owing to the employment of day-forward-chaining analysis.
• It considers threat alerts that belong to one group of the same contextual information as a single event. This approach reduces the complexity of alert relationships.
• The method is capable of reducing the number of false positives, which is desirable in IDSs, by leveraging reconstruction error values output by SAE.
The rest of this study is organized as follows: Previous studies on reducing threat alerts are discussed in Section II. Section III details the process of data preparation, while Section IV explains the proposed method. The experimental results are presented in Section V. Section VI concludes the paper and outlines directions for future work.

II. RELATED WORK
Among many studies on alert screening, several are focused on creating a common format of alert logs. For example, Madani et al. [6] discusses the challenges of categorizing, parsing, transferring, and filtering logs providing their various formats. Existing security appliances generate logs in different formats, including the Log Event Extended Format (LEEF) [7], Common Event Format (CEF) [8], Intrusion Detection Message Exchange Format (IDMEF) [9], and Common Event Expression (CEE) [10]. The absence of a common format makes the analysis of threat alerts a challenging task. Azodi et al. [11], [12] proposed a model for reading and normalizing various log formats using named-group regular expressions (NGRE) and a knowledge base. Sapegin et al. [13] proposed a new common format that contains all information needed from different log formats.
Valeur et al. [14] proposed a correlation-based method for reducing the number of alerts. The performance of this method depends on the characteristics of datasets. In particular, a reduction of 99.2% could be achieved for the honeypot dataset but only 53.0% for the MIT/LL 2000 dataset. Although the convincing performance, some of the correlation components may be impractical which requires the security operator to have an access right to the host.
Hassan et al. [2] proposed the NoDoze method that employs an automated provenance triage to overcome the alert fatigue problem. NoDoze adjusts the suspiciousness level of each event in the provenance graph based on the suspiciousness level of neighboring events in the graph. The method performs behavioral execution partitioning by distinguishing between benign and malicious behaviors. It also generates the dependency graph of true alerts (most malicious) to prevent the dependency explosion due to previous data provenance.
Sun et al. [15] used isolation forest to identify deviations from normal employee behaviors. They considered the temporal factor; instead, they gathered data for a period of time and then performed anomaly detection on the entire dataset. In contrast, Ding and Fei [16] employed isolation forest to analyze data coming an a form of streams using a sliding window. Tuor et al. [17] used a recurrent neural network to reduce the number of alerts regarding employee behaviors.
Aminanto et al. employed a stacked autoencoder to output transformed features that helped improve the detection rate [18] and performance of the k-means clustering algorithm [19]. The method proposed in this paper employs SAE to reduce the number of false positives by calculating the reconstruction error values.
In summary, this study aims at automated alert log reduction rather than defining a single log format. The proposed method considers network-based IDS logs without any additional information.

III. DATASET A. COLLECTING AND LABELLING DATA
The logs employed in this study were generated by an IDS used to monitor a class-B network connecting more than 1,000 users and 30,000 hosts, including mobile devices, servers, and personal computers. The logs contained 564,561 threat alerts generated by one security appliance over ten months, from January 1 to October 31, 2017. According to the standard process of alert management, the alerts were carefully investigated by security experts to identify key incidents, which were labeled as highly critical alerts. For the investigation purposes, the security experts examined the contents of communications captured in archived packet capture (PCAP) files stored in a database. The critical level of each alert was determined based on evidence gathered after examining the contents of accessed URLs, matching with commercial URL black lists, performing the sandbox analysis of downloaded files, etc. Among the 564,561 alerts, only 291 were labeled as positive (i.e., highly critical), while the rest were labeled as negative (i.e., less critical) based on the significance of possible impact. It should be noted that while the security experts gathered deterministic evidence for the highly critical alerts, there still could be many other significant alerts ignored by the experts and labeled as negative due to misjudgment and the alert fatigue problem itself.

B. FIELD EXTRACTION: ALERT MESSAGE DECODING
For this study, alerts were generated in the CEF format, which is an event interoperability standard introduced by ArcSight [20]. Among other details, each alert contained such information as the application ID, alert ID, URLs, port number, IP addresses, and timestamp when the alert was received. The CEF format starts with a prefix indicating the hostname and date, followed by a message. In this study, messages were processed by a parser.
In the parser, each message was first divided into a header and a body. The header contained a prefix and the first seven fields that were split into eight fields using '|' as a delimiter. The first field in the header was further divided into the following five fields using the white-space ' ' as a delimiter: month, day, time, hostname, and CEF_Version. The rest of the fields contained predefined keys referring to device product, device vendor, etc.. After parsing, each alert was converted into a dictionary (i.e., a set of key/value pairs) in the JSON format.

C. ALERT GROUPING
In this study, ''alert'' and ''event'' are used as general terms, while ''record'' and ''event profile'' as special terms. To maintain common understanding, we define the following terms: • Alert: Alert is a notification with a message produced by a security appliance.
• Event: Event is an occurrence of an incident with a varying level of significance.
• Record: Record as a single line of message in an alert that contains descriptive information about an event.
• Event profile: Event profile of an event is a formatted collection of information presented by all the records related to the event.
Several records were grouped into an event profile based on their incident ID. One incident ID may be assigned to more than one records; the first record in each event profile was selected as the corresponding event profile representative.
An event profile was labeled as positive (i.e., highly critical record) if there was at least one positive record in the corresponding records. As a result, while the dataset included 562,270 negative and 291 positive records, it contained 98,959 negative and 156 positive event profiles. Table 1 shows the distribution of positive and negative records and event profiles in the dataset.

IV. METHODOLOGY
This section presents the proposed method for automated and real-time alert prioritization using unsupervised isolation forest, SAE, and day-forward-chaining analysis. The method includes the following three main stages: feature selection, anomaly detection using isolation forest, and false positive reduction using SAE as shown in Fig. 1.

A. FEATURE SELECTION AND DIGITIZATION
In total, 49 fields (features) were obtained after applying the message decoding procedure outlined in Section III. Table 2 lists all the fields. The fields included categorical (string), numerical, and timestamp values. The number of unique categorical values varied from less than ten (e.g., the Protocol field had only two values: udp and unknown) to about one million (e.g., Message ID). Fig. 2 shows all the fields and their unique value counts. Numerical fields (e.g. Impact, IncidentImpact, Severity) are in integer format, where the bigger values represents the more significant record. Fig. 2 indicates that fields with a large number of unique values were more signature-like, where only a few records shared the same value. Typically, such fields contribute less to the classification task and increase the dimensionality of data. To reduce the negative impact of such fields, a rulebased approach was applied to filter them out. In particular, all numerical fields were kept, while categorical fields were selected only if the number of their unique values was lower than a pre-defined threshold.
One-hot encoding was used to transform the categorical values into numerical values. Missing values were considered  as a new value for each field. All numerical fields were normalized using the min-max method.

B. ANOMALY DETECTION
Highly critical alerts were identified using an anomaly detection algorithm called isolation forest [21]. Isolation forest uses sparseness and difference properties of the tree structure to isolate anomalies instead of profiling common behaviors. It builds an ensemble of trees. In each tree, anomaly nodes are placed close to the root, while non-anomaly nodes are placed deeper, i.e., far away from the root. Among others, the two important hyperparameters of this algorithm are the sample size, which defines the number of samples to be drawn from the training set to build each tree, and the degree of contamination, which defines the percentage of samples to be considered as anomalies.
In this study, the isolation forest model was trained using instances accumulated from previous days. The model was then used to classify new alerts coming in real time.

C. FALSE POSITIVE REDUCTION
Our previously proposed anomaly detection method [5] achieved a false positive rate of 12.55%, which is quite high. In this study, the reconstruction error value obtained from SAE was used to reduce the false positive rate. SAE comprises several autoencoders (AEs) stacked together. Each AE is a symmetric neural network model with a particular structure, where the number of output neurons equals to that of input neurons, while the number of neurons in the hidden layer is less than that in the input and output layers. AE is one of unsupervised learning methods in the sense that training is performed on non-labeled data.
In SAE, the output of the previous AE is used as an input to the next AE as shown in Fig. 3. In the context of anomaly detection, a SAE model is first trained on common or desired data (in this study, less critical records). Then, the SAE model would output a high reconstruction error (RE) if encountering an anomaly (in this study, highly critical records). In other word, records with high RE are considered as highly critical in this study.

V. EXPERIMENTAL ANALYSIS
Experiments were conducted in Python 3.6 running on a computer equipped with Intel(R) Core(TM) i5-6400 CPU @2.70 GHz and 16 GB RAM. The performance metrics included recall, false positive rate (FPR), true negative rate (TNR), and balanced classification rate (BCR). Recall, also known as the detection rate, is the number of correctly classified highly critical records divided by the total number of highly critical records (Eq. 1). FPR is the number of less critical records classified incorrectly as highly critical records divided by the total number of less critical records (Eq. 2). TNR is the opposite to FPR; it is defined as the number of less critical records classified correctly as less critical records divided by the total number of less critical records (Eq. 3). BCR combines recall and TNR (Eq. 4). It is often used instead of accuracy and F-score on imbalanced datasets [22], which is the case in this study. Typically, the objective is to maximize recall while minimizing FPR. However, the objective of minimizing FPR is relaxed in this study because false positive records can be re-examined by security operators.
where TP stands for true positive representing the number of highly critical records correctly classified as highly critical records; TN stands for true negative representing the number of less critical records correctly classified as less critical records; FN stands for false negative representing the number of highly critical records incorrectly classified as less critical records; and FP stands for false positive representing the number of less critical records incorrectly classified as a highly critical records. In addition to the above four metrics, the performance of the proposed method was evaluated using the receiver operating characteristic (ROC) curve and area under the ROC curve (AUC). Two experiments were conducted as outlined below, one testing the anomaly detection capability of the proposed method, while the other testing the ability of the method to reduce the number of false positives.

A. ANOMALY DETECTION
To find the best configuration for detecting anomalies using isolation forest, the following three models were tested: overall, monthly, and daily. The overall model was trained using all the data (i.e., collected over ten months), while the window was reduced to one month and one day for the monthly and daily models, respectively. The daily model included four different submodels allowing to investigate the impact of including the current day and using the result of isolation forest as an input to supervised classifiers. In total, six models were considered, where the first four models used the same data for both training and test steps, while the remaining two models used only an unseen test set for testing. Table 3 shows the experiment results for anomaly detection obtained by the three models, which are represented by prefixes from one to three.

1) OVERALL MODEL
The Overall Model is shown in Fig. 4(a). It was trained and tested on the same 564,561 records collected over ten months. To tune the two important model hyperparameters, the following values were tested: 30,000, 3,000, and 1,000 for the sampling size; and 5%, 10%, and 20% for the degree of contamination. The results are shown in Table 3.
According to [21], isolation forest can maintain a high detection rate with a small sample size. This observation is consistent with the results presented in Table 2 for Model 1, where higher BCRs correspond to lower sample sizes. Table 3 indicates that the sample size of 1,000 and the contamination degree of 5% provide the best performance, with a BCR of 95.48%. However, when targeting a recall of 100%, the sample size should be 1,000 and the contamination degree should be 10%, which provide a BCR of 95.10%. Therefore, the latter values were chosen as a baseline for the next experiments.
The records were visualized by mapping them into a two-dimensional layout using t-SNE [23]. While the distance in t-SNE mapping does not carry direct information given that t-SNE is not a distance-based mapping [24], it can still be understood that the close proximity in t-SNE mapping provides a strong relationship. Fig. 5 shows the mapping for the original labels, while Fig. 6 shows the mapping for the anomaly scores that came from the isolation forest testing. In Fig. 5, the majority of negative (less critical) records, shown in dark blue, are clustered inside a big cluster, whereas positive (highly critical) records, shown in light blue, lie outside the big cluster. In Fig. 6, records with the highest anomaly scores belong to Group-A; in contrast, records with the smallest anomaly scores belong to Group-I and Group-J. The majority of B-C-D-E-F-G-H records are spread outside the big cluster, which means that these false records are close to true records. Fig. 4(b) shows the Monthly Model, which was trained and tested using the data from the same one month. The goal was  to verify the impact of the temporal factor during the training phase. The sample size and contamination degree were set to the baseline values, i.e., 1,000 samples and 10%, respectively. Two alternative Monthly Models were considered. One was  trained on the data from January to July and tested on the data from August to October. The other model was trained on the data from January to April and tested on the data from May to October. Table 2 shows the results.

2) MONTHLY MODEL
The table indicates that the first Monthly Model achieved a BCR of 92.17%, which is lower than that of the Overall Model (95.10%). This result demonstrates that a larger time window provides a better representation in the isolation tree. As expected, the other two Monthly Models achieved even lower BCRs (77.96% and 75.87%) as they were trained and tested on different data. Fig. 4(c) shows the Daily Model that was trained and tested using the data from a single day. This model was used to verify a hypothesis that a small window size would allow to achieve a better FPR. Several degrees of contamination were tested as shown in Table 3 for Models 3-a.

3) DAILY MODELS a: SINGLE-DAY
In Model 3-a-1, the same hyperparameters were used as those for Models 2-1 and 1-8. The table indicates that Model 3-a-1 achieved the best FPR, which can be explained by the unique pattern of every day. Form these results, we conclude that a wider time-window can lead to a better recall, while a narrower time-window can lead to a better FPR. Fig. 4(d) shows the Daily Model that was trained using the data from previous days and the current day, and tested on the current day data. Fig. 7 shows the size of the training set depending on the number of previous days considered. Several degrees of contamination were tested as shown in Table 3 for Models 3-b. Model 3-b-1 achieved a recall of 100% and an FPR of 16.08%; the latter is comparable to the FPR of Model 1-8. However, Model 1-8 was trained on the data from ten months (i.e., with a ten-month delay), while Model 3-b-1 was trained on the data with one day delay only.   data. Several degrees of contamination were tested as shown in Table 3 for Models 3-c. Model 3-c-2 achieved a recall of 100% and an FPR of 12.55%, which is better that the FPR of Model 3-b-1. While the current day was not included for training Model 3-c-2, the model maintained the FPR pattern captured in Figs. 8 and 9.  Table 3 shows the results for Models 3-d. Among the four classifiers, Model 3d-2 achieved the best performance with a BCR of 89.69%, which is very close to the BCR of Model 3-c-1 (89.94%). This result indicates that isolation forest predictions can be used as labels for supervised machine learning. Fig. 10 illustrated the best performing models in each group (i.e., Overall, Monthly, and Daily). The best models were assumed to be the ones that achieved a recall of 100% (or the highest) and the highest BCR. Model 3-b-1 achieved the best AUC; however, we consider Model 3-c-2 to be the  best performer because it could provide correct predictions for any new record in real-time while maintaining comparable performance with other well-performing models with time delay such as Models 1-8 and 3-b-1.

b: ACCUMULATED WITH CURRENT DAY
The records used in this study were generated by a security appliance that also provided key features called ''impact'' and ''incidentImpact'' indicating the severity of each record. To evaluate the proposed Model 3-c-2, it was compared against the filtering results using ''impact'' and ''incidentImpact.'' Figs. 11 and 12 show the ROC of Model, 3-c-2 plotted against 11 different levels of these two features, respectively.
The figure indicates that the proposed model outperforms the filtering methods using the two features. These results indicate the effectiveness of the proposed model in reducing the number of less critical threat records by 87.41%. Furthermore, the model achieved a recall of 100% and a FPR of 12.55% (Table 3, Model 3-c-2). With the proposed model, the security operator can focus on the remaining 12.59% records. Note that the 0.04% difference is due to 291 true records excluded from the 12.55% of FPR in Model 3-c-2.

B. FALSE POSITIVE REDUCTION
To further reduce the FPR, two strategies were applied: day-forward-chaining analysis and SAE. Providing that the  data considered in this study are time series, the conventional cross-validation method is not appropriate for training models due temporal dependencies among data points and the arbitrary choice of the test set [25]. An alternative method that works well on time-series data is day-forward-chaining analysis [25], which originated from rolling-origin evaluation [26] and rolling-origin-recalibration evaluation [27]. The overview of day-forward-chaining analysis is depicted in Fig. 14. According to the figure, data from only the first day are used to train the model, which is then evaluated on the data from the second day. Next, data from the first and second days are used to train the model, while it is then evaluated on the data from the third day; and so one.
For reconstruction error analysis using SAE, the following network architecture was employed. Two layers of AE were stacked with the following configuration: input with 2146 neurons, encode 1 with 400 neurons, encode 2 with 50 neurons, decode 2 with 50 neurons, decode 1 with 400 neurons and, output with 2146 neurons. Fig. 13 shows the three models that were used to verify the proposed methods. The first, one-day delay model considered data from only one day ( Fig. 13(a)). The second model considered data accumulated from previous days for both isolation forest and SAE ( Fig. 13(b)). The third model considered data accumulated from previous days for isolation forest and from one previous day for SAE (Fig. 13(c)).

VI. CONCLUSION
This paper presented an unsupervised method incorporating isolation forest and SAE for addressing the alert fatigue problem. The proposed method accumulates data from previous days to train a classification model while screening the incoming records at the same time, resulting in dayforward-chaining analysis. The method was demonstrated to be effective in reducing the number of false positives, thus leaving only a small number of threat alerts for security operators to review. The method is practical as it does not require prior data labeling and can operate in real time.
This study focused on threat alerts produced by a single security appliance. In the future, the proposed method needs to be generalized to work with multiple security appliances generating logs in different formats. Furthermore, other features need to be adopted for improving anomaly detection.
MUHAMAD ERZA AMINANTO received the Ph.D. degree from the School of Computing, Korea Advanced Institute of Science and Technology (KAIST), South Korea, in 2018. He is currently a Lecturer with the University of Indonesia (UI) and a Cooperative Visiting Researcher at the Cybersecurity Research Institute, National Institute of Information and Communications Technology (NICT), Japan. His current research interests include machine learning, IDS, and cybersecurity.
TAO BAN (Member, IEEE) received the B.S. degree from the Department of Automatic Control, Xi'an Jiaotong University, in 1999, the M.E. degree from the Department of Automation, Tsinghua University, in 2003, and the Ph.D. degree from Kobe University, in 2006. He is currently a Senior Researcher with the Cybersecurity Research Institute, National Institute of Information and Communications Technology, Tokyo, Japan. His research interests include network security, malware analysis, machine learning, and data mining for security.