Dark-TRACER: Early Detection Framework for Malware Activity Based on Anomalous Spatiotemporal Patterns

As cyberattacks become increasingly prevalent globally, there is a need to identify trends in these cyberattacks and take suitable countermeasures quickly. The darknet, an unused IP address space, is relatively conducive to observing and analyzing indiscriminate cyberattacks because of the absence of legitimate communication. Indiscriminate scanning activities by malware to spread their infections often show similar spatiotemporal patterns, and such trends are also observed on the darknet. To address the problem of early detection of malware activities, we focus on anomalous synchronization of spatiotemporal patterns observed in darknet traffic data. Our previous studies proposed algorithms that automatically estimate and detect anomalous spatiotemporal patterns of darknet traffic in real time by employing three independent machine learning methods. In this study, we integrated the previously proposed methods into a single framework, which we refer to as Dark-TRACER, and conducted quantitative experiments to evaluate its ability to detect these malware activities. We used darknet traffic data from October 2018 to October 2020 observed in our large-scale darknet sensors (up to /17 subnet scales). The results demonstrate that the weaknesses of the methods complement each other, and the proposed framework achieves an overall 100% recall rate. In addition, Dark-TRACER detects the average of malware activities 153.6 days earlier than when those malware activities are revealed to the public by reputable third-party security research organizations. Finally, we evaluated the cost of human analysis to implement the proposed system and demonstrated that two analysts can perform the daily operations necessary to operate the framework in approximately 7.3 h.


I. INTRODUCTION
In recent years, an increasingly large number of indiscriminate cyberattacks have been observed on the Internet, and it is therefore becoming increasingly costly to analyze these attacks. To maintain security of the Internet, it is necessary to quickly recognize global cyberattack trends, specify their causes, devise countermeasures, and alert the world of the details of the threat. For this purpose, it is important to detect the indiscriminate scanning attack activities caused by The associate editor coordinating the review of this manuscript and approving it for publication was Kashif Saleem . malware at an early stage before a particular attack becomes a pandemic.
However, it is challenging to identify malware scanning attacks among the massive amount of benign traffic in regular networks. Therefore, we adopted unused IP address spaces (darknets). The term ''darknet'' refers to observation networks, also known as ''network telescopes,'' and should not be confused with anonymous communication networks such as Tor. In the darknet, legitimate communication (noise) does not occur; therefore, indiscriminate scanning communication (signal) is observed more noticeably. Thus, the signal-to-noise ratio is high. This makes it an effective way to identify trends and tendencies in global cyberattacks.
However, the volume of traffic observed in the darknet is increasing each year exponentially. Moreover, there are many communications whose intentions are unknown, as only the initial communications are observed. For example, in a darknet, we observe numerous independent cyberattacks occurring simultaneously, as well as many communications that are unrelated to attacks, such as scanning activities that are conducted for benign investigation purposes, communications with unknown causes, and misconfigured communications. As a research target, we should distinguish such noisy communications from malicious attack communications in detail.
Devices infected with similar malware, that is, ones which share scanning modules, tend to scan in a similar spatiotemporal pattern to compromise new infection targets [1]. Such a tendency is also observed on the darknet [2]. Here, the distributions of source hosts and destination ports for packets observed in a certain period are referred to as spatial features. The features observed in the temporal variation of these spatial features are thus referred to as spatiotemporal patterns. The hosts and destination ports that send packets with similar spatiotemporal patterns are then referred to as being synchronized. Even in case of small-scale infection activity of malware, a high degree of synchronicity is expected to occur in the associated spatiotemporal patterns, and early detection of malware activity can be realized by estimating the synchronicity and detecting anomalies.
In our previous studies, we focused on such synchronization and attempted to detect potential malware activities by estimating the group of source hosts with high synchronization in their spatiotemporal patterns on a large-scale darknet. We adopted the following three different machine learning methods in this study: Graphical Lasso [3], nonnegative matrix factorization (NMF) [4], and nonnegative Tucker decomposition (NTD) [5] to estimate the synchronization of spatiotemporal patterns from packet counts by spatial feature per unit time in darknet traffic data. The Graphical Lasso algorithm can sparsely estimate conditionally independent variable pairs that are not synchronous from a covariance matrix. The NMF and NTD algorithms can decompose synchronous latent frequent patterns from data matrices or tensors into superpositions of multiple groups. We previously proposed the following different methods to estimate the synchronization in real time to automatically use the aforementioned algorithms and detect the source host space groups that show abnormal synchronization: Dark-GLASSO [6], [7], Dark-NMF [8], and Dark-NTD [9].
In our previous studies, we confirmed that each method is capable of detecting malware activities well. However, we did not comparatively evaluate the methods and examine their early malware activity detection performance. In this study, we first modularized the three previously proposed methods and integrated common components such as feature extraction and alert issuing into a single framework. We refer to this integrated framework Dark-TRACER. As the main challenge, we conducted two experiments on Dark-TRACER-one is to evaluate the quantitative detection performance, and the other is to evaluate the feasibility of early detection. In the first experiment, to quantitatively evaluate the detection performance of malware activity, we used the ground truth of reliable malware activity in October 2018, which was manually created, and performed parameter tuning to minimize false negatives and false positives in each module. Although we have previously presented the evaluation results of a conventional method ChangeFinder [10] and the proposed modules Dark-GLASSO and Dark-NMF, we evaluate Dark-NTD for the first time using the same criteria. In the second experiment, we manually generated a new ground truth of events (from June 2019 to October 2020) that clearly shows the time of infection spread of malware activities and used it to evaluate the feasibility of the proposed framework for early detection.
As a result, Dark-GLASSO, Dark-NMF, and Dark-NTD achieved 97.1%, 100%, and 97.1% recall, respectively. We also identified the pros and cons of each module and found that the integration of all the proposed modules into a single framework, Dark-TRACER, complemented each individual module's weaknesses. In addition, the results of the early detection feasibility evaluation show that Dark-TRACER can detect threats 153.6 days earlier than when the threats were revealed to the public by reputable third-party security research organizations. We also assessed the human analysis cost and found that daily operation with two analysts could be performed in an average of 7.3 h, assuming that one analyst requires 15 min of analysis time per port.
In summary, this study afforded the following contributions: • We integrated our three prior methods (modules) into a single framework, Dark-TRACER. To the best of our knowledge, our approach is the first method that focuses on the synchronization of spatiotemporal patterns of the darknet traffic. Dark-TRACER can detect malware activities that show anomalous synchronization.
• This work is also the most advanced practical study that quantitatively evaluated the detection performance of malware activities and the feasibility of early detection.
• We found that Dark-TRACER complements the weaknesses of each module, and achieves a 100% recall rate. In addition, the results demonstrate that Dark-TRACER detects threats on average 153.6 days earlier than when the threats are revealed to the public. We also demonstrated that two analysts can conduct the necessary daily operations of the framework in approximately 7.3 h. Currently, Dark-TRACER is being implemented in realworld contexts for actual operation. It is expected to provide information on detected global malware activities to organizations such as the Computer Security Incident Response Team (CSIRT) and the Security Operation Center (SOC), and to assist in their ability to implement prompt countermeasures such as investigating the causes and conducting detailed analysis. The remainder of the paper is organized as follows. The proposed framework, Dark-TRACER, and its three modules are presented in Section II. In Section III, we present the methodology and results from the parameter tuning and quantitative evaluation experiments on the detection performance of malware activities for each module. Section IV describes the feasibility evaluation of the proposed method for the early detection of malware activities. In Section V, we discuss the advantages of Dark-TRACER through a comprehensive comparison of each proposed module, consideration of the likely adversarial attacks, ways to reduce false-positive alerts, and practical operation methods. Section VI provides a summary of related work on darknet measurement analysis, malware activity detection, and investigative scanners, and we conclude the paper in Section VII.
The following advantages over existing malware activity detection methods can be achieved by focusing on synchronicity: 1) We can reduce the effect of benign noise communication in the darknet traffic and highlight the malicious communication. 2) In addition, malware activities that are difficult to trace by conventional manual operations, such as threats that are small-scale, orchestrated, or have no visible explicit spikes, can be captured before the malware infection becomes widespread by detecting anomalously synchronized spatial features. 3) Finally, if a malware activity is found to be synchronized with other malware activities at a time when the scale of infection is small (i.e., before it spreads in earnest), it can be detected at that early stage.
The pseudocode of Dark-TRACER framework is presented in Algorithm 1. The parameters are described in this section and Section III-C. For more specific details of the three algorithms employed in this study, the original paper reporting on each algorithm [3]- [5] or our previous works [6]- [9] may be referred to. Based on Fig. 1 and Algorithm 1, the modules are described in greater detail in Algorithm 1.

A. DATA OBSERVATION
Dark-TRACER targets darknet traffic data for analysis. As mentioned previously, the darknet has the advantage of a high signal-to-noise ratio, because regular communication (noise) is not typically observed there, and indiscriminate scanning communication (signal) is monitored in abundance. However, not all communications that are observed in the darknet are malicious communications caused by malware. Among the totality of communications observed in the darknet, some communications are not related to attacks, such as scanning activities for investigation purposes, such as Shodan and Censys [11], 1 unexplained communications, and misconfigured communications. Dark-TRACER is a framework that detects intrinsic attacks and malware activities by ignoring and eliminating such noisy communications.
We have implemented a large-scale darknet observation system, the NICTER project, 2 which aims to understand global trends in indiscriminate cyberattacks. Darknet observation systems (sensors) have been installed in several countries and organizations, and approximately 300,000 IP addresses are currently being monitored. The observed data of these darknet sensors differ slightly depending /* Algorithms and Anomaly Detection (Section II-C, II-D) */ /** Dark-GLASSO **/ 4: for epochs do 12: end for /* Issuing Alerts (Section II-E) */ 15: outliers ← outliers 1 | outliers 2 | outliers 3 16: Alerts ← issuing_alerts(outliers) 17: end while on their geographical location and the scale of observation. For this reason, Dark-TRACER analyzes each sensor separately.
Next, as a data preprocessing step, Dark-TRACER analyzes only TCP-SYN packets because TCP packets other than TCP-SYN that reach the darknet are not considered to be attack scans. In addition, the upper 16 bits of the IP address are adopted as the unit of the source host. This means that hosts are aggregated on a regional or organizational level. Finally, to highlight the observation of unknown malware activities, we excluded well-known and frequently observed threat ports.

B. SPATIOTEMPORAL FEATURE EXTRACTION
First, we prepared darknet traffic data for a certain period (T seconds). We assumed that N h unique numbers of source hosts and N p unique numbers of destination ports were observed in the darknet traffic data. Then, at a sampling interval of T /M seconds, the number of packets was counted for each source host or destination port, and these are referred to as the spatial feature variables. Here, M is a hyperparameter. From the above, three types of tensors representing spatiotemporal features were generated from the observed data: (N 0 = {0, 1, 2, · · · }). This feature extraction was processed in real time and sequentially every t seconds.

C. APPLYING ALGORITHMS
This section briefly introduces the main characteristics of the sparse structure learning algorithm Graphical Lasso [3] and the two tensor decomposition algorithms, Nonnegative Matrix Factorization (NMF) [4] and Nonnegative Tucker Decomposition (NTD) [5]. In addition to the above algorithms, Dark-TRACER can be applied with other methods to estimate the synchronization of spatiotemporal features, but an anomaly detection method that is appropriate for the method must be considered.

1) GRAPHICAL LASSO
The Graphical Lasso (package name: glasso 3 ) algorithm is a sparse structure learning method that can calculate the ''intrinsic relationships'', rather than spurious correlations, between variables. Here, ''no intrinsic relationship'' between two variables is equivalent to the conditional independence of the two variables given the other variables. In a Gaussian graphical model, which is a structural learning model that assumes a multivariate Gaussian distribution, the above problem can be considered as the problem of estimating a precision matrix (i.e., an inverse covariance matrix). Graphical Lasso uses maximum likelihood estimation with a 1 regularization term to obtain a sparse precision matrix, thereby introducing sparsity into the relationship between variables.
The obtained precision matrix can be represented as an undirected graph, as shown in the Graphical Lasso section of Fig. 1. The node set represents the set of variables, and the edge set represents the ''presence or absence of a relationship'' between the variables. In other words, when there is no relationship between variables, no edges are drawn between the nodes corresponding to those variables. Alternatively, if there is a relationship, an edge is drawn. Graphical Lasso has often been applied to the field of anomaly detection. Graphical Lasso has been applied to a wide range of real-world problems, such as outlier detection [12], [13] based on the relationship of the candidate outlier with the other variables, such as in Dark-GLASSO, and for detecting changes in a graph structure [14], [15].

a: DARK-GLASSO MODULE
Dark-GLASSO uses Graphical Lasso to estimate and graph the intrinsic relationship between spatial feature variables from a spatiotemporal feature matrix (V h or V p ). This can be interpreted as a representation of the synchronization between the variables.

2) TENSOR DECOMPOSITION
Tensor decomposition is a method of decomposing latent frequent patterns from a matrix or tensor into a superposition of multiple groups. Several models have been proposed depending on the rank of the tensor and the decomposition method. Tensor decomposition has been applied in a variety of fields, such as recommendation systems in the marketing domain [16], feature extraction in electroencephalograms [17], image classification [18], and foreground filtering and activity detection in videos [19].
The tensor data generated in Section II-B does not accept negative values. To make the decomposition results realistic and interpretable, we employed the tensor decomposition methods NMF and NTD with non-negative constraints. NMF is a decomposition method for rank-2 tensors (matrices), whereas NTD is a decomposition method for rank-D tensors (in this case, D = 3). NTD can be regarded as an extension of NMF to higher dimensions. Next, the application process of the method is briefly explained for each module.

a: DARK-NMF MODULE
As shown in the NMF part of Fig. 1, NMF is a method of approximate decomposition of a non-negative matrix V ∈ N M ×N 0 into a product of two non-negative factor matrices W ∈ R M ×r , H ∈ R r×N (V ≈ WH ). Here, r is the number of basis vectors, and refers to the number of patterns to be decomposed (r < N , M ). The NMF minimizes the error function ||V − WH || 2 F (s.t. W ≥ 0, H ≥ 0) (Frobenius norm). Although several approximate decomposition algorithms have been proposed for NMF, we employed the most well-known multiplicative update algorithm proposed by Lee et al. [4]. In this algorithm, the initial values of W , H are given, and the optimization is performed by alternately updating W , H until the error function is minimized. In Dark-NMF, the values of the singular value decomposition were taken as the initial values.
In summary, Dark-NMF approximately decomposes a spatiotemporal feature matrix (V h or V p ) into two factor matrices W , H using NMF. The decomposed matrices contain the same number of potentially synchronous groups of spatiotemporal feature variables as the number of bases. As shown in the NTD part of Fig. 1, NTD decomposes a rank-D tensor into one small tensor and several matrices. Dark-NTD works with rank-3 tensors and decomposes them each into one small tensor G and three matrices A (1) , A (2) , A (3) . The tensor V ∈ R I 1 ×I 2 ×I 3 can be decomposed using the decomposition equation as ∈ R I n ×R n (n ∈ {1, 2, 3}), G ∈ R R 1 ×···×R 3 , the scalars I 1 , I 2 , I 3 denote the length of each axis (mode), and the symbol × i denotes the product in the mode-i direction, which multiplies the tensor by the matrix in mode i. The factor matrix A (n) can be regarded as an extracted feature for mode n. The tensor G is referred to as a core tensor and represents the weights of the basis vectors of each mode and the strength of the relationship. In addition, R 1 , R 2 , R 3 are the ranks, which determine how many basis vectors are extracted for each mode and can be interpreted as the number of frequent patterns.
The NTD algorithm minimizes the error function ||V − . The optimization is performed by alternately updating G and A (n) . However, when the tensor to be decomposed, V, becomes large, the decomposition becomes practically impossible because a massive amount of memory and computation is required to perform the exact calculation. To address this problem, Dark-NTD utilizes the fiber sampling tensor decomposition (FSTD) [20] to perform a low-rank approximation of the tensor V in advance. Based on the results of this low-rank approximation, LRA-NTD [21], which efficiently approximates NTD, is applied to save memory and accelerate the decomposition process without reducing its precision. For details of this acceleration, please refer to Kanehara et al.'s previous paper on Dark-NTD [9].
In summary, Dark-NTD utilizes the accelerated NTD algorithm to decompose the rank-3 spatiotemporal feature tensor V hp into a core tensor G and three matrices A (1) , A (2) , A (3) . The results of the approximate decomposition contain the same number of potentially synchronous groups of spatiotemporal feature variables as the number of bases in each matrix. Each component of this decomposition result can be interpreted as follows: A (1) : Temporal features. Each basis vector represents a temporal traffic pattern of a different type. A (2) : Source host spatial features. The source hosts corresponding to the indices of each basis vector are assumed to be synchronized and in the same group. A (3) : Destination port spatial features. The port numbers corresponding to the indices of each basis vector are assumed to be received from the same group.

D. ANOMALY DETECTION
In this section, we present a method for detecting anomalies in spatial feature variables based on the application results of each algorithm, module by module.

1) ANOMALY DETECTION IN DARK-GLASSO
From a graph of the calculated precision matrix, the degree of synchronization between the variables is quantified by the graph density |E|/N (N − 1). Here, |E| is the number of elements in the edge set, and N is the number of spatial feature variables. The closer the graph density is to 1, the more strongly all variables are related to each other. Graph density is also referred to as its sparsity. The graph density value is calculated from observation data every T seconds in a continuous period and is recorded sequentially. Outlier detection is performed when time-series data of the graph density value are collected for a period of fixed window size (K ). First, two variances are calculated: one when the largest element in the time-series data is excluded and the other when it is not excluded. If the ratio of the two variances exceeds a threshold θ, it is considered an outlier and is deleted from the time-series data. The outliers are identified successively by the next largest element until they no longer exceed the threshold θ. If there are no outliers and the data size exceeds the fixed window size (K ), the oldest data are deleted in chronological order. From the above, it is possible to determine the periods which have anomalous graph density values as compared to the other periods.

2) ANOMALY DETECTION IN DARK-NMF
Because the scales of the matrices W , H are not unique, we first normalize them. To ensure that the sum of each column of W is aligned to 1, we normalize W , H as W = W , H = −1 H using a diagonal matrix ∈ R r×r , whose diagonal component is an inverse of the sum of each column. The elements of the normalized H are scaled to actual observed packet counts. Spatial features with values of H less than 1 are considered inactive features in the corresponding basis vector. For such active spatial features, if there are more than β features that exceed α (%) of the maximum value of elements (maximum number of packets), active spatial features are judged as anomalies. In addition, when judging anomalous spatial features, a parameter f ∈ {0, 1} is utilized to determine whether to treat all active spatial features or only those that are more anomalous. From the above, we can determine the anomalous spatial features of a specific period.

3) ANOMALY DETECTION IN DARK-NTD
If there are two or more host spatial features that exceed the threshold value for A (2) , the group of hosts is considered to have synchronized activities and its IP addresses are recorded. In addition, G and A (3) are utilized to identify the destination port features through which this group of hosts communicated. From G, we identify a port group of A (3) that is linked to a group of hosts that have been determined to have synchronous activity from A (2) . In the identified port group, the destination port features that exceed the threshold are determined to be the targeted ports in the synchronized host group of A (2) . We can determine the anomalous host groups and their targeted ports in a specific period.

E. ISSUING ALERTS
The final process collects information that has been determined to be anomalous from each module and outputs an alert in a uniform format. For Dark-GLASSO, we used the entire darknet traffic data for a period that has been identified as anomalous. For Dark-NMF, we used the data for the spatial features identified as anomalous. If numerous source hosts sent many packets to a specific destination port, we aggregated the information regarding the time, destination port numbers, and source hosts and issued an alert. Finally, Drak-NTD issued an alert directly using the anomalous host group, targeted port, and time information found in the anomaly detection step.

III. EVALUATING QUANTITATIVE COMPARISON OF DETECTION PERFORMANCE
We evaluated the performance of each proposed module and describe the results of two different experiments to demonstrate the relationships between modules and their practicality. In the first experiment in this section, we quantitatively evaluated the ability of each module to accurately detect malware activities. In Section IV, the second experiment evaluated the feasibility of the early detection of malware activities. Darknet traffic was preprocessed using tcpdump and passed to Dark-TRACER, implemented in the R language. All experiments were conducted in a unified manner in Japan Standard Time, with CPUs running on AMD RYZEN TR 2990WX and 128GB memory.
In this experiment, we manually gathered TCP ports for which malware activities were clearly observed in October 2018 and generated the ground truth for a total of 35 TCP ports. This ground truth evaluation aimed to determine a hyperparameter set that minimized the number of false negatives, even if there were some false positives in each module, and evaluated the detection accuracy at that time. The conventional method, ChangeFinder, and the proposed modules, Dark-GLASSO and Dark-NMF, have already been tested and the results of those evaluations have been published [7], [8], whereas Dark-NTD was now evaluated for the first time using the same criteria. The following subsection describes the details of the dataset, the parameter tuning of Dark-NTD, and the comparison results from each module.

A. DETAILS OF DATASET
The dataset and the ground truth for evaluation were the same as those used in the previous reports for Dark-GLASSO and Dark-NMF and are publicly available. 4 Specifically, we employed data from eight darknet sensors A to H, which are located around the world and have different observation scales. The observation scale of each sensor ranges from approximately 30,000 IP addresses (/17 subnet) to approximately 2,000 IP addresses (/21 subnet), for a total of approximately 80,000 IP addresses in the darknet observation network. The period of data used in the experiment was in the month of October in 2018. The average number of packets per day for sensor A, which has the largest observation scale, was 81.4 M, and the data size was 5,605 MB. To highlight the observation of unknown malware activities, the following 11 known and constantly observed TCP ports were excluded during preprocessing : 22, 23, 80, 81, 445, 1433, 2323, 3389, 5555, 8080, 52869. Next, regarding details of the ground truth, Table 1 shows the TCP ports where malware activities were monitored and the characteristics of the malware activities by threat type. The threat types were primarily classified into Internet of things (IoT) malware such as Mirai, Hajime, and HNS (Hide and Seek), vulnerabilities related to router manufacturers, and vulnerabilities related to other off-theshelf service protocols.
As a fingerprint, or key identifier, of Mirai, it is commonly known that the sequence number matches the destination IP address in the SYN packet [22], [23]. A fingerprint of Hajime is that its window size is fixed at 14600, and an upper or lower 1 byte of the sequence number is 0. A common feature of router vulnerability types is that there are many cases wherein each router manufacturer's login screen was confirmed when an HTTP connection was made to the source hosts that sent the scan. Cohen et al. [24] also identified that ports 5379, 6379, and 7379 were observed during the same period. Please refer to that previous paper for more details, including time-series graphs of the unique host counts of malware activities in this ground truth.

B. PARAMETER TUNING IN DARK-NTD
In this section, we describe how to tune the following five hyperparameters in the Dark-NTD.
1) sensor: which darknet sensor is used for the observed data 2)R n : Number of bases in FSTD, a low-rank approximation method for acceleration. 3) R n : Number of bases in NTD. 4) epochs: How many times the calculation for the same data is repeated 5) th: Threshold for alert determination The above five hyperparameters are tuned by grid search.
The search range and interval include our long-term empirical rules. For the sensor, we compared the performances of selecting one of the eight darknet sensors against the use of all eight sensors. Next, the larger the number of basesR n in FSTD, the better is the low-rank approximation of the original tensor. In addition,R n should be set to be larger than R n , the number of bases in NTD. In this grid search, we worked within the range ofR n ∈ {25, 49, 81, 121} and R n ∈ {3, 5, 8}. Furthermore, because the initial values of FSTD and NTD are randomly chosen, the calculation results are not unique. Therefore, we need to know how many times the same data can be iterated to obtain a stable and sufficient accuracy. In this tuning experiment, we iterated epochs up to 15 times. Finally, for alert determination thresholding th, we worked within a fixed range of {0.05, 0.1, 0.2, · · · , 0.9} and an adaptive method called ''Otsu's thresholding method [32],'' which is a commonly used image thresholding algorithm.
Here we describe the results of the above five parameter tuning. It was not practical to tune all five parameters simultaneously, because the number of combinations would be immense. As an evaluation strategy, we divided the parameters into two groups:R n , R n , which is directly  related to NTD, and sensor, epochs, and th, which are not. Because sensor, epochs, and th have a more significant impact on detection performance, we first roughly tuned the detection accuracy with these three parameters and then fine-tuned it withR n , R n . In this experiment, the spatiotemporal feature extraction in Section II-B was performed by generating and using tensors V hp for October 2018 with the observation time unit T set to 1,800 s, the sampling interval M set to 30, and the online processing time unit t set to 600 s.

1) TUNING EVALUATION OF SENSOR, EPOCHS, AND TH
We evaluated the tuning of sensor, epochs, and th after fixing the values toR n = 25, R n = 5, which were empirically used in an earlier study [9]. The results of the evaluation are presented in Figs. 2 and 3. The horizontal axis represents the number of epochs, and the vertical axis represents the number of true positives (TPs) and false positives (FPs) of the port numbers. When the number of TPs is close to 35 and the number of FPs is low, we can observe that the detection  accuracy is excellent. Fig. 2 shows the results for each th when all eight sensors are utilized. This indicates that th results in more TPs with fewer epochs while keeping the number of FPs relatively low when using Otsu's thresholding method (otsu) as compared with any other fixed value. Next, Fig. 3 shows the results for each sensor when the th used is Otsu's thresholding method. The results demonstrate that when using only the sensor A, the same number of TPs is achieved with fewer epochs while keeping the number of FPs lower than when using either all sensors in combination, or other sensor alone. Based on the above, we conclude that the best solution is to use only A for sensor, 4 for epochs, and Otsu's thresholding method for th.
Next, we provide a brief discussion regarding each parameter. Fig. 4 shows the dispersion of TPs and #FPs for each epoch over 15 iterations. The red graph is the mean number of TPs, the blue graph is the mean number of FPs, and the error bars represent the standard deviations. From these results, we can conclude that the randomness of the initial value selection of FSTD and NTD does not dramatically affect the detection performance because a similar number of TPs and #FPs was recorded each time. In terms of the sensor, Dark-NTD recorded a good number of TPs for sensors with a large observation scale. Finally, Otsu's thresholding method achieves a similar level of accuracy to a fixed value of 0.3 but has the advantage of adaptively determining a threshold value from the data.

2) TUNING EVALUATION OFR n AND R n
In this section, we set sensor, epochs, and th to the values determined above, and then evaluated the tuning ofR n , R n . The results are shown in Table 2. Contrary to expectations, increasing the value ofR n decreased the number of TPs and increased both the number of FPs and the average processing time. This result supports the fact that even atR n = 25, we can sample enough important information (fiber) for low-rank approximations. Therefore, we determined thatR n = 25 is appropriate. In the case of R n , there were no significant differences in the average processing time for any value, and the greatest number of TPs was achieved with R n = 5, so we conclude that R n = 5 is most appropriate.

C. COMPARATIVE EVALUATION RESULTS OF DETECTION PERFORMANCE
In this section, we quantitatively compare and evaluate the detection accuracy of malware activities among the modules. The results are shown in Table 3. ChangeFinder [10] is an existing method that was applied in NICTER before proposing each module of Dark-TRACER, and it is an algorithm for detecting change points in time-series data with a low computational cost. The ChangeFinder algorithm is based on the sequential discounting autoregressive (SDAR) forgetting learning algorithm, which calculates only new time series data and reduces the influence of past data by improving the autoregressive model to learn sequentially. We implemented ChangeFinder on two types of timeseries data: the number of packets and unique source hosts in 10 min. The parameters of each module used in this experiment are described below. Here, λ in Dark-GLASSO is a regularization coefficient for Graphical Lasso. Due to the high computational complexity of Dark-GLASSO, random sampling was conducted when the number of hosts N h exceeded γ to maintain real-time performance. All other parameters are explained in Section II. Next, we explain the notation used in Table 3. SETs in Dark-NMF indicates the difference between 0 and 1 settings of f . CONV in Dark-NTD is a parameter setting introduced in previous research, and Tuned is the parameter setting determined by tuning in the previous section. The difference between CONV and Tuned is that Tuned has epochs and uses only sensor A. Note that only Tuned in Dark-NTD utilizes sensor A, whereas the other modules utilize all eight sensors. #FNs is the number of false negatives, which is #TPs+#FNs = 35. Recall is an evaluation metric calculated as #TPs / (#TPs+#FNs).
Lastly, we describe the symbol [ ] attached to SET2 and Tuned . The results of SET2 and Tuned show that the number of FPs is very high. The primary cause of false positives is synchronized scans by investigative scanners, such as Shodan and Censys [11]. To address this problem, at least temporarily, we attempted to exclude alerts from investigative scanners by applying a simple rule to the alert results of SET2 and Tuned. The simple rule was: if a large number, or a sequential number, of TCP ports were seen concurrently from the same source hosts in the alerts, those alerts are excluded. The application results of the rule were SET2 and Tuned . Dark-NMF has an effect of halving the number of FPs while maintaining the number of TPs, whereas Dark-NTD does not have such an effect.
From the results of a comparative evaluation in Table 3, Dark-TRACER achieves a recall rate of 100%, although there are some FPs, by integrating the results of each module. Next, we examine the characteristics of the ports that are false negative in each module. ChangeFinder tends to perform poorly at detecting small host sizes and for short-or longterm constant malware activities. Dark-GLASSO and Dark-NMF are perform poorly when detecting malware activities with small host sizes. Furthermore, SET1 tends to be weak in detecting long-term persistent activities. Dark-NTD tends to be poor at detecting short-term malware activities. Overall, the results show that an integration of the three proposed modules can mutually complement the weaknesses of each module.

IV. FEASIBILITY ASSESSMENT OF EARLY DETECTION
In this section, we assess the feasibility of the early detection of malware activities. The details of the dataset, the   validation in time series data; after learning the optimal parameters with past data in section 3, we verified them with future data in this section.

A. DETAILS OF THE DATASET AND EXPERIMENTAL SETUP
In this experiment, we used data from three sensors A (/17 subnet), B (/18 subnet), and D (/20 subnet), selected by observation scale, instead of all eight sensors. The data period used in the experiment was 17 months, from June 2019 to October 2020. To highlight the observation of unknown malware activity, we excluded known and routinely observed TCP ports by calculating them for each month at each sensor as a preprocessing step.
Next, we describe the details of the ground truth used in the evaluation. In the experiment described in Section 3, the ground truth included many stationary threats whose infection spread period of malware activities was unclear; thus, it was not easy to assess early detection. In this experiment, we manually generated a new ground truth for malware activities observed from June 2019 to October 2020, which represented a set of threats with a clearly identifiable infection spread period. The newly prepared ground truth was based on reports and blog posts published by NICTER's expert operators. 5 Among the malware activities observed by NICTER, we selected malware activities whose origin and characteristics were clear and for which there were references by third parties. As a result, we collected 12 types of threat events on 33 TCP ports. The breakdown of the ground truth is shown in Table 4. The following information was accurately recorded: • the initial period when NICTER began to observe a rapid increase in the number of packets and hosts on TCP ports related to threats • the change in the scale of the number of hosts at that time • the period in which threats were revealed to the public due to references issued by reputable third-party security research organizations (i.e., reveal date) • characteristics of the threats Third-party references included recurring activities, such as BlueKeep, ShenZhen TVT, and MikroTik, which are attacks on previously known vulnerabilities.
This ground truth considers not only the type of threat but also its variations, such as the observed infected host size and the persistence/stationarity of threats. For clarity, hourly timeseries graphs of the number of unique source hosts observed by NICTER are shown in Figures 5, 6, 7, and 8 for each TCP 5 https://blog.nicter.jp/  port. The solid vertical line labeled ''D'' represents the earliest period detected by Dark-TRACER, the dashed line labeled ''N'' represents the period observed by NICTER operators, and the dashed line labeled ''P'' represents the period when was is revealed to the public by reputable third-party security research organizations. Figures 5 and 6 are time-series graphs summarizing the partial TCP ports where Moobot-related threats were observed. In Fig. 5, port groups with synchronized fluctuations in the number of unique hosts can be confirmed as belonging to one group on September 19 and one group on September 21, indicating that large-scale Moobot activity was observed during this period. Figure 6 shows events where Moobot features were observed but did not show synchronized fluctuations with the ports in Fig. 5. These ports are related not only to Moobot, but also to the activities of other Mirai variants such as Fbot and Estella. As shown above, Moobot's malware activity is an orchestrated threat that combines multiple activities. Next, Figs. 7 and 8 show time-series graphs of partial TCP ports among threats other than Moobot. Of these partial TCP ports, we found several threats that were difficult to detect by conventional methods or by human efforts, such as threats with no spikes, constant threats, and small-scale threats.

B. ASSESSMENT RESULTS
We now describe the processing results for each parameter with their best parameters, which were the same as in the experimental setup described in Section III-C. We also applied the aforementioned simple rule to exclude alerts caused by investigative scanners. Here, Dark-NMF was computed with the parameter SET1. Table 5 shows the number of ports that were detected early, late, or falsely negative, and their average number of days, based on the initial period of NICTER observations. The results show that although there were a few overlooked ports (#FNs) and late detected ports when considered by the module, all 33 TCP ports could be detected at an early stage when integrated with Dark-TRACER. In addition, Dark-TRACER was able to detect threats on average 126.4 days earlier than the initial period when threats were first observed by NICTER, and 153.6 days earlier than the period when threats were announced to the public.
We also investigated how many ports were alerted for each module in this experiment. Table 6 shows the average number of unique ports per period for each module. For the entire 17-month period, Dark-GLASSO, Dark-NMF, and Dark-NTD produced alerts for 66, 2,042, and 3,969 unique ports, respectively. When the proposed modules were integrated into Dark-TRACER, the number of unique ports was 5,271. We counted the number of unique ports for each day, week, and month, and the averages are shown in Table 6. For example, Dark-TRACER issued alerts for an average of 58.49 ports per day. Assuming that one analyst requires 15

min of analysis time per port (refer to Section. V-E), two analysts could perform these daily operations in approximately 7.3 h (roughly 14.6 h for a single analyst). It would require approximately 31.5 h for a week and 89.8 h for a month with two analysts.
From the above two experiments, we found that Dark-TRACER could tune the parameters of each module so that the number of FNs was almost non-existent and could also detect malware activities at a fairly early stage. As a future challenge, the cost of analysis would be lower if the number of FPs could be reduced more precisely. In addition, there is a possibility that expert analysis would disclose other unknown activities, in addition to the malware activities that were selected for the ground truth.

V. DISCUSSION
In this section, we provide a comprehensive discussion and insight into the performance of our framework. First, we demonstrate the advantages of Dark-TRACER and provide a comprehensive comparison of each proposed module. Then, we discuss the potential concerns of our approach, such as adversarial attacks and the reduction of false-positive alerts. Finally, we present guidelines for the practical application of Dark-TRACER.

A. ADVANTAGES OF DARK-TRACER
As mentioned in the introduction, by focusing on the synchronization of spatiotemporal patterns in darknet traffic, we have the following advantages.

1) TRIMMING UNSYNCHRONIZED AND NOISY COMMUNICATIONS
Distinguishing between non-attack-related and attack-related communications from darknet traffic is a difficult task. Misconfigured or unexplained communications are nuisances that interfere with the interpretation of darknet traffic analysis. In this paper, we focused on the fact that hosts infected with similar malware tend to compromise and scan FIGURE 9. A 3D graph visualizing a case of anomalous synchronization of the spatiotemporal patterns detected from the experimental results in Section III. A scatter plot of partial V hp during 18:30-19:00 on October 31, 2018, is visualized. Each of the three axes is a time axis in minutes, a source-host spatial axis, and a destination-port spatial axis, and the plots represent the observed packets (element values in V hp ). For the destination-port spatial axis, there are only three points at which anomalous synchronicity was detected-5379, 6379, and 7379. Host IPs are plotted in red if they match on multiple ports within one-minute increments and in green if they do not. The red points are considered to be synchronized communications caused by factors such as malware activities, while the green points are considered to be noise communications.
in a synchronized spatiotemporal pattern. By estimating the synchronicity of spatiotemporal patterns in the darknet traffic and eliminating communications that do not show synchronization from the scope of analysis, noisy communications are expected to be scraped off, and malicious communications can be highlighted.
For understanding, a visualization of the synchronization of the spatiotemporal patterns is shown in Fig. 9. This is an alert of malware activity detected by Dark-NMF at sensor A during 18:30-19:00 on October 31, 2018, visualizing V hp at that time. The number of packets is plotted in three dimensions: time, source host, and destination port in oneminute increments. Figure 9 shows that the number of red dots indicates the number of communications from the same host to the same destination ports (5379, 6379, 7379/TCP) during that time period. As indicated in Table 1, we observed a scanning attack on the same service Redis at these ports. Thus, many red dots appear when the spatiotemporal pattern has anomalous synchronization. In contrast, the green dots can be regarded as noisy communication. It is assumed that Dark-TRACER detects anomalies by highlighting the red dots (e.g., malware activity) while eliminating the green dots (noise communication). The red dots (synchronization between spatiotemporal features) do not appear as abundantly as they appeared in Fig. 9 between arbitrary destination ports of ordinary darknet traffic. VOLUME 10, 2022

2) DETECTING MALWARE ACTIVITIES THAT ARE CONVENTIONALLY DIFFICULT TO DETECT
Traditionally, malware activities have been detected based on changes in time-series data, such as the number of packets and the number of hosts, using change-point detection algorithms such as ChangeFinder, or manually by operators. Until recently, many malware activities were relatively easy to detect by operators, as they were threats that had severe and obvious changes in time-series data, threats with clear spikes, or threats that were simple and had a large scale of infection. However, in recent years, the amount of communication observed in the darknet has increased exponentially and cyberattacks have become more diverse and sophisticated, such as the Moobot described in Section IV-A. Such orchestrated threats, which intertwine multiple activities, small-scale threats, threats without explicit spikes, and constant threats, are malware activities that are difficult to detect manually. However, there is a significant possibility that Dark-TRACER can detect such traditionally hard-todetect malware activities. Its feasibility is well demonstrated by the evaluation results of early detection in Section IV, wherein various types of malware activities were detected.

3) EARLY DETECTION OF MALWARE ACTIVITIES IN REAL-TIME
Dark-TRACER is not only capable of detecting traditionally hard-to-detect threats, but also of detecting them early and in real-time. Even when the scale of malware activity is small, if there is overlap in the spatial features (e.g., the distribution of hosts and ports) with another malware activity that has already been detected, and if there is synchronicity in the pattern of the number of packets, there is a high probability that they will be detected together. This implies that Dark-TRACER can capture the signs of infection before it spreads in earnest. In this way, by checking the overlapping degree of host spatial feature variables between alerts from the same period, it is possible to identify threats that at first glance appear to be different events, but are actually caused by the same malware. In fact, as shown in Figs. 5 and 6, Dark-TRACER can detect orchestrated threats such as Moobot, in which multiple activities are intertwined, at an early stage by detecting signs of infection as they spread, even at a small scale.

B. COMPREHENSIVE COMPARISON OF PROPOSED MODULES
In this section, we comprehensively compare the proposed modules Dark-GLASSO, Dark-NMF, Dark-NTD in terms of accuracy, cost, anomaly detection method, and spatial features. An overview is given in Table 7, and detailed explanations are provided in order from the top of the list.

1) ACCURACY
First, we discuss the accuracy aspect. In general, there was a trade-off between the number of FPs and FNs. Dark-GLASSO had almost no FPs and Dark-NMF, Dark-NTD had almost no FNs. As for the performance of early detection, Table 5 shows that Dark-GLASSO tended to make detections slightly later, but the other modules almost always detected threats early. Next, as mentioned in Section III-C, we examined the characteristics of each module's number of FNs ports and considered the types of threats that each module overlooked. Dark-GLASSO tended to miss small-scale threats, Dark-NMF tended to miss smallscale and constant threats, whereas Dark-NTD tended to be weak at detecting short-term threats. The same tendency was confirmed by the experimental results described in Section IV. Because the modules are complementary to each other, it is recommended to use them in an integrated manner, rather than using only one in isolation. Table 5 shows that by integrating the modules into Dark-TRACER, we can avoid problems of missed or delayed detection. Finally, the accuracies of Dark-GLASSO and Dark-NTD were low, unless the observation scale of the darknet sensor was large. In contrast, Dark-NMF recorded the same level of accuracy for all eight sensors of different sizes used in Section III.

2) COST
Dark-NMF is very computationally inexpensive and does not require any particular preprocessing. In contrast, Dark-GLASSO and Dark-NTD are computationally expensive and require specific preprocessing. Given the spatial feature sizes N h and N p , graphical lasso has a cubic time complexity of O(N 3 h ), NTD has a quadratic time complexity of O(N h · N p ), and NMF has a linear time complexity of O(N h ) or O(N p ).
In this study, Dark-TRACER was processed online sequentially at 10-min intervals. Therefore, we adjusted the parameters and preprocessed each module to finish the processing within 10 min for one data slot. Dark-NMF did not require any preprocessing. Next, as described in Section III-C, Dark-GLASSO performs random sampling preprocessing when the number of hosts N h exceeds γ . In a previous paper [7], we reported that the output alerts were quite unstable when γ was lower than the expected average value of N h . Dark-NTD, as described in Section II-C, applies FSTD [20] to preprocess the tensor V to a low-rank approximation and preliminarily samples only the essential parts. The larger the number of basesR n in FSTD, the better the low-rank approximation of the original tensor. However, as shown in Table 2, the results of tuningR n demonstrated that increasing the value ofR n only worsened the accuracy. In all experiments, the processing time of each module was no longer than 10 min (CPU: AMD RYZEN TR 2990WX). For one data slot, Dark-NMF took approximately 1 min, Dark-NTD approximately 5 min, and Dark-GLASSO approximately 7 min.
An important factor in the cost of alert analysis is the number of ports that must be analyzed per unit period. As shown in Table 6, Dark-GLASSO has the lowest cost. For Dark-GLASSO, only 66 ports needed to be analyzed during the entire experiment in Section IV, whereas the other modules required 30 to 60 times more cost.

3) ANOMALY DETECTION
Next, we discuss methodological differences in anomaly detection. Dark-NMF and Dark-NTD decompose spatiotemporal features into latent frequent patterns and then perform anomaly detection for each group of decomposed spatial features. This decomposition can be regarded as a favorable condition for detecting local events, but it is also one of the reasons why the number of detected anomalous events (alerts) becomes very large, although it should be noted that the number of alerts can be adjusted by tuning the parameters. In contrast, Dark-GLASSO detects anomalies from all spatial features without decomposing the spatiotemporal features. This is a favorable condition for detecting global events and is one of the reasons that the number of anomalous events (alerts) detected is small.
In addition, Dark-GLASSO requires K of the past data to perform anomaly detection. Each time we change the value of the parameters or introduce a new sensor, Dark-GLASSO has to wait for K periods to obtain the detection results.
Other modules, however, do not require waiting in such cases because they do not require past data.

4) SPATIAL FEATURE
Finally, we discuss differences in the application of spatial features. In Dark-TRACER, two types of spatial features can be handled: host space and port space. For darknet traffic in a short unit time, the size of the port space N p tends to be larger than the size of the host space N h . In Dark-GLASSO, the port spatiotemporal feature matrix V p is not employed because it becomes computationally intractable when the size of the spatial features becomes large. Dark-NMF can handle not only the host spatiotemporal feature matrix V h but also the port spatiotemporal feature matrix V p due to its low computational complexity. Finally, because Dark-NTD is designed to handle a three-dimensional spatiotemporal feature tensor V hp from the beginning of the proposal, it can calculate the host/port space simultaneously.

C. CONSIDERATIONS FOR ADVERSARIAL ATTACKS
This section discusses adversarial attacks that an attacker might implement to evade detection by Dark-TRACER. Possible attempts include multiplying noise, distributing spatial features, and reducing the frequency of temporal features, which might prevent the malware from capturing spatiotemporal features when the framework performs scanning compromises.
• The case wherein dummy scans, which are unrelated to an attack, are attached to a true attack to confuse the detection framework.
In this case, more data will be observed, and synchronization of the spatiotemporal features will be captured more strongly, resulting in better detection of true attacks. However, some of the detected events may contain dummy scan information, which may be troublesome for analysis.
• The case wherein multiple true attacks from many attack groups are distributed and executed simultaneously.
The advantage of Dark-NMF and Dark-NTD is that they can be decomposed into several patterns with similar spatiotemporal features. Therefore, Dark-TRACER can detect anomalies by dividing potential attack groups into several groups, even when orchestrated attacks are conducted simultaneously.
• The case of a slow stealth scan attack.
Depending on the degree of stealthiness, if a stealth scan attack is too slow, the synchronization of the observed spatiotemporal features becomes weak, making the attack difficult to detect. However, slow stealth scans are not efficient for an attacker who wants to spread the damage of their malware as quickly as possible, because the speed of spreading the malware infection is significantly slower. Slow stealth scans are generally considered to have purposes other than the spread of the VOLUME 10, 2022 malware infection directly, and thus are not the target of detection in this study.

D. REDUCTION OF FALSE-POSITIVE ALERTS
As mentioned in Section III-C, the primary cause of false-positive alerts (#FPs) issued by Dark-TRACER is synchronized scanning by organizations for investigative purposes. Organizations such as Shodan and Censys, which deploy various cyberspace intelligence information as search engines, regularly scan the entire Internet space at a high frequency. Because such Internet-wide scans are fast and large-scale, they are observed in our darknet sensor networks and are represented as synchronized spatiotemporal patterns. Consequently, Dark-TRACER achieved a low number of FNs, whereas the number of FPs from investigative scanners is very high. We believe that if Dark-TRACER can distinguish between alerts caused by investigative scanners and alerts caused by malware activities in a secondary manner, after detecting anomalous synchronous spatiotemporal features without missing them, the inefficient situation wherein there are many FPs can be significantly improved. In this study, to temporarily solve this challenge, we applied a simple rule that excluded alerts when a large number, or a sequential number, of TCP ports were seen simultaneously from the same source hosts in the alerts. In the first experiment described in Section III, we found that Dark-NMF halved the number of FPs while maintaining a high number of TPs, whereas Dark-NTD did not. In the second experiment in Section IV, by applying our simple rule, we were able to reduce the number of unique ports from 64,103 to 5,271 for the entire period alerted by Dark-TRACER. In this way, we have demonstrated the feasibility of significantly improving malware detection by secondarily examining alerts. In future work, we would like to develop a model for classifying or clustering scanners for investigative purposes and automatically create a blacklist so that we can eliminate alerts caused by investigative scanners with better accuracy than the simple rule used in the present study.

E. TOWARD THE PRACTICAL OPERATION OF DARK-TRACER
Each of the three independent proposed modules has its own strengths and weaknesses, and they complement each other through their collaboration into a single framework Dark-TRACER. From the two experiments presented in this paper, it was found that Dark-TRACER can achieve a 100% recall rate in the detection accuracy of malware activities and can also accomplish early detection. In this section, we discuss how Dark-TRACER can be operated in a practical manner.
First, we consider each module separately. Dark-GLASSO has a small number of FNs, but because there are few FPs, the precision rate #TPs / (#TPs+#FPs) is high. When it is not possible to spend much time on the analysis of the detection results, or when the analysis of global malware activities is sufficient, it is practical to employ only the detection results of Dark-GLASSO. Next, Dark-NMF and Dark-NTD, which use nonnegative tensor decomposition methods, are beneficial for detecting local malware activities because they can detect many FPs while incurring almost no FNs at an early stage. In particular, Dark-NMF is effective in detecting anomalous synchronization because it does not require preprocessing, has a very low computational cost, and shows good detection accuracy, even for darknet sensors with small observation scales. In contrast, Dark-NTD has a very high potential for detecting small-scale threats that are typically considered difficult to detect with other modules and is useful for capturing fairly localized events. As described above, each module has different characteristics and can be utilized according to nature of the precise situation, or all modules can be fully leveraged into an integrated framework as in Dark-TRACER, taking advantage of their complementary relationship.
Finally, we discuss a secondary analysis method for the detection results of Dark-TRACER. The alerts issued by Dark-TRACER contain information on IP addresses, targeted ports, and the timestamps of the hosts that are identified as abnormal. However, this information alone is often not enough to accurately determine malware activity. As mentioned in Section III-A, some malware activities, such as Mirai and Hajime, are known to have fingerprints in their initial scan packets. In large-scale scans such as malware and scanners which operate for investigative purposes, packet headers are often designed to have fingerprints in order to scan faster [44], [45]. Previous research has also reported that scanners use fingerprints to distinguish their scan results from backscatters [46].
The question arises as to what specific information should be checked. The following steps are considered useful for secondary analysis of Dark-TRACER alerts: 1) Computing the statistics of packet headers of detected alerts and find characteristic header information (including known fingerprints such as Mirai and Hajime). 2) Checking whether honeypots in an interactive observation network have observed any communication related to the detected alerts, and if so, analyzing what type of communication occurred interactively. 3) Collating and analyzing the presence of information related to the detected alert in third-party threat intelligence information (e.g., CVEs, vulnerabilities, and reports). This is the actual workflow of the security operations center at NICTER. NICTER operations experts are expected to analyze the aforedescribed collation process in approximately 15 min per port of an alert. However, this does not necessarily imply that the causes and details of all events can be clarified. In order to increase the number of events that can be clarified as much as possible, it is necessary to collate more abundant information. In the future, we intend to extend Dark-TRACER by considering a wide range of applications, such as a mechanism to reduce false positives, improve both recall and precision, and automatically associate threat intelligence from third parties [47].

VI. RELATED WORK
We describe related work on how darknets are leveraged in measurement analysis and malware activity detection. We also provide related studies and directions for identifying investigative scanners on darknet traffic, which will inevitably become necessary in the future.

A. DARKNET MEASUREMENT ANALYSIS
The darknet has attracted extensive attention in the field of network security, and many researchers are actively engaged in research on its development, analysis, and visualization [55]. Previous research [2], [56]- [58] has discussed the fundamentals of various darknet configurations, deployment techniques, and sensor placement techniques, and clarified the effectiveness of darknets. In addition, profiling, filtering, and classification have been intensively studied for the measurement of darknets. In the rest of this subsection, we present related work on IoT malware analysis and general darknet measurement analysis. A summary of the related studies is shown in Table 8. Fig. 10 shows that the amount of observed traffic in NICTER's darknet observation network with a total of 300,000 IP addresses has increased rapidly over the past few years. The main reason for this is the IoT malware ''Mirai'', which appeared in 2016 [22]. In contrast to conventional botnets, IoT malware scans multiple ports in parallel to form a large-scale botnet that can spread the infection faster [48]. Moreover, IoT malware variants behave competitively with each other and are repeatedly destroyed and reinfected over a short period [23]. The emergence of such diverse and sophisticated IoT malware further complicates cyber threats and makes it difficult to examine the actual current state of malware strategies. Therefore, it is essential to have a mechanism to investigate IoT botnets while they are still persistent and to rapidly and precisely detect potential threats.
Apart from malware activity detection, which is discussed in the next subsection, the following studies were conducted in other areas of darknet measurement analysis. Dainotti et al. contributed to a census-like analysis of how the IP address space is used by developing malware and evaluating methods to remove spoofed traffic from darknets and live networks [49]. Durumeric et al. analyzed a large-scale darknet to investigate Internet-wide scanning activities and identify patterns of extensive horizontal scanning operations [50]. Fachkha et al. devised an inference and characterization module to identify and analyze the probing activities of cyberphysical systems (CPS) by extracting various features from large amounts of darknet data and performing correlational analyses [51]. Jonker et al. introduced a framework to protect against DoS attacks based on various data sources, including darknet traffic data [52]. They found that one-third of all /24 networks on the Internet had suffered at least one DoS attack in the past two years. Shaikh et al. identified unsolicited IoT devices by collecting IP header information from darknet traffic data and classifying them using several machine learning algorithms [53]. Akiyoshi et al. proposed a method to detect emerging scanning activities and their scale by analyzing the correlation between traffic in honeypots and darknets [54]. Most of the measurement analysis studies using darknets have been applied to understand the general trend of malicious communications observed in darknets. Thus, for detailed analysis, many studies use not only darknet data but also trap-based monitoring systems such as honeypots.

B. MALWARE ACTIVITY DETECTION ON DARKNETS
A summary of the related works referred to in this subsection is provided in Table 9. According to a survey paper on darknets [55], the technique of filtering misconfigured traffic has not yet been fully explored and is an ongoing challenge that deserves more attention from the research community. We consider that our method can filter out misconfigured traffic by detecting anomalies in the synchronization of spatiotemporal patterns. Furthermore, our method is unique in that it can detect global cyber threats/malware activities in real time in a uniform format by focusing on the synchronization of anomalous spatiotemporal patterns among many indiscriminate suspicious scans that reach large-scale darknets.
Here, we present some prior research that had a similar scope to our problem and used darknet traffic but did not focus on synchronization. There are several methods to detect anomalies by detecting change points in darknet traffic, such as ChangeFinder that was introduced as a comparison method in a previous study [10], [59]- [61]. Ahmedet al. proposed a sliding window-based adaptive cumulative sum (CUSUM) algorithm, which is a sequential analysis method for detecting drastic changes in darknet traffic [59]. Inoue et al. [60] employed the ChangeFinder algorithm [10] to detect sudden change points in darknet traffic with a low computational cost. Ban et al. proposed an abrupt-change detection algorithm that can detect botnet  probe campaigns with a high detection rate by searching for temporal coincidences in botnet activities observed on the darknet [61]. The aforementioned change detection methods all share the same drawback-they cannot achieve high accuracy without focusing on specific protocol ports because they detect change points without distinguishing between many sources of noisy communications, such as misconfigured traffic. As shown in the experimental results of ChangeFinder herein, the accuracy was low when the algorithm was applied to the entire traffic without focusing on a specific port. In addition, applying the change detection method to a specific port would result in many alerts, which would require considerable computational and analysis costs.
Next, we review recent related works on the analysis of malware activity using darknet data (mainly clustering). As mentioned in the previous subsection on darknet mea-surement analysis, most of the communications reaching the darknet since 2016 have seen a considerable increase in traffic volume due to IoT malware. Therefore, many clustering methods targeting the analysis of IoT malware activities have been studied in recent years. Pour et al. learned to classify compromised IoT devices and non-IoT devices using convolutional neural networks [62]. Using the classification training results, they extracted features embedded in unsolicited scan flows and deployed hierarchical agglomerative clustering. As a result, the authors uncovered 440,000 compromised IoT devices and 350 IoT botnets. Torabi et al. leveraged the Shodan IoT search engine and darknet traffic data to infer compromised IoT devices and to characterize and cluster the generated scanning campaigns [63]. The authors discovered newly targeted ports and emerging IoT malware/botnets, highlighting their persistence and evolutionary process. Cohen et al. proposed a method for sequential clustering of aggressive hosts having similar intentions from scalable darknet traffic by embedding a port sequence for each source host [24]. By tracking the clusters, they detected recurrent or new attacks and found several new campaigns of malicious port sequences.
Such clustering methods are considered to be able to classify malware activities, investigative scans, and noisy communications (e.g., misconfigured traffic) to some extent. However, although the clustering methods can distinguish distinctive hosts or packets as a group, they cannot detect anomalies and thus cannot identify which clusters are anomalous or distinctive. Therefore, it is necessary to prioritize the clusters to be analyzed based on some criteria. In contrast, as Dark-TRACER issues alerts in real time upon anomaly detection, it is easy to know what to analyze first, thereby making the framework highly practical. In addition, clustering methods are not suitable for the early detection of unknown or small-scale malware activities because clusters are not formed unless the amount of observed features of malware activities is large. As shown in the experimental results herein, Dark-TRACER has good potential to detect small-scale malware activities in an early stage.
BotSniffer [1] and BotMiner [64] proposed a framework for detecting C&C traffic and malicious activities based on the spatiotemporal correlation method. However, the scope of their problem setting was different from that of ours, because BotSniffer and BotMiner only target specific protocols of actual network traffic and therefore cannot detect global cyber threats and malware activities in the entire Internet space. To the best of our knowledge, there is no related work that focuses on synchronization in the same scope as the present study. As described in Table 9, prior papers [6] and [7] were published as prototypes of Dark-GLASSO, prior paper [8] was published as a prototype of Dark-NMF, and, finally, prior paper [9] was published as a prototype of Dark-NTD. However, as the previous methods have been considered independently, their relationship has remained unclear. In this study, we integrated and evaluated the three previous methods as Dark-TRACER and clarified that they complement each other. In addition, although the previous methods can detect malware in real time, the feasibility of early detection of malware activity was not evaluated; therefore, in this study, we evaluated the early detection performance and the analysis cost.
We also present several other related works that utilized darknet traffic to identify and detect malicious communications, based on the reports of a recent survey paper [65]. Kumar et al. proposed a model that learns from darknet data and benign traffic data to test whether it can classify malicious communications [66]. However, the model only classifies whether the traffic flow is malicious or benign, and because it learns all at once, it does not specifically identify what kind of maliciousness it has classified. Bou-Harb et al. investigated orchestrated probing campaigns by considering a clustering method for time-series traffic data [67]. However, this method does not detect anomalies and does not distinguish noisy communications. Ali et al. classified DDoS attacks using Resource Allocating Network with Locality Sensitive Hashing) (RAN-LSH), which employs LSH to select data for training and achieves fast online learning by training only selected data [68]. However, because this method only analyzes backscatter traffic and targets to classify DDoS attacks, it is not suitable for detecting and classifying malware activities, which do not involve backscatter.

C. TOWARDS THE IDENTIFICATION OF INVESTIGATIVE SCANNERS
We conclude this section by sharing related works that have distinguished between investigative scanners, which is an issue that must be considered in future studies. A summary of the related works mentioned in this subsection is presented in Table 10. Many massive Internetwide scanners are observed on the darknet, including both public scanning activities and malware activities. Recently, high-performance scanning tools such as ZMap [44] and Masscan [69] have been deployed, and Mazel et al. profiled the utilization of such tools [45]. The results revealed that many entities openly engage in scanning activities on a large scale and on a constant basis. Because such harmless and large-scale investigative scanners perform activities with relatively synchronized spatiotemporal patterns, many of these scanners were incorrectly detected in the results of this study. Therefore, we must consider how to distinguish such harmless investigative scanners from malware activities.
As mentioned in the previous subsection, DANTE [24] embeds port sequences of darknet traffic by source host in a given time frame and performs clustering. By comparing with the previous time frame and labeling the clusters, it is possible to track campaigns and detect recurrent or new attacks. In large-scale scans using scanning tools, such as ZMap or Masscan, or large-scale scans using malware, fingerprints are often attached to packet headers to perform faster scans [46]. It has also been reported in [46] that fingerprints are provided to distinguish scan results from backscatters. In contrast, Tanaka et al. proposed a method based on a genetic algorithm to automatically identify fingerprints embedded in TCP/IP headers from darknet traffic [70]. They succeeded in identifying unknown fingerprints from data corresponding to a short period. Identifying the fingerprints of investigative scanners and tracing the scanners using the DANTE mechanism may enable us to distinguish scanners from malware activities. Additionally, Wan et al. clarified that the coverage rate differs depending on the scan source and that the observed hosts differ depending on the region of the observation network [71]. Analyzing the darknet observation networks in various regions, such as the autonomous system (AS) and in various countries, is a way to obtain a more detailed and precise understanding of the actual scanning situation.

VII. CONCLUSION
In this study, we introduced three independent machine learning methods to automatically estimate the synchronization of the spatiotemporal patterns of darknet traffic in real time and to detect anomalies. Those three methods are: Dark-GLASSO, Dark-NMF, and Dark-NTD. We also proposed Dark-TRACER, which integrates all three methods into a single framework. We found that Dark-TRACER was able to complement the weaknesses of each module, achieving a 100% recall rate and detecting all malware activities in the experiment. It detected the malware on average 153.6 days earlier than the time when the threats were revealed to the public by reputable third-party security research organizations. In addition, we found that two analysts could perform the daily operations necessary to detect these threats in approximately 7.3 h.
Currently, our most serious challenge is the large number of false positives. In this study, we confirmed that even a simple rule-based approach can effectively reduce the number of false-positive alerts. As described in Sections V-D and VI-C, our future work is to reduce the number of false positives by identifying the fingerprints of investigative scanners and building a model to track them. By reducing the number of false positives, the analysis cost can be lowered. In addition, we intend to automate the secondary collision analysis mentioned in Section V-E to elucidate the causes and details of the alerts detected by Dark-TRACER. Finally, we plan to deploy Dark-TRACER in the real world and detect threats and malware activities in real-time to aid rapid response.