Measuring Early Detection of Anomalies

Early detection is a matter of growing importance in multiple domains as network security, health conditions over social network services or weather forecasts related disasters. It is not enough to make a good decision but it also needs to be made on time. In this paper, we define a method to evaluate detection of anomalies in time-aware systems. To do so, we present the early detection problem from a generic perspective, examine the evaluation metrics available and propose a new metric, named TaP (Time aware Precision). A set of experiments using three different datasets from different fields are performed in order to compare the behaviour of the different metrics. Two different approaches were followed, first a batch evaluation is performed, followed by a streaming evaluation which allows to present a more realistic behaviour of the systems. For both steps, we propose two sets of experiments. The first one using baseline models, followed by the evaluation of a set of Machine Learning algorithms results. The presented metric allows the amount of items needed to take a decision to be taken into account, not depending on the specific dataset but on the nature of the problem to solve.


I. INTRODUCTION
The detection of anomalies in multiple application domains, such as health conditions, cyber-security or industrial equipment malfunction, is a key issue that has attracted, and continues to attract a lot of attention, using multiple and different approaches, such as, statistical, classification or cluster based, among others [1].
However, another key aspect behind the detection of an anomaly is the time required for its detection, since an early detection can help reduce the negative impact of the anomaly in the system. For example, from a medical perspective, the early detection of a disease can speed up its treatment reducing the negative impact of the disease on the subject and, at the same time, reducing the economical cost of the treatment [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval .
In the same way, the early detection of an anomaly in an industrial equipment can help minimise the disruption of a normal service operation [3], [4]. The early detection of an intrusion in a computer network is paramount to reduce the impact of the attack on the infrastructure and to prevent it from reaching more advanced phases, where it could be more difficult to tackle [5].
Therefore, in this work, we study the problem of an early detection of anomalous behaviour in different environments, following an evaluation methodology that takes into consideration the time required to detect the anomaly. Then, we aim at classifying as soon as possible, minimising the amount of information considered by the models to make a decision. For this purpose, we examine the main works in the state-of-the-art that tackle the early detection problem in different environments and we propose a formal and common structure for the early detection problem, along with a suitable evaluation metric. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ More specifically, the main contributions of this work can be summarised as follows: • We present and formally define the early detection problem from a global perspective, which seeks to identify as soon as possible, if an anomaly is present in the system. • We examine the evaluation metrics from the state-of-theart, identify pros and cons, and we propose a new metric named Time aware Precision, that covers the three states in the early detection problem (i.e. normal, anomalous, delay).
• We perform extensive experiments using three independent real-world datasets from different environments (depression disorders, computer network attacks and floods) and we validate the behaviour of the metric proposed over the state-of-the-art metrics. The remainder of this article is organised as follows. First, the related studies are examined regarding the early detection problem in different contexts. Then, we formally define the early detection problem from a global perspective and next we present the datasets used in our experiments that cover different environments of the problem. A detailed experimental evaluation is described in Section V, followed by a discussion. Finally, we present the main conclusions of our article and possible future works.

II. RELATED WORKS
Throughout the literature, early detection of anomalies has been explored in different fields, although there is limited research on early detection evaluation metrics and methodologies.
The workshop on early risk prediction on the Internet (eRisk), as part of the Conference and Labs for the Evaluation Forum (CLEF) since 2017, proposed a task for the early detection of different conditions (e.g. depression, self-harm or anorexia) on social networks using a time-aware methodology and effectiveness metrics [6], [7].
In this sense, to the best of our knowledge, the eRisk workshop constitutes the best effort on early detection with different metrics proposed, being ERDE (Early Risk Detection Error) [6] and latency-weighted F1 (or F1 − latency) [8] the most used. These metrics are described in detail in the next section, although they both present a major issue that is to be dataset dependent, which prevents their use for performance comparison. In fact, this limitation on these metrics is the main motivation behind our work.
On the other side, several attempts for early detection in different fields can be identified in the literature. For example, some works explore the early detection of cyberbullying on social networks. Samghabadi et al. [9] propose a corpus to detect cyberbullying on social networks as soon as possible, although the time-aware evaluation is limited to the use of standard metrics (i.e. F1) at specific points. A more precise study is described in [10], where the authors use ERDE and latency-weighted F1 to compare the performance of different machine learning methods in a dataset from the Vine social network.
Some efforts have also been given to the early detection of rumours and fake news on social media. For example, [11] presents different alternatives for the detection of unconfirmed information, although the time-aware evaluation is limited to measuring the amount of time required to detect a rumour. An interesting interdisciplinary study of fake news is presented in [12], although the analysis of the early detection is circumscribed to the reduction of news articles and news content information, without including a proper time-aware evaluation.
Cyber-security is another field where the time required to detect a possible threat has been of interest for some researchers. In [13] the authors present a prototype for cyber attacks early recognition system, although no time-aware performance evaluation of the proposed system is developed. Some other works, such as [14] or [15], explore early detection by detecting cyber attacks in the early stages, however the evaluations do not actually consider the time required for the detection. There is some research on Early Warning Systems (EWS), especially to avoid malware propagation, that explore different alternatives such as bayesian inference [16], Kalman filter [17] or sensors [18], but the evaluation is mainly focused on the identification of potential attacks in a timeline, without presenting a proper time-aware performance metric. More closely related to this work, [19] explores different methods for the early detection of cyber attacks using ERDE as the main performance metrics, while on [20] the authors focus on Operating System scan attacks and include F1 − latency as time-aware evaluation metric.
Another area in which early detection is especially interesting corresponds to smart cities. The number of connected devices and the massive traffic provoke that threats can have a significant impact. Xu in his work [21] proposes the use of software-defined network function virtualization (SDNFV) architecture and a traffic classification strategy to perform early detection of such attacks. However, the time-aware evaluation is limited to measuring the average response time between the attack and the detection. A similar approach is followed by Privalov et al. [22], measuring the average time to detect a distributed denial of service attack.
In summary, in the literature we can identify several research works that emphasise the early detection of problems or anomalies in different fields. Particularly we should highlight ERDE [6] and F latency [8]. However, there is not a clear time-aware evaluation methodology, which leads to the use of unconventional alternatives and metrics by the researchers. In this work, we present, from a global perspective, the early detection of anomalies problem and propose a new metric, named Time aware Precision, generic and suitable for any situation.

III. PROBLEM STATEMENT
In this section we formally define the problem of early detection of anomalous situations for a generic system and a non-specific type of entity.
Let E = {e 1 , e 2 , . . . , e |E| } be the set of entities that may suffer anomalies, where |E| denotes the number of entities. Each entity e ∈ E is formed of a sequence of items, denoted as I e , and a binary indicator l e that denotes whether the specific entity is considered anomalous (l e = true) or not (l e = false).
We define E + as the set of anomalous entities, i.e. where l e = true, and E − as the set of non-anomalous entities: The sequence of items for a specific entity is supposed to change through time and is given by I e = (< I e 1 , t e 1 >, < I e 2 , t e 2 >, . . . , < I e n , t e n >), where the tuple < I e k , t e k >, k ∈ [1, n] represents the k-th item for entity e, and t e k is the timestamp associated with item I e k . Timestamps t e 1 , t e 2 , . . . , t e n can be equally and homogeneously distributed (e.g. in case of sensor emitting temperature and humidity data every second) or can be uneven and randomly distributed (e.g. in case of the posts written by a user).
An item, I e k , is specified by a vector of characteristics or features and, in this case, we assume that all items associated with an entity, I e k , k ∈ [1, n], are defined by the same vector of features, whose values may, and predictably will, change through time.
Since entities are independent, each sequence of items I e may have different lengths, n, for each entity e ∈ E. However, note that the number of features, m, would be the same for all items.
We will drop the superscript e in the specification of I e , e.g. I e = (< I 1 , t 1 >, < I 2 , t 2 >, . . . , < I n , t n >), whenever e is clear from the context. Similarly, the vector of features for a specific item I e k , will drop the superscript when it is clear from the context: I e k = f k 1 , f k 2 , . . . , f k m . Given an entity e, the objective is to detect if the entity has an anomalous behaviour but scanning as few items from I e as possible.
We define a function f (l e , I e × [1. .n]) → {0, 1, 2} as the objective function. The function will return 1 (i.e. positive) if entity e is considered anomalous after processing items I 1 to I k . In case the entity e is considered normal (i.e. nonanomalous or negative) after processing item k and previous ones, then f (l e , I e , k) = 0. Finally, f (l e , I e , k) = 2 if no definitive decision can be emitted on entity e after reading k items and more items must be processed (i.e. delay).
Therefore, for f (l e , I e , k) outputs 0 and 1 are considered final and items I k+1 , . . . , I n do not need to be processed. On the other side, if output 2 is provided, further items I k+1 , . . . , I n must be processed, until a final output is achieved or the end of the items sequence is reached.

A. STANDARD METRICS
The main metrics identified in the state-of-the-art for early detection problems are ERDE [6] and latency-weighted F1 [8]. In both cases, metrics were used to measure performance on the early detection of depression on individuals based on their posts on social networks.
The ERDE metric is measured at a specific point denoted as o and considers four different cases [6]: In case of wrong predictions (false positive, FP, and false negative, FN), as expected, the error increases but in two different ways. False positives increase the error proportionally to the number of positive cases in the dataset, while false negatives increase the error by 1. A true negative (TN) prediction, as expected, does not increase the error independently of when it was produced. However, a true positive (TP) will impact negatively if the delay required to make the prediction exceeds the measuring point o (i.e. k > o), using a sigmoid function to introduce the penalty.
Note that ERDE metric ranges from 0 to +1 and, as an error measure, values closer to zero are considered preferable values.
Also, some variants of the ERDE metric can be found in the literature. For example, [23] defines the ERDE % o that is based on the percentage of items processed, instead of the number of items and in [20], a normalized version of the ERDE metric is defined using mix-max normalization.
The latency-weighted F1, or in short F latency or F1-latency, is proposed by Sadeque et al. as an alternative to the ERDE metric, combining both latency and accuracy [8]. The F-latency metric is defined as: where, F1 is the standard F-measure that is calculated based on precision (P) and recall (R), F1 = 2·P·R P·R , and p is a parameter that determines how quickly the penalty should increase. Typically, this parameter is set to achieve 50% of latency penalty at the median number of items.
This metric, as it depends directly from F1, will range from 0 to 1 and values closer to 1 are representative of good results.
The main limitation that both metrics present is that they are dataset dependent, which greatly limits the performance comparison among different datasets. The ERDE metric penalisation for false positives is directly related with the number of anomalous cases in the dataset and for the F latency the parameter p depends on the items median.
Moreover, some other limitations have been identified for both metrics: • The sigmoid function employed by ERDE produces a fast increase of the penalty for late true positives. Also, a perfect system (detecting all true positive cases just with the first round of items), may achieve an error greater than 0 [20], [24].
• F latency is defined as a final metric and cannot be measured at different points (i.e. different batches) and it can produce unexpected values (i.e. negative values) with small item sequences due to the use of the median to determine the penalty increase in the parameter p [20].

B. TIME AWARE PRECISION
One of the main contributions of this work is a new metric named Time aware Precision or TaP in short. This metric is expected to overcome all major issues of ERDE and F latency . TaP will range between +1 and −1, where values close to +1 represent correct and in time predictions, values close to −1 represent incorrect or late predictions and values around 0 would correspond to non predictions (or delay) cases.
We define Time aware Precision as a generic metric that could be applied to any early detection problem. In this sense, one of the key aspects that is problem dependent, is the moment when a correct prediction is considered late. For this purpose, and following ERDE, we define a point denoted as o to start penalising correct predictions. Up to this point it will be considered as an acceptable delay for the defined problem and the penalisation will be applied after that amount of items.
Therefore, TaP at o for an entity e i is calculated as follows: In the case of an incorrect prediction (False Positive or False Negative), the metric weighs −1 to represent an error. If a correct prediction is made (True Positive or True Negative), then the time required to generate the prediction is taken into account. If the prediction was made before o (i.e. k ≤ o), then the precision achieves its maximum value (i.e. 1). Otherwise (i.e. k > o) a penalty function, pf o,λ (k), is used to reduce the score of the metric. Lastly, if no decision has been made (i.e. delay), TaP will take the value of 0.
TaP for the set of entities E is calculated as the average score for each entity: The penalty function is defined using a generalised logistic function scaled to operate on the defined range and the penalty is based on the number of items required to take the decision: The X-axis represents the number of items required by the system to reach a correct prediction, assuming all previous items led to a delay. For example, for x = 5 the system generated the correct prediction on item k = 5 and, therefore, previously (i.e. x < 5) a delay was generated.
Where the parameter λ controls how the penalty is increased. Figure 1 shows how parameter λ affects TaP. In all cases, the correct prediction has been made, but the number of items required to obtain it varies and is represented on the X-axis. If λ = 0 then no penalty is introduced for a late prediction and, independently of when the prediction was made, the maximum score is achieved. However, as λ increases the penalty slope becomes steeper. When λ = 0.1 the metric decreases linearly as the correct prediction is being delayed. If λ = 10 a step penalty function is obtained, reaching the minimum value with a prediction just one item late (i.e. k = 3).
Also, we define, respectively, TaP + for the set of anomalous entities or positives cases, E + , and TaP − for the non-anomalous entities as: Both values are combined in a Time aware Precision variant denoted as TaP α and defined as: where the parameter α controls the impact of anomalous and non-anomalous cases in the final score. This allows to control the balance between cases and the representation of each class on the final result. Again, this parameter is problem related and its value will depend on each specific situation. In this sense, α = 0.5 will balance both cases, but other approaches could be considered. For example, with α = 1 only positive (i.e. anomalous) cases would be taken into account and non-anomalous cases will not be considered, which could be interesting when studying the early detection of a disease where non infected cases are non important.

IV. EXPERIMENTAL SETTINGS A. DATASETS
Three datasets coming from three different and independent domains are used for evaluation purposes: a dataset for the detection of depression based on social networks, a dataset used for the detection of computer network attacks and a dataset of weather observations data for the detection of floods.
The following subsections provide the details for each one of them and how the early detection problem is defined in each case.

1) DEPRESSION DATASET
This dataset has been specifically gathered for the Workshop on Early detection prediction on the Internet 2017 (eRisk) [6] and contains public posts from Reddit published by individuals which have been manually tagged as depressed or non-depressed based on self-reports of diagnosed depression. The dataset contains posts for a period of about a year for each user.
In this case, the set of entities E corresponds to the different subjects considered in the dataset, while the sequence of items for each entity is represented by the subject's posts. Table 1 shows a brief summary of the main numbers associated with the datasets considered. This dataset constitutes the smallest one, with just 887 entities and slightly more than half a million items. Also interestingly, the average number of items per entity is higher than the other datasets. In the experiments the features defined in [25] and [26] will be utilised, but limited to those considering individual post characteristics and not taking into account aggregated features for a sequence of posts.

2) NETWORK ATTACKS DATASET
This dataset includes data traffic from a video surveillance network where different attacks are performed throughout several days and is used to test Network Intrusion Detection Systems (NIDS) [27]. In our case, we focus on the OS Scan Attack from this dataset that scans the network for hosts and their operating systems to reveal potential vulnerabilities.
In order to model this as an early detection problem we have created bidirectional flows from the communication network data [28]. In this way, the set of entities, E, corresponds to data flows, where each flow is composed of a sequence of packets (i.e. items) defined by the same pairs source IP address -destination IP address, source port -destination port and protocol. Furthermore, to improve the flow division the timestamp of data flow packets is also considered, setting a threshold for the time between two consecutive packets to 0.1 seconds and the threshold for the flow duration to 1 second [20].
In total, more than 75 thousand entities and nearly 1.7 million different items are being considered (Table 1). For evaluation purposes, the features proposed in [27] are employed, which consider only individual characteristics from each packet, disregarding aggregated flow features.

3) FLOODS DATASET
Using the data provided by the National Center for Environmental Information (NCEI) of the National Oceanic and Atmospheric Administration (NOAA) from the USA we have constructed a floods dataset for 2018 at the United States.
Based on the Storm Events Database [29] provided by the NCEI, we have collected the details about timing and location for all flood events produced in 2018. Moreover, the Integrated Surface Dataset (ISD) from the NCEI provides worldwide surface hourly weather observations, including atmospheric pressure, temperature, dew point, atmospheric winds and precipitations, among others [30]. For each flood, we have identified the closest meteorological station and have collected data for the whole year, including the period of time when the flood occurred.
In order to study the early detection of floods, we have converted the dataset to a series of raining sequences. A raining sequence starts when a precipitation observation is obtained and will continue until the precipitation stops. A limit of up to 2 hours without any precipitation has been set for the sequence to finish. Then, a raining sequence is classified as flood if the flood occurred during the raining sequence or up to 24 hours.
In this case, the set of entities, E, corresponds to the raining sequences and precipitation observations constitute the sequence of items for each entity. The experiments will be performed considering as features the different weather observation data collected by the NCEI in each meteorological station.
From Table 1 we can observe how this dataset has a relatively high number of entities, close to 70 thousand and the number of items is above 500 thousand. Probably, the most relevant feature of this dataset is that it is mainly unbalanced, with just 2.1% of entities considered anomalous (i.e. floods).

B. EVALUATION PROTOCOLS
For evaluation purposes, the set of entities, E, will be divided into 5 folds and each one of them into two non-overlapping sets: training and testing. We will present the mean result of this experiments and we use a two-tail paired t-test with a p − value < 0.05 to indicate the significance of the performance differences. The literature collects mainly two VOLUME 10, 2022 types of evaluation on different early detection problems: batch or streaming [6], [24].
In the batch evaluation the sequence of items, I e , is divided into various homogeneous and consecutive batches, and each one is processed independently. Therefore, for each entity e, I e = (< I e 1 , I e 1 >, < I e 2 , t e 2 >, . . . , < I e n , t e n >) is split into B batches, where batch j, B j , is defined as the sequence of items occurring between time [(j − 1) · n/B + 1] and [j · n/B]: Since entities in E are independent and can have different lengths, batches will be homogeneous for each entity e, but batches may have different sizes for each entity.
Each batch is processed individually and, typically, all previous batches can be considered by the early detection model. Therefore, the function f (l e , B j , j·n B ), j ∈ [1, B] must be processed until a final output is obtained or until B is reached.
In our experiments, following previous works [6], [26], we have considered B = 10 for the test-set, with each batch containing 10% of the items.
In streaming evaluation the sequence of items, I e , is processed individual and sequentially. Consequently, for each entity e, the function f (l e , I e , k), k ∈ [1, n] must be processed, until a final output is obtained or until n is reached. As in the previous case, when processing item k, I k , all previous items, I 1 , . . . , I k−1 , can be taken into account by the model.
It is interesting to note that, in a batch evaluation, when a final result is provided, the exact batch used is identified (e.g. B j ). However, the exact item from the batch used to produce this decision can not be identified and, therefore, for evaluation purposes, the last item of the batch will be considered (e.g. I j·n B ). On the other side, in streaming evaluation, the exact item, I k , used to reach a final decision is certainly determined.

C. EVALUATION METRICS
For all experiments we report results on ERDE and F latency as well as TaP.
The parameter o, used in ERDE and TaP to set the penalty point, is set for each dataset considering two cases, low and high penalty points: The remaining parameters for TaP, namely λ and α, are explored in the experiments section (Section V) and different values are considered.

D. MODELS
Initially, to test the behaviour of the different evaluation metrics, we present five synthetic baseline models that are used to represent extreme cases in the evaluation. For each one of them labels are assigned, considering original labels, as follows: • Oracle n : produces delays before item n for all entities and then the correct prediction for each entity is generated. Therefore, an Oracle 1 would represent a best-case scenario where all correct predictions are produced after processing the first item of each entity.
• Elcaro n : works as an inverse Oracle, delaying the prediction before item n and then providing the wrong prediction for each entity. In this case, any Elcaro would represent a worst-case scenario, independently of the time required to generate the prediction.
• Positive n : a delay is produced before item n and then all entities are tagged as positive (i.e. anomalous) cases.
• Negative n : in this case, after item n, all entities are predicted as negative (i.e. normal or non-anomalous).
• Random: the three possible outputs (positive, negative or delay) are generated randomly with equal probabilities. Moreover, in the experimental evaluation we will consider the following off-the-shelf machine learning algorithms. The scikit-learn [31] implementation will be used to generate basic models in order to measure their performance and evaluate the metrics behaviour. Each one of the models include the parameters used for training: This selection of models was made based on the work presented in [32]. For all algorithms a model for each point of measure has been defined as it is more thoroughly explained in the following section.

V. EXPERIMENTS
In this section we present the results for our experiments, which we divide in two blocks: batch and streaming evaluation. For the batch evaluation protocol we have split each dataset into 10 batches, each one with 10% of the items for each entity. In the streaming evaluation protocol, on the contrary, entities' items are processed individual and sequentially until a final decision is obtained.

A. BATCH EVALUATION
Firstly, we focus on TaP metric and the parameter λ using the Depression dataset. We compute TaP o=5 for all baselines and we test different values of λ (i.e. 0, 0.01, 0.1, 1 and 10). In Figure 2 we present results for λ = 0.01 and λ = 10 as representative of the parameter operation. This will introduce a smooth but noticeable penalty for the first value and a severe one for the second, as shown in Figure 1.
In both cases, as expected, Oracle 5 and Elcaro 5 take the higher and lower values, respectively. However, when λ = 10, Oracle 5 is not able to reach a perfect score. This is due to the fact that some entities (i.e. subjects, in this case) required more than 5 items to make the correct prediction, as a result of the batch distribution and that the penalisation introduced by λ = 10 is high. Meanwhile, the smother penalisation introduced with λ = 0.01 allows Oracle 5 to reach almost a perfect score. It is interesting to observe how Positive 5 and Negative 5 are closer to Oracle 5 and Elcaro 5 , respectively, with λ = 10 since the penalisation for a late detection is, almost, equivalent to a wrong prediction. Also, the symmetry shown on this figures is due to the synthetic nature of the models. In the next set of experiments, we examine the behaviour of TaP α . For this purpose, we fix the value of λ = 0.01, as this value introduces a slight but noticeable penalisation for all datasets in this form of evaluation, and we show how TaP + and TaP − behave independently in Figure 3.
From the figure, we observe how each measure focuses only on positive and negative cases, respectively. Therefore, on TaP + Oracle 5 and Positive 5 models achieve equal and the highest scores, meanwhile on TaP − Oracle 5 and Negative 5 achieve the highest scores. Alternatively, Negative 5 obtains a −1 score on TaP + as all positive cases are predicted as negative, and an equivalent response is obtained for Positive 5 on TaP − . Next, we study the effect of α parameter for TaP α . For this purpose, we calculate TaP o=5 for all baselines using the same dataset and testing different values of α: 0.5, 0.75, 0.9 and 0.95. Figure 4 shows the results obtained.
It is interesting to note how α does not have an effect on the performance of Oracle 5 and Elcaro 5 models, since, in the former, there are no wrong predictions and, in the latter, all positives cases are wrongly predicted. However, the effect of α is clearly observed on the Positive 5 and Negative 5 models. As α increases, more importance is provided to only positive predictions and, therefore, the performance of Positive 5 model becomes closer to Oracle 5 and, correspondingly, Negative 5 gets closer to Elcaro 5 .
In the remaining experiments, although stated otherwise, the parameters for TaP are fixed to: o = 5, λ = 0.01 and α = 0.90.
Finally, we compare the performance of the different metrics (i.e. ERDE, F latency and TaP) for the baseline methods using the three datasets. We provide two outputs for the Oracle model, at 1 and 5 items, respectively. Also, we use two values for the parameter o for ERDE and TaP for each dataset to capture low and high penalty points. For the Depression dataset we consider o = 5 and o = 50, respectively, following [6]; for the Network attacks dataset we set o = 1 [20] and o = 10 and Floods dataset uses o = 2 and o = 20. Table 2 summarises the results obtained.  Regarding the ERDE metric, we can observe the difficulty of interpretation and comparability among datasets. Oracle 1 and Oracle 5 were expected to reach almost perfect scores in all cases, which is mainly true to the Depression and Floods datasets where ERDE o=high even reach 0.0. However, on the Network attacks dataset, both models obtain a relatively low score (0.4337 and 0.6338). This is due to the small number of items in most positive entities in this dataset, approximately 2 packets for anomalous flow, which introduces a high penalization for correct predictions.
Focusing on F latency , it is interesting to note that this is a final metric and, therefore, just one score is provided since it is not possible to measure the performance at different points (i.e. parameter o). Although the minimum value is obtained for Elcaro and Negative models, and that Oracle 1 obtains the maximum value, the penalization introduced for the model Oracle 5 must be noticed. This is motivated because both models achieve a perfect F1 measure and the latency component penalise the differences in the prediction moment. In particular, in the case of Network Attacks or Floods datasets, where the number of items per entity is smaller, the penalization introduced is so that even when the number of items, k, is 5 the measure obtains half the value or less.
On the other side, TaP is able to provide the maximum scores, or close to, for Oracle models and minimum scores for Elcaro on all datasets. Also, values are comparable and same models obtain similar scores on different datasets. Batch evaluation may produce, in some cases (i.e. for Oracle 1 in the Depression dataset and o = low), that the maximum score is not achieved. This is because the last item of the batch is considered for the metric calculation and may introduce a small decrement. Interestingly, this highlights the difference between Oracle 1 and Oracle 5 in the Depression dataset, where the slightly worse performance of the latter is patent from the TaP scores obtained.
Next we present the results for standard Machine Learning models using the same three datasets and metrics. To do so, five different algorithms had been trained with two different points of prediction, on batches 1 and 5. The same parameters are used for all three datasets and points defined. These results are shown in Table 3.
The variation observed between datasets shows that with the previously selected parameters for the metrics the difference between o = low and o = high is slightly higher for TaP than for ERDE when the systems show a bad performance. There are bigger differences for ERDE between o = low and o = high on the better performing models, but the changes between models are smaller which could make more difficult selecting the better model. That can be seen for Network Attacks dataset for ExtraTree 1 and AdaBoost 1 where ERDE achieve 0.4337 for both models while the other metrics show different values. In particular, TaP obtain 0.9858 and 0.9857, showing a mild improvement in terms of definition. In the Oracle evaluation, the best results were expected to be obtained by using a higher o (i.e. o = high) as TaP parameter. That will imply a lower penalty in the evaluation and that is what can be observed in the Machine Learning models evaluation as well.
As shown in the Oracle evaluation, better results were expected to be obtained the sooner the decision was taken. This might not be the same situation regarding Machine Learning models as some systems could perform better with less information than others or also in some cases it would be possible to get worst results with more information.
If we analyse the results for Network Attacks dataset it can be seen that all the models performance improve the sooner the decision is taken. One exception to this behaviour can be observed in Floods dataset results, where LinearSVC 1 and LinearSVC 5 , for example, achieve almost the same values for every metric. Also, as explained in the baseline models evaluation, better results are achieved for a higher point of penalty regarding the proposed metric.
Finally, the differences shown in the behaviour of F latency metric and TaP between the baseline models and Machine learning models evaluation must be noted.

B. STREAMING EVALUATION
In the streaming evaluation, instead of batches, each item for each entity is processed and evaluated sequential and individually.
Since the gap between two consecutive measures, in this case, is just one item, we have decided to set λ = 0.1 to observe variations in the metric more clearly. In contrast with a batch evaluation where the gap between two consecutive batches (and measures) would typically be of quite a few items, a value of λ = 0.01 was selected.
Before delving into the experiments, we present in Figure 5 the results for the baseline models in the Depression dataset. Observe that the X-axis represents the number of items used by a baseline model to produce its prediction, and the Y-axis represents the score obtained. Therefore, Oracle at x = 10 is an Oracle 10 and the TaP score achieved is approximately 0.5.
Regarding the TaP metric for Oracle and Positive on Figure 5, we can observe how, after the penalisation point of o = 5, there is a linear decrease in the performance, as expected since the predictions, although correct, are generated late.
In the same figure 5, ERDE shows very little difference between the five models which decrease the ability to differentiate among models performance.
Lastly, F latency starts with a close to perfect value in the first measurement points for Oracle model, and it stabilises around 20 items. As it can be seen, Oracle model presents a big difference with the rest of the models, which again cannot be easily differentiated.
Focusing more on TaP we explore the effect of λ on Figure 6 setting λ = 10 for TaP o=5 and TaP o=50 , again in the Depression dataset. As expected, there is an important decrease in the performance of Oracle and Positive after the penalization point, but there is also an impact on the final score. In this sense, Oracle 10 and subsequent achieve values close to −1 for TaP o=5 , while for TaP o=50 Oracle 100 and next are decreased but only to around 0.25. This is due to the effect of the logistic function in the penalty, that is higher for lower values of o.
Taking into account the results of baseline models evaluation, we proceed to show the results for the Machine Learning algorithms selected and already used in the batch evaluation section. Figures 7,8 and 9 show the behaviour of ERDE, F latency compared to TaP α=0.90 . Both for ERDE and TaP the o value selected was the o = low used previously. That is, o = 5 for Depression, o = 1 for Network Attacks and o = 2 for Floods dataset. Also, λ parameter is defined to 0.1 in order to show a better impact of the penalisation when the difference between the number of elements taken into account is small as it was introduced at the beginning of this section. This happens, for example with the first values of the graphics as they are closer together than the rest.
Due to the performance of the algorithms and the distribution of the measurement points, the metrics output is almost completely stable. The exceptions can be located in three different cases with different explanations. First, in the case of Figure 8, where Network Attacks dataset results are presented, a drastic change can be observed in all metrics, but specially in ERDE and F latency as the great majority of positive entities have around 2 items and that is where the penalty for o = low is introduced. Secondly, we can focus on Depression dataset, showed in Figure 7, where we can observe how RandomForest and ExtraTree models show a better performance, for F latency and TaP, achieving its best value at point 5. In this case, ERDE is not capable of clearly differentiate between those models. Lastly, in the results for Floods dataset, shown in Figure 9, ERDE and F latency are not capable to display the differences when the model obtain poor results and just TaP is able to do that for ExtraTree and RandomForest models. This shows, as it was seen for baseline models, the higher granularity provided by TaP metric against ERDE and F latency .
Lastly, if the results for TaP metric from each dataset and the penalisation points are taken into account, changes in the output value can be observed and explained. If we take Depression dataset as an example, an increase in the value of the metric can be seen up to point 5. This is explained by some algorithms taking better decisions after at least two elements have been considered. At the same time, this does not improve the final value as the penalty is increased for more than 5 items. After the penalisation we can see a stabilisation of the metric value. This is related to the x-axis not showing a proportion of the total number of items for each dataset, instead, a more natural approach to the time distribution is shown.

VI. DISCUSSION
Batch evaluation provides a more straightforward and less resource consuming approach, and it is appealing if entities are homogeneous. However, if entities are heterogeneous (i.e. different number of items) the evaluation may penalise entities with smaller sequences of items. On the other side, streaming evaluation is more resource demanding since a whole sequence of items for each entity may have to be processed, but a better evaluation granularity can be achieved since the exact item used for the final decision can be identified.
Moreover, some details about the tuning of the metric should be discussed as two parameters that have to be defined for each problem must be carefully selected. On the one hand  the parameter o decides after what point the penalty is applied and this could be extracted from the problem itself and the urgency of the detection. On the other hand, α parameter defines how much False Positives and False Negatives affect the outcome and it also should be taken into account for better results.
Also, when parameters are discussed, we must notice that metrics as F latency require to know the whole dataset in order to compute the parameter p. When applying ERDE the amount of elements is required in advance to generate the penalisation factor. As opposed, in the case of TaP, where the parameters depend on the nature of the problem itself and not in the specific data of each dataset.

VII. CONCLUSION
In this paper, we presented the problem of early detection of anomalies over three datasets from different backgrounds. We can conclude that time aware metrics are relevant to properly evaluate time sensitive models. The chosen metrics VOLUME 10, 2022 must be able to be easily interpreted and to represent the performance as well as the promptness of response of the system. This is achieved by the development of the TaP metric which is able to provide a dataset agnostic way of measuring machine learning models in time aware environments.
In the future, we expect to extend this research in different ways. We would like to research specific early detection models for some of the problems considered in this work (e.g. network attacks) and analyse their performance using TaP. Within that work we will delve into feature extraction or generation, preprocessing and model definition. Also interesting is the use different datasets from the same topic to compare the performance of different machine learning models in a time-aware prediction, using the methodology and metric proposed.
MANUEL F. LÓPEZ-VIZCAÍNO was born in Lugo, Spain, in 1990. He received the B.S. degree in computer science from the University of A Coruña, Spain, in 2015, where he is currently pursuing the Ph.D. degree. He is currently working as an Assistant Teacher with the Department of Computer Sciences and Information Technologies of the University of A Coruña. His research interest includes the evaluation and application of early detection methods to anomalies in cybersecurity, although he is also interested in other topics regarding artificial intelligence, evaluation metrics, and network security. He is the author of 12 journal articles, ten book chapters, and more than 30 conference papers. His research interests include network security, intrusion detection, data flow analysis, the IoT, medical informatics, biomedical imaging, artificial intelligence, and neural networks. From 1998 to 2006, he was an Assistant Professor at the Computer Science Department, University of A Coruña. Since then, he has been an Associate Professor with the Computer Science Department. He is the author of four books, nine book chapters, more than 20 journal articles, and more than 60 conference papers. His research interests include information retrieval, recommender systems, and early detection of anomalies applied to cybersecurity. VOLUME 10, 2022