Improving Deep Learning Based Anomaly Detection on Multivariate Time Series Through Separated Anomaly Scoring

The importance of anomaly detection in multivariate time series has led to the development of several prominent deep learning solutions. As a part of the anomaly detection process, the scoring method has shown to be of significant importance when separating non-anomalous points from anomalous ones. At this time, most of the solutions utilize an aggregated score which means that relevant information created by the anomaly detection model might be lost. Therefore, this study has set out to examine to what extent anomaly detection in multivariate time series based on deep learning can be improved if all the residuals from each individual channel is considered in the anomaly score. To achieve this, an aggregated and separated scoring method has been applied with a simple denoising convolutional autoencoder. In addition, the performance has been compared with other state-of-the-art methods. The result showed that the separated approach has the potential to generate a significantly higher performance than the aggregated one. At the same time, there were some indications suggesting that an aggregated scoring is better at generalizing when no labels are available to select the anomaly thresholds. Therefore, the result should serve as an encouragement to use a separated scoring approach together with a small sample of labeled anomalies to optimize the thresholds. Lastly, due to the impact of the anomaly score, the result suggests that future research within this field should consider applying the same anomaly scoring method when comparing the performance of deep learning algorithms.


I. INTRODUCTION
Finding anomalies in time series can be of great value for a number of different applications within the smart industry including, manufacturing, maintenance, security and server machines [1]. This can be especially true for multivariate data that has both temporal and interrelations dependencies [1]. Because of the lack of real labeled data as well as class imbalances and data heterogeneity, unsupervised and semi-supervised learning for anomaly detection has been The associate editor coordinating the review of this manuscript and approving it for publication was Frederico Guimarães . researched extensively to perform accurate anomaly detection using few or no labels [1], [2]. Several different types of methods have been developed for multivariate anomaly detection in time series data including statistical based methods such as [3] and [4] but also traditional machine learning methods such as [5]. In recent times however, the use of deep learning has become a prominent way of finding anomalies in multivariate times series data that has shown great success on benchmark datasets. The different types of architectures used include autoencoders, Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Generative Adversarial Networks (GAN) as well as graph networks [1]. In addition, several different scoring methods have been developed. A scoring method defines, based on the output of the developed model, what should and should not be regarded as an anomaly. In a recent study it was shown that the scoring method has a significant impact on the overall performance that can surpass the importance of algorithm selection [6]. Currently, most state-of-the art methods employ an aggregation score [6], meaning that the residuals for each of the individual channels are aggregated, when determining the anomaly probability. Yet, this means that relevant information about the anomalous behaviour disappears and can, as was shown in [7], mean that anomalies caused by a single anomalous signal in a multivariate space, are missed. Because of the loss of valuable information when reducing the model output, it is arguably of interest to examine the behaviour of individual signals separately. The main research question is therefore: • To what extent can the performance of multivariate anomaly detection on time series data based on deep learning be improved by constructing an anomaly score considering the residuals of each individual channel separately? To answer this question, a convolutional autoencoder is used together with two different scoring methods: one with the suggested separated scoring and one with aggregated scoring based on previous work. These are evaluated on benchmark datasets and compared with other state-of-theart methods. The overall framework for the suggested separated approach is illustrated in Figure 1, which consists of three different layers with different multivariate data inputs, D: two offline layers with training and optimization and one online layer that handles continues data streams. The first layer trains a model in an unsupervised fashion and the second layer constructs separated thresholds based on a channel-wise residuals and labels. Then, these thresholds are applied to the channel-wise residuals to produce the channel-wise score. The contributions of this study are that it: • Constructs a separated scoring method for multivariate anomaly detection on time series data based on deep learning algorithms.
• Evaluates to what extent a separate scoring method for multivariate anomaly detection on time series data based on deep learning algorithms can outperform an aggregated scoring method.

II. RELATED WORK
In previous work, several different types of methods for anomaly detection in multivariate time series data have been used, both with regard to the deep learning algorithm, and the scoring method. Recent work by J. Audibert et al. [8] has successfully shown the advantages of using adversarial training. In their method, called USAD, two different autoencoders with a shared encoder are trained with a separated optimization goal. The optimization goal of the first autoencoder is to create a representation of the input that is as accurate as possible and the goal of the second one is to create a representation that is as inaccurate as possible. Their anomaly score is based on the weighted sum of the output of the respective decoders and using a single anomaly threshold to separate anomalous from non-anomalous points. Furthermore, in a study by D. Li et al. [9], MAD-GAN was introduced which is a GAN that uses a generator and a discriminator consisting of two LSTM networks. The anomaly score in their method is based on an aggregation considering both the reconstruction of the input data as well as the result from the discriminator. Another used is Variational Autoencoder (VAE), which in contrast to standard autoencoders encodes the Gaussian distributions. Both LSTM-VAE described by D. Park, Y. Hoshi, and C. Kem in [10] and OmniAnomaly by Y. Su et al. [11] employs recurrent-based VAE. In Omni-Anomaly, Gated Recurrent Unit (GRU) cells and dense layers are used together with a planar normalization flow that defines the latent space of the input, which is then decoded. They calculate the anomaly score at point t as an aggregation based on the contribution of each channel by considering the reconstruction probability and then use a single threshold to separate anomalous and non-anomalous points. Similarly, LSTM-VAE is based on RNN, but employs longshort-term-memory (LSTM) cells instead of GRU. They also use probability-based score based on the negative log of the encoder-decoder structure together with a single threshold.
Similarly to the VAE, B. Zong et al. [12] have proposed a method called Deep Autoencoder Gaussian Mixture Model (DAGMM) that bases the anomaly detection on the estimated likelihood of a particular point in time belonging to a learned Gaussian distribution. The method uses two different networks consisting of a standard dense autoecoder and an estimation network that uses the compressed representation from the autoencoder and learns the Gaussian mixture. The anomaly score is based on the likelihood of membership to the trained Gaussian mixture and an anomaly is defined if the score surpasses a predefined threshold.
Furthermore, convolutional neural networks has extensively been applied. In a study by [13], a convolutional autoencoder together with a GRU neural network was applied. They use an aggregated scoring method based on the eigenvalues of each channel. In addition, a method called Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED) was introduced by C. Zhang et al. [14] where a combination of recurrent and convolutional autoencoder is utilised. The network uses convolutions to compress the data and skip-connection with convolutional LSTM cells. In addition, they convert the multivariate time series to correlation matrices based on different lengths as input, which is unique compared to the other methods presented in this paper. The network is then used to recreate the input matrices and a particular point in time is defined as anomalous if the count of anomalous cells, which are cells where the difference between the input matrices and reconstructed output matrices is higher than a thresholds, is higher than a given aggregation threshold.
Graph neural networks have also shown great results in recent works. Firstly, A. Deng and B. Hooi present GDN [15], which uses an attention-based prediction approach, meaning that the future relation is estimated and compared to the actual future data. Their anomaly score is defined as the maximum residual from the robustly standardized (based on the inter-quartile range) individual residuals for each sensor. This means that they use a single threshold to separate anomalous from not-anomalous points. Furthermore, GTA was introduced by Z. Chen et al. [16], which uses multi-scale dilated convolutions and incorporates multi-head attention in the learning process. They apply a simple aggregation of the residuals for each channel and define a point in time anomalous if the score surpasses a certain threshold.
L. Shen, Z. Li and J. T. Kwok present a recurrent based temporal hierarchy one-class network (THOC) [17]. The method uses dilated RNN with multi-scale skip connection where the data is scaled into a Multiscale Support Vector Data Description (MSVDD) that is used to define the anomaly score at a certain point in time. A point is anomalous if the score is above a predefined threshold.
Regarding anomalies scoring methods [6] explored different types of aggregated based methods that showed great success when deployed together with state-of-the-art algorithms such as OmniAnomaly and USAD. They recommend using the method called Gauss-D where the scoring is aggregated as the cumulative negative log of the Gaussian probability distribution of each of the channels. Furthermore, they showed that the scoring method has a significant impact on the overall performance of the anomaly detection method.
While the result from these different studies have shown great promise, none of them apply a separate anomaly scoring approach, and considering the loss of information that an aggregated score inevitably leads to, it seems to be of great value to examine to what extent it affects the anomaly detection performance.
In addition to deep learning methods, statistical and traditional machine learning methods have, in a recent study showed promising result as alternatives to deep learning methods [18]. The authors argue with their compelling result that benchmarks for anomaly detection on multivariate time series, to a greater extent, should consider statistical and traditional machine learning approaches. Some of the statistical methods that showed promising results were Principal Component Analyses (PCA) [19] and Independent Component Analyses (ICA) [20] which can be applied as reconstruction based methods for anomaly detection were the input data is encoded and reconstructed to compare with the original input to find anomalies [21]. Regarding, traditional machine learning solutions, One-Class Support Vector Machine (OCSVM) [22], Isolation Forest (IF) [23] as well as Local Outlier Factor (LOF) [24], which uses different types of separation techniques for finding anomalies, showed performance that could be comparable with those by the considered deep learning methods.

A. PROBLEM FORMULATION
We are provided two sets of time series data D train and D test .
We assume that D train does not include any anomalous data.
and y t ∈ [0, 1] labels the corresponding x t as either not anomalous or anomalous. The goal is to train a model M on D train and use this model to predict if a particular x t ∈ D test is anomalous.
We assume that model has access to the l prior readings when classifying the reading at the current time step t as anomalous or not with the exception of the first l readings where all l reading are assumed to be accessible. Let X t ∈ R l×C collects the last l readings row-wise. We will use labels train and test to indicate which of the two sets X t refers too.

B. METHOD OVERVIEW
In Figure 2, the method overview is described. It includes two basic steps which are 1) training and 2) inference and threshold tuning. In training, an autoencoder is trained which is then used to generate the channel-wise residual for each time step in the test data. Then, these residuals together with a threshold optimization procedure are used to produce the anomaly prediction based on separated anomaly scoring. The optimization is based on labels provided for the test dataset and the scoring is constructed by normalizing the channel-wise residuals using their respective threshold and setting the score to the normalized channel-wise residual with the highest value.

C. PART 1: TRAINING 1) MODEL TRAINING
We are interested in capturing the variations within each channel. This is the first step towards identifying which values fall outside the ''usual'' or acceptable ranges. We employ a Denoising Convolutional Autoencoder (DCAE), Figure 3, to capture data characteristics. Convolutional autoencoders have in different forms successfully been applied in the time series data domains and shown great performance as shown in for example [14] and [13]. There are limitations to the described DCAE such as the usages of kernels with a fixed size meaning that it can be challenging for it to capture longterm anomalies. However, we found that the performance of the DCAE is sufficient in order to answer our research question, which is to evaluate to what extent the performance of multivariate anomaly detection on time series data based on deep learning can be improved by constructing an anomaly score considering the residuals of each individual channel separately. The DCAE learns to reconstruct X train t by minimizing the residual where X train t andX train t refers the input and its reconstruction, respectively. The denoising aspect of the network is to prevent it from over-fitting by inducing Gaussian noise with the purpose of forcing it to learn the most significant features. We selected a fixed kernel size of seven and the time series is compressed using strides of two. When the network has been trained, the model can be tested on new data where a residual above a certain threshold deems the point X t to be anomalous.

2) CHANNEL-WISE RESIDUALS COMPUTATION
Given the input X train t and its reconstruction, we construct an residual measure as follows: Note that E t is defined over time window l, whereas we are interested in computing an residual measure e t for the reading x t at time t. We compute e t by averaging E t over a time window l e as follows: k is the smoothing kernel defined as follows: Here 1 ∈ R l e is a vector of ones and 0 ∈ R l−l e is a vector of zeros. l e represents kernel-width, and we have set it equal to 10.

3) SCORING FUNCTIONS
We will use Gauss-D approach proposed in [6] as the baseline and show that our method that computes channel-wise threshold improves upon the results achieved by the baseline. For the remainder of this discussion, we will refer to the Gauss-D approach as DCAE a and our method as DCAE s .

4) GAUSS-D (DCAE a )
We fit a Gaussian distribution on channel-wise residual computed over a time window up to the current time t: where is cdf of N (0, 1). The aggregate score is then a t = i A i t . An anomaly is defined if the score a t surpass a single threshold.

5) SEPARATED APPROACH
The separated approach suggested provides a score based on the evaluation of the residuals from each of the channels. Considering the thresholds τ = (τ 1 , τ 2 , . . . , τ C ) the point x t is determined to be anomalous if any e i t is above τ i for i = 1, . . . , C. Yet, as all τ ∈ τ are set separately, they differ in magnitude, which implies that it is not possible to present a simple value to the user that describes the anomaly score for a point x t . Therefore, e i and τ i are scaled with the same scalar so that all τ ∈ τ are normalized to 0.5 meaning that each threshold is set to a shared value but the relation between e i t and τ i is kept: Here 0.5 ∈ R C is a vector of 0.5 and • and denotes the Hadamard product and division, respectively. When this has been completed, it is possible to define the anomaly score a t for x t as the max scaled e t : Then, the label y t , is defined as follows: VOLUME 10, 2022 FIGURE 2. Method overview. A model is trained using non-anomalous training data and is then used to generate channel-wise residuals. Together with labels and a optimization procedure a combined score can be produced considering each individual channel.

D. PART 2: INFERENCE AND FINE-TUNING
Inference involves deciding whether or not a reading at time t is anomalous or not given the last w−1 readings. 1 We use the following procedure to decide whether or not a reading x t ∈ D test is anomalous. First, construct X t ⊂ D test as discussed previously. Recall that X t ∈ R l×C . Use the autoencoder to reconstruct X t and follow the steps discussed above to compute e t , the residual measure for the reading in question. Next, given {e t−(w−1) , e t−(w−2) , · · · , e t }, compute µ i and σ i needed to compute the scores using Gauss-D approach as described above.

1) UPDATING THRESHOLDS
In this paper the thresholds which gives the highest f 1 score have been used to evaluate the different methods. This is a 1 Use training data when last w test readings are not available.
procedure that has been used by previous research to benchmark a method against other approaches, including [8], [16] and [13]. For Gauss-D, the threshold was selected by simple testing the score for different values in a reasonable range and choosing the best one. For the separated approach, an optimization procedure was applied described in the next section.

2) OPTIMIZATION OF SEPARATE THRESHOLDS
Before the optimization begins, the initial thresholds τ init are set to the maximum residual on the training data for each respective channel. Then, the optimization procedure is applied, which consists of two steps. In both of these, the optimization has been carried out in a greedy fashion where changes to each individual threshold has been made iteratively until a condition has been met. In the first step, the goal is to maximise the f 1 score with the optimized thresholds τ optimized . This is achieved by optimizing τ init based on the true labels, L from the test dataset, and the residual for each of the separate channels, e, according to Equation (10). In the experiment, this was achieved by first decrementing the thresholds until the f 1 score stopped improving or the threshold became less than zero. Then, the thresholds were incremented until the f 1 score had not improved for a certain number of iterations or an overall iteration limit was reached. The changes to each threshold was calculated based on the highest loss for each channel.
When the problem defined by Equation (10) has been solved, some thresholds might have been significantly incremented in the search for a global maximum without affecting the overall f 1 score. This means that new anomalies might be missed because of too high anomaly thresholds. Therefore, we need to minimize the difference between the initial thresholds τ init and the new thresholds τ without lowering the f 1 score based on τ optimized , which is described in Equation (11).

IV. EXPERIMENTS A. DATASETS
In this paper, as can be seen in Table 1, three different benchmark datasets have been used for evaluation. These are Water Distribution (WADI) [25], Secure Water Treatment (SWaT) [26] and Server Machine dataset (SMD) [11].

1) WADI
WADI consists of 123 different dimensions that describe a water distributing process [25]. The data was collected during 16 days where 14 were measured under normal conditions and 2 under attack scenarios. As there are different versions of the dataset, it is worth noting that the first version was used, as was done by other state-of-the-art methods such as [11] and [8]. 3) SMD SMD was presented by [11] and consists of 28 different entities divided between 3 different machines that all are described using 38 dimensions of data. It was collected from an internet company during 5 weeks and each of the entities have a training set as well as a test set which contains different types of labeled faults.

4) MSL
Mars Science Laboratory (MSL), provided by [28], is a dataset from a spacecraft on a mission launched by NASA. It consist in total of 27 different entities with 55 dimensions. The first dimension represents a telemetry channel and the rest are command signals that are one hot encoded.

5) SMAP
Soil Moisture Active Passive (SMAP) is also provided by [28] and is similar to MSL but describes another spacecraft and mission launched by NASA. It consists of 55 entities with 25 dimensions. Just like MSL, the first signal represents the telemetry value and the rest are one hot encoded commands.

B. ACCURACY METRICS
For the datasets mentioned above, two types of accuracy metrics are regularly used to compare the performance of different methods. These are point-wise score and point adjust score [8]. Point-wise score, in this paper denoted as f p1 , is simply based on the precision and recall for all of the predicted and true points. As this can give a low accuracy even though all different segments have been identified, point adjust, f pa1 , was suggested by [11]. The point adjust method determines all points within an anomalous segment as anomalous and treats all other points the same as for the pointwise metric. The main problem with this approach is that it will give a higher reward for identifying a long anomalous segment than a short one. Because of this issue [6] has developed a new metric that to a greater extent resolves the issue with these approaches. They define it as the composite f 1 score, f c1 , which is displayed in Equation (12). For this metric, recall is the overall segment recall R e and the precision is the point-wise precision P t .
While the most appropriate metric could be different from case to case, in this paper we argue that f c1 is generally the most appropriate metric for anomaly detection in time series. All three metrics have been applied in this study. The point-wise and point adjust methods were used to compare our method to other methods described in previous work. In experiments comparing the aggregated scoring method to the separated scoring method, f c1 score have been applied.

C. SETUP
In the experiments, a similar setup used in previous work was used for a fair comparison, particularly [1], [8] and [6]. Firstly, both SWaT and WADI were down-sampled with a factor 3 to reduce the training time. Then, the data was normalized using a min-max scalar for each individual channel, shown in Equation (13). This was however not done to the datasets SMD, MSL and SMAP who already are provided normalized.
Furthermore, to avoid extreme values and thus unnecessarily high residuals on the test data, similarly to [6], thresholds were set on the input data to cut it according to Equation (14). The purpose of doing this was to speed up the optimization procedure and values beyond these limits were shown to have insignificant effects on the overall performance in initial experiments.
Also, for a better visualization of the result, a max limit for the separated anomaly score was set to 2.5, which does not affect the selection of anomalous points. In terms of hyperparameters for the DCAE, no major optimization efforts were conducted since the focus in this study has been the separated scoring method. This means that it is likely that higher accuracy can be achieved by optimizing hyperparameters such as kernel size to each dataset. The window size l was set to 128 for the SWaT data, as it seemed to give the highest accuracy and 32 for the other datasets to reduce training time. No major reduction in accuracy could be observed by reducing l for the other datasets. Furthermore, as was done in paper [6], the window size w in Gauss-D from Equation (5), was set to the same length as the training dataset. The batch size was set to 64, which generally seemed to give the best performance. For the training process the maximum number of epochs were set to 70 for WADI and SWAT and 100 for the rest. The deviation for the Gaussian noise was set considering the dataset size and performance to 0.2 for SWaT and WADI and 0.1 for SMD, MSL and SMAP. In addition, 20% was held out from the training data and used for early stopping for the training procedure. The deep learning methods that have been compared in the experiment are USAD [8],  OmniAnomaly [11], LSTM-VAE [10], DAGMM [12], MAD-GAN [9], THOC [17], GTA [16] and GDN [15]. For each of the these, when possible, the result provided by the original paper was used. When this was not possible due to for example a different threshold setting method or lack of metrics presented, the result was retrieved from the experiment by [8] and [1].
In addition to the deep learning methods, the best performing statistical and traditional machine learning methods in the study by Audibert et al. [18] have also been used for comparison. These were implemented using the default setup provided by scikit-learn [29]. An exception was made for FastICA and PCA where we tested different number of components and selected the one related to the highest f 1 score. This was due to, in contrast to the other methods, high variations in accuracy depending on amount of components used. The methods selected were FastICA [21], [30] which is a version of ICA [20], PCA [19], [21], LOF [24], OCSVM [22], and IF [23]. These methods can be implemented so that a decision threshold applied to an anomaly score defines the anomaly prediction. To get the highest possible f 1 score we used a similar approach as in [18], which is to normalize the anomaly score between 0 and 1 and select the threshold with the highest f 1 score.   Tables 2 and 3 present the results considering the threshold which gave the highest score for each method for SWaT and WADI with f p1 and point adjust f pa1 , respectively. Table 4 shows the result for point adjust for MSL, SMAP and SMD. As can be seen, the method suggested in this study has the highest score both with and without point adjust on SWaT and WADI. It also has the highest score on SMAP, third highest on MSL and the highest score on SMD. Furthermore, the average accuracy with point adjust, displayed in Table 5, is significantly higher than the other methods. Interestingly, DCAE with the aggregated method performed surprisingly well in relation to more complex methods and received the second highest average score.

B. COMPARISON TO AGGREGATION
If we compare the result of the aggregated scoring method and the separated method, it is clear that if optimized thresholds are used, the separated method is significantly superior. This is also true for the f c1 score displayed in Table 6. In figures 4 and 5, the result from the evaluation of the SWaT dataset for the respective methods are displayed and in figures 6 and 7, the same result is presented for WADI. As can be seen, it is much more difficult to separate the anomalies from non-anomalous points using the aggregated score than     using the separated score. This result in anomalies found by the separated method becomes more distinct, while the score from the aggregated method appears to be more random.

C. GENERALIZABILITY
To verify that we are not simply over-fitting the thresholds to the particular types of anomalies seen in the test set, the aggregation and separated scoring method were optimized considering different sizes of the test set simulating a more online-like scenario. The results presented in tables 7 and 8, are based on the performance of the part of the test set that was not part of the optimization, meaning for example that for optimization of 25% of the test data, the evaluation is based on the remaining 75%. As can be seen, the aggregation method is slightly better at generalizing with no or few examples for the SWaT dataset, yet once the optimization size increases, the separated approach is superior. It is also worth mentioning that the aspect that reduces the f 1 score for low optimization size is the precision, meaning that the seperated method can still capture all the relevant anomalies but with a higher false positive rate.

VI. DISCUSSION
Considering the main research question of this study, the results strongly suggest that using separate thresholds can significantly increase the accuracy of an anomaly detection method when some labels have been defined. This also means that a separated scoring method seems to be superior to an aggregated method when a dataset of labeled anomalies is available. Yet, to generalize this conclusion, the separated scoring method should be applied to more deep learning methods. In addition, some of the results indicate that the aggregated method could have a greater sense of generalizability than using separated thresholds. This makes sense because using separated thresholds will create a more sensitive anomaly detection method. This is arguably both an advantage and potential setback of the suggested method. The advantage is that the method can be adjusted to identify anomalies only visible from one or a few signals, which can be difficult to achieve with aggregated methods. This means that a relatively high level of recall can be achieved regardless of whether or not the thresholds have been optimized. However, as the result has shown, the precision is expected to be fairly low without using an appropriate method to adjust the thresholds based on test data. This indicates that the most appropriate scoring method might be dependent on the problem definition. For example, if no labels are available, an aggregation based method might be most appropriate. Therefore, from a practical perspective the result should act as a strong incentive for users to provide a small sample of labels which can be used to optimize a separated scoring method, such as the one presented in this work, that is applied together with a deep learning model. From a theoretical perspective, it is recommended that future work focuses on improving scoring methods based on the separation of channels residuals but also to develop effective methods that utilize expert knowledge in labeling that can be used together with a separated scoring method to make the anomaly detection capability improve over time in an online-based scenario.
Furthermore, the results support the conclusions in [6] about the significant impact of the scoring method. This implies that bench-marking a method against other stateof-the-art methods considering both the algorithm and the scoring method as a system can give inconclusive results regarding the potential of the scoring method or the algorithm individually. Therefore, if the task is to evaluate the potential of a new algorithm for anomaly detection, the appropriate way of evaluating it would be to use a standardized scoring method that is used by the approaches the algorithm is compared to.

VII. CONCLUSION
This study set out to examine to what extent an anomaly score based on residuals from individual channels could outperform an aggregated anomaly score for anomaly detection on multivariate time series. To evaluate this, two different anomaly detection methods have been developed both based on the same underlying algorithm, the DCAE. The aggregated score was based on previous state-of-the-art work, and the separated method has been developed in the study. In addition to comparing the different approaches, other state-of-the art methods have been considered to get an understanding of how well the separated approach could perform. The result of the evaluation of three different datasets has shown that the performance of the separated approach significantly outperforms the aggregated approach. In addition, despite a relatively simple algorithm, the score from the separated scoring outperformed all of the other considered state-ofthe-art methods. There were also some indications that the aggregated method could be slightly better at generalizing when only few labels are available in the test set, which implies that an aggregated scoring method could be superior when no labels are available for threshold optimization. The result is an incentive for users to provide a small set of labeled points as this has the potential of significantly increasing the accuracy using a separated scoring approach. Lastly, the result in this study does, supported by [6], suggests that future studies concerning anomaly detection should consider separating the evaluation of the scoring method and the underling algorithm, because the scoring method can have a substantial impact on the performance of the overall method, which for example can mean that a method that shows great performance can be based on a suboptimal machine learning algorithm.
MATTIAS O'NILS received the B.S. degree in electrical engineering from Mid Sweden University, Sundsvall, Sweden, in 1993, and the Licentiate and Ph.D. degrees in electronic systems design from the Royal Institute of Technology, Stockholm, Sweden, in 1996 and 1999, respectively. He is currently a Professor with the Department of Electronics Design and leads the Research Group in Embedded IoT Systems, Mid Sweden University. His current research interests include design methods and implementation of embedded DNN-based systems, especially in the implementation of real-time video processing and time series processing systems. He has published five books as an editor and one as the author and over 300 peerreviewed contributions in journals, books, and conference proceedings. He has given over 100 invited presentations at conferences, universities, and companies. His current research interests include systems on chips, selfaware cyber-physical systems, and embedded machine learning.