Robust Unsupervised Anomaly Detection with Variational Autoencoder in Multivariate Time Series Data

Accurate detection of anomalies in multivariate time series data has attracted much attention due to its importance in a wide range of applications. Since it is difficult to obtain accurately labeled data, many unsupervised anomaly detection algorithms for multivariate time series data have been developed. However, building such a system is challenging since it requires capturing temporal dependencies in each time series and must also encode the inter-correlations between different pairs of time series. To meet this challenge, we propose a Multi Scale Convolutional Variational Autoencoder (MSCVAE) to detect anomalies in multivariate time series data. Firstly, multi scale attribute matrices are constructed from multivariate time series to characterize multiple levels of the system states at different time steps. Then, given the attribute matrices, a convolutional variational autoencoder is employed to generate reconstructed attribute matrices, and also an attention-based ConvLSTM network is used to capture the temporal patterns. In addition, a new ERR-based threshold setting strategy is developed to optimize anomaly detection performance instead of relying on the traditional ROC-based threshold setting strategy with an imbalanced dataset. Finally, the proposed framework is assessed by means of experiments on four datasets. The experimental results show that our proposed framework is superior to competing algorithms in terms of model performance and robustness, demonstrating that our model is effective in detecting anomalies in multivariate time series.


I. INTRODUCTION
In data mining, a time series [1] is a sequence of data points collected over time [2]. Such a sequence forms the basis of methods for tracking changes over time. Time series data can track changes in milliseconds, days, months, or even years. Time series data plays an important role in virtually all areas of science, engineering, commerce, and industry. Since data points in time series are collected at intervals, there is a relationship between successive observations, proportional or not, which distinguishes time series data from other kinds of data. There are two types of time series data. One is a univariate time series based on a single time dependent variable (or dimension). The other is a multivariate time series based on two or more time dependent, interrelated variables (or dimensions).
In recent years, research on time series data has focused on detecting and analyzing anomalies [3][4][5][6]. Anomaly detection is the identification of unexpected data points, i.e., events, or items that differ significantly from what is expected. Three types of time series anomalies have been identified, namely, point anomalies, contextual anomalies, and collective anomalies [7]. Point anomalies are points that exist far outside the range of the entire data set. Contextual anomalies are values that deviate significantly from most of the data points in the same context. Collective anomalies occur when a subset of the data points deviates substantially from the entire dataset. Anomaly detection is critical in many real world applications, such as analysis of potentially fraudulent transactions, sensor network faults, abnormal equipment behavior, etc.
Great strides have been made recently in multivariate time series anomaly detection systems [8][9][10][11]. This progress is also reflected in many areas of application [12][13][14][15][16]. Multivariate time series anomaly detection refers to anomaly detection of time series data with multiple sequences. The occurrence of anomalies in multivariate time series data typically involves multiple features. Sequential analysis of individual features cannot accurately locate all the anomalies, because several variables have to be examined simultaneously when analyzing segments of data. Moreover, encoding the inter-correlations between different pairs of time series also needs to be considered. Clearly, detecting anomalous parts of multivariate time series is a challenging problem.
Machine learning techniques are increasingly being adopted to detect anomalies because they can capture different characteristics of time series and detect anomalies effectively. Various anomaly detection methods for multivariate time series data have been developed. One such method employs unsupervised anomaly detection based on a Convolutional encoder-decoder with an attention-based Convolutional Long-Short Term Memory (ConvLSTM) network to detect and diagnose anomalies from signature matrices [17]. Liang et al. [18] make use of the multi-channel signature matrix transform to capture hidden non-linear anomalies. They also employ a forgetting mechanism to assign different weights, and use a novel threshold strategy to optimize performance. Another method uses a fully convolutional adversarial autoencoder to detect anomalies in video scenes and localizations [19]. Yet another method applies online unsupervised anomaly detection to industrial robot performance; this approach is based on Variational Autoencoder and Convolutional neural network with Sliding window to better recognize normal patterns of data for realtime anomaly detection [20]. Despite considerable progress, VAE based anomaly detection for imbalanced data has not received much attention.
In this paper, we propose an unsupervised anomaly detection method, i.e., Multi-Scale Convolutional Variational Autoencoder (MSCVAE), to detect anomalies in multivariate time series data. This model incorporates a new threshold setting strategy designed to fill a gap in existing methods.
The main contributions of our proposed framework are as follows.
 The novel MSCVAE framework is designed to detect anomalies in multivariate time series data. MSCVAE constructs multi-scale attribute matrices to characterize multiple levels of the system states across different time steps and then uses a convolutional variational autoencoder to extract the characteristics of the time series input. Specifically, we use an attention-based Convolutional Long-Short Term Memory (ConvLSTM) network to capture the temporal patterns and also to reconstruct the attribute matrices.  We propose a novel threshold setting strategy based on a confusion matrix to optimize threshold selection of anomaly detection, which will help to improve the model robustness under conditions of imbalance between normal and abnormal data in multivariate time series.  Experiments have been conducted on four datasets in order to verify the effectiveness of the proposed framework and the new threshold setting strategy. The results demonstrate that our method is superior to competing models in terms of anomaly detection performance and robustness under different ratios of imbalanced datasets. The paper is structured as follows. Section II analyzes related work. Section III describes the proposed methodology framework. The experiment is detailed in Section IV. Finally, conclusions are presented in Section V.

II. RELATED WORK
Anomaly detection is challenging, and many approaches have been taken in various applications. In past years, many classical unsupervised approaches have been developed [21][22][23][24][25][26], including Principal Component Analysis (PCA) [27], which finds a low-dimensional projection that captures most of the variance in the data. The anomaly score is the reconstruction error of this projection. It is a linear algebra technique that can automatically achieve dimension reduction. Lee et al. [28] proposed online over-sampling PCA which makes use of online platforms for large-scale problems. By over-sampling the minority class of the target instance, their proposed algorithm allows them to determine the anomaly of the target instance.
One of the latest techniques for dimensionality reduction is Autoencoder [29], which is a popular approach for anomaly detection. Autoencoders consists of an encoder and decoder which reconstruct data samples and use the reconstruction error as the anomaly score [30]. Zhou et al. [31] proposed a deep autoencoder that combines robust PCA and deep autoencoders. It splits data into two parts: one part can be reconstructed by autoencoders; the other is the noise (outliers) in the data. Deep Autoencoding Gaussian Mixture Model (DAGMM) [32] jointly considers the Deep Autoencoder and Gaussian mixture Model to model the density distribution of multi-dimensional data.
Recently, Generative Adversarial Networks (GANs) [33] and LSTM-based approaches [34] have also shown promising performance for multivariate anomaly detection [35,36]. Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks (MAD-GAN) [37] proposed an unsupervised anomaly detection method based on generative adversarial networks (GAN) by considering complex dependencies among different time series variables. The LSTM-based Encoder-Decoder [38] models time series temporal dependency by means of LSTM networks and achieves better generalization capability than traditional methods. OmniAnomaly [39] is a stochastic recurrent neural network designed to avoid potential misguiding by uncertain instances, which uses stochastic variable connection and normalizing flow to get reconstruction probabilities to determine anomalies. Ryota et al. [40] introduced a convolutional neural network and environment-dependent anomaly detector to detect the object, its attributes, and actions in the image. An environmentspecific model can identify unusual attributes likely to explain abnormal patterns. Hu et al. [41] proposed a time series anomaly detection technique using six meta-features. This technique is a One-class Support Vector Machine (OC-SVM) system designed to identify the abnormal states of a univariate or multivariate time series based on local dynamics. Recently, a novel computational approach, namely Local Recurrence Rate based Discord Search (LRRDS), was proposed to identify discords from multivariate time series. This approach reduces the dimensionality of a time series and can detect variable-length discords using the given time series as the normality reference [42].
U-Net [43] shares some of the design characteristics of the system architecture described in this paper. It is a fully convolutional neural network incorporating so-called skip channels between encoding and decoding layers. This approach allows for integrating high and low level features in a way that prevents information loss and reduces the depth of sequential layers. However, U-net is limited in that the rate of learning may be diminished in the middle layers of a high depth case. This means the system is at risk of ignoring layers with abstract features, thus limiting extraction of some of the complex features that could help image segmentation in medical images. Moreover, U-Net requires rather substantial training time because of a large number of hyperparameters. By contrast, our system uses a more effective method consisting of the multi-scale MSCVAE model that has an attention-based ConvLSTM network. This method allows for capturing interesting features and key patterns to assist with anomaly detection in multivariate time series data. Some Variational Autoencoders (VAEs) [44][45][46][47][48] have taken a probabilistic approach, and autoencoders have been combined with Gaussian mixture modeling [19]. Bayer and Osendorfer [49] used Variational Inference to learn the underlying distribution of sequences and introduced recurrent stochastic networks. The core of these models is an RNN extended with a latent variable. Sölch et al. [50] used a Stochastic Recurrent Network (STORN) to detect robot anomalies using unimodal signals. Park et al. [51] presented the combination of LSTM-VAE using multimodal sensory signals, where LSTM is used to replace the feed-forward network in VAE. Pereira et al. [52] proposed applying a selfattention mechanism to VAE to improve the encodingdecoding process for energy data. Despite the intrinsic unsupervised setting, most of them may still not be able to detect anomalies effectively since most of the methods cannot capture temporal dependencies across different time steps in multivariate time series data.
Our goal is to develop specific approaches for multivariate time series data, which take account of the correlation between the series. Multivariate time series data usually contain noise and mask the true anomalies. Also, we need to consider the characteristics of the data itself in addition to the correlation between series. Combining the advantages of the above models, the MSCVAE model we propose can not only capture the time pattern effectively but also has the ability to deal with noise. Although these methods use a variety of technologies and operate in different ways, common to all is the need to have indicators to judge anomalies.
In anomaly detection problems, thresholding (i.e., setting a threshold) is an effective strategy for evaluating anomaly scores. The scores output by the model are separated into outliers and normal data using the threshold. Most systems select the threshold with the aid of the receiver operating characteristics (ROC) curve. The ROC curve is a plot in a unit square of true positive rate (TPR) versus false positive rate (FPR). Using the ROC curve, the threshold that optimizes specificity as well as sensitivity can be identified. However, the imbalanced class distribution of a data set poses a problem because most unsupervised learning techniques are designed for balanced class distribution [53]. To solve this problem, we propose a new threshold setting strategy based on a confusion matrix, thus selecting a better threshold, improving model performance and robustness.

III. OVERVIEW OF THE METHODOLOGY
In this section, we first introduce the problem to be tackled in section III-A, and then elaborate on how to generate attribute matrices in section III-B. Next, we explain the proposed Multi-scale convolutional variational autoencoder (MSCVAE) in detail in section III-C -III-G. Specifically, we first show how to generate multi-scale system attribute matrices. Then, we encode the spatial information in attribute matrices via a convolutional variational encoder and model the temporal information via attention-based ConvLSTM. We reconstruct attribute matrices based upon a convolutional variational decoder. Finally, we detail the strategy of threshold setting in section III-H. Figure 1 presents an overview of the MSCVAE framework. The terminology and notation used in the paper are summarized in Table 1.

A. PROBLEM STATEMENT
In this work, we focus on multivariate time series, given the historical data of n time series 1 2 ( , , ..., ) , T and assume that there are anomalies in the data. We aim to detect anomalous events at certain time steps after T . We use only the normal dataset for training to characterize the various time series patterns under normal conditions. For the validation, we also use only the normal dataset. Both normal and abnormal data are used for testing.
Given the importance of correlations between the different pairs of time series for characterizing the system state [54], we generate an n n  attribute matrix t M utilizing the pairwise inner product of multivariate time series within this segment. This is designed to illustrate the inter-correlations between different pairs of time series in a multivariate time series segment from time t w  to t . We adopt the method for calculating the attribute matrix proposed in [17] to capture the similarity of shape and value scale correlations between two time series. Examples of attribute matrices are shown in Figure 1, part A. The pseudocode of the algorithm for generating the attribute matrix is introduced in Table 2.
For multivariate time series segment , w X we define two time series, namely, can be computed as follows: where w is the sliding window size, assigned the value 10 at each time step. Also, the interval between two segments is set to 10. To further explain the attribute matrices generation process, the algorithm's pseudocode is shown in Table 2.

C. VARIATIONAL AUTOENCODER (VAE)
Autoencoder is a typical neural network that transforms inputs into outputs with the minimum possible error. It consists of three sequential layers: the latent variable layer, the encoder layer, and the decoder layer. The encoder encodes or compresses the input data x d x into a latent variable z d z , while the decoder decodes or reconstructs the encoded data (latent variable) z back to the input dimension. The objective of Autoencoder is to get a reconstructed value from the latent variable that is close to the original input data. A disadvantage of the autoencoder is that a hidden layer may not be continuous and poses serious computational challenges. Variational Autoencoder (VAE) is a type of Autoencoder designed to address this issue.
VAE is an unsupervised learning network for dimension reduction, which can learn to represent complex data without supervision using a deep neural network. VAE relies on probability distributions of observations in latent variables and makes strong assumptions about the distribution of latent variables. Therefore, instead of generating an encoder that outputs a single value to describe each latent variable, it creates an encoder that operates with a probability distribution for each latent variable.
VAE has three main layers, namely, encoder, decoder, and loss function. The input denoted by x is generated from a latent variable .
z VAE defines latent variable z as a random variable distributed according to a prior function ( ) p z , defined as a multivariate unit Gaussian distribution (0, ) N I . The first process samples data x from the conditional likelihood distribution (x ).
p z Then, the decoder (x ) p z functions as a generator model corresponding to the distribution of the decoded variable given the encoded one. However, the encoder is defined by the conditional probability function ( ) p z x , i.e., the distribution of the encoded variable given the decoded one.
The loss function of VAE has two terms: the first term is a reconstruction loss term, which can maximize the reconstruction likelihood, and the second term is a regularization term, which ensures that the distribution learning ( ) q z x is like the true prior distribution ( ) p z . The loss function is given by Eq. 2.
The regularization term of VAE is the Kullback-Leibler divergence which measures the difference between the encoder's distribution ( ) q z x and ( ) p z , and constitutes a measure of how close p is to q , as described in Figure 1. part B.(a).

D. CONVOLUTIONAL ENCODER
We use convolution encoders [55] to encode the spatial form of the system attribute matrices. Four convolutional encoder layers are applied to extract the values of the attribute matrices in our framework. ,0 t P represents the input of the first layer and assumes that where  is the sigmoid function, l pi W is the filter kernel of the input gate, l ci W is the filter kernel of the input gate process for the input of the hidden layer at the previous time step, and l hi W is the filter kernel of the cell state at the previous time step in the input gate. The input , o are all 3D tensors. The symbol "  " denotes the convolutional operator, and "  " denotes the  Hadamard product. Figure 1. part B.(c) illustrates the temporal modeling procedure.

F. CONVOLUTIONAL DECODER
We can reconstruct the input data using the extracted attribute output produced by the decoder. For the convolutional decoder, we follow the procedure discussed in [17] which is formulated as: where  denotes the deconvolutional operation,  is the concatenation operation, ( ) f  is the activation function unit (same as in the encoder).
where ,0 t n n P    . We employ the mini-batch stochastic gradient descent method and the Adam optimizer to minimize the above loss. After sufficient training epochs, the learned neural network parameters are utilized to infer the reconstructed attribute matrices of validation and test data. Finally, we perform anomaly detection and diagnosis based on the residual attribute matrices, elaborated in the next section.

H. THRESHOLD SETTING STRATEGY
VAEs map each group of attribute matrices to anomaly scores during the anomaly detection process. The threshold setting aims to give the best boundary to distinguish normal and abnormal samples, and then detected samples are labeled normal or abnormal by means of VAEs.
The confusion matrix is commonly used to show how each test value predicts classes compared to their actual classes including True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
Based on the ROC-based threshold setting strategy [57], the best threshold is always sought as the dot on the upper left corner or point (0,1). However, the ROC-based threshold setting strategy is insensitive to imbalanced datasets, leading to poor performance in anomaly detection. According to the false positive rate formula, when a large number of normal samples are wrongly judged as anomalies, the false positive rate (FPR) changes a little. Therefore, anomaly detection performance may be widely divergent for two threshold values that are close together. In this case, it is not easy to select an optimal threshold among different discrete points. Moreover, Eq. 8 is often adopted to calculate the distance between a point on the ROC curve and the point (0,1). The threshold corresponding to the minimum distance is selected as the best threshold. However, TN is much larger than TP, which makes Therefore, when choosing the minimum distance, the ROCbased strategy focuses more on the latter, only choosing the smaller FN as much as possible. Eventually, it leads to low F1.
We adopt a threshold setting strategy using a confusion matrix to avoid the problems described above. The confusion matrix can effectively reflect the anomaly detection results, even for an imbalanced dataset. Therefore, we introduce a new error rate (ERR) defined as a function of TP, FP, FN, and TN, as shown in the formula below. The aim of the new threshold setting strategy is to minimize ERR, which means fewer samples are misjudged. As a result, the optimal threshold can be selected according to the minimum ERR.

ERR
FP FP TP TN    As indicated above, Precision, Recall, F1-Score, and ERR are utilized as evaluation metrics for anomaly detection. Moreover, the geometric mean or G-mean [58] of sensitivity and specificity, is suitable for evaluating the quality of binary (two-class) classifications for a balanced as well as imbalanced dataset problem. Therefore, G-mean is used as another evaluation metric in the following experiments.

A. DATASETS DESCRIPTION
To illustrate the performance of the proposed method, we have carried out extensive experiments on four publicly available benchmark datasets, namely, Satellite, Wafer (UCR), EEG, and Opt. Descriptions of these data sets are given below, and the details of the experiments are listed in Table 3. The second dataset can be downloaded from the website: http://www.cs.ucr.edu/∼eamonn/discords/, and the other three datasets are from the UCI public database. Details of the above datasets are given below. Satellite: This dataset was generated from data purchased from NASA by the Australian Centre for Remote Sensing. It consists of the multi-spectral values of pixels in 3x3 neighborhoods in a satellite image together with the classification associated with the central pixel in each neighborhood. The data set is used to predict the category of the image of the observed region of soil. Soil classes "red soil", "grey soil", "mixture class", and "very damp grey soil" constitute the normal class. The class of anomalies consists of "cotton crop", "damp grey soil", and "soil with vegetation stubble". This classification is based on multi-spectral values, consisting of 36 time series and 6435 instances.
Wafer: The Wafer dataset is related to semiconductor microelectronics fabrication, using data collected from various sensors during the processing of silicon wafers for semiconductor fabrication. Each time series in this dataset contains measurements recorded by one sensor in the course of processing one wafer by one tool. The dataset contains 152 attributes, and there is a large class imbalance between normal and anomaly.
EEG: This dataset is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and later added manually to the file after analyzing the video frames. This dataset consists of 14 EEG attribute values and one indicating the eye state.
Opt: The Opt dataset is used for preprocessing programs made available by NIST to extract normalized bitmaps of handwritten digits from a preprinted form, which contains 64 parameters. It is a character recognition dataset for integers 0-9. The 32X32 bitmaps are divided into non-overlapping blocks of 4X4 each, which generates an 8X8 input matrix. The instances of digits 1-9 are treated as inliers whereas the instances of the digit 0 are treated as outliers.

B. EXPERIMENTAL SETUP
Our experiments use the open-source machine learning library Scikit-learn, and deep learning framework Torch, Keras, and TensorFlow to develop the baseline models and   Table 4 presents the details of all the algorithms.

C. PERFORMANCE EVALUATION
All anomaly detection models in experiments were trained with the corresponding training subsets, consisting of normal samples. Then the models were verified using a validation method consisting of normal samples and testing subsets, including both normal and abnormal data. The model evaluation metrics, i.e., precision, recall, F1-Score, and Gmean, were in the range of 0 to 1. Higher precision, recall, F1-Score, G-mean, and lower ERR indicate better model performance. Table 5, Table 6, Table 7, and Table 8 show the confusion matrix elements and evaluation metrics of anomaly detection models on four datasets. The bold fonts in Table 5,  Table 6, Table 7, and Table 8 indicate that the MSCVAE is superior to the other models.
The comparison experiments show that 10 classical anomaly detection algorithms and 7 deep architecture models have been implemented. Generally speaking, the ability of classical anomaly detection algorithms is limited when facing modeling issues for multivariate time series, and this conclusion is supported by results for the two comprehensive indexes, F1 and G-mean. DAGMM reduces the dimension and extracts features by using neural networks, slightly improving compared to GMM. GAN achieves similar results to DAGMM on the Satellite dataset (shown in Table 5) but performs well on the other three datasets. MAD-GAN attempts to map data into the latent space and detect anomalies via discriminant results and reconstruction errors generated from the mapping process. MAD-GAN achieves good results on the Opt dataset and gives slightly better results than OmniAnomaly. LSTM-AE performs better on   Tables 6 and 8). As an excellent sequence to sequence model, Autoencoder fails at temporal data and performs well on the Opt dataset as shown in Table 8. VAEs achieve better results than Autoencoder on four datasets, which suggests that a high dimension dataset is a challenge for the training of VAEs.
MSCVAE is superior to the other algorithms discussed above, except in the EEG dataset, which gives slightly better recall, F1-Score, and G-mean values than our proposed framework. Also, the Opt dataset gives slightly better recall values than in our proposed framework. Nevertheless, it is clear that our proposed framework can effectively improve anomaly detection performance on multivariate time series. In other words, MSCVAE is much better than baseline methods, as it can handle both inter-sensor correlations and temporal patterns of multivariate time series effectively.
The reasons for the superior performance of our framework can be summarized as follows: (1) As a deep learning model with temporal patterns, MSCVAE can achieve good performance with no need for supervised training. (2) VAEs with an attention-based ConvLSTM network framework can effectively identify anomalies. (3) Our framework can model both inter-sensor correlations and temporal patterns of multivariate time series effectively. (a) Satellite dataset

D. ABLATION STUDY
We have conducted an extensive study to illustrate the impact of different components on the model results, using two key modules of our model: multi-scale attribute matrices and an attention-based ConvLSTM. Three variants of MSCVAE are considered in the evaluation:  MSCVAEw : MSCVAE framework without attentionbased ConvLSTM.  CVAEa : MSCVAE framework without the multi-scale attribute matrices.  CVAEw : MSCVAE framework without both multi-scale attribute matrices and an attention-based ConvLSTM. F1-Scores and G-means on four datasets are reported in Fig.  2. We observe that our proposed framework, MSCVAE (marked in blue), is obviously superior to the other three competing models on anomaly detection tasks, which indicates the importance of multi-scale attribute matrices and an attention-based ConvLSTM. However, MSCVAEw (marked in pink) and CVAEw (marked in yellow) methods can obtain approximately equal results with all four datasets. Besides, the results of CVAEa (marked in green) are much better than MSCVAEw and CVAEw, which indicates the importance of an attention-based ConvLSTM. Furthermore, we see that under all conditions, both multi-scale attribute matrices and attention-based ConvLSTM can effectively improve the anomaly detection performance of multivariate time series.

E. ROBUSTNESS EVALUATION
Anomaly detection often suffers from the dataset imbalance problem, which means there are more normal samples than anomaly samples. To combat this problem, we used the two criteria, F1-Score, and G-mean, with different rates of anomalies to evaluate our model's robustness. Fig. 3 shows the chart of F1-Score and G-mean comparisons with different rates of anomalies. The comparison results indicate that the F1-Score and G-mean of most algorithms tend to increase as anomaly rates increase. Our proposed framework's F1-Score and G-mean values are highest on all datasets under different anomaly rates. Therefore, we are justified in concluding that our proposed framework has good robustness even in the face of dataset imbalance problems.

F. THRESHOLD SETTING STRATEGY COMPARISON
In order to verify the effectiveness of the proposed threshold setting strategy, we compared two threshold setting strategies based on four datasets, shown in Table 9, Table 10, Table 11,  and Table 12. Table 9, Table 10, Table 11, and Table 12 show the top three thresholds on four datasets. The rank indicates the suitability of the threshold. Therefore, the highest rank is the best threshold. As explained in formulas 14 and 15, the threshold with the lower distance and the ERR   As stated, the ROC-based strategy pays more attention to FN and ignores FP, so it selects the best threshold with lower FN. In contrast, the ERR-based strategy focuses on FP, and its best selected threshold makes for similarly excellent FP performance. Consider the experimental results in Table 9. The first threshold of the ERR-based strategy has 4 FP and 6FN, and the corresponding F1-Score and G-mean are 0.9606 and 0.9585. However, the first threshold of ROC-based strategy has 28 FP and 11 FN, and the corresponding F1-Score and G-mean are just 0.8632 and 0.8204. ROC-based strategy fails to achieve the best threshold because of its excessive attention to FN.
Further investigation was conducted to analyze the same proposed framework using the two threshold setting strategies under different anomaly rates, shown in Fig. 4. The ERR-based strategy achieves a better threshold than the ROC-based strategy under a low anomaly rate on four datasets. Besides, it is clear that the difference between the two strategies decreases as the anomaly rate increases. In short, the proposed threshold setting strategy is superior to the traditional ROC-based strategy and can achieve an appropriate threshold even when confronted with a dataset imbalance problem.

V. CONCLUSION
In this paper, we proposed a novel MSCVAE framework to solve the anomaly detection problem for multivariate time series data. The framework employs multi scale (resolution) system attribute matrices which transform multivariate time series into multi scale attribute matrices. This approach allows for characterizing the state of the entire system in different time segments, and adopts the convolutional variational autoencoder to generate reconstructed attribute matrices, which makes our framework more robust by taking advantage of VAEs. An attention-based Convolutional Long-Short Term Memory (ConvLSTM) network is used to capture the temporal patterns. The framework can model both inter-sensor correlations and temporal dependencies of multivariate time series. Finally, a new ERR-based threshold setting strategy is adopted, instead of a traditional ROCbased threshold setting strategy, to achieve better model performance. To verify the effectiveness of the proposed framework, experiments on four datasets were implemented. The results demonstrate that MSCVAE can outperform stateof-the-art baseline methods. The work reported here justifies the following conclusions.
(1) Multi-scale attribute matrices provide an effective preprocessing method for characterizing system states at different time segments of multivariate time series with no need of prior knowledge.
(2) CNN structure is embedded in the encoder, and the decoder of the VAEs model is adopted to extract the characteristics of the time series. An attention-based Convolutional Long-Short Term Memory (ConvLSTM) network is used to capture the temporal patterns and reconstruct the attribute matrices, providing an effective unsupervised anomaly detection method. Combined with the proposed ERR-based threshold setting strategy, the MSCVAE based framework can achieve excellent performance.
(3) Experiments on four datasets indicate that our framework is outperforms competing models in detection accuracy and robustness under imbalanced datasets. An extensive experiment was also conducted to verify that the proposed threshold setting strategy can acquire an optimal threshold in the anomaly detection task, thus contributing to the superior anomaly detection performance of our model.