Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines

As industries become automated and connectivity technologies advance, a wide range of systems continues to generate massive amounts of data. Many approaches have been proposed to extract principal indicators from the vast sea of data to represent the entire system state. Detecting anomalies using these indicators on time prevent potential accidents and economic losses. Anomaly detection in multivariate time series data poses a particular challenge because it requires simultaneous consideration of temporal dependencies and relationships between variables. Recent deep learning-based works have made impressive progress in this field. They are highly capable of learning representations of the large-scaled sequences in an unsupervised manner and identifying anomalies from the data. However, most of them are highly specific to the individual use case and thus require domain knowledge for appropriate deployment. This review provides a background on anomaly detection in time-series data and reviews the latest applications in the real world. Also, we comparatively analyze state-of-the-art deep-anomaly-detection models for time series with several benchmark datasets. Finally, we offer guidelines for appropriate model selection and training strategy for deep learning-based time series anomaly detection.


I. INTRODUCTION
E VERYTHING on the Earth is a source of signals.Hu-  mans have continuously measured and collected signals occurring in nature, such as temperature, wind speed, rainfall, and sunspot intensity, to adapt to the environment.In addition, for decades, various industrial activities have been generating numerous data in most fields of industries such as business (e.g., sales and market trend), finance (e.g., stock price), biomedical (e.g., heart and brain activity), and manufacturing (e.g., yield).In each industrial field, the data owners actively collect and leverage them to improve products, processes, and services.In particular, with the advent of Industry 4.0, industries have started to intensively utilize numerous sensors to monitor their facilities and systems simultaneously, resulting in increased efficiency, safety, and security [1].
Among the various data types, time-series data has been studied for a long time in academia, such as medicine, meteorology, and economics, and is now an essential target of analysis in most practical applications.Time-series analysis refers to a range of tasks that aim to extract meaningful knowledge from time-ordered data; the extracted knowledge can be used not only to diagnose the past behavior but also to predict the future.Widely-known examples of time-series analysis include classification, clustering, forecasting, and anomaly detection.
Anomaly detection, the process of identifying unexpected items or events from data, has become a field of interest for many researchers and practitioners and is now one of the main tasks in data mining and quality assurance [2].It has been studied in a variety of application domains and has experienced significant progress.Classical methods including linear model-based methods [3], distance-based methods [4], density-based methods [5], and support vector machines [6], are still a viable choice of algorithm.However, as target systems become larger and more complex, those methods face limitations, namely an inability to manipulate multidimensional data or address a shortage of labeled anomalies.In particular, detecting anomalies in time-series data is challenging because the order and the causality between observations along the time axis need to be jointly considered.Recently, many approaches have been developed to address these challenges.For instance, Min et al. [7] proposed a novel computational method using a recurrence plot (RP), a square matrix consisting of the times at which a state of a dynamic system recurs.They measure the local recurrence rates (LREC) by scanning the RP with a sliding window and detect anomalies by comparing similarities between the statistics of the LREC curves.
Deep learning, a subfield of machine learning algorithms inspired by the structure and function of the brain, has been getting attention in recent years.Deep-learning methods learn the complex dynamics in the data, while making no assumptions about the underlying patterns within the data.This property makes them the most attractive choice for timeseries analysis these days.For instance, Ke et al. [8] proposed to combine ensembled long short term memory (LSTM) neural networks, which memorize long term patterns in time series, with the stationary wavelet transform (SWT), to forecast the energy consumption.Their experimental results showed that the proposed deep-learning method outperforms classical computational methods.
The goal of this study is to review state-of-the-art deep learning-based anomaly detection methods for time-series data.To the best of our knowledge, previous reviews [1], [2], [9]- [14] on this subject matter do no more than simply categorize models according to their mechanisms and describe their characteristics.In this paper, in addition to classifying the models according to their methodologies, we further analyze in detail how they define interrelationships between variables, learn the temporal context, and identify anomalies in multivariate time series.Also, we provide guidelines to practitioners based on comparative experimental analyses using several benchmark datasets.Our analyses provide practitioners with helpful insights for choosing the best-suited method(s) for the problem(s) they are trying to solve.
The rest of the paper is organized as follows: in Section II, we provide elementary backgrounds on anomaly detection and time series.In Section III, we present various industrial use cases.In Section IV, we present notable conventional methods and discuss the underlying factors that have made them no longer sufficient for recent applications.In Section V, we review recent anomaly detection methods in-depth according to how they define the inter-correlations between variables, model the temporal context, and set anomaly criteria.Through Section VI-A to VI-B, we evaluate the deep learning-based anomaly detection methods on several benchmark datasets and provide a comparative review.Finally in Section VII, we provide general guidelines for model selection to fit given conditions and problems.

II. BACKGROUND A. ANOMALIES IN TIME-SERIES DATA
We begin with introductory remarks on the definition of anomalies.Several attempts have been made to describe the nature of anomalous data (i.e., statistical outliers).Hawkins [15] described an outlier as an observation that deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.In this context, we can describe the anomaly in time-series data as the data point(s) at time step(s) that shows unexpected behaviors that differ significantly from previous time steps.Following the previous works of literature, we categorize the types of anomalies related to time-series data as follows.

1) POINT ANOMALY
Point anomaly is a data point or a sequence that abruptly deviates from the norm (Fig. 1(a)).Such anomalies may appear to be temporal noise and are often caused by sensor errors or abnormal system operations.For detection, operators traditionally set upper and lower control limits, commonly referred to as UCL and LCL, respectively, based on prior data.Values that exist outside those limits are regarded as point anomalies.
not guarantee the accuracy of detection results because few anomaly points can be obscured by the other normal variables and significantly affect the entire target system.Reducing the dimensions by extracting clear variables or features or using a model complex enough to detect various patterns can address such problems.

B. PROPERTIES OF TIME-SERIES DATA
Although time is an essential concept in nearly all tasks, working with time-sensitive data requires lots of careful consideration.Nevertheless, if the characteristics of timeseries data are well-understood, anomalies can be effectively detected by utilizing the contextual information from signals.Therefore, we describe the fundamentals of time-series data in a nutshell.The factors discussed here include temporality, dimensionality, nonstationarity, and noise.

1) TEMPORALITY
A time series is generally considered to be a collection of observations indexed in a time order [18].The data are captured at equal intervals, and each successive data point in the series depends on its past values.Hence, there is some implication of the temporal correlation or dependence between each consecutive observation [19].A joint distribution of sequence of observations can be expressed using the chaining product rule as (1).
where x t is a data point observed at time t ∈ T ⊆ N and each conditional probability p(•|•) indicates the temporal dependence between current state and previous ones.

2) DIMENSIONALITY
Dimensionality refers to the number of individual data attributes captured in each observation [9].Accroding to the dimensionality, time-series data is largely divided into univariate and multivariate types.The dimensionality of time-series data influence computational costs and analysismethod choices.
• Univariate: This type describes an ordered set of realvalued observations, where each data point is measured at a specific time, t ∈ T ⊆ N.Then, x t ∈ R is a data point measured at time t and is a realized value of a certain random variable, X t [2].
• Multivariate: This type describes an ordered set of multidimensional vectors, X = {x t } t∈T , each of which is recorded at a specific time, t ∈ T ⊆ N, and contains real-valued observations.In practical circumstances, this can be seen as a group of univariate timeseries data streams representing the state of the target system.Anomaly detection for univariate time series only considers the relations between the current state and the previous states, i.e., temporal dependence.But for a multivari-ate stream, both the temporal dependence and the correlations between observations should be considered.Despite the added trickiness, multivariate time series data has now become a typical type of data for analyzing various behaviors created by combinations of several variables.

3) NONSTATIONARITY
A time series is said to be stationary if its statistical properties do not change over time.More explicitly, for any τ ∈ N, a continuous stochastic process x = {x t } t∈T ⊂N is strongly stationary if following condition is satisfied, as in (2).
where F x denotes the joint distribution function.Ideally, we want a stationary time series for modeling, but many of the desired properties are not satisfied in real-world scenarios.Volatile features, such as seasonality, concept drift, and change points, make time-series data non-stationary.
• Seasonality: This refers to a periodic and recurrent pattern caused by factors such as weather, holidays, marketing promotions, and the behaviors of economic agents [20].In short, it is a periodic fluctuation over a limited time scale.For example, power consumption is high during the day and low during the night.Likewise, online sales increase rapidly over the Black Friday weekend and then decrease again.
• Concept drift: The nonstationarity of many real environments may lead to changes in the underlying statistical distribution of a data stream over time.This phenomenon goes by many names in literature, the most common of which is concept drift [21].This is a central issue, because it can derail the performance of models learned from historical data [22].
• Change points: In the manufacturing industry, the normal state of equipment often changes for several reasons.For instance, process conditions change as operations are stopped and restarted with a different setting.Because most time-series data are nonstationary, data points that indicate spurious anomalies at certain timestamps may not be truly anomalous on a larger scale.Hence, detection methods that adapt to changes in data structures are required for long-term deployment.

4) NOISE
In signal processing, noise is a general term for unwanted changes to signals during their capture, storage, transmission, processing, or conversion [23].It is considered a bread-andbutter issue in real-world systems.In many cases, noise is due to minor fluctuations in the sensor sensitivity and will have essentially no effect on the overall data structure.However, when the separation between noise and anomaly in a noisy system is difficult, noise seriously affects the performance of detection models [24].Therefore, it is crucial to understand the nature of the noise and reduce noise during the preprocessing stage.
Various industries have increased their competitiveness by adapting to the changing environment using the latest digital technology.Cloud computing, big data, mobile devices, IoT, and artificial intelligence (AI) have led to the hyperconnectivity and super-intelligence of industrial sites.Combining digital components with physical world phenomena helps reduce operating costs, increase business agility and flexibility, and create new revenue models.Anomaly detection using these technologies is particularly essential to industry because it is highly demanded by real-world applications, such as fault detection in manufacturing, leak detection in gas-chemical processes, cyber intrusion detection, and structural health monitoring in infrastructures.

A. SMART MANUFACTURING
The idea of smart factory conceptualizes a highly digitalized and connected combination of facilities and equipment that can improve productivity and quality through automation and self-optimization.In an automated manufacturing process, equipment conditions are most closely related to quality and productivity.Stable operation leads to better quality, and efficient operation reduces manufacturing time and improves productivity.Therefore, it is crucial to detect faults immediately or forecast possible anomalies in equipment.
The equipment applied in smart factories includes the production equipment, the infrastructure facility, and the logistics automation equipment (Fig. 2).The production equipment manufactures products efficiently while maintaining quality.The infrastructure facility supplies power, water, gas, and chemicals to the manufacturing process; it also purifies wastewater and chemical waste.The logistics automation equipment carries products from one place to another.
While several machine learning techniques have been utilized to detect damage, faults, and abnormalities in these types of industrial equipment [25]- [28], deep-learning models have shown a great promise.

1) PRODUCTION EQUIPMENT
Data-driven models help equipment operation in large manufacturing factories because they can detect possible failures without extensive domain knowledge.Hsieh et al. [29] adopted an autoencoder (AE) based on long short-term memory (LSTM) to learn the normal state of equipment and detect anomalies in multivariate streams occurring in production equipment components.LSTM-based AE contains an encoder and a decoder, each of which consists of LSTM networks, variants of recurrent neural networks (RNN).
In most manufacturing work areas, computer numerical control (CNC) is utilized to shape and machine metal and other rigid materials by cutting, boring, grinding, shearing, or other deformations.Luo et al. [30] proposed an early fault detection model for a CNC machine.They employed a stacked autoencoder (SAE) to mine sensitive fault features from large-scale vibration data during long-term operations.They used cosine similarity function as a health indicator for predictive maintenance.
After convolutional neural networks (CNN) revolutionized the field of computer vision [31], researchers also began to apply CNN to time-series data analysis [32].CNN-based fault detection and diagnosis models showed their competence in handling multivariate time-series data captured from semiconductor manufacturing processes in [33]- [35].

2) INFRASTRUCTURE FACILITIES
Pumps, chillers, and scrubbers are representative infrastructure facilities for maintaining environmental conditions (e.g., temperature, purification, and pressure).In particular, industrial pumps are used for various reasons, such as sustaining a vacuum state in equipment or pipes and exhausting gases and sludge.Pumps are usually driven in parallel.Thus, even if one pump behaves abnormally, the other pump can compensate for it, leaving the operator unnoticed.This scenario provides tolerance for abnormalities, but the heavily loaded pumps will inevitably wear faster.Therefore, accurate detection and prediction of anomalies are required to enhance the stability of the manufacturing process.In this regard, Lindermann et al. [36] employed a discrete wavelet transform (DWT) and LSTM-AE to detect anomalies across multiple pumps.Another method used CNN to recognize failures with converted images from vibration signals of pumps [37].
Heating, ventilating, and air conditioning (HVAC) is a representative system that is key to providing indoor environmental comfort via temperature control, oxygen replenishment, and removal of moisture and contaminants.Recently, deep learning-based anomaly detection and diagnosis models for this system have been proposed in [38], [39].
During chemical processes, abrupt changes in the air supply or the contamination levels can significantly damage the product quality.Therefore, several anomaly detection studies have been conducted over the years.Wu et al. [40] employed a pre-trained AlexNet, one of CNN models, to extract gen-eral features from data and perform transfer learning using the joint maximum mean discrepancy.The proposed model showed a great generalization performance to various chemical processes.Another example [41] used LSTM for the early detection of faults via particle attrition in a chemicallooping system.Contaminant detection and treatment are essential in wastewater treatment (WWT) as well.A recent study leveraged LSTM to monitor and detect faults in the WWT process, showing a remarkable performance [42].

3) LOGISTICS AUTOMATION SYSTEM
The manufacturing industry's recent interest in highly flexible production systems is related to the increasing demand for more individualized products [43].This situation requires production flexibility, which has been enhanced by autonomous guided vehicles (AGV) that transport product components between work areas during the manufacturing process [44].AGV reduce the cost of human intervention and allows on-demand changes regarding product types.
Despite numerous advantages, there are several crucial obtacles that must be overcome when using AGV.For example, if one of the vehicles is damaged or malfunctions, it can cause a bottleneck, and the others have to move further, resulting in significant economic loss.To take an appropriate action when such a problem occurs, the condition of vehicles must be monitored at all times.Acosta et al. [45] presented a method that estimates nonlinear vehicle dynamics based on signals in the vehicle.They employed a structure composed of an Extended Kalman Filter (EKF) and neural networks to predict the lateral tire forces and the road grip potential.EKF assumes the distribution of uncertainty as nonlinear Gaussian and estimates this by repeating prediction and correction.Gräber et al. [46] proposed a side-slip angle estimator using RNN with gated recurrent units (GRU).Because RNN, especially with GRU, explicitly models long-term dependencies, it achieves an excellent estimation quality while generalizing over different conditions.Although conventional approaches like EKF are still dominant in the industry, a well-designed RNN with sufficient data can be a competitive solution since it relies on fewer model assumptions like the underlying physical equations.
Another solution would be to monitor the route the AGV is traveling rather than the AGV itself or to avoid congested sections.Since the early 2000s, in semiconductor manufacturing plants, tens-of-thousands of AGV have transported wafers along ceiling rails (i.e., the overhead hoist transport).In these systems, neural network-based methods [47], [48] have been proposed for rail condition diagnosis.They monitor the positions of the upper-and lower-rail cables and the cable holders.Another method used a decision tree [49] to detect unplanned stopping or slowing of vehicles in factories.

B. SMART ENERGY MANAGEMENT
Stable supply and efficient consumption of energy are essential to cope with rapid climate changes and resource shortages.Thus, anomaly detection in energy supply and con- sumption processes has become increasingly important.In terms of supply, if a power outage occurs, it causes significant losses to consumers.In contrast, if energy is unnecessarily consumed, higher prices are paid and energy is wasted.
As illustrated in Fig. 3, a large amount of data are collected and reported in the smart-energy management system.This provides all involved individuals the opportunity to better understand and predict consumption patterns.Autonomous collection devices, in turn, reduce the requirement for manual meter readings [11].Furthermore, real-time early detection of possible failures allows energy suppliers to deal with problems ahead of time instead of relying on reactionary efforts.The success of smart-energy systems in the power sector has enabled the full embodiment of the smart-grid paradigm in water and natural gas fields [50].

1) ELECTRIC POWER
Several applications in [51]- [53] have been proposed to detect anomalies in multivariate time-series data generated by power plants.They take advantage of various deep neuralnetwork models (e.g., convolutional LSTM, CNN, and attention layers), achieving remarkable results.Aside from anomaly identification, collected metrics data are used to diagnose the severity of problems.
A wide variety of approaches has also been proposed to detect consumer-side losses, such as abnormal consumption patterns, unnecessary waste, and theft [54]- [56].Diagno-sis results are reported to the consumer using the energymanagement systems to prevent problems and develop future strategies.

2) TREATED WATER
Water treatment and distribution systems determine the quality of both potable and industrial water supplies.Watertreatment facilities mainly exist in secure areas, but distribution networks are comprised of countless pipelines that span large areas.Since distribution networks are widespread and often vulnerable, the risk of physical attacks always exists.To make the matter worse, a cyber intrusion poses a bigger threat, and the related damages have a significant impact.In this regard, several real-world datasets (e.g., SWaT and WADI) have been released [57], [58] so that researchers can use them without the need to collect vast numbers of data personally.
Li et al.
[59] adopted a generative adversarial network (GAN) to detect anomalies in multivariate time-series data and validated their method on the aforementioned datasets.More recently, a method using a temporal hierarchical oneclass network (THOC) [60], a combined structure with several layers of dilated RNN and multiscale support vector data description (MVDD), has shown a superior performance to the other state-of-the-art networks.
Several tools that detect abnormalities in consumption patterns also exist.Representatively, Vercruyssen et al. [61] exploited an active-learning strategy using constraint-based clustering and label propagation to monitor water consumption.

3) MANUFACTURED GAS
Crude oil, hard coal, and natural gas are manufactured into petroleum products and transformed into solids, liquids, and gases worldwide.Similar to the water-treatment process, the purification and refinement processes directly affect quality of petroleum products.Inspired by a successful image segmentation network, Wen et al. presented a time-series anomaly detection model using a CNN [62] that adopted a transfer-learning framework to resolve data sparsity issues.They demonstrated its effectiveness with the gasoil plant heating-loop dataset [63], which includes cyber-attacks on utility systems as a variety of data points.Moreover, energy management systems are required to manage gas storage and transport thoroughly and constantly, not only for cost reduction but also for environmental safety.On that matter, a recent CNN-based model was proposed [64] to detect gas leaks by monitoring flow noise inside the pipes.

C. CLOUD COMPUTING SYSTEM
In cloud computing, client data are stored and managed in remote data centers by a service provider [65].These providers are required to allocate appropriate resources to users in real-time while storing sensitive information securely.As cloud services become more popular, intrusion detection has become crucial.Hence, providers now leverage logs and time-series data to monitor the states of servers and networks to detect deviations from normal patterns.Hundreds of thousands of suspicious events are continuously detected by such monitoring systems every day.Therefore, timeseries anomaly detection on cloud systems with subsequent diagnosis of the current state and tracing of the root causes is important to maintain high service availability [66], [67].

1) SERVER MACHINE
On a server, multivariate time-series metrics, such as the processor load, the network usage, and the memory status, are made available.Su et al. [68] proposed a variational AE (VAE) with gated recurrent units (GRU) for monitoring a server machine, named OmniAnomaly.They combine the hidden state of the GRU e t and the stochastic variable of the previous time step z t−1 in qnet, which acts as an encoder.And the resulting value is fed to the dense layer to sample the current stochastic variable z t .This variable passes through the planar normalizing flow so that it can learn the complex posterior well, and it is connected to z t−1 using linear Gaussian state space model in the pnet, which acts as a decoder, to obtain temporal dependence.After that, the value x ′ t is sampled from the estimated distribution through the reconstruction process.For similar purpose, hierarchical temporal memory (HTM) and Bayesian network-based approaches have been proposed [69].Meanwhile, CNN-based approaches [70], [71] have been verified to be effective on several datasets from global cloud enterprises.

2) NETWORK AND FRAMEWORK
Moreover, as the network traffic grows exponentially, it becomes ever more necessary to constantly monitor network systems and distributed processing frameworks.Audibert et al. [72] proposed AE, in which one encoder and two decoders are trained adversarially, to identify network anomalies.In addition, Zhao et al. [73] recently suggested a graph attention network-based method to detect anomalies in a big-data processing framework.They explicitly modeled correlations between sensors via attention layers, captured temporal dependence with GRU, and increased performance by jointly applying forecasting and reconstruction results.

3) CYBERSECURITY
In addition to ordinary physical threats, malicious cyberattacks have become critical issues for the reliability and security of cloud systems.For this reason, numerous methods have been proposed to protect customers' sensitive information [74]-[76].

D. STRUCTURAL HEALTH MONITORING
Civil infrastructure, including buildings, bridges, levees, pipelines, are composed of large and complex structures that carry large loads while operating in tough environments.These structures are designed to operate safely under expected loading ranges, but corrosion and damage can occur due to repeated exposure to operation over their lifespans.If the damages are not detected on time, a structure becomes more vulnerable to failure or results in a safety accident.Structural health monitoring (SHM) evaluates their loads and responses and identifies abnormal behaviors to maintain these structures.[77].Some anomalies in SHM data caused by imperfect sensors and the poor quality of data transmission must be eliminated because they can cause false alarms and affect the structural performance assessment.However, eliminating them requires expertise and is very time-consuming.
In this respect, several approaches have been proposed recently.Bao et al. [16], imitating the recognition process of humans, transformed data as image files and fed them into stacked autoencoders (SAE) for anomaly classification.They trained each layer of the network one at a time, and this training scheme is referred to as greedy layer-wise training.After this phase is completed, they fine-tune all layers to improve the results.They verified the performance of the proposed framework with real-world data from a long-span cable-stayed bridge in China.
Similarly, Tang et al. [17], taking advantage of the interpretability of visualized data, converted raw time-series data to images and split the continuous data into segments by windowing data without overlap.Afterwards, they fed the pre-processed data into a CNN-based classification model.Each segment was decomposed into the time domain and frequency domain with Fast Fourier Transform (FFT) and fused as an image by stacking time response image and frequency response image.

IV. CHALLENGES OF CLASSICAL APPROACHES
Even before deep learning was popular, people had developed various mathematical and statistical models to analyze time-series data, applying them widely across various fields.Here, we introduce some representative methods and describe the challenges that remain to be solved.

A. CLASSICAL APPROACHES 1) TIME/FREQUENCY DOMAIN ANALYSIS
Time-series data can be analyzed in the time domain using the width and the height of measured thresholds.Another straightforward yet efficient method is to apply Fourier analysis to examine data with frequency-domain representations.According to the Fourier theorem, any periodic function, no matter how complex it is, can be expressed as a combination of periodic components, such as a sum of sines/or cosines.Fourier analysis is a process that recovers the function from those components.Discrete Fourier transform (DFT) is one of the popular methods and takes the following form: where X k is k-th frequency value transformed from given input data x t .Once you transform the raw time series to a frequency spectrum, as in ( 3), and sort it by coefficients, you can acquire the seasonal periods by inverting the highest frequency.In practice, fast Fourier transform (FFT), a speedup version of DFT, is a preferred choice.

2) STATISTICAL MODEL
To mathematically analyze time-series data, we can generate a statistical model by calculating statistical measures, such as mean, variance, median, quantile, kurtosis, skewness, and many more.With the generated model, newly added timeseries data can be inspected to determine whether it belongs to the normal boundary [78].

3) DISTANCE-BASED MODEL
Many algorithms use the explicit-distance between two temporal sequences to quantify the similarity between the two.Based on the obtained similarity metric, newly obtained sequences will be flagged as an anomaly if their distances from the normal one fall outside the expected range.The most common measure of distance is the Euclidean distance, as in ( 4), which computes the distance as the length of a segment connecting two points.
Dynamic time warping (DTW) is a popular distance measure, allowing nonlinear alignments between two sequences that are locally out of phase [79].Assume that we have two sequences X and Y, whose lengths are M and N, respectively.DTW between the two sequences are measured as follows: 1) Create cost matrix C using dynamic programming algorithm, as in (5).
where i is a data point of X, j is of Y, D(i, j) is a distance between i and j, and C(i, j) is a minimum warp distances of two sequences.2) Trace back from C M,N to C 1,1 to get the optimal warping path W (w 1 , w 2 , ..., w L ), choosing the previous points with the lowest cumulative distance.3) Finally, calculate the final distance using W , as in (6).

4) PREDICTIVE MODEL
Predictive models are used to forecast future states based on the past and current states.We can deduce the anomaly according to the severity of the discrepancy between the predicted value and the real one.For example, the autoregressive integrated moving average (ARIMA) [80] are frequently employed models to forecast time series.ARIMA model is composed of three parts: • Auto-regressive (AR) model is composed of a weighted sum of lagged values, and thus we can model the value of a random variable X at time step t as (7).

AR(p)
where {φ i } p i=1 are auto-correlation coefficients, ǫ is an white noise, and p is the order of AR model.
• Moving-average (MA) model computes the weighted sum of lagged prediction errors and is formulated as (8).
where {θ i } 1 i=1 are moving-average coefficients, ǫ t denotes a model prediction error at time step t, and q is the order of MA model.
• Integrated (I) indicates the time series using differences, and thus a data point at time step t is Xt = X t − X t−1 , when d = 1, where d denotes the order of differencing.As a result, the ARIMA model with the order-parameters is formulated as follows: where µ is a constant and y As described in (9), each value at a specific time step is affected by previous observations and prediction errors, so the ARIMA models the temporality of time series.Also, the differencing process makes the time series stationary, resulting in the ARIMA being effective for non-stationary time series.If the time-series data has a seasonal-or cyclicvariation, we can use a seasonal ARIMA (SARIMA) [81] model.In this case, we introduce additional parameters: P , D, and Q, which deal with the seasonality.These parameters are used in the same manner as p,d, and q.Fundamentally, ARIMA is not capable of modeling multivariate data.Instead, autoregressive integrated moving average exogenous (ARIMAX) [82] model that has an additional explanatory variable or vector autoregression (VAR) [83] model that uses vectors to accommodate the multivariate terms is used to replace ARIMA.

5) CLUSTERING MODEL
In an unsupervised setting, clustering-based methods are simple yet effective choices for grouping the data and detecting the anomalies.Once you map time-series data into a multidimensional space, clustering algorithms group them close to the centroid of each cluster depending on their similarities.Models classify newly received data samples as anomalies if they are far from pre-defined clusters or have low probability of belonging in any of the clusters.
Popular data clustering methods include the k-means algorithm [84], one-class support vector machine (OCSVM) [85], Gaussian mixture model (GMM) [86], and density-based spatial clustering of applications with noise (DBSCAN) [87].The above methods may be insufficient to be applied when datasets have mixed attributes, such as numerical and categorical values.To resolve this issue, the kprototypes algorithm [88], a simple combination of k-means and k-modes algorithm, was proposed.The k-prototypes algorithm measures dissimilarity between two mixed-type objects X and Y , which are described by attributes A r 1 , A r 2 , ..., A r p , A c p+1 , ..., A c m .The dissimilarity is measured as [88, eq. ( 10)].
δ(x j , y j ), categorical attributes (10) where the first term is the Euclidean distance between the numeric attributes and the second one is a simple matching dissimilarity between the categorical attributes.
The above clustering methods are still representative benchmarks but are becoming outdated.Recently, data has become more large-scaled, and thus it requires clustering algorithms that can deal with the massive size of data in both sequential and parallel computing environments.In order to effectively process large amounts of data, we can consider two approaches; One is to increase the computational speed by reducing the size of the data, and the other is to split the data into small chunks and process them in parallel.
Structural clustering algorithm for networks (SCAN) [89] is one of the successful density-based clustering algorithms for a graph, a fundamental data structure.Several works [90]- [92] use nodes/edges pruning techniques to reduce the number of structural similarity comparisons, thereby boosting the efficiency of SCAN without sacrificing the clustering quality for graphs with millions or even billions of edges.These methods skip vertices that are shared between the neighbors or remove outliers before update clusters.Similarly, Li et al. [93] improve DBSCAN, a density-based clustering algorithm for numerical data, to prevent redundant computations with the fast nearest neighbor query that exploits the triangular inequality.
The second approach is to distribute the data among several machines or processors to accelerate processing of an extensive volume of data.MapReduce [94] is one of the most widely used parallel processing models for data-intensive applications.As illustrated in [95,Fig. 4], this model consists of two main functions: the Map and the Reduce functions.Considering k-means as an example, MapReduce tasks follow the procedure as: 1) The dataset is split into multiple chunks and they are fed to the mappers in the form of <index, value>.
2) The Map functions calculate the distance of each sample from centers, and then assign the samples to the closest cluster: <index, center>.
3) The Reduce functions compute the partial summation of the samples with the same center and binds them in the form of <center, (sum, #samples)>.4) The synchronization phase sequentially calculates the new centers by dividing the sum by #samples and update centers: <cluster, new center>.5) Repeat until convergence.
Over the past few years, variants of k-means clustering using MapReduce [96]- [98] have been introduced.Meanwhile, Scalable k-means++ [99] utilizes MapReduce at the initialization phase instead of the post-initialization phase.

B. CHALLENGING ISSUES
Although traditional approaches have made much progress in anomaly detection in time-series data, there is still room for improvement because of the following challenges.

2) COMPLEXITY OF DATA
Analyzing univariate time-series data is still a critical topic in applications that require less computation, such as edge computing.Nonetheless, as more industrial applications are automated and the complexity of control systems increases, separately monitoring individual univariate time-series data becomes impractical.With the large numbers of dimensions, traditional approaches generally experience a non-negligible drop in performance due to the curse of dimensionality.Moreover, correlations between variables that cannot be inferred by univariate time-series analysis can also be used to indicate anomalies.

V. DEEP LEARNING FOR ANOMALY DETECTION
In this paper, we focus on recent anomaly detection models that have been used to overcome the challenging issues mentioned in Section IV.Therefore, our survey works under the following assumptions.
• Semi-supervised/unsupervised learning: All data are considered to be in the normal class for semi-supervised learning, whereas no explicit distinction between normal and abnormal classes is considered in unsupervised learning.Both strategies learn the data structure to overcome the shortage of labeled data.
• Multivariate data: The models should be capable of extracting and exploiting the information entangled in multivariate time-series data.
• Deep learning: Deep-learning methods are explored to handle a complex and massive amount of data.
In this section, we analyze these methods from three perspectives: how they define inter-correlation between variables; how they model the temporal context information; and how they define anomaly scores or thresholds.

A. INTER-CORRELATION BETWEEN VARIABLES
Most deep-learning models for multivariate time-series data establish relationships among multiple variables at every time step.This spatiotemporal information considers not only the temporal context but also the correlation between variables.Table 2 shows how the correlations of multivariate variables are established in the recent works.

1) DIMENSIONAL REDUCTION
A status of a large-scale system can be represented using a few significant factors.Thus, we can reduce the amount of computation by extracting the main features via dimensional reduction.Typically, a linear algebra-based method including principal component analysis and singular value decomposition, or a neural-network-based method including AE and VAE is used.Some previous works process the individual univariate time series, while the others treat the reduced representations as multivariate series.Dimension reduction also has a setback: detecting the cause of anomaly is difficult.

2) 2D MATRIX
A 2D matrix directly captures the morphological similarity and the relative scale among individual variables.Moreover, it considers multivariate variables jointly, making it robust to turbulence at specific points in time.Two representative definitions of the 2D matrix, m t ∈ R n×n are formulated as follows: where X = {x 1 , x 2 , • • • , x T } are multivariate time-series data with n variables of length T , that is, X ∈ R n×T , and ) is an n-dimensional vector.On one hand, if the phase of the entire variable suddenly rises or falls due to an unexpected event, (11) can detect anomalies, but (12) cannot.On the other hand, when the overall phase changes by a concept drift or a change point, (12) dismisses this event as normal, while (11) flags an unnecessary alarm.

3) GRAPH
A graph can define an explicit topological structure and learn the causal relationship among individual variables.Recently, several approaches [73], [108], [109] that applied an attention mechanism to GNN have been proposed to improve performance for identifying root causes.A directed graph is formulated as G = (V, E), where V = {1, 2, ..., N } is the set of N nodes, and E ⊆ V × V is set of edges.Here, e ij denotes the edge from node i to j.Generally, given a graph, the attention layer outputs representation for each node as follows: where y i denotes the feature representation of node i. σ corresponds to the sigmoid activation function, α ij to the attention score which measures the influence of node j to node i, where j is one of the L adjacent nodes of i, and v j to the feature vector of node j.We can compute the attention score α ij by the following equations: LGMAD [111], LSTM-AE [29], MAD-GAN [59], OmniAnomaly [68], SPREAD [104], LSTM-VAE [112] Gated Recurrent Unit (GRU) THOC [60], GGM-VAE [113], S-RNNs [114] CNN Convolutional Neural Network (CNN) Choi et al. [53], MU-Net [62], BeatGAN [115] Temporal Convolutional Network (TCN) HS-TCN [116], TCN-GMM [117], TCN-ms [118] Hybrid Convolutional LSTM (ConvLSTM) MSCRED [51], RSM-GAN [74] Attention Self-attention or Transformer MTAD-GAT [73], SAnD [119], MTSM [120], GTA [109] Others Hierarchical Temporal Memory (HTM) RADM [69], Wu et al. [121] where ⊕ concatenates two node features.w denotes a set of learnable parameters, and LeakyReLU is a nonlinear activation function that has a gentle slope for negative values.Fig. 6 illustrates the intuition behind the graph attention.

B. MODELING TEMPORAL CONTEXT
The history of a sequence contains a great deal of knowledge about its behavior and can suggest future shifts.Hence, estimating the distribution alone is limited in detecting context and collective anomalies.In time-series applications, the temporal context should be considered when modeling the normal status.Table 3 shows the taxonomy of models in terms of modeling the temporal context.

1) RNN
Several deep learning-based approaches to model the temporal context.One of the most common benchmarks uses RNN to recognize pattern sequences and predict expected values.Thus, we can determine anomalies by identifying the differences between the predicted and actual signals.RNNs have been extended with other variants, such as LSTM [122] and GRU [123].LSTM and GRU address the vanishing or exploding gradient problem, where the gradient becomes too small or too large as the network goes deeper.There are multiple gates in an LSTM and a GRU cell, and they can learn long-term dependencies by determining the number of previous states to keep or forget at every time step.Meanwhile, the dilated RNN, as illustrated in [124,Fig. 7], is proposed to extract multi-scale features while modeling long-term dependencies by using a skip connection between hidden states.Shen et al. [60] adopt a three-layer dilated RNN and extract features from each layer to jointly consider long-and short-term dependencies.
RNN-based approaches are generally used for anomaly detection in two ways.One is to predict future values and compare them to predefined thresholds or the observed values.This strategy is applied in [60], [110], [111], [114].The other is to construct an AE or VAE to restore the observed values and evaluate the discrepancy between the reconstructed value and observed one.This strategy is used in [29], [59], [68], [104], [112], [113].

2) CNN
Although the RNN is the primary option for modeling timeseries data, CNN sometimes shows better performance in several applications [53], [62], [115] that work with shortterm data.By stacking convolutional layers, each layer learns a higher level of features from pixels to objects.In addition, the pooling layers introduce non-linearity to CNN, allowing them to capture the complex features in the sequences.
Instead of explicitly capturing the temporal context, the CNN models learn patterns in segmented time series.Hence, one of its drawbacks is that it is not easy to comprehend behaviors appearing over a long period.As an alternative, Temporal convolutional networks (TCN), a variant of CNN, has been proposed in [125].There are three distinguishing properties of TCN.First, the convolutions in the model are causal, meaning that they ensure no information leakage from the future to the past.Second, it can take a sequence of any length, just as with an RNN.Third, it can look quite far into the past to forecast futures using a combination of deep networks and dilated convolutions.

3) HYBRID
When monitoring time-series data with a sliding window, the detectable anomaly pattern varies according to the window size.For example, assume that we have three different windows for 30 sensor data and define a covariance matrix for each window.Then, the shape of the data becomes (30, 30, 3) at time t like an image.If we stack the covariance matrices from t − 4 to t to the time axis, the shape of the data becomes (5, 30, 30, 3) like a video, in which case, we should consider the spatial information and temporal dependencies simultaneously.
Shi et al. [126] first proposed a ConvLSTM model to solve the spatiotemporal sequence-forecasting problem.They replace the dot products in the LSTM cell with convolution operators, and consequently, all gates and states in the cell are reshaped into 3D tensors that can capture spatiotemporal information.Moreover, the model learns state transitions with fewer parameters.In [51], [74], the overall architectures were based on AE and GAN, respectively.In their encoders, ConvLSTMs capture the spatiotemporal context from the feature maps across the previous time steps.Additionally, a temporal attention mechanism [127] adjusts the contribution of the previous feature maps to update the current one.

4) ATTENTION
The attention mechanism was initially used as an auxiliary tool in models.However, novel approaches based on attention layers, such as Transformer [128] and bidirectional encoder representations from transformer (BERT) [129], have become mainstream in natural language processing (NLP).By paying attention to the input weights that contribute more to the output, the attention-based models can capture very long-range dependence with a relative importance to each data point.The remarkable achievements in NLP have led to a time-series anomaly detection domain.In this regard, several works [73], [109], [120] employing Transformer are presented recently.

5) OTHERS
Hierarchical temporal memory (HTM) is considered to be one of the most promising next-generation deep learning algorithms.It is designed to embody the structure and interaction of pyramidal neurons in the neocortex [69].It comprises of stacked cells in a tree shape, and the columns of cells are activated by the input and the previous states of connected neighbors.HTM can capture and predict sequence patterns and thus is beneficial to anomaly detection in time-series data.what makes HTM more unique is that it continuously learns temporal patterns from streaming data without backpropagation.Hence, HTM requires minimal human intervention to be trained in an unsupervised manner.

C. ANOMALY CRITERIA
The models addressed above learn the representation of the given data in an unsupervised or semi-supervised manner by minimizing a defined objective (loss) function.The objective differs according to the model architecture and is generally related to the decision criteria for abnormality.
Once the models are trained, they are applied to the systems and machinery state diagnoses.In general, diagnostic results are expressed in numeric to help understand a given status.We call this numeric indicator an anomaly score.The greater it is, the more likely the state is to be abnormal.Specifically, when the score exceeds a certain threshold, the corresponding data point is determined as an anomaly.In the past, domain experts decided this threshold empirically, but now it is decided according to the model-training result.Some models [68], [69], [110]- [112], [120], [121] employ an adaptive threshold that continuously adjusts to the changes in data over time.The schemes for deriving an anomaly score can be classified into three types, as depicted in Fig. 8: a reconstruction error, a prediction error, and a dissimilarity.

1) RECONSTRUCTION ERROR
In general, AE, VAE, GAN, and Transformers use reconstruction errors as anomaly scores.AE-based models including [29], [51], [62], [72], [104], [114]  such as [68], [112], [113] estimate the data distribution and generate samples from it, which are very similar to the input data.GAN-based models explicitly generate samples that are as similar as possible to the input data with the generator, as in [53], [59], [74], [115].Recently, Transformer with a stacked encoder-decoder structure, which consists only of attention mechanisms, is employed in several works [73], [109], [119], [120].In particular, Zhao et al. [73] consider both prediction and reconstruction errors jointly in their model.Even though these models use different training schemes and objective functions, they calculate anomaly scores similarly.They reconstruct or generate data analogous to the input data and measure the residual between the input and generated data.

2) PREDICTION ERROR
There are two ways to derive anomaly scores from the prediction model.One applies a binary label based on the probability of the data point being classified as a normal, as proposed in [116], [119].The prediction error indicates whether the expected label matches the ground truth.The other approach is to predict the expected value for the next time steps, as proposed in [69], [110], [111], [121].In this case, the prediction error is the residual between the expected value and the observation.The second one is more practical than the first because the labels are insufficient in the real world.

3) DISSIMILARITY
Dissimilarity-based one measures how far the value derived by the model exists from the distribution or cluster of the accumulated data.There are various methods for measuring the similarity, such as the Euclidean distance, the Minkowski distance, the cosine similarity, and the Mahalanobis distance.
In the temporal hierarchical one-class (THOC) network [60] and TCN-Gaussian mixture model (GMM) [117], time-series features are extracted by a dilated RNN and TCN, respectively.Then, they are clustered using a similar deep support vector data description, or their distribution is estimated using a GMM.THOC measures the similarity between features and clusters using cosine similarity, and TCN-GMM uses the Mahalanobis distance.The similarity obtained from the models is subtracted from one to obtain an anomaly score.Conversely, multi-stage TCN [118] uses a multivariate Gaussian distribution to estimate the distribution of prediction errors rather than the features of training data.Then, the anomaly score is determined by measuring the Mahalanobis distance between the current prediction error and the pre-estimated error distribution.

VI. COMPARATIVE REVIEWS
In this section, we provide experimental performances of various methods on real-world datasets for time-series anomaly detection.

A. EXPERIMENTAL SETUP
To compare the performances of the presented methods, the following public time-series datasets are used: • Secure Water Treatment (SWaT) [57]: multi-variate time-series data collected over 11 days from water treatment test-bed, a small-scale cyber-physical system.The last 4 days of data contain 36 attacks.The objectives and the duration of these attacks are diverse.To get more information or request for the dataset, please refer to the SWaT website4 .
• Water Distribution (WADI) [58]: multi-variate timeseries data from water distribution pipelines collected over 16 days.Each series includes various network traffic, sensor and actuator measurings.Out of 16 days, 14 days contain data under normal conditions, and two days under attack scenarios.Please refer to the WADI website 5 for more details.
• Mars Science Laboratory rover (MSL) [110]: multivariate time-series data recorded from Mars Science Laboratory rover.Training and testing testbeds are separated, and the anomalies in the testing testbed are all labelled.The data is available at the public storage 6 .Several previous works of research have reported the performances of the anomaly detection methods on the datasets described in Table 4.The reported performances are used if available, and the other performances are obtained from our experiments.Detection results on SWaT [57] are available in [60], [72], and [109]; WADI [58] in [109], [72], and [108]; MSL [110] in [108], [72], and [68].
For performance evaluation, we adopt three standard evaluation metrics: Precision, Recall, and F1-score.They take the following form: where TP are the true positives that stand for the number of the detected true anomalies, FP are false positives that mean the incorrectly detected ones, and FN are false negatives that are undetected anomalies.Precision is the proportion of samples that are true anomalies among those predicted by  [60], MSCRED [51], DAGMM [103], LSTM-VAE [112], OmniAnomaly [68] 0.  the model as anomalies.Recall is the proportion of anomalies predicted by the model out of entire anomaly samples.Therefore, the higher Recall is, the more anomalies are caught without omission.At the same time, the higher Precision is, the fewer false alarms occur.Because Precision and Recall are inversely proportionate to each other in general, the threshold must be adjusted to evaluate model performance for different purposes.In many real-world scenarios, it is important for the system to detect as many actual attacks or anomalies as possible at the cost of few false alarms.Therefore, we focus more on Recall and F1-score than Precision in the experiments.Moreover, we report the best results of each model on all datasets for a fair comparison because different thresholds may result in different metric scores.Anomaly detection methods for time-series data require various hyper-parameters tuned for the optimal performance.
Since the optimal values of the hyper-parameters are not the same for each method, we report the used values in Table 5.Typical hyper-parameters include down sampling ratio, window size, point adjustment, and learning rate.In most case, time-series data used to be down-sampled prior to the experiments to model data of longer time frames under a fixed capacity of the model.According to [72], downsampling speeds up learning by reducing the size of the data and also has a denoising effect.In addition, slicing each series using a window of a fixed length is a common practice.Point adjustment is a technique to boost the recall of the detection model.Typical anomalies in datasets tend to be temporally adjacent.If the model successfully detects any of the anomalies within the segment when it makes decision for every time step, the evaluation process regards the whole contiguous segment of the anomalies as detected.

B. RESULTS AND ANALYSIS
We compare a wide range of state-of-the-arts in multivariate time series anomaly detection, categorized as follows: • AE: DAGMM [103], MSCRED [51], OmniAnomaly [68] • VAE: LSTM-VAE [112], USAD [72] • GAN: MAD-GAN [59] • RNN: THOC [60] • Transformer: GTA [109] • GNN: GTA [109], GDN [108] Table 6 shows the anomaly detection accuracy in terms of Precision, Recall, and F1-score of the state-of-the arts on the benchmark datasets (SWaT, WADI, and MSL).Except for specific cases, we tried to employ the same experimental settings as much as possible to fairly compare the performance.If the comparison under the same settings is not plausible, we used the settings reported in the original paper.Each of these methods prioritizes a different metric as the authors choose specific thresholds depending on their goal.Therefore, we pick the F1-score as a baseline and sort the methods for SWaT correspondingly.
The result shows no clear one-size-fits-all method for all the datasets and no notable distinction in performance depending on their structure.Therefore, we interpret the results from several perspectives.

1) MODELING TEMPORAL DEPENDENCIES
Compared to DAGMM [103], designed to treat multivariate data without temporal information, RNN-based models show superiority (see Fig. 9).The average F1-scores of the RNNbased models on SWaT and MSL datasets are 1.87% and 14.90% higher than those of DAGMM, respectively.This is because they can take long sequences as input and capture the temporal dependencies.
LSTM-VAE [112] replaces the feed-forward network in a VAE with LSTM.MSCRED [51] is a CNN-based AE that reconstructs a feature map that contains both the aggregated information of observations and the inter-correlation between variables within a fixed-size sliding window.Between the encoder and the decoder, it captures the spatiotemporal dependencies from the feature maps across the previous time steps using ConvLSTMs.MAD-GAN [59] employs LSTM-RNN as both generator and discriminator to learn the temporal context in a generative adversarial training fashion and reconstructs the original time series explicitly.THOC [60] adopts multi-layers of dilated RNNs to model temporal dependencies with a wide range of lengths.
Most RNN-based methods outperform DAGMM, but with the exception of LSTM-VAE on MSL.We argue that the main reason behind this phenomenon lies in the process over the latent variables; Although LSTM-VAE uses LSTM for sequence modeling, it ignores the temporal dependencies among latent variables.Meanwhile, OmniAnomaly [68] connects stochastic latent variables in the middle of encoder and decoder with a linear Gaussian state-space model to model the temporal dependencies with inherent stochasticity.As a result, the approaches without modeling temporal dependency are not suitable for time-series anomaly detection.

2) PARALLEL PROCESSING FOR LONG SEQUENCES
Despite the powerful capability of sequence modeling, one drawback of RNN is that it restricts parallelization because it computes its output sequentially.Meanwhile, Transformer takes a sequence at once, so parallel processing is possible.Furthermore, it can reflect contextual information at once by computing contributions between all-time steps through a self-attention mechanism.This property is significant to sequence modeling because a longer sequence can provide more information.Consequently, compared to DAGMM, GTA that aims to adopt Transformer achieves overall 6.98% and 30.01%improvements in terms of the best F1-score on SWaT and MSL datasets, respectively.GTA also shows 5.11% and 10.64% improvements compared to the overall mean of the F1-score of the RNN-based model on SWaT and MSL datasets, respectively.

3) DIMENSION OF THE DATASETS
As shown in Fig. 10, we can see that the overall performances in terms of the best F1-score on the WADI dataset are significantly lower compared to the other datasets (SWaT and MSL), except for GNN-based methods.Recall that the dimension of the WADI dataset is 112, double that of SWaT and MSL, as described in Table 4.When we feed the 2D feature map that defines correlations between variables, such as a covariance matrix, to the deep-neural network-based models, the amount of feature expression and computation will be more than quadrupled compared to SWaT and MSL.In particular, in reconstruction-based models with deep layers, the amount of computation is overloaded for each layer.Undoubtedly, the poor results for WADI are expected.

4) INTER-CORRELATIONS BETWEEN ATTRIBUTES
Despite several factors affecting performance, we can see that there is no remarkable difference in the results on the WADI dataset when simply comparing models that undergo dimensionality reduction in the preprocessing stage with those that do not.We argue that the possible reason is that some important features are lost during dimensionality reduction.
Meanwhile, GNN-based models (GTA and GDN) achieve a relatively higher F1-score on the WADI dataset.While GTA greatly benefited from the sequence modeling ability of the Transformer, GDN, which does not consider temporal dependencies yielded notable results by simply learning the graph structure of the relationship between variables.We believe that the major factor lies in the dependencies between features.SWaT and WADI provide the network traffic, measurements from sensors and actuators under several control processes.These attributes are not entirely independent of each other, and thus there exist inter-correlations between the attributes within the associated equipment and control processes.Therefore, trivial variations in one sensor or actuator can affect other associated attributes within the same group.As a result, we observe that the graph structure learning with attention mechanism is more effective on datasets in which the elements are strongly related.

VII. GUIDELINES FOR PRACTITIONERS
Most current anomaly detection methods are highly specific to certain use cases.This means that there is no one-size-fitsall approach.In this respect, we provide guidelines for model selection according to the purpose and the circumstances of each application.Intuitive visualizations of our guidelines are provided in Fig. 11.We also discuss the training techniques that should be considered.

A. DETECTION STRATEGIES
Time-series data is not very different from data in other domains.However, there are unique properties of the time series  • Real-time: Online business and finance require realtime anomaly detection to respond quickly to incidents [130].Also, monitoring manufacturing equipment in real-time is mandatory to reliably maintain a manufacturing capacity.Recently, cyber-physical systems (CPS) [131] have integrated physical and computational capabilities to remotely control substantial systems in real-time.They react immediately to dynamic changes and reduce human intervention.Generally, GRU- [60], [68], [113], [114] and CNN-based models [53], [62], [115] using reconstruction errors provide real-time anomaly detection capabilities.The inference time of each model can vary with computational complexity and computing resources.The models with high computational complexity take longer to make the decision.Conversely, the models paired with extensive computation resources generally output the result faster.However, GRU is a type of the RNN that sequentially processes the observations.Thus, the inference time of the reconstruction-based models using GRU will be constant regardless of the computation resources unless data parallelism is not supported.at once, and thus, they can process more features and longer sequences with sufficient computation resources.
• Early Warning: Maintenance costs in manufacturing plants constitute a substantial portion of the total production cost.Once a severe failure has occurred in facilities, the operators will lose vast amounts of time and costs due to an unscheduled downtime for repair.In this regard, a condition-driven preventive maintenance (PdM) [132] scheme has been introduced.Improved time-series anomaly detection algorithms that can predict future breakdowns are required to successfully perform PdM.Autoregressive algorithms that accumulate historical information in their model can predict possible faults.In particular, LSTM- [110], [111] and HTM-based models [121] have been widely used to predict faults in time-series data.The main challenges in anomaly prediction include false alarms and missed anomalies [133], [134].Therefore, selection of an optimal threshold for anomaly detection is particularly important.A higher threshold value will suppress false alarms, but may miss the actual anomalies.On the contrary, a lower threshold will capture more anomalies but result in more false alarms.

2) SLIDING WINDOW VS. INCREMENTAL UPDATE
There are two propositions to infer context from time-series data.A time-series model either processes all of the historical data points or incrementally update the outputs for the newest items.These approaches are called sliding windows and incremental updates, respectively.
• Sliding window: Some models can only feed-forward data of fixed sizes.TCN- [116]- [118] and CNN-based methods [53], [62], [115] fall into this category, and the size of the window affects the length of the temporal dependency modeled by the neural network.Therefore, practitioners should carefully choose an appropriate window size depending on the nature of the dataset (e.g., time lags between multivariate series and the frequency of subsequent anomalies).Excessive window sizes can cause anomalies to be overlooked, whereas insufficient window sizes can render the model incapable of capturing long-term dependencies.For example, Zhang et al. [51] compared the anomaly detection performance for varying window sizes, and chose the optimal value showing the maximum performance.
• Incremental update: Incremental models update the predictions for new data via marginal computations.They are particularly beneficial in streaming environments in which data items are supplied one-by-one.Moreover, the computational benefits should not be underestimated.Methods based on sliding windows must maintain the entire data stream in memory for additional processing, which involves larger computations at each timestep.Autoregressive models, such as GRUs and LSTMs, are inherently incremental models because they maintain a compact summary of past data in their hidden states.For instance, some of the LSTM-based methods [110], [111] support incremental updates.However, many methods [29], [51], [69], [104] require references to past data for pre-or post-processing using AEs or other networks.For these methods, the incremental features are limited.

B. TRAINING AND PREPROCESSING TECHNIQUES
In addition to the detection phase, anomaly detection methods have a wide range of design choices for training.

1) LOSS FUNCTION
Time-series anomaly detection models are trained using different types of loss functions depending on how they model the normality of the data.The types of loss functions include a adversarial loss, a reconstruction loss, a prediction loss, and a negative log-likelihood.
• Adversarial loss: Since the pioneering work of Goodfellow et al. [135], adversarial formulation has been widely used [136], [137] to improve the modeling capability of generative modules.This technique was also adopted in previous studies [59], [74], [115] for timeseries anomaly detection.The discriminator primarily serves as a helper for the generative component.After training, it can also be used to generate anomaly score, as in [53], [59].A typical adversarial formulation is given as the following two-player game: where D and G are the discriminator and generator modules, respectively.Although the results generated by the models trained with an adversarial loss can be remarkable, the most challenging issue is that the simultaneous dynamic training of two competing models is inherently unstable.Due to the unstable training process, the models may fall into failure modes instead of converging to the optima.A typical failure mode is a mode collapsing that the generator always outputs the same value from multiple inputs.
• Reconstruction loss: AE is a preferred choice for anomaly detection, provided that AE trained with normal training data reconstructs normal data well.Several methods [29], [51], [104], [114] use AEs, optionally in conjunction with other modules.They use reconstruction losses as training loss functions, so that so that the AE is trained to capture the normality of the training data.A typical reconstruction loss takes the following form: where S t is the observed data point at time step t, and S ′ t is the reconstructed data point at timestep t. • Prediction loss: Prediction-based approaches detect anomalies by comparing the predictions with real ob-servations [100], [111].The prediction model is trained using a prediction loss so that the model is forced to produce an accurate prediction using past data or relationship among features.The prediction loss is similar to (20), except that S ′ t indicates the prediction for the real observation S t .This training scheme is applied in the inference time as is, and thus is beneficial for the early warning system.
• Negative log-likelihood: A group of generative models that can estimate the log-likelihood of input data commonly uses negative log-likelihood (NLL) as a training loss.Minimizing NLL maximizes the estimated likelihood of a dataset such that the model captures the notion of normality present in the dataset.GMMs are a type of generative model [103], [117] that includes NLL in their loss functions.Note that the NLL is optimized different For examples, TCN-GMM [117] maximizes the log-likelihood presented in (21) using the expectation-maximization algorithm.
where θ indicates the GMM parameters, {Σ k , µ k , w k } K k=1 , and D is the number of dimensions in the feature vectors.In contrast, DAGMM [103] maximizes a similar loss term that uses gradient descent in an end-to-end fashion.VAE, another class of generative models, is trained with an evidence lower bound (ELBO), as in (22), which is a lower bound of the log-likelihood.VAE-based methods [73], [112], [113] use ELBO for training.
They do not simply generate a data instance similar to the input but also approximate the unknown prior distribution using training data.

2) BATCH LEARNING VS. ONLINE UPDATE
A common challenge in time-series data is the nonstationary nature of data, as discussed in Section II-B.Following the changes in data distribution, we suggest two types of approaches to updating the model accordingly.
• Batch learning: Deep learning typically assumes a stationary distribution of data, and deep neural network models are trained using a large batch of data sampled from the same distribution as the test distribution.Therefore, most deep learning-based methods should provide a new batch of training data to fine-tune the model.This training scheme may be problematic when the system administrator cannot re-collect data after each data update.
• Online update: The above problem can be mitigated when the model supports online update.It enables finetuning of the model with newly appended data without the need to re-train the model from scratch.HTMbased methods have such capabilities [69], [121], but online updates are rarely found in deep-learning models because the nonstationary assumption of data distribution is rather unconventional in machine learning.Among deep learning-based approaches, some methods [110], [111] adjust their thresholds for binary decision-making.We can consider continual learning as an alternative.Continual learning, however, suffers from the plasticitystability dilemma.Neural networks are known to do well on forward-transfer, and thus the parameters should be plastic to learn a new task.At the same time, they should be stable not to forget the important features.However, a fine-tuning to new tasks makes the parameters rapidly forget what they previously learned.We call this phenomenon catastrophic forgetting [138].Common approaches to mitigating catastrophic forgetting include regularization-, dynamic network architectures-, and memory replay-based methods.

3) DENOISING
Noise in time-series data is an inevitable factor induced by sensors.Noise, which is hardly distinguishable from anomalies, may degrade the performance of anomaly detection.Therefore, diverse techniques have been proposed to make the model effectively learn the normality of the data by removing the noise in advance.
• Smoothing: The exponentially weighted moving average is a recursive smoothing filter that performs a scheme in which weight is assigned to the current observation the most and decays exponentially as one traverses the past.Although this method is effective, it has a problem that we should determine the level of denoising.
• Transformation: Signals bear representation in both the time and frequency domains.Wavelet transform and fast Fourier transform decompose signals into multiple resolutions to extract frequency characteristics.The difference between the transformed data and the original data is regarded as a noise.
• Estimation: Kalman filter removes noisy data by representing them in a state-space model and applying probabilistic estimation [142].input as is, but instead, robustly learns the representation of the features to prevent overfitting.

VIII. CONCLUSION
For many years, data-driven decisions have been made across businesses and industry to provide better products and services to a global community.Analytical techniques for extracting beneficial information from large volumes of data collected from various sources offer many opportunities.Moreover, identifying and troubleshooting unexpected events from time-series data can help prevent accidents and financial losses.Deep learning-based approaches have been attracting a considerable amount attention because of their incredible capability to resolve these problems.
In this paper, we discussed the characteristics of timeseries data and the anomalies detected therein.We also described various applications of anomaly detection in several industries, including manufacturing, energy management, cloud infrastructure, and structural health monitoring.Because there has been a historical interest in anomaly detection in time-series data, we briefly presented some traditional approaches and described challenging issues regarding this topic.As the complexity of the system increases while the refined data and labels for analysis remain insufficient, the demand for unsupervised deep learning-based time series anomaly detection continues to increase.In this regard, we provide a review of the latest deep learning-based anomaly detection methods for time-series data from several perspectives and report the evaluation results on three real-world benchmark datasets.Finally, we finish with guidelines for model selection and training techniques.

FIGURE 1 .
FIGURE 1. Anomaly types in time-series data.

FIGURE 2 .
FIGURE 2. The examples by equipment type: (a) a production equipment named etching machine in semiconductor manufacturing creates chip features by selectively removing dielectric and metal materials on a wafer; (b) an infrastructure facility called the central chemical supply system safely supplies high-purity chemicals to the semiconductor manufacturing process; and (c) a logistics automation equipment called automated guided vehicle transports product components in work areas.

FIGURE 3 .
FIGURE 3. Smart energy management systems collect data from energy supply and consumption processes.It provides real-time monitoring to alert possible failures (e.g., leaks, overloads, cyber intrusions), helps stakeholders analyze data, and sometimes renders remote control.
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2021.3107975,IEEE Access Choi et al.: Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines

FIGURE 5 .
FIGURE 5. A taxonomy of recent deep learning-based time-series anomaly detection methods.HTM, hierarchical temporal memory; RNN, recurrent neural networks; TCN, temporal convolutional networks; GNN, graph neural networks; GAN, generative adversarial networks; VAE, variational autoencoder.Most of the models do not use only one structure or method but combine several ones.We classify the models based on the main structural characteristics of each model and denote types of anomaly scores with colored circles.* is an exception because the roles and influences of Transformer and GNN are clearly separated.

1 )
LACK OF LABELS Failure modes in most industrial circumstances are extremely rare, and therefore they are insufficient for use as labeled training data.The scarcity of failure modes makes collecting enough labeled training data time-and resource-intensive.Even when labeled data are obtained, the class imbalance between normal and abnormal data hampers model training.

FIGURE 6 .
FIGURE 6.A mechanism of graph attention layer.Red circle is the final output.

FIGURE 7 .
FIGURE 7.An example of a three-layer dilated RNN with dilation 1, 2, and 4. With its recurrent skip connection and its use of exponentially increasing dilation, it alleviates gradient problems and extend the range of temporal dependencies with fewer parameters.

FIGURE 8 .
FIGURE 8.The examples of each type of anomaly criteria: (a) a reconstruction error; (b) a prediction error; and (c) a dissimilarity.
reconstruct input data by extracting features from them.VAE-based models This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2021.3107975,IEEE Access Choi et al.: Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2021.3107975,IEEE Access

FIGURE 9 .
FIGURE 9. Experimental results on SWaT and MSL.The RNN-based and Transformer-based models that capture temporal dependencies outperform DAGMM, the non-temporal modeling method.

FIGURE 10 .
FIGURE 10.Experimental results on SWaT, MSL, and WADI.The dimension of the dataset affects the performance.

FIGURE 11 .
FIGURE 11.Strategies for anomaly detection in time-series data: (a) real-time vs. early-warning; (b) sliding windows vs. incremental update

TABLE 3 . Modeling temporal context
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2021.3107975,IEEE Access Choi et al.: Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines

TABLE 5 . Hyper-parameters values used for each method. The methods marked with † indicate their papers also provided the performances of some other models measured under the same environment. The MSCRED [51] jointly uses three-sized sliding windows in the original work, and we have reflected this in our experiments.
† THOC

TABLE 6 . Anomaly detection accuracy in terms of Precision (%), Recall (%), and F1-score, on three datasets with ground-truth anomalies. We left some results blank when comparisons are not possible, such as when there is no source code or an accredited performance evaluation.
* did not apply point adjustment on the WADI dataset, results in poor Recall and F1-score relatively.
If the training dataset is small compared with the model capacity, the deep-learning model can memorize the dataset.Hence, the model learns the noise.In this case, it is difficult to distinguish between noise and anomalies.A denoising autoencoder is a general deep learning-based method that addresses this problem.It trains the AE to restore the original input by adding random noise.Thus, it does not reconstruct the This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2021.3107975,IEEE Access Choi et al.: Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and Guidelines • Deep learning:This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/