Hybrid Anomaly Detection via Multihead Dynamic Graph Attention Networks for Multivariate Time Series

In the real world, a large number of multivariate time series data are generated by Internet of Things systems, which are composed of many connected sensing devices. Therefore, it is impractical to consider only a single univariate time series for decision-making. High-dimensional time series decrease the performance of traditional anomaly detection methods. Moreover, many previously developed methods capture temporal correlations instead of spatial correlations. Therefore, it is necessary to learn the temporal and spatial correlations between different time series and timestamps. In this paper, to achieve improved anomaly detection performance for multivariate time series, we propose a novel architecture based on a graph attention network (GAT) with multihead dynamic attention (MDA). This framework simultaneously learns the dependencies between sensors in both the temporal and spatial dimensions. To tackle the overfitting problem in autoencoder (AE)-based methods, we propose a hybrid approach that combines a novel generative adversarial network (GAN) architecture as a reconstruction model with a multilayer perceptron (MLP) as a prediction-based model to detect anomalies together. The detection framework proposed in this paper is called the HAD-multihead dynamic GAT (MDGAT). Extensive experiments on different public benchmarks demonstrate the superior performance of HAD-MDGAT over state-of-the-art methods.


I. INTRODUCTION
With the rapid development of information technology, the scales of all kinds of data continue to expand. Time series anomaly detection has become a field of interest for many researchers and practitioners [1]. Anomaly detection has been applied in various domains, such as intrusion detection in cybersecurity, medical detection, economic analysis and fault diagnosis in industry [2].
Time series can be divided into univariate time series and multivariate time series. Because a univariate time series has one dimension of data, some anomaly detection algorithms can locate anomalies regarding one feature. However, in realworld scenarios, many sensors are interconnected to run and generate numerous time series data, such as in cyber-physical systems (CPSs) [3]. Data from different sensors can be related The associate editor coordinating the review of this manuscript and approving it for publication was Sajid Ali . in complex, nonlinear ways (for example, pressure changes affect flow rates and water levels). Similar to these CPS data, multivariate time series data have many interconnected correlations. If a single feature (such as a univariate time series) is used to detect anomalies, it may be difficult to determine whether the system of interest runs normally. Anomalies in multivariate time series tend to be determined by multiple spatial features, and the analysis of a single feature is insufficient for correctly detecting anomalies. Therefore, it is necessary to take the correlations of multiple spatial features into consideration when addressing multiple time series.
Time series also often lack labeled samples [4]. Anomaly labeling requires high expert costs and does not guarantee coverage for all anomaly types. Thus, unsupervised deep learning methods are typically used to accomplish the task of anomaly detection. In recent years, many unsupervised anomaly detection approaches, including prediction-based methods and reconstruction-based methods, have been proposed. Prediction-based methods use recurrent neural networks (RNNs), such as the deep malicious insider threat detector (DeepMIT) (Sun et al. [5]) and long short-term memory (LSTM) network (Hundman et al. [6]). DeepMIT models user behaviors as sequences and predicts the probabilities of anomalies. These approaches utilize the differences between predicted and real samples to detect anomalies. However, with the increasing dimensionality and scales of time series, it is becoming more challenging for these conventional prediction-based methods to effectively capture the temporal correlations in high-dimensional multivariate time series [7]. Reconstruction-based methods, such as the autoencoder (AE) proposed by Aggarwal [8] and the generative adversarial network for multivariate anomaly detection (MAD-GAN) proposed by Li et al. [9], can reconstruct samples. The reconstruction error can be obtained by the difference between the original and reconstructed samples. These methods do not simultaneously consider the temporal and spatial dimensions between sensors. Therefore, these methods do not have high accuracy when used with multivariate time series that contain many potential interrelationships. Moreover, these methods can effectively fit data according to the obtained reconstruction errors for anomaly detection. If the data include anomalies, these methods (such as AE variants) also fit anomalies well, leading to reduced anomaly detection performance. When anomalous data are very close to normal data, they are often undetectable.
However, many methods (such as the above models) do not take spatial correlations, which are important for anomaly detection, into account. In the real world, most data are generated from non-Euclidean spaces. Many deep learning methods have poor performance in terms of handling these data. In recent years, graph neural networks (GNNs) have seen increasing popularity. They can effectively model graph-structured data [10], such as molecules, and they have made great progress in terms of capturing spatial correlations. Three main types of GNNs are available, including graph convolution networks (GCNs [11]), graph attention networks (GATs [12]), and graph AEs (GAEs). GCNs can be further classified as spectral or spatial methods. A spectral GCN method uses a spectral decomposition approach, such as Laplace matrix decomposition for a graph, to aggregate node information. When the given graph is large, the whole graph must be used, resulting in decreased performance. A spatial GCN method uses the topology of the input graph to directly aggregate its neighbor node information at each layer of the GCN. Thus, this approach has greater potential to deal with large graphs than spectral GCN methods. Attention mechanisms are widely used in different domains, such as computer vision and natural language processing. Some methods must also observe time series data to mine their useful information. GNNs are no exceptions. A GAT applies an attention mechanism to assign different weights for different neighbor nodes. However, the implementations of GATs are only static: for any query, the neighbor scores are monotonic according to the per-node scores. As a result, a GAT cannot express even simple alignment problems and capture much information between different observations.
Taking the above problems into consideration, we propose a novel architecture, a HAD-multihead dynamic DAT (MDGAT), based on a GAT. The main four contributions of this paper are as follows.
• We propose a HAD-MDGAT based on a GAT. It simultaneously learns the dependencies between sensors in both the temporal and spatial dimensions. It has more robustness.
• We introduce a multihead dynamic attention (MDA) mechanism in our architecture to capture the interrelationships between different sensors. This mechanism can deal with alignment problems and model the different correlations between different keys and different queries.
• Prediction-based and reconstruction-based methods are integrated into our model. The prediction-based model can predict the next value by utilizing spatial and temporal correlations. To solve the overfitting problem, we propose re-encoding a GAN to reconstruct data. This technique uses two generators as encoders to compute differences as parts of the reconstruction errors, improving the accuracy of anomaly detection.
• Experimental results obtained on public datasets show that the HAD-MDGAT achieves the best performance in comparison with state-of-the-art baselines.
The rest of the paper is organized as follows. Section II describes the related work, and Section III presents the details of our proposed HAD-MDGAT model and how to use it for anomaly detection. In Section IV, the HAD-MDGAT is evaluated on multiple datasets, where it achieves better performance than state-of-the-art methods. Section V concludes the paper and proposes possible future work ideas.

II. RELATED WORK
As mentioned in the introduction, time series data are applied in various domains. To date, many anomaly detection methods have been proposed for industrial applications [13]- [15]. Time series can be divided into univariate and multivariate time series. Univariate time series have one dimension. Multivariate time series have many dimensions. However, many anomaly detection methods take temporal correlations into consideration while ignoring the spatial dimension. Some unsupervised anomaly detection methods have made great progress. Even though few GNN-based methods are used for time series anomaly detection, they have recently attracted increased attention.

A. TIME SERIES ANOMALY DETECTION
Univariate time series anomaly detection methods only take the dependencies of the current timestamp and the previous timestamps, such as temporal correlations, into consideration. However, methods for multivariate time series also consider the correlations between different observations. Some methods deal with both kinds of time series anomaly detection. Among these methods, deep learning approaches have attracted the most attention from researchers. One category, unsupervised learning methods, does not need labeled samples. Classic anomaly detection methods can be divided into proximity-based methods, prediction-based methods and reconstruction-based methods [16].
Proximity-based methods, such as K-nearest neighbors (KNN) [17] and the local outlier factor [18], measure the degrees to which values deviate from anomaly objections. These methods ignore the temporal correlations between observations and need prior knowledge, such as the number of anomalies that are present.
Prediction-based methods are commonly used. Their main idea is that anomalies are identified according to the differences between the predicted values and the real values; such approaches include the autoregressive integrated moving average (ARIMA) [19], gradient boosting regression tree (GBRT) [20], and LSTM [21] methods. The ARIMA has a certain lag and is sensitive to anomalies; at the same time, much smoothness testing and parameter estimation are required. The GBRT approach is applied to detect anomalies for data with stable patterns and periodic characteristics. Due to the uncertainty of single regression tree generation, the differences among the results are large. The ARIMA and GBRT techniques do not consider temporal correlations. However, deep learning methods can tackle these problems. RNNs [22] can detect anomalies by predicting time series data. They capture the temporal correlations between different observations. However, RNNs have lower performance, while the input time series are becoming longer. This means that RNNs cannot capture long-term series [23].
To date, reconstruction-based methods, such as AEs [24], variational AEs (VAEs) [25], the LSTM-VAE [26], unsupervised anomaly detection (USAD) [27], OmniAnomaly [28], the MAD-GAN [12], and a GAN with an attention network and bidirectional LSTM (AMBi-GAN) [29], have also been widely investigated. Such an approach learns a model to reconstruct data that are as similar as possible to the original data. Anomalies are identified by their high anomaly scores. An AE is a basic model. To improve the performance of the original AE, Chen et al. proposed a VAE. A VAE additionally considers Kullback-Leibler divergence to measure the difference between the estimated and prior distributions. It combines reconstruction error and distribution error to detect anomalies, but it ignores the temporal correlations in the data. The LSTM-VAE was proposed to capture temporal correlations. USAD is an unsupervised method based on reconstruction and consists of three parts, an encoder and 2 decoders that share the same encoder network. It also uses LSTM to capture temporal correlations. OmniAnomaly uses a VAE with gated recurrent units (GRUs) to detect anomalies. However, OmniAnomaly does not amplify the reconstruction error. When processing time series data, LSTM serves as the basic architecture of the generator and the discriminator to capture temporal correlations. However, LSTM exhibits gradient instability and model collapse problems. The AMBi-GAN consists of bidirectional LSTM and an attention mechanism, and it can capture temporal correlations. Recently, Nguyen et al. [30] proposed an LSTM-based method to detect anomalies. It uses LSTM to predict time series and employs an AE-LSTM with a one-class support mechanism to reconstruct time series. Prediction-based and reconstruction-based methods have also been combined to detect anomalies. However, reconstruction-based methods can effectively fit the input data. Thus, these methods fit anomalies when they are close to normal data [31]. Therefore, the resulting overfitting problem decreases the accuracy of anomaly detection.
Even though the above methods are effective, they do not take spatial correlations into consideration.

B. ANOMALY DETECTION WITH GNNS
Deep learning can achieve great success in terms of data representation. The patterns of anomalies can be learned by deep learning methods. However, many deep learning methods have poor performance when handling non-Euclidean data. GNNs have been proposed to tackle graph-structured data. A GNN is based on deep learning. It enhances the capability of the resulting model to process graph pattern information. The anomalies can be easily identified according to the extracted representation [32]. GCNs have been proposed as the convolutional networks of computer vision. Wu et al. [33] proposed the multitask GNN (MTGNN) for multivariate time series forecasting problems. The MTGNN consists of a graph convolution module and a temporal convolution module to capture the spatiotemporal dependencies between time series. Weber et al. [34] proposed EvolveGCN to detect anomalies in financial transaction networks. Their approach uses a GCN as the feature extractor. The k most influential nodes represent all the information contained in the network at a certain moment. However, the overall characteristics of the network are ignored. The graph deviation network (GDN) [35] also uses the top-k method employed by the MTGNN to construct a graph. It treats each time series as a node on the graph, but the connections between the nodes are learned. GATs are also used for feature extraction. A GAT evaluates a graph deviation score as the difference between the expected value and the observed value. Wang et al. [36] proved that a GNN can effectively model multirelation data. Attention mechanisms have been widely applied to sequencebased tasks. GNNs also benefit from this concept by using an attention mechanism during aggregation, integrating the outputs of multiple models, and generating random walks that are oriented to important targets. A GAT is a spatialbased GCN. It uses an attention mechanism to determine the weights of node neighborhoods when aggregating feature information, and it considers the correlations between different time series. Huang et al. [37] proposed a hybrid-order GAT (HO-GAT) to detect anomalies in attributed networks. This network uses an HO self-attention mechanism to learn node and motif instance representations. Two encoders are VOLUME 10, 2022 used to reconstruct the attribute information. The reconstruction errors can serve as anomaly scores for detecting anomalies. Wu et al. [38] proposed Event2Graph, which uses a dynamic bipartite graph structure to capture the interdependencies between observations. Event2Graph converts the predicted event edges into anomaly scores. If an anomaly score is higher than the threshold, the corresponding event is classified as an anomaly. Fan et al. [39] proposed Anoma-lyDAE, which uses a GAT in its structural encoder to learn the importance levels among nodes and their neighbors. Thus, AnomalyDAE can efficiently capture structural information.

III. METHODOLOGY
Through the above description, we know that current deep anomaly detection methods only concentrate on temporal correlations while ignoring spatial correlations. In addition, some methods overfit anomalies. In this section, we first state the current problems, propose the HAD-MDGAT framework for capturing the temporal and spatial correlations between different observations, then discuss the proposed novel GAN framework, and finally compute anomaly scores for anomaly detection.

A. PROBLEM STATEMENT
In our work, we focus on anomaly detection in multivariate time series. We execute the HAD-MDGAT on real-world datasets to find anomalous samples that are apparently different from other observations. In our work, the datasets are derived from sensors at timestamp T ; the sensor data are denoted as X = {x 1 , x 2 , . . . , x T , . . . , x n } ∈ R N * n . At timestamp i, x i ∈ R N , i = 1, 2, . . . , n, is an N -dimensional vector determined from N sensors, where N is the number of features and n is the length of the input data. The inputs are generated by a sliding window. The final output is a vector y ∈ R n , where y t ∈ {0, 1} and y t = 1 indicates that the observation at time t is anomalous.
Our work simultaneously learns the dependencies between sensors in both the temporal and spatial dimensions with two MDGATs. Then, a GRU is applied to capture the pattern features of the given time series. Next, we propose a hybrid approach that combines a prediction-based method and a reconstruction-based method to detect anomalies. With respect to the overfitting problem faced by reconstruction-based methods, we propose a novel GAN as the reconstruction-based approach.
To further understand GNNs, we provide the following foundational concepts.
, v n } denotes the nodes and E ⊆ V ×V denotes the edges. Here, v i (with the same dimensionality x i ) denotes the feature vector for each node and e i,j represents an edge from a node v j to a node vi. An undirected graph has bidirectional edges.

2) NODE NEIGHBORS
The neighbors of node v i are defined as

3) PEAK OVER THRESHOLD (POT)
We apply the POT approach [40] as the threshold selection method. It can automatically select the appropriate threshold for a time series. Moreover, it does not make any assumptions about the data distribution and fits the tails of a probability distribution via a generalized Pareto distribution (GPD) with parameters.

B. HAD-MDGAT FRAMEWORK
The HAD-MDGAT framework (shown in Fig. 1) simultaneously learns the dependencies between sensors in both the temporal and spatial dimensions through MDGATs. Then, the two GAT layers and the output of the 1-D convolution layer are concatenated to extract the features of the input time series. The concatenated vector is fed into a prediction-based module and a reconstruction-based module (shown in Fig. 3).
The predicted results, the reconstructed samples and the real values are used to compute anomaly scores. A threshold is set by the POT method. If an anomaly score exceeds this threshold, we can identify that the corresponding sample is anomalous.

1) DATA PROCESSING
In multivariate time series, different variables have various dimensions. This affects the selected threshold and the robustness of hybrid modules. Thus, we process the input time series by executing the maximum-minimum normalization method on the training and testing data:

2) MULTIHEAD DYNAMIC ATTENTION (MDA)
Due to increases in data volumes and the number of connected sensory devices, it is difficult to achieve high accuracy in the multivariate anomaly detection task. Many deep learning methods concentrate on temporal correlations instead of spatial correlations. Therefore, we introduce a GAT with MDA to capture the temporal and spatial correlations between different observations. The output of each node computed by the GAT layer is shown as follows: where h i denotes the output representation of a node; α ij measures the correlation degree between v i and v j ; denotes the concatenation of node representations; a ∈ R 2N , w ∈ R 2N are trainable parameters; a leaky rectified linear unit (LeakyReLU) is used as the activation function to consider the attention weights between node pairs (i, j) for the representation of node i; and j denotes node i's adjacent neighbors.
A multihead attention mechanism is also applied in the HAD-MDGAT. After the feature vectors calculated by the K-head attention mechanism are concatenated, the corresponding output feature vectors are denoted as follows: where denotes vector concatenation; σ is the sigmoid activation function; K indicates that K attention heads are used to calculate the attention scores; α k ij is the attention score obtained after the calculation of the k-th attention mechanism head; and w k is the parameter matrix of the linear transformation of the input vector. From [12], if we apply MDA on the last layer of the network, the concatenation method is no longer sensible. Therefore, concatenation is used in the intermediary layers. Regarding the output of the last layer, the concatenation approach does not achieve good results. Therefore, the averaging approach is applied. The output of the last layer calculated by the MDA mechanism is denoted as follows: h i is the feature vector that is input into the GRU after feature extraction is performed by the HAD-MDGAT. For multivariate time series anomaly detection, we use two kinds of graph attention layers with MDA (the MDGAT) to learn the dependencies in both the temporal and spatial correlations.

a: SPATIAL LAYER
To capture spatial correlations, we view a multivariate time series as a complete graph. Every node is a value of one feature across n timestamps, and an edge represents the dependency between two nodes. N denotes the number of features (nodes). x i is denoted as An MDA mechanism is applied to calculate h S for a certain node. The spatial layer is shown in Fig. 2.

b: TEMPORAL LAYER
We also use MDA to capture the temporal correlations between different observations. Each node x t = x i,t | i ∈ [0, n) denotes one timestamp with N features (or sensors). The output of the spatial layer is an N × n matrix ( h Temp ). The output of the temporal layer is an n × N matrix.
Finally, we concatenate the outputs of the two layers and the processed vector x i . This forms an n × 3N matrix containing spatial and temporal correlations.

C. RE-ENCODING GANS
As mentioned in Section II, the prediction-based methods and reconstruction-based methods all have their own advantages. Therefore, we use the two types of methods to detect anomalies together. The output of the GRU is input into the prediction-based method (a multilayer perceptron, MLP) and reconstruction-based method (re-encoding GANs) simultaneously. However, a better reconstruction performance often results in the overfitting of anomalies. This reduces the accuracy of anomaly detection. Moreover, model collapse is a common situation during GAN training. Therefore, we propose a novel GAN (shown in Fig. 3) as the reconstruction-based method to generate samples. To deal with model collapse in GANs, we use the Wasserstein loss [41] to train the GAN. GRU cells have the advantage of generating time series. Therefore, we apply fully connected neural networks with GRU cells to achieve improved anomaly detection performance. The two generators serve as encoders (E 1 and E 2 ). Each generated sample is encoded again into the latent space. The re-encoding loss can be obtained by the difference between the two values in the latent space. The output of D x is a probability score ranging from 0 to 1, which can be used as a part of the anomaly score for detecting anomalies. D z identifies whether the input is obtained from random noise or the encoded latent space. This enables the z distribution be as close as possible to the X (original time series) distribution and can deal with the overfitting problem.
The objective functions of D z and D x are as follows: However, during training, it is not guaranteed that the learned mapping can map each individual input x i to the desiredx i by relying on the adversarial and Wasserstein loss functions alone. When a network has a sufficiently large capacity, any random arrangement of the input data can be mapped to the output distribution that matches the target. The cycle consistency loss [42] was proposed by Zhu et al. It ensures that images in the corresponding domains have a one-to-one correspondence and prevents conflicts between the samples generated by the two generators. Therefore, the two generators can transform the generated samples back to their original states. Our GAN maps the input x i to the target z i in the latent space via E 1 and then generatesx with a generator. To reduce the size of the space derived from function mapping, the learned function should be cyclically consistent to keep the mappings G and E 1 from contradicting each other. E 1 , E 2 , and G are trained with the adaptive cycle consistency loss: To further improve the accuracy of anomaly detection, we use two encoders in the GAN to amplify anomalies. The re-encoding loss is used to detect anomalies according to the observed differences in the latent space. When the comparison is conducted in the latent space after encoding, anomalies can be more effectively detected. By minimizing the input features and the encoded features of the generator, the differences between them allow the generator to learn how to encode real samples into the corresponding latent space. Therefore, the generator can address the overfitting problem encountered by the reconstruction-based method. The object re-encoding process is as follows: 2 (10)

D. PREDICTION-BASED METHOD
We combine prediction-based and reconstruction-based methods to detect anomalies. Fully connected layers form the basic architecture of the prediction-based method. The loss is as follows: where x n is the next timestamp. x n,i denotes the value of the i-th feature of x n .

E. ANOMALY SCORES 1) RECONSTRUCTION ERROR
Dynamic time warping (DTW) can identify areas with small differences over a long period of time and can address time drift issues. Each series is linearly deflated to perform some ''twisting'' operation to achieve better alignment. The best match of a given time series is calculated to measure the similarity between local regions. Thus, we use DTW to measure the differences between real and reconstructed samples. There are two time series X andX , and a 2*l*2*l matrix is used to compare the two time series. The warping path traverses this matrix, and the k-th element of the warping path is denoted as w k = (i, j) k , which is the minimum distance between x i andx j .
. ,x i+l are the real and reconstructed samples for the i-th feature, respectively.

2) RE-ENCODING LOSS
The re-encoding loss is computed as follows: 40972 VOLUME 10, 2022

3) PREDICTION LOSS
We calculate the as i s for the N features. The final anomaly score produced by the prediction-based method is the sum of the scores of all features.
The final anomaly score is computed as follows: where λ is a hyperparameter used to combine the prediction-based and reconstruction-based errors. The default value is 0.5. According to the POT technique, if AS exceeds the threshold, the corresponding samples can be identified as anomalies.

IV. PERFORMANCE ANALYSIS
First, we describe the utilized experimental datasets, baseline models and evaluation metrics. Then, we conduct experiments to demonstrate the performance of our method. Finally, to illustrate the effectiveness of the proposed modules, we conduct an ablation study on five datasets to validate the GAT, the prediction-based module and the reconstructionbased module, which contribute to the performance improvement achieved by the proposed approach.

A. DATASETS
We use five real-world datasets to validate the performance of the HAD-MDGAT, namely, Secure Water Treatment (SWaT), Water Distribution (WADI), Mars Science Laboratory Rover (MSL), Soil Moisture Active Passive (SMAP), and the Server Machine dataset (SMD). SWaT 1 and WADI 2 come from a water treatment test bed coordinated by Singapore's Public Utility Board and a network, respectively. MSL and SMAP contain spacecraft telemetry signals provided by NASA. 3 The SMD 4 is a five-week dataset obtained from a large Internet company. The dataset is derived from 28 machines, and the anomalies in the training dataset are labeled by experts. The five datasets contain different numbers of anomalies, and the location of each anomaly is known. Table 1 provides the details of each dataset (including their anomaly ratios, etc.).

B. BASELINE MODELS
We implement 7 state-of-the-art baseline models for a performance comparison with the HAD-MDGAT.

2) LSTM-VAE
LSTM serves as the basic architecture of the encoder. However, it does not consider the temporal correlations between observations.
3) LSTM-NDT [43] LSTM is applied to detect anomalies in multivariate time series. This approach utilizes an unsupervised, nonparametric algorithm for threshold determination.

4) OMNIANOMALY
This method learns latent representations through a GRU and a VAE. It takes dependence and stochastic factors into consideration and applies reconstruction probabilities for anomaly detection.

5) USAD
An AE is the basic architecture of USAD. USAD conducts two-phase training in an adversarial manner to reconstruct samples. The input is identified as an anomaly if its corresponding anomaly score is higher than the threshold.

6) MAD-GAN
The MAD-GAN applies LSTM to capture temporal correlations and embeds the captured dependencies into a GAN. It uses reconstruction errors to detect anomalies in multivariate time series.

7) GDN
The GDN also learns the interrelationships between variables in multivariate time series. It directly applies GATs to capture features and uses graph deviation scores to detect anomalies.

C. EVALUATION METRICS
We apply the accuracy (Prec), recall (Rec) and F 1 score (F 1 ) metrics to evaluate the anomaly detection performance of the HAD-MDGAT.  The true positive (TP) indicator is the number of samples that a detection model correctly identifies as anomalies. The false positive (FP) indicator represents the number of normal samples that are identified as anomalies. The false negative (FN) metric is the number of anomalous samples that are identified as normal. Moreover, the true negative (TN) measure indicates the number of normal samples that are correctly recognized as the normal type. The model is more robust if the values of the three above metrics (precision, recall, and F1 score) are higher. To evaluate the robustness of the HAD-MDGAT, dynamic Gaussian mixture noise [44] with different signal-to-noise ratios (SNRs) is added to the original data.

D. EXPERIMENTAL SETTINGS
The experiments are implemented in Python 3.6 with PyTorch 1.9 and are performed on a PC with a Ubuntu 18.04.5 LTS, an Intel R Xeon(R) E5-2678 v3 CPU, 2 RTX 3060 GPUs, and 32 GB of RAM. We use the same sliding window w = 100 for all datasets. We set the kernel size k 0 = 7 for 1D convolutions. The dimensionalities of the GRU layer (k 1 ) and the fully connected layers (k 2 ) are all set to 150. Our model is trained with the Adam optimizer for 300 epochs. The initial learning rate is 0.001. In the GAN, the generator uses the tanh activation function, and the discriminator uses the sigmoid activation function. The numbers of GRUs in the generators and discriminators are all 4.

E. RESULTS
We evaluate the performance of the HAD-MDGAT and compare it with that of with 7 other baselines on five datasets. An ablation study on different components is conducted to determine the impacts of these components on the performance of the HAD-MDGAT. The appropriate thresholds for anomaly detection are used for all models, and the optimal F1 scores are obtained. The optimal results obtained by all models on the public datasets are shown in Table 2 and marked in bold. The HAD-MDGAT detects anomalies on the SMAP training set (shown in Fig. 4). It can detect the most anomalies. Table 2 shows that the HAD-MDGAT significantly outperforms the other state-of-the-art baselines by achieving the highest mean F1 value (0.929) across all public datasets. The second-and third-best methods in terms of overall performance are the GDN (0.881) and OmniAnomaly (0.785), respectively. The HAD-MDGAT outperforms them by 5.45% and 18.34%, respectively. We make the following observations. (1) The LSTM-VAE is used for classification. It exhibits robustness to imbalanced data and has fast convergence. However, it does not take the temporal correlations between different observations into account. Its performance on the five datasets is not good, especially its value of 0.38 on WADI. The DAGMM is suitable for balanced datasets. However, it exhibits slow convergence for unbalanced data, and its generalization ability is not decent. Additionally, it has poor performance on high-dimensional datasets, such as WADI (0.201). USAD applies a VAE as its basic architecture; it has a fast training speed, and it has the second-best performance on MSL (0.911). However, it achieves a lower value on WADI (0.43). The MAD-GAN has many hyperparameters, making it unsuitable for training. It also has poor generalizability. However, it has better performance on SWaT (0.81) and the SMD (0.872). USAD, the LSTM-VAE, the DAGMM and the MAD-GAN are not suitable for high-dimensional datasets. (2) LSTM-NDT has better performance on SWaT (0.804). However, it does not perform well on MSL (0.564) and the SMD (0.604). This means that LSTM-NDT is sensitive to different scenarios because it cannot conduct effective modeling for all of the cases. OmniAnomaly has better performance on SWaT (0.833), MSL (0.901) and the SMD (0.931). However, it does not take the spatial correlations between observations into consideration. (3) The GDN has good performance on the SMD (0.92) and the other datasets. However, strong connections cannot be merely determined by the tightness of their spatial distances. This approach has a poorer performance on WADI (0.815) than on the other datasets.

1) PERFORMANCE COMPARISON
The HAD-MDGAT achieves the best performance on all datasets. As shown in Fig. 5, the HAD-MDGAT outperforms the other baselines and scores 65.3% higher than the MAD-GAN (0.562). The HAD-MDGAT outperforms USAD (20.03%), OmniAnomaly (18.34%) and the LSTM-VAE (44.48%). The reconstruction-based methods are prone to overfitting anomalies, which leads to low performance. However, these methods have good performance on low-dimensional datasets. We use an extra encoder to address the overfitting problem. The HAD-MDGAT has a higher F1 score than the DAGMM (40.33%). Similar to the LSTM-VAE, the DAGMM does not consider temporal correlations. This indicates that temporal correlations are critical for anomaly detection. As introduced in Section III, we use a temporal layer to capture temporal correlations. The GDN scores 12.23% higher than OmniAnomaly. The GDN applies a GAT to extract the temporal and spatial correlations between different observations. OmniAnomaly does not consider spatial correlations. Therefore, it is essential to utilize spatial correlations when reconstructing samples for anomaly detection. However, the GDN cannot achieve high performance on high-dimensional datasets such as WADI. We use a spatial layer to learn the spatial dependencies between different observations. In our GAN, we propose the use of a re-encoder to amplify the reconstruction error and improve the efficiency of anomaly detection; however, Omni-Anomaly does not have a similar effect. The HAD-MDGAT has better performance than LSTM-NDT (37.43%). LSTM-NDT has better performance on SWaT than on MSL, WADI and the SMD. The reconstruction-based methods mostly achieve better performance than the prediction-based methods (the DAGMM and LSTM-NDT) on WADI, the SMD and MSL. This means that the prediction-based and reconstruction-based methods all have separate advantages in terms of anomaly detection. The HAD-MDGAT uses a hybrid method that combines both kinds of approaches to detect anomalies. This technique improves the performance of the HAD-MDGAT on all five datasets.

2) ROBUSTNESS
Gaussian noise based on SNRs is typically used to evaluate the robustness of models [7]. We set different SNRs to evaluate the HAD-MDGAT. The results are shown in Table 3. Even though the F1 value decreases with increasing SNRs, our HAD-MDGAT is still more competitive than the other baselines, especially at an SNR of 10. Overall, the HAD-MDGAT is less impacted by noise.

F. ABLATION STUDY
To illustrate the effectiveness of each component of our method, we conduct an ablation study on the same five datasets. The results validate the improvements provided by the MDA mechanism, the hybrid architecture and the spatial layer. The different components are denoted as follows: w/o MDA: disabling the MDA mechanism; w/o prediction-based: disabling the prediction-based method; w/o reconstructionbased: disabling the reconstruction-based method; and w/o spatial layer: preventing the GAT from learning spatial correlations (only the temporal layer remains). The results are shown in Table 4.  Fig. 6 shows that the spatial layer and the reconstructionbased method achieve good performance based on their mean F1 scores. The reconstruction-based method outperforms the prediction-based method. Moreover, the MDA mechanism also contributes to the anomaly detection accuracy.
The MDA mechanism can model the different correlations between different keys and different queries to assign node neighbor scores. The GAT with MDA can fit unbalanced data, and it has superior robustness. We find that the GAT with MDA can also capture the interrelationships between nonadjacent timestamps. The HAD-MDGAT scores are 10.86% higher than those of the version without the spatial layer. When a sample is anomalous, its spatial correlation is greatly different from that of a normal sample in Fig. 7 (which includes 13 features). A darker block indicates a higher spatial correlation and vice versa. This means that it is critical to VOLUME 10, 2022   capture spatial correlations when conducting anomaly detection. The prediction-based method is sensitive to random time series.
However, the reconstruction-based method trains a model to learn the distribution of the input data; this model is less affected by noise and other perturbations. The overfitting problem is a limitation of the reconstruction-based method. If an anomaly is very close to the normal data, it may be undetectable by the reconstruction-based method. We propose a novel GAN to reconstruct data and compare the differences between the representations of two encoders in the latent space, thereby amplifying the errors between the normal and anomalous samples. The prediction-based method can detect anomalies that are sudden time series perturbations. Hence, the hybrid method, which combines both types of methods, can achieve higher anomaly detection accuracy. In order to further optimize proposed model, the Adam gradient descent method is implemented in HAD-MDGAT. Mini-batch algorithm is applied to improve the efficiency of HAD-MDGAT. Data can divided into batches by minibatch algorithm. In gradient descent training, only a portion of the data set instead of all training set is used and updates the parameters by batch. Therefore, a set of data in a batch jointly determines the direction of this gradient, reducing randomness. As shown in Fig. 8 and Fig. 9, the hybrid method exhibits fast convergence on the training and validating sets.

V. CONCLUSION AND FUTURE WORK
In this paper, we propose a hybrid method based on a GAT called the HAD-MDGAT. A GAT with MDA is proposed to learn the temporal and spatial correlations between different observations. Ablation study shows that MDA makes greate contributions to improving anomaly detection accuracy. HAD-MDGAT scores are 10.86% higher than those of the version without the spatial layer. The prediction-based method can detect anomalies that are sudden time series perturbations and the reconstruction-based method trains a model to learn the distribution of time series. The combination of two methods makes that HAD-MDGAT is less affected by noise and other perturbations.In order to evaluate the robustness of HAD-MDGAT, we set different SNRs to evaluate the HAD-MDGAT. HAD-MDGAT are still more competitive than other baselines. We use the re-encoding loss as a portion of the final anomaly score. Two encoders control the fitting of the learned features in the reconstruction-based method so that the GAN can deal with the overfitting problem. What's more, HAD-MDGAT scores are 6.05% higher than those of the version without the reconstruction-based method from ablation study. Experiments show that the HAD-MDGAT achieves improved anomaly detection performance and outperforms the other seven tested baselines.
For GAN-based anomaly detection models, choosing an appropriate sliding window length is difficult. Additionally, a GAN is unstable during training. In the future, we will investigate these issues and combine other prediction-based methods with GATs.