Robust Anomaly Detection for Multivariate Data of Spacecraft Through Recurrent Neural Networks and Extreme Value Theory

Spacecraft anomaly detection which could find anomalies in the telemetry or test data in advance and avoid the occurrence of catastrophic failures after taking corresponding measures has elicited the attention of researchers both in academia and aerospace industry. Current spacecraft anomaly detection systems require costly knowledge and human expertise to identify a true anomaly. Moreover, some new problems and challenges such as large volume of test data, imbalanced data distribution and the scarcity of faulty labeled samples have emerged. In this work, we propose an unsupervised anomaly detection algorithm combining Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) and Extreme Value Theory (EVT). First, we develop a two-layer ensemble learning based predictor framework which stacks three GRU-based networks with different architectures to learn and capture the normal behavior of multiple channels of data. Then, the prediction errors are calculated and smoothed using Exponentially Weighted Moving Average (EWMA) algorithm. Next, we propose a detection rule setting anomaly threshold automatically through EVT which does not assume any parent distribution on the prediction errors. To the best of our knowledge, it is the first attempt that stacked GRU-based predictors with EVT has been employed into the spacecraft anomaly detection. Through extensive experiments conducted on public datasets as well as real data sampled from a launch vehicle, we show that the proposed detection algorithm is superior to other state-of-the-art anomaly detection approaches in terms of model performance and robustness.


I. INTRODUCTION
Spacecraft is a very sophisticated and complicated system which consists of many subsystems and components such as structure system, engine, control system and telemetry system. It is essential to monitor the health status and strengthen the stability and reliability of spacecraft because it often operates in the harshest outer space environment. Anomaly detection strives to discovery patterns and data that do not adapt to expected behaviors [1], [2]. Anomalies which are unexpected instance data greatly deviating from normal behaviors defined by the most of a dataset usually do not occur abruptly and there exists subtle and imperceptible changes in telemetry data of spacecraft [3]. Anomaly The associate editor coordinating the review of this manuscript and approving it for publication was Yuan Zhuang . detection technology should detect these changes and anomalies in advance to further prevent the potential cascading downtime which may result in catastrophic and unforeseen damages to spacecraft. Therefore, anomaly detection plays a significant role in real-time managing health status and promoting reliability of spacecraft [4], [5].
At present, one of the most used anomaly detection methods is Out-Of-Limit (OOL) [6], [7], where a pre-defined threshold is set and once a monitored value strays outside of this threshold, and then an anomaly is detected. Although OOL methods are simple and easy to implement, they require large amount of domain knowledge and human expertise to set thresholds and are difficult to identify contextual anomalies when the values are less than the pre-defined threshold. Therefore, in the recent years there have been many anomaly detection algorithms which make great improvements over VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ OOL, such as density-based approaches [8], [9], statistical methods [10], [11], k-nearest neighbor methods [12], [13] and Support Vector Machine (SVM) based methods [14], [15]. However, these methods perform well only when the size of data is small. With the missions of spacecraft are becoming more and more complex, the volumes of telemetry data increase rapidly. For instance, the Synthetic Aperture Radar satellite generates nearly 85 terabytes of data per day [16]. Therefore, it becomes almost unlikely for the aforementioned anomaly detection methods to adapt to such big telemetry data to identify anomalies in the real or near real time.
Another challenge for spacecraft anomaly detection is that the telemetry data are quite imbalanced [4], which means that there are large number of normal samples but insufficient or even no labeled abnormal samples in telemetry dataset, resulting from the fact that spacecraft usually works in normal condition and failures seldom occur. It is timeconsuming and labor-intense to label thousands of telemetry data, which can lead traditional approaches to overfit and not be able to detect anomalies accurately. Moreover, most of these anomaly detection methods are single channel models designed to only be able to detect anomalies for each telemetry channel individually. However, a single model may not perform well for all the telemetry channels, and it is much more reasonable and preferred to detect anomalies using multivariate channels rather than using univariate channel for several reasons. First, it needs more hardware resources and time to train and maintain models for each telemetry channel. Second, in real application scenarios, engineers are encouraged to concern more about the status of spacecraft as a whole than each channel.
In this research, in order to mitigate and balance the challenges mentioned above, we propose an unsupervised deep learning-based anomaly detection model using Gated Recurrent Units (GRU) [17] based Recurrent Neural Networks (RNNs) [18], [19] and Extreme Value Theory (EVT) [20], [21] to identify anomalies in telemetry data of spacecraft. Our core idea is to forecast the time series of spacecraft by designing a two-layer ensemble learning based predictor framework with GRU, and subsequently apply EVT rule on prediction errors. Our approach is an unsupervised algorithm which tries to learn and capture the normal behavior of multiple channels data by leveraging superior performance of GRU. The bigger a prediction error is, the more likely it will be regarded as an anomaly. The main contributions of this paper are summarized as follows: 1) We propose an unsupervised anomaly detection algorithm. To the best of our knowledge, this algorithm is the first multivariate time series anomaly detection method which combines GRU based neural networks and EVT in spacecraft.
2) We propose an ensemble learning based predictor framework based on diverse GRUs and is robust to multiple channels of telemetry data.
3) We introduce EVT to process the prediction errors and propose a dynamic threshold setting method without making any critic assumptions about the prediction error distribution. 4) We conduct extensive experiments and compare our approach with other state-of-the-art anomaly detection methods on different dataset. Experimental results demonstrate that our method achieve the best performance among all the baseline methods.
The remainder of this paper is organized as follows. Related researches are reviewed in Section 2. Section 3 details the proposed anomaly detection methods. Experimental results and discussions are conducted in Section 4. Finally, conclusions are drawn in Section 5.

II. RELATED WORK
The telemetry data of spacecraft are time series, and the anomalies could be divided into three categories -point, contextual, and collective [22]. Point anomalies are observations that differ significantly from the majority of data; contextual anomalies are regarded as anomalous only in a specific context and collective anomalies demonstrate that a set of observations are anomalies as a whole, but any single value in that sequence may be not by itself. Since contextual and collective are much more difficult to detect and both require temporal information, we merge them into the contextual category in this article.
Several anomaly detection approaches have been explored and exploited for spacecraft, among which expertsystems [6], [7], [23], [24] are the simplest and most widely used methods. The notable expert-system called Intelligent Satellite Control Software DOCtor (ISACS-DOC) is successfully implemented in Hayabusa, Geotail, and Nozomi missions [24]. The Out-Of-Limits (OOL) [6], [7] method using a pre-defined threshold to detect anomalies is a simple form of expert-system and still be adopted by a large number of spacecraft due to low computation expense, ease of interpretability, and easy applicability. Envelop Learning and Monitoring using Error Relaxation (ELMER) [25] is an improved OOL method which uses neural network to periodically set new threshold bounds. However, it is usually hard for expert-system to detect contextual anomalies and it requires sufficient and accurate knowledge of spacecraft [16]. A myriad of machine learning methods which automatically learn and extract knowledge from telemetry data have been introduced to spacecraft anomaly detection, such as k nearest neighbors-based methods [12], [13], clusterbased methods [26], [27], Support Vector Machine (SVM) based methods [14], [15], [28], and Artificial Neural Networks (ANN) based methods [4], [16], [29], [30], among others. The k nearest neighbor-based method is used by XMM-Newton satellite [6] as well as the Space and the International Space Station [31]. Cluster based methods which identify values falling outside of welldefined clusters as anomalous are adopted by NASA. The Centre National d'Etudes Spatiales (CNES) uses One-Class Support Vector Machine (OS-SVM) to separate anomalies from nominal data [32]. Although machine learning methods show great improvement over expertsystem based approaches, each has its own shortcomings related to computational expense, model complexity, or generalizability.
Recently, given the rapid advancement of computing capacity and neural network architecture, deep learning has begun to be applied in anomaly detection in spacecraft with high-dimension and large volumes of data. Su et al. [30] proposes a model called DAGMM which reduces the dimension of input data to extract hidden characteristics using Autoencoder (AE) and estimates the density of the characteristics using Gaussian Mixture Model (GMM). Recurrent neural networks as well as its variants Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU) have shown powerful capability to maintain memory of long-term dependencies and model complex nonlinear feature interactions. Therefore, LSTM and GRU are especially suitable for anomaly detection with time-series [16]. Malhotra et al. [33] used LSTM to construct a predictor to model normal behavior of time series, and subsequently use the prediction error to detect anomalous behaviors. The LSTM-based anomaly detection method in [33] gave better detection results than other methods, but this method assumed that the prediction errors must fit a normal distribution. In [16], Hundman et al. proposed a LSTM-based predictor and a nonparametric anomaly thresholding approach without assuming the distribution of prediction errors to detect telemetry data of the Soil Moisture Active Passive (SMAP) satellite and the Mars Science Laboratory (MSL) rover. The main disadvantage of this paper is that it is a single channel model and we should train many models for multiple telemetry channels which will cost more training time and require more computation expense. Tariq et al.
proposed an anomaly detection algorithm using a multivariate convolution LSTM with mixtures of probabilistic principal component analyzers and demonstrated the effectiveness of the method with the real Korea Multi-Purpose Satellite 2 (KOMPSAT-2) data in [34].
Literatures in [4], [16] are prediction-based anomaly detection algorithms which usually apply several rules on the prediction errors to identify anomalies. In [33], the prediction errors from the training data were assumed to fit a Gaussian distribution of which the mean, µ, and variance, σ^2 are computed using the Maximum Likelihood Estimation (MLF). Then an anomaly score is calculated and a low score indicates that the corresponding observation may be an anomaly with relatively high probability. In literature [35], it models the prediction error distribution as a normal distribution, and applies a threshold to the Gaussian tail probability (Q-function) to decide whether or not to declare an anomaly. However, the assumption that prediction errors follow Gaussian distributions does not always hold. Therefore, in [16], a novel dynamic anomaly threshold is set that, when all the observations above the thresholds are removed, would lead to the largest percent decrease in mean and standard deviation, but this method would cause many false positives.

A. OVERALL STRUCTION
Spacecraft usually consists of many subsystems, each of which generates several channels of telemetry data. Therefore, it is very difficult to detect the root causes of anomalies due to the correlation among these channels. Moreover, a single predictor may not perform well for other channels. Hence, in this section, we design a multivariate telemetry channels anomaly detection model.
The overall structure of our proposed anomaly detection algorithm is illustrated in Fig.1, and it includes two steps: model training and detection. The model training procedure is often conducted offline at an acceptable frequency, such as weekly or monthly. First, we leverage the Z-score normalization to process the raw telemetry dataset and then divide the dataset into training data and test data in the preprocessing module. Second, the training data are used to train the ensemble learning based prediction framework which learn and capture the normal behaviors of multivariate time series. Third, the prediction errors are calculated and smoothed. Then, the threshold setting module automatically sets thresholds for prediction errors using EVT. Once the model training procedure is completed, the well-trained model is stored for the detection procedure. The test data could be fed into the model online and outputs prediction errors. We use the EVT method to decide whether a test observation is anomalous or not. VOLUME 9, 2021 B. DATA PREPROCESSING In this study, we mainly observe multivariate time series m , in which each element corresponds to a specific output channel of spacecraft, and the total number of output channels is m. The time series is shown in Fig. 2, where each row in the matric represents the data of a single channel.
The time series X is transformed into a sequence X = where l b denotes the length of historical values used to predict the next l a values in the time t. The formats of inputs and outputs are shown in Fig. 2. Then, we split the data into training and test datasets after conducting the Z-score normalization to process the input data.

C. GRU-BASED PREDICTOR
LSTMs first proposed in 1990s [18] can learn longdependencies and contextual reasoning in time series and overcome the vanishing gradients problem by replacing an ordinary neuron by the LSTM unit or block; hence, LSTMs have become popular in forecasting problems in time series [4], [16], [21]. However, it is quite difficult to train LSTMs due to their complex architecture. In order to address this shortcoming, the LSTM is improved, and the input gate and the forgetting gate in LSTM are replaced by an update gate to form a new GRU [17], [36]. The performance of GRU is as good as LSTM while the structure of GRU is simpler than that of LSTM, which is shown in Fig. 3. Therefore, we apply GRU in our predictor model to learn the temporal dependence in multivariate telemetry data.
In Fig. 3b), the x t and h t−1 are the inputs while the hidden variable h t is the output, and the GRU can be formulated as follows: where r t , z t represents the reset gate and the update gate, respectively. W * , U * , b * , * ∈ {h, r, z} denote the parameters to learn. In the former applications, a specific predictor is trained for each channel and each predictor is used to forecast values for that corresponding channel. This is feasible when the number of channels is small, or channels with analogous data distribution. However, in spacecraft, there usually have thousands of telemetry channels, so it is impractical and infeasible to train a prediction model for each channel due to the computation, storage and time constraints. Therefore, inspired by ensemble learning, we propose a GRU-based stacked predictor in this section, and the framework is illustrated in Fig. 4.
The proposed stacked predictor consists of two layers: a prediction layer and a decision layer. The prediction layer contains 3 GRU-based predictors with various GRU-based neural network structures and aims to learn the normal behaviors of telemetry data. The input data are fed into three GRU-based predictors simultaneously, and the outputs of each predictor are what the values are expected to be at the next several timestamps.
The decision layer combines the predictions of the multiple GRU models in order to get a more accurate prediction for a dataset. In this paper, the decision strategy gets the means of all the predicted values produced by each individual model as final prediction values. This layer is essential as it enables the proposed predictor to be a generic and robust model in the sense that it does not rely on a specific prediction model. Our proposed model employs various different models to promote the prediction accuracy. Through this way, the stacked predictor is robust to multivariate channels of data.

D. AUTOMATIC PREDICTION ERROR THRESHOULD SETTING USING EVT
The stacked predictor is trained on the training dataset without any abnormal samples, so it learns and captures the normal behavior of the telemetry data. The prediction errors e(t) are defined as the absolute differences between the ground truth observations y t and their corresponding predicted valueŝ y t at time t.
All the prediction errors in training dataset form a vector e = [e(1), e(2), . . . , e(t), . . . , e(n)]. Since there are abrupt changes in the output channels of spacecraft, there exist spikes in prediction errors; therefore, it is necessary and useful to dampen these spikes using the Exponentially Weighted Moving Average (EWMA) method [37]. The smoothed prediction errors are calculated in (5).
In order to find the anomalies, we set the anomaly threshold using the EVT. The goal of EVT is to find the law of extreme values in e s (t) rather than the data distribution of e s (t), so it makes no assumptions about the data distribution of prediction errors when finding extreme values. These extreme laws have the following mathematical form The above equation is called Extreme Value Distribution (EVD), in which the parameter γ is the extreme value index determined by original distribution. In order to evaluate the probability of anomalies using EVD in telemetry data, the parameter γ should be estimated. The Peaks-Over-Threshold (POT) algorithm is an efficient and effective method to compute γ .
Assuming that the cumulative distribution function of the random variable X is F(x), if and only if a function σ exists, for all x ∈ R s.t. 1 + γ x > 0: where θ is the initial threshold and selected as the 98% quantile, γ and σ are shape and scale parameters of a Generalized Pareto Distribution (GPD). This result shows that the excesses over a threshold θ, denoted as X − θ, are likely to follow GPD. The Peaks-Over-Threshold (POT) approach attempts to fit a GPD to the excesses X −θ. Similar to [20], the value of parametersγ andσ could be estimated by Maximum Likelihood Estimation (MLE). Therefore, the final threshold ε is computed by: where q represents the desired probability and usually lies between 10 −5 and 10 −3 , n denotes the total number of values, N θ is the number of peaks, i.e., the number of X i s.t. X i > θ. The EVT, by using the POT algorithm, enables us set dynamic threshold such that P(X > ε) < q without any strong assumption on the original distribution of X and without any clear knowledge about this distribution. We apply this method to detect anomalies in the telemetry data of spacecraft. We compute a first threshold ε tr in the smoothed prediction errors from training dataset. This training procedure is illustrated in the algorithm 1.

End
Then for the prediction errors of the test dataset, we could find whether the values are anomalous or not. If a value in e s−te exceeds the threshold ε tr , then we consider the corresponding input data as abnormal. The anomalies are not used to update the model. In other cases, either the value exceeds the initial threshold (peak case) either it is a normal value (normal case). In the peak case, we update the threshold. This procedure is shown in Algorithm 2.

IV. EXPERIMENTS
In this work, we are the first to introduce GRU-based stacked predictor as well as extreme value theory into the anomaly detection of spacecraft telemetry data. All experiments are based on the environment of TensorFlow 2.0, python 3.6+.

A. DATASETS AND EVALUATION METRICS
In order to compare other anomaly detection methods with our proposed model and demonstrate the overall effectiveness of our proposed model, we first conduct several experiments on two public datasets released by NASA Jet Propulsion Laboratory [16]. These two public datasets consist of telemetry data with labels from the Soil Moisture Active Passive (SMAP) satellite and Mars Science Laboratory (MSL) rover, Curiosity. There are 55 and 27 channels in SMAP and MSL datasets, respectively. Then we implement experiments on three public datasets to show the superior prediction performance of stacked GRU models. Finally, a dataset collected from a real launch vehicle is used to elaborate verify the effectiveness of EVT-based detection rule. Similar to [3], [16], we select Precision, Recall, F1-Score (denoted as F1) as metrics to evaluate the performance of our proposed method and other baseline approaches. The Precision, Recall and F1 are defined as: where TP is the true positive values, FP represents false positive values, and FN denotes false negative samples.

B. EXPERIMENTAL SETTINGS
As described in section 2.3, a stacked predictor which contains 3 different architectures of GRU-based neural networks is used to forecast the telemetry time series data of spacecraft. The general architecture is deep which consists of several GRU layers and two fully-connected lays, as shown in Fig. 5. Before training, we employ the Tree-structured Parzen Estimator (TPE) Bayesian Optimization [38] to select the hyper parameters of our proposed model. The GRU A network contains 2 hidden GRU layers with 32, 64 neurons, respectively, the GRU B network contains 3 hidden GRU layers with 80, 80, 64 neurons, respectively, and the GRU C network has 4 hidden GRU layers with 48, 80, 80, 64 neurons, respectively. The rest of hyper parameters are listed in Table 1.
The output of the second fully connected layer is the prediction of the next l a timestamps and in our model we set l a to 5 considering balance between training efficiency and model performance. The number of units in the first fully connected layer is set to 64. Early stopping and dropout strategy are used to avoid over-fitting. The parameters of the model are trained by adopting Adam optimizer [39] for stochastic gradient descent with an initial learning rate of 10 −3 to minimize the Mean Square Error (MSE) loss function.
For the POT parameters, we follow the recommendation in [20] that the range of q is 10 −5 , 10 −3 and typically the initial threshold θ is the 98% quantile.

C. RESULTS ON SMAP AND MSL 1) THE OVERALL RESULTS
Our proposed anomaly detection method is a predictionbased algorithm which combines GRU-based predictors and EVT to identify anomalies. Fig. 6 and Fig. 7 illustrate the detection results for channel F-7 from MSL dataset and channel S-1 from SMAP dataset, respectively. The first plots in Fig. 6 and Fig. 7 contain the true data, the predicted data and the smoothed errors. The dash lines in the second plots of Fig. 6 and Fig. 7 represent the dynamic thresholds while the blue lines denote the smoothed errors, and once the smoothed error exceeds the thresholds an anomalous point is labeled. Our proposed approach could detect all the 423 anomalies with 34 false positives in channel F-7 and all the 448 anomalies without any false positive in channel S-1. Therefore, the precision and recall in channel F-7 are 92.26% and 100% while the precision and recall in channel S-1 are both 100%.
In order to demonstrate the effectiveness and superiority of our proposed algorithm, we compare it with three stateof-the-art methods for time series anomaly detection: Hierarchical Temporal Memory (HTM) model [35], LSTM-based model with nonparametric anomaly thresholding approach (LSTM-NAT for abbreviation) [16], Stacked predictor with dynamic thresholding algorithm (SP-DTA for short) [3]. The results of precision, recall, F1 of these baseline models and the proposed method on SMAP and MSL are showed in Table 2 and Fig. 8. The precision and F1 metrics of our method outperform all baseline methods, and the recall metrics are only lower than the best baseline approach on SMAP dataset. Our approach gets the highest F1 scores with 87.4% which exceeds the best baseline approach by 3.6% on MSL dataset. All these results show the superiority of our proposed method compared to the state-of-the-art methods. Then, we analyze the performance of these approaches in detail.
HTM [35] which operates in real-time is a theoretical framework for time series learning in the cortex. HTM outputs the value of current input and computes the prediction error and likelihood. Then this likelihood of an anomaly L which is defined using the tail probability Q determines whether an anomaly is identified.
where µ s denotes a short time average, µ W and σ 2 W represents mean and variance of a time window W, respectively. This method usually imposes a strong assumption on the prediction errors VOLUME 9, 2021  that they should follow a normal distribution which does not always hold in real application scenarios. As a result, HTM method does not perform well on these two datasets with the lowest F1 scores (56.6% for MSL) among all the methods.
Hundman et al. in [16] proposed a LSTM-based network to detect anomalies with a complementary unsupervised and nonparametric anomaly thresholding approach in telemetry data from SMAP and MSL spacecraft. One of disadvantages of this method is that it could not detect contextual anomalies well, which explains why this method gets the lowest precision 85.5% for SMAP. Another limitation is that a specific model needs to be created for each telemetry channel, thus it would need to train many models for various telemetry channels, which is infeasible and impractical in real application due to computation constrains.
Stacked predictor in [3] stacks three LSTM-based predictors with different architectures and a Support Vector Machine (SVM) based predictor to forecast the input values, then a dynamic thresholding algorithm is applied to optimally extract anomalous data. This method could enhance generalization and adaptability, so it is suitable for multivariate telemetry channels. However, this method will need more time to search the best thresholds for the prediction error sequences than other approaches and it is difficult to determine the optimal bandwidth of Gaussian kernel. This explains its worse performance than our proposed method.
Our proposed method is a prediction model which promotes the forecasting performance by adopting stacked GRU-based predictor. Unlike HTM and SP-DTA, our method applies no assumption on the probability distribution of prediction errors and employs EVT to automatically set threshold. Compared with LSTM, the architecture of GRU is simpler while the performance of GRU is as good as LSTM; therefore, the GRU-based model proposed is more efficient and effective than other baseline models. Moreover, our model trains on the entire dataset instead of training a specific model for each channel, which great promotes efficiency and reduce training time.

2) EFFECT OF GRU
The prediction-based anomaly detection algorithms usually consist of two steps. First, a well-designed prediction model is constructed to learn the normal representatives of time series data and forecast the future time series. Then a specific detective rule is performed to identify anomalous time series on the prediction errors. Thus, training an effective and efficient predictive model is vital because it directly determines the prediction errors. Considering the superior performance of RNN in processing temporal or sequential data, in this section, the prediction performances of GRU and LSTM which are two main variants of RNN are compared in two dimensions of prediction accuracy on the test set and training time on the training set. We select commonly used datasets in the field of anomaly detection for comparative analysis: Numenta Machine Temperature dataset [40], Power Demand dataset [41], and ECG dataset [42]. For fairness, the GRU and LSTM based networks have the same architecture and hyper parameters. The experimental results are shown in Table 3 below.
From the experimental results in Table 3, we can conclude that the prediction accuracy of the prediction model implemented with GRU is slightly higher than that of LSTM. When using a predictive model for anomaly detection, it is usually necessary to compromise between prediction accuracy and detection accuracy. A high prediction accuracy on the test set often results in low detection accuracy. This is because the prediction-based anomaly detection algorithm determines whether the corresponding data is abnormal through the magnitude of the prediction error.
In terms of training efficiency, the training time of each epoch for the GRU model is shorter than that of the LSTM model, with a maximum reduction of 12.2% in ECG dataset. From the perspective of processing the large-scale test data of a launch vehicle, the shorter the training times are, the faster the data will be processed.
Combined with the above analysis, the GRU-based prediction model has a faster running rate. Although the prediction accuracy is slightly higher than that of LSTM, it does not impact the results of anomaly detection. Therefore, we choose GRU to construct the prediction network model.

3) EFFECT OF EVT
In addition to prediction model, another aspect that significantly influences the final performance of anomaly detection is the detection rule. Motivated by the fact that the tails distributions of the prediction errors follow a Generalized Pareto Distribution (GPD), we propose a detection rule based on Extreme Value Theory (EVT) which is able to automatically set thresholds for prediction errors.
In order to assess the performance of our EVT-based detection rule, we also compare the EVT-based methodology to a complementary unsupervised and nonparametric anomaly thresholding approach [16], a method leveraging Chebyshev's inequality [43] and a Gaussian-based thresholding rule [44]. The former three detection rules are not bound to the underlying distribution of the prediction errors; while the Gaussian-based rule assume that the prediction errors should follow a Gaussian distribution.
The experiments are conducted on a real dataset containing voltage values sampled from a power system of a launch vehicle. The dataset contains a total of 11 abnormal points, and values below 34V are defined as abnormal points and emphasized as red dots in Fig. 8. The abnormal points are caused by detonating related initiating explosive devices or controlling the swing of the nozzles during flight, which leads the voltage values to drop at the corresponding time. This dataset is fed into the GRU-based stacked predictor to get prediction errors on which these four detection rules are implemented, respectively. The results are shown in Table 4.
It can be seen from Table 3 that our EVT-based detection method could identify 10 anomalies and none false positive, which outperforms three other baseline detection rules. Although the Chebyshev's inequality method could detect most of anomalies, it also results in many false positives, which lead to a low precision. The Gaussian-base method also results in many false positives indicating that presuming a Gaussian distribution on the prediction errors is a very strong assumption and might not always hold for many application scenarios. The precision of LSTM-NAT method is 88.9%, but the recall is 72.3% showing that it would miss many real anomalies.

V. CONCLUSION
Spacecraft anomaly detection can help to discover and identify abnormal behaviors in advance and avoid potential cascading downtime. In this paper, to address the problems of large amount of telemetry data and imbalanced samples in spacecraft, we propose an unsupervised deep learningbased anomaly detection methodology using stacked Gated Recurrent unit (GRU) based recurrent neural networks and Extreme Value Theory (EVT). Inspired by ensemble learning, we proposed a stacked GRU-based network which could not only promote the prediction performance, but also be robust to multivariate telemetry data. Moreover, we devise an EVT-based detection rule to set dynamic thresholds for prediction errors without making any critic assumptions about the distribution. Through extensive experiments, our proposed approach exceeds state-of-the-art methods on many large datasets.
To the best of our knowledge, this is the initial exploration of the GRU based neural networks and EVT in the field of spacecraft telemetry data anomaly detection. In the future, we will focus on how to improve the efficiency and robustness across multiply related spacecraft and machines by using transfer learning.