Introduction
Modern microservice systems consist of thousands of services, grouped by hundreds of subsystems. Generally, each service runs on a different container, and those containers can be dynamically created or destroyed according to some scaling configurations. It is challenging for the operator to detect operational issues and perform swift troubleshooting in such a complex environment.
During runtime, an operator can extract the metrics generated by nodes and services, such as CPU/Mem usage, Network I/O and the number of requests per second. Combining the observations from different timestamps forms some time series; by analyzing such series, an operator can decide whether the system is in an abnormal state; predicting the future values of time series also helps the operator discover the hidden risks before it actually break the system.
Traditional time series analysis methods require intensive human effort. To analyze the performance of a certain service, an operator must observe the time series generated by that service, then set the thresholds manually according to his/her experience. However, in modern microservice systems, the great quantity of pods and containers makes the manual method unacceptable. Operators require an unsupervised method, which could automatically extract proper thresholds from previous time series with minimal human inference.
A rich body of literature investigates unsupervised time-series anomaly detection tasks. Xu et.al. proposed a model Donut [1] based on variational auto-encoders (VAE), which achieved 0.9 F-score in an unsupervised training setup. Ma et. al. proposed an approach based on a one-hot support vector machine to identify unforeseen or abnormal sections in a time series. Other unsupervised methods, such as clustering, are also being actively researched according to [2].
However, as the size of modern microservice system growing larger, the above methods reached a performance bottleneck. For example, training a single VAE or clustering model is not trivial; training different models (especially neural networks) for hundreds of sequences and predict them simultaneously becomes unacceptable in modern production settings. Therefore, we try to find a universal model for the same type of metrics in a microservice system.
The model should have the ability to effectively distinguish the “static” parts in different time series, and itself should be complex enough to hold time series with different characteristics. A competitive model is Temporal Fusion Transformer(TFT) [3], which integrates static info and LSTM layers into transformer structure, and combines them via residual layers. Based on the idea of TFT, we tested and optimized it according to the requirements in production environment of microservice system.
The contributions of this paper can be summarized as follows.
We show that Temporal Fusion Transformers can be the state-of-the-art model for unsupervised anomaly detection tasks in the DevOps field. We prove this by comparison with other models on real-world datasets collected from production.
We enhance the Temporal Fusion Transformer model by integrating input sequences in a probabilistic way, which improves performance.
We propose a framework called TFTOps that is suitable for near real-time, concurrent anomaly detection of hundreds of different series.
Related Work
In this section, we will discuss the development of anomaly detection methods and the usage of these methods in DevOps/AIOps domain.
A. Anomaly Detection
Anomaly detection is an active field of research in which many applications and solutions being proposed every year in the past decades.
A natural solution to the time series anomaly detection problem is to utilize the “normal” time series to build a model capable of predicting the following items when the system is running normally. Given such a model and an observation of time series, one can use the observation as the model’s input, and get the prediction indicating what the following values could be under the system’s “normal” situation. If the observed values of future time series violate the predicted ones, the operator should mark the timestamp as “abnormal”. Under this setup, two essential problems need resolving.
Finding a good model that can correctly predict our target sequence. Depending on the characteristics of different targets, “good” models often vary in scales, costs, and even methodologies.
Finding a method to examine whether a violation case is “abnormal”, or just a false positive/outlier. Such methods mainly refer to various thresholding techniques, including static and dynamic (or self-adaptive) thresholding.
Traditional time series analysis methods often tend to use simple statistic methods to model time series. Early in the 1990s, the ARIMA model [4] was proposed. The full name of ARIMA is “AutoRegressive Integrated Moving Average”, which uses a differencing operation (or “integrate”) to eliminate the non-stationarity if possible. To deal with seasonal components, a seasonal-differencing technique is also introduced. Given input, an ARIMA model generates a scalar value, and operators often compare the prediction with a static threshold to determine whether the prediction-observation pair is abnormal.
The naïve ARIMA method experiences difficulties when dealing with complex time series. However, the series can be non-stationary even after the integration process. Running a second-order differencing
On the other hand, some efforts were made to decompose the time series into different interpretable components. Generally speaking, one can write a time series as a composition of 4 series: a level component, a trend component (T), a seasonal component (S), and an error term (E), which is the gap between the model and actual observations. Different models make different underlying mathematical assumptions about the methods to compute and combine those components, forming a family that is called ETS models. Hyndman et. al. [5] provides a detailed view of the ETS family.
However, these traditional models suffer from performance degradation when the target becomes complex, or when we are unable to provide accurate prior knowledge. For example, consider a time series describing the number of customers in a resort. One can easily conclude that it should be busy on weekends, and have fewer customers during weekdays, thus the seasonal component in ARIMA/ETS should have a period of 7 days. However, this assumption breaks when Christmas is coming. The input of traditional models only includes the target series itself and several hyper-parameters. Their lack of conditional inputs makes the models harder to generalize to different circumstances.
Since the 2010s, there has been a growing interest in neural network-based methods for various machine learning applications, including time series prediction and anomaly detection. As the previous example suggests, time series often come with prior knowledge, while neural networks are well-suited to incorporate this information into the models. Thanks to its nature, one can skip the intensive work of choosing feasible features from all sorts of prior knowledge, and let the stacked layers in neural networks do the job automatically.
During past decades, researchers have proposed a variety of neural network models, and it is hard to thoroughly introduce every different directions. To make a comparison with our proposed TFTOps, we mainly focus on the works that are close to our model in target, approach, and scalability potential. In other words, we introduce some models that have the following properties:
Uses Deep neural network (DNN) to perform time sequence prediction. This ensures that the model is somehow generalizable enough to capture fuzzy and versatile sequences.
Integrates static information and other “known” variables with input. These information, acting as prior knowledge, will contribute to the model’s overall performance.
Directly predict the future value of the sequence (unsupervised), other than only predicting a manually attached “anomaly” label. This ensures the availability of model when exact anomaly labels can hardly be retrieved.
Condition 1) and 3) indicate a task called multi-horizon sequence forecasting. Given a time sequence as input, the DNN model should generate its prediction for the possible values ahead.
In the field of DNN, there are two main approaches to solve the multi-horizon forecasting task. The first approach in this direction features the autoregressive models such as Seq2seq [6] and DeepAR [7]. Autoregressive models usually base on Long-short Term Memory (LSTM) cells [8] and their variations, such as GRU [9]. Deep AR [7] uses stacked LSTM layers to generate parameters (mean and variation) of a Gaussian distribution, then samples from that distribution to determine the one-step-ahead output. Deep State-Space models [10] implements a similar approach, while modeling the Gaussian distribution parameters with a Variational Autoencoder(VAE) [11]. More recent researches focus on Transformers, for example [12] proposed the use of transformers in time series tasks and introduced convolutional layers to reduce memory footprints. Fan et.al. [13] proposed a multi-modal attention mechanism to enhance LSTM encoders, which provides better context to a bi-LSTM decoder. These models are built to solve the “one-step-ahead” prediction problem: accept the previous sequence as input from
On the contrary, non-autoregressive models generate a sequence of forecasts for a fixed number of horizons concurrently. A typical model also resembles the normal approach of using an encoder to generate a hidden representation for the whole input. Common choices of encoders include feed-forward CNN,Autoencoder [14], LSTM, transformer, or a combination of these models [15]. Upon generating the representation vector, a decoder will process it to acquire future predictions. The whole model is end-to-end trainable. For example, the Multi-horizon Quantile Recurrent Forecaster (MQRNN) [16] proposed two different encoder structures (CNN and LSTM), and generated output based on the encoded sequence via an MLP for each horizon. Temporal Fusion Trasnformers [3], unlike [12], directly predicts the
In terms of model structure, our TFTOps mainly inherits ideas from TFT model [3] and [17]’s probabilistic input.
B. Anomaly Detection in AIOps
DevOps relies on different aspects to monitor the status of system. Main datasources include Application logs [18], distributed traces [19] and KPI metrics. More comprehensive results can be found in [20]. In this paper, we mainly focus on researches about KPI metrics (time series).
Currently, the most widely accepted toolchains include Prometheus [21] and Grafana [22] which have user-friendly interfaces and supported by a powerful query language (PromQL [23]). However, those platforms currently only support naïve models as built-in functions, such as linear regression and statistical tests. These approaches are suitable for most periodic data, while failing to solve complex cases where an assumption of underlying data distribution is unavailable.
In the past decades, machine learning approaches are intensively investigated for metric prediction. Supervised methods include decision trees [24], [25], [26] or Bayesian classifiers [27], [28], [29]. While achieving high accuracy in some cases, the performance of supervised methods heavily depends on accurately labeled data. Unsupervised methods include LOF(local outlier factor) [30], clustering [31] and PCA [32] are developed for the cases where labeled data is unavailable.
Being successful in other fields, such as computer graphics and machine translation, neural network methods are gaining notice in the AIOps context since 2015. For example, Monni et. al. [33] built an anomaly detector based on restricted Boltzmann machine [34]; Lin et. al. using a Learning-to-Rank model [35] to combine the result of LSTM and random forest. In more recent researches, attention mechanism also took place in predicting long-term time series [36], [37]. Except directly predicting the target, operators can also use neural networks to generate reliable fake sequences [38] to augment the dataset. However, the performance of neural networks is often a concern to the operators.
Generally speaking, previous researchers mainly focused on predicting the trend of KPIs(Key Performance Indicators). For example, Microsoft’s Spectral Residual CNN KPI anomaly detection model [39] treats each KPI time sequence separately: an independent model is trained on each KPI sequence. Since KPI sequences can be defined quite differently, using different model parameters to predict them is a natural choice. Also, since the number of such KPI sequences is few, one can easily afford the cost of training and hosting several models. However, in typical microservice settings, there are hundreds of pods and services hosted for different purposes. When the operators try to make fine-grained predictions, they will likely meet a dilemma: using a neural network to accurately track a pod/service’s metric, such as disk and memory usage, often consumes more resources than the pod/service itself. This dilemma prevents neural network models from participating in fine-grained prediction tasks.
Recently, transformer-based models have shown superior performance in neural machine translation and other text-related fields. Its power in processing sequences also attracts the attention of researchers in time-series fields. Since 2020, transformer-based models have formed a line of research on long-term time series forecasting. For example, Informer [40] proposed an efficient ProbSparse self-attention mechanism, which solved the performance cap of long time series(output length > 50); Autoformer [41] proposed a decomposition architecture and auto-correlation mechanism to further aggregate sub-series level representations.
However, in DevOps settings, it is doubtful that long-term time series forecasting is necessary. As proposed by [40], the prediction error quickly rises as the output sequence length increases. In production environments, DevOps tends to shrink the length directly instead. Discovering abnormal circumstances hours ahead is good enough; in this case, a shorter time series(output length < 50) would suffice. Despite using similar transformer and self-attention mechanisms, this paper mainly explores another dimension: the “width” of transformer models in the practical DevOps field. In other words, this refers to the ability to compute a group of similar metrics using the same model structure and parameters.
Methodology
This section describes the architecture and training scheme of our proposed model TFTOps.
A. Network Architecture
The model consists of three main parts: The input embedding layer, the Input LSTM layer, and the temporal self-attention layer. The architecture is shown in Fig. 1, and each components are connected according to Algorithm 1. In the following sections, we will also present the structure of each component in a bottom-up order. Also, we will explain the rationales behind our choices.
The architecture of our TFTOps model. Blue dashed lines denote skip connections. Cells with the same color shares the same weight across different timesteps.
The detailed architecture of a GRN Unit. Each GRN unit contains 2 “linear+activation” structure, a LayerNorm, and a skip connection. Contextual input (
The architecture of input embedding layer. GRN layers with the same color share the same group of parameters.
The architecture of our probabilistic input scheme. The probabilistic feature is extracted across a sub-sampled time window. The dashed vertical line marks the end of a sub-sampled window, where the other non-probabilistic variables (including “known” and “observed” variables) are extracted. After processed by its own GRU layer, each variable equally enters the variable selection layer.
Algorithm 1 Forward Pass of TFTOps Model
Input variables
B. Multi-Input and Preprocessing
We begin with basic attributes that defines a model: input and expected output.
In a microservice system, the metrics time series are often strongly entangled together. For example, for a service S, a growing network traffic indicates its load is increasing and will naturally lead to a peak in CPU usage and disk I/O. Depending on the system architecture, such inter-relationships often tend to have some delay, making it even more valuable in predicting future metrics.
As in [3], we divide the input into two parts: “static input” and “variable input”. The general rules of division are:
Static inputs
In Prometheus, each time series has some associated attributes (“labels”). Labels are read-only categorical variables describing the features of a given time series created upon the initialization of that series. When extracting metrics from a server/pod, the node exporter automatically attaches labels to the metrics; maintainers can also customize the labels. For example, the following time series from Prometheus’ node exporter has 4 pre-defined static inputs: business, fstype, instance(IP), and mountpoint.
node_filesystem_avail_bytes{ business=“k8s”, fstype=“ext4”, instance=“10.17.xx.xx”, mountpoint=“/local” } These inputs are valuable in predicting future disk usage. The “Business” label is manually assigned and represents the department to which the node belongs.
Therefore, we collect all of the time series under the same metric from a microservice system and define 4 categorical variables according to the unique values in the node exporter. After choosing those categorical variables, we transform them into one-hot features while maintaining a mapping for all the unique values we saw in this process. See 1 for an example.
Variable inputs
Variable inputs are the main components that build up a time series. In our monitoring system, there are 2 types of input that will change over time: values of different metrics and the timestamp that produces these values. Many other variables, such as hour, day of the week, and whether it is a holiday, can be generated from timestamps. We denote the metrics as “observed variables” and timestamp-related inputs as “known variables”.
The main difference between those 2 types of inputs is whether we can “foresee” the exact future value. Since Prometheus periodically fetches metrics from sources, we have a fixed timestamp-delta between neighboring observations. We can conduct the value precisely and generate other fields, such as the hour/day of the week, for any future timesteps. On the other hand, the future value of metrics remains unknown, as it’s our model’s job to predict them.
Therefore, unlike traditional RNN-based models, the input dimensionality of the transformer encoder and decoder in TFTOps are NOT the same. We resolved this by the same variable selection network in [3], which conceptually serves as a trainable weighted average among input features.
As shown in figure 1, the variable features (yellow) are fed into two Gated Residual Networks(GRN). As in [3], the GRN block is a basic building block of our model. In this process, two types of GRNs are used: one(green) processes the observed variable inputs(input range is
C. Gated Residual Networks
The idea behind GRN is like a residual network: the model chooses to apply a non-linear process when needed. As in 2, a GRN unit receives two vectors as its input: a primary input \begin{align*} \text {GRN}(a,c) &= \text {LayerNorm}(a+\text {GLU}(\eta _{1}))\\ \eta _{1}&=W_{1}\eta _{2}+b_{1}\\ \eta _{2}&=ELU(W_{2}a+W_{3}c+b_{2}),\end{align*}
\[GLU(x)=\sigma (W_{g1}x+b_{g1})\odot (W_{g2}x+b_{g2})\]
The ELU activation is conceptually similar with the ReLU activation, but is smooth and differentiable:\begin{align*} \text {ELU}(x)=\begin{cases} \displaystyle x & x>0\\ \displaystyle \alpha (\exp (x)-1) & x\leq 0 \end{cases}\end{align*}
D. Input Embedding Layer
This layer is the first part of tft model, and it directly process the input features of various types and shapes. It consists of embedding layers and a variable selection network. 3
1) Embedding Layers
Each different input variable \begin{equation*} \boldsymbol {\xi } ^{(i)}_{t} = W^{(i)}X^{(i)}_{t}\end{equation*}
An additional layer of GRN introduces a non-linear process, transforming \begin{equation*}\tilde {\boldsymbol {\xi }} ^{(i)}_{t} = \text {GRN}_{\boldsymbol {\xi } ^{(i)}_{t}}\boldsymbol {\xi } ^{(i)}_{t}\end{equation*}
For each variable
2) Variable Selection Network
After the embedding layer, the transformed \begin{equation*} \boldsymbol {\Xi }_{t} = \lbrack \boldsymbol {\xi } ^{(1)^{T}}_{t}, \boldsymbol {\xi } ^{(2)^{T}}_{t}, \ldots, \boldsymbol {\xi } ^{(m_{\chi})^{T}}_{t}\rbrack\end{equation*}
Also in this layer, a context vector \begin{equation*} \boldsymbol {v}_{\chi t} = \text {Softmax}(\text {GRN}(\boldsymbol {\Xi }_{t}, \boldsymbol {c}_{s}))\end{equation*}
We combine the weight \begin{equation*} \tilde {\boldsymbol {\xi }}_{t} = \sum _{j=1}^{m_{\chi} }\boldsymbol {v}_{\chi t}^{(i)}\tilde {\boldsymbol {\xi }} ^{(i)}_{t}, \tilde {\boldsymbol {\xi }}_{t}\in \mathbb {R}^{d_{model}}\end{equation*}
After that, each timestep, regardless of how many input features it has, is transformed into a vector with fixed dimension
E. Probabilistic Input
The observation frequency in AIOps is much higher than in traditional time-series prediction tasks (such as grocery sales [43] and climate change). Most monitoring systems collect metric values on a minute-level basis; thus, a typical period (a day) would contain
To resolve this problem, we leverage the “probabilistic input” idea from [17] to enhance the previous input schema. We illustrate this probabilistic pre-processing method by a real-world task below.
One of our clients has a Prometheus monitoring system whose retrieving interval (the time difference between two neighboring data points) is
Firstly, we manually determine an interval
according to domain knowledge. For example, in this task, we chooseI , which is combining 20 inputs into ONE distributional input.I=20 Secondly, we normalize the real-valued inputs. After normalization, the model split the values into
bins.n_{\text {bins}}=10 is a hyperparameter determined through grid search, and its typical search range isn_{\text {bins}} , where[{0.05, 0.3}]*I is the previously mentioned aggregation interval. LetI be the binarization function which maps the real-valued inputsb(t): \mathbb {R} \rightarrow \mathbb {R}^{n_{\text {bins}}} to one-hot inputx(t) .b(t) Cut the input sequence according to the previously mentioned interval
. Each timestampI now corresponds to a time window that starts fromt and ends witht-I+1 itself. We transform the values in a window into a one-hot vector, then take an average to build a probabilistic input series:t Similarly, for preceding inputs\begin{equation*} P_{x}(t)=\frac {1}{I}[b(t-I+1)+b(t-I+2)+\ldots +b(t)]\end{equation*} View Source\begin{equation*} P_{x}(t)=\frac {1}{I}[b(t-I+1)+b(t-I+2)+\ldots +b(t)]\end{equation*}
, we have:P_{x}(t+k) \begin{equation*} P_{x}(t+k)=\frac {1}{I}[b(t+k(I-1)+1)+\ldots + b(t+kI)]\end{equation*} View Source\begin{equation*} P_{x}(t+k)=\frac {1}{I}[b(t+k(I-1)+1)+\ldots + b(t+kI)]\end{equation*}
After the pre-processing, each time window (
We apply a linear layer to transform the \begin{equation*}\tilde {\boldsymbol {\xi }} ^{(P_{j})}_{t}=\text {GRN}_{\tilde {\boldsymbol {\xi }} ^{(P_{j})}}(\boldsymbol {\xi } ^{(P_{j})}_{t})\end{equation*}
Note that we only use the “probabilistic” pre-processing scheme when dealing with observed variables. Known variables, which often tightly related with timestamps, are resampled with a larger interval as usual.
F. Input LSTM
As in figure 1, the outputs of the input embedding layer at every timestep (
The LSTM output at each timestep (encoder and decoder) is denoted as \begin{equation*} \tilde {\boldsymbol {\phi }}(t,n) = \text {LayerNorm}\left ({\tilde {\boldsymbol {\xi }}_{t+n} + \text {GLU}(\boldsymbol {\phi }(t,n)) }\right)\end{equation*}
The produced output \begin{equation*} \boldsymbol {\theta }(t,n) = \text {GRN}(\tilde {\boldsymbol {\phi }}(t,n), \boldsymbol {c}_{e})\end{equation*}
Through input embedding and LSTM, we extract the local information and put them into
We concatenate \begin{equation*} \boldsymbol {\Theta }(t) = [\boldsymbol {\theta }(t,-k)^{T}, \ldots, \boldsymbol {\theta }(t,\tau)]^{T}\end{equation*}
After this layer, informations in the series short-term relationships are extracted into the output features.
G. Multi-Head Attention
As in TFT [3], TFTOps implements a self-attention mechanism. This mechanism, which is inherited from the original Transformer model [45], helps the model learn the long-term relationships among input timesteps.
Generally speaking, the attention mechanism is a scaler upon “values” \begin{equation*} \text {Attention}(Q,K,V) = A(Q,K)V\end{equation*}
\begin{equation*} A(\boldsymbol {Q}, \boldsymbol {K}) = \text {Softmax}(\boldsymbol {QK})^{T} / \sqrt {d_{attn}}\end{equation*}
Multi-head attention layers are built by replicating the dot-product attention several times. A single dot-product attention is called a “head”, and the number of heads is denoted by a hyperparameter \begin{equation*} \boldsymbol {H}_{h} = \text {Attention}(\boldsymbol {QW}_{Q}^{(h)}, \boldsymbol {KW}_{K}^{(h)}, \boldsymbol {VW}_{V}^{(h)})\end{equation*}
After done computing all \begin{equation*} \text {InterpretableMultiHead}(\boldsymbol {Q},\boldsymbol {K},\boldsymbol {V}) = [\boldsymbol {H_{1}}, \ldots,\boldsymbol {H}_{m_{H}}] \boldsymbol {W}_{H}.\end{equation*}
We made the same adjustments against traditional multi-head attention layers, as in [3]. The core idea is to “force” all heads to use the same value of \begin{align*} \tilde {H} &= \tilde {A}(\boldsymbol {Q}, \boldsymbol {K}) \boldsymbol {VW}_{V}\\ &= \left \{{\frac {1}{H}\sum _{h=1}^{m_{H}} \text {A}\left ({\boldsymbol {QW}_{Q}^{(h)}, \boldsymbol {KW}_{K}^{(h)}}\right)}\right \} \boldsymbol {VW}_{V}\big)\\ &= \frac {1}{H}\sum _{h=1}^{m_{H}} \text {Attention}\left ({\boldsymbol {QW}_{Q}^{(h)}, \boldsymbol {KW}_{K}^{(h)}, \boldsymbol {VW}_{V}}\right)\end{align*}
Thus, each head attends to the same input features corresponding
H. Multi-Head Self-Attention Layer
By feeding the output of GRNs into aforementioned interpretable self-attention layer, we model the long-term dependencies with a stacked multi-head self-attention.
We apply interpretable multi-head self-attention only once. All previous GRN results, range from the first timestep in input sequence to the last timestep of prediction (\begin{equation*} \boldsymbol {B}(t) = \text {InterpretableMultiHead}(\boldsymbol {\Theta }(t), \boldsymbol {\Theta }(t), \boldsymbol {\Theta }(t))\end{equation*}
This self-attention layer can also be stacked, as mentioned in [45]. When stacking, the following layer directly deals with the output of previous self-attention layer; this time, we do not include residual connection and/or static contexts.
Again, in order to preserve the local information learned by RNN, we use a residual connection. This is marked with blue dashed line in III-A. The transformer output and the residual are concatenated channel-wise, then fed into another GRN block(blue):\begin{equation*} \boldsymbol {\psi }(t,n) = \text {GRN}\left ({\text {LayerNorm}\left ({\tilde {\boldsymbol {\theta }}_{t+n} + \text {GLU}(\boldsymbol {\beta }(t,n)) }\right) }\right)\end{equation*}
I. Output Layer and Quantile Loss
We provide another skip connection which skips the entire transformer block, directly connect the output of LSTM to the dense layer.\begin{equation*} \tilde {\boldsymbol {\psi }}(t,n) = \text {LayerNorm}\left ({\tilde {\boldsymbol {\phi }}_{t, n} + \text {GLU}(\boldsymbol {\psi }(t,n)) }\right)\end{equation*}
This new embedding \begin{equation*} \hat {y}(t, \tau) = \boldsymbol {W} \tilde {\boldsymbol {\psi }}(t,n) + b\end{equation*}
The shape of prediction at timestep
To better fit the requirements of anomaly detection, we choose the averaged quantile loss. We define a sequence of \begin{align*} \hat {y}_{q}(t, \tau) &= \boldsymbol {W}_{q} \tilde {\boldsymbol {\psi }}(t,n) + b_{q} \tag{1}\\ QL(y, \hat {y}, q) &= q(y-\hat {y}_{q})_{+} + (1-q)(\hat {y}_{q}-y)_{+} \tag{2}\end{align*}
The quantile
Our final loss is the average loss across all quantiles
J. Output Schemes and Detection Modes
There are two primary use cases of TFTOps.
Firstly, we can automatically get an adaptive threshold for a certain metric with the quantile loss function. As the quantile loss function suggests, quantile
Secondly, the TFTOps model itself can also serve as a prediction model. Traditionally, such prediction is retrieved by extracting trends via simple regression models (such as linear regression) from the last hours, then extending the extracted trend. However, this naïve method does not consider external messages such as date or time, thus introducing a high chance of false alarms. In our TFTOps model, the loss function for the “normal” quantile (
Experiments
A. Experiment Setup
We implemented TFTOps under tensorflow framework. Both training and prediction are performed inside the same kubernetes pod, which is hosted on a node with Intel(R) Xeon(R) Gold 6278C and no GPUs.
Our dataset is retrieved from a Prometheus platform which monitors a working kubernetes cluster with hundreds of nodes and thousands of running pods. Because prometheus uses an in-memory TSDB database, which is a temporary storage and does not guarantee persistence, our training set is periodically retrieved from a PostgreSQL database, which serves as a persistent data storage of Prometheus.
B. Dataset Overview
1) Prometheus Node Exporter (PNE)
The Prometheus monitoring platform scrapes metrics from node_exporter s in kubernetes system every 20 seconds.
Our first experiment chose “node_filesystem_free_bytes” metric as the target. The metric indicates the amount of free space in each mounted device on k8s nodes. This metric is tightly connected with system load, thus receives great attention from our operators. This metric would certainly go up as system load increases; however, there are many reasons to cause the increase of metric, and their severity differs greatly. For example, when the system load is reaching a local peak, an increase in node_filesystem_free_bytes is perfectly normal; when the system is experiencing a DDoS attack, or some bug occurs in load-balancing components, an increase in the same metric would eventually break the system. Discriminating the two different cases is proved to be a challenge, and we tried to resolve it by TFTOps.
We can easily find other useful metrics in node_exporter, such as node_network_receive_bytes_total which could be useful in predicting future values. We integrated those metrics in a group of experiment to clarify whether providing extra time series are helpful in this job.
Generally speaking, the sequences are very likely to be monotonic in given time window, which makes the traditional methods (ARIMA, ETS) suitable for this task. However, traditional approaches require fitting a model per prediction task, greatly slowing down the prediction process. On contrary, inference through a neural network is more efficient.
The input variables are extracted directly from Prometheus sequence, which contains a set of labels to describe the sequence’s properties. We used PromQL to extract the whole sequence as a list of data points extract at different time from different servers. Each point contains timestamp, value(bytes), and a set of descriptive variables indicating the situation of the source. We use the following variables in our experiment:
Note that, since the same timestamp will definitely not appear in both train and test datatsets, we do NOT use the timestamp itself as a feature.
2) Redis Connect(RDC)
The second experiment aims to predict the number of active connections per redis job in a redis cluster. In out production Redis cluster, a crontab job is running to retrieve the number of connections once per minute, and send the message to a kafka topic for further analysis. Originally, the operators use a naá’ive alert system based on thresholding, which is proved to be annoying, since the alert is triggered by false positives from time to time. We directly consume the topic, and record the historical values in a database. This metric is somehow proportional to the system’s KPI because it is closely related system load and user traffic. Traditional temporal-filtering based models failed to deal with such KPI-related metrics.
Due to the nature of redis connections, most of sequences in this dataset are oscillatory. Traditional methods fails to converge (ARIMA), or performs poorly (ETS) when processing such sequences.
We use the following variables in our experiment:
3) Benchmarks
We compare TFTOps to different types of models for multi-horizon forecasting. For models which base on neural networks, we used roughly the same number of search iterations to conduct hyperparameter optimization over a pre-defined search space centered around their default parameters (mentioned by their lrespective paper).
A brief introduction of relevant models and their source code (if available):
ARIMA [4]: ARIMA is a traditional statistical model for time series analysis. Note that there is no guarantee that all input sequences should be stationary; for the non-stationary sequences which cause the ARIMA models to fail, we use the last observed value as a substitution.
ETS: Another traditional statistical model, which decomposes time series into error, trend and seasonal components. Both ARIMA and ETS does NOT require additional features.
MQRNN [16]: A recurrent network for multi-horizon forecasting. Its encoder is the same as seq2seq LSTM encoder (but without attention) and uses the same hyperparameter search space.
RoughAE [14]: A neural network model based on the denoising autoencoder.
DTDL [15]: A dictionary-learning neural network model, which uses LSTM layers as the autoencoder.
DeepAR [7]1: A popular time series prediction framework based on RNN. The main difference between DeepAR and Seq2Seq is that DeepAR introduces distributions as output at each timestep.
ConvTrans [12]2: A model based on transformer [45] structure and LogSparse convolutional self-attention layers. LogSparse layer reduces the number of dot products per self-attention layer.
For the models that we failed to find open-source implementations, we tried to replicate the model structure in their paper to our best effort.
Most neural network models are based on either LSTM recurrent networks [44] or transformers [45]. For seq2seq models, the decoder input (output of former timestep) is concatenated with “known” variables; other networks already integrated static and extra sequences in their inputs.
In our experiments, we include both variations of TFTOps:
TFTOps: The original TFT model, which is similar with [3].
TFTOps(prob): The TFT model with extra binned and probabilistic inputs.
C. Evaluation
The effectiveness of the prediction model can be assessed from two different aspects. First, we want the prediction to be precise when the system runs normally. Second, we want the prediction model to raise an alert when the system is in an erroneous state.
The first requirement is evaluated by computing the average
Per timestamp \begin{equation*} MAE(t+\tau) = |\hat {y}^{(t_{0})}(t+\tau) - y(t+\tau)|\end{equation*}
The final evaluation metric is the average over timesteps and sequences:\begin{equation*} MAE = \frac {1}{n\tau _{\max }}\sum _{i=1}^{n}\sum _{\tau =1}^{\tau _{\max }}MAE_{(i)}(t+\tau)\end{equation*}
Note that \begin{equation*} MAE\% = \frac {\sum _{\tau =1}^{\tau _{\max }}|A(\hat {y}^{(t_{0})}(t+\tau)+B) - y'(t+\tau)|}{|\sum _{\tau =1}^{\tau _{\max }}|y'(t+\tau)}\end{equation*}
We modified the output layer of those neural network models to enable quantile loss function. Different quantiles are trained in the same training loop, while their
The second requirement is evaluated by investigating the errorneous states in daily maintainence, or injecting exceptions into system and examine the supervised methods such as record and precision. We do not consider the F-metric due to the great difference between the number of normal and abnormal cases. Even after we injected exceptions in the system, abnormal cases were still extremely rare.
D. Implementation
This subsection describes the implementation details of TFTOps model in production envirionment.
1) Auto Update
The statistic of time series in a prouction system is always shifting over time. As a result, the predictior model must also have a updating routine.
In our production deployment, training and prediction are ran on different, independent processes. The update are scheduled once per week. We fetch the most recent data from databses to train TFTOps model; after training is done, the best model is saved to file system, and a message is send through RabbitMQ message queue. When the prediction process receives that message via polling, it will trigger a “reload” function which causes the prediction process to discard the old model and reload the new model from disk.
2) Dataset Error Handling
In production environments, each model corresponds to 2 weeks/one month of training data. It is nearly impossible for the microservice system to keep a consistent state for weeks. In our cases, Kubernetes pods (or their exporters) and Redis jobs are dynamically created and destroyed over time. As a result, we must deal with “dirty” data.
Prometheus metrics: Prometheus node-exporters generate a metric (
) to track the state of each node. As we wish to model the system’s normal and steady state, the metrics generated while\mathrm {kube\_{}node\_{}status\_{}condition} (which means the node is not ready) are discarded from the training set.\mathrm {kube\_{}node\_{}status\_{}condition}\,\,\{\text {condition='Ready'}, \text {status='true'}\}\neq 1 Redis Connect: For each Redis job, other than inferring the mean and variance directly from the training set, we observe a longer range (6 months) to retrieve a better estimation of mean and variance. If the job was started in 6 months, the observation becomes its full lifespan. Besides rescaling the input/output, we also use this observation to rule out outliers. Outliers are replaced with
. In our systems, the number of connections per Redis job is tracked by Kafka messages. A scheduled producer fetches the status of Redis clusters and uploads them to the given topic. Both fetch and upload procedures are prone to errors during network fluctuations. Therefore, we observed missing data points occasionally. When the number of consecutive missing observations is relatively small (\mu \pm 3\sigma , 5 minutes), we assume the system is running normally. In this case, we use the latest visible observation for filling in the missing values. Therefore, our pre-processing step can generate training data generated across the gap. On the contrary, when a wider gap (>5) appears, we assume that the system’s state is questionable, thus discarding the missing observations.\leq 5
E. Results: Evaluation on Test Datasets
The MAE results of models mentioned in previous section are shown in Table 4 and Table 5.
Generally speaking, the performance of modern models is better than traditional ARIMA/ETS models. It is worth noticing that neural network models are especially good at predicting quantiles: all the neural network models’ quantile losses(q=0.1 and q=0.9) are far better than that of ETS.
Among all of neural network models, we can conduct that TFTOps achieves excellent MAE in both datasets, exceeding other SOTA models, and is especially good for the RDC dataset which keeps oscillating.
The probabilistic input scheme for TFTOps also have positive effect on MAE; however, this scheme introduces some extra parameters and requires slightly more computations in preprocessing and embedding extraction phases. However, due to the probabilistic features are vectorized and fed into the variable selection layer, the extra cost is trivial.
F. Results: The Robustness of TFTOps Against Noise
When building a metric prediction system in real world, the developers have a variety of static or variable features to choose from. For example, when predicting the CPU usage, one can easily acquire the following features from many sources, especially from prometheus node-exporter:
CPU model, generation and performance benchmarks (static)
The number of processes running on the same machine (variable)
The network load of the same machine (variable)
KPI of related services (variable)…
Feature without impacts are considered “noise” and is harmful to the model. Generally, one would run a grid-search to validate the effectiveness of each feature and find the optimal set of features. However, training all sorts of deep-learning models require a non-trivial amount of time. To metigate this problem, the model itself should be robust against the noise.
We designed a new dataset PNENoise based on PNE to test different model’s ability to filter noise. In PNENoise, new artificial noise features are concatenated to each row. Those “noise” features are sampled from following distributions to reflect various situations:
Uniform: a uniform distribution between [0, 1].
Normal: a normal distribution
.N(0,1) Periodic normal: a normal distribution whose centroid is a function of time
. This reflects some periodic component. We choset .N(5sin(t/24), 1) Random category: a randomly assigned “class” from
. This is a static feature.{0, 1, 2, 3, 4}
To evaluate the robustness, we re-train different models (except ARIMA and ETS) on the PNENoise dataset from scratch and compare their performance on the MAE(
Results are shown in Table 6. It can be conducted that, traditional neural-network based methods(MQRNN,DeepAR and ConvTrans) suffers from noisy features, while RoughAE/DTDL and TFTOps are only slightly affected. The implementation of RoughAE and DTDL directly addressed the robustness problem and solving it by model designations such as rough inputs [14] and dictionary learning [15], thus achieved better robustness. However, their performances are slightly worse than deep neural networks. The TFTOps model, on the other hand, maintains robustness while enhancing its performance.
G. Results: Evaluation on Real Environment
As mentioned in [7], the quantile loss function and outputs naturally becomes a good detector of anomalies. Every time the real observed target violates the prediction (a target value that is even higher than the upper quantile or lower than the lower quantile) indicates that the sequence is in an abnormal state.
We performed the evaluation on the production environment which generates RDC dataset. The TFTOps model was kept serving for a month. During that period, the operators recorded 17 abnormal incidents. When TFTOps model reports an anomaly, we check that whether an abnormal incident happens within a 10-minute range (true positive) or not(false positive).
The confusion matrix and evaluation result is shown in table 7 and table 8.
We can conclude that the model successfully extinguishes most of anomalies at a cost of relatively low precision.
Shifting the quantile loss (q=0.1/q=0.9) towards 0/1 should improve precision while lowering recall. The selection of quantile will greatly influence the performance of model in production schemes. A practical solution is to predict many quantiles(for example, q=[0.05, 0.1, 0.15…,0.95]), and select a proper quantile as detector of anomalies according to real situations.
To illustrate the conclusion above, we re-trained the model using the same dataset and hyperparameters, only altered the quantiles to
Shifting the quantile loss (q=0.1/q=0.9) towards 0/1 should improve precision while lowering recall. The selection of quantile will greatly influence the performance of model in production schemes. A practical solution is to predict many quantiles(for example, q=[0.05, 0.1, 0.15…,0.95]), and select a proper quantile as detector of anomalies according to real situations.
Our investigation shows that in our production setting, human operators slightly favor recall over precision. Therefore, we chose
More generally, the choice between recall and precision is determined by the nature of the system that generates our metrics. When it is easy to validate whether an actual error occurs, or every occurence of the error tends to have critical consequences, operators favor recall over precision; when the error can be resolved automatically, operators tend to favor precision over recall to eliminate false positives.
H. Efficiency
The time cost on prediction stage of TFTOps model is listed in table 11.
The column “Avg. time” is the average time to process a slice of data, which contains metrics generated by the whole system (
When a forward pass is run, the data flows through an LSTM layer and a stack of self-attention layers. The time complexity ties closely with the total sequence length of encoder and decoder,
In the PNE dataset task, we retrieve data every 5 minutes; in RDC dataset task, we retrieve data once per minute. According to table 11, in both cases, the efficiency of TFTOps model meets our production requirements.
Conclusion
We proposed TFTOps, a variant of Temporal Fusion Transformer [3] designed for AIOps unsupervised anomaly detection tasks. We also improve TFTOps by introducing probabilistic inputs, which further boosts the accuracy of TFTOps in our experiments. Our findings prove that the TFT model concept is well-suited for a modern multi-metric monitoring setup, satisfying requirements on both accuracy and efficiency.
The merits of TFTOps includes:
Flexiblity. User can include various features (real- valued / categorical / probabilistic, variable / static) in their hypothesis. TFTOps provides an elegant way to blend the features into its training and prediction process.
Robustness. If the new feature is proven to be noise, the TFTOps model has stronger robustness compared to other neural network models. Therefore, it shortens the time-consuming scheme of feature selection. With little effort, the operators can generate valuable predictions with the off-the-shelf TFTOps model.
Future Works
In our experiments, we treat each sequence independently (except they share some categorical static inputs). Intuitively, in k8s or other cluster settings, introducing features from “related” nodes would be beneficial for the accuracy of our model. Besides, other possible data sources, such as embedded system logs, can be used as a feature. We leave both directions for future works.
Appendix
Appendix
Hyperparameters
We use the following grid search scheme for the optimal hyperparameters in experiments on both PNE and RDC datasets.
The parameter name, search space and effect of hyperparameters are listed below:
Dropout rate: {0.1, 0.2}. This parameter controls probability of dropout in GLU layer before gating layer and layer normalization.
Hidden layer size: {4, 8, 16, 32}. This controls the size of vector input/output of hidden layer(
).d_{model} Learning rate: {0.1, 0.01, 0.001}. Fine-grained tuning can be conducted according to dataset to further improve the result.
#heads: {4,6,8}. The number of heads in multi-head attention layers.
Stack size: {1,2,3,4}. the number of stacking self-attention (SA) layers.
For the PNE dataset, optimal hyperparameters are:
Dropout rate: 0.2
Learning rate: 0.001
Hidden layer size: 8
#heads: 4
Stack size: 4
For the RDC dataset, optimal hyperparameters are:
Dropout rate: 0.1
Learning rate: 0.01
Hidden layer size: 8
#heads: 4
Stack size: 2
According to our observations, both datasets does not require a large embedding/feature space. On the other hand, they do need to stack some self-attention layers to achieve the optimal performance. Intuitively, higher dimension of embeddings means we can extract rich information from input features(static and variable). However, for our dataset, the inter-relationship among different timesteps of the metric itself seems to be more important, which is mainly modeled by the LSTM and Transformer layers.