A Dual-Staged Attention Based Conversion-Gated Long Short Term Memory for Multivariable Time Series Prediction

In multivariate time series modeling, it is necessary to capture short-term mutation and long-term dependence information simultaneously. However, mechanism which can capture short-term change is difficult to be used to grasp long-term dependence information, and vice versa. In order to capture both short-term mutation and long-term dependence information in the same model, this paper proposed a dual-staged attention mechanism based on conversion-gated Long Short Term Memory network(DA-CG-LSTM). Hyperbolic tangent function is introduced into the input-gate and the forget-gate of Long Short Term Memory network(LSTM), which improves the ability of the network to extract the short-term mutation information. Further, dual-staged attention mechanism is added in the network, which includes input attention and temporal attention. Input attention adaptively extracts the feature relations of exogenous sequences, and temporal attention selects the relevant hidden layer states across all the time steps. Experiments on air quality and traffic flow time series data show that the proposed network has lower average absolute error, average absolute percentage error and root mean square error by more than 50% compared with Dual-staged Attention Recurrent Neural Network(DA-RNN) and Transformation-gated LSTM(TG-LSTM).


I. INTRODUCTION
Time series prediction algorithms have been widely applied in many areas, such as environmental temperature [1], [16], financial stock [2], [17] and traffic condition [3], [27]. Although traditional models [4]- [8] have shown their effectiveness for various real world applications in multivariate time series prediction [9], they cannot model nonlinear relationships and do not differentiate among the exogenous terms. In order to address this issue, researches have turned to recurrent neural network in recent years.
Through application of gate mechanism, variants of recurrent neural network (such as LSTM [10] and GRU [11]) can not only effectively address issues of gradient disappearance and explosion, but also capture nonlinear and long-term The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson . dependence relationships. However, LSTM is unable to capture short-term information with high correlation with the target sequence effectively [22].
In order to enhance the capability of capturing long-term dependence, encoder-decoder, combined with recurrent neural network, was proposed to address variable filtering issues for multivariable time series [12]. Also, the same network has attracted wide attention in machine translation [13], where original sentences were encoded to fixed-length vectors and then decoded to generate corresponding translation. However, with increase of the length of input sequence, the performance of encoder-decoder network deteriorated rapidly. For time series forecasting problems, it is usually necessary to consider a relatively long segment of target and associated sequences. To address this issue, inspired by human attention mechanism, encoder network based on attention mechanism, which adopted appropriate mechanism to select hidden state across all time steps [14], was proposed. Attention based LSTM, was used to generate translation dictionary with relative probability distribution [15]. Double attention layers were added in LSTM to extract spatio-temporal feature from exogenous sequences, which can improve efficiency and performance in grasping long-term dependence information [16]. Dual-staged attention based recurrent neural network(DA-RNN) was proposed [17]. In the first stage, input attention and encoder network were used to select relevant driving series at each time step. In the second stage, temporal attention was combined with decoder network to select relevant hidden states across all time steps, which can effectively capture long-term dependent information of target sequence. GeoMAN [18] introduced multi-level attention mechanism to model the dynamic spatio-temporal dependencies. DSTP-RNN [19] enhanced spatial correlation between exogenous sequences through application of multi-level attention mechanism. Although DA-RNN and the similar networks [17]- [19] have advantages of extracting hidden information and obtain long-term dependence relationship, short-term mutation information can't be dig out. In these networks, the attention weights are small in this driving sequence, which means they cannot pay attention to short-term mutation information.
In order to effectively capture short-term mutation information, Parametric Sigmoid Norm Based CNN(PSNET) [20] was proposed to separate centroid of samples of different complexity by translating and scaling Sigmoid function, in order to extract relevant input sequences. However, it cannot solve the problem of Sigmoid function's oversaturation. Maxout [21], which fits convex functions as activation functions using segmented linear functions through piecewise linear functions, extracts multi-dimensional feature information, and improves the ability to capture short-term mutation information. However, it is difficult to be widely used due to the large number of parameters. TG-LSTM [22], which used hyperbolic tangent function at input-gate and cell state renewal, anti-hyperbolic tangent function at forget-gate on LSTM, can effectively capture short-term mutation information. However, important information of the data tends to be lost due to the decreasing value of the hyperbolic tangent function. In addition, hyperbolic tangent function is constantly used in the process of cell state renewal, which is not conducive to capture the law of long-term dependence, so long-term information is actively discarded.
On the whole, network based on temporal attention mechanism can capture long-term dependence information, while it is difficult to capture short-term mutation information. The improvement of neural network's activation function, which ignores long-term dependence information, can capture short-term mutation information. Therefore, how to find the balance between the two mechanisms is challenged.
Dual-Staged Attention Based Conversion-gated Long Short Term Memory (DA-CG-LSTM) was proposed in this paper to get the balance. Anti-hyperbolic tangent and hyperbolic tangent function are introduced into forget-gate and input-gate of LSTM respectively, in order to capture short-term mutation and delete redundant information. Moreover, for long-term dependence information, dual-staged attention is added. In the first stage, input attention adaptively allocates weights to feature and temporal dimension of time series. Hidden states are obtained by inputting into conversion-gated Long Short Term Memory(CG-LSTM). In the second stage, temporal attention is used to allocate weights in all time steps. With new activation functions and attention mechanism, DA-CG-LSTM can balance between long dependence and short mutation information. In order to verify the performance, DA-CG-LSTM, DA-RNN, TG-LSTM and other models were tested on Air Quality, MITV traffic flow and Appliances Energy dataset. Prediction accuracy is significantly improved. Moreover, compared with DA-RNN, the number of DA-CG-LSTM's parameters is reduced by 50%.

A. MULTIVARIATE TIME SERIES PREDICTION PROBLEM
Given n-dimensional input series with the length of window size T , row vector . . x n t ) ∈ n represents input sequence value in n-dimension at time t.
In general, given previous value (y t−T +1 , . . . , y t−1 ), y t ∈ of target series and (x t−T +1 , x t−T +2 , . . . x t ), x t ∈ n of the input sequence at current and past moments, time series prediction model aims to learn the nonlinear mapping function of current target value: where, F(·) is a nonlinear mapping function that needs to learn.

B. MODEL STRUCTURE
In order to effectively address issues of capturing short-term mutation information and processing long-term dependence information in the same model, this paper proposes a dual-staged attention mechanism network based on CG-LSTM. For capturing short-term mutation information, CG-LSTM adds anti-hyperbolic tangent function at forget-gate and hyperbolic tangent function at input-gate. Before CG-LSTM converges, multiple backpropagation calculations are needed to learn optimal parameters of loss function, partial derivative of conversion gate makes value range of gradient data stream constantly change. In other words, short-term mutation information will correspond to the most significant interval of range variation, thus capturing mutation pattern well. Using temporal attention is able to enhance the capability of processing long-term dependence information [17]. Dualstaged attention mechanism, able to select hidden states that has high correlation with target sequence in the first stage, able to use some classification information to filter hidden states in the second stage, can achieve best effectiveness [25]. Therefore, this paper introduced dual-staged attention mechanism. The structural framework of DA-CG-LSTM is shown in Figure 1. In the first stage, before input sequences enter into CG-LSTM to capture short-term mutation information, input attention is used to adaptively allocate weights to feature and temporal dimension of time series, and then enters into CG-LSTM to get hidden states as local driving sequences. In the second stage, temporal attention used hidden states and cell states of CG-LSTM as classification information to adaptively assign weights to local driving sequence across all time steps, which will achieve optimal prediction effect. Finally, this paper uses DA-CG-LSTM to predict output valueŷ , h t , c t represent hidden states and cell states of CG-LSTM respectively.
The backpropagation function of DA-CG-LSTM proposed in this paper is smooth and differentiable, so root mean square error is used as error function for model training. Local driving sequence obtained by CG-LSTM can adaptively select relevant input sequence with short-term mutation information, while using temporal attention mechanism to capture long-term dependence information.

C. CONVERSION-GATED LSTM (CG-LSTM)
1. The input-gate formula of LSTM can be expressed as: Considering that hyperbolic tangent function can alleviate over-saturation issue of Sigmoid function [22], the input-gate formula of CG-LSTM is shown in Equation 3 : Before convergence of conversion-gated LSTM, multiple backpropagation calculations are carried out to learn optimal parameters of loss function, and partial derivative of conversion gate makes value range of gradient data stream constantly change. At this time, short-term mutation information will correspond to the most significant interval of range change, thus capturing mutation law well. After passing through input-gate of conversion-gated LSTM, current information stream retained will be more scattered, which enables the network to consider more mutation information in the process of iteration.
In cell state renewal formula, TG-LSTM only performs hyperbolic tangent processing on the input gate, making value range of input-gate smaller, is performed on input-gate, which may cause network to focus more on capturing long-term and delete redundant long-term information. At the expense of certain current input information, long-term dependence and short-term mutation information are maximized.
2. The forget-gate output formula of LSTM is: Due to saturation of Sigmoid activation function, it cannot capture short-term mutation information effectively. TG-LSTM [22] disperses data flow by introducing inverse hyperbolic tangent function according to characteristics of its partial derivative, so as to achieve the effect of capturing short-term mutation information. The forget-gate formula of TG-LSTM is: According to the function graph of 1 − tanh(·), it can be obtained when independent variable is at [0, 1]. Dependent variable at this stage does not produce saturation [12], and is inferred that using arctangent operation of forget-gate can solve saturation problem of Sigmoid function. However, the slope of forget-gate's formula is formed as: 1 − (tanh(·)) 2 .
Since value range of its derivative is [0.25, 1], the effect of data flow dispersion is not obvious, this paper introduced an improved form of forget-gate, which is shown in Equation 6: The value range of f t is (0, 1), and it is consistent with the changing trend of σ (·), which solves this issue that TG-LSTM reduces main information, and retains the advantage of TG-LSTM capturing short-term mutation information. Forget-gate's slope of CG-LSTM is: 3 ), σ (·) ∈ (0, 1). As σ (·) decreases, value first increases and then decreases.  As shown in Figure 2, when domain interval is (0, 1], value range is (0, 2.89]. Multiple backpropagation calculations are required to learn optimal parameters of loss function before CG-LSTM converges, and partial derivative of conversion gate makes value range of gradient data stream continuously change. In other words, short-term mutation information will correspond to most significant interval of range change, so that well capturing mutation law. Moreover, after forgetgate's output of CG-LSTM, retained data will be more dispersed, which causes DA-CG-LSTM considering more mutation information during iteration process, which is more effective than TG-LSTM and addresses the issue of over-saturation of Sigmoid function. In addition, this paper also experiments to prove that conversion mechanism can enhance capability of capturing short-term mutation information compared with LSTM. Moreover, when σ (·) = 0.5, value of forget-gate's output is 0.005, which means that forget-gate discards more redundant information and can make better use of the short-term mutation relationship between information.
In this paper, gate mechanism of LSTM [10] is improved, and structure of CG-LSTM is shown in Figure 3.
CG-LSTM updates cell state c t and hidden state h t by using forget-gate f t , output-gate o t and input-gate i t , as shown in Equation 7-11:  Here, σ (·) stands for Sigmoid activation function; x t ∈ n , h t−1 ∈ m represent current input sequence and previous hidden state, In this paper, DA-CG-LSTM is proposed based upon CG-LSTM. The purpose is to achieve balance between capturing long-term dependence and short-term mutation information. Since the processing of anti-hyperbolic tangent of forget-gate, information is more dispersed after entering into it, which can solve problem that information captured is relatively concentrated due to saturation issue of Sigmoid function, and retain historical information. In addition, it has carried on hyperbolic tangent at input-gate, which makes short-term mutation information easier to capture. However, CG-LSTM which has improvements on forget-gate will discard some information. Therefore, attention mechanism is added to select input sequence adaptively before introducing CG-LSTM. At this time, input sequence retains more effective information, so that more invalid information can be abandoned when data passes through forget-gate. The weighted input sequence is analyzed by CG-LSTM to obtain local driving sequence, and temporal attention mechanism is used to address long-term dependence issue of CG-LSTM.

D. DUAL-STAGED ATTENTION MECHANISM 1) INPUT ATTENTION MECHANISM
In the first stage, single-layer attention can only extract information from the time axis of input sequence. However, the time axis and the feature axis of time series will have different influences on target variable, so this paper introduced double-layers attention.
Given time series X , local driving sequence x t is mapped function to h t by nonlinear function f (·): Here, h t ∈ m is the hidden state of Conversion-gated LSTM at the moment, and o(·) represents weights assignment function.
Firstly, input attention is used to obtain weighted feature dimension of input sequence, as shown in Figure 4: Here, v T e ∈ n , W e ∈ m×m are the parameters to learn, and B e ∈ n is deviation term. α k t represents attention weight of k-dimensional feature at time t. Softmax function is used to ensure that sum of attention weights is 1. New input sequence can be obtained according to the obtained attention weights:x Then weighted input sequence is obtained by attention of temporal dimension, as shown in Fig. 5: Here, v T s ∈ n , W s ∈ T ×T are parameters to learn, B s ∈ T is deviation term, β k t has same meaning with α k t , and weighted input sequence is obtained as:

2) TEMPORAL ATTENTION MECHANISM
In order to address the issue of capturing short-term mutation information, this paper uses CG-LSTM to analyze multiple input sequences. However, as the length of input sequence increases, performance of capturing long-term dependence information will deteriorate. Therefore, in order to address this issue, temporal attention is used to adaptively select hidden states across all time steps. Specifically, according to hidden states d t−1 ∈ l and cell states c t−1 ∈ l of CG-LSTM, local driving sequences at time t is obtained: Here, [d t−1 ; c t−1 ] ∈ 2l represent hidden states and cell states of conversion-gated LSTM; z T op ∈ m , U op ∈ m×m , W op ∈ m×2p are parameters to learn, and b op ∈ m is deviation term. η i t represents attention weight of the i-th driving sequence, and temporal attention information c t is obtained as follows: The update operations for hidden state d t−1 and cell state c t of conversion-gated LSTM are as follows: where parameters to learn, and [c t−1 ; y t−1 ] ∈ m+1 represents information selected by attention mechanism and output at the last moment.

E. TRAINING PROCESS
In this paper, Adam optimizer are used to train the model. The backpropagation function of DA-CG-LSTM is smooth and differentiable, so RMSE is chosen as objective function and parameters are learned through backpropagation: (27) where, N represents the number of training samples.

III. EXPERIMENT
A. DATA SETS Air Quality dataset [23] is collected from five metal oxide chemical sensor arrays embedded in an air quality chemistry multi-sensor device, contains 9,358 time series instances. The installation, located in a heavily polluted area of an Italian city, recorded free of charge on-site air quality sensor records from February 2004 to March 2005 on a highway. Ground truth averages hourly concentrations of CO, nonparaffin, benzene, total nitrogen oxides (NOx) and nitrogen dioxide (NO2), provided by a reference certified analyzer at the same location. Missing values are marked with -200. In this experiment, the benzene content was taken as the target sequence, and 13 related exogenous sequences were selected. Considering that air pollution degree may have a certain relationship with time, time was included in the selected exogenous sequences. The first 5979 data instances were selected as the training set, the next 1495 data as the verification set, and the final 1869 data as the test set. The Metro Interstate Traffic Volume(MITV) dataset [29] represents the hourly traffic volume of Interstate 94 westbound between Minneapolis and St. Paul at the 301 stations for 48,203 time-series instances. The hourly westbound traffic volume of this road was recorded from October 2012 to September 2018, and the climate, time, holiday or not, temperature at that time were recorded as variables. In this experiment, 8 related exogenous sequences were selected with traffic volume as the target sequence. In addition, the first 30,840 data are used as the training set, the next 7711 data are used as the verification set, and the final 9638 data are used as the test set.
The Appliances energy dataset [26] is a total of 19,735 items of electricity consumption data recorded every 10 minutes over a period of 4.5 months for a particular house. The collectors used a ZigBee wireless sensor network to collect temperature and humidity around and inside the house, uploading the data every 10 minutes for a total of 4.5 months, and used this data as features for the electricity consumption data. Finally, the dataset has 28 features. In this experiment, the initial 12611 data were used as the training set, the subsequent 3177 data were used as the validation set and the final 3947 data were used as the test set. In order to measure the effectiveness of various time series prediction methods, three different evaluation indicators were adopted in this paper. They are root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error(MAPE). Specifically, for target value y T and predicted valueŷ T , the formula of mean square error is shown in Equation 27. The mean absolute error is defined as: When comparing prediction performance of different data sets, mean absolute percentage error is a common method because it measures proportion of prediction bias based on ground true value:

C. RESULTS
In this paper, DA-CG-LSTM, CG-LSTM and other time prediction models are compared were compared on Air Quality, MITV and Appliances Energy datasets.
(3)TG-LSTM [22] can effectively capture short-term mutation information. (4)ATT-LSTM [15] combines attention mechanism and LSTM to address the issue of longterm dependence. (5)DA-RNN [17] introduced dual-staged attention mechanism combined with LSTM on time series prediction, which could achieve good effect. (6)DA-LSTM [16] adds double-layer salient attention at input stage of LSTM to improve efficiency and performance, which can solve long-term dependence problem. (7)CG-LSTM is improved through gate mechanism based on LSTM. (8)The first stage of attention mechanism, FA+CG-LSTM: using two-dimensional attention in CG-LSTM. (9)The second stage attention mechanism, SA-CG-LSTM: adaptive selection of all hidden states is carried out through temporal attention. All models were compared in two cases T = 15 and T = 30. The obtained evaluations are shown in Table 1 and  Table 2. It is obvious that DA-CG-LSTM performs best in both datasets. Moreover, prediction effect of CG-LSTM is better than that of TG-LSTM and LSTM, the prediction accuracy of CG-LSTM increased by 5% compared with TG-LSTM and 9% compared with LSTM. Through comparing performance of different models on these data sets, it is shown that although LSTM has certain ability of processing long-term dependence, performance of LSTM is poor when time step increases. DA-RNN has better prediction performance than other baseline models on MITV and Appliance Energy. However, prediction performance of DA-RNN is not optimal on Air Quality. It can be inferred that Air Quality has high instability and is required to capture short-term mutation information. Although the processing of Zoneout model can enhance long-term dependence learning ability of LSTM, it does not consider different effects of multivariate series on target sequence in advance, and ignores complex association of multivariate. AT-LSTM was originally applied to natural language processing, which performance is significantly weaker than that of many deep learning networks since only single-level salient attention is used to calculate scores of multivariate series. DA-LSTM, improved based on AT-LSTM, added temporal attention in output stage. Input sequence is screened through two-layer attention, and exogenous sequences, more relevant to target sequence, are obtained, which solves the problem of lack of ability to capture short-term mutation information to a certain extent. However, due to using LSTM, its performance is poor in the face of increasing time step. It performs well in Air Quality, needs to make better of short-term mutation information, but it is difficult to perform well in MITV and Appliance Energy, need to make use of long-term dependence relationship. DA-CG-LSTM is built on the basis of DA-RNN and TG-LSTM, inherits their advantages. It can be seen from Figure 6(a) that the prediction of Air Quality requires more use of capturing short-term dependencies, which causes TG-LSTM performing better than DA-RNN. And long-term dependence needs to be used in forecasting MITV and Appliance Energy, as shown in Figure 6(b) and Figure 6(c), performances in DA-RNN and TG-LSTM are opposite, and DA-CG-LSTM is better than two baseline models under any circumstances.
With the increasement of time step, prediction performance becomes not so well. when the parameter time steps change from 15 to 30, the performance of DA-CG-LSTM model decreased by 15%, 17.1% and 16.4% respectively on Air Quality, MITV, Appliance Energy data set, is smaller than that of other comparison models. For example, the performance of DA-RNN decreased by 31.3%, 23.2% and 27.3% respectively. Therefore, DA-CG-LSTM has better robustness and wide applicability.
The training and validation set loss of each network in each iteration are shown in Figure 7(a), Figure 7(b) and Figure 7(c). It can be known that training and validation set loss of DA-CG-LSTM in each iteration are better than all baseline models on Air Quality. DA-CG-LSTM basically reaches stable stage after the number of iterations exceeding 25, better than other models in terms of convergence speed. According to the convergence of curve, it still has the potential to improve performance. DA-CG-LSTM does not show a large advantage in the training set loss on MITV. However, when the number of iterations exceeds 150 times, it shows an advantage compared to other. According to the convergence of curve, it still has the ability to continuously VOLUME 10, 2022 improve performance in subsequent iterations. From above analysis, it is can be known that DA-CG-LSTM has better generalization performance. It can be seen that DA-CG-LSTM does not outperform the other baseline models in terms of the training set on Appliance Energy. However, DA-CG-LSTM outperforms the other baseline models by a large margin for the loss in the test set. It is known that the model has good generalisation.

D. PARAMETER SENSITIVITY
This paper explores sensitivity of networks in parameters, that is, the impact of selected parameters on the prediction performance of networks. Specifically, impacts of the changed parameters on the prediction performance can be obtained when only one parameter is changed and other parameters remain unchanged. When keeping in the hidden layer of CG-LSTM,batch_size = 128, and changing size of time step T , it is found that the effect of different time steps on networks is shown in Figure 8(a). By setting T = 15, m = 30, and changing batch_size, this paper gets the prediction effect of different batches, which is shown in Figure 8(b). After that, through changing the number m of network units, the result is shown in Figure 8(c). It can be known that with the changes of various parameters, the prediction effect of DA-CG-LSTM will produce concave changes, and optimal effect will be reached at T = 15, m = 30, batch_size = 128. The performance of DA-CG-LSTM changes little with the change of parameters, which is more robust than other models. Therefore, it can be better applied to multivariate sequence forecasting and widely applicated.

E. TIME COST ANALYSIS
When considering the performance of the model, the time cost of the model also needs to be considered [27], [28]. The model in this paper is built based on DA-RNN and CG-LSTM, optimizing dual-staged attention mechanism of DA-RNN, and the overall parameters are reduced by nearly 20% compared to DA-RNN. On Air Quality, MITV and Appliance Energy datasets, the number of parameters respectively required for DA-RNN are 39004, 37804 and 44524 and DA-CG-LSTM are 33209, 32249 and 38729 when T = 15, m = 30, batch_size = 128. Therefore, compared with DA-RNN, DA-CG-LSTM requires fewer parameters to achieve better prediction results.
To test the computational time consumption of DA-CG-LSTM. Experiments were conducted on Air Quality, MITV and Appliance Energy datasets. The experiments were run on a computer with an Intel Core i7-8700 3.20GHz CPU, 16GB of RAM and a GeForce GTX 1060-Ti 4G GPU. The numbers of training set were respectively 5979, 30840 and 12611, the numbers of validation set were respectively 1495, 7711 and 3177, and the numbers of validation set were respectively 1869, 9638 and 3947. The time cost for model training and testing are shown in Table 3. For example, on MITV data set, the training time and testing time of the DA-CG-LSTM are both less than DA-RNN, but both are more than CG-LSTM. This indicates that dual attention mechanism increases the running time and also improves the ability of the network to extract long-term dependence information.

F. ANALYSIS
DA-CG-LSTM, based on DA-RNN [17] and CG-LSTM, makes use of these advantages. CG-LSTM can better discard redundant and disperse information. Combined with attention mechanism, the ability to capture short-term mutation information can be enhanced. Temporal attention mechanism network of DA-RNN has better advantages in the extraction of hidden information, aiming to extract long-term dependence information. Through using double-layer attention to improve DA-RNN's input stage, only using single-layer attention to extract information of feature dimensions, it can effectively reduce impacts on different dimensions and time steps at target sequence. Therefore, in this section, different parameters was used to discuss and analyze DA-RNN, CG-LSTM, and DA-CG-LSTM respectively.
As time step changing, prediction performance evaluation of above three networks is shown in Figure 8(a). Also with changing of batch size and the number of hidden layer units, performances are shown in Figure 8(b) and Figure 8(c). It can be seen that DA-RNN and CG-LSTM have their own advantages. The MITV and Appliances Energy datasets require networks that are good at capturing long-term dependencies for training and prediction, while the Air Quality dataset requires networks that are good at capturing short-term mutation information. DA-RNN has better ability to deal with long-term dependence, while CG-LSTM has advantage in capturing short-term mutation information. Therefore, performance of DA-RNN is generally better than CG-LSTM in MITV and Appliance Energy, vice versa. DA-CG-LSTM uses advantages of these two models and can further improve prediction performance.

IV. CONCLUSION
Aiming at solving problems of capturing long-term dependence and short-term mutation information in same model, this paper proposes dual-staged attention mechanism based on CG-LSTM. Through introducing hyperbolic tangent function in forget-gate and input-gate of LSTM, the ability to capture short-term mutation information is improved. Combined attention mechanism with CG-LSTM, which adaptively capture short-term mutation information, using temporal attention to extract long-term dependence information. DA-CG-LSTM, based on dual-staged attention, can not only capture short-term mutation information, but also long-term dependence information. Experiments on MITV, Air Quality and Appliance Energy show that performance of DA-CG-LSTM proposed in this paper is better than some excellent time series prediction networks, and it, which can solve more multivariate time series analysis problems, has lower regression analysis errors and more accurate prediction capabilities, better robustness.
SHUFANG FENG received the B.S. degree in electrical engineering and automation from the Changzhou Institute of Technology, China, in 2019. He is currently pursuing the master's degree in control science and engineering with the School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China. His research interests include chemical process modeling, time series modeling, and fault diagnosis.
YONG FENG received the B.S. degree in automation from the East China University of Science and Technology, Shanghai, China, in 2019, where he is currently pursuing the master's degree in control science and engineering with the School of Information Science and Engineering. His research interests include time series forecasting and chemical process modeling. VOLUME 10, 2022