Grid Load Forecasting Based on Dual Attention BiGRU and DILATE Loss Function

To learn the deep relationship implicit in load data and improve the accuracy of load prediction, this paper presents a new Seq2seq framework based on dual attention and a bidirectional gated recurrent unit (BiGRU). The authors also introduce the Distortion Loss including Shape and Time (DILATE) loss function. Firstly, a dual attention mechanism is added to the Seq2seq architecture. The first layer of the attention mechanism enables the encoder to output multiple intermediate states, making the decoder more targeted when calculating the predicted values at different times. The second layer is an automatic attention mechanism, which reduces the possibility of errors accumulated during a long-time span prediction by calculating the internal correlation of the decoder output sequence data. Secondly, the DILATE loss function is introduced to improve the prediction lag problem caused by the mean square error (MSE) loss function. Finally, the proposed model is tested using power load data from a region in northern China. The simulation results show that the method proposed in this paper has a better prediction effect than LSTM, GRU and LS-SVR.


I. INTRODUCTION
As an important part of energy management, power load forecasting plays an important role in the power system [1,2]. Accurate load forecasting can economically and reasonably arrange the start and stop of generators inside the power grid, maintain the safety and stability of power grid operation, and reduce unnecessary rotating reserve capacity. The level of power load forecasting can measure whether a power enterprise's management is modern and scientific, which is why it is essential to power system theoretical research.
Most early power load forecasting methods were based on mathematical statistics. Related models in the theory [3][4][5] have higher requirements on the stationarity of the original time series. When grid conditions are normal and the climate and other factors are fairly stable, the prediction results are better. However, when the random factors vary greatly or bad data interferes, the prediction results are not ideal [6,7].
With the widespread installation of data sensors in power systems, neural network algorithms supported by big data are widely used in power systems [8][9][10]. Among them, the recurrent neural network (RNN), designed for time series research, is the most widely used in load forecasting. In [11], a load prediction method is proposed based on isolation forest (iForest) and long short-term memory (LSTM) [12] network. In [13], a load hybrid prediction method is introduced, combining convolutional neural networks (CNN) and LSTM. In [14], the Seq2seq [15] architecture is used to predict short-term loads. In [16], the authors propose an LSTM-based Seq2seq-LSTM load prediction model. However, for RNN-based prediction models, whether using LSTM or Gated Recurrent Unit (GRU) [17], setting too long a sequence length may affect the stability [18]. In [19], we find that the effective con-text size of language models using LSTM is around 200 units on average, but only the closer 50 tokens can be clearly distinguished, suggesting that even LSTM struggles to capture long-term dependencies. Obviously, long-term dependencies in time series data are especially important for data prediction.
The attention mechanism [20] allows the model to access any part of the historical sequence, making it more suitable for grasping the dependencies of long-term sequences. In [21], an interpretable self-attention mechanism is implemented to learn the dependencies of long-term series data, which makes it possible to predict sales data. The authors in [22] introduced an attention mechanism to predict ultra-short-term power generation and gave different weights to the hidden states of the bidirectional long-short-term memory (BiLSTM) network, so as to selectively obtain more effective information. In [23], feature space and the time series dual attention mechanism are based on LSTM, which enabled the creation of an interpretable offshore wind power output prediction model. The authors in [24] proposed a twostage time attention mechanism (TAM) in the short-term load forecasting of power grids, which enhanced the model's memory for long-time series information. The article [25] forecasts three kinds of data, namely cold, heat and electricity, and designs three attention modules, but only one module is used for power load prediction; the article [26] introduces a model based on Attention's BiGRU and convolutional layers act as feature extractors to explicitly model channel features using SENet commonly used in image recognition. When designing the time-related attention layer, although the above articles have added attention mechanisms to the models, the attention mechanisms in them all capture the relationship between the data in the input sequence or the relationship between the input sequence data and the output sequence, but the output sequence itself also It contains information, which needs to be fully excavated to improve the prediction accuracy.
The loss functions for training and evaluating models in deep learning are generally mean absolute error (MAE), mean square error (MSE), as well as root mean square error (RMSE). Most loss functions are calculated by comparing pairs of time series and their predicted values, thereby looking at the "vertical" Euclidean distance between the two series. However, when one sequence is misaligned in time relative to another, then a "horizontal" distance or "time" distance becomes necessary; otherwise, a certain time lag will result. Therefore, a method is needed that allows elastic movement of the timeline to accommodate similar but outof-phase sequences. The DILATE (Distortion Loss including Shape and Time) is a loss function [27], and it is used in an evaluation algorithm that integrates shape error and time error. Taking shape error and temporal distortion as loss values, the temporal offset and distortion losses can be obtained more efficiently.
Combined with the DILATE loss function, this paper proposes a Seq2seq architecture based on a dual attention mechanism and Bidirectional Gated Recurrent Unit (BiGRU). The contributions of this research are: 1) A dual attention mechanism is applied. The first layer of attention mechanism considers the relationship between input and output sequences, strengthens the weight of key data that affects load prediction, and weakens the weight of data with low correlation. The weighted data provides more information to the decoder. The second layer of attention mechanism is a self-attention mechanism, which excavates the internal relationship of the hidden state output by BiGRU in the decoding layer, which further improves the accuracy of the prediction results.
2) The DILATE loss function is introduced to solve the lag problem caused by the Mean Squared Error (MSE) loss function in time series forecasting tasks.

II. SEQ2SEQ MODEL BASED ON DUAL ATTENTION MECHANISM
Given input series, i.e., x x x x    , which represents historical load data, and where m is the length of input data. 1 2 3 [ , , ,..., ] n n Y y y y y    is the target series, which represents the load data to be predicted, and where n is the length of output data. The forecasting model aims to learn a nonlinear mapping function   F  to the target series Y : In the model proposed in this paper, BiGRU neural network is applied to the encoder and decoder. The attention mechanism module and the self-attention mechanism module are embedded in the encoder and decoder respectively. The model structure is shown in Figure 1.

A. BIGRU
GRU is a simplified version of LSTM, which keeps the effect similar to LSTM while making the structure simpler. GRU merges the forget gate and the input gate into an "update gate" that merges the unit state and the hidden state. When calculating the hidden state at the current moment, a candidate state t h  is first calculated, and when calculating the candidate state, the value of the reset gate t r is considered: We used the sigmoid function to limit the value of the gate to [0,1]. If the reset gate is close to 0, the current candidate value t h  ignores the previous hidden state 1 t h  and is computed with the current input t x . This effectively lets the hidden state throw away any irrelevant information found in the future: After calculating the candidate value t h  , the update gate is used to control how much information from the previous hidden state can be transmitted to the current hidden state. The update gate t z is: The last hidden state t h at the current moment and can be calculated as: The GRU neural network uses a cyclic structure to store and retrieve information, but the neural network only considers the information at the past moment of the prediction point and cannot consider the state at the future moment; thus, the prediction accuracy cannot be further improved. The bi-directional GRU (BiGRU) network has a future layer that allows the data sequence to be predicted in the opposite direction to overcome this problem. The BiGRU principle is shown in Figure 2: The network uses two hidden layers to extract information from the past and future, and both are connected in the same output layer. In BiGRU: Among the hidden layers, the function represents the nonlinear transformation of the input data. t u and t v are the weights corresponding to the forward hidden layer, while state t h  and the reverse hidden layer state t h  correspond respectively. Moreover, t b represents the hidden layer state at time t of the corresponding offset.

B. ENCODER WITH ATTENTION MECHANISM
After the input data X is calculated by the BiGRU module in the encoder, the hidden state 1 f is the BiGRU network unit in the decoder. The output value of the BiGRU module of the decoder is the hidden state The relationship between X and Y is represented by calculating the degree of attention between H and ' Y in the attention module. The calculation process of attention mechanism is shown in Figure 3.
Use the max Soft function to normalize ij e according to Equation 10.
The feature correlation coefficient ij a is multiplied by the corresponding hidden state value j h to obtain a weighted cell state n k C    that considers different feature contribution rates.
The hidden state Y  is updated according to Equation 9: f is the BiGRU network unit in the decoder, and the input is no longer the original cell state, but the weighted data considering the size of the influence. Through attention mechanism, the encoder part can adaptively extracts the contribution rate of each input data to improve the prediction accuracy. Algorithm I is the algorithm pseudocode of the attention mechanism in this paper. Algorithm

C. DECODER WITH SELF-ATTENTION MECHANISM
In order to capture the interrelationships among Y  , we introduce a self-attention mechanism when designing the decoder part. The calculation process of self-attention mechanism is shown in Figure 4. The self-attention mechanism does not employ the recursive mechanism of the recurrent neural network, so the sequence data does not contain any position or order information. This paper adopts the same absolute position embedding (APE) as the transformer to solve this problem. Let t be the expected position in the input sentence, The positional embedding t p  can also be understood as a vector containing pairs of sine and cosine for each frequency. d is divisible by 2.
The self-attention layer takes position-encoded data By capturing the relationships and interactions between the output time series, the model can grasp the dependencies of the time series, which can further modify the output of the decoder and improve the prediction ability. The calculation process of self-attention mechanism is: 1) The position-encoded data Y  is multiplied by the learnable weight matrices ( ) , , which corresponds to query, key, and value, respectively; 2) Calculate the similarity between each Query( Q ) and each Key( K ) by dot product to get the weight d to scale it so that it fits a standard Gaussian distribution, and then multiply it by V to get the weighted value Y  .The scaled dot-product attention is the attention calculation method adopted by the transformer, and its expression is as follows: 5) Combine the positional encoded value Y  and the weighted value Y  to the linear layer to obtain the final predicted resultŶ . The self-attention mechanism reduces the dependence on external information, can link information at different positions in the input sequence, and is better at capturing the internal correlation of data or features. Algorithm II is the pseudocode of Self-Attention. Algorithm

III. DILATE LOSS FUNCTION
In the process of training a predictive model, a loss function needs to be used to up-date the model parameters. In the past, most of the time series forecasting used the loss function based on Euclidean distance. It calculates the loss between the predicted value and the true value according to a strict one-to-one mapping of each data point, ignoring information such as the shape and time of the time series. As shown in Figure 5, the dashed box identifies a relatively obvious time dislocation. While the model predicted correct sequence data, it was not accurate on time. Clearly, a method is needed that allows elastic movement of the timeline to accommodate similar but out-of-phase sequences. The DILATE loss function is a sequence prediction evaluation algorithm that integrates shape error and temporal error. DILATE takes shape error and temporal distortion as loss values, which minimizes the effects of temporal offset and distortion. As shown in Figure 6, the prediction effect of Figure b is better than that of Figure a, however the Euclidean distance loss values in the two figures are exactly the same, because the Euclidean distance loss function ignores the time shape information contained in the curve, the DILATE loss function perfectly captures this information and changes the corresponding loss value. The insensitivity of the MSE loss function to time information can lead to lag problems in time series forecasting. The main purpose of load forecasting is to apply to power generation planning, equipment maintenance planning, real-time economic dispatch and long-term power grid planning [28]. The existence of lag will cause the time difference between the predicted condition and the actual load condition, which will have an adverse effect on the operation of the power grid. The DILATE loss function considers both shape loss and temporal distortion to identify errors from two dimensions. Its formula is as: where (0,1)   is used to balance the weights of shape  and time  ; thus, shape  is the shape loss function, and time  is the temporal distortion function; then,ˆi y is the predicted value, and i y  is the true value. The shape loss function is based on the dynamic time warp (DTW) theory. DTW allows each data point to be mapped one-to-many, so as to distort the time series data, Compare the similarity in shape of the two curves and judge the difference between the time series data. The error generated by judging the similarity of the two curves' shapes is the shape error. The steps of the DTW algorithm are as follows: Step Step 2: Find the optimal cost path. Based on the obtained cost matrix, the optimal cost path is obtained by finding the minimum value of each step, which minimizes the overall cost.
To improve the convergence speed of the algorithm and find the optimal dynamic path, three constraints must be satisfied in the path planning. 1) Boundary conditions: Since the arrangement order of the compared two sequences' elements is inherent, and the elements of the two sequences have a certain correspondence, the dynamic path rounding must begin with the start element 1 1 ( , ) x y and end with element ( , ) n n x y . 2) Monotonicity: When comparing two sequences, a monotonic comparison in one di-rection is required, and repeated comparisons are not allowed.
3) Continuity: The distance between two points in the dynamic programming path must be adjacent to two points, and no cross-comparison can occur across points. According to the above steps and corresponding constraints, calculate the Euclidean distance between all points of the two curves and fill in the corresponding square to obtain an n n  -dimensional cost matrix. The starting point 1 1 ( , ) x y and the ending point ( , ) n n x y of the optimal curve are obtained under the constraints of the boundary conditions in the constraints. Under the constraint of monotonicity, it progresses from the starting point to the direction close to the ending point, and the continuity guarantees that there will be no breakpoints. In all the curves obtained, find the curve with the minimum Euclidean distance in each of the n steps forward, which is the optimal path. The resulting cost matrix and optimal path are shown in Figure 7.

FIGURE 7. Cost Matrix and Optimal Path
To identify the optimal path to the cost matrix, define binary matrices A , n*n A R  . If the optimal curve between h y andˆj y in the cost matrix is in the path, the value of hj a in matrix A is 1; otherwise, the value of hj a is 0. The definition * n n A is the set of all paths that can go from (1, 1) to (n, n) under constraints. Since DTW is discrete and nondifferentiable, a smooth operator is introduced: TDI imposes a delay penalty on the best path * A . The area between the optimal path and the first diagonal of the cost matrix represents the time error due to time distortion when selecting the optimal path for DTW. Since TDI contains two non-differentiable matrices,  and  , it cannot be differentiable by introducing a smooth operator. But because of *( , )

IV. MODEL OVERALL STRUCTURE
Based on the forecasting model proposed above, combined with data preparation and model testing, the complete load forecasting process can be obtained, as shown in Figure 8. The algorithm processing flow follows: Step 1: Data Processing 1) Data preprocessing The purpose of data preprocessing is to ensure the quality of the dataset when cleaning historical load data. Check for missing values and outliers in the dataset to prevent their impact on prediction accuracy. Remove outliers and treat them as missing values. Missing values are repaired from historical loading data. This paper uses linear interpolation to deal with: where y represents the missing value, and x represents the abscissa of the missing value. 0 y and 0 x represent the horizontal and vertical coordinates of the first selected point, respectively. Moreover, 1 y and 1 x represent the horizontal and vertical coordinates of the second selected point.

2) Data normalization
Divide the preprocessed data into a training set and test set; then, perform the data normalization processing, starting with the training set and ending with the test set. Normalization is the operation of converting data into the range of 0-1, which can better extract data features, and at the same time, speed up the convergence speed of neural network training and improve the prediction accuracy. Data normalization can be ex-pressed as: where max y represents the maximum value in the data set; min y represents the minimum value in the data set; y represents the uninitialized actual value, and y represents the initialized value.
Step 2: Load Forecasting Input the processed training set data sequence into the prediction model. The input data is processed by BiGRU in the decoder to generate the hidden state sequence   y at the current moment. After all the prediction sequences Y  are obtained, which are sent to the self-attention mechanism module, and the prediction accuracy is further improved by mining the internal sequence relationship between them.
Take the DILATE loss function to compare the predicted sequence with the actual value to get the loss value, where the actual value is the actual load sequence. The DILATE loss function splits the loss value into shape error and temporal distortion. The model parameters are updated according to the comparison results, and the training of the model is completed after several iterations. Finally, the test set is input into the trained model, and the final prediction result is output.
Step 3: Model Validation To evaluate the accuracy of the established load forecasting model, the fitting degree and forecasting accuracy of the model are evaluated by selecting evaluation indicators. Sample dataset collects once every hour. A sliding window strategy is employed during training and testing. During training, the actual values are used to complement the training set of each window for supervised learning. Remark 1: For the dataset, two points should be emphasized. The first point is that the dataset used is a real dataset from a large city in northern China, and no assumptions are made on this dataset. The dataset is the electricity consumption data on the user side, which is not directly related to renewable energy. The second point is that this paper infers future data from historical real data through deep learning methods, without considering the influence factors of transmission system operators and distribution network operators.

B. EPOCH DETERMINATION
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Figure 9 shows the optimization process of the epoch. It can be seen that when the model starts training, the error is large, around 14%. With the increase of training period, MAPE gradually decreases. After the training cycle reaches 120 times, the objective function MAPE is the smallest, and then slowly rises. At this time, it can be considered that the model is overfitting. Therefore, this paper determines the optimal training period of the model to be 120.

C. EVALUATION INDICATORS
This paper considers four different evaluation metrics, namely MSE, RMSE, MAE and DILATE. Among them, RMSE and MAE are two scale-dependent measures. The RMSE is the square root of the MSE, making the prediction results more intuitive by orders of magnitude. The DILATE is split into two indicators, DTW and TDI, so that the shape and time errors of the prediction results can be examined separately.
The expressions of MAE, RSME, and mean absolute percentage error (MAPE) are shown as: Where n represents the number of data points, i F represents the predicted value, and i R represents the true value.

D. MODEL COMPARATIVE EXPERIMENT
In this section, the model proposed in this paper is compared with other baseline methods. The chosen baseline method is as follows: LSTM is an artificial neural network used in the field of artificial intelligence and deep learning, which is specially designed to solve the long-term dependence problem of general RNN.
GRU is like LSTM with forget gate, but has fewer parameters than LSTM because it has no output gate. GRU is found to perform similarly to LSTM on some tasks.
LSSVR [29] is an improvement of Support Vector Regression (SVR). It changes inequality constraints into equality constraints, and uses the error sum of squares (SSE) loss function as the empirical loss of the training set, which is widely used in sequence prediction.
In order to verify the validity and stability of the model, 192 hours of load data were predicted, and the length of the input data was 96 hours. The quantitative evaluation results of each model are shown in Table 1. From the perspective of the average value of the prediction results, the prediction accuracy of the method in this paper is the highest compared with the other seven methods, which can accurately predict the periodicity and long-term variation. In addition, Figure 10 shows the training convergence process of three neural network-based models. GRU and LSTM can reach convergence after training for 60-80 Epochs. The method proposed in this paper adds a dual attention mechanism on the basis of BiGRU, which has high model complexity and slow convergence speed. Convergence can be reached after training for 120 Epochs.  As shown in Figure 11, the proposed model achieves the best performance among all metrics. Due to the multi-step output 192 setting, the output span is long, and the results of LSTM and GRU are not ideal compared to single-step prediction. In terms of shape and time indicators, it is very different from the original data. It can be seen that the RNN network will lose a lot of information in long-term sequence prediction, cannot effectively capture sequence features, and the accuracy of the prediction results is very low. LSSVR is a regression algorithm that aims to find the shortest distance between the regression curve and the target sample point. It can be seen from Figure 11(c) that the results predicted by LSSVR can basically reflect the fluctuation trend of the sequence, and the fluctuation time points are basically correct. But the volatility gap is very large. The model proposed in this paper can still have a good grasp of the trend of the data in the long span prediction. The introduction of the dual attention mechanism makes the magnitude and trend of the predicted curve basically correct. This shows that the attention mechanism can still learn the dependencies between long-term series data well. And because of the use of the DILATE loss function, the prediction results are ideal in terms of time indicators, and there is basically no lag. This means that the present model retains better long-term robustness, which has important implications for real-world practical applications such as weather warning and long-term energy consumption planning.

E. ABLATION STUDY
In order to verify the effect of the DILATE loss function, the MSE and DILATE functions were used to conduct comparative experiments for the prediction algorithms LSTM and Dual Attention. In order to clearly show the effect of the DILATE function, a single-step test was performed. After predicting one time step based on each time window of length 12, the window is filled with a actual value . As can be seen from Figure 12, the LSTM with MSE predictions are unclear, under-performing in the presence of data dips or spikes, and with significant time lags. Dual Attention with MSE produces sharper shape predictions, but there is a large temporal misalignment. In contrast, LSTM with DILATE and Dual Attention with DILATE predict series that have both correct shapes and temporal localization. Among them, the model proposed in this paper has the best prediction results. The prediction evaluation is shown in Table Ⅱ.

VI. CONCLUSION
In this paper, we propose a seq2seq framework based on dual attention mechanism and BiGRU for long-term power load prediction, replacing the loss function commonly used in deep learning with DILATE. The advantages are as follows: 1) BiGRU can simultaneously consider the information and state of the past and future moments of the prediction point. Compared with ordinary GRU neural networks, BiGRU can more effectively capture the relationship between past and future data in long-term sequences.
2) The first-layer attention mechanism can obtain the relationship between the input data and the data to be predicted, providing more targeted input information for the decoding layer. The second layer uses the self-attention mechanism to fully consider the time series characteristics of the output data. By exploring the internal relationship between the sequences, the prediction error is corrected, and the accuracy of the prediction result is further improved.
3) The DILATE loss function considers the two indicators of deformation and time variation, which has advantages compared with the prediction algorithm trained by MSE, and can effectively improve the lag problem of prediction.
In the follow-up work, the impact of different factors on the load, such as the integration of renewable energy, smart appliances, and extreme weather will be considered. It is foreseeable that by adding these factors, the accuracy of load forecasting will be further improved.