Deep Concatenated Residual Network With Bidirectional LSTM for One-Hour-Ahead Wind Power Forecasting

This paper presents a deep residual network for improving time-series forecasting models, indispensable to reliable and economical power grid operations, especially with high shares of renewable energy sources. Motivated by the potential performance degradation due to the overfitting of the prevailing stacked bidirectional long short-term memory (Bi-LSTM) layers associated with its linear stacking, we propose a concatenated residual learning by connecting the multi-level residual network (MRN) and DenseNet. This method further integrates long and short Bi-LSTM networks, ReLU, and SeLU for its activating function. Rigorous studies present superior prediction accuracy and parameter efficiency for the widely used temperature dataset as well as the actual wind power dataset. The peak value forecasting and generalization capability, along with the credible confidence range, demonstrate that the proposed model offers essential features of a time-series forecasting, enabling a general forecasting framework in grid operations. The source code of this paper can be found in https://github.com/MinseungKo/DRNet.git.

of renewable energy resources such as solar and wind energy, and significant change in the composition of the electricity generation mix [1]- [4]. The highest annual growth of these resources can be observed all over the world and manifests the fast energy transition [5]. For example, the worldwide wind power capacity has grown from 180 GW in 2010 to 622 GW in 2019 [6]. The solar power capacity has concurrently grown from 41 GW to 585 GW [6]. Solar and wind penetrations are expected to grow further, owing to their improved economic benefits. This paper thus uses "variable renewable energy (VRE)" to represent solar and wind energy [7], [8].
The weather-dependent variability of these energy resources [9], however, may threaten the reliability and economical efficiency of power system operations, leading to significant social and economic losses [3], [10], [11]. Among the various methods to handle the supply-side variability, VRE forecasting is the most fundamental and practical front-end application. Its accuracy facilitates a secure and economical grid integration of the VRE [7], [10]. Compared to solar, it has been understood that wind power is less predictable because of its highly uncertain characteristics [7]. Besides, wind generators tend to be installed as wind power plants rather than distributed generators, unlike the solar generators [12], [13]. This geographical aggregation and smoothing help reduce operating reserve requirements, particularly beneficial to the bulk power system operations. Various studies have thus been conducted to investigate the power system impact of the aggregated wind power and methods for improving the wind power forecasting (WPF) [14]- [16].
WPF methods can be classified into three categories: physical method, conventional statistical method, and artificial neural network (ANN) based method. A hybrid one combining more than two methods above has been investigated to complement each other [3], [17]- [19]. The physical method builds upon the meso-scale weather model or the numerical weather prediction system (NWP), which represents the mathematically expressive model based on various geographical and meteorological information [20], [21]. Though this method performs good for medium-term forecasting periods of more than 3 hours, it has limitations on short-term forecasting because of the difficulty in gathering all the related geographical or meteorological data [21]- [23]. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The conventional statistical method produces a linear characteristic of the wind power output based on the historical data [24]. Well-known methods such as AR or ARIMA models have been widely used to construct a linear relationship; however, the nonlinearity of the data often compromises the accuracy and generality of the model [17]. Though there are various approaches to express the nonlinearity based on the conventional statistical methods, these methods still based on the linear forms are limited in representing the nonlinear dynamics [25]- [27]. On the other hand, the ANN-based method can effectively represent the nonlinear and complex features of wind speed and power with a large number of parameters.
ANN-based forecast methods have been widely used with the improvement of memories and arithmetic units. ANN-based shallow models for WPF or wind speed forecasting (WSF) are proposed with higher accuracy than physical or conventional statistical methods [28], [29]. Hybrid models, paralleling basic ANN models with other models, e.g., Kalman filters, and support vector machines, are proposed to boost the accuracy of WSF [30], [31]. The introduction of a recurrent neural network (RNN) in deep neural network (DNN) improved the accuracy of ANN models [32], [33]. Furthermore, long short-term memory (LSTM) network, which is the advanced structure of RNN, is introduced. The memory cell of LSTM helps significantly decrease the time-series forecasting error [34]. Based on the ability to keep the data for a long time, LSTM is used to extract temporal features for WSF [35]. The LSTM based WSF models outperform the ANN or ARIMA based models as demonstrated in [17], [36]. However, these previous studies mainly focused on obtaining the diverse information from various LSTM networks with few LSTM cells and the benefits from the deep learning were not fully exploited.
In general, the performance of DNN increases as the network depth grows. However, after a specific size of the network, overfitting issues can arise and negatively affect the overall DNN performance [37], [38]. Two approaches have been taken to mitigate these problems: ameliorating the layer itself and transforming the structure of DNN [38]- [42]. A representative method of the first approach is bidirectional LSTM (Bi-LSTM) [39]. Unlike LSTM training only in a forward direction, Bi-LSTM allows for bidirectional training to improve the performance of sequence learning [40]. In [43], [44], Bi-LSTM networks for WPF and WSF achieve higher forecasting accuracy than LSTM networks. However, it should be noticed that Bi-LSTM networks do not always outperform LSTM networks, which is handled in Section IV of this paper. As an example of the second approach, residual learning modifies the structure of DNN with shortcut connections and effectively trains DNN. Variants of residual learning have been proposed and reported improved performance [38], [41], [42]. However, the second approach is only limited to CNN for sequential data. Structural improvement of RNN is desired for highly complex sequential data, as the network needs to be deeper to handle the data. This paper thus proposes deep concatenated residual networks (DRNets) for RNN as detailed in Section III following the brief discussion of DNN, RNN, and (Bi-)LSTM in Section II. DRNets integrate the key concepts of DenseNet and multi-level residual network and further incorporate several new improvements, including activation functions and fused structure with short and long Bi-LSTMs. The adequate constitution of RNN layers is firstly investigated when DRNets are employed. With the constitution, the combination of ReLU and SeLU is proposed for activating the network. Finally, the fused concept, which exploits results from both short and long Bi-LSTMs, is used to enhance the peak value forecasting capability. In particular, the proposed model is adopted for 1-h ahead aggregated WPF, useful for a range of system operations, including scheduling, dispatch, and operating reserve requirements [7]. The accuracy and efficacy of the proposed forecasting model are demonstrated through rigorous case studies using the historical wind power data from ERCOT 1 and validated by the other type of Jena's temperature data, as presented in Section IV. Finally, concluding remarks are provided in Section V.

A. Deep Neural Network (DNN) and Recurrent Neural Network (RNN)
The DNN is an improved model of ANN with multiple processing layers to learn representations of data [46]. Technological improvement in memories and arithmetic units enables DNN to express high dimensional data. DNN can be understood as a large black box function, which even trains the functional form itself. The DNN thus embraces the complex nonlinear dynamics of the wind power, without providing the functional structure for each dynamic pattern. In addition, various uncertain weather factors commonly affect the wind power plants and their power outputs to a varying degree, which cannot be adequately captured by the shallow networks [47]. Therefore, compared to ANN, DNN should be more adequate for WPF with the following two attributes: 1) Ability to learn common or shared uncertainties and 2) Ability to learn nonlinear relationships [47].
The RNN is one category of DNN suitable for the sequence learning. RNN receives a sequence as an input at a time and maps the sequence with the sequential output [39], as featured in the recent successful applications such as speech recognition, natural language translation, and image captioning [48], [49]. The conventional structure of RNN with N layers and its unfolded graph are shown in Fig. 1. For data input x t at time step t, corresponding predicted output O t can be represented as the following equations: where g t l represents the activation functions of the lth layer at t, and l = 1, 2, . . ., N. b x and b y are bias terms, U t , W l , and V N are weight matrices, and h t l is the sharing state vector of lth layer. For each time step, parameters of RNN are updated to minimize the value of the loss function L(O t , y t ), where y t is the desired  output. Therefore, RNN can learn temporal features owing to its sharing property coming from the state vector.

B. Long Short-Term Memory (LSTM) and Bidirectional LSTM (Bi-LSTM)
The LSTM is an improved architecture of conventional RNN to overcome its limitation on solving problems with long-term dependency [50]. LSTM can alleviate the vanishing gradient problem by adding a special hidden unit, known as a memory unit. This unit can accumulate or remove the new inputs. Three controlling gates determine the operation of the unit by controlling the flow of data, as illustrated in Fig. 2 [51]. The forget gate discards useless memories from the state vector and the input gate adds the necessary information from the new input and previous net output. Finally, the new output of the corresponding unit is determined by the output gate. Based on these operations, LSTM unit can keep the useful data for a long duration, so it captures the long-term dependencies better than the conventional RNN.
Bidirectional learning can contribute to boosting the accuracy of conventional RNN [52]. Bidirectional learning has been adopted from the reasoning that the output is not a sole product of previous inputs, but a piece of the continuous correlation. Bidirectional RNN (BRNN) trains its parameters in both forward and reverse paths to understand the context. This training process can capture the features or patterns in bidirectional aspects, whereas RNN is trained in the forward-path only. BRNN showed higher accuracy and performance in sequence learning than conventional RNN, especially in speech processing tasks [52].
As represented in Fig. 3, Bi-LSTM incorporates the bidirectional concept into LSTM. The forward layer of Bi-LSTM updates the parameters in the unit as does the LSTM network. On the other hand, LSTM cells in the backward layer compute the derivative of the propagated errors in the forward layer. The operation of a single LSTM cell h in a backward layer can be described as follows [39], [51]: where t h is the backpropagated error of the output on cell h at time t, N is the set of all units, U and W denote weight matrices, and C is the set of cells. δ is the derivative of error regarding the gates of the cell, and y is the output of each gate, where subscripts g, s, f , and q represent input gate, state, forget gate, and output gate each. Especially, cell output y c is determined by tanh function, while the other outputs are calculated using a sigmoid function. s h is the state value of h, and E t is the net output error at time t. According to the operation of LSTM unit in the backward layer, Bi-LSTM can adjust the parameters to lessen the propagated errors in the forward layer.

A. Conventional Residual Learning
Though stacking layers enables DNN to enrich the level of features, other problems may arise [53]. At the early stage of DNN, a vanishing/exploding gradient problem may adversely affect the convergence of DNN during the training process. Degradation and overfitting are the newly exposed problems after resolving the convergence issues. The radical root of these problems is the network not adequately structured to train its parameters. Degradation is the retrogression of training related to the network depth. As the network depth increases, training error of the network saturates and then increases rapidly. On the other hand, overfitting is related to validation or test. Overfitting occurs as the parameters are updated for the training dataset only; the performance of the trained network on test dataset thus decreases.
The most fundamental way to avoid these problems is to secure more training data or to reduce the size of networks. However, there are some limitations on boosting the size of training data in reality, and reducing the size of networks can degrade the adaptability of networks to the other types of data. Therefore, various methods, such as dropout strategy and pooling strategy, have been proposed to solve the problems. Dropout, one of the most successful regularization methods, reorganizes the network only with strongly related connections by evaluating each unit with random masking during the training process [54]. In addition, pooling strategy can reduce the network features by mapping more than 2D to a single output [55]. However, the noise added on RNN layer using the dropout can be amplified with the depth of the network. It is thus advised that the dropout should be applied only for shallow RNN [56], [57]. Pooling strategy is not widely adopted for RNN, because RNN is the sequence pooling process in itself [58], [59].
Among the countermeasures, residual learning has been widely used due to its simplicity and effectiveness [60]. The basic idea of the residual learning connects shortcuts within the network. The desired mapping of the stacked layers in conventional DNN can be represented as where H(x) is the desired mapping of stacked layers, F denotes the pure stacked layers, and x is the input of the stacked layers.
In case of the residual, the representation of mapping is different from one of the conventional DNN as The addition of x in (9) represents that the residual learning can be regarded as a feedforward network. When the whole network is composed of n stacked layers, conventional DNN is just an expression between x and H n . Therefore, the total network aims to lessen H n − x. On the other hand, if the residual learning is applied to each stacked layer, ith stacked layer optimizes its weight to drive H i − H i−1 to zero, where i = 1, 2, . . ., n. Therefore, the network with the residual learning is easier to optimize weights compared to the conventional network, because the residual network can be regarded to divide the optimization tasks. In addition, the final output of the residual network can be represented as F n (H n−1 ) + · · · + F 1 (x) + x, while the output of the conventional network is F n (F n−1 (· · · F 1 (x))). This shows that the residual network can maintain the input flow, not stuck in the training details. Accuracy can further be improved by modifying the residual learning with various strategies [42], [61]- [63]. Multi-residual network incorporates additional shortcuts on ResNet. The identity mapping of the network can be multiple levels [61]. Another approach to reform the residual network is the fused network. Fused network stacks the layers in both vertical and horizontal directions, which resembles the ensemble approach [62]. Multilevel residual network (MRN) and DenseNet are the improved residual learnings widely used in CNN. Instead of identity mapping, MRN uses 1D CNN mapping based on the hypothesis that the residual mapping of the residual network can be optimized [63]. DenseNet connects all the layers by concatenation, unlike the other structures whose layers are connected through addition [42]. However, adding layers of MRN is limited in improving the WPF performance, compared to concatenation. Unlike MRN, the size of the network increases geometrically with the increase of layer as DenseNet concatenates all the layers.

B. Proposed Residual Learning
To overcome the limitations, this paper proposes another structure of residual learning, which combines the strengths of MRN and DenseNet. The key concepts of the proposed residual networks (DRNets) are shown in Fig. 4. For MRN with stacked layers and total 2n layers, the output can be represented as follows: where J(k) represents the output of 1D CNN for input k, and Act(x) denotes an activated output for input x. In case of DenseNet, the output of the ith stacked layer becomes where means the concatenation. Therefore, the output of the ith stacked layer with DRNets can be expressed as the combination of (10) and (11) as follows: where y i is the pure output of ith stacked layer. Structural formulations in (12) present that DRNets use 1D CNN mappings and the activation functions similar to MRN, but do not connect all the outputs from residuals. At the same time, the shortcuts gather through concatenation like DenseNet. Therefore, DRNets can nurture both the effective activation of MRN and the preservation of data of DenseNet. In addition, the total number of parameters is maintained similar to those of MRN and DenseNet because the numerical increase of parameters by concatenation is covered by fewer links of the residual. Another feature of DRNets differentiated from the others is that the inputs of the concatenation include not only y i but also activated y i , which is expressed as Act(J(y i )). In turn, DRNets perform well with higher parameter efficiency than the other residual learnings.

C. Additional Improvements: Peak Value Forecasting and Confidence Interval
Peak load forecasting has been regarded as an essential tool to make decisions related to the power system operations [64]- [66]. Because the wind generation can be considered as negative load in the steady-state operations, predicting the peak value of wind power outputs should particularly be beneficial in estimating standby capacity of the power system and load rates with capacity factors, i.e., useful information for unit commitment or deploying quick start generators. With the increase of wind power penetration level, the importance of the accurate peak value forecasting should increase.
To further improve peak value forecasting capability, the proposed model adopts the fused concept, as illustrated in Fig. 5. The horizontally stretched size of RNN layers is related to the period during which the layers can learn best. Horizontally long RNN has strength in apprehending long-term tendencies while short RNN learns short-term tendencies well. The fused net, composed of long and short Bi-LSTM networks, can nurture both strengths. A single Bi-LSTM layer at the end of the network determines the participation of long and short Bi-LSTM layers, which is related to the impact of short and long-term uncertainties on the output. Therefore, the fused concept helps the model better analyze the sequence and improve the peak value forecasting of wind power.
Types of activation functions influence the forecasting performance. One of the most widely used activation functions is ReLU, which has significantly improved the performance of deep neural networks [67]. However, ReLU has a serious problem known as a dying ReLU problem. When ReLU activates a large portion of hidden units as 0, the gradient-based algorithms cannot update the weights. In order to solve this problem, leaky ReLU and eLU were proposed [68], [69]. The SeLU or scaled exponential linear unit has one more tunable parameter than eLU [69]. On top of a tunable parameter α of eLU, SeLU has another tunable parameter λ, which can be represented as The SeLU not only avoids the dying ReLU problem but also offers a self-normalizing characteristic because the activation of a normally distributed input through SeLU converges towards the normal distribution [69], [70]. Therefore, SeLU helps train deep networks without gradient problems. Activation functions of the proposed model are divided into two categories. The first one is the function used for activating 1D CNN layer, and the other is used for activating Dense layer. As shown in Fig. 5, the proposed activation functions (called as the final ReLU), take SeLU for the former category and ReLU for the latter. Because there always exist forecasting errors and uncertainties, quantifying the confidence interval about the predicted value should be helpful for more reliable operation of the power system [71], [72]. As represented in Fig. 6, the overall process of obtaining stochastic or probabilistic intervals (PIs) consists of 4 steps; Classification, Gaussian Modeling, PI Estimation, and Set Update. From the training results, elements specified by the predicted values in a training set,ŷ tr , and the prediction error e tr = y tr −ŷ tr can be obtained. Ifŷ tr is matched to x-axis, and e tr to y-axis, each element can be expressed as a form of (ŷ tr , e tr ). According to the value ofŷ tr , the elements can be classified into m sets. The value m should be selected so that each set A k has sufficient elements for assuming Gaussian distribution. If we let B k represent the elements of A k , e B k are the prediction errors of B k , and k ∈ [1, 2, . . ., m]  in the x-axis range of A n , an stochastic interval I te e i about the forecasting error with 100(1 − β)% confidence level can be expressed as where the lower bound L i β and the upper bound U i β with standard score z 1−β/2 can be calculated as Finally, the confidence interval about the wind power forecasting can be obtained as below: The above equations denote that the confidence interval for specific test prediction data is determined by the interval of corresponding set, which is initially composed of training prediction data and error. After the real value of the test set, y i te , is revealed, the next prediction error set A n,i+1 is updated with e te i = y i te −ŷ i te . Therefore, the corresponding set A n,i+1 = [A n,i , (ŷ i te , e te i )], and the other sets A k,i+1 = A k,i for k ∈ [1, 2, . . ., m] ∩ [n] C . Then the new test confidence interval is determined based on the new test prediction value, y i+1 te , and the updated m sets.

A. Test Settings
This paper presents 1-h ahead forecasting on the wind power dataset of ERCOT and Jena's temperature dataset. Each dataset is composed of hourly average data with a single feature without any other variables. The values of each dataset are preprocessed to fit in the range between 0 and 1, for example, by MinMaxScaler in Python [73]. Each dataset is then divided into training, validation, and test sets. The training set updates the parameters of the forecasting models, and the validation set selects the best-performed model. The test set evaluates the performance of the selected model. In order to optimize the model during the training process, three widely used metrics, including MSE, MAE, and MAPE, are employed. Note that all the metrics are calculated based on the preprocessed data. Test settings are summarized in Table I.
The overall timeline of the proposed WPF is illustrated in Fig. 7. In this paper, update time and forecasting horizon are set to be 1-h to assist, e.g., the hourly reliability unit commitment. For k-h WPF, all the corresponding data would be collected at (k − 1)-h plus data acquisition time. At the same time, k-h wind power output forecasting is performed; thus, the actual lead time  should be 1-h minus data acquisition time. The forecasting resolution is the same as the forecasting horizon based on the hourly dataset. Higher resolution or intra-hour forecasting could be achieved with the shorter update time. The forecasting horizon could also be extended for various applications in operational planning.

B. Temperature Forecasting Results
To secure the objectivity of the proposed model, we have conducted forecasting experiments on the temperature dataset of Jena in Germany, which has been widely used for RNN performance test [74]. The temperature data embeds variability and imposes forecasting uncertainty, similar to the wind power data. The experiments are based on the data of 70037 hours from 2009 to 2017 with 60%, 30%, and 10% data division for training, validation and test set. Table II represents the experimental results of the 1-h ahead temperature forecasting. DRNet-3 shows better performance than DenseNet in all metrics except the peak value forecasting. Fused DRNet-1 improves both overall performance and the peak value forecasting more than DRNet-3. Moreover, the standard deviation of errors with fused DRNet-1 is 1.1957, which is smaller than 1.2085 and 1.3447 with DRNet-3 and DenseNet, respectively. These results imply both the improved performance and the generalization capability of the fused DRNet for a time-series.
Training and Validation MSEs, according to epochs, are shown in Fig. 8 and reveal the training processes of the networks. Because the overall training processes of the temperature and wind power data are similar, the training curves of the temperature data are only included in this paper. In general, the validation  errors of DRNet-3 and fused DRNet-1 generally converge at between epoch 5 and 10. Therefore, the user-selectable early stopping strategy may be an option to expedite the training process when the validation error does not decrease for the predetermined epochs. However, this paper excludes the early stopping strategy to prevent early termination and to use the generally trained model.

C. Wind Power Forecasting Results
The wind power dataset is drawn from the hourly total wind output of ERCOT for 26311 hours from 2016 to 2018 [75]. The data displays the aggregated power output from all the wind generators in Texas. Total installed wind generation increased from 16246 MW to 22607 MW, and the maximum output of wind power was 19099 MW. The largest wind output percentage of the load was 54.6%, and the biggest percentage change of output was 280.6%. Among overall 26311 hours data, 76%, 16%, and 8% divisions are used for training, validation, and test set, respectively.
1) Residual Learnings: The first task of the case study is to examine WPF capability of the proposed residual learning. Each residual learning was applied to Bi-LSTM networks with the depth of 7 and 11, and all the activation functions are set as SeLU. Overall WPF results are shown in Fig. 9 and Table III. For 7-depth network, DRNet-4 outperforms the other methods. Among the conventional methods, the degradation for 11-depth  network does not occur in DenseNet only. The others performed well at 7-depth but a great increase of error can be observed at 11-depth network. On the other hand, DRNets showed superior forecasting accuracy to conventional methods. Though there is a slight increase in the error at 11-depth, DRNet-4 showed the lowest error at 7-depth. DRNet-2 and 3 not only showed good performance but also degradation did not occur. Especially, DRNets in 7-depth network showed higher accuracy than the 11-depth network with DenseNet, even with less number of parameters.
The better performance of DRNet-4 at 7-depth is owing to all the residual mappings in DRNet-4 containing spatiotemporal data, using both identity mapping and 1D CNN layer. However, overfitting occurred in 11-depth network with DRNet-4 as the number of parameters increases excessively. On the other hand, DRNet-3 has relatively high validation error in 7-depth, and the lowest error in 11-depth network. The difference between DRNet-3 and 4 is 1D CNN mapping of the initial input, which highly contributes to the increase of parameters. As shown in Table III and Fig. 9, the margin of parameter increase in DRNet-3 is much lower than one in DRNet-4, which leads to the prevention of overfitting or degradation. This can be identically stretched to DRNet-2 and 3 for deeper networks. In other words, the deeper the network is, the more 1D CNN mapping should be removed.
2) LSTM and Bi-LSTM Layers: The second task is to identify the impact of substituting Bi-LSTM for LSTM with DRNet-3 and 4. When the number of layers is 7, the network with 7 LSTM and one mixed with LSTM and Bi-LSTM showed the best performance each for DRNet-3 and DRNet-4, as shown in Table III. In case of 11 layers, the network with 11 Bi-LSTM surpassed the others. Fig. 10 shows the validation errors for various combinations of layers according to the network depth. 7-depth networks showed similar performance regardless of the type of the layers. However, the validation error of the 11-depth LSTM network increases with the increase of depth, while one of the pure Bi-LSTM network decreases. This is owing to the improvement in the parameter optimization of the backward layers in Bi-LSTM. As the network becomes deeper and more complex, Bi-LSTM layers can help to prevent overfitting and degradation, and to increase the accuracy of WPF.

3) CNN Layer and Activation Functions:
In order to verify the effectiveness of employing 1D CNN layer in the residual connection, DRNets and DRNets without 1D CNN layers are evaluated. In case of DRNets without 1D CNN layers, the residual connections remain with SeLU functions. According  to the test result of Table III, 11-depth DRNet-3 and 7-depth DRNet-4 are used for comparison. The simulation results show that the DRNets with 1D CNN outperform DRNets without 1D CNN, as represented in Fig. 11. All the validation and test errors except the test MAPE of DRNet-4 are lower than those of DRNets without CNN. The next step is to determine the activation functions applied to 1D CNN and Dense layers. For the conventional activation functions, WPF results of the conventional functions, final ReLU, and final SeLU are compared. Final ReLU used SeLU for activating 1D CNN and ReLU for Dense layer and vice versa for final SeLU. Fig. 12 shows the comparison of validation MSEs according to activation functions. For conventional functions, ReLU has the lowest average error for both DRNet-3 and 4. However, SeLU recorded the lowest errors for specific experiments, though the deviation of the results was large. Final ReLU can nurture both strengths of ReLU and SeLU. The average MSEs of final ReLU were lower than those of ReLU. In addition, the best result showed better performance than one of SeLU. Overall deviations of final ReLU were larger than those of ReLU, but much lower than those of SeLU.
4) Fused Concept: WPF results of our best single model, i.e., 11-depth Bi-LSTM networks with DRNet-3 and final ReLU, are shown in Fig. 13. DRNet-3 showed better performance in forecasting low peak value than DenseNet, but has lower accuracy in forecasting high peak value. The fused concept can help to improve peak value forecasting. Therefore, the final model fused the short and long Bi-LSTM networks with DRNet-1, as the fused net highly increases the parameters. As shown in Fig. 13, fused DRNet-1 highly improves not only the high peak value forecasting but also the overall errors. The mean of the largest 10% errors for DRNet-3 was 275.2331 MW, which was higher than 247.0299 MW of DenseNet. Fused DRNet-1 improved these values to 199.4232 MW with almost the same standard deviation. Therefore, we can conclude that the Bi-LSTM forecasting model with final ReLU and fused DRNets enhances the overall WPF performance and peak value forecasting as it has strength in adjusting the data flow so that it can adequately optimize the parameters of DNN.
In order to verify the efficiency of our proposed models, Diebold-Mariano (DM) tests were conducted for ResNet, DenseNet, DRNet-3, and fused DRNet-1 [76], [77]. Results of DM tests based on the squared-error loss are shown in Table V . DM tests comparing ResNet to the other models show that the improvements made on DenseNet and DRNets are significant since the absolute values of DM are larger than 1.96, which is z score of 5% significance level in the normal distribution. Comparing DenseNet and DRNets, both DRNet-3 and Fused DRNet-1 have higher forecasting accuracy than DenseNet. The observed differences between DenseNet and Fused DRNet-1 are On the other hand, the absolute value of DM between DenseNet and DRNet-3 was 1.5051, which is less than 1.96. Therefore, the observed differences between DenseNet and DRNet-3 are not as significant as those between DenseNet and Fused DRNet-1, but still have meaningful value, because the value is higher than 1.282, which is z score of 10% level. DM test results on DRNet-3 and Fused DRNet-1 indicate that the forecasting accuracy of the two models can vary, according to the stochastic interference. However, it should be noticed that the test results represent the overall performance, not the peak value forecasting ability.

5) Stochastic Intervals:
The prediction error of WPF with the fused model and the corresponding PIs with 95% and 99% confidence levels are represented in Fig. 14. Initially, the training prediction errors were divided into m sets in accordance with the forecasted values. The value of m is selected as 8, so that each set has more than 100 elements, i.e., n(A k ) ≥ 100. Though independent random variables follow Gaussian distribution with n(A k ) ≥ 30 by the central limit theorem, n(A k ) ≥ 100 is chosen to get higher forecasting accuracy for more reliable power system operation [78]. Each set was transformed into a separate Gaussian distribution function, which is continuously updated with the test data. Note that the objective of the probabilistic forecasting in this paper is to get the range of the prediction error for test data. For given WPF value of test data from DNN, each data can be classified into A k according to the forecasted value. PI for the test data is determined based on the mean and variance of the corresponding set, and the set is updated by merging the test data. For example, if the forecasted value of the first test data,ŷ te 1 , is included in A j , PI about the forecasting error is determined as After the real value is revealed, the forecasting error, e te 1 , can be calculated and A j,1 is updated to A j,2 , which contains [ŷ te 1 , e te 1 ] as a new element. Meanwhile, the other sets except for A j remain the same as the previous.
In order to compare the performance of the proposed probabilistic forecasting to the standard bootstrap (SB) method, PI coverage probability (PICP), PI normalized averaged width (PINAW), and coverage width-based criterion (CWC) are adopted as performance indices [79]. Additionally, Pinball loss is calculated to evaluate the overall performance. Pinball loss can guarantee the probabilistic forecasting performance as a comprehensive index, which simultaneously evaluates the reliability, sharpness, and calibration. Physical meanings and mathematical equations of PICP, PINAW, and CWC can be found in [80], and Pinball loss used in this paper for certain β can be formulated as follows: where estimated interval width,Ŵ te i , and the difference between prediction error and historical prediction error mean, w te i , are defined as follows: Note that the Pinball loss provided in Table VI is the average of the Pinball values with β = 0.01, 0.05, 0.1, 0.2, . . ., 0.9. Two hyperparameters in CWC are set to a value of 50 for η, and the confidence level, 1 − β, for μ. As shown in Table VI, the proposed probabilistic method shows higher PICP than the SB method for both 95% and 99% confidence level, which implies that the actual values lie in the proposed range with higher  possibility. Though PINAW of the proposed method is higher than one of the SB method, lower values of CWC and Pinball loss imply that the proposed method can derive more valid PI than the SB method.
The average continuous rank probability score (CRPS) in Table VI and the reliability diagram with sharpness in Fig. 15 affirm the high reliability of the proposed method. The reliability diagram of the proposed method is more adjacent to the perfect reliability curve than one of the SB methods, especially for the forecast probability larger than 0.8. As remarked in [81], [82], small deviations from the diagonal with small bias in the sharpness verify the reliability of the proposed method. In addition, smaller CRPS of the proposed method than the SB method indicates the adequacy of the Gaussian distributions postulated from the proposed method: the CRPS calculation for the Gaussian distribution is well documented in [83].

6) Forecasting Performance Comparison:
The WPF performance of the proposed method is compared to other methods, including the naive algorithm, ARIMA, Gaussian process (GP), Support Vector Machine (SVM), and 3-layer RNN model and is summarized in Table VII. The naive algorithm is one of the simple physical forecasting methods, and the combination of a seasonal and hourly naive algorithms is used for comparison [84]. As one of the conventional statistical methods, ARIMA is used, where the structure is determined to be ARIMA(2,0,1) based on the autocorrelation and partial autocorrelation plots [85]. The GP and SVM are representative examples of shallow machine learning methods. For the GP, Matern and White kernel are combined within a Bayesian framework, and support vector regression model is used for the SVM [86], [87]. The 3-layer RNN represents a relatively small size of ANN.
As shown in Table VII, the machine learning methods excel the physical and conventional statistical methods for all three metrics. The DM values of GP with the naive algorithm and ARIMA are −12.9265 and −11.7738 each. Thus, the GP shows even higher performance than the naive algorithm and ARIMA. At the same time, SVM and 3-layer RNN show lower test errors than GP. DM values of SVM and RNN with GP are −6.0173 and −8.9571, which denotes the meaningful observed differences. Though RNN performs slightly better than SVM, there are no significant differences between the two models because DM value between SVM and RNN is 0.3362. It is remarkable that the proposed model has improved the forecasting accuracy of the SVM and 3-Layer RNN significantly. DM values of fused DRNet-1 with SVM and RNN are −2.5514 and −3.0518. Forecasted wind power profiles using all methods are shown in Fig. 16. It is noteworthy that the profiles with lower accuracy show longer time delay or more massive peak value error than those with higher performance.

D. Additional Use of Wind Speed Data
In order to investigate the impact of incorporating weather data explicitly, for example, wind speed data as an input, additional experiments on the wind speed and the ERCOT wind power output data from 2016 to 2017 are conducted. Wind power capacities by site in ERCOT are obtained from [88] and wind speed data is drawn from [89]. The actual wind speed input set for the aggregated WPF is obtained by calculating a weighted arithmetic mean of hourly wind speeds by regional wind power capacities with reference to the total capacity. For the impact analysis, the global attention mechanism is adopted and the location-based attention scores are compared as detailed in [90]. The WPF test results based on attention mechanism and wind speed data are shown in Table VII. The attention scores are represented in Fig. 17 where the x-axis indicates the input vector with 24 hours of length, and the attention scores on y-axis mean the average of intensified weights for the vector at a specific hour. When the attention mechanism is applied to the network only  with wind power data, the accuracy of the network increases, as indicated in Table VIII. This is owing to the attention weights, which increase the impact of the more relevant inputs. On the other hand, the accuracy of the network using both wind power and wind speed data decreases with the attention. Though the scores of the wind power are similar to those only using wind power data, the scores of 3 hours, 6 hours, and 11 hours ahead wind speed data have turned out to be much higher than the others. Therefore, the performance of the network using both data is dominated by wind speed data at specific points, rather than the sequence of wind power or wind speed. The following observations are then drawn from the analysis: 1) Though the raw wind speed data is obtained from a reliable source and has been preprocessed to be an adequate pair with the wind power data, the wind speed was not measured at the same sites for the aggregate wind power forecasting. There are also unknown (or unexplainable) dynamics on top of the underlying physics between the wind speed and output power, which indeed requires an even higher dimensional model [91]- [93]. 2) Incorporating the wind speed data thus increases the uncertainty and may degrade the overall performance. Focusing on the wind power sequence with attention mechanism draws higher performance for the 1-h ahead WPF as the wind power data assumes all of the dynamics the proposed model tries to identify. 3) A high attention score confirms the significant relationship between wind power and speed. Instead of being used as a direct input to the forecasting model, the wind speed data may be used as an auxiliary signal or indicator to improve the WPF performance.

V. CONCLUSION
This paper proposed the deep learning model for 1-h ahead WPF, where the basic layer is composed of Bi-LSTM layer. Bi-LSTM network has been widely investigated as a powerful forecasting model as it can improve performance by eliminating propagated errors, especially when the number of parameters increases by residual learnings. The increasing depth of DNN, however, makes the LSTM prone to overfitting, which degrades the performance of the deep learning model. The proposed DRNets effectively resolve this technical challenge by concatenating the original input, shortcuts of the residuals, and the original activated input. The proposed activation functions using SeLU for 1D CNN and ReLU for Dense layer can further improve the overall accuracy. Significant improvements in the peak value forecasting have been observed in the case study by using the fused network of short and long Bi-LSTM networks with DRNets. Consistent superior and reliable performance of the proposed model for various datasets demonstrates that the proposed method provides a general framework for time-series forecasting applications, especially in grid power operations.