Two-Stage Attention Over LSTM With Bayesian Optimization for Day-Ahead Solar Power Forecasting

The penetration of PVs into the power grid is increasing day by day, as they are more economical and environment-friendly. However, due to the intrinsic intermittency in solar radiation and other meteorological factors, the generated power from PVs is uncertain and unstable. Therefore, accurate forecasting of power generation is considered one of the fundamental challenges in power system. In this paper, a deep-learning model based on two-stage attention mechanism over LSTM is proposed to forecast a day-ahead PV power. In addition, the Bayesian optimization algorithm is applied to obtain the optimal combination of hyper-parameters for the proposed deep-learning model. Various input features that can affect the PV power generation such as solar radiation, temperature, humidity, snowfall, albedo etc. are considered and their impact with respect to the attention mechanisms on the forecasted PV power is analyzed. The input consists of data from 21 PVs installed at different geographical locations in Germany. The proposed model is compared with state-of-the-art models such as LSTM-Attention, CNN-LSTM, and Ensemble model for day-ahead forecasting. The model is also compared with various single attention mechanisms such as Input-attention, SNAIL, Raffel, and Hierarchical attention etc. The proposed model outperforms the traditional methods in terms of accuracy, hence proving its efficiency. Forecasting Skill (FS) score of the proposed model is 0.4813 whereas 0.4427 is for the Ensemble model, which is the best among other state-of-the-art models. Root Mean Square (RMSE) and Mean Absolute Error (MAE) of the proposed model is 0.0638 and 0.0324 respectively, whereas those of the Ensemble model are 0.0685 and 0.0369 respectively.


I. INTRODUCTION
Due to increasing concerns about global environmental and economic challenges, the demand for clean energy such as photovoltaic and wind power is increasing [1]. According to International Energy Agency (IEA), photovoltaics (PV) is one of the fastest-growing and economical renewable energy resources [2]. PV power generation is mainly dependent on meteorological factors such as solar radiation, clouds, humidity, pressure, temperature etc. Due to the intrinsic intermittent nature of these factors, the nature of PV power is also variable and uncertain, resulting in unstable fluctuations [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Padmanabh Thakur .
Consequently, the inclusion of a large proportion of PV will cause grid oscillations [4]. Therefore, to reduce these uncertainties and fluctuations, accurate forecasting of PV output power is essential.
Numerous benefits can be obtained by hours to day-ahead solar power forecasting. Accurate forecasting can help companies to avoid penalties [5]. Forecasting helps in optimizing the scheduling of supply offers to the market, hence, increases revenues [6]. Operators can avoid problems in balancing generation and demand [7], which resultantly can improve the system stability, and reduce the costs of ancillary services [8], [9]. Furthermore, decisions on unit commitment, reserve requirement, and maintenance scheduling to obtain optimal operating cost can be performed with accurate VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ forecasting. Considering these reasons, accurate PV forecasting has been recognized as one of the fundamental challenges in power systems [10], [11]. Forecasting of solar power can be mainly categorized into physical, statistical and machine-learning models [12]- [14]. Physical models have well-established methods that rely on Numerical Weather Prediction (NWP) data [15] and satellite images [16]. Although these methods have good accuracies, they require extra information of images and cloud maps from satellites, resulting in a higher cost for operation and computation. In addition, these models should be designed for particular locations [12]. Statistical models are based on traditional regressive mathematical models such as linear regression and Automatic Regressive Integrated Moving Average (ARIMA) models. Since linear regression models build a linear mapping between inputs and the target power [17], they cannot efficiently capture the non-linear relationship between input features and outputs of solar power. Furthermore, as the forecasting horizon increases, the accuracy of these models decreases [12].
To overcome these issues, various machine-learning models have been proposed [18]- [20]. Support vector regressors are the machine-learning model that can generate nonlinear relationship. Support Vector Regression (SVR), based on various weather information such as cloud, sun duration etc., has been used to forecast solar power in [18]. A hybrid model based on the genetic algorithm with Support Vector Machine (SVM) has been proposed for solar power forecasting in [19], which improved results as compared to simple SVM. An ensemble model is a hybrid machine-learning model based on the combination of various non-linear regression models. The ensemble model has been proposed as the best machine-learning model in [20]. However, these models depend on predefined parameters and predefined nonlinear mapping. Therefore, it is difficult to capture the true underlying non-linear relationship between inputs and target values [13], [21], [22].
Recently, deep-learning models, which are advanced forms of traditional machine-learning techniques, are becoming very popular in various fields such as image processing [23], text translation [24] and time-series problems [25]. Solar power forecasting is a time-series problem where next time steps are sequentially dependent on past time steps in a nonlinear relationship. Recurrent Neural Networks (RNNs) are deep-learning models specifically designed for time-series data [26]. Non-linear Autoregressive Recurrent Neural Network (NARX) has been successfully applied to solar power forecasting [27], [28]. However, conventional RNNs suffer from exploding and vanishing gradients [29]. Thus, they cannot capture long-term dependencies. The extension of RNN, Long Short-Term Memory (LSTM) [30], has been proposed to overcome these limitations of RNN. LSTM and combination of LSTM with Convolutional Neural Network (CNN) have been used with good accuracies in solar power forecasting [31]- [33].
Encoder-decoder networks based on LSTM are becoming popular deep-learning models in time-series forecasting, specifically in sequence-to-sequence mappings [34]- [38]. Therefore, these combinations of encoder-decoder with LSTM can be regarded as state-of-the-art. Although these models work well with small sequences, their performance degrades with the increasing length of sequences [37]. In time-series forecasting this is a big concern, as predictions usually require longer temporal sequences as well as many input features such as day-ahead solar power forecasting.
Attention mechanism is an extension of the encoderdecoder model specifically designed to improve the performance of longer sequences [39]. In [39], solar power has been forecasted using single self-attention over LSTM to capture important temporal states. However, single temporal attention mechanisms still lack in handling data containing many input features and long temporal sequences. Addressing these time-series forecasting challenges, a twostage attention mechanism has been used for stock price forecasting in [40].
In this paper inspired by [40], a two-stage attention mechanism-based deep-learning model is applied to day-ahead solar power forecasting using multiple input features. The model is optimized by using the Bayesian optimization algorithm to obtain the optimal combination of hyper-parameters. Following are the major contributions of this paper: 1. Two-stage attention-based encoder-decoder over LSTM is applied to day-ahead solar power forecasting. First, an attention layer is applied to the input, focusing on more relevant features at a particular time, which is followed by a temporal attention layer to focus on relevant temporal hidden states of LSTM units. Both attentions are applied over LSTM. 41 different input features from 21 different PV panels installed at different geographical locations in Germany are used as input data. The paper analyzes the performance of the attention mechanism with respect to some important input features such as solar radiation, temperature, snowfall, etc. The paper also analyzes the performance of the attention mechanism with respect to temporal values. 2. Deep-learning models have different hyper-parameters, which control their performance. On a particular problem, different combinations of these parameters produce optimal results. Therefore, the Bayesian optimization algorithm has been applied to the two-stage attentionbased deep-learning model to obtain the optimal combination of hyper-parameters. 3. A comparative study of the proposed method with the persistence model [13], and the state-of-the-art methods such as LSTM [31], [32], CNN-LSTM [33], LSTM-Attention [39] and Ensemble model [20], has been carried out to show the effectiveness of the proposed method. Furthermore, single attention mechanisms can be carried out via different techniques such as Input-Attention, Raffel, Hierarchical, SNAIL attention, etc.
[41]- [43]. Comparison of these single attention techniques over LSTM has also been performed with the proposed method to show the effectiveness of the two-stage attention mechanism.
The paper is organized as follows: in Section II, the twostage attention mechanism over LSTM for day-ahead solar power forecasting is explained. Section III consists of the experiments. Results are given in Section IV. Section V gives the qualitative discussions and conclusion.

II. TWO-STAGE ATTENTION MODEL FOR SOLAR POWER FORECASTING
A. MODEL SUMMARY Solar power is highly dependent on meteorological features like solar radiation, temperature, humidity, snowfall, etc. During normal conditions, it almost follows the trend of features like solar radiation. However, during extreme conditions like snowfall and albedo, the power production is almost zero. Therefore, an attention mechanism is required at the input to focus on the features that are more relevant at a particular time. Similarly, solar power forecasting is a timeseries problem. The next time-step is correlated to past timestep outputs. Therefore, relevant time sequences must also be focused on with the attention mechanism. Considering the aforementioned objectives, in this paper the two-stage attention-based encoder-decoder model over LSTM has been applied to day-ahead solar power forecasting. First, encoderbased attention is applied to the input features to focus on important features at a particular time. Then, decoder-based temporal attention is applied to the hidden states of the encoder's LSTM to extract important temporal states. Finally, a linear layer is added to the output to predict a day-ahead solar power. The whole model is trained based on the standard backpropagation algorithm. The complete model is shown in Fig. 1. t , x 2 t , . . . , x n t ) ∈ R n be a series of n input data features at time t. And let x k = (x k 1 , x k 2 , . . . x k L ) ∈ R L be a series of k th input feature data over L time-steps window. Then, a series of n input data over L time-steps window can be expressed as ( Input data consists of historical solar power data and weather data. The target of the proposed paper is a day-ahead solar power. Given the past values of output as (y 1 , y 2 , . . . , y L ) together with the input (x 1 , x 2 , . . . , x L ), the complete model to predict a day-ahead solar power can be expressed by the following function F: where y L+1, y L+2, . . . ,y L+l are l-time steps ahead solar power to be predicted. Using the proposed method, any horizon ahead forecasting can be carried out. In this paper eight-time steps ahead forecasting with a resolution of three hours is performed to obtain a day-ahead solar power.

C. LSTM UNIT
Both input and temporal attention-based encoder-decoder are applied over LSTM units. An LSTM unit is shown in Fig. 3. LSTM unit consists of a hidden state h t which is the output of an LSTM unit, and an internal state or cell state c t which remembers the cell states. It also contains three gates: input i t , forget f t , and output gate o t . An input gate controls the amount of current information to be passed. A forget gate controls the information to be processed or to be forgotten, and an output gate defines the internal state information that needs to be passed.
Provided that x t is the input at t and h t−1 is the previous hidden state of LSTM, the following chain of equations can be used to obtain the current hidden state of LSTM unit h t :  (2), an LSTM unit can be expressed by the following non-linear function f :

D. INPUT ATTENTION BASED ENCODER
To extract relevant input features from the input series x k , an input attention is applied with the encoder as shown in Fig. 1. The input attention can be applied using x k , and the previous hidden and cell states of the encoder's LSTM by using (4) and (5) as follows: where z ε , W ε and U ε are the parameters to be trained. α k t is an attention weight that shows the importance of the k th input feature at time t. To keep the sum of all the attention weights to 1, a softmax activation is applied to ε k t . This attention mechanism gives important features more weights rather than treating all the inputs equally. A new input series can be extracted with these attention weights using (6). This new input is fed to update the encoder's LSTM hidden state of (3) as shown by (7):

E. TEMPORAL ATTENTION BASED DECODER
The decoder model is designed to extract important temporal hidden states and to make the final output prediction.
With the increasing length of input series, the results of the traditional encoder-decoder deteriorate. Therefore, after the attention encoder, a temporal attention-based decoder has been applied to select relevant hidden states of the encoder from all time-steps as shown in Fig. 1. The attention weights of each encoder's hidden state can be calculated by using the previous hidden state of the decoder's LSTM d t−1 and its cell stateć t−1 as in (8) and (9): where z d , W d and U d are the weights to be trained. β i t represents the importance of the i th encoder hidden state at time t. Since each encoder's hidden state h i has been mapped to a temporal component of the input, the attention mechanism calculates the context vector v t as a weighted sum of all the hidden states of the encoder (h 1 , h 2 , .., h L ) using (10). This vector is different at each time-step. The vector is then concatenated with the target values using (11). Then, using the decoder's hidden state d t−1 and the newly concatenated valueỹ t−1 , the decoder's new hidden state d t can be obtained using the decoder's LSTM non-linear function f based on (3) as shown in (12): wherew andb are weights and biases that are mapping the concatenation.

F. OUTPUT AND TRAINING MECHANISM
The final output follows the decoder, which consists of a linear layer to predict a day-ahead solar power. The final layer will predict l-time steps ahead. The complete model can be expressed by the following expression: Y t = y L+1, y L+2, . . . ,y L+l = F(y 1 , y 2 , . . . , y L , x 1 , x 2 , . . . , whereŶ t is the solar power to be predicted; W y and z y are the weights to be trained. In this paper, l is taken as eight to predict a day-ahead solar power with a resolution of three hours. The whole model has been trained using the standard backpropagation algorithm with the objective function defined by Mean Square Error (MSE): where N is the number of training samples and Y t is the actual values.

III. EXPERIMENTS
A. DATA The model is trained and tested on 21 different PV facilities installed at different geographical locations in Germany [44]. These facilities are installed on different spots ranging from rooftop to fully-fledged solar farms. Each dataset consists of NWP data and the historical power data in a resolution of three hours for 990 days. The nominal power of the PVs ranges between 100kW and 8500kW. Out of the 990 days, 490 days are used for training, 250 days are used for validation, and 250 days are used for testing. After splitting the data, the data has been normalized. Except the output power, all input data are normalized between 0 and 1. The output power or the target value is normalized according to the power capacity of the respective PV facility.

B. BAYESIAN OPTIMIZATION
Hyper-parameter optimization is essential for the model's optimal performance. Traditionally, manual, or grid and random search techniques were used for tuning hyperparameters [45]. Manual methods are time-consuming and depend on human expertise, while in grid search the efficiency decreases as the number of hyper-parameters increase.
In random search, a combination of random parameters is sampled based on a statistical distribution given by the user, which may not spot optimal points in the search. Bayesian optimization [45] considers past evaluations to select hyper-parameters to evaluate next. By selecting its parameters in an informed manner, it can more focus on areas of the parameter space that can validate more promising parameters. It has three main parts: search space from which parameters can be sampled out; objective function; and surrogate. It builds a probability model of the objective function and uses it to select the most promising hyper-parameters to evaluate the true objective function. Table 1 shows the ranges of hyper-parameters to be optimized in the proposed model. Seven parameters of the proposed model have been optimized using the Bayesian optimization algorithm. The table also shows the optimal set of values after applying the Bayesian optimization. Fig. 4 shows the convergence points of the optimization algorithm where the optimal combination of the hyper-parameters is converged. These points show the optimal values of the loss functions.

C. MODELS FOR COMPARISON
The persistence model has been used as the benchmark model [13]. In this model, the forecasted PV power output VOLUME 9, 2021 is assumed to remain the same at the same time of the previous or following day [13]. Various state-of-the-art models like LSTM, LSTM-Attention [39], CNN-LSTM [33], and Ensemble method [20] have been implemented on the dataset to compare with the proposed model for day-ahead forecasting. All the comparative models have been optimized to obtain their optimal set of parameters. Table 2 shows the optimized parameters of different models. The attention mechanism can be carried out via different techniques such as Raffel [41], Hierarchical [42], and SNAIL [43] attention. These single temporal attention techniques have been applied over LSTM hidden states and are compared with the proposed method to check the performance of the two-stage attention mechanism. The Input-Attention model with attention only on the input features has also been compared with the proposed method to emphasize the importance of the combined effects of two stages of attention.

D. EVALUATION
In this paper, Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R2 score, and correlation coefficient are used to evaluate the performances of models. Furthermore, Forecast Skill (FS) score has been used to compare the models with the benchmark, i.e. the persistence model. Definition of FS score differs depending on literatures. This paper adopted the FS score from [46].
where y' and y are predicted and the actual values respectively. Correlation-coefficient is the Pearson correlation-coefficient of the predicted and actual values. Table 3 and Table 4 show the comparison of the proposed model with different state-of-the-art models for day-ahead solar power forecasting. Table 3 shows the RMSE of all the models for each PV panel. From this table, it can be seen that the proposed model considerably outperforms all other models. Similarly, Table 4 shows the average values of RMSE, MAE, correlation coefficient, and R2 score of all models. RMSE and MAE errors indicate the losses. From Table 4 it can be seen that RMSE and MAE of the proposed model are the lowest among all the models. Similarly, R2 score and correlation coefficient refer to the accuracies. The R2 score and correlation coefficient of the proposed model is the highest as shown in Table 4. In order to show the effectiveness of the proposed twostage attention mechanism, the model has also been compared with various single attention mechanisms. SNAIL, Raffel, and Hierarchical attention are applied over LSTM hidden states to focus only on temporal sequences. An attention layer is applied to the input features in an Input-Attention model to focus only on important features. The comparison of the proposed model with various single attention mechanisms is   given in Table 5. The combination of two-stage attention is highly efficient as compared to single attention mechanisms for day-ahead solar power forecasting, which can be seen from Table 5. FS score is the criteria to check the performance of different forecasting models with respect to the persistence model. A higher value of FS score indicates the better performance of a model. The FS scores of all models are shown in Fig 5. The figure shows that the FS score of the proposed model is the highest among all the models. Among the other models, the ensemble model has the better FS score than the others. The proposed paper considers 41 features. The performance of the model with respect to some important features such as hour of the day sine, month of the year sine, solar radiation direct, temperature, snowfall, and albedo is shown in Fig. 6. This figure shows that the input attention gives more weight to the input features which are more influencing on the target. It is obvious that the hour of the day has a very good correlation with the target power, which can be seen from Fig.6 (a). Similarly, Fig. 6 (b) shows the impact of the month of the year. Months in summer have a high impact on the output power, whereas winter months have the least impact. The temperature in summer is higher than that of winter, which also influences the power production as shown in Fig.6 (d). Solar radiation is the most correlated data among all inputs. The output power almost follows the trend of solar radiation as shown in Fig. 6 (c), unless some harsh weather conditions like snow or albedo are encountered, as shown in Fig. 6 (e) and (f). The model shows that the PV   performance decreases with the increase in the snow falling. Albedo, which accounts for the reflection from the panel, has also an effect on the output power. Fig.6 also shows the performance of the model with respect to temporal values. The model has given more weights when lookback is 2 as compared to when lookback is 1, which means the model is learning better when lookback is 2.

V. DISCUSSIONS AND CONCLUSION
Solar power forecasting is a time-series problem with nonlinear relationship between inputs and targets. Traditional methods either carry out linear mapping or lacks in handling long-term temporal dependencies. Although extensions of RNN such as LSTM with auto-encoders can handle long-term dependencies, the increase in the number of input features and long-sequences deteriorates their performance In this paper, addressing aforementioned issues, day-ahead solar power forecasting has been carried out using a two-stage attention-based encoder-decoder model. The model applies two stages of attention over LSTM. At first, an encoderbased attention is applied to the input, which focuses on the important features at a particular time. At the second stage, a decoder-based temporal attention is applied to focus on important hidden states of the encoder. This combination of two stages of the attention mechanism with encoder-decoder model solves the time-series forecasting problems significantly, which can be seen from the results.
FS score shows the effectiveness of a forecasting model with respect to the persistence model. The FS score of the proposed model is better as compared to other models as shown in Fig. 5. Table 3 shows the comparison of RMSE of the proposed model with the state-of-the-art models for each PV panel. It can be seen from this table that the proposed model outperforms the traditional methods considerably. In this result, the ensemble method, which is the combination of various machine-learning techniques, has shown some better results than the others, due to the combined effect of various models together. However, the accuracy of the proposed method is much better due to its consideration of all challenges related to the time-series. This effectiveness of the proposed method can further be seen from Table 4, where the average values of RMSE, MAE, R2 score and correlation coefficient of the proposed method are considerably better than the traditional methods.
Different techniques can be used to apply attention mechanisms (such as Raffel, SNAIL, and Hierarchical techniques). These attention mechanisms have been applied over LSTM hidden states to emphasize temporal attention mechanism only. In addition, Input-Attention model with attention only on the input features has also been considered, emphasizing only input feature selection. All these attention mechanisms are compared with the proposed method in Table 5 to show the effectiveness of the two-stage attention mechanism. From this table, it can be seen that the combination of input attention and temporal attention has high accuracy as compared to considering temporal attention or input attention only.
The two-stage attention mechanism focuses on more relevant input features as well as temporal hidden states, which can be seen from Fig. 6. This figure shows that during normal weather conditions, the output power is following the trend of features like hour of the day, solar radiation, etc., and these features are obtaining more attention weights accordingly. However, under extreme weather conditions like snowfall or albedo, the power production is almost zero and these features are getting weights accordingly. Similarly, it can also be seen from Fig.6 that the model is giving more weights depending on more relevant temporal values. For instance, the model has given more temporal weights when the lookback is taken as 2, as compared to when the lookback is taken as 1. This means the model is learning better considering 2 lookbacks as compared to 1 lookback.
The paper has some limitations to be addressed as future work. The proposed model consists of two layers of attention with encoder-decoder layers over LSTM. The proposed model requires more layers and parameters to be trained as compared to other models. Therefore, the performance of the proposed model in terms of speed is lower as compared to the other models, which can be seen in Table 6. Furthermore, similar to all day-ahead forecasting models, this model also relies on forecasted weather data. Therefore, the accuracy is dependent on the accuracy of weather forecasters. The aim of the paper is to show that under given same conditions, the proposed model performs better as compared to other state-of-the-art models.
Since the proposed model is very efficient in forecasting, this model can be applied in different multi-horizon forecasting applications such as microgrid demand response forecasting considering market participations, and in electric vehicles load and charge or discharge forecasting. Considering the ability of the model to focus on more relevant features, it can also be applied to various applications like event management, fault identification, faulty equipment identification, intrusion detection, and power disturbance classification, and so forth.