Probabilistic and Deterministic Wind Speed Prediction: Ensemble Statistical Deep Regression Network

Wind energy as one of the most promising energy alternatives brings a set of serious challenges in the operation of power systems because of the uncertain nature of wind speed. To address this problem, it is essential to establish a framework to forecast a comprehensive form of information about the wind speed. To this end, an ensemble residual regression deep network is designed to understand fully time-variant and spatial features from the historical data including wind speed and corresponding meteorological data. Then, to enhance the accuracy, a modified error-based loss function is proposed. Consequently, to provide a comprehensive form of information, a modified kernel density estimator is proposed to extract a set of probability density functions (PDFs) with a high level of accuracy and reliability. The simulation results and a comparative analysis on an actual dataset in London, U.K. demonstrate the high capability of the proposed probabilistic wind speed approach.


I. INTRODUCTION A. Motivation and Background
Wind energy is one of the fastest growing energy resources in the 21st century, and the global capacity of wind energy plants has increased by about 400 GW during the 15-year period between 2000 and 2015 [1], [2]. The operation of the power systems considering the wind energy conversion systems has brought serious risks to the power system stability and economic integration of wind generation because of the volatility, intermittency, and dependence on meteorological factors. To accommodate the challenges brought by the inherent intermittency of wind power, precise forecasting and provision of a complete form of information regarding the wind speed in the look-ahead times is a potential solution [3], [4]. Although wind speed prediction has been widely investigated in previous studies, there are several issues that have not been comprehensively addressed, including: i) the lack of a structure that can accurately predict the wind speed based on raw data; ii) inability to provide full statistical information with a high level of accuracy and reliability without any prior knowledge of the distribution function; and iii) ability to learn the uncertainty pattern and spatial-temporal features. These issues motivate us to establish a probabilistic approach for wind speed forecasting based on a combination of advanced deep networks and the nonparametric probability density function (PDF) estimators developed in this paper.

B. Literature Review
The previous studies on wind speed prediction can be grouped based on the forecasting model and the aim of forecasting. In terms of the forecasting model, the previous studies can be grouped into persistence, mathematical model-based, statistical, artificial intelligence, and hybrid models. 1) Persistence models: The persistence models are based on a simple assumption: the value of the wind speed at the next time interval is considered equal to the wind speed at the current time. This model is significantly influenced by increasing the time steps of the forecasting problem and is only used as a simple benchmark in the previous studies. 2) Mathematical models: The mathematical models are constructed based on mathematical relationships between the wind speed at the current and previous time intervals and in the next time intervals. For instance, numerical weather prediction (NWP) is a well-known mathematical model that was used in [5] to predict the wind speed. However, finding a mathematical expression for highly nonlinear and complex time series, such as the wind speed, is too difficult and associated with a high computational burden. Therefore, mathematical modelbased models cannot be used in short-term wind speed forecasting. 3) Statistical models: In statistical approaches, the difference between values in the past and current time intervals is used as the main principle to provide a linear model between the predicted values and the past values. The main family of the statistical methods is the autoregressive moving average (ARMA) based models. For instance, the fractional autoregressive integral moving average (f-ARIMA) is developed in [6] for the day-ahead wind speed prediction. The autoregressive integrated moving average with exogenous variables (ARIMAX) [7] and the generalized autoregressive conditional heteroskedasticity (GARCH) [8] are also used for wind speed prediction. Although statistical models can be easily implemented as low-cost wind speed forecasting models, they are not capable of capturing nonlinear and complex time series. 4) Artificial intelligence (AI) based models: The AI-based methods work based on offline teaching and an online working process. The AI-based methods are intelligence structures that can learn based on historical data and work based on unseen data [9]. The AI-based methods are categorized into shallow and deep models. A shallow network consists of one input, one hidden, and one output layer, while the number of hidden layers is more than one in the deep models [10]. Shallow networks are widely used as the wind speed prediction model; one of the widely used shallow networks for wind speed prediction is the artificial neural network (ANN), which is studied in [11]. In [12], an optimal ANN is presented for wind speed prediction. A multistep wind speed prediction model is constructed in [13] based on the ANN. The support vector machine (SVM) is another well-known shallow-based method that is used in wind speed forecasting. In [14], an optimal SVM-based wind speed prediction model, and in [15], a solar irradiation prediction model based on the SVM have been investigated. The k-nearest neighbor (kNN) is used in [16] to predict the wind speed in lookahead times. The fuzzy logic-based wind speed forecasting model has been studied in [17]. The random forest (RF) is used in [18] and [19] to predict the wind speed in look-ahead times. The wavelet neural network (WNN) as a shallow network is also used in [20] to predict the wind speed in Edmonton, Canada. Although shallow-based networks have been widely investigated in wind speed prediction, they are inadequate to fully understand nonstationary and complex time series [21].
To resolve the shallow-based models, there are two main groups of solutions: i) combining the shallow networks with additional techniques/models; and ii) replacing them with deep models. 5) Hybrid models: The accuracy could be improved by combining shallow models with additional techniques, e.g., by a combination of feature extraction techniques, such as the wavelet transform with the ANN [22], a combination of the extreme learning machine (ELM), and backtracking search algorithm-based feature selection [23]. Furthermore, a combination of the ELM, Adaboost, and particle swarm optimization (PSO) is presented in [24]. A combination of the ELM, empirical mode decomposition, and grey-wolf optimization has been presented in [25], while a combination of empirical mode decomposition, differential evaluation optimization, and fuzzy systems are used to predict the wind speed in [26].
The hybrid models can also be composed of different models, such as a combination of a mathematical modelbased model and a statistical model in [27]. The hybrid model is a case-dependent model, and there is no guarantee that the hybrid model is a general solution for wind speed forecasting because of the hypothesis of a small number of parameters [28]. The summary of different groups of wind speed prediction models is provided in Table I. In recent years, deep neural networks have been a newly emerging concept in machine learning, and they have gained a lot of attraction with their strong ability to extract/select features without any additional techniques and generalization capability [29]. The deep networks that can be implemented in the time series forecasting are divided into the deep Boltzmann machine (DBM), the deep autoencoder (DAE), recurrent neural networks (RNN), and convolutional neural networks (CNN) [30]. The DBM is used in [31] for wind speed prediction, and the DAE is used in [32]. The DBM and the DAE can handle complex features; however, they cannot fully understand spatial and temporal features. The RNN-based models, especially gated RNNs including the long short-term memory (LSTM) and the gated recurrent unit (GRU) are wellknown tools in time series prediction [33]. The LSTM and the GRU are implemented in [34] and [35], respectively, for time series prediction. Although gated RNNs are strong tools to realize temporal features in long-tailed sequences, RNNs cannot fully understand spatial features [36]. In contrast with RNNs, CNNs are strong in understanding spatial features but weaker in temporal feature learning. Thus, it is essential to establish a model that can learn both spatial and temporal features in long-tailed time series, such as wind speed and the corresponding meteorological information.
The aim of the wind speed prediction can be: i) a point; ii) an interval; iii) a quantile; and iv) a PDF. i) Point forecasting: Many of the previous studies have been carried out to assign only a point value for each time interval. Because the ideal prediction (with error = 0) is not possible and only provides a single point, in recent years, prediction models have been developed to provide more information than a single point in the next hours. Therefore, probabilistic wind speed prediction including intervals, quantiles, and PDFs is preferred by power system operators. ii) Prediction interval: In the interval wind speed prediction, a set of intervals is the output. Instead of a single point, minimum and maximum values are replaced. Generally, data-driven models are used to forecast the intervals (the upper and lower bound) for each time interval using an error-based loss function. The prediction interval of the wind speed has already been investigated under several different methods including delta, Bayesian, bootstrap, and mean-variance. The delta method is based on a linear model, which is constructed based on an error-based objective function, discussed in [37]. The Bayesian technique is presented in [38] and can construct a set of intervals in lookahead times based on assumed probability density functions. The bootstrap method outputs a set of realizations of historical data and, therefore, estimates the minimum, average, and maximum values [39]. The mean-variance method is a neural network-based method to predict a set of intervals based on probability distribution functions [40]. Lower upper band estimation (LUBE) [41] (based on the shallow networks), [42] (RNNs) is one of the well-known prediction interval construction methods for wind speed. The multi-objective formulation has been presented in [43] to construct the PIs based on the combination of fuzzy set theory, feature extraction, and multi-objective slap swarm optimization algorithm. Further, a deep mixture network can predict the interval with a high level of reliability [2]. iii) Quantile prediction: The quantile prediction can provide more information than intervals and be implemented in a similar process to prediction intervals. For example, a combination of an optimized ELM based on a hybrid metaheuristic algorithm and quantile regression was presented in [44]. The Bayesian information criterion, linear programming, and nonlinear quantile regression are combined to produce a set of prediction intervals. In [45], a hybrid neural network is integrated into a quantile regression neural network to extract a set of quantiles in the next hours of the wind speed. A combination of ARMA, empirical mode decomposition, an ANN, and a linear regression model is introduced in [46] for quantile wind speed forecasting. A combination of graph neural networks (GNN) and quantile regression was presented in [47] to extract a set of quantiles of wind speed in look-ahead times. In the prediction interval and quantile regression, only a set of statistical information is provided. However, information of the prediction intervals and a set of quantiles can be derived from the PDFs. iv) PDF prediction: In the PDF forecasting, the output of the forecasting engines represents a set of PDFs. Although the predicted PDFs are a comprehensive form of statistical information, PDF prediction has received limited attention in previous studies. The PDF estimator can be divided into parametric and nonparametric methods. The RNN-based model is structured based on the Gaussian mixture concept as a parametric PDF estimator to predict the PDFs [48]. The parametric methods require a piece of prior knowledge of the PDF model to estimate the PDFs, whereas the pre-assumed model cannot be applicable in practice. In the nonparametric PDF estimators, kernel density estimation (KDE) is well known as an easy implementation of nonparametric PDF estimation, and it can estimate the PDFs without any prior knowledge of the distribution function. Therefore, KDE has been integrated with wind speed prediction models, such as gated networks [47] and a combination of ARIMA and LSTM [49]; a review of studies on four different KDEs ha s been presented in [50]. There a re two major problems related to PDF predictions of wind speed: they are i) significantly influenced by the bandwidth selector; and ii) not able to perform accurately in long-tailed time series like the wind speed. This paper aims to resolve this issue by some modifications in the KDE process.

C. Contribution and Organization
In this paper, we propose a methodology for wind speed prediction to bridge the above-mentioned major gaps in the previous works by: (i) building understanding of the fully spatial-temporal features associated with impacts of meteorological conditions on the wind speed; and (ii) providing full statistical information in the form of probability density function (PDF) with a high level of accuracy and reliability. Hence, this paper establishes an accurate deep neural network structure by the following: 1. An ensemble residual neural network is designed to capture fully temporal and spatial features, which is called an ensemble Res-network. The designed ensemble Resnetwork can accurately predict the wind speed by assembling multiple layers to learn spatial and temporal features from raw wind speed data and the corresponding meteorological data. 2. One of the major obstacles to reduce accuracy in the previous works is the biased errors produced by the conventional error-based loss functions. This paper develops an error-based loss function to prevent biased error, thereby improving the performance of the designed probabilistic wind speed prediction method in terms of accuracy and reliability. The full statistical information of wind speed in look-ahead times can be represented in the form of PDFs. In practice, the wind speed does not follow a predefined PDF. Thus, to estimate the PDF with a high level of accuracy and reliability, previous parametric methods are not practical. The most conventional PDF estimator method is the kernel density estimator (KDE). The KDE suffers from two major problems: i) it is significantly influenced by the bandwidth selector; and ii) cannot perform accurately in the long-tailed time series like the wind speed. To resolve these problems, the KDE is modified by using the general form of the KDE and a development of the bandwidth selection based on the moving sampling window technique. The proposed method is extensively tested on a realistic dataset from London, U.K. The multiple analysis demonstrates the effectiveness and superiority of the proposed Res-Network integrated with the modified KDE. Therefore, the contributions of the paper can be summarized as: • The ensemble Res-Network is developed to accurately predict the wind speed by assembling multiple layers to learn spatial and temporal features from raw wind speed data and the corresponding meteorological data • A modified version of the KDE is developed to resolve the bandwidth selection and to improve the accuracy of the prediction in long-tailed sequences. • A combination of the modified KDE and the designed Res-Network can provide full statistical information in the form of PDF with a high level of accuracy and reliability. • A modified error-based loss function is developed to prevent bias error, thereby improving the performance in terms of accuracy and reliability. This paper is organized as follows. The designed Res-Network is introduced in Section II. The modified KDE is presented in Section III. Simulation results are discussed in Section IV, and concluding remarks are provided in Section V.

A. Contribution and Organization
Let us have ( , ), where = { − ,⋯ , −1 } are the input historical data and the target values = { }. The main aim in the deterministic datadriven prediction is to establish a structure to project a set of ̃ with a minimum difference with . In the probabilistic prediction, the objective is to construct a network to project ( | ). Thus, the main aim in this paper is to design a network with an ability to construct { +1 , +2 ,⋯ , + } and { ( +1 | ), ⋯, ( + | )}.

B. Input Data
The input data are historical data of wind speed and the corresponding meteorological data, including temperature, humidity, wind direction, and solar irradiation. The historical data, are normalized as: where ∈ ℝ are historical data, while and are the maximum and minimum values of the historical data. To feed the designed deep structure, the data are organized as a time series with a specific window length, ω. The optimal value for the window length is highly dependent on the time series. In the present work, the window length is dependent on: (i) the seasonality of the dataset; (ii) the number of models in the designed deep ensemble structure; and (iii) prevention of overfitting. The overfitting can be caused by a low number of layers and overlapping of the input data. By increasing the window length, the number of layers required for training is reduced. A data-driven structure with a low number of layers is more likely unable to perform efficiently considering a longtail wind speed time series. Moreover, the overlapping of data samples increases the correlation between different time series and reduces the number of efficient time samples. Therefore, a large value for ω can lead to overfitting, and therefore, small values are preferred.

C. Designed Structure
As mentioned above, the designed structure is a nonlinear regression network as shown in Fig 1. As can be seen, the proposed ensemble network consists of multiple Res-Network blocks. The proposed structure is able to construct a sequence VOLUME XX, 2017 1 of time series in look-ahead times based on historical data. The residual structure helps to ensemble multiple blocks and enhances learning ability by the residual feedback. In addition, according to [51], a large number of networks can be assembled by the residual concept. The designed ensemble Res-Network network includes end-to-end residual blocks that use a recursive process throughout the input dataset. The Res-Network layers include a sequence of dense layers, each layer consisting of multiple hidden layers. Assume that the Res-Network includes different residual blocks, and each block consists of m layers. The input dataset is fed into several dense layers, and at each layer the output is [29]: , = ( ℎ −1 + −1 ) (2) where the outputs of dense layers, weight matrices, hidden state, and biases are shown by , , , ℎ −1 , and −1 , respectively. Furthermore, the activation function is depicted by . The activation function considered in this paper is the rectified linear unit (ReLU) [52]. The main difference between conventional structures and the Res-Network is the mapping function. In the conventional deep/shallow networks, the is used to project an ( , Π) ( Π is the learning weight sets), while in the residual networks, ̌( , Π) is projected as (as shown in In the ensemble Res-Network, the forward propagation can be expressed as: The backpropagation process of the loss function is carried out as: The loss function is depicted by . It is worthwhile to note that the gradients at the output of the ensemble Res-Network can be directly back-propagated. This issue reduces the possibility of vanishing gradients, even in large-scale networks [53]. In the designed Res-Network, the input dataset of the th k module, which is composed of several dense layers, is updated as: The output of ℎ module and the input of the ℎ module are denoted by ̃ and , respectively. It is worthwhile to note that we assume 0 =̃0 = 0. In the different ℎ dense layers, the hidden state of each layer is: ℎ , = , , ( )s (7) The outputs of each module in the Res-Network are ̌ and , , , and they are obtained based on the mapping function dimensions 1 ∈ ℝ 1 ×ℎ 1 and 2 ∈ ℝ 2 ×ℎ 2 (dimension widths 1 , 2 and heights ℎ 1 , ℎ 2 ) as: , , = 2 , ℎ

D. Loss Function
The mapping function is constructed based on the optimization process. The optimization process should be performed on the loss function, and the output of the optimization process is a set of learning weights. In previous studies, the conventional mean error-based loss function has usually been used in in the short-term forecasting [9]. One of the most commonly used error-based loss functions is the mean absolute percentage error (MAPE The regulation coefficient prevents biased errors in under/overestimation errors. To prevent overfitting in the several numbers of layers in the Res-Network, a proportion of learning weights are dropped out by the dropout technique [54]. The learning weights are obtained based on Adam optimization [55], and the hyperparameter is obtained by root mean square propagation (RMSProp) [56].

III. Modified Kernel Density Estimation
From a practical point of view, the proposed method should be applied in an iterative manner. Firstly, based on different dropout values, the trained network generates a set of points.
It is worth mentioning that the trained model performs fast (the average performance time is 42 ms). Then, after several iterations, the proposed nonparametric PDF estimator estimates a set of PDFs. To provide full statistical information in the form of PDFs, two sets of PDF estimation methods can be used, including parametric and nonparametric methods. Because the nonparametric methods can estimate a set of PDFs without any prior knowledge of the PDF model, nonparametric methods are preferred. Among the nonparametric PDF estimators, kernel density estimators (KDEs) have been regarded as well-known and easy-toimplement methods. However, the kernel-based estimator is not accurate enough in long-tailed distribution. Moreover, the bandwidth selector has significant impacts on the performance of PDFs [57]. To address this problem, this paper modifies the KDE based on using a varying bandwidth. The modified KDE can be implemented as: Step 1: Initialization In this step, the proposed Res-Network is repeated several times based on assigning different values to the dropout. Thus, several numbers of data are generated in an extremely short time period. The running time for each sample is less than 11.2 ms. Thus, the designed network can be repeated many times in a short time period.
Step 2: KDE function After several iterations for the Res-Network in wind speed forecasting, the generated data compose an input set for the modified KDE, Υ = { 1 ,⋯ , }, where denotes the total number of iterations, and indicates the point forecasted by the Res-network. Then, for iterations, the KDE function is computed as: The smooth parameter is denoted by .
Step 3: local mean integrated squared error function To compute the optimal values for the variant bandwidth in the modified KDE, first, the local mean integrated squared error function, is defined: where , , and represent the inhomogeneous coefficient, the sampling window, and the exception values, respectively.
Step 4: Objective function for optimal variant bandwidth Then, the general kernel function is defined: The loss function is optimized based on the least square process. The optimal is ℓ −1 * , where ℓ and * denote the scale parameter and the optimal smoothing parameter. Further, the optimal value for * = ℓ −1 * , where * indicates the optimal sampling window. During the optimization process, the scale parameter is updated at each iteration as: Step 5: Stopping criteria The optimization process in step 4 is repeated for T times. The number of iterations is considered as the stopping criteria.
Step 6: outputs The final set of PDFs are extracted as a full statistical form of information for the wind speed in the next hours. The overall process of PDF estimation based on the modified KDE is summarized in Algorithm 1.

Algorithm 1: Modified KDE in probabilistic wind speed forecasting
Input: Modified KDE parameters and Υ . Output: PDFs of wind speed in look-ahead times Step1: Set the modified KDE parameters and the input data set obtained by several iterations of the Res-Network Step 2: Calculate the kernel function based on (11).
Step 3: Local mean integrated error function is computed as (12).
Step 4.3: update the optimal values Step 4.4: obtain the optimal values End Step 5: output the final set of PDFs for wind speed End The proposed method can also be used for estimation of the discrete [58], [59], and [60]continues systems

IV. Numerical Experiments
The effectiveness of the probabilistic wind speed forecasting approach was tested to verify the performance of the designed deep forecasting model and the proposed probabilistic VOLUME XX, 2017 1 forecasting approach. The parameters of the proposed deep network are given in Table I. The proposed methodology wa s applied in Python software on a personal computer with @2.67 GHz, 4GB memory, and a 32-bit OS. Note that the time resolution considered in this paper is 30 min. For comparison, several methods were considered in terms of models: • Combination of CNN and GRU [4].
where , , and represent the actual values, the forecasted mean value, and the total number of samples in the forecasted time series. Although the metrics in (16)- (19) are proper measures for evaluation of the forecasting points, these four metrics cannot assess the forecasted PDF comprehensively. Thus, the continuous ranking probability score (CRPS) and cross-entropy (CE) were also used as accuracy and reliability metrics in this paper. The CRPS assesses the predicted PDFs in terms of sharpness, calibration, and reliability of the predicted PDFs, which are: These four metrics were used to assess the results from different aspects. The MAPE and the RMSE were used to evaluate the mode of the predicted PDFs in terms of accuracy. However, although both of them are considered comprehensive metrics in time series 4, these metrics cannot reflect the reliability of the predicted PDFs. To address this issue, the CRPS and CE were used. The CRPS can assess the sharpness and calibration. However, the CRPS is not sensitive to the perturbation out of the distribution. Therefore, CE wa s used as an accuracy and reliability evaluation metric in the PDF prediction.

B. Dataset
Testing of the performance of the actual data set was adapted from [65]. In this dataset, wind speed, temperature, humidity, wind direction, and solar irradiation were collected from January 1st, 2013 to January 1st, 2014 in London, U.K. To train a designed deep network, 70% of the dataset of a whole year was considered. In addition, about 30% of the data was considered for the performance evaluation of the estimator.

C. Result Analysis: Deterministic Forecasting
The deterministic results are discussed for four sample days in spring, summer, fall, and winter. The wind speed is highly dependent on seasonal changes. Thus, variation in different seasons can be illustrative for a deterministic wind speed prediction approach.

D. Results Analysis: Probabilistic Forecasting
The full statistical information for each time interval is provided by a combination of the modified KDE and the designed Res-Network in the form of PDF. In other words, for each time interval, a nonparametric PDF is forecasted. The predicted PDFs for four different time intervals are depicted in Fig. 11. The time intervals are selected from a sample day of the spring season. As can be seen, the predicted PDFs are narrow, and the real value is too close to the peak of the PDF. Therefore, the accurate and reliable performance of the proposed method can be verified. Table III compares the proposed approach with the SKDE and the KDE in terms of CRPS, CE, MAPE, and RMSE. The MAPE and the RMSE are obtained based on predicted median values. As can be seen, the proposed method performs significantly more accurately than the SKDE and the KDE in the spring season.  The predicted PDFs in two different time intervals in a sample day in the summer season are depicted in Fig. 12. As can be seen, the real value is too close to the peak of the predicted PDFs. Furthermore, the results obtained by the modified KDE are compared with the SKDE and KDE in Table IV. Accordingly, the modified KDE enhances the performance of the SKDE and the KDE by about 37.44% and 43.15%, respectively, in terms of CE.      The MAPE values for the shallow-based methods are too high, which means that the shallow-based networks cannot perform accurately for the wind speed prediction. However, according to the results in Tables II-VI, the obtained MAPE values by the proposed method are lower than 5%. These values validate the high accuracy of the proposed probabilistic wind speed prediction method.

V. Conclusion
One of the most favorable renewable energy resources is wind energy. The key challenge in the operation of a power system is the uncertainty associated with the wind speed. This paper addressed the problem by: i) designing an ensemble residual neural network; ii) developing an error-based loss function; iii) establishing a framework that can provide a set of PDFs; and iv) presenting a modified KDE. The results were extensively analyzed based on the seasonal behavior of wind speed in London, U.K. with a 30 min time resolution; the results show at least a 25% improvement compared with several state-ofthe-art deep and shallow networks. Furthermore, the proposed modified KDE shows a more than 30% improvement in terms of accuracy and reliability.