A New Data Driven Long-Term Solar Yield Analysis Model of Photovoltaic Power Plants

Historical data offers a wealth of knowledge to the users. However, often restrictively mammoth that the information cannot be fully extracted, synthesized, and analyzed efficiently for an application such as the forecasting of variable generator outputs. Moreover, the accuracy of the prediction method is vital. Therefore, a trade-off between accuracy and efficacy is required for the data-driven energy forecasting method. It has been identified that the hybrid approach may outperform the individual technique in minimizing the error while challenging to synthesize. A hybrid deep learning-based method is proposed for the output prediction of the solar photovoltaic systems (i.e. proposed PV system) in Australia to obtain the trade-off between accuracy and efficacy. The historical dataset from 1990-2013 in Australian locations (e.g. North Queensland) are used to train the model. The model is developed using the combination of multivariate long and short-term memory (LSTM) and convolutional neural network (CNN). The proposed hybrid deep learning (LSTM-CNN) is compared with the existing neural network ensemble (NNE), random forest, statistical analysis, and artificial neural network (ANN) based techniques to assess the performance. The proposed model could be useful for generation planning and reserve estimation in power systems with high penetration of solar photovoltaics (PVs) or other renewable energy sources (RESs).


II. INTRODUCTION
The penetrations of solar photovoltaic (PV) are increasing in several countries including Australia in multiple straight years. Significant PVs are either connected to medium or low voltage networks in Australia. The growth of both large and small-scale PV penetrations has economic and environmental benefits. However, it poses a range of management and control issues for grid operators due to the variability of PV outputs. The power system has become increasingly volatile and less predictable with PV systems [1]. The PV systems are weather dependent, therefore, hard to predict. Accuracy of the prediction is critically important for secure operation of power systems with high penetrations of VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ PV systems. It enables the system operator to deal with output power variability and planning engineers to plan and design the power system for future [2]. There are various methods for such forecasting in different time horizons, e.g. short, medium, and long-term. The physical, persistence, statistical, and combined approaches may be used to estimate the output of variable generations [3]. The meteorological data and energy forecasting are the two significant components related to the forecasting of the PV system [4]. Many procedures were proposed in the literature to forecast meteorological information such as wind speed, cloud cover, temperature, and irradiance [5], [6]. Furthermore, physical, meteorological data-driven, and astronomical driven are the common methods reported in the literature to forecast the output power and energy of the PV system. Different parameters such as power rating, azimuth angle, module type, tilt angle, wind speed are used in the physical model for energy forecasting in PV systems [7]. The historical weather data and the previous measurements of PV system outputs are used in the meteorological data-driven method for PV forecasting [8]. The statistical, persistence, auto-regression are the key methods used for this purpose [8]. Recently, machine learning techniques have widely been applied in the meteorological datadriven approach to forecast PV output [9]. In the astronomical and meteorological data-driven approach, the physical factor has been used with the meteorological data [9]. In a datadriven traditional statistical method, the measured historical PV data in the past time is used in forecast [10]. The autoregression and spatial-temporal are the other two widely used data-driven methods for such application [11], [12]. However, the physical information of PV is often limited or ignored in these methods [10]- [12]. Although different techniques have already recognized for forecasting PV output, there is still an opportunity to improve the reliability and accuracy regarding the long-term forecasting of the PV system to be used in power system planning. A good number of works have been attempted to estimate the short-term solar yield using historical data. Most of the forecasting techniques applied in minutes into day spatial resolution for dispatching and load following, unit commitment, distributed generation operation, building energy management, and transmission scheduling. However, very few studies have investigated the data-driven long-term estimation of solar yield. In this paper, a data-driven model is proposed for reliable estimation of solar yield from historical data. Three main forecasting algorithms categories, i.e. statistical analysis [13], machine learning [14], and hybrid [15], [16], were reported. The Australian Renewable Energy Agency (ARENA) has reported the Australian Solar Energy Forecasting System (ASEFS), which used the statistical models like decision tree, random forest, and persistence to forecast the hour ahead prediction of solar energy in Australia. The model has a root mean squire error (RMSE) of 15.80. Hence, there is still a prospect to improve in the forecasting approaches. Furthermore, several machine learning techniques were attempted to forecast minutes, hours, and day-ahead energy outputs of large-scale PV systems [7], [8], [14], and [17]. These were mainly used various neural networks (NN) based forecasting techniques with the short length of dataset. Very few studies have exhibited good forecasting performance as reported in [7], which has a normalized root-mean-square deviation or error (nRMSE) of 0.07356. However, the proposed algorithm in [7] are not suitable in generalized forecasting due to the underlying weather classification and certain assumptions applicable to the specific region. Furthermore, the hybrid techniques were attempted to combine the algorithms for better performance as stated in [7]. The proposed method combines the particle swarm optimization (PSO) with the variation of NN to achieve better forecasting performance. However, the performance of the proposed algorithm is almost similar to other NN based algorithms for forecasting. Recently, the recurrent neural network (RNN) and deep learning [16] based forecasting have received a great deal of attention due to better prediction performance compared to traditional techniques i.e. statistical, PSO, NN. But, most of the deep learning-based methods are used for short-term forecasting with the small length of data.
In this paper, a novel hybrid deep learning method is proposed. A number of studies have individually used long and short-term memory (LSTM) and convolutional neural network (CNN) individually in various application including forecasting of PV output [18]. This paper proposed a method that combines LSTM and CNN to obtain a hybrid algorithm for long-term forecasting of PV output. The proposed algorithm is compared with four baseline modelling methods and demonstrates the better performance compared to the other methods. The rest of the paper is organized as follows: Section III briefly describes the key techniques considered in this paper. The methodology is explained in Section IV. Results and discussions are presented in Section V. The conclusions and the contributions of the paper are given in Section VI.

III. OVERVIEW
The long-term forecasting of solar PV can be used for the planning of power system reserve with high penetration of PV systems. The goal of this research is to find the PV power and energy in the long-term time horizon -a couple of years ahead. The historic Typical Meteorological Year (TMY) dataset used for this study. The TMY dataset are obtained from Energy Partners' [19]. The TMY data is used in the System Advisor Model (SAM) to prepare the required weather data and PV output data for the prediction model. The blending of two deep learning methods has been considered. Fig. 1. shows an overview of the proposed method. The key techniques used in this work are briefly described next in this section.

A. RECURRENT NEURAL NETWORK (RNN)
The recurrent neural networks (RNNs) consist of recurrent loops of networks that allow persistent information flow [16]. These loops allow the information to flow concurrently from one step of the network to the next using the chain of events within networks which are intimately related to the sequences and lists. The concept of the recurrent neural network is the base of deep learning techniques/algorithms which are inspired by the connection of neurons in human brain [16], [17]. It uses recurrent learning to learn from large and complex dataset. Deep learning is used to solve complex problems that require input from diverse, unstructured, and inter-connected dataset. In this work, two of the most popular deep learning techniques such as long and short-term memory (LSTM) and convolutional neural network (CNN) are utilized. The details about LTSM and CNN are given later in this paper.

B. LONG-AND SHORT-TERM MEMORY (LSTM)
The LSTM is a deep learning technique explicitly designed to reduce long lasting dependency problem using a chain like structure [20]. The recuring model of LSTM uses concurrent cell update structure. The initial update starts right after the first output of initial LSTM block which uses the initial state of the network and the first-time step of the sequence to compute the output. At time step t, the block uses the current state (c t−1, y t−1 ) to update cell state c t , and the following time step of the network to compute the output. Each layer has two states known as the cell and the hidden state (also known as the output state). The output of the LSTM layer at time step t is contained in the hidden state of the same time step [21]. The information erudite from previous steps is confined in the cell state of the current step. The layer adds or removes information from the cell state controlled by gates in each time step. Fig. 2 illustrates a general LSTM block architecture. From Fig. 2, it is evident that there are four control gates in LSTM: forget (f ), cell candidate (g), input (i), and output (o) as illustrated in Fig. 2. When c (t − 1) points enter to the LSTM unit from LSTM block, it first passed through the forget gate and drop some memory. The new memories are added by update gate. The output is filtered through the output gate. Working mechanisms can be mathematically expressed as in (1) -(4) for timestamp t for each control gate.
In (1)-(4), σ g denotes the gate activation function. The sigmoid function given by σ (x) = (1 + e −x ) −1 is used to compute the gate activation function in MATLAB [22]. There are three learnable weights of an LSTM layer: input weights W , recurrent weights R, and bias b. The matrices of W , R, and b are concatenated as in (5).
where i, f , g, and o represent the input gate, forget gate, cell candidate, and output gate, respectively. VOLUME 8, 2020 The cell and hidden state at timestamp t are expressed by (6) and (7), respectively.
where denotes the Hadamard product (element-wise multiplication of vectors) and σ c denotes the state activation function. The state activation function is compared by using the hyperbolic tangent function (tanh) and lstmLayer function in MATLAB.

C. CONVOLUTIONAL NEURAL NETWORK (CNN)
The Convolutional Neural Network (CNN) is one of the most popular deep learning algorithms [21]. It has the advantage of extracting data features effectively. Therefore, the CNN is used widely in image recognition and classification. The CNN networks are like a visual cortex, with arrangements of simple and complex cells [18]. Similar to an RNN neural network, CNN is composed of three main components: the input layer, output layer, and hidden layers in between the input and output layers [23]. A general CNN structure is illustrated in Fig. 3. One or multiple convolutional layers may be involved in CNN as given in Fig. 3. The CNN used in this paper has four 2-D convolutional layers, BNL, ReLU layer, and APL. These are followed by one DL, fully connected layer, and regression output layer, respectively. The influential input parameters of m × m × n are used in CNN (where m×m determine size of each set, and n specifies the total number of dataset). The inputs are passed to the convolutional 2D network consists of neurons that connect to sub-regions of input dataset or the output of the previous layer. The convolutional 2D network uses the set of weights called filter (k) that convolved the input. This has extracted the important features of the input dataset for accurate output prediction. Then, the batch normalization is used to normalize inputs (m i ) by calculating the mean (µ B ) , and variance (σ 2 B ) over a mini-batch and each input channel. The normalized activations can be obtained as in (8).
In (8), is the property Epsilon that improves the numerical stability when the mini-batch variance is very small. The batch normalization layers are followed by ReLU layer which acts as a threshold operation to the input with the following relationship as given in (9).
The ReLU layer is followed by an APL, which performed down sampling. The input is divided into rectangular pooling regions to compute the average values in that region. If the input (I ) to the pooling layer is n × n, and the pooling region size (PS) is h × h, then, the pooling layer down-sampled the regions by h [23]. The output (O) of a pooling layer for overlapping regions can be expressed as in (10).
In the final stage, one DL, fully connected layer, and regression layer work together to prepare the output of the CNN network. The dropout layer randomly sets the input elements to zero given by the dropout mask rand(size(m)) < Probability (where m is the layer input). The fully connected layer multiplies the input by a weight matrix W and adds the bias vector b. In this case, the fully connected layer acts independently on each time step with the sequential inputs. At time step t, the corresponding entry of Z is WY t + b.
The loss function of the regression layer is the half-meansquared-error for the sequence-to-one regression networks of the predicted responses as in (11). This can be computed by a regression layer as given in (11).
where n is the number of responses, t i is the target output, and Y i is the network's prediction for response i.

IV. METHODOLOGY
The step-by-step methodology used in this paper is given in Fig. 4. The monthly dataset from 1990 to 2013 with onehour time interval have been used here for the forecasting. Solar dataset for four locations in Queensland, e.g. Cairns, Gladstone, Rockhampton, and Townsville are considered to validate the proposed method.
Step 1: Prepare the initial dataset-The historic Typical Meteorological Year (TMY) dataset from 1990-2013 with .tm2 file extension are used to generate the weather data for the proposed algorithm. The System Advisor Model (SAM) is used to generate the energy output of the PV system [24]. The SAM is developed by the National Renewable Energy Laboratory (NREL) to estimate the energy output of renewable energy systems including PV generators by the physical model of the system. The PV system in SAM has been tuned using the manufacturer data of PV cell, inverters, AC lines, derating factors, and others. Using the specification of the physical model of PV plant and relevant TMY dataset, the SAM presents influential weather parameters like global horizontal irradiance (GHI), direct normal irradiance (DNI), diffuse horizontal irradiance (DHI), wet bulb, and dew point temperature in hourly and monthly duration. Fig. 5 illustrates the output of a PV plant estimated by SAM for a representative year in Cairns. The obtained output can be exported to a.csv format to use this as an input to the deep learning algorithm. The hourly and monthly energy outputs of the PV plants are also calculated using TMY and physical model specification in the SAM.
Step 2: Input selection-Initially the generated weather data were analyzed using the correlation coefficient to find positive and negative Correlation Index (CI) for parameters associated with energy production. The CI values in this paper are calculated based on the Pearson product-moment correlation coefficient as given in (12) [25]: The influential parameters are given in Table 1. As presented in Table 1, the five major influential input values with CI > ±0.5 are employed. As can be seen from Table 1, GHI and DNI are positively correlated while DHI, wet bulb,  and dew point temperature are negatively correlated with energy the production.
Step 3: In this step, the dataset are prepared for the training and testing of the hybrid deep learning structure. The LSTM part of the hybrid deep learning technique has been used to predict inputs in t years (this will be used in Step 4 to calculate PV output for t nyears using CNN part of the hybrid structure). The brief overview of dataset preparation and hybrid deep learning is given next.  Fig. 7 shows the process flows which used to prepare dataset for training and testing. VOLUME 8, 2020

B. DATASET STANDARDIZATION
The standardization process is used to prepare the dataset to better fit and keep the deviation minimum. For the dataset matrix M ij , the mean and standard deviation are estimated to get standardizing dataset of S.
In (13)- (15), M ij is the dataset matrix, S is the standardized dataset, µ is the mean, and σ is the standard deviation of the dataset.

C. PARTITION OF TRAINING AND TEST DATA
The 90% of the available data are used for training, while, the other 10% are used for testing. The training data size can be estimated as in (16):

D. PREPARE PREDICTORS AND RESPONSES
The training sequences are shifted by n time steps to forecast the value in future time. This has been done to make sure that the proposed method could learn to predict ahead of input sequences. The predictor and responses for the proposed algorithm can be obtained as in (17) and (18):

E. HYBRID DEEP LEARNING ARCHITECTURE
The hybrid deep learning (LSTM-CNN) architecture has been designed using LSTM and CNN deep learning techniques. Due to the weather variability, it is difficult to predict PV output accurately in longer time horizon. The CNN has intelligently adapting mechanism to understand complex relationships of properties in variable nature which motivated us to choose CNN over other deep learning methods to predict yearly PV output. As illustrated in Fig. 8, a deep learning network using two LSTM layers denoted as LSTM 1 and LSTM 1 with 500 and 1000 hidden layers are initially considered. These LSTM layers then combined with input data I years which is 5 by 12 matrix as presented in (19).
The LSTM network is designed with fully connected layer and regression output layer to get O nyears outputs. The LSTM network was set with training option properties as given in Table 2.
The LSTM network is designed to predict input values in future time of nyears where (nyears = years + n). The value  of n can be replaced by any number of years. The output O nyears of LSTM network is 5 by 12 matrix which gives all influential input values of n years as presented in (20).
Step 4: The output from LSTM network used as the inputs into the CNN network as presented in Fig. 8. The CNN network is designed with three 2D CNN deep learning networks followed by equal numbers of BNL, ReLU layers, and APL. The CNN network also added with DL to handle overfitting. Finally, it has fully connected layers of 12 outputs for each year which is followed by the regression layer. The network was set with training option properties as given in Table 3. The CNN network is then trained using I years where Iyears = 24 from 1990 to 2013. It was then used to predict output energy E nyears for nyears time as given in (21)   of n LSTM = n CNN ).

V. RESULTS AND DISCUSSIONS A. PREDICTION RESULTS
The forecasted performance is tested in North Queensland locations, e.g. Cairns, Gladstone, Rockhampton, and Townsville. However, due to the brevity, only the results related to Cairns are presented in this section. Historical meteorological data from 1990 to 2013 in Cairns has been used for the training of the model. The downloaded data files have some low quality, missing data, and format compliance to SAM and the proposed prediction model. To resolve these problems, data cleaning has been carried out based on the physical model. The SAM has also been used for data cleaning in this paper. For example, if the PV output obtained more than the capacity value for very low irradiance or output of PV at night, flagged as bad data. Sometimes the PV output could be obtained due to missing data. This is also flagged as bad data. Similar to [7], 5271 hours out of 5461 daytime are considered as good data in this work. The bad data are excluded from the training of the proposed method. Often the missing data have been filled based on the previous hour of measurements. The SAM model is later used to compare the forecast model with the baseline PV model.
After processing the monthly weather input parameters and energy output estimated by step 1 and step 2 given in Section IV. The historical dataset of 24 years with a list of input values are established. These have been processed later to pepare a input matrix I years=24 as in (18) for LSTM (see step 3). The proposed LSTM algorithm predict output matix O nyears=6 for 6 years as illustrated in Fig. 9.
In Fig. 9, GHIs from 2014 to 2020 are presented. The predicted value of GHIs is compared with the actual measured values to validate the performance of the proposed method.    whereas, the predicted meteorological data have been used to get the PV output for 2014 using the physical model of PV. From the results given in Fig. 10, it is evident that the predicted energy output almost accurately matched with the actual energy output of 2014 with errors less than 3.3. It should be worth noting that energy prediction in May and September showed the highest positive errors, while August has the highest negative error. From the results in Fig. 10, it is evident that the forecasted energy value closely matched with the actual values of energy in 2014. Thereby, it is evident that the proposed method is able to forecast the long-term energy from PV systems.
The proposed method is further used to estimate the energy output of a PV system at Cairns. Yearly predicted energy outputs are given in Fig. 11 for 2015 to 2020. From the yearly predicted results, it is evident that the energy production would be high from September to January and low from February to August -which are the general trends for the PV systems in North Queensland.

B. COMPARATIVE ANALYSIS
There are no standard sets of performance comparison parameters to be used in the existing forecasting techniques. Hence, it is important to cover a reasonable range of performance parameters for benchmarking the proposed method. Four well-known forecasting performance parameters such as RMSE, nRMSE, mean absolute percent error (MAPE), and Rvalue are used to benchmark the proposed algorithm.
The RMSE is more sensitive to forecast errors [14], [26]. Hence, it is suitable where the small errors are more tolerable than the larger ones. The RMSE can be expressed as in (22) [14]: In (22), PV a i is actual PV output power, PV f i is forecasted power, and N the number of observations. The RMSE error is nomalized with respect to maximum and minimum value of PV f i to get nRMSE as given in (23). It should be noted that the lower the RMSE and nRMSE values, better the performance of the algorithm for forecasting.
The MAPE is widely used index to determine the forecast accuracy with respect of scale-independency and interpretability. The MAPE and error variance can be calculated as in (24) [17], [26], and [27]: where FH is the forecast horizon and PV p t is the peak output power at time t. A higher MAPE value means lower accuracy of forecasting algorithms whereas lower MAPE value means high accuracy of the forecasting algorithms.
The Rvalue is the correlation between the predicted values and the observed values [27]- [29]. It gives an idea about the model generalization. An Rvalue close to one means, the forecasting values are highly close to the fitted regression line and it can be used in more generalized cases. The Rvalue can be calculated as in (25) [27]- [29]: Table 4 shows the baseline comparison of the proposed method against the four well-established methods given in the literature to forecast the long-term energy from the PV system. All four benchmarking performance indices mentioned earlier are used for the comparison. From the results given in Table 4, it is evident that the proposed method has the RMSE of 3.89 which is very low compared to the other methods. The nRMSE value for the proposed method is 0.0529, considerably low with compared to others. However, this can be further improved with training. The given algorithm outperforms all other existing algorithms in MAPE which is 2.83 for the studied location. Furthermore, the Rvalue of the proposed method is 0.9. This indicates that the proposed method is very close to the fitted regression line. Moreover, it is worth noting that the Rvalue of the proposed method is higher with compared to statistical analysis. However, Rvalue of the given method is slightly low with compared to random forest and NNE. From the comparative results, it is evident that the proposed forecasting algorithm is more accurate for forecasting the long-term energy output from PV system. Fig. 12 illustrated the values of RMSE and MAPE for all four studied locations in the North Queensland, e.g. Cairns, Gladstone, Rockhampton, and Townsville for the proposed method. From the results given in Fig. 12, it is evident that the RMSE values are lower than 15 in all studied locations  for the given method. Moreover, the given algorithm has good forecast quality for various locations with RMSE ranging between 3.89 -11.87. It should be worth to note that the MAPE values of the studied locations are ranging between 2.5 -7.8, which makes the proposed method more reliable in estimating long-term energy output of PV.
Further analyses are conducted to evaluate the reliability of the proposed method for different datasets and layers for LSTM. The RSME, MAPE, nRMSE, and Rvalue are used as the indices to measure the reliability of the proposed method. Table 6 gives the performance of the proposed model under different lengths of training data (i.e. 5 years, 10 years, and 25 years). Table 7 shows the performance of the proposed method under different LSTM layers and standard deviation of indices in relation to result presented in Table 4. From the results given in Table 6 and 7, it is evident that the mean standard deviations of all the indices are lower in relation to actual values presented in Table 4. For example, the average RMSE standard deviations varies in worst case scenarios is ±1.2 only whereas ±0.2 best case scenarios. Therefore, the performance indices for the given method are lower under various factors affecting the forecasting performance.  From the results given in Table 6, it should be worth noting that the Rvalue reduced significantly for the smaller training dataset. Moreover, the MAPE value for the lower training datasets is also high. However, the average changes are low which suggests that the proposed method is reliable.

VI. CONCLUSION
This paper proposed a new and reliable method forforecasting the long-term output of solar PV. The proposed method utilized the multivariate long and short-term memory and convolutional neural network to develop the technique for forecasting the PV output. This paper utilized the twenty-four years of historical data from various locations in North Queensland in Australia to validate the performance of the developed model. Additional meteorological parameters have been used in the proposed algorithm based on their positive and negative influences on the output of the PV system. From the given results and comparisons, it is evident that the proposed method may accurately predict the long-term output of the PV system for planning studies with RMSE lower than 15 for all studied locations. Moreover, the proposed method is robust compared to some well-established methods such as ANN, Random Forest, NNE, and others. The proposed algorithm was run in MATLAB R2018b (9.5) with the computational cost for training and prediction of 203.63 s. Therefore, it could be considered as a low computation cost algorithm compared to others.
In this study, several assumptions had to make for PV output forecasting. Therefore, further sensitivity study around this domain would be performed in the future. This work will be further extended to forecast the long-term generation reserve in power systems with high penetration of wind and solar in Australia.
BIPLOB RAY (Member, IEEE) received the Ph.D. degree in information technology from Deakin University, Australia. He is working as a Senior Lecturer in information technology with CQUniversity, with a background of a mix of research, academic, and industry experience. His teaching and research are currently focused on networked intelligent systems, big data, security protocols, and the privacy of mHealth. His high-quality research work has been recognized by peers and cited extensively. He has more than 20 international journal and conference publications. He has appeared as a keynote/plenary speaker at a number of international conferences.
RAKIBUZZAMAN SHAH (Member, IEEE) received the Ph.D. degree from The University of Queensland, Brisbane, QLD, Australia. He is a Senior Lecturer in smart power systems engineering with the School of Engineering, Information Technology, and Physical Sciences, Federation University Australia (FedUni Australia). Prior to joining FedUni Australia, he has worked with The University of Manchester, The University of Queensland, and Central Queensland University. He has experience working at and consulting with DNOs and TSOs on individual projects and collaborative work on large projects (EPSRC project on multi-terminal HVDC, Scottish, and Southern energy multi-infeed HVDC)primarily on the dynamic impact of integrating new technologies and power electronics into large systems. He is an active member of the CIGRE. He has more than 60 international journal and conference publications, including 18 journals in the IEEE and IET, and has spoken at leading power system conferences around the world. His research interests include future power grids, such as renewable energy integration and wide-area control, asynchronous grid connection through VSC-HVDC, power system stability and dynamics, the application of data mining in power systems, the application of control theory in power systems, distribution system energy management, and low-carbon energy systems.