Convergence of Photovoltaic Power Forecasting and Deep Learning: State-of-Art Review

Deep learning (DL)-based PV Power Forecasting (PVPF) emerged nowadays as a promising research direction to intelligentize energy systems. With the massive smart meter integration, DL takes advantage of the large-scale and multi-source data representations to achieve a spectacular performance and high PV forecastability potential compared to classical models. This review article taxonomically dives into the nitty-gritty of the mainstream DL-based PVPF methods while showcasing their strengths and weaknesses. Firstly, we draw connections between PVPF and DL approaches and show how this relation might cross-fertilize or extend both directions. Then, fruitful discussions are conducted based on three classes: discriminative learning, generative learning, and deep reinforcement learning. In addition, this review analyzes recent automatic architecture optimization algorithms for DL-based PVPF. Next, the notable DL technologies are thoroughly described. These technologies include federated learning, deep transfer learning, incremental learning, and big data DL. After that, DL methods are taxonomized into deterministic and probabilistic PVPF. Finally, this review concludes with some research gaps and hints about future challenges and research directions in driving the further success of DL techniques to PVPF applications. By compiling this study, we expect to help aspiring stakeholders widen their knowledge of the staggering potential of DL for PVPF.


I. INTRODUCTION
The rapid expansion of Distributed Energy Resources (DERs) is driven by the vast exploitation of carbon-intensive energy sources and climate change concerns that threaten human survival and social progress [1]. Among all alternative sources, solar energy, specifically, Photovoltaic (PV) solar energy, has been getting the highest interest globally in the modern electricity grid, with estimates to satisfy a quarter of electricity needs by 2050 [2]. PV power plants' deployment merits include inexhaustibility in PV systems supply, long-life span, and excellent economic viability in the mid and long term [3]. However, the discontinuity and time-varying behavior of PV power flow bring into question the reliability and efficiency of PV systems [4]. Moreover, the sudden weather changes threaten the unit commitment and affect the demand and supply balance [5]. Therefore, PV Power Forecasting (PVPF) is a crucial factor for reliable power supply as it significantly reduces the sensitivity of energy systems to weather intermittency [6]. Consequently, the futuristic Smart Grid (SG) paradigm has considerably spurred the adoption of accurate PVPF techniques.
In this context, the energy community has been focusing on developing effective forecasting techniques to meet various technical challenges [7], [8]. With computer hardware and software development, forecasting models take advantage of High-Performance Computing (HPC) to achieve higher effectiveness. PVPF plays a vital role in handling a series of risk assessments and solving risk decision-making issues for an uninterruptible energy supply. PVPF can be conducted directly by predicting the PVPG [9], or indirectly by predicting the environmental factors encompassing the most relevant frequencies originating from weather conditions, such as solar irradiation (Fig. 1). Obviously, solar energy is presented as the most significant and critical parameter in concluding the characteristics of the solar units [10], [11]. Next, the predicted output is employed to deduce the PVPG via a predetermined mathematical model. However, it has been reported that direct PVPF leads to more accurate results than indirect PVPF [12], [13]. Nevertheless, this review article has considered both direct and indirect PVPF models.
Conceptually, the determination of PVPG lies in 1) Physical methods, 2) statistical methods 3) AI methods, and 4) Hybrid methods as the combination between them, as illustrated in Fig. 1. The physical models establish the mathematical formulas for the PV Power Generation (PVPG) equipment to conduct a deterministic closed-form solution for PVPF [14]. Physical models employ Numerical Weather Prediction (NWP) or ground measurement devices that meet the appropriate calibrated service facilities [15]. In [16], a physical model based on NWP has been adopted for solar irradiation uncertainty forecasting. It has been empirically proven that the suitable selection of the modeling window length is critical for predicting the confidence intervals. However, the proposed model has poor anti-interference capabilities reflecting the unsatisfactory prediction performance [16]. On the other hand, statistical forecasting is carried out through extensive numerical patterns analysis based on statistical theory. Statistical algorithms require a data set acquisition to build their domain knowledge since they neglect the investigated physical process. Moreover, statistical and physical models were not significant enough to be effective with unsatisfactory accuracy in numerous non-trivial problems such as Renewable Energy (RE) forecasting and weather forecasting. AI, specifically Machine Learning (ML) techniques consist of advanced complex approaches to acquire knowledge expertise and lead to accurate results and better generalization capabilities. Although ML is a very promising domain for power systems due to the abundance of computational resources and high-resolution databases, ML techniques have only been accorded to a few considerations compared to statistical and physical techniques in PV systems [17].
Deep Learning (DL) is considered an evolution of ML comprising multiple cornerstone-like models. More broadly, DL has been given a significant emphasis in academic circles for the last decade but only recently has broken into the industrial world for application-oriented research.  Artificial Neural Networks (ANN) with multiple layers (hence called ''deep'') of interconnected neurons have sprung up and sparked a renewed interest in the research community, resulting in a plethora of research papers. Non-deep learning methods comprise one to three operational layers, whereas DL methods stacks multiple layers (more than three) of simple modules hierarchically. Elaborately, DL is advantageous to classical ML methods in distinguishing and learning multiple complexity levels due to three principle factors [18].
Firstly, the classical models heavily rely on a generation of hand-crafted features to track data patterns. This task necessitated manual design and feature learning, which are often labor-intensive and ad-hoc [19]. Fortunately, DL methods can intelligently learn from parse data representation using a general learning process [18]. This eliminates the need for domain expertise and hard core feature extraction adopted in handed engineering-and shallow learning-based models. More specifically, feature extraction-based DL is deducted automatically and optimally configured using an end-to-end pipeline to promote faster learning without being told to do so explicitly [20]. Secondly, traditional ML techniques such as Random Forest (RF) and Decision Trees (DT) might not handle multidimensional data [19]. For such models, the training time becomes terribly low for deep varying level sizes [19]. Furthermore, the efficiency of these models can be unsatisfactory since the variable correlations are neglected [19]. Fortunately, DL methods can efficiently be fueled by massive amounts of data with a high level of complexity and multidimensionality to predict nonlinear behaviors accurately. Thus, DL models can achieve an outstanding predictive performance without the need for pre-defined relationships. Big Data technology uses DL to process a large pool of datasets and offers a potential solution to overcoming the set problem. Thirdly, DL models can hold and store more information within the neurons than the basic ANN model [18]. This can allow learning distributed representation (many-to-many relationships between types of representations), enabling generalization to new combinations of values not explicitly shown in learning data [18]. The former factors made the deployment of DL models more application-oriented than ML, strengthening the pervasive adoption rate by the advanced manufacturer.

A. THE DIRE NECESSITY OF DL IN PVPF
High-precision PVPF can potently promote the grid's accommodation of PVPG by alleviating the negative impacts uncertainties on the utility grid. However, it is quite challenging to achieve satisfactory results with the classic prediction models. Recently, DL has become a research hotspot for its excellent ability to handle nonlinear Time Series (TS) energy data [21]. Thus, the marriage of PVPF and DL gives an impetus to build more sustainable and robust energy management paradigms [20]. DL methods have been successfully used in solar irradiance and solar power production forecasting. Notably, Deep Neural Network (DNN) architectures provide capabilities to learn hierarchical features from the data set while providing a more efficient representation than shallow models and improving generalization potential [22]. Hence, by eliminating the unpredictability factor, the research community tends to make great strides towards accurate forecasts and reliable decision-making [22].
With the whispered adoption of Advanced Metering Infrastructure (AMI), massive amounts of stored information with a variety of data types, complex relationships, and explosive growth from PV stations will be continuously generated, resulting in big or fast/real-time data streams. DL models can handle the big amounts of data generated from weather stations with Big Data Deep Learning (BDDL) to produce accurate results. DL models take advantage of the increase of computational power of Graphical Processing Units (GPUs) to curry out massive data on stream for high-quality forecasts. DL independently extracts features as an efficient big data-driven analytic scheme to process insufficient quality data that contains noise, heterogeneous data. DL models can efficiently handle the complexity, diversity, and integrity data conundrums that encounter meteorological data integration to improve the steadiness and security of power dispatch [20].

B. REVIEW NOVELTY AND CONTRIBUTIONS
The rising interest in DL underpinning PVPF systems intensified the need for a taxonomic review to summarize the most recent development in PVPF [23]. Fig. 2 illustrates the milestones of the AI development from the early attempts until the emergence of DL in 2010. This paper's primary motivation comes from providing a unifying overview of the DL methods related to the PVPF applications. This article seeks to foster the synergy between the PVPF systems and DL methods. The main contribution is to enable further work both by industry and academia to speed up the practical adoption of DL techniques for PVPF. A bibliometric and network analysis on the PVPF topic was conducted to organize the data in a more reader-friendly form from Web of Science (WoS) core collection database and VOSviewer software [24]. The VOSviewer software Bibliometric analysis for the author-supplied keywords; The size of nodes presents the frequency of recurrence. The connections between the nodes illustrate their co-occurrence in the same article. When the distance between the keywords is short, the keywords co-occur more frequently with each other. was employed to reveal the thematic content of the articles set based on the identification of the keywords. Keywords included by authors of the articles and occurred more than three times in the WOS core database from 2015-2021 were exported into Comma-Separated Values (CSV) format and enrolled in the final analysis. Of the 150 keywords, the initial search identified that 50 met the threshold. The keyword combinations are employed in the systematic review protocol to provide a broad overview of research trends in DL techniques in PVPF systems. Fig.3 presents the mapping analysis of the commonly-occurring term with the VOSviewer.
Using rigorous bibliometric indicators, Fig.3 shows that the node covering the widest and most noticeable area is the fields of ''deep learning'' and ''prediction'' with a smaller size. From Fig.3, the emergent research topics are classified into two core clusters: DL models and enabling technologies such as Deep Transfer Learning (DTL). These clusters are devoted to organizing the paper content. To do so, the recent DL architectures applied to PVPF and selected from the period 2015 to 2021 have been analyzed. Furthermore, the development of Deterministic PVPF (DPVPF) and Probabilistic PVPF (PPVPF) were presented. In this work, we focus on reviewing the current signs of progress and pointing out potential future directions of DL for PVPF. Some existing works have studied PVPF and AI, listed in Table 1 with a brief description of their related topics and the differences with this review [22], [25]- [29]. In [30], a comprehensive review of RE forecasting methods has been conducted with a particular emphasis on wind and solar energy. Specific focus of this review reports a growing interest of studying DL techniques for forecasting applications regarding their inherent feature extraction capabilities. However, the review coverage includes both wind and solar resources, which may lead to loose contributions and explanations, especially when discussing the forecasting architectures. In [6], Renewable Energy Sources (RES) forecasting methods have been reviewed. In the RES context, the authors provided the common understandings and promising research insights, including hierarchical forecasting, probabilistic forecasting, and forecast combination. Additionally, some helpful recommendations and common research pitfalls for publishing high-quality journals were provided. Nonetheless, the AI techniques adapted for RES forecasting have not been discussed. More importantly, these works focus exclusively on the RES spectrum and do not mainly focus on PVPF. Therefore, the key characteristics of PV variability and how DL can solve the PV limitations are scarce. In [31], the authors have reported the typical policies related to solar forecasting for grid penetration. However, this work has not been extended to intensively investigate DL forecasting techniques. In [32], the concept of overarching thinking is introduced to contradicts the basic classification of predictive methods into statistical, ML, or NWP. Moreover, the main post-processing methods for PVPF were reviewed to enhance the goodness of the forecasts. However, an in-depth analysis of regression models is not provided. Paper [22] conducted a review on direct PVPF with a special focus on statistical and ML models provided in the literature. The authors of this review classified the data-driven techniques into persistence methods, statistical approaches, ML approaches, and hybrid techniques. Nevertheless, DL methods' investigation, their potential benefits, and shortcomings are not discussed in detail [22]. In [25], the authors have been limited to reviewing probabilistic forecasting for electricity consumption and PVPG. This review reported that all the forecasting engines essentially depend on the extreme forecasting scenarios, leading to poor computability scalability in the existing ML methods [25]. Therefore, every model should be customized to well perform a forecasting task. However, the DL models architectures were not explored elaborately. Besides, the implementation factors of key issues of DL approaches were not outlined. The authors in [29] have extensively studied the BD models for PVPF. Thirty-eight papers were deeply analyzed to enlist the most relevant ML models. It can be pointed out the Extreme learning machines achieved an excellent accuracy-computational time tradeoff [29]. Nonetheless, the DL paradigms and their executions were not covered. Besides, the notable techniques for PVPF deployment were not explored explicitly. All the related works partially cover the aims of this work. However, the related works paid less attention to results based on DL methods. In contrast, we limit this holistic review to the DL-based PVPF, leaving aside shallow ML and physical methods-based PVPF. Therefore, this study provides insights not previously fully covered or evaluated by other reviews [33].
This review gives a particular emphasis on the application of DL methods to the PVPF. To the best of our knowledge, different from the previous works, this is the first initiative to give a bird's eye view on the applications of DL for PVPF, which is not adequately addressed in the existing literature. To fill this gap, this review focuses on the use of DL for PVPF applications. The main contributions of this review are expounded as follows: • First, we derive taxonomies for PVPF based on various criteria such as the forecasting horizon, forecasting models, system features, and forecasting range.
• Second, the DL models-based PVPF are systematically classified into learning-related. A comprehensive and complete review of different algorithms applied in the case of PVPF is provided to give critical insights into their strengths and limitations. We aim to allow the reader to readily distinguish the efficacy gaps at a glimpse. Further, an evaluation and discussion of the role of meta-heuristics in carrying out the functions required in DL within the PVPF-realm have been conducted.
• Third, the enabling technologies for DL-based PVPF were rigorously reviewed in a more comprehensive and applicant-oriented manner, such as federated learning, transfer learning, and BDDL, where previous works are summarized logically.
• Fourth, pioneering works related to deterministic and PPVPF have been deeply investigated.
• Fifth, a critical view over the existing research challenges are presented, and future directions in PVPF studies to the deployment of competent, scalable, and computationally effective algorithms based on DL are discussed.

C. REVIEW STRUCTURE
The rest of this paper is structured as portrayed in Fig. 4. Concretely, section II presents the review methodology. Section III describes the popular taxonomies of PVPF techniques. Section IV comprehensively investigated the DL methods for PVPF. Section V presents the possible enabling DL techniques for PVPF. Section VI discusses the significant applications of DL techniques for PVPF. These applications cover DPVPF and PPVPF. In section VII, the possible VOLUME 9, 2021 future directions for empowering the PVPF performance by emphasizing the undiscovered fields have been presented. Section VIII concludes this review paper.

II. RESEARCH METHODOLOGY AND SYSTEMATIC REVIEW PROTOCOL
To benefit reading, extensive searches have been performed to fetch the most relevant content. For instance, the time horizon is one of the essential tools to classify PVPF techniques. Depending on the time domain, there are four distinguished forecasting horizons as illustrated in Fig. 1; specifically, Ultra-Short-Term (USTF) from seconds to one hour [34], Short-Term (STF) with the prediction period from hours to one day, Medium-Term spans up to a month ahead, and long-Term predictions for a month to a year [12]. With the aim of covering the largest number of articles regarding the review topic, possible variations were also employed for this selection. Hence, the search string utilized was: 'Deep learning' AND 'Photovoltaic power' AND 'Forecasting' OR 'Prediction' OR 'Solar power' OR 'Solar irradiance' OR 'deep neural networks' as shown in Fig.5. The publications on each platform based on the keywords were made on June 15, 2019, totaling 350 articles. The identification of the relevant research works is conducted according to the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) [35]. We identified the main academic research databases, including Google Scholar, IEEE Xplore, Science Direct, and Nature. The identification process of relevant papers from the early stage of the selection to the final selected publications has been displayed in Fig. 6. To achieve a more complete and inclusive understanding, this review paper contributes to the existing research papers by answering the following five Research Questions (RQ): RQ1: What is the popular taxonomies for PVPF?; RQ2: What are the most up-to-date DL methods for PVPF?; RQ3: What are the DL methods for deterministic and probabilistic PVPF?; RQ4: How Big data and transfer learning can enhance the PVPF accuracy?; RQ5: What are the research frontiers and future research directions?
The answers to these questions consider multiple sources of information to ensure the accuracy and objectivity of the main findings of this paper. The inclusion criteria were as follows: (1) articles published in English-language documents that were published between January 1, 2015, and August 01, 2021; (2) articles used a particular DL model for PVPF; Fig. 7 presents a timescale variation on the frequency of use of terms ''Forecasting'', ''Deep learning'', and ''Photovoltaics'' in scientific books from Google Books Ngram Viewer. It can be seen from Fig. 7 that the popularity of DL has significantly increased in the last years, while the forecasting paradigm had decreased since 1980. This result shows that forecasting applications must take advantage of the increasing DL trend, especially for PV systems.

III. OVERVIEW OF PVPF AND DL
To help readers better distinguish between the emerging learning paradigms applied to PVPF, DL methods were classified into three classes: discriminative learning, generative learning, and Deep Reinforcement Learning (DRL). A brief review of the knowledge and understanding of these three concepts is presented. We, then briefly discuss the potential of well-established mature DL structures in PVPF. Furthermore, we cursorily investigated several Hyperparameter Optimization (HO) used for DL in PV systems.

1) MLPs
MLP, also named Fully Connected Network (FCN) is the quintessential DNN model [36]. Conceptually, this MLP consists of three fully connected layers in a feed-forward architecture, specifically, input layer, numbers of intermediate (hidden) layers, and output layer, as shown in Fig. 8(a) [8], [37]. By increasing the networkability, the hidden layers computationally reveal underlying patterns of data at deep levels of abstraction. With the use of nonlinear/linear activation functions, the MLP configuration can resolve complex mappings between a set of observations and response variables. The MLP mechanism is described as follows [8]: where x j , w kj and b k denote the neurons' input, synaptic weight function, and bias term, respectively. denotes VOLUME 9, 2021 the activation function. As a supervised learning algorithm, the MLP network employs backpropagation to minimize the cost function. Moreover, two common training methods are used as part of MLP: Feed-forward Back-Propagation, and Levenberg-Marquardt. The optimal selection of the training algorithm can enhance the convergence speed and the model accuracy.
Pioneering work is presented in [38], where it proves that the typical MLP network is perfectly tailored for a day-ahead PVPF. More importantly, the developed method could offer prominent enhancement for one day-ahead PVPF results, especially with TanhAxon activation function and Levenberg-Marquardt learning rule. However, hyperparameter tuning for a deep MLP involves a lot of complexity in modeling, which could be a challenging issue. Such observations are confirmed in [39], wherein the MLP performance mainly hinges on offline training and the hyperparameters' suitability which require significant human labor for fine-tuning. From the authors' work in [39], a two-state model-based MLP and knowledge-based Neural network (KBNN) were designed for offline and online on-site deployment. Here, KBNN is used to avoid data shortage by including prior analytical prediction equations. Hence, the KBNN significantly the MLP model with the insufficiency of labeled data. The approximate MAPE is 11%. In [40], a DL-based mapping model between the concurrent sky image and surface irradiation has been introduced. Aside from the hybrid model, the sky images were clustered in the preprocessing stage using K-means clustering based on Convolutional Autoencoder (CAE) to enhance the feature representation of high-dimensional data. Despite the high complexity cost, the mapping modeling of surface irradiance boosts the forecasting model performance. In [41], an MLP model has been proposed for solar radiation forecasting. Four types of uncertainties were presented; errors from meters' measurements, sarcastically of weather data, ML uncertainties, and errors due to the forecasting range. A reliability index has been comprehensively introduced to assess the goodness of the forecasts [41]. Although the model generalization to complex cases, the MLP model is restricted to reveal patterns among sequential samples such as PV generation data. This is mainly due to the fact that this configuration does not save the previous information in an internal memory [42]. Consequently, the Time Series (TS) data are trained independently, which may lead to poor accuracy [42].

2) CNNs
CNNs, popularly termed as ConvNet, are a popular class of feed-forward networks designed to process grid-like structured data [28]. The core of the CNN network was conceptualized based on three principal elements: convolution, pooling computations, and Fully Connected Layers (FCL), as shown in Fig. 8(b). These elements lead to a spectacular features extraction capacity and robust feature representation [43]. The objective of the pooling layers is to merge semantically similar features into a single one by applying a specific function. This allows the pooling layer to reduce the feature map dimension, accelerating the system convergence [43]. While the convolutional layer extracts local features from contiguous data. The layered structure of CNN comprises one-dimensional CNN, two-dimensional CNN, and three-dimensional CNN. For PVPF, one-dimensional CNN is essentially used to process sequential data [28]. The fully connected layers are usually the last few layers and are used to summarize information.
In [44], the authors introduced a hybrid framework by integrating a hybrid CNN-Long-Short Term Memory (LSTM) network-based PVPF model. Using the hybrid paradigm, CNN automatically filters out noise and extracts the valuable features, while LSTM efficiently handles sequential inputs. More specifically, the authors have sought to exploit the Multiple Relevant and Target variables Prediction Pattern (MRTPP) method to optimize the distribution of the input features. This promotes the forecasting engine's efficiency in capturing the nonlinear variation of PVPG for multi-step prediction. However, the convolutional layer in CNN has a convolutional kernel of constant size and a limited receptive domain, which is limited to local feature learning. Whereas in [45], the authors proposed a specialized CNN for 15-min ahead minutely-averaged PVPF. The proposed model provides accurate predictions represented by a 5.7% forecast skill without intensive hand-engineered features as input. However, the major limitation of CNN cannot be fully suitable for capturing time dependencies.

3) RNNs
RNNs are a special type of DNN based on control theory composed of a chain of neurons whose output is connected not only to the next layer but also to feedback connection, as shown in Fig. 8(c) [46]. By sequencing TS data as an input vector, the RNN cell allows the underlying information to persist until feeding it back to the next prediction. Thus, the RNN provides a quicker implementation and fast training. The TS data passes through a cell in a sequential vector, at each step the cell output value is concatenated with the next time step data, and the output value of the cell serves as input for the next time step. The process is repeated until the last time step data. However, the most common drawback of the vanilla RNN model is the limited capability of handling long-term dependencies [46].
For TS data, RNNs based models form the core of most sophisticated fancy TS applications, which allows them to be perfectly tailored for PVPF and widely depicted in pioneering articles. To name a few, the authors in [47] proposed a deep RNN to predict solar irradiation accurately. It is worth mentioning that five RNN categories were rigorously described, specifically, the standard RNN, Deep RNN, stacked RNN, Deep RNN with shortcut connections, and Deep RNN with deep output layer. Compared with other benchmarks, using realistic data from natural resources in Canada, the RNN showed better accuracy when processing TS high-level features. In [48], an LSTM-Convolutional network (LSTMC) has been adopted for PVPF. The proposed approach results in RMSE of 0.621kW for short-term PVPF. Surprisingly, it can be remarked that LSTMC model outperforms CNN-LSTM. This paper demonstrates that extracting temporal correlations first and then spatial correlations using this processing order could improve the forecasting effectiveness. Despite the high accuracy of the proposed model, it is concluded that a large number of training samples is extremely needed for achieving generalizability.
LSTM network is a special chain-like structure with memory cells proposed by Hochreiter and Schmidhuber [49]. The LSTM is a particular type of RNN that overcomes the notorious vanishing gradient phenomenon in modeling long-range temporal dependencies [50]. A typical LSTM cell mainly consists of input gate i t , forget gate f t , output gate o t , and one control gate c t to manage the information flow, as shown in Fig. 8(d). Concretely, these gates are used to control the update, maintenance, and deletion of information contained in cell status. The operations on the memory block are managed using adaptive multiplicative gates. The LSTM gates, hidden outputs, and cell states are computed as following [51]: where x t and c t is the input sample at time t and the memory unit, respectively.
stands for the bias terms, weight vector, and input weights for each gate respectively. The symbol denotes Hadamard product. h t−1 is the hidden layer for the respective gates x in the current timestamp. σ activation function denotes the element-wise calculation. The LSTM has an excellent potential to process time-based information. For instance, the authors in [52] employed an LSTM-based to perform an hourly PVPF. The particle swarm optimization (PSO) algorithm was employed to adjust the load dispatch. The simulation results from the proposed model exhibit that it considerably increases the accurateness of prediction a Mean Absolute Percentage Error (MAPE) error of 15.87%. However, the computational cost is high compared to MLP model. Gated Recurrent Unit (GRU) is considered as one of the preferred single TS forecasters [46]. By using the recurrent connections, the GRU architecture permits the network to access the historical information. The GRU is a kind variant of RNN that has gates that modulate the flow of information inside the unit, as shown in Fig. 8(e). A typical GRU cell is composed of only two gates, the reset gate r t = σ (w r x t + U r h t−1 ) and the update gate z t = σ (w z x t + U z h t−1 ) [46]. σ is a smooth and differentiable function; b z ,W z , and U z are the bias, the input constant of the update gate (z),and the previous activation constant, respectively. Nonetheless, it is distinguished for its gate reduction strategy to accelerate the learning process without lowering the performance.
Paper [53] employed GRU for PVPF. In their study, various processing blocks were built based on the characteristics of each block to promote the proposed approach accuracy. Specifically, the Pearson coefficient is exploited to rank the feature inputs according to their relevance to the PVPG. Next, K-means clustering is used to group the training data according to the similarities of input patterns. These groups are utilized to generate an averaged PVPF output. The proposed GRU architecture demonstrated its expertise in capturing temporal dependencies. However, the average training time is 365.40 seconds, which is painfully slow compared to statistical techniques such as Auto-regressive Integrated Moving Average (only 3.66 seconds). Bidirectional LSTM (BiLSTM) is an improvement to one-way RNN where the forward and backward hidden layers are combined to access both the preceding and succeeding data [54]. Bidirectional Mechanism (BM) is a way of learning the information from both directions, as shown in Fig. 8(l) [54]. In a nutshell, BiLSTM can handle the sequential modeling challenge better than conventional LSTM by acquiring the forward and reverse information from the cyclic feedbacks. For instance, the authors in [55] proposed a BiLSTM model to model nonlinear time dynamics for PVPF, which helps in boosting the model performance significantly. The proposed model can accurately detect meteorological changes over time. Meanwhile, the real-world implementation of the Bi-LSTM model requires memory-bandwidth-bound computation, which compromises their application ability due to high computation and storage. To combat this challenge to a large extent, Attention Mechanism (AM) is associated in the RNNs structures for better generalization ability by mimicking the attention of the human brain [56].
AM puts more focus on the input sequence's core elements that affect the quality of the forecasts to learn the information in the input sequence better. The principle idea of the AM in DL is also to neglect the irrelevant data to the current task and only to select the information that is more critical to the current mission, as shown in Fig. 8(f). AM can be categorized into spatial attention, channel attention, and selfattention [57]. AM allows the forecasting models to pay more attention to useful features, so AM is widely used in RNNs. The computation of AM is initialized by a Query (Q) and Keys (K ) as f (Q, K i ) = Q T K i . The Softmax activation function is used for weights standardization as [57]: The attention value is obtained by calculating the sum Attention(Q, K , V ) = L i a i * Value i , where L is the size of the input sequence. In [58], the authors proposed a Convolutional self-Attention Based LSTM (CA-LSTM) for PVPF. The self AM is a special form of AM, which better captures the syntactic and semantic information from the row of TS data. The proposed model aims to fully use the features of long sequence inputs, achieving an overall MAPE of 10%. Thus, the AM successfully improves the traditional LSTM VOLUME 9, 2021 performance with a lower MAPE of 17%. Reference [59] proposed an LSTM-based Temporal AM (TA-LSTM) for solar generation forecasting. The proposed model employs partial autocorrelation to follow the input lag. The TA-LSTM produces a Root Mean Square Error (RMSE) of 0.26kW, which seems to prove its high competitiveness compared to the classic LSTM in PVPF. Meanwhile, the hyperparameter list is increased, which leads to extensive hyperparameter tuning. Despite the high suitability of this network for TS processing of the gated RNN architectures, their massive storage and computation requirements hinder their application ability, particularly for PVPF.

B. GENERATIVE LEARNING
In this section, we review the state-of-the-art DL architectures. These ubiquitous DL architectures are Auto-Encoders (AEs), Restricted Boltzmann Machines (RBMs), and Deep Belief Networks (DBNs), and Generative Adversarial Network, respectively.

1) RBMs AND DBN
The RBM (Restricted Boltzmann Machine) network is energy-based stochastic neural networks, as shown in Fig. 8(j). In a nutshell, this variant of Boltzmann Machines (BM) has node connections both within layers and between layers (Fig. 8(h)). The RBM model has the potential to learn the input probability distribution in supervised/unsupervised learning. The RBM architecture has two levels with symmetrical connections between them, one is the visible layer v, which contains the input and output, and the other is the hidden layer h with n units [18]. The visible and hidden units that follows a joint distribution can be expressed as: and z = v,h E(v, h). Deep Belief Networks (DBN) is an unsupervised greedy learning algorithm with a stacked RBM units as shown in Fig. 8(i) [18]. The DBN performs layer-wise training to learn probability distribution of the input vectors. The DBN processing consists of using layer-by-layer unsupervised pre-training to select the suitable initial parameters and supervised fine-tuning mechanism to rebuild training samples by tuning the parameters [60]. The RBM layers were seen as feature extractors to generate a high-dimensional abstraction of the inner relationships of the data.
In [61], the authors proposed an Integrating gray data preprocessor and DBN for day-ahead PVPF. These contributions are completed in [60], where the authors proposed a day-ahead global solar radiation forecasting using functional DBN. The performance enhancement of the developed technique relies on an embedding clustering layer and knowledge functions from empirical models. These processing units demonstrate a sophisticated and elongated iterative fashion, thus improving the model robustness for longer time dependencies [60]. Although the DBN model demonstrated its efficiency in various forecasting tasks, the DBN architecture is prone to model structure and parameter optimization challenges.

2) AUTOENCODERS
Autoencoder (AE) architecture is one of the most groundbreaking unsupervised learning models that learn characteristics from unlabeled data representation [9]. AEs are loosely inspired by the way the human brain works. Typical components of the approach are the encoder and the decoder [20]. By minimizing the reconstruction error between the input data at the encoding layer and its reconstruction at the decoding layer. There are many types of AEs, and the most commonly used ones are; Stacked Autoencoders (StAE), Denoising Autoencoder (DAE), Sparse Autoencoders (SAE), CAE, and Denoising CAE (DCAE), and Variational Autoencoders (VAE) [20]. An AE-LSTM network is proposed for day-ahead PVPF for the next day at 15-min interval [62], with a normalized RMSE of 4.56%. The proposed AE-LSTM model jointly exploits the feature extraction of AE and the sequential TS forecasting engine of LSTM. However, the trans day weather volatility is poorly predicted by the formal predictor. The authors of [63] established an AE-driven DL model-based PVPF method to overcome the stochastic behavior of PV power output, which achieves an optimal R 2 = 99.5%. Despite the ability of the proposed model to provide accurate one-step and multi-step forecasting results, the VAE shown in Fig. 8(n) is prone to the vanishing latent variable problem.

3) GAN
GAN is an unsupervised pre-trained network consisting of two competing neural networks: the generator G(z) and the discriminator D(x), as presented in Fig. 8(o). By learning the real data x distribution, G generates realistic scenarios until they cannot be distinguished anymore from real data. The fake data is generated from random noise using Gaussian distribution. This operation is conducted by deliberately introducing feedback at the back-fed input cell from the input noisy variables p z (z). And D correctly distinguishes whether the input data comes from the true data p data (x) or the generator. The two models are optimized simultaneously by updating the network weights in an alternating manner. The hyperparameters are tuned based on optimizing the loss and varying the randomness. The objective function formulated as [64]: GAN has gradually attracted prominence in the PVPF domain, especially for data augmentation purposes. For instance, the authors in [64] employ Wasserstein GAN with Gradient Penalty (WGANGP) for weather classificationbased PVPF. The WGANGP was utilized for data augmentation purposes by producing synthetic data that follows the same heteroscedasticity of original data. The newly generated data is fed to the CNN model to improve its performance by ameliorating the feature representation. A series of experiments on 33 meteorological weather types were conducted and proved the effectiveness of the method by comparing it with other methods. However, the major caveat of the GAN model lies in the fact that the GAN training is relatively unstable.

C. DEEP REINFORCEMENT LEARNING
Recently, DRL has been introduced as a combination of DL (DL), and Reinforcement Learning (RL) to better cope with the dynamic changes of the unsteady PV environment [65]. By bridging DL and RL, DRL shows its great potential in handling complex tasks and high scalability to suit complicated and unfamiliar environments, particularly in power systems [66]. As a goal-oriented learning method from the environment feedback, the traditional RL extends its potential to store high-space actions and states with an intuitive hierarchical feature extraction ability and nonlinear approximation ability of DL architectures, as shown in Fig. 8(p). Consequently, DL models intersection state information from sequential PV data [67]. From the DRL scheme, the agent continuously interacts with the state environment over a series of trial-and-error processes to shape the optimal strategies [68]. Several methods for the Q values updates were provided in the literature, such as State-Action-Reward-State-Action (SARSA) and Asynchronous-Advantage-Actor-Critic (A3C) and Deep Q Net (DQL) [66]. In RL, the environment can be modeled as a Markov Decision Process (MDP) expressed as (S, A, P, R, γ ). Here, S, A, P, and R denote the discrete states in the environment, a finite set of actions provided for the agent, the state transition probability matrix, and the reward function, respectively. γ presents the discount factor utilized to quantify the importance of the future and present rewards. Despite the ample opportunities of DRL, efforts to employs the DRL in PVPF have been scarcely found in the literature.
Using the powerful representation ability of DNNs, the value function is fitted to optimize the explosion of continuous and integer state-action space problems. Mainly, DRL provides high-dimensional input or large action sets to solve intractable problems using self-adjustments and optimization solutions. However, current DRL techniques are dependent on massive training data and expensive computational requirements, which may be unacceptable in practical PV advertising platforms. DRL models were roughly divided into two main settings: Model-based, and Model-free-based. The following subsections describe these concepts.

1) MODEL-BASED
Model-Based DRL (MDRL) expects that the agent understands the system dynamics and how the system crosses from one state to another one and how rewards are generated. This MDRL methods have been effective in terms of data-efficiency, transferability, and universality [68].
However, MDRL is computationally expensive and ineffective in rapidly varied environments [68]. MDRL has been proposed to solve the optimal action-selection policy [69]. MBRL employs an internal model to approximate the environment, and the control behavior can be learned through this model. It has been reported that model-based approaches are more efficient than model-free approaches. However, MDRL needs to save the state transition matrix and employing Dynamic Programming (DP) algorithms leading to massive calculation requirements [70]. This approach is not always practicable, especially in complicated paradigms where the agent has limited to no knowledge about its environment. Authors in [71] applied an MDRL-based MuZero algorithm to solve the scheduling problem of distributed microgrid, particularly for PV systems. The proposed approach combines the Monte-Carlo tree search method with a learned NN to efficiently learn a network model. However, Despite the high sample efficiency of the proposed model, the model design is complicated, especially for large PV systems.

2) MODEL-FREE BASED
Model-Free DRL (MFDRL) conducts the optimal policy without explicitly learning the model of the environment. It can be achieved by using three approaches: Value-Based RL (VBRL), Policy-Based DRL (PDRL), and Actor-Critic (AC) based [78]. VDRL is a prominent learning method to deal with high-dimensional state space and discrete or continuous action spaces in optimization problems [79]. On the other hand, PDRL architecture guarantees better convergence and keeps relatively high efficiency in high-dimensional or continuous action space [79]. However, agents require millions of time steps to learn tasks from many iterative systems. Since updates occur in small steps, agents may under-explore their environments or under-develop strategies, leading to exploration shortcomings in some cases. AC is a fusion of policy-based and value-based models to constitute an endto-end learning paradigm from perception to action [80]. Asynchronous advantage actor-critic (A3C) and deep deterministic policy gradient, and twin delayed deep deterministic policy gradient (TD3) were the standard representations of AC method [79].
MFDRL is widely depicted for optimization and control of PVPG in research works. To name a few, in [81], the authors adopted a novel strategy that brings together DQN and CNN to cope with the uncertainties of an isolated microgrid. Concisely, the proposed DRL model optimizes the sum of diesel generators' generation cost and the penalty of non-served power demand. However, the curse of dimensionality persists with the DQN. A combination of Policy Dynamics based Win or learn Fast-Policy Hill Climbing (PDWoLF-PHC) and Back Propagation Neural Network (BPNN) network has been adopted to tackle the RES uncertainties for fast-response regulation units [82]. The proposed model optimizes the coordinated control for the source-grid-load. Despite the high efficiency of the proposed model for automatic generation control, its deployment requires high exploration costs in a VOLUME 9, 2021 multi-area interconnected grid. A multi-agent Double Deep Q Network and an Action Discovery (DDQN-AD) has been proposed for distributed RES management [83]. However, the proposed model is limited to homogeneous agents to work effectively. Paper [84] proposed an automatic generation control-based AC strategy. The proposed method relies on DQN to follow an isolated microgrid paradigm's interaction agent-environment without the need for RE forecasts. However, the proposed system can only have deterministic policies, limiting its feasibility in practical power grids. To sum up, DRL is limited for optimization and control tasks, enhancing the prediction systems' efficiency. Nonetheless, the forecasting problems remain unsolved entirely via DRL, which requires more investigations.

D. HYBRID MODELS-BASED
The efficiency of stand-alone DL models can be unsatisfactory in PVPF in different case scenarios due to inappropriate HO, bias-overfitting conundrum, and unbearable complexity in both computational and spatial dimensions. To bridge that gap, the combination of two or more cross-discipline methods (a.k.a hybrid models) is commonly proposed to forecast PVPG with an improved performance than the single DL models [7]. This performance enhancement refers to the fact that the single DL models have their strengths and limitations, as reported in Table 2. Specifically, DL has some limitations, including the lack of interpretability with DL outputs that we cannot even fathom how they are generated yet, extensive computation requirements, and the need for massive data to efficiently perform the desired task. Hybrid models are often preferred for solving the insurmountable PVPF problems to eliminate or reduce the shortcomings of single models by combining them with another model in order to obtain impressive results [85]. Fig. 9 shows a comprehensive distribution of the reviewed papers in this review according to the forecasting method.
As remarked in Fig. 9, the hybrid methods are by far more deployed than the rest of DL models. For instance, a Conv-GRU model has been proposed to predict the PVPG accurately [86]. The proposed model provides a high versatility to deal with the nonlinear behaviors to provide an accurate PV output. A CNN-LSTM model has been proposed for PVPF [87]. However, the extraction of positional and temporal representation in the PV output requires explicit recognition of patterns and regularities in data, challenging to compute due to the massive computational burden in real-life application. An AE-LSTM model has been proposed [9]. A DBN-based Auto-Regressive has been proposed for nonlinear TS modeling [88], which provides decent performance. But, the algorithm is fragile when faced with the PV volatile behavior when applied to different locations and not suitable for PVPF. An innovative USTF method has been depicted in [34]. The authors' work consists of implementing of the underlying Local Sensitive Hash algorithm (LSH). The used taxonomy considers four weather conditions: clear, cloudy, rainy, and snowy weather. LSH profoundly investigates the coupling correlated weather features. The methodology adopted for LSH system classifies the PV power segments and generates a PPF output. In [34], the authors exploited a hybrid method for an accurate hourly PV power prediction based on a gradient-descent Back-Propagation method (BP), Schema Frog Leaping Algorithm (SFLA), and ANN named BP-SFLA-ANNs model. Subsequently, their adopted BP-SFLA-ANNs model consists of using SFLA to mediate between BP and ANNs models. BP model provides the values of the primary hyperparameters of ANNs to let the SFLA start from this initial selection to further search for more suitable parameters of a typical ANNs. The interaction between SFLA and the BP led to a superior ANN accuracy and less computational burden compared to an SFLA-ANNs without the initial tuning of BP. Further applications of hybrid models in PVPF are listed in Table 3.
However, computational complexity is one of the main weaknesses of the hybrid models due to using of two or more techniques. Thus, the accuracy improvement should not compromise the computational complexity of mixed models. The performance of a mixed model depends on the performance of a single model.

E. HYPERPARAMETER OPTIMIZATION OF DL ARCHITECTURES
The ever-increasing complexity of the newly developed DL methods has raised an emerging resurgence of research on HO. A wide range of hyper-tuning techniques was adopted to support DL algorithms or provide an alternative for specific optimization tasks. Automated hyperparameter selection is an FIGURE 10. Meta-heuristic algorithms with Sine cosine algorithm [97], Find fix exploit analyse [98], Electro-search algorithm [99], Selfish heard algorithm [100], Emperor Penguins colony [101], Butterfly optimization algorithm [102], Group counseling optimization [103], Volleyball premier league algorithm [104], Jaya algorithm [105], Gaining sharing knowledge based [106], Differential search algorithm [107], Backtracking search optimization [108], Stochastic fractal search [109], Synergistic fibroblast optimization [110]. essential step to save the rare resources of human expertise and notorious efforts. Meta-heuristic techniques offer the adequate tools to provide an optimal or near-optimal configuration of DL models due to their efficiency and scalability for various complex applications. Meta-heuristic algorithms can be divided into four distinguished categories, specifically, evolution-based, swarm intelligence-based, physicsbased, and human behavior-based, as shown in Fig. 10.
For instance, Ant Colony Optimization algorithm has been proposed for model tuning to accurately predict the PVPG [111]. Paper [112] developed a CNN and a Salp Swarm Algorithm (SSA) for PVPF. For different types of weather, five CNN regression models are designed. Consequently, the prediction engine is easy-to-implement even if the knowledge of the hyperparameters was limited. Paper [54] adopted a combination of BiLSTM, Sine Cosine Algorithm (SCA), and complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) for solar irradiance forecasting. Similarly, the authors in [113] established an improved BiLSTM model with Genetic Algorithm (GA). However, the proposed version of the BiLSTM model is susceptible to the random weight initialization.

IV. DEEP LEARNING MODELS USED IN PVPF APPLICATIONS
In this section, the notable techniques for PVPF have been discussed, including DTL, BDDL, incremental learning, online learning, and federated learning.

A. FEDERATED LEARNING BASED PVPF
As DL models based on a central server are suffering from critical privacy-intrusive and security challenges, without explicit awareness of the users, these concerns hold particularly strong. Concerns about data awareness in PV systems may lead end-users to be progressively unwilling to send their potentially private and personal data to data centers, raising the problem about how DL algorithms will be trained [114]. Federated Learning (FL) is regarded as privacy-preserving collaborative learning that gained a whispered emphasis from academics and practitioners due to its significant contributions [115]. This emergent technology permits various parties ideally, mobile devices, to cooperatively train a DL model on their combined sets, without any participants having to reveal their local data to a centralized server.
For the energy hub, FL seems to be the missing puzzle piece in the widespread adoption of SE. For instance, the authors in [116] applied a secure FL method based on Bayesian LSTM Model to address solar irradiation prediction. The proposed model proves its efficiency in solving the problem of sparse samples by using a distributed training framework. Meanwhile, the missing and faulty data can considerably hinder the training of the FL model. FL is still in the early stages of development. There are many works to do and a few structural achievements in the approval process that still need to happen before the researchers could leverage the FL technology to its full potential.

B. DEEP TRANSFER LEARNING IN PVPF
To further supplement the staggering potential of DL, Deep Transfer Learning (DTL) based on Transfer Learning (TL) and DL has been proposed to overcome the insufficiency problem of the actual training data sets convergence issues and isolated learning shortcomings [117]. DTL employs the pre-existing knowledge acquired from a DNN model for a particular task to solve related ones [118]. Concretely, the sensor measurements are hardly sufficient and sparsely generated [117]. Instead of learning any new task from scratch, DTL is devoted to avoiding much expensive data-labeling efforts using cross-domain data sets [119]. Furthermore, the HO-based DL algorithms could be conducted using TL [120]. This learning framework operates by transferring the knowledge gained by a DNN model in handling a task (source problem) to solve another related task. We define a source domain D s and a target domain D t with D s = D t . A learning and source tasks (T s T t ) with T s = T t constitute a DTL framework if η t is a nonlinear function represented by a DNN. The source and target domains is formulated as [119]: With M and N are labeled samples in D s and D t , respectively; X s and X t denote the feature spaces of source and target domains, respectively. x i denotes the data instance and y i is the related class label. In [121], a Shared-Optimized-Layer LSTM (SOL-LSTM) network has been proposed for PVPF. The rationale of the proposed model design is to combine a Sequential Model-Based Global Optimization (SMBO) with the LSTM network. DTL can solve the data insufficiency of the newly build PV plant by pre-training the hyperparameters in a similar source domain and fine-tuning in the target domain. However, with the increase of the data's volume, the SOL-LSTM performance significantly decreases. In [120], a DNN has been proposed to tackle nonlinear weather uncertainties for RES farms. However, the DTL model seeks repeated access and preprocessing of a potentially massive data set of source tasks to establish the necessary knowledge base for the downstream target task. This requirement may not be feasible in large PV systems due to the lack of data-intensive computing resources. Therefore, it is of great significance to merge DTL and BD solutions to solve the data problems in large-scale PV systems.
Although the DTL would significantly enhance the performance of the DL-based PVPF, the research work on the DTL use for PV systems is relatively limited contrary to other areas such as fault detection in PV systems [122]. This may be related to the essential need for data sets with minimum reasonable similarity.

C. BIG DATA DEEP LEARNING FOR PVPF AND CHALLENGES
As known, DL models are data-hungry, requiring a large data to work effectively [123], [124]. Reciprocally, DL scales well with growing amounts of data [125]. However, the computational efficiency of the actual calculators is considerably limited, especially with the trend of increasing DNN size. To bridge this gap, it is, thereby, generally not an option to employ Big Data technologies for model training, especially with the continuous flow of data pouring from monitoring systems and smart devices [126]. Big Data and Analytics provide the means to predict the weather conditions and PVPG for PV systems to work at peak efficiency. Data mining can truly be beneficial for enhancing the PVPF in a complex ground-level infrastructure.
In [122], the authors have proposed a deep Feed-Forward Neural Network (FFNN) with big data set from Australia. A H2O package has been employed to combine non-deep and deep learning methods with Apache Spark cluster-computing framework. The simulation results proved that the lagged information does not need more than the previous 24 hours of historical information to provide and accurate results. However, the grid search algorithm tuning of the trained model may produce heavy computation requirements. Consequently. An adequate selection of optimization algorithms will be of utmost importance to establish cost-efficient prediction tools.
In [124], the authors introduced a big data solution by combining used physical and dynamical theories and intelligent algorithms to solve big data problems. However, decent documentation of the used software (Sun4Cast TM ) is missing. Paper [127] proposed an annual rooftop solar irradiation analysis based Spark-based fuzzy partitioning LSTM model. The proposed solution employs rooftop characteristics, horizontal solar irradiation, visibility of the sky, and shading factor. Hence, the simulations prove that ensemble models represented by RF overcome NN symbolized by extreme learning machine ensembles with an accuracy of 92%. The authors in [128] applied a big data forecasting tool based to solve the PVPF problem, where, Pyspark package is employed to implement the neural network model. The simulation results produce an average RMSE = 0.03 MW with fast convergence. Unfortunately, the computational complexity disables the model from the transition from proof of concept to production. Whereas a massive amount of engineering is needed to deploy it in production. Big Data frameworks applied to PV systems can contour several limitations such as data privacy and security, multisource data integration, real-time data processing to ensure that the data clearly conveys what they need the DL to learn for real-world PV plants [129].

D. INCREMENTAL AND ONLINE LEARNING
Under the umbrella of DL, incremental and online Leaning has emerged as a continuous evolving scheme to improve the universality of the prediction engines for accurate regional forecasting. Hence, the difference between these two algorithms is quite challenging. Online Learning (OL) dynamically trains or adapts the model using each incoming data point at each time step, without saving [130]. Thus, OL is used to handle large volumes of streaming data transmitted at high velocity. Incremental Learning (IL) provides a fast remodeling from batch learning of data at different time intervals, and has the capability to integrate new knowledge on-the-fly of the predefined model if the network deems to be expanded. The incremental samples can be fed from the available Samples (SIL) or even unseen classes (CIL).
Authors [131] established an online PVPF method to handle concept drift. A model-agnostic online forecasting (MAOF)-based LSTM model is used, generating an MAPE = 23.59%. However, the model produces a serious performance degradation is particular case scenarios leading to stability issues. The work in [132] applied an incremental learning model for solar irradiance based-Regression Enhanced Incremental Self-Organising Neural Network (RE-SOINN). The proposed model is trained incrementally as new data come in progressively. This architecture avoids the tedious retraining process of DL models.

V. APPLICATION POTENTIAL IN PV POWER FORECASTING
With the increasing spatiotemporally coupled uncertainties in PVPG, PVPF becomes a desperate need to ensure grid stability and weaken the uncertainty of solar PV power, hence paving the way towards a large-scale economic deployment of RES in the electricity grid. Conceptually, PVPF can be broadly classified into two folds, point PVPF and interval PVPF.

A. POINT PV POWER FORECASTING
Point forecasting, so-called deterministic forecasting, is widely regarded as essential for optimal power system management. Point forecasting models have been thoroughly researched over the years, and the trend of developing more accurate forecasts is still booming [133]. The average PVPG P t+k|t , is estimated to be produced from a PV system during a specific time period, for the PVPG forecast made at time step t, for a look-ahead time, t + k, if it would function under an equivalent constant PV power. The time horizon T , for which the prediction is generated, defines the total length of the forecast period. Deterministic PVPF provides accurate and specific future values [134], [135]. Further, these methods are easy-to-use, deploy and evaluate using score metrics such as RMSE, MAE, R, and MAPE [33]. Unsteady PVPG threatens energy generation. However, spot forecasts do not include the uncertainties around the mean value. Therefore, their results can be unreliable and misleading in particular scenarios [12]. Table 4 presents several exemplary applications of deterministic DL methods for PVPF. For instance, the authors in [142] adopted a LSTM-CNN model. It produced the most accurate forecasts over single LSTM and CNN, with a forecasting skill of 37%-45%. But, the computational burden is ten times longer than the standalone LSTM. A deep Extreme Learning Machine (ELM) has been applied in [143]. The proposed model incorporated Enhanced Colliding Bodies Optimization (ECBO), Variational Mode Decomposition (VMD), and a ELM algorithms. However, the proposed model does not shed light on uncertainty abstraction and reasoning. Therefore the forecasting engine is found inefficient in dealing with meteorological data pervaded with uncertainty.

B. PROBABILISTIC PV POWER FORECASTING
With the increasing PV power uncertainties, Probabilistic Deep Learning (PDL) has become the de-facto solution to lessen negative impacts on power system reliability and economic efficiency from stochastic PV generation [144]. PPVPF can provide an estimated interval where various possible PVPG values for a specific time are generated to quantify the intrinsic uncertainties associated with point forecasts [144], [145]. PPVPF draws excellent attention to balancing authorities for its ability to provide prediction interval, quantile, density, or conditional probability distribution of future predicted power [146]. From the literature, the most overwhelming PPVPF methods lie in conditional quantile regression (QR) and conditional expectile regression [147]. For instance, An Improved quantile CNN has been proposed for indirect PVPF to compute consistent quantile estimates [148]. The simulation results indicate that two-stage training strategy has a positive influence on enhancing forecasting accuracy. A deep QR-CNN-based Wavelet Transform (WT) has been exploited to model DPVPF and PPVPF [149]. The CNN-WT efficiently provides a wider view of a prediction. The proposed model is tested using TS data collected by Elia, Belgium's transmission system operator. The erage Coverage Error (ACE) obtained varies from −1.02 to 0.43 for a prediction horizon of 15 min.
In [147], the authors propose a Robust Self-Attention Multi-horizon (RSAM) model for PPVPF using QR. However, the QR generates a non-differentiable loss function that threatens the model stability and robustness. The proposed model employs a self-attention-based transformer model. However, the problem of crossing quantile curves is frequently observed, particularly when considering a dense set of quantiles or using a small data set. Furthermore, the DNN is naturally deterministic and limited to PPVPF. Therefore, Bayesian probability is often integrated with DL models to provide prediction intervals associated with forecasted point values [150], [151]. Substantial research has shown that PPVPF is scarcely investigated compared to deterministic PVPF. A number of DL techniques have been exploited in the literature for PPVPF. For instance, paper [152] presents a Robust Self-Attention Multi-horizon (RSAM). The proposed model indicates an 18.60% improvement compared to the conventional LSTM. A Traditional Encoder Single Deep Learning (TESDL) framework has been proposed, which provides a 27% improvement in accuracy factor [153]. Reference [154] exploited an SAE and Lower Upper Bound Estimation Method for PPVPF. It was found that the wind speed, weather temperature, weather relative humidity, global horizontal radiation, and diffuse horizontal radiation can effectively predict PV energy production. Despite their importance, PPVPF may ignore the interdependence shape of forecast errors among look-ahead timesteps, and may lose their potential in practical use in the time-dependent and multi-stage decision-making processes, such as the trading strategies design in a multi-market environment.

VI. CHALLENGES AND FUTURE RESEARCH DIRECTIONS
The last section concludes this review with rigorous investigations and guidelines for future studies. The main findings and the research frontiers of this study are enlisted as following: • There is a desperate need to convince the PV experts that DL concept is efficient and satisfactory to gain acceptance by operators and stakeholders. This high-tech concept needs to overcome several weaknesses to win broader acceptance and confidence from the energy hub. This poor DL integration can be explained by various reasons from the industrial perspective. For instance, DL limitations lie in poor generalization potential in learning evolving operating conditions. with the high complexity of the atmospheric condition, designing an ideal DL method is beyond the bounds of possibility.
To build PV industry trust, DL models have to overcome the lack of representativeness of the train sets and the potential adversarial attacks.
• The time resolution for a large extent of the research works emphasizes the STF-based PVPF. However, the MTF and LTF are essential for energy trading, strategic planning, and degradation-rate-impacted energy potentials of PV panels. Nevertheless, the performance validity expires in a more extended period with increased error values compared to STF forecasts.
• The hybrid models possess immense potentialities in PV paradigms. Consequently, the number of papers that address the hybridization of DL is ever-growing. However, the model hybridization entails an elevated computational burden. Almost half of the reviewed papers for forecasting applications incorporate hybrid models. It can be deduced from the related works that GRU and LSTM architectures have been frequently deployed in the model fusion due to their suitability in time series data. The hybrid models' high accuracy should not compromise their reasonable complexity.
• The shortage of PV skilled professionals and experts to deploy DL techniques presents a severe problem that impedes the vast deployment of these techniques. The major concern for PV practitioners is the lack of clear guidance rules for algorithms' structure and HP tuning. Therefore, finding near-optimal solutions can be a cumbersome problem for non-qualified operators.
As DL jobs are in high demand, matching DL and PV technology landscape is relatively limited. More importantly, it is quite hard to find qualified man-labor in both domains of interest. Getting sufficient professional knowledge in both DL and PV technologies requires personal initiative due to the lack of resources in these infancy subjects.
• The explainability and interpretability of the proposed PV systems is a severe challenge for their practical feasibility. DL models operate in a ''black-box'' fashion, impeding their whispered adoption due to the lack of explainability. The poor visibility of model performance may lead to manufacturing problems especially due to safety-critical concerns. Nevertheless, interpretable DL methods have scarcely been applied and tested in PVPF, where the transparency and understandability of the decision logic of the forecasting engine are not guaranteed.
• The data mining and big data analytics are essential for the cost-effectiveness of predictive modeling in PV systems. The data accumulated from weather stations are processed in a continuous flow or stream with various formats, sizes, and variability. Big data analytics is a means to improve data power stability. However, most research papers do not shed light on the utilization of DL models in actual PV plants.
• PVPF is recently proposed to mitigate energy uncertainties. However, the proposed methods may face several implementation issues, as reported in Table 2. In fact, most of the proposed methods were still in the proofof-concept stage without passing to real-world applications. The implementation barrier lies in the laborious DL adjustment in terms of storage, dimension, search capacity, and convergence settings for actual standalone or grid-connected PV systems.
• The available DL models were commonly validated in the Standard Test Conditions (STC). Nonetheless, very little work has been done for real-world validation, where the performance accuracy dramatically decreases for real PV systems.
• Data privacy awareness presents the key enabler for the integrity and reliability of forecasting systems. There is a pressing need to protect data privacy from vulnerabilities and cyber intrusions for promoting the DL deployment. For instance, false data detection tools are needed to preserve the forecasting engine and achieve effective decision-making. FL-based PVPF presents a prospective direction towards a Secure and Resilient grid [116]. However, the studies aiming to cover data-driven cybersecurity technologies-based PVPF are exceedingly rare in the literature.
• The proposed techniques often operate for a specific time frame over a specific geographical area. For instance, DL models can outperform the benchmarks for the forecasting situation under scrutiny. But, this model could not perform equally in other PV areas with different topographies and weather patterns. In fact, DL is hitherto inefficient in regional PVPF and limited to the technical characteristics of PV plants. VOLUME 9, 2021 Designing general models that could transfer learning from local PV plant to another is a potential solution towards model universality.
• The online prediction tools are seldom investigated for PV plants despite their importance in real PV systems. Online prediction effectively adapts to newly incoming information. More concretely, offline DL methods fail with unpredictable conditions during the process. Online PVPF can obtain near-optimal predictions and promotes model stability. This requires an engagement of sufficient storage space of the infrastructure and an adequate frequency for model updates.
• Another challenge is pre and post-processing of data. Data preprocessing and error post-processing is a serious concern, especially with the sheer size of data. The unfathomable amounts of data lead to noise and imprecise knowledge problems such that they can be difficult to surmount. Data preprocessing usually includes data normalization, faulty and missing data filtering, data resolution adjustment, data augmentation correlation analysis data clustering, and graph constriction. While, post-processing procedures usually include various pruning routines, rule filtering, or even knowledge integration. Paper [32] taxonomized post-processing techniques for solar forecasting into four classes: deterministic-to-deterministic, probabilistic-to-deterministic, deterministic-to-probabilistic, and probabilistic-to-probabilistic post-processing. In this paper, the authors reported that post-processing is vital for consistency, quality, and value in the PVPF context [32].
• The integration of satellite and ground-based measurements is rarely being studied. The reason for this shortage may refer to the limited access to satellite images which prohibits the data collection. Despite satellite-derived irradiance datasets provide spatial diversity, the obtained accuracy is relatively low in the presence of poor datasets. Designing a data-light DL systems is a potential solution to overcome the large data requirements barrier and achieve satisfactory results.
• Standardizing forecast evaluation towards an universal functional form is mandatory to facilitate the selection of the suitable model among others. Quantifying predictability presents a tiresome task due to a large number of error metrics. The diversified score criteria may inhibit attaining a statistical consensus of the model's goodness. Unifying the score metrics is essential to gain the industry and end-users acceptability in real-world problems. The standardized criteria alleviate the costs of the prediction system and the economic impact. Although error metrics standardization seems an intuitively appealing task, research works are limited towards that goal.
• Hierarchical TS Forecasting (HTSF) is deemed suitable to achieve excellent performance in PVPF through explanatory variables. However, the use of HTSF is limited. HTSF follows a hierarchical aggregation structure at different levels by reconciling incoherent forecasts according to their proximity from individual TS. Therefore, the relationships within the hierarchy are preserved. This entails a problem of coherency at different granularity levels of the time-varying observations. Therefore, the adjustment between upper and lower levels is vital to ensure the consistency of the forecast. DL is particularly well-positioned to predict the TS data of all nodes in the hierarchy and reconcile them [155]. However, to the authors' best knowledge, HTSF is still not applied to PV systems.
• Forecasting with multimodal and multilevel information fusion is scarcely discussed. Multimodal learning allows learners to merge the information from different sources.
In the PV context, the multimodal data may include sky images and cloud motion speed records. These heterogeneous data from different modalities present complementary information from multiple sources. Information fusion from different modalities with strong end-to-end governance standards can significantly enhance the prediction accuracy, boosting interest towards concepts to model in this area.
• Smart meters (SM) sensing may bring rigorous challenges to forecasting accuracy, especially with their short service life span. A meter failure can bring a plethora of problems for simple causes such as internet loss, software flaws, and hardware malfunctions. However, manually checking all SM on a regular basis can be labor-intensive. DL can efficiently work for early detection of inaccurate SM, towards longer-lasting SM. The careful investigation of the data reliability and sensing tampering early on helps in producing more robust predictions.
• The performance of DL methods depends on the accessibility of abundant quality of PV data to meet power quality standards. However, these models run slowly and have narrow boundaries of the frequency domain division in the production environment. The DL deployment lies in three major aspects: portability, scalability, and computational cost. Operationalizing and robustifying DL are still tedious tasks that mandate fruitful research in this direction.

VII. CONCLUSION
With the inevitable emergence of the Smart Grid, DL-asa-service plays an essential and indispensable role in the bulk penetration of Photovoltaic (PV) energy across efficient PV Power Forecasting (PVPF) systems. This paper provides a comprehensive review of the recent PVPF involving DL. We took a deeper dive into the well-known architectures for PVPF. Three types of emerging learning methods are classified, specifically, discriminative learning, generative learning and deep reinforcement learning. The DL methods have their own merits and numerous shortcomings, which may be covered by optimal hyperparameter tuning. Different PVPF strategies concerning time horizons have been described in the study. A vast majority of case studies from the literature demonstrated that hybridization and assembling straighten DL techniques leading to better accuracy and high robustness. It is hoped that this review paper would help researchers and practitioners to improve forecasting accuracy through moving DL models from the nascent stage to real-world applications and to come up with more precise PV energy forecasts.
HAITHAM ABU-RUB (Fellow, IEEE) received the two Ph.D. degrees. He has worked at many universities in many countries, including Poland, Palestine, USA, Germany, and Qatar. Since 2006, he has been with Texas A&M University at Qatar. For five years, he has served as the Chair for the Electrical and Computer Engineering Program at Texas A&M University at Qatar and currently serving as the Managing Director for the Smart Grid Center. He has published more than 400 journals and conference papers, five books, and six book chapters. He has supervised many research projects on smart grid, power electronics converters, and renewable energy systems. His main research interests include electric drives, power electronic converters, renewable energy, and smart grid. He was a recipient of many national and international awards and recognitions. He was a recipient of the American Fulbright Scholarship, the German Alexander von Humboldt Fellowship, and many others. He has worked in the industry for more than 12 years as the Engineering Team Leader, a Senior Electrical Engineer, and an Electrical Design Engineer on various electrical engineering projects. He is currently an Associate Research Scientist with the Department of Electrical and Computer Engineering, Texas A&M University at Qatar. He has published more than 125 journal articles and conference papers and one book. His principal work area focuses on electrical machines, power systems, smart grid, big data, energy management systems, reliability of power grids and electric machinery, fault detection, and condition monitoring and development of fault-tolerant systems. He has participated and leads several scientific projects over the last eight years. He has successfully realized many potential research projects. He is a Senior Member of the Institute of Electrical and Electronics Engineers (IEEE), a member of the Institution of Engineering and Technology (IET), a member of the Smart Grid Center-Extension in Qatar (SGC-Q). He has been a Visiting Professor in several universities and international and prestigious organizations. He is an Invited Professor of many institutes. He has several international publications. His research interests include materials science, energy systems, pollution, and renewable energies, with expertise in multi-component multiphase convection-diffusion problems. He is a member of several boards of directors or scientists and conference organizing committees. He has chaired many international conferences. He is a regular reviewer for journals and international research projects. VOLUME 9, 2021