Deep Learning in Smart Grid Technology: A Review of Recent Advancements and Future Prospects

The current electric power system witnesses a significant transition into Smart Grids (SG) as a promising landscape for high grid reliability and efficient energy management. This ongoing transition undergoes rapid changes, requiring a plethora of advanced methodologies to process the big data generated by various units. In this context, SG stands tied very closely to Deep Learning (DL) as an emerging technology for creating a more decentralized and intelligent energy paradigm while integrating high intelligence in supervisory and operational decision-making. Motivated by the outstanding success of DL-based prediction methods, this article attempts to provide a thorough review from a broad perspective on the state-of-the-art advances of DL in SG systems. Firstly, a bibliometric analysis has been conducted to categorize this review’s methodology. Further, we taxonomically delve into the mechanism behind some of the trending DL algorithms. We then showcase the DL enabling technologies in SG, such as federated learning, edge intelligence, and distributed computing. Finally, challenges and research frontiers are provided to serve as guidelines for future work in the futuristic power grid domain. This study’s core objective is to foster the synergy between these two fields for decision-makers and researchers to accelerate DL’s practical deployment for SG systems.


I. INTRODUCTION
Increased concerns about the exponential growth of electricity demand and the bulk integration of Renewable Energy Sources (RES) brought a set of new challenges on the adaptability of the traditional grid for such reality [1]. Faced with the ever-growing population and energy demand, the emergence of Smart Grid (SG) as the most convenient solution provides the necessary tools to enhance the services of the antiquated grid using Information and Communication Technologies (ICT) [2]. This ICT-enabled grid presents the most effective solution to overcome the major issues in the outdated grid, such as poor adaptation to outliers and heterogeneous sources [2]. The next-generation Electric Power Systems (EPSs) must satisfy the users' needs in terms of high flexibility to sudden events and customer behavior, Energy Management (EM) of distributed renewable energy sources, and adoption of low-cost and easy-to-deploy technical solutions [3]- [5].

A. CHALLENGES OF SMART GRID IMPLEMENTATION
Domestic electricity consumption has risen steadily in recent years due to population growth and rapid industrialization [6]. To meet the electricity demand, the renewable energy consumption in the United States is projected to climb from 11.34 quadrillion Btu to 21.51 quadrillion Btu in 2050 with a nearly 50% increase [6]. The distributed renewable energy market flourished as the fastest-growing sector with an estimation of surpassing traditional sources in 2050 [1]. For their smooth and mature penetration, the SG requires a robust and intelligent coordination platform between the different elements of EPSs. Traditional automation based on instructive tasks and traditional methods for tedious routine operations between the utility grid parts is ineffective in dealing with unexpected situations and sustainability problems [7]. The use of conventional approaches for electrical operations through deterministic programming makes the power flow issues remain difficult to control due to the heterogeneous multi-agent practitioners for the ''energy mix'' generation of the grid [2]. Furthermore, the conventional automation techniques require manual monitoring restoration and operation regulation leading to frequent problems and downtimes, especially with the incorporation of Renewable Energy Sources (RES) [8], [9]. The SG's underpinnings tend to automatically communicate with different electrical components and deduce the future behavior of each section using extensive calculations in which DL has the main share of their effective deployment [10].

B. EMERGENCE OF DEEP LEARNING
Several milestones have been reached in presenting Machine Learning (ML) techniques for various sub-areas in SG [11]. However, shallow neural networks and sample ML models pose many challenges that make them seldom employed for complex problems in EPSs [12], [13]. These challenges broadly lie in two facts: Firstly, the nondeep-learning algorithms are ineffective for high-dimensional representations with unreasonable complexities [14], [15]. Secondly, the accuracy of simple ML models can not be improved with large amounts of data [16], [17]. To tackle these problems, the learning paradigm shifts to DL as the most dazzling flagship of ML for exploiting Big Data (BD) abundance with hierarchical feature extraction, high efficiency, and timely manner [18]. Deep Neural Networks (DNNs) have quickly ascended to the spotlight due to the improvement of computing performance and data capacity. DL paradigm has achieved great success due to its strong potential to represent learning. The performance of DL techniques is solidified based on multiple processing units to learn feature representations with several layers of abstraction [19]. Due to its wide success, there is a significant proliferation in the use of DL for EPSs to exhibit complex correlations from heterogeneous data with different formats. Notably, the complexity of SG has intensified the need for DL to make use of the massive data from smart meters and Internet of Things (IoT) devices [20]. Therefore, the SG community has been encouraged to apply DL methods to solve a range of miscellaneous and critical problems. These problems broadly include forecasting tasks, fault detection and diagnosis, cybersecurity, and prediction [21].

C. RELATED WORKS AND MOTIVATION
Despite the rising interest in DL techniques in SG, the recent review articles are scattered across sub-areas of EPSs. Furthermore, the existing body of knowledge reported so far in availability published papers lacks a critical standpoint overview for the recent methodologies that perfectly tailor DL to EPSs such as distributed DL models and edge intelligence. Pioneering relevant review articles for DL & SG applications are reported in Table 1 [1], [10], [11], [13], [21], [23]. From Table 1, it can be concluded that the reported research strategy leads to losing sight of significance in tracing the development line of the energy field. This paper comes to provide a systematic review of DL methods applied to SG to foster the synergy between these two research hotspots. Beyond reviewing the recent DL methods, their merits, and limitations, this review will elicit escalating attention on the emerging DL enabling technologies for EPSs. These technologies include Federated Learning (FL), Distributed DL (DDL), Edge Intelligence (EI), Big Data DL (BDDL), Deep Transfer Learning (DTL), and Incremental Learning (IL). Finally, a fruitful discussion on the research frontiers that intersects advanced DL and EPSs is conducted.

D. RESEARCH METHODOLOGY AND SYSTEMATIC REVIEW PROTOCOL
Starting from September 2019, the multiple-methods approach was conducted [24]. The collection of the mainstream research papers on SG/AI from Web of Science (WoS), Scopus, IEEE Xplore, Science Direct, and Google scholar was conducted as the largest databases of peerreviewed articles. Only peer-reviewed articles written in English, providing experimental results, and having a unique identifier from the mentioned databases were taken into consideration, including reviews, research articles, patent reports, and conference proceedings. The adopted methodology for conducting this review article employs a combination of keywords categorized into three main groups, specifically, 'Deep Learning', 'Smart Grid', and 'Prediction'.
The search methodology focuses on the recent research articles from 2015-2020 to identify the comprehensive statues of the AI applications on SG. The filtering process results in 220 research papers from 600 related papers selected based on their relevance by reading the title, abstract, conclusion,  and full text. The filtered articles are tabulated and unified to facilitate the comparative analysis and assessments according to the prediction horizon, applications, used data, error measures, AI classes, experimental setup, etc. The following criteria were applied: (i) SG and EPSs are considered. (ii) the feasibility analysis of the forecasting models is given high importance in the selection process. (iii) the evaluation of the forecasting models emphasizes the use of scale-free metrics. (iv) the future directions and perspectives taking into consideration the latest research articles to give a general standpoint of the current status of SG-based DL and future work. Fig. 1 presents a timescale variations on the frequency of use of terms SG and DL in scientific books from Google Books Ngram Viewer. It can be seen from Fig. 1 that the SG paradigm initially appeared in 1997, while DL achieves an exponential peak since 2006. This high correlation of the SG and DL lies in their complementarity to promote their applicability in real-world policies. A bibliometric analysis based on a thesaurus file from WoS website is conducted to shape the review structure, as illustrated in Fig.2. The aforementioned keywords shape the review structure. Fig.2 shows three topic clusters: DL models, SG, and enabling technologies (i.e., transfer learning). These clusters are taken into consideration to structure the content review in the next subsection.

E. STRUCTURE OF THE REVIEW
The information flow of this review is structured in a topdown manner, as illustrated in Fig.3. Section II emphasizes the ultimate need for SG in the energy hub. Section III comprehensively presents the commonly used DL methods for SG system. These DL methods essentially include Multilayer Perceptron (MLP), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), Restricted Boltzmann Machines (RBM), Autoencoders (AE), Deep Reinforcement Learning (DRL), and Generative Adversarial Network (GAN). Section IV presents some insights on DL enabling technologies underpinning the SG paradigm. Section V introduces the possible future work and possible directions for empowering the role of DL in the SG area by emphasizing the undiscovered fields. Section VI concludes this review paper.

II. PRIMER ON SMART GRID
The SG technology presents a potentially powerful concept to sustainable energy operations. This next-generation network takes advantage of the customer action and energy stockholders to empower the energy delivery in a secure, economic and sustainable manner. A huge amount of data sources and control points in the grid cope with end-user needs toward efficient decision-making actions [23]. For realizing two-way communication, SG elements require efficient integrity to meet the desired functionalities of SG.
However, the widespread adoption of SG systems introduces several critical challenges. The complexity of SG and huge amount of data require advanced automation and information management tools. The bulk penetration of RES into the electrical grid leads to unstable and volatile power generation. This volatility requires DL-based models to address the uncertainty and intermittency of renewable EPSs [25]. Furthermore, the wide development of Advanced Metering Infrastructures (AMI) and Wide Area Monitoring Systems (WAMS) intensifies the necessity of DL-based techniques to deal with the massive data produced.

III. REVIEW OF DL METHODS
Recently, DL arouses a greater attraction at a far faster pace than even a decade ago. This appears quite obviously regarding the massive number of technologies where DL fingerprint is meaningfully marked [26]. Particularly, utilityscale plants have been remarkably expanding around the world during the last decade [27]. Due to the fast growth in global installed capacity, utility-scale generating facilities face multiple challenges in terms of performance monitoring, power losses, faults and failures, large complexity, and big size across hundreds of acres of land. Stakeholders and researchers tend to find scalable solutions in such huge plants to address these challenges [27]. Traditionally methods of monitoring utility-scale projects become too costly in the utility-scale. DL techniques contribute toward filling these gaps by self-monitoring and self-healing, automated diagnostics provider. Several studies report the increasing role of DL algorithms in revolutionizing utility-scale systems [28]. Fascinated by the enticing popularity of DL, the following subsection is devoted to tackling the popular DL methods to SG.

A. MLP
Multilayer Perceptron (MLP) is a DNN with densely connected layers to acquire the strong fitting ability for nonlinear systems [29], [30]. Over the past decades, neural networks have achieved a significant evolution from the simple McCulloch-Pitts Neuron to more complex MLP structures, as shown in Fig. 4 [31]. Regarding Fig. 4, Neural networks architectures have passed by radical transformations to enhance learning performance [32]. Operational Neural Networks (ONNs) and Self-ONNs present a diversification of the conventional MLP that introduces embedding nonlinear patch-wise transformations for designing more compact networks with improved prediction capability [33]. The objective of these futuristic variants of MLP networks is to boost generality potential with less network complexity and minimal training data [34]. For formalizing the dynamics of MLPs model of K hidden layers . J i denotes the weight from the input layer to the hidden node i and w i denote the weight from the hidden node i to the output layer. Here, φ(i) denotes the activation function, while φ(x, J i ) = φ(J T i x) is the output of the ith hidden neuron. φ = {J 1 , . . . , J k , w 1 , . . . , w k } denotes all the model parameters.
Paper [37] proposed MLP based Double Least Absolute Shrinkage and Selection Operator (dLASSO-MLP) model for measuring quality variables of soft sensors in industrial processes. The dLASSO-MLP model retrieves the redundant hidden nodes to avoid overfitting. An Iterative Residual blocks Based DNN (IRBDNN) model has been introduced VOLUME 9, 2021 FIGURE 4. Timescale evolution of Artificial Neural Networks with ONNs: Operational Neuron Networks, GOPs: Generalized Operational Perceptrons, +: under improvement. Super Networks belongs to Super AI, which meets the technological singularity (These concepts are purely speculative for the time of writing this review and may not exist in the future) [35], [36].
to predict individual occupants' short-term strength requirements by exploring the underlying Spatio-temporal correlations of customers behaviors [38]. It has been concluded from the implementation of IRBDNN that the spatial correlations between different appliances used in a household and the iterative residual blocks can boost the model performance [38]. However, the data representation in communication networks can potentially increase the accuracy of the IRBDNN model [38]. The authors in [39] employed MLP for malicious attack detection of SG networks. The proposed solution provides a 99% accuracy over 10000 simulations. In [40], the authors have developed an MLP model for STLF based on smart meter data. According to the simulation results, MLP achieved better performance than conventional ML techniques [40]. However, the training time is relatively slow [40]. MLP reveals several features that promote its implementation for nonlinear problems of single and multi-complex tasks such as distributed representation and computation, mapping capabilities, powerful generalization, and high-speed information processing [41]. There are several advantages of MLP model, especially in higher-dimensional settings, still, the loophole lies in algorithm complexity, long-training time for large MLPs, and higher computational burden [41]. By stacking multiple layers, constructing and training this DNN model can be computationally expensive [42].

B. RNNs
RNN is designed for sequential time series data where the output of the network is fed back to the input as illustrated in Fig. 5(a). The recursive processing of RNN contains hidden layers to with feedback loop to provide a useful information about the past states [43]. For a sequence of input x t = (x 1 , . . . , x T ) ∈ R N ×T and output vectors y t = (y 1 , . . . , y T ) ∈ R M ×T , the hidden states are h t = (h 1 , . . . , h T ) ∈ R H ×T . Let's consider h t and h t−1 the hidden state at time t and t − 1, respectively. Here, h t can be written as: where f h denotes the nonlinear function such as the sigmoid function σ . W xh ,W hh and W hy represent the input, hidden, and output weights, respectively. b h , and b y denote the hidden and output bias. The network output y t can be written as y t = W hy h t +b y . Fig.5.(a) shows the architecture of the proposed model.
In [44], an intrusion detection system-based RNN has been introduced for detecting network attacks and fraudulent transactions in the blockchain-based energy network. The SG attacks are avoided by generating blocks with short signatures and hash function [44]. The RNN model has achieved an overall accuracy rate of 98.23%. The merit of RNN lies in its internal memory. However, RNNs are prone to the vanishing and exploding gradient dilemma [45]. For real EPSs applications, this shortage makes RNN usually replaced with two types of memory gated structures: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) [46].

1) LSTM
The major success of LSTM model is accorded to its excellent ability in temporal feature extraction from input data x 1 , . . . , x T with 0 t T . The LSTM mechanism consists of the deployment of three memory gates, namely, input gate as shown in Fig. 5(b) [45], [47]. Where x t , σ , and c t denote the input sample at time t, the activation function, and the memory unit, respectively.
for the bias and weight matrix for each gate, respectively. The symbol is the corresponding multiplication of the elements [45]. A comparative study of some popular nonlinear activation functions is presented in Table 5.
The authors in [49] employed LSTM and an aggregation function based on Choquet integral for Solar irradiation forecasting. The proposed architecture provides a clear information about the largest consistency among the conflicting forecasting results by aggregating different LSTM networks [49]. The simulation results based on six datasets collected from different regions in Finland demonstrated the performance superiority of the proposed approach compared to four standalone LSTM models with different configurations [49]. A STLF method based-LSTM and Multivariable Linear Regression (MLR), named LSTM-MLR, is proposed to capture the time series variations of STLF using ensemble empirical-mode decomposition [50]. However, the computational complexity can limit the proposed LSTM-MLR from the practical adoption in real power systems. In [51], a parallel LSTM-CNN model has been proposed for STLF. Reference [52] introduced a hierarchical dilated LSTM model for mid-term electric load forecasting. Multilayer LSTM is equipped with dilated recurrent skip connections and a spatial shortcut path from lower layers for better forecastability and universality potential [52]. This model was based on the winning submission to the M4 forecasting competition for monthly data in 2018 [52]. In [53], a rapid islanding detection method-based LSTM classifier has been proposed, demonstrating the applicability of LSTM for classification problems. From these studies, it can be concluded that LSTM model dramatically improved the State-Of-The-Art (SOTA) of RNN in time series prediction. This improvement is due to its strong ability to learn long-tailed temporal dependencies using memory units and customized gates [45]. However, the LSTM structure has limited potential in learning spatial features.

2) GRU
GRU network has been proposed to alleviate the computational burden from the LSTM architecture by using only two gates instead [45]. These gates are a reset gate and an update gate (Fig. 5(c)) [45]. The update gate has the same functionality as the forget gate of an LSTM, which decides if the information is useful or forgotten. The reset gate is used to decide if the information would be saved or removed. Paper [54] used a Residual GRU (ResGRU) with CNN for fault diagnosis in PV arrays. The proposed ResGRU model has a strong anti-interference ability to effectively identify single and hybrid faults with an accuracy of 98.61%.
Unfortunately, the ResGRU model's computational complexity poses the problem of massive waveform storage, especially with online fault diagnoses [54]. In [55], Particle Swarm Optimization (PSO) method is used to tune GRU model, which shows that GRU is very sensible to the initial state of parameters. A Fault type classification method based-GRU has been presented, which makes use of discrete wavelet transform for data pre-processing [56]. Despite the high flexibility, GRU architecture is prone to poor spatial feature representation and high computational efficiency, making its trustworthy practical implementation questionable. [31].  [57]. SA is a fully differentiable deterministic mechanism that can be plugged into an existing system. Alternatively, instead of using all the hidden states in HA as an input for the decoding, the system samples a hidden state y i with the probabilities s i . Compared to HA, SA is easier to implement [57]. Two variant attention mechanisms are introduced: Self-Attention mechanism (SAM) and Multi-Head Attention Mechanism (MHM) [58]. The self-attention is employed to capture spatial dimensions in the nonlocal feature dependence, which is hard to be extracted by convolution kernels. Meanwhile, MHM employs multiple self-attention mechanisms. Extensive studies found that MHM is more efficient than SAM for selective attention.
Attention-based models have been frequently used in EPSs. For instance, paper [58] employed SAM and Multi-Task Learning for Photovoltaic Power Forecasting (PVPF). Despite the fundamental role of SAM in enhancing the RNN-based models' performance, a massive number of additional learn-able hyperparameters have been added into the prediction system, which requires extensive tuning, making it unsatisfactory for real-world scenarios [58]. In [59], the authors used a multi-head CNN-RNN architecture for anomaly detection. From the simulation results, the type of anomalies in the learning library for fault identifications are diagnosed, specifically, point, context-specific, and collective [59]. In real applications, the multi-head CNN-RNN model could face other unknown types of fault samples, and at this time, the proposed model may fail [59]. Paper [60] introduced a gated dual attention unit (GDAU) for predicting bearing Remaining Useful Life (RUL). The GDAU model combines AM and GRU to improve the prediction accuracy and convergence speed [60].

4) BIDIRECTIONAL MECHANISM
Bidirectional mechanism (BiM) is proposed for unidirectional-based schemes to capture both past and future semantic information from the forward and backward directions simultaneously [61]. The bidirectional modeling uses two separately computed directions of information processing: one forward neural network scanning the sequence from left to right and one backward network reading the sequence in the other direction [61]. Assuming the input of time t is the embedding layer w t ,at time t − 1, the output of the forward hidden unit is ← − h t−1 ,and the output of the backward hidden unit is − → h t+1 , Then the output of the backward and of the hidden unit at time t is equal as follow: where L() denotes the hidden layer operation of the LSTM hidden layer [61]. The forward output vector is − → h t ∈ R 1×H and the backward output vector is ← − h t ∈ R 1×H , and they should be combined to obtain the text feature. It should be clarified that H denotes the number of hidden layer cells: Owing to the advantages of BiM, it has been used for forecasting applications. The authors in [62] employed Bidirectional LSTM (BiLSTM) for wind speed interval prediction. A sequence-to-sequence mechanism based on the bidirectional GRU for type recognition and time location of combined power quality disturbance has been introduced [63]. Unfortunately, the details about computational time and the network size were missing, limiting the model generalization. A feature-attention mechanism-based BiGRU method for RUL prediction for electrical assets has been proposed [64]. Despite achieving STOA performance, the high complexity of the proposed model, the hyperparameter optimization may lack convergence guarantees [64].

C. CNNs
The CNN layers were initially used to capture the semantic correlations of underlying spatial features between slice-wise representations by convolution operations in multipledimensional data [65]. The feature mapping of CNN contains k filters spatially repartitioned into different channels [66]. The pooling operation is applied to shrink the width and height of the feature map. The convolutional layer y i is calculated as y i = f a ( i k ij x i ) [61]. where x i , k i ,and y i denote the feature input, the convolutional kernel, and the hidden layer of the ith iteration, respectively while f a represents the activation function. The pooling layer equation for j × k metric dimension is given as y ijk = max(x i,j+p,k+q ). where p and q denote the vertical and horizontal index in local neighborhood. The fully connected layer υ is generated as follows υ = f a (w υ y + b) where b and w denote the bias, weight matrix [67].
CNNs analyze the hidden patterns using pooling layers for scaling, shared weights for memory reduction, and filters for capturing the semantic correlations by convolution operations in multiple-dimensional data [68]. Thus, CNN architecture acquires a strong potential in understanding spatial features [69], [70]. Despite the CNN potential, CNN model suffers from its disability in capturing special features [70]. The authors in [71] implemented GRU for Long-term load forecasting. In [72], a Temporal Convolutional Network (TCN) has been associated with Light Gradient Boosting Machine (LGBM) to address the issue of unsatisfying accuracy in STLF. As an enhanced variant of CNN, TCN model associates a sequence of dilated causal convolutions and residual connections for better effectiveness for stacking deep layers [73]. A CNN with Squeeze-and-Excitation modules (CNN-SE) has been proposed for STLF [74]. In the CNN-SE architecture, SE mitigates the redundancy caused by the massive number of input channels while the CNN model aggregates micrometeorological data from different acquisition sites [74]. The proposed model provides a SOTA multi-dimensional analysis using the squeeze-and-excitation block [74]. In [75], a CNN-based locational detection algorithm has been proposed for multi-label classification of false data injection attack. For better accuracy, a Bad Data Detector (BDD) has been employed to refine data quality on IEEE-14 and 118 bus systems [75]. However, the training difficulty increases with the depth and number of CNN layers [56]. Therefore, the number of hidden layers of CNN is preferred to be small, which can decrease its performance [56], [75]. From the previous CNNs applications, the CNN model contributes to the prediction system with efficient spatial feature extraction and distributed implementation. However, the downsides of CNN include a lack of temporal data modeling and long training time [76].  Fig.6(a), is found ineffective for its poor learning potential and high calculation complexities. Therefore, RBMs restrict BMs without linking the hiddenhidden and visible-visible connections among units on the same layer, as shown in Fig.6(b). The RBM is a two-layer training model in the construction of a DBN (Fig.6(c)).
The RBM layer employs a generative graphical model that encodes the probability density function of its input layer h i−1 into its latent feature vector h i . The RBM uses an energy function E(v, h) with inputs modeled with the Gaussian function. The energy function equation can be written as follows: with W ij denote the weight between visible and hidden units, σ i denote the Gaussian standard deviation of the visible units, and a i and b j denote the bias terms. For DBN with multiple RBMs, it can be built using the layer-wise greedy pretraining through the above procedure for each RBM.
In [77], the authors employed an improved DBN for STLF using meteorological data, load demand data, and demandside management data. The improvement methodology consists of a virtue of processing units, specifically, Hankel matrix and gray relational analysis for correlation analysis, Gauss-Bernoulli RBMs for identifying the probability density of data, Bernoulli-Bernoulli RBMs for processing binary data, and mixed pre-training and GA for parameter optimization. The proposed model in [77] is found outperforming other benchmarks with a MAPE=6.07%. A DL model-based Factored Conditional RBM (FCRBM) has been firstly introduced for STLF [78]. The proposed FCRBM algorithm conducts dimensionality reduction and hyperparameter optimization using modified mutual information and genetic wind-driven algorithm [78]. A Conditional DBN (CDBN) has been proposed for detecting false data injection attacks in real-time [79]. According to the simulation experiments using IEEE 118-bus and 300-bus test systems, the CDBN provide a powerful capability of learning high-level representations of raw input data [79]. From the obtained results, a general conclusion can be drawn that giving higher importance to demand-side management data and the electricity price data enhances the prediction accuracy of the proposed model [79]. However, the proposed methodology requires four different kinds of data to operate. This condition is difficult to satisfy in practical industrial applications [79].

E. AUTOENCODERS
An Autoencoder (AE) is an auto-associative feedforward Neural Network (NN) [80]. It can learn effective representations from the raw input in an unsupervised manner [61]. The AE is essentially introduced for feature extraction or dimensionality reduction using three elements: encoder, coding layer, and decoder, as illustrated in Fig.7(a). Let's consider x (k) = (x 1 , x 2 , . . . , x N ) T , x ∈ R m and y (k) = (y 1 , y 2 , . . . , y N ) T , y ∈ R m the normalized unlabeled input vector and the ouptut vector, respectively. The AE maps the input vectorx = (x 1 ,x 2 , . . . ,x N ) T as: y (k) = σ (W (k) x (k−1) + b (k) ) andx (k) = σ (Ŵ (k) y (k−1) +b (k) ). Here, x (k) and x (k−1) denote the inputs of the kth layer and (k − 1) layer, respectively. W (k) andŴ (k) denote the weights while b (k) and b (k) denote the bias. The classic AEs have several bottlenecks in terms of overfitting, poor generalization ability, and sensibility to noise inference. To suppress the adverse effects of auto-encoding architecture, several models have been proposed, including Sparse AE (SAE) (Fig.7(b)), Stacked AE (StAE), Variational AE (VAE), Stacked Denoising AE (SDAE), Stacked Contractive AE (SCAE), and Convolutional AE (ConvAE) [81]. Stacked AE consists of stacking multiple AEs layer by layer to better find the encoding-decoding scheme than AE by mapping the input vector towards a lowerdimensional manifold. The AE is trained by minimizing the squared reconstruction error L(x,x) = 1 2 x ||x −x|| 2 . However, Stacked AEs are prone of overfitting in case of a high risk copying the input to the output with any feature learning.
To fill this gap, SDAE has been proposed to create a noise corruptions in feature extraction [82]. The DAE is trained in an unsupervised bottom-to-up manner to reconstruct the best clean input x from a corrupted version of input x . The SDAE acquires a more flexible mapping than the classical AE. VAE is an NN-based generative probabilistic graphical model. The core idea of VAE consists of computing an efficient Latent Space Regularization (LSR) to enable generative process. Using a variational inference for latent representation learning, the probability density distribution of the LSR of a VAE typically matches that of the training data much closer than an original AE. Suppose that Z and p θ (X Z ) denote the latent variable and the variational approximation of the intractable posterior, respectively. The generator network g(Z , θ) = p θ (X Z )p(Z ) approximates the generative process p θ (X ) = p θ (X Z )p θ (Z ) as illustrated in Fig.7(c).
Paper [83] used stacked AE to extract the relevant features such as land use maps for spatial load forecasting. A datadriven bottom-up spatial and temporal STLF approach is conducted and demonstrated its generalization potential on larger areas, and diverse regions [83]. A Joint Latent VAR (JLVAR)based monitoring method for gearbox failure detection of wind turbines has been proposed and proved its feasibility on Supervisory Control and Data Acquisition (SCADA) systems [84]. For electricity price forecasting, the authors proposed SDAE, RS-SDAE, a Random Sample Consensus (RANSAC) with Stochastic Neighbor Embedding (SNE), and SDAE for obtaining a robust representation of feature inputs [85]. Due to the hybridization approach, the additional hyperparameters dramatically decrease the model's learning efficiency [85]. In [86], an unsupervised DL scheme-based cyberattack detection system for transmission protective relays has been proposed using a 1-dimensional convolutional-based AE. A comparative study of AE and its variants is presented in Table 3.   GAN has gained tremendous interest as a mainstream superresolution model. GAN uses a Generator Network (GN) and Discriminator Network (DN) to create synthetic data that follows the same distribution from the original data set as shown in Fig. 8(a) [92].
More specifically, the GN mimics the data distribution using noise vectors to confuse the discriminator in differentiating between the fake image and the real image. The GN is trained to boost the probability of fooling the DN by making the new data indistinguishable from the real one. The discriminator's role is to be trained to distinguish generated fake image created by the generator from the original image following the two-player zero-sum game [93]. For discrete data, GAN is frequently employed for Data Augmentation (DA). For example, paper [50] proposed a Signals Augmented Self-Taught Learning (SASLN) network for Fault Diagnosis of Wind turbine Generation (FDWG). Here, GAN is used for signal augmentation to be fed to SLN model [50]. A fusion between CNN and GAN resulting in Wasserstein GAN with Gradient Penalty (WGANGP) model has been proposed to increase the size of data for PVPF [92]. Thirty-three meteorological weather types are reclassified into ten weather types [92]. The WGANGP was used to synthesize new, realistic training data samples by simulating input samples for improved training data sets. Despite the good performance generalization, the CNN-WGANGP may misclassify some confusing samples into a particular weather type, leading to lower prediction accuracy [92].
In [94], a Conditional Wasserstein GAN with gradient penalty (CWGAN-GP) has been introduced to model the uncertainties and the variations of the load. However, the CWGAN-GP lacks unexplained behavior of the network and specific rule for the determination of inner mechanism [94]. A GAN-based super-resolution reconstruction method for low-frequency electrical measurement data has been proposed [95]. The proposed architecture employs a generator based on DRN and a discriminator based on CNN to provide high reconstruction accuracy while sacrificing the computational complexity [95]. Reference [96] employs Wasserstein GAN with gradient penalty to capture the real distribution of the electricity consumption data for electricity theft detection. Reference [97] employs GAN for power loss mitigation of active distribution networks. Despite the large popularity of GANs, the GAN training and assessment is quite challenging and unstable. Moreover, the applicability of GAN is limited to non-discrete data such as computer vision and image recognition applications.

2) DEEP REINFORCEMENT LEARNING
DRL is the combination of DNNs and RL [98]. The basic idea behind DRL consists of assigning rewards and punishments for an agent to shape its policies. Here, the goal of DRL is to maximize the computing reward functions with reasonable actions (a i ), as illustrated in Fig. 8(b). The DRL mechanism consists of generating an autonomous agent that can navigate the search space and provide an optimal policy of actions. The DNN represents a large number of states (s i ) and approximating the action values to learn the best action choices over a set of states through the interaction with the environment [99]. DRL can be classified into value-based models, such as Deep Q-learning (DQL), Double DQL, Duel DQL, and policygradient-based models such as Deep Deterministic Policy Gradient (DDPG) and Asynchronous Advantage Actor Critic (A3C) [100], [101].
The authors in [102] proposed a Multi-Agent DRL (MADRL) model for EV charging stations with Energy Storage Systems (ESSs) and photovoltaic (PV) systems. The cooperative MADRL model-based on CommNet provides active and intelligent energy management while handling real-time dynamic data [102]. Paper [103] introduced a Cyber-Attack Recovery Strategy for SG-based DRL. The proposed DRL has been employed for re-closing the tripped transmission lines at the optimal re-closing time [103]. Unfortunately, the proposed DRL may suffer from parameter instability during training [103]. In [104], the authors proposed a Multi-Agent DRL-based Volt-Var Optimization (VVO) framework for unbalanced distribution power systems. The actions are attributed to different agents to mitigate the action dimension of each agent [104]. The simulation results proved that DQL can be used with continuous actions [104]. A deep Q-network (DQN) based approach has been used for accurate STLF [105]. In [106], a cyberphysical vulnerability assessment approach based DQN has been introduced, which requires sufficient power system data acquired to correctly identify the contingencies. From the previous studies, DRL models have some bottlenecks, such as the lack of compatibility with continuous action spaces and slow policy convergence.

Beyond the cited DL methods, Capsule Networks (CN) and
Deep Spiking neural network (SNN) are employed for SG applications. CN are an advanced form of CNN comprising a group of vector neurons, primary capsules layer, convolution layer, and digit capsules layer [107]. Capsules in CN identify spatial patterns between the lower-level entities and applies a dynamic routing algorithm to recognize these relationships. From the authors' work in [108], a weight-shared capsules network is employed to further supplement the generalization capability of the original CN. Nevertheless, the proposed model is dependent on the quality of labeled data to perform fault diagnosis of machinery. This data is difficult to acquire, especially when the faults happen rarely.
SNN was introduced as one of the third generation ANN [109]. SNNs are considered as efficient models for temporal coding in neurons where the neurons interact with other nodes through excitatory or inhibitory spikes. Furthermore, SNNs can support huge parallel processing using neurons cluster, which significantly accelerates the execution time [110]. The proposed forecasting system in [110] associates SNN with a group search optimizer for automatic hyperparameter tuning to perform probabilistic forecasting with excellent performance and a short training time of 1.31 seconds.

H. HYBRID MODELS-BASED
DL algorithms have their strengths and weakness in terms of hyperparameter settings, data exploration of the computational burden [9]. Table 4 reports the advantages and downsides of DL models. According to Table 4, it can be remarked that the reported downsides of DL methods impede them from becoming canonical approaches in the power industry. Each DL method has the characteristics that make it tailored to a specific application of the SG area better than the other methods. To overcome the DL shortcomings, hybrid models have been extensively proposed based on single DL models to tackle EPSs bottlenecks, as shown in Fig. 10. For the sake of following the share of DL methods usage for SG systems limited to 01/01/2015-21/01/2021, several popular DL models have been coupled with ''Smart grid'' and searched on Google scholar research engine. As a result, a pictorial representation of the most popular DL techniques correlated to SG with their frequency of use is illustrated in Fig. 9.
We can see that CNN and RNN models have a high applicability and universality potential rather than the newly developed models. Paper [111] proposed a hybrid method for wind energy forecasting. The proposed method combines DBN and Support Vector Regression (SVR). DBN is a supervised DL technique having l-layers and parabolic hidden cells [111]. The proposed model outstrips individual models' accuracy (SVR and DBN). However, the proposed method is time-consuming [111]. The authors in [112] combine Long Short-Term Memory (LSTM) and CNN for power demand forecasting. In [92], the authors used GAN and CNN for a day ahead PVPF. Wasserstein GAN with Gradient Penalty (WGANGP) is proposed to classify thirty-three meteorological weather types for generating synthetic training data. Authors in [113] employ CNN, and GRU models to learn spatio-temporal representations for probabilistic wind power forecasting. From the loss versus epoch variation, it worth saying that the convergence of the model was reached in the first ten iterations, which may lead to serious overfitting problems [113]. In [114], a cybersecurity diagnosis and localization method using hybridization of AE, RNN, LSTM, and DNN has been proposed. Despite the model's universality potential to cope with other networked industrial control systems, it has been found that this hybrid design can not detect unknown attack/fault types [114]. A novel Graph Neural Network (GNN) based framework, combining graph convolutional network (GCN) and LSTM model has been proposed for multi-task multi-task transient stability classification [115]. The hybrid design of GNN is found able to effectively analyze complex spatio-temporal patterns in the IEEE 39 Bus system and IEEE 300 Bus system [115]. The fly in the ointment is that the GNN-based models require VOLUME 9, 2021   to compute not only the input data feature but also topological information, which can cause poor generalization potential [115].

IV. POTENTIAL DL ENABLING TECHNOLOGIES
The enabling technologies behind the deployment of DL are tackled in this section, including DDL, FL, EI, DTL, BDDL, and IL.

A. DISTRIBUTED DEEP LEARNING
DL achieved a quantum leap for making use of complex algorithms to achieve a SOTA performance. However, training large DL models may come with an incredible number of iterations, heavy hyper-parameter optimization, enormous computing time, and burden. With the existing non-distributed computing approach, the calculation of billions of parameters for DL models can be terribly slow and expansive.
Here, DDL comes to fill this gap using High-Performance Computing (HPC) for training large networks [128]. HPC strategy employs parallel programming paradigms with multiple distributed data centers or edge devices to significantly alleviate the impractical computation burden and afford fast turnaround time for model training [129], [130]. Thus, Distributing the heavy-duty analytics and computations make DL more productive and exhibit the desired behavior whilst lessening the burden on the cloud. Multi-Central Processing Unit clusters and Tensor Processing Unit (TPU) provide the required tools for scaling up the training to achieve an effective computing performance.
Several DL libraries that allow multi-node computing can be used, including Horovod, distributed Tensorflow. Parallel Computing (PC) illustrated by Data Parallelism (DP) (Fig.11(a)) or Model Parallelism (MP) (Fig.11(b)), or Layer Pipelining (LP) (Fig.11(c)). DP divides the data into partitions according to nodes, while MP shares the calculations of different nodes of DL model by different computers. Compared to DP, MP is an ambiguous concept in terms the model repartition method not fixed on a clear basis. In [131], the authors employed a distributed deep AE for Phasor Measurement Unit (PMU) detection. Paper [132] employed a distributed DRL for load scheduling in residential SG. In [133], the authors propose a cloud-based DDL framework for phishing and Botnet attack detection and mitigation. The proposed solution is composed of security mechanisms working cooperatively: Distributed CNN (DCNN) method and a cloud-based temporal LSTM network model. In [134], Double DQL-Based Distributed operation has been proposed for managing the operation of a community battery energy storage system. A DDL-based IoT/Fog network attack detection system has been introduced in [135]. The proposed system proves that distributed attack detection using DDL can achieve better performance centralized algorithms in terms of accuracy using the sharing of parameters [135].

B. FEDERATED LEARNING
FL method employs distributed training process over end devices equipped with AI chips and siloed data centers on edge [136]. Thus, IoT devices compute model training with localized data and its storage capabilities instead of transferring data to central computing facilities. FL provides a suitable solution to lessen the communication costs, privacy concerns and the adverse effects of data centralization [137]. This method provides a privacy-preserving mechanism, which can be extremely beneficial, especially for highly distributed systems.
Smart cities sensing make use of three classes of FL: namely, horizontal, vertical, and transfer FL described in [136]. In [138], a probabilistic solar irradiation forecasting based on Bayes LSTM-NN has been proposed with a FL scheme shown in Fig.12. In [139], a FL mechanism-based DRL has been proposed for EM of multiple smart houses. The weakness of this study is that the proposed algorithm may suffer from overestimating the state-action values [139]. Paper [140] shed light on DeepFed, a federated DL scheme for Intrusion Detection. The proposed model proved that FLbased solutions can trade off between model performance and privacy concerns. Reference [141] proposed a GAN-based synthetic feeder generation mechanism to ingest power system distribution feeder models using a device-as-node representation. The disadvantage is that this model does not reduce the dimension of the data, so the computational burden of the proposed architecture is relatively high [141]. From these studies, the challenging points can be summarized in two major points: 1) The limited bandwidth of wireless communication of the current IoT devices. Thus, collaborative learning can be heavy. 2) The FL process relies on collaborative computing from IoT devices that must fully participate in the learning process. In case of a sudden interruption, before the learning process converges, the disconnected devices could significantly affect the learning quality.

C. EDGE INTELLIGENCE
DL workloads witnessed significant growth with the large availability of voluminous data and hardware and software  improvement. Consequently, it is estimated that a single SG system produces 22 gigabytes of data per day for two million users [142]. Alternatively, on-board processing and Cloud Computing (CC) will be insufficient to store and possess the petabytes of data from multiple embedded power generator sensors and their communication with home sensors and appliances. To remedy these weaknesses, traditional cloudbased computing has been replaced by EC and Fog Computing (FC), leading to edge intelligence (EI) [143]- [145]. EI is the merge of EC and AI to push both data and intelligence to analytic platforms [146]. Big tech companies have put forth leading projects to demonstrate the advantages of Edge Computing (EC) in paving the last mile of DL [143]. EI coordinates a set of connected edge IoT devices that are used for data collection, caching, processing, and analysis based on AI [147]. DL has the required potential to automatically investigate the data from edge devices for quick real-time predictive decision-making [148]. The authors in [149] introduced an edge-cloud integrated solution using RL. The proposed solution tracks the demand response with high exactitude. Fig 13 presents the most popular DL libraries for EC [116]. Furthermore, Table 5 provides a comparative study of mostly used computing types for SG engineering research.

D. DEEP TRANSFER LEARNING
Despite the high potential of DL in representation learning, it can be pinpointed that the performance of DL dramatically degrades with small sample size. Furthermore, DL models have limited use as they require extensive measurements and recording burden to create a huge number of learning examples. Particularly, EPSs may not be in the same VOLUME 9, 2021 dimensional space and may not have the same distribution, which makes achieving the desired accuracy and turnaround time a challenging problem [162]. In other words, DL techniques are expected to provide salable and generalized solutions solutions across a diverse range of mainstream EPSs applications [163], [164]. In [15], the authors claimed that the generalization ability is a severe dilemma for the employment of DL techniques in wind and solar energy prediction. Reference [165] reports that one of the common issues of the prediction techniques resumes in the poor model universality, which can be referred to the limited data sources. For instance, authors in [166] reports that the lack of data has been known as one of the common failure reasons in load forecasting. To bridge this gap, Deep Transfer Learning (DTL) is well-positioned for alleviating expensive datalabeling efforts using cross-domain datasets [167], [168]. This learning framework operates by transferring the knowledge gained by a DNN model in handling a task (source problem) to solve another related task, as shown in Fig. 14. A given a source domain D s and a learning task T s , a target domain D t and a learning task T t , DTL aims to achieve a better learning performance of the target predictive nonlinear function r t () in D t using the knowledge in D s and T s , where D s = D t , or T s = T t . DTL methods can be classified into four classes based on source and target domains and tasks: (i)mapping transfer, (ii)instance transfer, (iii)networktransfer, and (iv)adversarial-transfer. In [169], the authors address the PVPF using DTL and LSTM network. From the simulation results, DTL can be unnecessary or inefficient when the data in the target domain are satisfactory [169]. In [170], a model transfer DL has been proposed for PV systems. The limitation of this approach is that the tasks require some reasonable similarity to successfully perform DTL [169]. Paper [50] employed a SLN as a particular form for DTL for FDWG. To work effectively, DTL requires some datasets with minimum reasonable similarity [50]. Paper [166] employed DTL and meta-learning using deep neural networks [166]. The proposed solution successfully realizes the model extension in the new target domain [166]. From the authors work, the selection of the optimal configuration for the prediction model is a complex problem that reduces the practical adoption of this solution in real-world problem with the high complexity of electricity-consumption data patterns [166].

E. BIG DATA DEEP LEARNING
Enormous volumes of information are quickly expanding in the PV systems with the persistent use of sensors, Distributed Computing (DC), and advanced information and communication systems. With the sheer size of data, DL methods have become dramatically more prominent for their unparalleled potential to efficiently learn discriminant features using BD technologies. BD computing is classified into two folds: (1) batch processing of massive information of on-disk data with no time constraints. (2) streaming processing of inmemory data in a real-time or short period of time [171]. Several computing frameworks have been proposed to compute BD, such as Hadoop, ComMapReduce, Dryad, Piccolo, and Spark; such systems have the capabilities to scale up DL.
Data mining can truly be beneficial for enhancing the PVPF [171]. Based on this reference, the weather information for the actual day is irrelevant for the prediction system. Paper [172] presents an accurate PVPF method using Apache Flume, Spark, Hadoop Distributed File System (HDFS), and Hive. In [173], the authors introduced Sun4Cast TM , a BD-based renewable power forecasting solution. Paper [174] proposed a PVPF based Spark-based fuzzy partitioning LSTM model. The objective of this study consists of estimating the temporal variability of the production problem of Switzerland [174].

F. INCREMENTAL LEARNING
With the ever-changing environment, Incremental Learning (IL) helps in learning from data streams. IL aims to accommodate new patterns without compromising the loss of historical knowledge [175]. Typically, IL operates with new data without forgetting the knowledge which rising stability-plasticity trade-off. Plasticity operates when the model acquires new information constantly, whereas the updated model's stability stands for maintaining the acquired knowledge [176]. Paper [176] introduced an Incremental Deep Convolutional Computation Model for BD. For processing a large quantity of online data, The incremental knowledge can be saved into the idle neurons with a large probability [176]. In [175], an Adaptive Incremental DL Scheme has been proposed to tackle the hyperparameter online tuning by maintaining high cost-efficient processing. However, the proposed algorithm lacks parameter convergence guarantees [175].

V. CHALLENGES AND FUTURE RESEARCH DIRECTIONS
The development of DL for SG is currently having some opportunities as well as some breakthroughs. To our best knowledge, some concrete points can be mentioned as follows: A. BETTER PERFORMANCE WITH SMALL DATA Typically, DL models require large amounts of data for learning to be effective. For particular SG applications, such large training datasets are not publicly available, difficult to collect, costly, and possibly problematic due to privacy regulation [177]. Although data augmentation and big synthetic training datasets techniques can partially cover the lack of large labeled datasets, it remains cumbersome to fully satisfy the requirements for training by hundreds or thousands, if not fewer, high-quality data points. Supervised training of DL models with small datasets is prone to overfitting [178]. DTL and IL can provide additional information to enhance the data representation and learning process. Paper [179] proposed a VAE-based DGM model for overcoming the limited data set with bearing fault diagnosis. In a similar application, the authors in [180] proposed a Stacked SAE for gearbox fault diagnosis, which achieves a SOTA performance with small labeled data. Few-Shot Learning (FSL) has been proposed to tackle the data scarcity for intelligent fault diagnosis [181].

B. TAILORING QUANTUM DEEP LEARNING IN SMART GRID
Quantum Deep Learning (QDL) can provide a huge breakthroughs to power systems [182]. Despite the inherent potential of QDL, there are few pieces of evidence of its practical application in power systems. Paper [183] reveal that the electric power grid can significantly make use of Quantum computing to tackle the EPSs challenges. QDL employs Big data to speed up the training and make the computation process more efficient [184]. Despite substantial efforts in industry and academia, no error-corrected qubits have been built so far. Therefore, QDL can still far from actual EPSs applications.

C. DEMOCRATIZING DL THROUGH GOVERNMENTAL POLICIES
In the last few years, many countries such as USA and China prepared a strategic roadmap to speed up AI diffusion on a larger scale, especially for the energy market [185]. As a result of the extensive investments of this high technological expertise, it has been strongly remarked the significant increase in the share of scientific papers in the AI-based SG landscape. Despite the recent waves of AI democratization on an international level, the adaptation of such high technology is now at a low level of spread in the developed countries. This is due to the high financial cost of the redefinition of the traditional grid infrastructure and adding the necessary flexibility to be perfectly tailored for the SG paradigm.

D. PRIVACY AND SECURITY ISSUES IN DL
Cyber-physical SG systems have a massive data flow which makes the wonder how the security of the grid in the long term is ensured. In any SC operation, the network stores the new data via cloud storage. The size of the generated data from the grid operations is huge and increases each day in the order of thousands of zettabytes (10 21 ) [186]. From the dynamic grid operations, DL methods are prone to cyber attacks resulting in wrong predictions and control failures. The attacks that threaten the DL privacy fall into Model Extraction Attack (MEA) (duplicating model parameters) and Model Inversion Attack (MIA) (stealing sensitive information) [187]. DL's privacy and security issues against adversarial and poisoning attacks are vital bottlenecks scarcely studied and require a deep investigation.

E. LACK OF SKILLED MANPOWER
DL in its current form is relatively still new and complex technology. Thus, there is a severe shortage of competent Data Scientists (DS) with the required skill-sets in DL to turn data into actionable insights. In 2018, it was reported that there was a 50 percent shortage in DS supply jobs versus demand [188]. According to reference [189], it is assumed that 11.5 Million data science jobs will be created by 2026. From the World Economic Forum (WEF), 20% of the market jobs could have the fingerprint of data science creating VOLUME 9, 2021 133 million jobs by 2022 [190]. Consequently, skilled workforce shortages are looming on the horizon and becoming a serious constraint, especially with the increased applicability of data science in many sectors such as the energy industry [191], [192]. A shortfall of skilled DS manpower has reflected the attractive annual salaries that can reach $135,776 for entry DS per annum in the United States [188]. According to McKinsey research, data science skills are something that the job market worldwide desperately needs [188]. Data scientists often need a combination of domain experience as well as in-depth knowledge of science, technology, and mathematics. There is no denying the fact that mastering each of these domains is somewhat elusive. Beyond promoting data science learning with university scholarships, free online courses, and incentives, job recruiters and top tech companies suggest jobs conversion from other fields towards data science to meet the increased demand [193]. Further, DL systems are intricate black boxes making decisions that are not easily interpretable from a human perspective. To cover this gap, several researchers proposed an explainable DL, i.e., an understandable internal DL architecture for human ease [194]. For example, in [195], an explainable DNN (xDNN) program has been introduced to facilitate the DL interpretability and make it easy to learn for practitioners and engineers [196]. Automatic DL is also considered as a promising solution to implement DL models without complexities [197]- [199].

F. COMMUNICATION INFRASTRUCTURES, PROTOCOLS, AND INVESTMENTS
With decision-makers increasingly seeing DL as a key enabler for thriving the futuristic SG, a fear of missing out this huge DL potential is globally spreading due to poor communication infrastructure [200]. In the last few years, numerous nations such as the USA and China prepared a strategic roadmap to speed up the DL diffusion on a larger scale through investment, protocols, advanced communication infrastructure, and risk management. The use of DL in government must take into account privacy and security, compatibility with legacy systems, and evolving workloads. The global data science market is expected to grow at a compound annual growth rate of 42.2% from 2020 to 2027 to reach USD 733.7 billion [201]. As DL's applicability to the next generation of technology grows, many decisionmakers are worried that they will be left behind and not share in the gains. However, DL workloads have specific requirements from the underlying infrastructure, which can be resumed in three dimensions: scalability, portability, and timing [202]. Scalability reflects the ability to support a massive amount of data [202], [203]. Portability refers to the flexibility of the workload to be transportable across core, edge, and endpoint deployments [204]. Timing describes analyzing streaming databases in a real-time or near-realtime manner by involving advanced computing technologies such as DL accelerators [205], [206]. Despite the recent waves of DL democratization on an international level, the adaptation of the required technology is now at a low level of spread in several developed countries. This is due to the high financial cost of the redefinition of the traditional power grid and adding the necessary flexibility to the communication infrastructure to be perfectly tailored for DL techniques.

VI. CONCLUSION AND FUTURE DIRECTIONS
In the last decade, Deep Learning (DL) has become promising dawn for Smart Grids (SG). Inspired by the recent computational neuroscience discoveries, this review has comprehensively discussed the mainstream DL approaches in power system applications. Furthermore, several state-of-theart paradigms are highlighted, including distributed DL, edge intelligence, and federated learning. DL's major applications in SG include energy forecasting, fault detection, cybersecurity awareness, prediction, and optimization to meet the technical requirements for safe and secured power system operations.This paper makes the following contributions: • The outlines of this paper have investigated the research landscape about DL approaches applied in the SG paradigm and analyze the extent of accuracy, feasible scenarios, and limitations.
• This review has pointed out DL's computational limits for power systems and how to lessen the computational power requirements by introducing DL enabling technologies such as distributed DL and federated learning.
• Finally, the emerging challenges and requirements and future directions of SG and DL models have been addressed.
Many of the reported research works are being conducted on SG applications, which still in the early stage of development. Some DL architectures, such as long short-term memory and deep convolutional neural network, have been heavily deployed to resolve dissimilar issues for a wide range of power engineering applications, including state of charge prediction, energy optimization, and power grid resiliency within the SG paradigm. Despite being recently introduced, generative adversarial network and deep reinforcement learning have been extensively included in multiple research works as efficient tools for modeling multifaceted problems that are considered vital to SG efficiency. Critical challenges and future research trends were depicted to draw fruitful discussions of the limitations of DL methodologies in the SG domain. In summary, DL enabling techniques highly require advanced communication and computation to speed up their practical adoption for SG systems rather than standing on the conceptual level. The future search directions have been investigated, which require interdisciplinary work to overcome the emerging bottlenecks for a flourishing SG market. This study's future work will give a particular focus on Explainable DL (XDL) algorithms for SG systems to shed light on the ample opportunities of XDL for transparent and understandable learning paradigms.
[205] L.-W. Kim  He has worked at many universities in many countries including Poland, Palestine, USA, Germany, and Qatar. Since 2006, he has been with Texas A&M University at Qatar, where he has served for five years as the Chair for the Electrical and Computer Engineering Program. He has been also serving as the Managing Director for the Smart Grid Center. He has published more than 400 journal and conference papers, five books and six book chapters. His main research interests include electric drives, power electronic converters, renewable energy, and smart grid. He was the recipient of many national and international awards and recognitions, the American Fulbright Scholarship, the German Alexander von Humboldt Fellowship, and many others. He has supervised many research projects on smart grid, power electronics converters, and renewable energy systems. He has worked in industry for more than 12 years, as an Engineering Team Leader, a Senior Electrical Engineer, and an Electrical Design Engineer, on various electrical engineering projects. He is currently an Assistant Research Scientist with the Department of Electrical and Computer Engineering, Texas A&M University at Qatar. He has published more than 95 journal and conference papers. His research interests include electrical machines, power systems, smart grid, big data, energy management systems, reliability of power grids and electric machinery, fault detection, and condition monitoring and development of fault-tolerant systems. He has also participated and led several scientific projects over the last eight years. He has successfully realized many potential research projects. He is also a member of The Institution of Engineering and Technology (IET) and the Smart Grid Center-Extension in Qatar (SGC-Q).
INES CHIHI received the Ph.D. and University Habilitation degrees from the National Engineering School of Tunis, Tunisia (ENIT), in 2013 and 2019, respectively. She is currently an Associate Professor of automation and industrial computing with the National Engineering School of Bizerta, Tunisia (ENIB) and a member of the Laboratory of Energy Applications and Renewable Energy Efficiency (LAPER), University Tunis El Manar. Her research interests include intelligent modeling, control, and fault detection of complex systems with unpredictable behaviors. He has several international publications, is a member of several boards of directors of scientists and conference organizing committees. He has been a visiting professor in several universities and international and prestigious organizations. He has chaired many international conferences. He was a regular reviewer for journals and international research projects and an invited professor by many institutes.