Deep Learning-Based Short-Term Load Forecasting Approach in Smart Grid With Clustering and Consumption Pattern Recognition

Different aggregation levels of the electric grid’s big data can be helpful to develop highly accurate deep learning models for Short-term Load Forecasting (STLF) in electrical networks. Whilst different models are proposed for STLF, they are based on small historical datasets and are not scalable to process large amounts of big data as energy consumption data grow exponentially in large electric distribution networks. This paper proposes a novel hybrid clustering-based deep learning approach for STLF at the distribution transformers’ level with enhanced scalability. It investigates the gain in training time and the performance in terms of accuracy when clustering-based deep learning modeling is employed for STLF. A k-Medoid based algorithm is employed for clustering whereas the forecasting models are generated for different clusters of load profiles. The clustering of the distribution transformers is based on the similarity in energy consumption profile. This approach reduces the training time since it minimizes the number of models required for many distribution transformers. The developed deep neural network consists of six layers and employs Adam optimization using the TensorFlow framework. The STLF is a day-ahead hourly horizon forecasting. The accuracy of the proposed modeling is tested on a 1,000-transformer substation subset of the Spanish distribution electrical network data containing more than 24 million load records. The results reveal that the proposed model has superior performance when compared to the state-of-the-art STLF methodologies. The proposed approach delivers an improvement of around 44% in training time while maintaining accuracy using single-core processing as compared to non-clustering models.


I. INTRODUCTION
The technological advancement in the smart grid has the goal to optimally serve the electric power generation, The associate editor coordinating the review of this manuscript and approving it for publication was Xiaochun Cheng.
transmission, and distribution [1]. For that, large amounts of data started to be collected from different grid sources with the intention of being utilized in various aspects such as energy forecasting [2], load analysis [3], asset management [4], customer segmentation [5], demand response management [6], energy efficiency analysis [7], anomaly detection [8], energy trading and marketing [9], etc. The development of information technology, two-way communication system, and customer engagements will significantly increase the amount of generated and collected data of the grid [10]. The advancement of sensors has penetrated the electrical systems leading the way for smarter grids that use smart meters [11]. The massive amount of data collected by various sensors and smart meters are of high velocity, variety, veracity, value, and volume, hence satisfy all the big data characteristics [12], [13].
The rising number of installed smart meters allows for the collection of big data corresponding to consumers' end devices. The smart meter big data, representing the customer energy consumption behavior with the granularity of the household level, enable the electrical utilities to perform capacity planning, capacity building, and operations. The integration of the smart meters' capability with the communication infrastructure in smart grids enhances the protection, reliability, efficiency, and safety of the energy supply to the consumers. The collected big data have been aggregated to different levels to perform load forecasting. For aggregated feeder level forecasting, the bottom-up approach is usually implemented. In such a way, the household level consumption data are aggregated to the feeder level and then the training is performed at the feeder level. Similarly, the data at the feeder level can be aggregated to the level of the distribution transformers, while several distribution transformers could be aggregated to the level of substation and so on which helps in performing load forecasting at the needed level. The electric utilities rely on short-term forecasts at the distribution feeder and the transformer level to support peak planning and grid operation.
In this work, load forecasting is performed with the consumption data at the level of distribution transformers. The lead time of the short-term load forecasting is one day ahead, and the forecasting horizon is hourly. The hourly energy consumption prediction performed one day-ahead enables the utilities to plan and strategically structure their power system operations. Consequently, peak shaving can be structurally planned and achieved with the usage of energy storage systems and dynamic demand response units in place [14]. Load forecasting enables the electric energy utilities to plan ahead, identify the regions with high load demand, match the volatile energy demand by changing the generation capacity, reduce generation cost, regulate energy prices, and manage scheduling. Accurate load forecasting can also benefit the energy management systems to simplify control algorithms with forecasted energy signals as inputs [15].
The energy consumption may vary from one location to another owing to different weather and climate conditions. And for the same reason, the energy demand may vary on different days of the week and at different times of the day. Many researchers have been interested in grouping the different conditions or different locations based on the similarities between the available features of the data in order to reduce the number of forecasting models required for predictions [16], [17]. The clustering techniques intrigue researchers to improve the load forecasting methodology and to enhance accuracy. The current work uses a clustering technique for a day-ahead hourly load forecasting application with the aim that the clusters and forecasting performance do not get negatively affected by the presence of outliers in the energy consumption values.
Typical methodologies employed for energy or load forecasting include time-series [18] and machine learning modeling [19]- [21]. There are two methods of machine learning modeling, namely, supervised [22] and unsupervised modeling [23]. These methodologies are worked mostly on small historical datasets. It is not clear how these methodologies can be applied to growing energy data in the era of the smart grid. The challenge is to effectively process big data from the smart grid and integrate indirect data sets including customer information, weather data, etc. into load forecasting applications. In this work, a novel method of utilizing a deep-learning supervised model along with an unsupervised machine learning technique is proposed. Firstly, to incorporate the effect of time and date on load, the past energy consumption values, termed as lag hour values, are used as features. It is followed by the pre-processing and cleaning of data. Secondly, the k-Medoids clustering technique is employed to group the transformers based on similarity in the energy consumption patterns of customers at the distribution transformers level. The clustering technique is employed to enhance the scalability of the approach. Finally, the deep learning models, including Deep Neural Networks (DNN) and Long Short-Term Memory (LSTM), are employed to train and generate predictions.
The load forecasting at the transformer level provides a pattern of the estimated load demand at the distribution network level rather than the feeder or household level. Distribution transformer load forecasting can contribute to efficient demand response management, generation scheduling, and can help in the reduction of losses. The motivation of this paper is to contribute to addressing the problem of timely and accurate short-term load forecasting in large electrical distribution networks.
To the best of our knowledge, there is no previous work that proposed a combinational hybrid methodology utilizing k-Medoids clustering algorithm and deep learning models for load forecasting. This is the first paper that proposes the use of a clustering algorithm, insensitive to the presence of outliers and solely based on energy consumption patterns, for load forecasting application in smart grids.
The key contributions of this paper are summarized as follows.
1) A novel hybrid highly accurate forecasting approach based on clustering techniques and deep learning models is proposed. The clustering technique is aimed to enhance the scalability of the approach and its capability to analyze big data. Initially, the approach clusters the distribution transformers based on the profile of energy consumption at the aggregated level. Finally, the forecasting models are developed within each cluster utilizing deep learning.

2) A pattern-based similarity metric utilizing pairwise
Minkowski similarities is proposed to determine the transformers that can be clustered based on daily energy consumption patterns. This simplifies the determination of clusters that better represent the load profiles of transformers within them.
3) The number of clusters is optimally selected in such a way that the overall within-cluster error for all the clusters is minimized. The optimization of the sum of square errors on all transformers is performed within constraints to generate the deep learning models for each cluster. 4) Different from the conventional clustering method, this work avails a robust clustering algorithm insensitive to the presence of outliers. 5) The multi-stage methodology reduces the number of forecasting models required for predictions of energy consumption in an electric network. Hence, the methodology can be scaled to any large electric network and big data. Eventually, the proposed methodology fixes the large-scale problem, which is significant since the real-world data are usually large-scale. 6) We investigate the performance of the proposed scheme using real-world data and show that a gain of 44% in training time is achieved over existing schemes whilst maintaining the forecasting accuracies. The rest of the paper is organized as follows: in Section II, the paper presents an overview of the load forecasting approaches used in the literature considering the aggregation, profiling, and clustering of the energy consumers. Section III presents the different aspects of our proposed methodology and an overview of the load forecasting approach implemented in this work while Section IV summarizes the results of the proposed clustering approach. Finally, Section V presents conclusions and future work.

II. RELATED WORK
In recent years, many researchers have invested their efforts to develop highly accurate forecasting models for energy consumption. Also, many of the presented methodologies are based on clustering using different features and conditions. In this section, the review of the proposed methodologies in the literature is presented.
Reference [17] proposed a day-ahead forecasting algorithm that uses load fluctuations and feature importance to cluster different customers at the distribution level. Crow search algorithm was utilized to determine the initialization conditions to avoid local minima convergence in the K-means clustering method and finally, an ensemble random forest model was generated to realize the day-ahead forecasting. The authors reported the lowest Mean Absolute Percentage Error (MAPE) of 1.633% for the random forest model and showed that the model performs better compared to Extreme Learning Machine (ELM), Neural Networks (NN), and Support Vector Machines (SVM). Their methodology benefited from the clustering of the 24 hours of a day into different clusters based on the fluctuation of energy consumption. Although the employed clustering method solves the issues of criteria for selection and initialization in the k-means algorithm, there is a scope of improvement in the Crow search-based k-means clustering algorithm when faced with high multi-modal peaks in the data formulations.
In [24], the authors proposed a long-term energy forecasting methodology that utilizes the spatial clustering algorithm of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to predict year-ahead load values for power system planning. The density-based clustering technique benefits from its inherent ability to effectively dealing with the noise in the data by eliminating the outliers. Similar sub-zones are clustered using DBSCAN based on the features of historical yearly energy consumption profiles, land use types, and geographic information. Eventually, Non-Linear Auto-Regressive (NAR) neural network models yield the values of the predicted load. They reported that their proposed model works better when compared to existing models such as exponential smoothing, grey theory, and Linear Regression (LR). However, short-term load forecasting is not addressed using this methodology.
In [25], the author proposed a hybrid model based on a Kalman filter, an artificial neural network, and wavelet transforms. The hybrid model also used clustering techniques for short-term load and renewable energy forecasting. The work provided evidence that the hybrid models involving clustering-based wavelet and artificial neural networks perform better than conventional models and other hybrid model combinations. However, in this work, the clustering was based on geographical zones, rather than the actual patterns of energy consumption.
Empirical Wavelet transformations (EWT) have been used to decompose the load data into Intrinsic Mode Functions (IMF) [26]. Along with LSTM modeling, the IMF functions are used to predict the low and medium frequency components for load predictions. Furthermore, the high-frequency components are highly varying components with uncertain characteristics, and these are clustered using Improved-DBSCAN (IDBSCAN) algorithm. The prediction results of the high, medium, and low-frequency components are aggregated to determine the total load predictions for short-term load forecasting. Their methodology has the advantage of employing different prediction methods according to the characteristics and the variance of data. However, the methodology based on IDBSCAN is not effective if the data is scaled to a large number of dimensions. Also, it is efficient only when the different clusters have varying densities.
Autoregressive Integrated Moving Average (ARIMA) model has been utilized as a baseline method for predicting energy consumption as it is easy to implement and generalize to a wide variety of specifications [27]. Nepal et al. used k-means clustering along with ARIMA modeling for predictions of energy consumption in buildings [27].
The clustering technique is used to cluster the days with similar load characteristics during the hours of a day. In their work, the days of a year were clustered into 6 clusters. During the prediction phase, their methodology determines the cluster number of the days preceding the testing day and finally predicts the energy consumption of the testing day.
The results indicate that the standalone ARIMA model can be improved with the addition of clustering-stage as in their proposed model. However, the k-means clustering utilized is sensitive to outliers if present in the data.
Fuzzy c-Shape clustering has been investigated by Fateme et al. to cluster the load data depending on the shape of energy consumption [28]. A horizontal ensemble model consisting of LSTM and XGBoost has been utilized to perform a day-ahead forecast of 30-minute granular load prediction. A novel feature of apparent temperature is used in their analysis. The apparent temperature is the equivalent weather variable as experienced by humans due to the collective influence of humidity, temperature, water vapor pressure, and wind speed values. They have suggested that the addition of novel features, such as the representative feature of weather, will improve the accuracy of predictions from cluster-based ensemble models. However, their methodology is dependent on the empirical and assumed function and formulation of equivalent apparent temperature.
LSTM models have been of interest to many researchers to perform energy forecasting [29], [30]. An ensemble of LSTM was used to perform short-term energy forecasting [31]. The different branches of the ensemble utilize different clustering algorithms in their initial phases. The employed clustering algorithms involve Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), DBSCAN, and KMeans++. In the final phase, a dense neural network is employed to aggregate the results from the different branches of the ensemble. The ensemble and deep learning models have been tested to yield better results when compared to nonensemble and classical models. In [32], Syed et al. proposed an averaging ensemble model of the classical algorithm including LR and deep learning algorithms including LSTM and DNNs.
The results indicate that the averaging ensemble model overcomes the shortcomings of the individual models and provides synergy to enhance the overall accuracy. However, the ensemble models and the LSTMs are computationally expensive.
In [33], a novel fuzzy-based clustering method is employed to cluster data into different clusters using the order of feature importance. The clustered data goes through two different phases of regression. In the first phase, a Radial Basis Function Neural Network (RBFNN) is utilized. In the second phase, the output of the first phase is passed to a pooling layer followed by a convolutional layer and finally, through two fully connected neural networks. They tested their proposed method with two case studies to predict the hourly energy consumption for the next week with better results as compared to the classical energy forecasting methodologies. However, the clustering method utilizes common space where data are shared between neighboring clusters and this introduces redundancy and requires additional computations.
Clustering has also been applied at the household level. In [34], Bayesian non-parametric clustering model has been applied to cluster the households with similar energy consumption profiles across seasons and neighborhoods. The load profile curves are obtained after the removal of phase variability with the application of elastic shape analysis. The household-level energy consumption has high variability and with the predictions on household-level load, it is difficult to aggregate the prediction results to high levels for the use of optimized utility operations.
In summary, there has been a significant research effort in the application of clustering techniques at different levels of energy distribution networks. The metric for clustering has been similarities in weather conditions, seasons, days of the week, hours of the day, etc. However, it is required that the metric for clustering should emphasize the patterns of energy consumption.
The main advantages of the proposed solution over similar works [35], [36] are as follows.
1) The clustering is solely based on the energy consumption pattern at the distribution node level in electric networks. Hence, additional granular data at the household level or other low levels of the grid are not required. 2) Unlike k-Means clustering, the adopted clustering is not sensitive to the presence of outliers in the data.

3) A trade-off between the accuracy of the predictions and
the training time is achieved.

III. THE PROPOSED FORECASTING METHODOLOGY
In this section, a hybrid methodology for day-ahead hourly short-term load forecasting is proposed. Fig. 1 represents the sequence of steps performed for developing the clusteringbased short-term load forecasting model. As shown in Fig. 1, the proposed methodology is carried out in four main stages: A. Data acquisition and pre-processing stage. B. Clustering stage. C. Training stage. D. Testing stage. The time-series load data have been utilized for the case study. The methodology begins with the cleaning of data. The time-series energy consumption data consist of attributes such as entry date time and date time. The date time attribute indicates the time at which the electrical energy is consumed, and the entry date time indicates the time at which the record of energy consumption value is updated in the central distributed recording system. The irrelevant attributes, such as entry date time, meter codes, etc., in the data that do not affect energy consumption, are eliminated. The data cleaning is performed to deal with duplicate records, missing data, etc.

A. DATA ACQUISITION AND PRE-PROCESSING STAGE
The main objective of the data acquisition process is to collect the data to evaluate the proposed methodology in terms of accuracy and training time. The pre-processing stage VOLUME 9, 2021 is crucial to remove noise, avoid redundancy, and improve consistency in the data. Basically, this stage aims to enhance data quality.
The datasets used in this work, consisting of two subsets of the Spanish distribution network, have been provided by the global energy company Iberdrola. Iberdrola has been a pioneer in the deployment of Advanced Metering Infrastructure (AMI) using Power Line Communications (PLC) and open standards. The STAR project, implemented between 2008 and 2018, has mobilized an investment of 2 billion Euros resulting in 10.8 million smart meters installed and the digitization of 90,000 transformer substations [37]. Both datasets are time-series energy consumption data at the distribution transformers' level. The major differences between the two datasets are the size of the data, and the data features. This section describes the two datasets in the following and also details the performed pre-processing steps.
1) DATA ACQUISITION a: DATASET 1 The load forecasting data available for analysis are the energy consumption data at the distribution transformers' level of Spain. The data contain the hourly energy consumption data for 10 distribution transformers. The weather data for the location of these 10 distribution transformers are scraped online using an Application Programming Interface (API) named Darksky [38]. The data are available for 33 months from 01 January 2017 to 28 September 2019. The weather data are merged with the energy consumption data. The data contain missing values for the weather features. The missing values should be either filled, extrapolated, or deleted [39]. The missing values for numerical features can be filled with mean or median values. Otherwise, backward or forward filling methods can be utilized to fill the missing values. Mode imputation is applied for categorical or ordinal features. In this work, we have used the forward fill method to fill missing values for numerical weather features. Dataset 1 consists of features such as date time, wind speed, maximum temperature, minimum temperature, humidity, summertime, and other weather features in addition to energy consumption values. Fig. 2 presents the standard deviation and mean of the energy consumption values for different transformers in dataset 1. The boxes represent the deviation of the energy consumption values, the horizontal blue line inside the blue box represents the median of the consumption, and the black circles represent the outliers. The higher width of the blue boxes represents that the energy consumption values for those transformers are highly varying. The load forecasting data available for analysis are the energy consumption data at the distribution transformers' level of Spain. The data contain the hourly energy consumption data for 1000 distribution transformers. There are more than 24 million load records in the dataset. The locations of these 1000 transformers are not available currently. The data are available for the same 33 months as dataset 1. The difference for dataset 2 is that the hourly weather information at the location of 1000 transformers is not available. Hence, dataset 2 consists of features limited to energy consumption values, and season. Fig. 3 presents the standard deviation and mean of the energy consumption values for a subset of transformers in Dataset 2. The high width of the boxes in Fig. 3 depicts that the energy consumption values for transformers 21, 127, and 562 are highly varying. Moreover, it indicates that many records have zero values for consumption. The mean and the range of energy consumption values in dataset 1 and dataset 2 are mentioned in Table 1. Fig. 4 and Fig. 5 represent the daily consumption of one transformer and the aggregated consumption of all the transformers from dataset 1 and dataset 2 respectively. It is evident that the overall pattern of consumption is the same irrespective of the consideration of one distribution transformer or    aggregation of the total number of transformers. As seen in Fig. 4, the energy consumption decreases gradually from the hours 00:00 to 05:00 AM and increases at a constant rate from the hours 05:00 AM to reach a peak energy consumption around 03:00 PM. The energy then declines at a constant rate till 08:00 PM and then increases with a more or less constant slope till a peak is achieved around 10:00 PM. As per Fig. 5, the energy consumption is more or less the same between the early hours of the morning from 00:00 AM to 05:00 AM while a constant increase in the energy demand is witnessed until 10:00 AM. The energy consumption then fluctuates between a small range between 10:00 AM and 09:00 PM with local peaks observed at 01:00 PM and 06:00 PM. After 09:00 PM, there is a decline in the consumption of energy.

2) DATA PRE-PROCESSING
It involves the stages of data cleaning, scaling attributes, attribute/feature selection, feature extraction, etc.

a: DATA NORMALIZATION
For accurate and efficient learning of machine learning algorithms, it is required that all the attributes have the same numerical contribution and variance in the same order. If one attribute has variance much larger than another attribute, then it dominates whilst learning the objective function.
To incorporate a non-distorting scaler, Minimum-Maximum (min-max) Scaler has been utilized in this work. The min-max scaler is given as per (1).
where a m is the new attribute value at row m, a m is the original attribute value at row m, min(a) is the minimum value of the attribute, max(a) is the maximum value of the attribute, and [p, q] is the scaling range decided for the attribute a. For feature selection, two methods have been investigated. The feature importance scores are calculated utilizing the permutation feature importance techniques. Additionally, a top-down search based feature selection method called Sequential Backward Search (SBS) is employed to address the multi-collinearity between the different features unlike the best individual feature technique [32].

B. CLUSTERING STAGE
The pre-processed data with high quality in terms of consistency, low noise, and no redundancy are passed to the next stage of clustering. The main objective of this stage is to group together the different distribution transformers with similar patterns in the daily energy consumption. The similarity metric utilized to group the transformers together is the daily energy consumption. There are two factors to be kept in mind while clustering-technique is used -the number of clustered models and accuracy. As observed in dataset 2, there are 1000 transformers. If each of the transformers is modeled separately, then these models capture the patterns of energy consumption for the transformers effectively. However, the aim is to reduce the number of models to be developed from 1000 to as much low number as possible. Nevertheless, while reducing the number of clustered models to be created, it is also crucial to maintain the accuracy of the models. Two of the clustering algorithms have been considered in the time-series forecasting of energy consumption at the transformers level. These algorithms are described in the following:

1) K-MEANS CLUSTERING ALGORITHM
The objective of the k-Means clustering algorithm [40] is to reduce the Error Sum of the Squares (SSE) scoring function that is given by (2).
where k represents the total number of clusters, C i represents each cluster, x p represents each point in a cluster, and µ i is mean of all points in a cluster. K-Means applies an iterative greedy approach to reduce the sum of squares error until it reaches a local optimum. k-Means starts with the selection of the number of clusters k and the initial k number of centroids assigned to each cluster. This step is followed by the centroid update. At this stage, all the points are assigned to the clusters with the nearest centroids. Once all the points are assigned, the centroids are updated for each cluster as the mean values of all the points in the clusters. The cluster assignment and the centroid update are repeated until there is no change in the centroids in two subsequent loops. This indicates the point of local minima.
The algorithm for the k-Means model is given in Algorithm 1. The value of k is selected in such a way that the average distance from points to centroid decreases rapidly till it converges or changes slowly thereafter.

2) K-MEDOIDS CLUSTERING
It is known that means, as a statistic, is highly sensitive to the outliers. The k-Means algorithm, that determines and utilizes the means of the data points in calculations, is particularly sensitive to the outliers in the data. To overcome this, a technique of using medoids instead of average values in a cluster is devised. Medoids are centrally located points in a cluster and the technique is called the k-Medoids clustering. Although k-Medoids are computationally more demanding, k-Medoids clusters are not particularly sensitive to the presence of outlier points and are applicable to both continuous and discrete domains of data [41]. This algorithm minimizes the sum of dissimilarities between the objects in a cluster with the reference object selected for that cluster. Basically, the input given is the value of k that represents the number of clusters defined for the data. For each of the k clusters, k-reference points are selected. The remaining points are clustered into the cluster of a reference point such that the sum of the dissimilarities between the reference object and the points in the cluster is minimized. With different initial medoids selected, the clusters obtained are different. The difference between the k-Means algorithm and the k-Medoids algorithm is that k-Means consider the average value in a cluster to be a reference point and k-Medoids consider the points to be a reference object for the clusters. Algorithm 2 presents the sequence of steps performed in the k-Medoids Algorithm.

Algorithm 2 . K-Medoids Algorithm
Input: , k, Data S Initialize k medoids randomly, t 1 , t 2 ,. . . . . . , t k ∈R d Output: The main objective of this stage is to develop machine learning and deep learning networks and train these networks on real training data. After the similar transformers are clustered together, the pre-processed data are passed to linear regression, deep neural networks, and long short-term memory networks to train the models for STLF. These models are explained in the following:
The objective of the LR is to choose w 0 , w 1 , w 2 , . . . .., w n so that the values of h w (x) is as close to the actual values of the labels (y). This is achieved by the introduction of a constraint while determining w 0 , w 1 , w 2 , . . . .., w n .
Here, J (w 0 , w 1 , . . . ., w n ) is the cost function in terms of model parameters. This constraint is basically the sum of squared error and the aim is to minimize this error while determining the weights.
The LR has been used as one of the prediction models to act as a benchmark for training time as this model would have the lowest training time owing to the simplicity of the model but coarser accuracy.

2) DEEP NEURAL NETWORKS
If the artificial neural networks have multiple hidden layers between the input layer and the output layer, then these are termed as Deep Neural Networks (DNNs) [43]. DNNs have the capabilities of modeling linear and non-linear relationships between the data features. Further, the tendency to overfit can be reduced with the application of dropout where the neurons are dropped in random or systematic order [44].
The non-linear function representing the data is effectively determined in the neural networks using summation and product operations. If a neuron 'j' of layer 'l' (depicted in Fig. 6) from a neural network is considered, then the input to the neuron is S l j , the weight at the neuron is w l ij . Let σ be the activation function, then x l j is the output from the neuron and this output acts as input to the neurons in the next layer. Here, i represents the neuron number in the previous layer and d l represents the number of neurons in the layer 'l'.
The input to the neuron S l j is given as and the output from the neuron j in layer l is given as In the matrix form, the equation for the input to neuron in layer l is given as: This equation is used in the forward propagation calculations. The value x, which is input, is available initially. It is used with pre-initialized weights W (1) to calculate the input S (1) to the neurons in the hidden layer 1. This input when applied with activation function yields the output x (1) from the neuron in hidden layer 1.
The algorithm for the forward propagation of the neural network is given in Algorithm 3. The aim of the forward propagation is to calculate the inputs and outputs in different layers of the network using the weights, bias, and activation functions. The backpropagation is utilized to determine the gradient of error in the direction from the last hidden layer to the first hidden layer while minimizing the gradient of the error with respect to the weights of neurons.
The error associated with the predictions is given by (9) [45]. The subsequent equations (10), (11), and (12) are This brings the partial derivative of the error with respect to neuron weights to the following equation: In the backpropagation, the error gradient δ (L) i is determined first (L represents the last layer in the neural network) and by way of backpropagation, the errors in the previous layers are calculated as the following: The above equation is the representation of the error gradient of a layer in terms of the error gradient of the next layer. All the steps of forming a DNN are provided in Algorithm 4.

3) LONG SHORT TERM MEMORY (LSTM) NETWORKS
LSTM is a type of Recurrent Neural Network (RNN) that predicts the output based on not only the current state of the hidden units but also on the previous states witnessed so far, with the help of storing information in memory blocks [46].
LSTMs are sequential models and hence, capture the temporal dependencies. These models are suitable for processing time-series data such as load forecasting data. In a standard RNN, there are two inputs at a time step t to a neuron: input of time step t (x t ) and output obtained at time step t − 1 (h t−1 ). Output at a time step is obtained by the weighted sum of x t and h t−1 which is then followed by using activation functions such as Rectified Linear Unit (ReLU), hyperbolic tangent (tanh), etc. on the weighted sum.
LSTM places a mini neural network inside each neuron and therefore complicates the process of training. However, it helps to improve reliance and handles the long-term dependencies well by eliminating the issues of gradient vanishing and gradient explosion that usually exist with the use of standard RNN. The main idea of LSTM is to have two outputs and gates. One of the outputs goes to the output layer and the next time step. Besides, the other output goes to the next time step only. Gates are the multiply operations performed and there are several gates in the LSTM. The LSTM network determines the weights and these weights are used to dot-product the inputs.
An LSTM layer followed by a fully connected neural network is depicted in Fig. 7. The machine learning models utilized in the proposed clustering-based modeling and their parameters are specified in Table 2. The values for the parameters are obtained after grid search parameter optimization. The objective of the work is to develop day-ahead hourly forecasting models whilst minimizing mean squared error as the loss function.

D. TESTING STAGE
In this stage, the performance of the clustering-based deep learning models is evaluated by testing these models on the datasets of all the distribution transformers. The performance of clustering-based models is compared against the individual models developed for each transformer. The metrics of evaluation utilized for accuracy are Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE). Training time and testing time are used to evaluate the performance in terms of execution time.

1) Root Mean Square Error (RMSE): RMSE is the square
root of the sum of squares of the difference between actual and predicted energy consumption. RMSE is an effective performance metric for comparing forecasting errors of different models for a single attribute which is the case in our paper. However, it is not a recommended measure to compare performance between attributes as RMSE is scale dependent. RMSE is given by (15) [47].
2) Mean Absolute Percentage Error (MAPE): MAPE represents the ratio (in percentage) of the absolute difference between the actual and predicted value to the actual value at every record of energy consumption. It is necessary to make sure that the actual value is not zero while calculating MAPE. MAPE is given by (16) [47].
Here, E actual is the actual energy consumption, E predicted is the predicted value of energy consumption, and N is the number of energy values. For low prediction values, the MAPE value cannot exceed 100%. However, for high prediction values, there is no maximum control limit to the value of MAPE. The data of each transformer are split into a training set and a testing set with the data split as 80% and 20% respectively. Algorithm 5 details the sequence of steps designed in the proposed methodology to perform the STLF task.

IV. RESULTS OF CLUSTERING-BASED MODELS
The current work focuses on the application of clustering by energy consumption patterns at the transformer level to enhance the performance of forecasting. The methodology determines the clusters of similar transformers based on a similarity metric of aggregate daily energy consumption after which the data are sent to the machine learning models. The proposed methodology is applied to two real datasets to illustrate the performance and enhancement in the forecasting accuracy. The computations are performed on 1 core with 32 GB RAM.

A. SIMILARITY METRIC FOR TRANSFORMERS
At the outset of the clustering stage, a similarity matrix is required to define the similarities and dissimilarities between the transformers. The similarity metric is defined to be aggregated daily load to capture the daily load patterns of the transformers. Let L s,t be the load of transformer s at time duration t. The load matrix for the transformers is represented by L S×T in (17) where S is the total number of transformers and T is the total number of days when aggregation level is 24 hours. The size of the matrix increases with the increase in the number of transformers or with the reduction in the aggregation level of energy consumption values. For the 1000 transformers dataset that is used in case study 2, the size of the load matrix is 1000 × 1001 where 1001 is the number of days between 01 January 2017 to 28 September 2019. : .. .. : : : : .. .. : : The similarity between any two transformers r, s at any given time p is determined based on pairwise Minkowski similarity which is given by (18).
where L r , L s represent the row vectors of load values for transformers r, s respectively. The optimized value of q in (18) was determined to be equal to unity. Finally, the obtained pairwise similarity matrix is passed as an argument to the clustering function to obtain the clusters of transformers with similar energy consumption patterns. The adoption of Minkowski similarity enhanced the performance of clustering.

B. OPTIMIZATION OF THE NUMBER OF CLUSTERS
There are various methods to optimize the number of clusters (k) in a clustering algorithm. There are direct methods such as elbow curve [48], average silhouette [49], etc., and statistical methods such as gap statistic [50], etc. The direct methods involve optimizing a cost function such as the minimization of within-cluster error. Statistical methods are those methods that collect evidence to support a hypothesis or to reject a null hypothesis [51]. In this work, the elbow curve is constructed to determine the optimal number of clusters and the results are shown in the following:

1) ELBOW CURVE
In this work, the direct method of the elbow curve is utilized. The elbow curve calculates the within-cluster sum of VOLUME 9, 2021 square errors (WCSSE) and determines the k-value such that WCSSE is minimized. The aim of the selection of k is to determine a low value of k such that the sum of square error for that value of k is the minimum and if any more clusters are added, the clustering is not improved much. This is to provide a trade-off between the number of clusters and the accuracy. The elbow method is selected over other methods of determining the k-value for clustering because of its low complexity. As per existing research [52], the execution time is the lowest for the elbow method when compared to other methods owing to its low complexity of utilizing the sum of the square distance between cluster points and representative centers.
To determine the optimal number of clusters, the elbow curves are obtained for dataset 1 and dataset 2 as illustrated in Fig. 8 and Fig. 9. The independent axes in the figures indicate the number of clusters and the dependent axes in the figures represent the WCSSE for the corresponding number of clusters (k) value. As per Fig. 8, the sharp decline in the WCSSE is observed for k = 3. Hence, the optimum number of clusters is selected as 3 for 10 transformers dataset. The elbow in Fig. 9 suggests that the optimal number of clusters is k = 93 for the 1000 transformers dataset. Hence, the clusters are determined, and the deep learning models are developed with the number of clusters k = 93.   Fig. 10 depicts the feature importance scores determined for the weather attributes and this illustrates that temperature, pressure, humidity, and UV index are the most crucial attributes contributing to the accurate predictions of the target variable whilst learning the objective function. Fig. 11 illustrates the feature importance scores determined for the lag consumption attributes for 6-hour ahead prediction of energy consumption values. The accuracies of forecasting models increase after the application of feature engineering i.e. feature selection using Sequential Backward Search (SBS). Fig. 12 illustrates the RMSE performance of forecasting models for increasing number of features in the data. The forecasting models are developed after the application of the proposed clustering-based deep learning approach. However, the results are shown after testing on the individual transformers within the clusters. The best testing   performance is obtained when the number of features is optimal (n = 19). If the number of features whilst training is increased or decreased, then the accuracy decreases.

D. FORECASTING RESULTS
The data obtained consisted of energy consumption values for 33 months for 1000 transformers. The K-Medoid clustering is utilized to cluster similar transformers together. The similarity indicates that transformers have similar patterns of aggregated daily consumption and hourly consumption. The aim of the work is to evaluate the performance of individual models for 1000 transformers against the clustered models. Individual models mean that 1000 transformers have separately trained models using their individual data i.e. these have 1000 different models. The clustered models indicate that the 1000 transformers are clustered into 'k' different groups and these 'k' clusters have one trained model each trained on the transformers within the clusters. The employment of a clustering technique reduces the number of models required from 1000 to k. As described in the previous subsection, the value of 'k' (number of clusters) is optimized to minimize the within-cluster sum of square errors.
The performance of clustered and individual forecasting models for distribution transformers is evaluated in terms of RMSE, MAPE, training time, and testing time.
The RMSE and MAPE values for individual models and clustered models using DNNs are determined and these are depicted in Fig. 13 and Fig. 14. Fig. 13 indicates the results of the DNN models for load forecasting. Each of the subfigures indicates a representative subset of 1000 transformers. As observed from the RMSE lines, mostly the individual models represent the lower boundary of the two lines. The RMSE values range between 0 and 30 kWh. These values are very low considering the range (0 to 2,147,484 kWh) of energy consumption in the dataset. At a few points, the clustered models over perform the individual models for the respective transformers. The MAPE values for the individual models range between 4 to 16 percent and the MAPE for clustered models range between 5 to 19 percent. These MAPE values indicate that the clustered models are very comparable to the individual models. A few transformers exhibit high statistical variance in the energy consumption, i.e. they have either zero consumption values, or very high energy consumption values, or actual energy values range between 0 and 1. The MAPE values for such transformers are around 20-32%. These transformers have been found to be alternate backup transformers that are used only during the periods of faults, preventive, or predictive maintenance of main transformers. Table 3 presents the results of clustering and individual models on dataset 1 when the machine learning models used are LR, LSTM, and DNNs. When accuracy is considered, the best performing model is the DNN model. In the clustering-based algorithm, the models are trained on a cluster whilst the testing is performed on each transformer within the cluster. If the clustering and individual models are compared, the individual models have slightly better accuracy when compared to clustered models. However, the accuracy of clustered models is highly competitive. If the gain on training time is considered, then the clustered models are highly preferable to individual models. When the training times for different machine learning models are considered, LR is the best owing to its simplicity. The DNN models have 10 folds of lesser training times compared to LSTM models. As a trade-off between accuracy and training time, it can be concluded that the clustering-based DNNs perform better. A similar pattern is also recognized in Table 4. It depicts the results of clustering and individual models on dataset 2 when LR, LSTM, and DNNs are used for training and testing.
The comparison of a trained clustered STLF model using different machine learning algorithms is illustrated in Fig. 15. The independent axis represents the time points and the dependent axis represents the energy consumption in kWh.   The results in the figure denote that the proposed k-Medoids methodology has generated accurate clusters, and the clustered model predicts energy consumption values close to the actual values of consumption for all machine learning algorithms in general. Fig. 15 also indicates that the DNN forecasts follow the consumption peaks better than LSTM and LR models. LSTM and LR at many time points forecast peaks after the peaks have occurred. Fig. 16 illustrates the error bars that depict the standard deviation of predictions using DNN and LSTM-based clustering models for STLF. The shaded region around the blue line depicting predicted energy values using clustered DNN model represents the error region or the deviation of model predictions. The experiments were repeated a reasonable number of times i.e. 20 times to obtain the mean prediction and standard deviation of the predictions. LR-based clustering models had zero variance for predictions and hence, are not plotted. LSTM-based clustering models have variance tending to zero and additionally, DNN-based models have very low variance as shown in Fig. 16. The sources of randomness are kept at the minimum whilst   training the proposed models and the trained models can be saved using deep learning serialization for future testing in industrial applications. The standard deviation of the error metrics for retraining of forecasting models under similar initialization conditions will be negligible.

V. CONCLUSION
The electric utilities rely on load forecasting for capacity planning, power management, and operations in this era of uncertainty due to renewables integration. In this paper, a hybrid model of k-Medoids clustering and deep learning models for the day-ahead hourly load forecasting at the level of distribution transformers was proposed. The performance and applicability of this solution were demonstrated on two real datasets, which proves the generalization ability of the work. In the larger dataset 2, there were 1000 distribution transformers analyzed. These transformers were clustered based on the pairwise Minkowski similarity of aggregated daily consumption of energy using the k-Medoids clustering method with WCSSE measure. The elbow method determined that the optimum number of clusters for these 1000 transformers was 93. The reduction in the number of required models from 1000 to 93 reduced the constraint on the computational resources utilized for load forecasting and was a step towards real-time application. Consequently, the deep learning models were used to train and test the clustered models. These models were also compared with the 1000 individual deep learning models in the metrics of RMSE, MAPE, training time, and testing time values. The forecasting results indicated that the proposed methodology has generated accurate clusters and saved 44% of training time. In essence, the clustered models were highly competent with the individual models in terms of accuracy. Furthermore, the training time for clustered models was significantly lower than the individual models by a huge margin. The proposed methodology can be used with huge electrical networks and big data in smart grids at any level. Future work will investigate the scalability of the proposed methodology to a large electrical distribution network of 100,000 transformers for which the data are being collected currently.