Short-Term Load Forecasting and Associated Weather Variables Prediction Using ResNet-LSTM Based Deep Learning

Short-term load forecasting is mainly utilized in control centers to explore the changing patterns of consumer loads and predict the load value at a certain time in the future. It is one of the key technologies for the smart grid implementation. The load parameters are affected by multi-dimensional factors. To sufficiently exploit the time series characteristics in load data and improve the accuracy of load forecasting, a hybrid model based on Residual Neural network (ResNet) and Long Short-Term Memory (LSTM) is proposed in this paper. First, the data with multiple feature parameters is reconstructed and input into ResNeT network for feature extraction. Second, the extracted feature vector is used as the input of LSTM for short-term load forecasting. Lastly, a practical example is used to compare this method with other models, which verifies the feasibility and superiority of input parameter feature extraction, and shows that the proposed combined method has higher prediction accuracy. In addition, this paper also carries out prediction experiments on the variables in the weather influencing factors.


I. INTRODUCTION
Load forecasting is mainly used to explore the changes in a regular pattern and influencing factors of consumer load in a smart grid, and take necessary control actions. The accuracy of short-term load forecasting can provide the basis for many control center functions such as planning, dispatching, load frequency control, and economic operation, and it is of great significance to ensure dynamic balance, and the stable and reliable operation of smart grid, stable and reliable operation of smart grid [1]. With the increase in power demand, how to improve the prediction accuracy is an urgent problem to be solved [2]. The changes of various factors will affect the load, such as regional differences, socio-economic activities, The associate editor coordinating the review of this manuscript and approving it for publication was Emilio Barocio. natural climate, price, and other factors [3]. As a result, the load data has the characteristics of randomness, volatility, periodicity, and diversity. The key problem of load forecasting research is how to mine the internal law of load change from historical load data and find an accurate forecasting method [4].
To improve the speed and accuracy of load forecasting, scholars have put forward many methods. Each method has its specific advantages and disadvantages. Some are suitable for linear data prediction, while others are suitable for classified prediction [5]. Short-term load forecasting technology can be divided into four categories: statistical technology, artificial intelligence technology, knowledge-based expert system and hybrid technology, which are arranged in chronological order, as shown in Table 1. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Statistical approaches require an explicit mathematical model which gives the relationship between load and several input factors. Classical statistical include multiple regression analysis, exponential smoothing, iterative reweighted least square, adaptive load forecasting, and stochastic time series. In [34], the peak demand of a typical growth system with genetic dynamic load characteristics is estimated. In [35], the regression-based daily peak load forecasting method with the transformation technique was presented; there is an obvious seasonal load change characterized as a nonlinear relationship between temperatures and loads. A new trend removal technique was based on optimal smoothing [36].
The expression computational intelligence is commonly used to refer to the fields of fuzzy systems, artificial neural networks (ANN), evolutionary computation, and swarm intelligence. Of these fields, neural networks are the subtype that is most often applied in load forecasting. In [37], an artificial neural network (ANN) method is applied to forecast the short-term load for a large power system. Adaptive Fuzzy Clustering model based on recursive Gustafson-Kessel algorithm and recursive weighted least-square is used to improve region division [38]. An attenuated radial basis function (RBF) neural network was used to train 24-hour power load forecasting. In [39], a modified deep residual network is formulated to improve the forecast results.
Expert systems are the result of advancements in Artificial Intelligence in the last two decades. These are rule-based methods, which make decisions based on the experience of experts. They are regarded as a supplementary method. A generalized technique for short-term load forecasting was tested using data from four diverse sites [40]. In [41], a knowledge-based expert system was implemented to support the choice of the most suitable load forecasting model for medium/long-term power system planning. A rulebased method was put forward in [42], which brought the prior expert knowledge of load curve into the statistical model.
Statistical methods and traditional machine learning methods can not take into account the high volatility, uncertainty, and time correlation of load data at the same time, so that the prediction accuracy is far from efficient, and there is still room for improvement [43]. Single methods often come with several types of disadvantages including low computational efficiency, high computational complexity, and high error percentage. Over the years, researchers have been working on building hybrid load forecasting models to obtain better accuracy with minimum error rate [44]. A composite load model was developed for predicting hourly electric loads 1-24 h ahead [45]. In [46], a hybrid demand model to enhance load modeling in distribution applications was proposed, which was conducted a state-space model and an ANN model. In [47], the merged particle swarm optimization with fuzzy neural networks is proposed. A neural network was proposed that combined elements of a convolutional neural network (CNN) and a long short memory network (LSTM) in [48]. In addition to these, weather factors are crucial for load forecasting. Over the years, operational numerical weather prediction (NWP) models have been developed to improve the accuracy, reliability, and resolution of predictions [49]. At the same time, some scholars also proposed to use the deep learning model for weather forecasting [50]. In this paper, we continue to study the prediction of weather variables using a hybrid method based on Residual neural network (ResNet) and LSTM, so that when there are anomalies in the power load data or part of the weather variable factor data is lacking, the ResNet LSTM model can be used to predict the weather variables, such as dry-bulb temperature (drybt) and humidity, and then the power can be predicted.
The remainder of this paper is arranged as follows. The deep learning principle of ResNet and LSTM is introduced in Section II. In Section III, the combined model based on the ResNet-LSTM network for short-term load forecasting is presented. We also discuss the evaluation indices. Case studies are given. and discussed in Section IV, and the conclusions are presented in Section V.

A. RESIDUAL NETWORK
ResNet was proposed in [51], which solves the problem of degradation of deep neural networks, i.e., shallow networks are directly stacked into deep networks, which is difficult to make full use of the powerful feature extraction ability of deep networks, and the accuracy will also decline. ResNet has three features: ultra-deep network layer (breaking through 1000 layers), residual block, and accelerated training with the Batch Normalization algorithm. These features not only solve the problem of degradation gradient but also solve the problems of vanishing gradient and exploding gradient. For the vanishing gradient, i.e., when the error gradient of each layer is less than 1, the deeper the network is during backpropagation, the closer the gradient is to 0. Similarly, exploding gradient means that if the gradient error of each layer is greater than 1; the deeper the network, the bigger the gradient.

1) RESIDUAL BLOCK
To solve the degradation problem in deep networks, ResNet proposes a residual block. The residual block is composed of multiple cascaded convolution layers and a shortcut connection, also known as residual mapping and identity mapping. After accumulating their outputs, the output of the residual block is obtained through the Relu activation function. As shown in Figure 1, where the weight layer is convolution operation, X is the input, F(x) is the residual mapping, and H (x) is the output. The mapping relationship between the three is: In the residual network, the input X is directly short-circuited to the output of the network. At this time, the network will no longer directly learn the optimal mapping function but instead learn its residual, which is shown in equation (2): If the network tends to be optimal, continue to deepen the network. If residual mapping becomes 0, i.e., F(x) = 0, then H (x) = X ; in theory, the network will always be in the optimal state, and the performance of the network will not decrease with the increase of depth. If F (x) ̸ = 0, but F(x) is close to 0, then X approximates the actual mapping H (x). In this way, the gradient degradation caused by network layer stacking is solved. ResNet has five basic network structures with different layers, namely ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152. ResNet18 and ResNet34 are the residual blocks of two-layer convolution, and Resnet50, Resnet101, and Resnet152 are the residual blocks of three-layer convolution. There are also some differences in the implementation of the different residual blocks, which will not be discussed here.

2) BATCH NORMALIZATION ALGORITHM
This algorithm refers to batch standardization processing, i.e., the feature map of a batch of data meets the distribution law with a mean value of 0 and variance of 1. This operation is carried out between each full connection and excitation function so that the variation range of input X in the hidden layer will not be too large, and the input value will pass through the sensitive part of the activation function, to accelerate the convergence of the network and improve the accuracy. Figure 2 illustrates the calculation process of mean µ B and variance σ 2 B of the batch normalization algorithm with batch size 2. After the convolution and pooling operations of Image 1 and Image 2, the characteristic matrices Feature 1 and Feature 2 are obtained: where x (1) represents the data in Channel 1 of all features of the batch. Similarly, x (2) represents the data in Channel 2 of all features of the batch. According to (4) and (5), we calculate the mean µ B variance σ 2 B of x (1) and x (2) respectively to obtain two vectors. Then we calculate the standard deviation of each channel according to (6) (ϵ is a small constant that prevents the denominator from being zero). There are also two parameters in (7): γ is used to adjust the variance of the numerical distribution, and β is used to adjust the position of the numerical mean. These two parameters are learned in the back-propagation process, which means that the neural network will choose the most suitable distribution along with the training process. The default value of γ is 1, and the default value of β is 0.

B. LSTM NETWORK
The original intention of the RNN is to learn the long-term dependencies in time-series problems. The practice has proved that RNN has a good performance in dealing with this problem. At the same time, a large number of experiments show that the standard RNN will lead to a vanishing gradient and exploding gradient in the training process because of its iteration. To solve this problem, Hochreiter proposed LSTM [52], which is an improved RNN. LSTM adds a long-term memory function unit, which carries forward the transmission of data information [20]. The basic network unit is shown in Figure 3.
The basic unit of LSTM includes the forgetting gate, input gate, and output gate [53]. The input x t , the state memory unit C (t−1) and the intermediate output h (t−1) determine jointly the forgetting part of the state memory unit. In the input gate, it determines the reserved vector in the state memory unit after sigmoid and tanh functions. The intermediate output h t is determined by the updated C t and output o t , and the calculation is shown in (8)- (13).  The forget gate is used to forget selectively the unit state at the last time and correct the parameters, the input gate is used to update the state of the information, and the output gate is used to read, output, and correct the parameters. LSTM adopts the ''Gates'' structure to increase the transmission and exchange of information, this solves the problem of ''gradient vanishing/explosion'' in model training and can learn the long short-term dependence information of time series, which can be applied in many scenarios.

III. ResNet-LSTM MODEL A. ResNet-LSTM STRUCTURE
The ResNet-LSTM model proposed in this paper consists of two parts: the ResNet is regarded as the pre-feature extraction unit and the LSTM is regarded as the time series feature learning unit. The model structure is shown in Figure 4.
As shown in Figure 4 (a), the Resnet18 network is used in the model, which consists of 5 parts, Conv1, and Conv2_x, Conv3_x, Conv4_x, and Conv5_x, respectively. Conv1 performs a convolution operation and a max-pooling operation. Each of the remaining parts consists of two residual blocks. The Resnet18 network is used in the model, and each residual block is composed of a two-layer convolution network. The features matrices of the two branches of the residual block are added and then output through the rule activation function. We input the data with the structure of (None, 56, 1) into the ResNet, where 56 is the number of feature parameters and the number of channels is 1. The number of convolution kernels in each part is 32, 32, 64, 128, and 256 respectively. The size of the convolution kernel is 3. The maximum pool is used to reduce the complexity of the model. The pool size is also 3 and the step size is 2. After ResNet extracts the features and outputs a three-dimensional vector (None, 2, 256) to the LSTM network.
In the residual block, when the shape of the input features matrix is consistent with that of the output features matrix, the main branch and the short connected branch can be added directly, i.e., the solid line residual structure. At this time, the stride of the convolution operation is 1, as shown in Figure 4 (b).
When the shape of the input features matrix is inconsistent with the shape of the output features matrix, i.e., there is a dashed residual block. At this time, a convolution operation with a strde of 2 is performed on the short connection branch, so that the shape of the features matrix of the short connection branch is consistent with that of the main branch, and then it is added, as shown in Figure 4 (c).
Through experiments, it was found that adding an LSTM network helps to improve the prediction ability of the model. The final model includes two layers of the LSTM network, and the number of units in each layer is 1, 024, 256. Dropout is used between each LSTM network layer to prevent the overfitting of the model. Finally, the predicted load value is output through two layers of Dense. The number of neuron nodes in the first layer is 64 and the number of neuron nodes in the second layer is 1.  model. These indices are calculated (14), (15), (16), and (17): where y i is the actual load value of sampling point i,ŷ i is the load forecast value of sampling point i, and N is the number of sample points.

IV. EXAMPLE ANALYSIS
This paper used the publicly available online load data set published in Queensland, Australia from Jan. 1, 2006, to Dec. 31, 2010 [54]. The sampling interval is 30 minutes, including 48 sampling points every day, and the data set has a total of 87649 rows of data. It contains six feature parameters: load, dry-bulb temperature (drybt), dew point temperature (dewbt), wet-bulb temperature (wetbt), humidity, and price. The data for the first four years is the training set, the data from January 1 to November 30, 2010 is the validation set and the last month is the test set. The feature parameters are shown in Table 2.
A. DATA PROCESSING

1) CORRELATION ANALYSIS
The analysis of the data set shows that the load changes periodically in a week, and the other parameters also have VOLUME 11, 2023 this trend. Figure 5 shows the change process of feature parameters in a week. The Pearson Correlation Coefficient method was used to analyze the correlation of the feature parameters [55], and the correlation thermodynamic diagram is shown in Figure 6.
It can be seen from Figure 5 and Figure 6 that the load has a strong positive correlation with dry-bulb temperature, wet-bulb temperature, and price, and a strong negative correlation with dew point temperature, and humidity. The work and rest habits of residents also affect the change of load. Therefore, the information on holidays and weekdays are also used as feature parameters to participate in the training. Besides, weather factors also have the features of time-series and periodicity, and there is a strong correlation between the variables. In this paper, the two weather variables of drybt and humidity are selected, predicted, and analyzed.

2) ABNORMAL DATA PROCESSING
Load data and meteorological data may have abnormal values due to communication errors or data loss. In the process of calculation and analysis, abnormal values distort the results and affect the prediction accuracy, which needs to be eliminated. In this paper, the box diagram is used to analyze and correct the outliers in the data set.

3) RESTRUCTURING THE FEATURE PARAMETERS
To couple the characteristic information of data and accurately mine the temporal characteristics between data, we construct the load data at any time into time-series data with multiple feature parameters. After reconstruction, each load value has 56 feature parameters, including weather, price, holiday, weekdays, sampling time, and load values of the previous 48 sampling points. The data reconstruction process is shown in Figure 7. The size of load vector at each sampling point is 8 + n, and the n position in t 0 ∼ t n−1 sampling point is filled with 0. Starting from the t 1 sampling point, fill the load value of the previous sampling point into the feature vector in turn. The feature vector of load at t n sampling point are Weather t n , Price t n , Holiday t n , Weekday t n , t n , Load t 0 , Load t 1 , . . . , and Load t n−1 .
Similarly, the training data can be changed to regenerate the hybrid ResNet-LSTM model for predicting the weather variable factors (drybt, dewbt, wetbt, and humidity). Taking the predicted variable drybt as an example, the data is reconstructed considering its time series characteristics, combined with its historical data. The data reconstruction process is shown in Figure 8. To be consistent with the size of the load input vector, the size of the drybt vector is also 8 + n, i.e., 8 relevant features including dewbt, wetbt, humidity, load, price, holiday, weekdays, and sampling time, and the previous n th drybt sampling points. The blank positions of the vector can be set to 0, as needed.
In this paper, the value of n is 48. After that, the feature vector of the sampling points is reconstructed in this way. Among them, weather includes four variables: drybt, dewbt, wetbt, and humidity. At time t 0 ∼ t 47 , there is 0 in the eigenvector, so it does not participate in the training of the model.

4) DATA NORMALIZATION
Different feature parameters have different properties and orders of magnitude. No standardized training will weaken the impact of the lower order of magnitude data. In the experiment, Min-Max Scaling is used to linearly transform the data x, and the data size is constrained between [0, 1].
where x * is the post value after normalization; x max is the maximum value in the sample data; x min is the minimum value in the sample.

B. EXPERIMENTAL CONFIGURATION
In the experiment, a computer workstation with Intel (R) core (TM) i7-9750h CPU and NVIDIA Quadro GPU was used, which was built with TensorFlow-GPU2.6.2, Keras2.6.0, CUDA11.6, and cuDNN8.3.2 combined development environment. The software environment is the Keras framework based on TensorFlow. Keras provides a concise and consistent programming interface and has the characteristics of modularization. At the same time, it supports the free combination of models, which helps users quickly understand the neural network architecture and reduces the repeated writing of code. The MSE is used as the loss function and trained by the Adam optimizer [56].

C. ANALYSIS OF EXPERIMENTAL RESULTS
In the experiment, LSTM is used as the baseline model and the control variable method is used to optimize the parameters. To illustrate the positive role of the proposed model in short-term load forecasting, the model is compared with multiple linear regression (MLR), CNN, LSTM, CNN-LSTM, and ResNet methods used in short-term load forecasting [57].    It can be seen from Table 3, Table 4, and Table 5 that when the data volume is small, the simple and well-designed MLR method is consistent with the evaluation indicators of the ResNet-LSTM model proposed in this paper. With the increase of data, the evaluation index of the MLR model  becomes larger, because linear regression is based on the assumption that the data changes linearly.
The evaluation indexes of other neural network models are larger than those of the ResNet-LSTM model. Hence, the efficiency of the ResNet-LSTM model is higher in comparison to other methods. Table 8 shows the load APE of different models at each sampling point on Dec. 1, 2010. To make the data understandable, the APE in the table is reduced by 100 times. It can be seen from Figure 12  It can be seen that in the smoothing stage of load change, the prediction of each model is accurate and there is little difference. In the area with severe load fluctuation, the prediction results of ResNet are relatively accurate. The   ResNet-LSTM model has a better ability to capture the load change trend than the ResNet model. From the perspective of the combined model, the predicted change curve of model ResNet-LSTM is closer to the real value than that of the CNN-LSTM model. VOLUME 11, 2023   It can be seen from Figure 12 that the APE values of the six models for drybt forecasting at 48 sampling points on Dec. 1, 2010, are 10.80, 4.24, 11.77, 7.54, 15.71, and 3.08, respectively. Figure 11 (a), (b), and (c) shows the fitting comparison between the predicted and actual values of humidity variables between different models. It can be seen that the predicted values of different models are different to varying degrees. When the humidity fluctuates greatly, the ResNet-LSTM model can better capture the changing trend, and the model has a good prediction effect. It can be seen from Figure 12 Figure 12 shows the APE values of load, dry bulb temperature, and temperature for the six models on Dec. 1, 2010. It can be seen visually that ResNet_LSTM has good accuracy in feature extraction and relationship processing of time series data.

E. COMPUTATIONAL COMPLEXITY ANALYSIS
To describe a deep learning model, in addition to accuracy, the number of floating point operations (FLOPs) and the number of parameters are normally used to illustrate the complexity.
In the ResNet-LSTM model, the number of parameters of a convolution kernel is given as:  Similarly, the number of parameters of a convolution layer is given as: We can calculate the value of FLOPs according to (21): where H , W , and C in are the height, width, and number of channels of the input feature map, K is the kernel width, and C out is the number of output channels [58]. After programming, FLOPs and parameters of different neural network models are shown in Table 6. In addition, the operation time required for load forecasting by different models on Dec. 1, 2010, Dec. 1 to Dec. 2, 2010, and Dec. 1 to Dec. 7, 2010 is shown in Table 7; the running time unit in the table in seconds.
It can be seen from Table 6 and Table 7 that although the training parameters and FLOPs of the ResNet-LSTM model are larger than those of other models, the time required for forecasting loads on different days is within 30 seconds. Considering the accuracy, the ResNet-LSTM model proposed in this paper is an effective short-term load forecasting method.

V. CONCLUSION
With the increasing requirements of power system short-term load forecasting accuracy, this paper proposed a combined model based on ResNet-LSTM, which uses the feature expression ability of ResNet to extract effective features and processes the temporal relationship through the LSTM network. The conclusions are as follows: 1) Considering the time series feature of load and taking into account the characteristics of historical data, the load data of Queensland, Australia was reconstructed. The feature parameters include date factors, weather factors, economic factors, and the historical load data of the previous 48 sampling times.
2) Compared with other machine learning models, the ResNet-LSTM model proposed in this paper gives full play to the feature extraction advantages of ResNet. At the same time, it gives full play to the ability of the LSTM network to better fit the timing and complex nonlinear relationship, and further excavates the potential timing feature expression of power load data. Experiments show that compared with other short-term load forecasting methods, the model has better forecasting accuracy.
3) The weather variables also have the features of time series and periodicity. This paper also makes an experimental comparative analysis of the two variables of dry-bulb temperature and humidity, which further shows efficacy of ResNet-LSTM model in processing time series weather data.
To sum up, this paper not only proposes a short-term load forecasting combination model for multi-dimensional input characteristic parameters but also reconstructed the data, which provides ideas and references for researchers to further explore how to improve the accuracy of load forecasting for various smart grid applications.
XINFANG CHEN received the B.Eng. degree from Liaoning Petrochemical University, in 2000, and the M.S. degree in engineering from Dalian Maritime University, in 2008. He is currently an Associate Professor and the Master's Supervisor at the Institute of Disaster Prevention. He is mainly engaged in the research of big data storage and analysis, data visualization, machine learning, data mining, and emergency information processing technology. YIQING LIU received the B.S. degree from the Hebei College of Science and Technology, in 2021. He is currently a Graduate Student with the Institute of Disaster Prevention. His research interests include big data analysis, data processing, machine learning, deep learning, distributed systems, and data visualization. He is mainly working on the application of network models based on time series analysis in different fields.
JILIN FENG received the B.S. degree from the China University of Geosciences, in 1984. He is currently a Professor and the Master's Supervisor with the Institute of Disaster Prevention. His main research interests include algorithm design, data analysis of remote sensing data, and disaster information processing technology. He is the Chairperson of the GIS Association of Institution of Disaster Prevention. VOLUME 11, 2023