Heating Load Forecasting for Combined Heat and Power Plants Via Strand-Based LSTM

Heating load forecasting is the premise for guiding heating operation management and dispatching. Heating load forecasting is a time series prediction problem which requires us to predict the real-time heating loads in the next 24 hours using available historical records and weather information. In this paper, we propose a model for short-term heating load forecasting based on a properly designed strand-based long short term memory (LSTM) recurrent neural network. We present how the data are pre-processed, and the loss function is designed to improve the model’s performance. Furthermore, an ensemble strategy is incorporated with the LSTM model to enhance its generalization and robustness. On offline (historical) testing data, the proposed model performs satisfactory predictions which meet the requirements of the local power plant. In addition to offline tests, we also implement the model to an online system of a power plant in Shandong province, China. The model made continuous forecasting without human interference for four months during the heating season of 2018. The model reported satisfactory online testing results that were comparable with the offline experiments using historical data.


I. INTRODUCTION
Load forecasting is essential for planning and optimizing operations for large energy systems. It can assist in a reasonable dispatch of resources. Because of the significant economic and environmental benefits, short-term forecasting has been widely used in energy fields, such as electric load forecasting [1], heating load forecasting [2], [3], etc. This paper is focused on heating load forecasting.
Heating load forecasting is the premise for accurate guidance of heating operation management and heat dispatching. During the heating season in China, which usually starts in the middle of November and ends in March of next year, an accurate heating load forecast can significantly improve the stability and production efficiency of central heating systems. On the one hand, heating load forecasting is essential to the stability of the power system. Renewable energy, such as wind power, is severely dependent on weather and thus The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Ayoub Khan . unstable. Therefore, power generation of other power sources (mainly thermal power) should be adjusted to maintain the stability of the power grid. Furthermore, the ability of the thermal units to adjust power generation can be predicted if the heating demand is accurately forecasted. On the other hand, heating load forecasting is vital for power plants to work economically and efficiently. Based on short-term heating load forecasting, power plants can make detailed schedules to meet the heating demand. The resources, such as fuels and human resources, can be reasonably allocated. Accurate forecasts enable power plants to receive excellent control of thermal units and reduce operating costs, which brings substantial economic benefits for heating systems.
Data studied in this research are the historical data provided by the Electric Power Academy of Shandong Province. The data set is composed of weather information and the heating load, both of which are sampled every ten minutes. Our objective is to perform next day forecasting of the heating load based on historical weather and heating load data, as well as the weather forecasting data of the next day.
Most simply, we can regard the problem as a static regression problem. To be more precise, we can learn a regression model that approximates the mapping between weather information and heating load at each given time instance. Traditional machine learning methods have been applied to short-term load forecasting problems, such as linear regression [4], Support Vector Machine (SVM), [5]- [7], artificial neural networks (ANNs) [8], [9] and some modified machine learning methods such as SSA-SVR [10], [11]. However, Hippert et al. [12] suggested that most of the works based on ANNs had overfitted the data after examining a collection of ANN-related papers. Moreover, most of the studies about heating load forecasting were focused on machine learning methods. In [13], authors found that bagging trees, boosting trees and neural networks are efficient models and they outperformed multivariate linear regression method. In [14], authors proposed a hybrid model which combines multivariate adaptive regression splines (MARS) and extreme learning machine (ELM). In [15], authors found that the 4-layer neural network outperformed ridge regression, Lasso and gradient boosting methods.
Despite the popularity of the regression methods above, they do not work well for our problem, as will be demonstrated in Section IV. This is mainly caused by the negligence of the temporal dependence of the data. Therefore, time series analysis models are more suitable. Popular time series analysis models include the automatic regressive moving average (ARMA) [16]- [19] and the autoregressive integrated moving average (ARIMA) models [20]- [22]. However, these methods require either the time series to be stationary or its difference sequence to be stationary. Furthermore, these methods mostly do not utilize long-term historical information of the time series data.
To resolve the aforementioned problem, we consider the Long Short Term Memory (LSTM) network [23], which is effective on non-stationary time series data and capable of incorporating long-term dependencies in the data. LSTM is a Recurrent Neural Network (RNN) with a special structure. Since its repeating unit is composed of four gates (input gate, output gate, cell gate and forget gate), it can automatically select useful long-term information, discard irrelevant information, and incorporate short-term information. Consequently, LSTM works well on a large variety of problems, especially processing and analyzing non-stationary time series data, such as processing natural languages [24], machine translation [25], [26], speech recognition [27], etc. Although LSTM model is popular in the fields above, to our best knowledge, it is new to heating load forecasting. In [28], authors compared the performance of nonlinear autoregressive exogenous RNN and SVM, and found RNN yields higher accuracy than SVM. However, the performance of LSTM in heating load forecasting is still waiting for exploration. This paper is focused on developing a 24-hours ahead heating load forecasting model based on LSTM. As will be shown in Section IV that a direct application of LSTM (called base LSTM) is not ideal for our problem, and modifications have to be made. Furthermore, to make the forecasting more stable, an ensemble strategy is proposed. Meanwhile, given that the heating load contain large fluctuates especially at the end of the heating season, we introduce proper smoothing and local scaling to solve such issues.
The paper is organized as follows. We provide detailed descriptions of our datasets in Section II. Section III presents the proposed LSTM model. Section IV presents experimental details and results. We conclude this paper in Section V.

A. DATA SOURCE
The data are provided by the Electric Power Academy of Shandong Province in China, and they are historical data of City A and B, two cities in Shandong Province. The data are composed of historical weather information and heating load. The historical weather data include the measured weather data and weather forecast data. The measured weather data are taken every ten minutes, while the weather forecast data are taken by the hour. The data of two cities cover two heating seasons: one is from 20th December of 2016 to 15th March of 2017; the other is from 15th November of 2017 to 15th March of 2018.

B. DATA ATTRIBUTES
The weather conditions, especially temperature, have a significant impact on the heating demand [2]. The given weather information includes temperature, pressure, humidity, and wind speed. Some statistics of the measured weather data are listed in Table 1.
Weather conditions vary between cities, as shown in Table 1. The temperature of the heating season in city A was on the average higher than that of city B, but with a lower variation. The average wind speed and pressure of heating season in city A were higher than those of city B, while the average humidity was lower in city A.
The relations of the weather variables with the heating load are shown in Fig. 1. In the heating seasons, the temperature has a negative correlation with the heating load. The pressure seems to have a positive correlation with the heating load. As for wind speed and humidity, their correlation with the heating load is hardly seen. However, they still have some influence on the heating load, since empirically the prediction accuracy would drop if we excluded these variables.
In practice, since our task is to predict the heating load of the next 24 hours, we can only use the weather forecast data for future information. The weather forecast data are provided   by the Meteorological Bureau of Shandong Province and are given on an hourly basis. In order to make a heating load forecast every ten minutes for the next 24 hours, linear interpolation for all the hourly weather forecast results is deployed in this paper. Fig. 2 presents the comparison of measured and forecast temperature data of city A. we can see that the measured and forecast data match well, which means the forecast data are reliable. Therefore, the interpolated weather forecast data are used during the training and testing of the models.

C. HEATING LOAD
The plots of the two cities' heating load are shown in Fig. 3 and Fig. 4 respectively. We can see that although the general trend of the heating load is similar, the magnitude of the heating load in city A is much greater than that of city B. The average heating load in city A is 1267.258. In contrast, it is 831.112 in city B. This indicates that the demands of heating can vary significantly among cities. It also suggests that it may be better to train models separately for two cities, or train a model on one city and transfer to another city by fine-tuning.  Besides, the heating load increases and decreases drastically at the beginning and the end of the heating season, respectively. Such large fluctuation of the heating load poses difficulty in the training of LSTM. Therefore, data pre-processing methods, such as smoothing and scaling, are needed.

III. METHOD
This section shows how we exploit our data to build a practical model for accurate heating load forecasting. First, we introduce a base model, and then several modifications are proposed to improve its performance.

A. BASE MODEL
As for time series prediction, it is natural to adopt the Long Short Term Memory(LSTM) network, upon which we build our base model.
In [23] and [29], the structure of the LSTM unit is proposed to construct an RNN which overcomes the issue of gradient exploding and gradient vanishing. Different from a vanilla RNN unit, the LSTM unit has four computing gates working for different purposes, as illustrated in Fig. 5.
In a vanilla RNN unit, the hidden state is computed by where h t ∈ R h is the hidden state at time t, x t ∈ R d is the input, and W ih ∈ R h×d , W hh ∈ R h×h , b ih , b hh ∈ R h are trainable parameters. However, LSTM units are much more complicated and contain 4 times of parameters as the vanilla RNN units. Apart from the hidden state h t , the LSTM units also maintain a cell state c t to represent the memory. There are many variants of LSTM units and the one we adopted is given as follows [29]: Structure of the base model.
are trainable biases, and * is the Hadamard product. Here, σ is the sigmoid function.
As illustrated in Fig. 6, our base model contains two stacked LSTM layers and a linear output layer. The output of the first LSTM layer serves as the input of the second LSTM layer. The network is designed to have precisely T = 1152 time steps, which covers a range of 8 days since the data is sampled every 10 minutes. The first T h = 1008 time steps, i.e., 7 days, cover the history part of the data, including the historical weather forecast data and heating load. The last T f = 144 time steps correspond to the predicted heating load for the next day. Here, we assume that data before one week has little effect on the prediction.
At the t-th time step of the model, a 5-dimensional vector is fed to the model. The vector x t consists of temperature T t , pressure P t , wind speed W t , humidity H t , along with the heating load of the previous time step L t−1 as a history feature.
To acquire sufficient amount of sample for training the model, we slice the original data {(T t , P t , W t , H t , L t )} T d t=1 , where T d is the amount of samples, using a sliding window The data of two heating seasons are sliced separately so that each data strand is completely contained in one of the two heating seasons.
During training, in one forward pass, the whole network takes a sequence X i as input (where the time length T = 1152). The output of the future partŶ During inference, the input consists of two parts.

Input of the history part is
, which is the recorded historical data. Input of the future part is X includes only weather features obtained from weather forecasting, andL i,t−1 is the output of the network at the previous time step since the heating load in the future is unknown at the current time.
Although LSTM is well-performed in dealing with nonstationary time series data, we have the following reasons to apply modifications to the base model to boost performance:

1) IMPROVING PREDICTION AT THE END OF EACH HEATING SEASON
The heating season ends in mid-March each year. During the ending month of each heating season, the heating load drops dramatically as the heating machines are gradually shut down. However, this pattern occurs only once a year and covers 1-2 weeks, which makes it hard to learn. Therefore, we reconsider the pre-processing of data so that the model learns the importance of the drop, which is illustrated in section B and C.

2) BETTER GENERALIZATION
The model is finally to be implemented on an online system to produce heating load forecasting automatically and continuously with minimal human interactions. Therefore, it should VOLUME 8, 2020  work as long time and with as few adjustments as possible, which requires the model to generalize well on new data and be robust to input noise. For this, an ensemble strategy is adopted, which is detailed in section D.

B. SMOOTHING
The recorded heating load is rather noisy. Proper smoothing of the data can make it easier for the model to capture local features and the major trend.
The heating load is smoothed by a sliding window of size b. Every data point is replaced by an average of over b consecutive data points in the previous time instances (including itself). This is to ensure that no future information is required to calculate the average. To be more precise, at each t, the smoothing operation can be formulated as where y t is the unprocessed value of heating load at time t, and b represents the size of the sliding window. We choose b = 12 in practice.

C. LOCAL RESCALING
Attributes of the raw data are of different scales and distributions and need to be scaled appropriately before being fed to the model. Otherwise, we often observe the slow convergence of the training and poor performance of the trained model. It is natural to pre-process the raw data to remove certain physical meanings while maintaining the statistical characteristics of the data. For that, we apply the following Min-Max rescaling transformation on training data to transform all the features into a fixed range [m, M ]. For a given time series {x i }, this can be formulated as where x min = min i x i and x max = max i x i . In practice, we take m = −0.7 and M = 0.7. The data are transformed by applying (5) to each data attribute of each strand independently, i.e.
where R represents a vector function which transforms x i component-wisely and s x is a component-wise scaling factor of the entire time series. Since the proposed model only takes in data strands of a week in length, we have another option to scale each data strand locally rather than globally. For the i-th data strand, this can be formulated as where R i represents the function used only to transform the data strand X i . The training target L i,t shares the same transformation with the component L i,t−1 of x i,t . Note that s x i is specific to X i and is calculated locally. In this situation, each data strand is transformed by a different scale factor s x i , but has the same range after rescaling. Moreover, the rescaling within a given data strand reflects only the variation in the underlying period and is unaffected by the absolute value of the entire time series.
During inference, the scales s x i are only fit on the history part of the data strands and are then applied to transform both the history part and future part, so that no future information is used in the calculation.

D. ENSEMBLE
Inspired by EnResnet proposed by [30], we ensemble the LSTM units in a similar way. We train n LSTMs with different initialization in parallel. Then we average the results of each network when making predictions. This simple ensemble method improves generalization and robustness.
Also, during the training of each model, a gaussian noise is added to the hidden states between every time step in both of the LSTM layers. This is to ensure the model to be resistant to perturbation of input data. During inference, we evaluate the model several times with noise injections and then average all the outputs.

Specifically, the native LSTM cells give the hidden states by
where such U is given by (2). The noise injection is given bỹ where γ = β √ Var (U (h t−1 , x t )) and β is a tunable parameter which is set to 0.1 in our experiments. The noise is sampled independently at each time of evaluation.

E. LOSS FUNCTION
It is natural to choose the following loss function to measure the absolute error of the prediction.
whereŷ it is the model's output at the t-th time step for the i-th data strand, y it is the corresponding ground truth, N is the total number of samples, and t from T − T h to T represents the future part of our model which produces the predicted sequence. The length is 144 in our settings since we need the model to predict the heating load with 10 minutes' increment for the next 24 hours. However, under local rescaling (section C), the absolute error on each data strand is not within equal value due to the different rescaling function applied, which eventually hurts the prediction accuracy (see Table 3). To solve this issue, a weighted 1 loss is introduced to help correct the difference in the weights of the samples. The weighted 1 loss is given by where the weights andȳ i = 1 T T t=1 y it , α is a constant to regularize the weights. As was described in section C, the scaling is a linear mapping. For the i-th data strand, the heating load at time step t is transformed by where s i = (M − m)/(max t y it − min t y it ) and d i = m − s i min t y it . Then the absolute error before scaling is which indicates that we should multiply the absolute error calculated on the scaled data by 1 s i . Moreover, we notice that the absolute errors of different samples are still not of equal importance even on the original scale. Considering error rate is a more frequently-used measurement to assess the result, we further correct the loss on each sample to This leads to an approximate value of error rate, and also the weighted 1 loss described in (12) and (13). Therefore, the scale and level of the original sequence are taken into consideration to leveraging the importance of each sample, which makes (12) a more consistent loss.

IV. RESULTS AND DISCUSSION
In this section, we apply our model to the real data described in Section II to evaluate its performance. First, we explore how the smoothing mentioned above, scaling, the weighted loss function influence the performance of the model using off-line data and hence conclude the best model settings.
Then, we introduce how the best model is applied to the online system and report its performance.

A. IMPLEMENTATION DETAILS
We first integrate the weather forecast data into historical data. As mentioned above, the weather forecast data are interpolated so as to be connected with the history data with respect to the timestamp. Then we compute the smoothing on the heating load as depicted in section III.B. If local rescaling is not adopted, a global scaling (7) is applied at this stage. The data is then sliced to generate data strands. If local rescaling is adopted, we perform the local transformation (8) for each data strand. We evaluate the model using the mean absolute percentage error (MAPE), which calculates a percentage error on each location and then average over all locations. The MAPE measure is formulated as We scale the output of our model back to its original scale before evaluating the error using the MAPE.
During training of the model, we use stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.05 which decays to 1/10 every 30 epochs. The LSTM models are trained within 100 epochs and the training takes 117 seconds per epoch in average on a Nvidia Geforce 1080Ti GPU. The models are implemented in PyTorch.

B. OFF-LINE TESTS
The historical data covers the period of 2 heating seasons. The first 70% data are used for training the model, roughly including the first heating season and half of the second heating season. The rest 30% data are used for testing. The data from City A and City B are split at 2018/1/21 13:10 and 2018/1/21 22:40, respectively. We train and test our model separately on the two datasets. VOLUME 8, 2020  First, we compare the performance of the LSTM base model with several traditional statistical models to demonstrate the advantage of deep models. Considering temporal factors, we extract data and time features within one-hot transformation to be included in the linear, random forest, and gradient boosting models. The results are shown in Table 2. The traditional models generally produce much higher MAPE on test data than on train data, indicating that they suffer from over-fitting. However, our base LSTM model shows a competitive result on test set among all the models, which reduces the MAPE from around 10 to 4.80 and 7.92 in the test set of City A and City B, respectively.
To improve the base LSTM model, we have proposed two data pre-processing methods mentioned in section III.B and III.C. Now we demonstrate the effectiveness of the two operations. The results are shown in Table 3. Based on the base model, adding smoothing causes a slight drop in both training and testing errors for City A, whereas it does not benefit the model on City B. As for the local rescaling, the MAPE drops significantly for both cities, indicating that it is rather crucial to the model.
In particular, the forecast result and daily MAPE under different settings in the test set of City A are shown in Fig. 9. In the last several days of the test set, which falls in March 2018 correspondingly, the heating load encounters a sharp decline, and the MAPE is high at this stage. Our proposed methods bring some contribution in reducing the MAPE of these days and thus have a better overall performance.
The local rescaling brings a notable decrease in MAPE. However, as mentioned above, the unweighted 1 loss does not match this option. Experiments show that using the weighted 1 loss leads to a better result, where the MAPE drops around 0.1 compared with the test using only local rescaling.
We finally propose our model by combining the above methods, which adopts the smoothing, local rescaling, and the weighted 1 loss. In City A, our model produces the best results among all the experiments discussed above. In City B, the smoothing operation can be optional since it does not improve the model. Furthermore, we explore the performance of ensembling the models. We can see that there is a slight decrease in MAPE of the ensembled models. Nonetheless, they may perform better than the single model when there are missing entries in the input data. In practical applications, data with noise or missing entries are inevitably due to the complex working environment of the power stations. To simulate such a situation, we randomly set each entry of the input data to  . Online test result in City A. The blue line represents the measured heating load during the heating season. The yellow line represents the one-day-ahead forecast of the heating load given by our online model. In December 2018, the heating load fluctuated by a large margin due to the manual operations within the power plant or some mistake in the data collection procedure. Since the online model takes historical data of a week's length as input, the abnormal data affected the performance of the forecast. 0 with a probability 0 < p < 1. In City B, as p rises, the MAPE of the ensembled models is generally lower than that of the single model by around 0.2. However, the ensembled models do not show a clear improvement in City A.
With the above results, the best model for City A is our proposed model (Index 4 in Table.3) because of its lowest MAPE in test data. For City B, we suggest the model without smoothing and better to be the ensembled version, because the heating load of City B is noisier and has a more considerable variation than City A.

C. ONLINE TESTS
We implemented our model to the online system, which is constructed as Fig. 10.
The data collection module collects and integrates the recorded historical data into a database. Every day at 8 a.m., a scheduled program starts and fetches the nearest 1008 records along with a 24-hours weather forecast data, and calls the prediction module. The prediction module first preprocesses the loaded data and evaluates through the model pre-trained on the off-line data as described by the previous subsection. Then, the results are sent to the ''intelligent data center'' and made available to the decision-makers.
An automated model updating mechanism was designed in order to re-train the model using the most recent data. To make it simple for users to operate, we build a neat and user-friendly interface to re-train the model with preconfigured hyper-parameters. Therefore, users can update the model through very simple actions while monitoring its performance. In practice, The training interface can also be called by the training program to train new models on the latest data. The newly trained models are saved to the model warehouse and can be loaded by the prediction interface.
Our model trained on historical data from December 2016 to March 2018 was submitted to the online system to forecast the heating load in City A in the heating season of the year 2018. Due to the limited computing resource of the online system, the ensembled version was not deployed. From November 2018 to March 2019, the model was tuned using the automated training program three times on December 2018, January 2019, and February 2019 separately, and the updated model was employed on a monthly basis. At the end of the heating season, the model reported a 4.09% MAPE for the entire season, which is comparable to the results of the off-line tests on City A. See Fig. 11 for a visual comparison between the predicted and measured heating load during the heating season of 2018 in City A.

V. CONCLUSION AND FUTURE WORK
We proposed a novel heating load prediction model based on LSTM. With carefully designed network structure, data preprocessing operations, and the loss function, our proposed model has achieved satisfactory accuracy on historical data provided by Electric Power Academy of Shandong Province in China. The model was further implemented to an online system in a city of Shandong province to make continuous and fully automatic forecasting and to produce highly reliable results during the heating season in 2018.
For further work, we would like to adapt the model to other cities in the northern provinces of China using transfer learning techniques, as data from more power plants have been collected. In addition, deep network architecture can be further improved as complex data are introduced.