Short-Term Load Forecasting using Bi-directional Sequential Models and Feature Engineering for Small Datasets

Electricity load forecasting enables the grid operators to optimally implement the smart grid's most essential features such as demand response and energy efficiency. Electricity demand profiles can vary drastically from one region to another on diurnal, seasonal and yearly scale. Hence to devise a load forecasting technique that can yield the best estimates on diverse datasets, specially when the training data is limited, is a big challenge. This paper presents a deep learning architecture for short-term load forecasting based on bidirectional sequential models in conjunction with feature engineering that extracts the hand-crafted derived features in order to aid the model for better learning and predictions. In the proposed architecture, named as Deep Derived Feature Fusion (DeepDeFF), the raw input and hand-crafted features are trained at separate levels and then their respective outputs are combined to make the final prediction. The efficacy of the proposed methodology is evaluated on datasets from five countries with completely different patterns. The results demonstrate that the proposed technique is superior to the existing state of the art.


I. INTRODUCTION
Smart grid, in simple terms, implies monitoring and control of the power system's assets in the generation, transmission, distribution, and utilization, to achieve high efficiency and reliability at low operational costs. Its cardinal feature -demand response, can only be fully realized through accurate forecasting of various variables, the most important of which is the forecasting of electrical load [1]. Artificial intelligence is fast becoming cardinal for data analytics and enhanced control of modern power systems. One of its most desirable application in recent times is the load forecasting through machine learning for predicting the trends in energy demand, so that the control decisions can be proactively optimized.
Long-term [2], mid-term [3], and short-term [4] are the different types of load forecasting found in the literature based on their duration of prediction from years to minutes. Short-term load forecasting (SLF) is more difficult than the mid and long term forecasting because of the greater variance in the respective energy consumption patterns [5]. The advantage of SLF is that it provides better insight into the electricity consumption patterns and a greater degree of freedom for demand-side management. Also, SLF can be aggregated to get mid-term and long-term forecasts. Therefore this paper focuses on SLF.
One of the most challenging problems in SLF is posed by small datasets; particularly in the case of individual households which usually exhibit wide variations in energy consumption in short intervals, thus making it harder for the deep learning models to learn the underlying patterns [5]. Before the deep learning era, lot of research went into the engineering of hand-crafted features which were required as the inputs to machine learning algorithms. The capability of deep learning to extract implicit features, removed the need of such complicated pre-processing of raw data. However deep learning models require large training data to extract useful features, more so for the datasets with high variances. This paper presents a novel deep learning architecture that combines the use of hand-crafted features with raw data, such that the deep learning model can work well for SLF of small datasets. The results demonstrate significant improvements in the performance, specially for small datasets by using the proposed architecture.

II. LITERATURE REVIEW
A significant amount of research has been carried out in recent times to develop SLF as the enabling tool for efficient monitoring and control of power system. Before the deep learning era, hand-crafted features were used to be fed to a machine learning model for making predictions. In [6] feature engineering was done to design a feature vector by performing entropy analysis with a specific tolerance band and auto-correlation function. The designed feature vector was then passed through an artificial neural network (ANN) for prediction. Ferreira and da Silva used a Bayesian based approach to solve the complexity of neural network and variable selection [7]. The approach has theoretical ground but relies on various assumptions regarding the network parameters distribution, requires three relevance thresholds and is computationally expensive. Phase-space embedding method was used for the selection of input variable which allowed to include the preference of the past values of prediction quantity in the input vector [8]. A neural networks based approach to forecast next 24 hour load on medium and low voltage substations was presented in [9]. The use of separate models each for daily average power and for intraday variation in power, improved the accuracy of prediction compared to the model based on time series.
Learning the daily routine of the usage of various appliances can help in better forecasting of an household's load profile. It was shown in [10], [11] that using the consumption data of the appliances together with the aggregated data of the whole house as the input to the long short term memory (LSTM) models gave better results than using the whole house readings alone.
Recently recurrent neural networks (RNN) have become the popular choice for load forecasting. In [12] machine learning models were used for predicting the energy demand on publicly available RTE dataset [13]. The performances of RNN and support vector machine (SVM) models were compared using different input features. The models were evaluated on a test set of 10 days of year 2017. The results demonstrated that RNN performed better, with a MAPE of 3.52%, compared to SVM with a MAPE of 14.00%.
A recent study [5] demonstrated how the individual household level load forecasting can be challenging because of different patterns of energy consumption of individual consumers [14]. A two layers based LSTM model was proposed and compared with other models based on backpropagation neural network (BPNN), k-nearest neighbour (KNN), extreme learning machine (ELM) and input scheme combined with a hybrid forecasting framework (IS-HF). Individual models for each household were trained and the best average MAPE of 44.06% was achieved through LSTM. Also it was demonstrated that aggregating these individual forecasts resulted into quite the same net MAPE of ∼8% that was yielded when the aggregated data of the consumers was trained and tested on a single model. This difference in the MAPE of individual versus aggregated forecast established that the individual SLF is harder compared to aggregated one; however the advantage of individual SLF is that it provides better insight into the trend of each constituent customer and can easily be aggregated together to provide the net trend.
Electricity demand is influenced by weather, holiday, time of day etc. Time dependant convolution neural network (TD-CNN) and cycle based long short term memory (C-LSTM) for short-and medium-term load forecasting was presented in [15]. Electric load on weekly basis was arranged in image format on which TD-CNN was run. C-LSTM helped to extract time dependencies between sequences. The models performed better than the traditional LSTM model while reducing the training time.
Another important application of SLF is in energy trading which is a complex process due to non-periodic variations in energy consumption. Accurate forecasting for hourly spot price is the key to achieve the best trading decision which is vital for investors and retailers in electricity market. A model based on a hybrid approach comprising of ARIMA, multiple linear regression (MLR), and Holt-Winter model was proposed in [16]. The hybrid model was tested for Iberian electricity market dataset to forecast hourly spot prices for various numbers of days. A hybrid model based on non-linear regression and SVM was proposed in [17], that was tested on ERCOT data [18]. This hybrid model achieved MAPE of 7.30% compared to the individual models with 8.99% and 8.63% MAPE respectively. Improvement of forecasting accuracy using standard LSTM model by feeding it processed features rather than raw data was proposed in [19]. The power load sequence was decomposed by complementary ensemble empirical mode decomposition (CEEMD), then the approximate entropy (AE) values of the obtained subsequences were calculated. The subsequences with similar AE values were merged into new sequence to form the inputs of the load forecasting model. This reduced the complexity of the power load sequence and improved the accuracy of load forecasting. The vanilla LSTM network was improved in [20] by cleaning and processing the raw load data using isolated forest algorithm.
Electric load forecasting requires training of large number of neurons in hidden layer, which increases the size of the network and slows overall training process. To reduce this overhead, a multi-column radial bias function (MCRF) with error correction algorithm designed to reduce the number of hidden neurons in a network, was proposed in [21]. It was shown that MCRF with only 50 neurons in hidden layer took only 10 minutes to train and achieved the MAPE of 4.59% compared to other models with more than 150 neurons that achieved better MAPE of 1.77% but took hours to train.
Accuracy of SLF can be improved through careful analysis of the load data to find the effectiveness of selected features. A technique was proposed in [22] for features selection where the bisecting K-means algorithm was used to cluster the load data with high similarity for a forecast date. The ensemble empirical mode decomposition (EEMD) helped to combine components with similar entropy. A bidirectional recurrent neural network (BRNN) model was proposed to forecast the load of the network. The model was verified on two datasets including a dataset from load forecasting competition. The results showed that BRNN model performed better even than the winner of the competition.
The literature survey therefore implies that a better load forecasting technique with reduced statistical error is a hot topic of research for modern power systems.

III. PROPOSED METHODOLOGY
Recently [5], [10], [15] deep learning solutions, particularly sequential models such as RNN and LSTM models are becoming popular choices for load forecasting. LSTM [23] has become a state of the art tool for time series problems owing to its ability to learn temporal patterns in sequential data. This paper presents a novel architecture named as deep derived feature fusion (DeepDeFF), comprising of a bidirectional sequential model with feature engineering for realizing a more accurate SLF technique.

A. Bidirectional Sequential Models
Bidirectional model trains forward and reverse nodes using respectively: 1) input in positive time, i.e. the given input as it is, 2) input in negative time, i.e. a time-reversed copy of the original input. The advantage of bidirectional model compared to conventional ANN models is that it observes the input in both forward and reverse directions to extract more information from the input sequence. This technique of negative time and bidirectional layer was first discussed in [24]. This paper implements LSTM, RNN, gated recurrent unit (GRU) as well as their bidirectional counterparts (BLSTM) [25], BRNN and (BGRU) on several datasets for a comprehensive comparison presented in the results.

B. Derived Features
The aim of derived features is to enrich the training data with useful features for more accurate predictions. A deep learning model with enough computation time and data may extract such features on its own, but this cannot be guaranteed within the constraints of time and resource. Thus providing these derived features explicitly as inputs can enable the model to learn more from the data and converge quickly, specially for small datasets. Generally the performance of deep learning models improve by increasing the number of relevant input features unless it starts to over fit.
The basic features used for generation of derived features and as input for the DeepDeFF model are: • Energy load consumption E . • Time-stamp of the day T , divided into 30 minutes interval each. The feature is converted into One-hot encoding. • Current day of the week W , converted into one-hot encoding. • Holidays represented by a binary label H . At the moment only weekends are marked as holidays, but in future work this can be expanded and synced with other public and national holidays.
Derived features are calculated for each record in the input sequence (1, K , f ), where K represents the number of past records used for creating the input sequence and f represents the basic features. Following are the derived features that are calculated and used as input to the DeepDeFF model.
• Average load consumption of K time-steps. • Standard deviation of load consumption of K timesteps. • Average load consumption of the time-stamp t that is to be predicted, for past K days. • Standard deviation of load consumption of the timestamp t that is to be predicted, for past K days.

C. Proposed Architecture
This paper proposes a two-layer bidirectional sequential model architecture DeepDeFF, which inputs the raw and derived input features into separate layers to extract learned features. The idea behind using separate input layers for basic and derived sequences is to allow the sequential layers to learn from the two input sequences independently. The goal is to exploit the relevance of basic and derived sequences with the predictions individually. The learned representation from the individual sequential layers is then merged and fed to a dense layer followed by a linear activation output layer to make the final prediction of the load at the next time interval. Fig. 1 shows the schematics of the DeepDeFF architecture. The hyper parameter settings consists of: (a) 20 nodes sequential layer (b) a dropout of 0.2 (c) Adam optimizer (d) MAPE as loss function (e) Learning rate: 0.01

IV. THE DATASETS
The proposed methodology for SLF has been evaluated on five energy load datasets from different sources. This section provides the salient parameters of the dataset and presents the pre-processing technique adopted for each.

A. Smart Grid Smart City (SGSC) dataset
SGSC project was initiated by the Australian Government in year 2010 [14]. It gathered smart-meter data from around 78,000 customers for a period of 4-years. In [5], individual models for each customer was proposed. However since it is not feasible to train individual models for ∼78,000 customers, therefore 69 customers having "hot water system" were selected. The same subset is extracted here to evaluate the DeepDeFF architecture.

B. The Almanac of Minutely Power dataset (AMPds)
AMPds [26] contains electricity, water and natural gas measurements of a single Canadian household with 19 appliances, recorded for 1 year with 1 minute resolution, which is down-sampled to 30 minutes resolution [10]. The variables for raw features used here are the same as for SGSC except that E here is assigned to the Ampere reading.

C. Réseau de Transport d'Électricité (RTE) France dataset
RTE dataset [13] is also used here to evaluate the proposed technique. The dataset used spans from year 2013 to 2016 with the sampling interval of 30 minutes. The raw inputs are programmed with same variables as for SGSC above.

D. The Electric Reliability Council of Texas (ERCOT) dataset
ERCOT dataset [18] provides real time and historical statistics surrounding independent system operator (ISO) operations of the Texas region for a period of ∼5 years recorded every 1 hour. The raw features variables used here are the same as for SGSC except that the time T here ranges 1-24 since the resolution is 1 hour. The input data is first pre-processed to achieve the derived input features. The raw input features T , W , H are converted to one-hot encoding. The raw and derived input features are then fed to two individual bi-directional sequential layers. The information extracted from these separate layers is then merged and used as input to a dense layer followed by a final feed forward layer with Li near activation function.

E. Pakistan Residential Electricity Consumption (PRECON)
PRECON dataset [27] records the electricity consumption patterns in a developing country for 42 households of varying financial status, daily routine and load profile. The data is collected with 1 minute interval from 01-06-2018 to 31-09-2019. The amount of data varies for each household due to different number and types of appliances that are selected for monitoring. This dataset also captures the problem of power outages rampant in developing countries. This is evident from several long 0KW data intervals. For raw features, same variables as in SGSC are used here except that E here refers to the KW usage.

V. EXPERIMENTS & RESULTS
The proposed framework for SLF is achieved through an evolutionary process after numerous rigorous experiments on all five datasets. This section discusses these experiments in sufficient detail and infers the results obtained. The results from the DeepDeFF architecture are compared with the results of simple two layer sequential models trained on basic features and MAE as loss as proposed in [5].

1) Train & Test Setting:
The same settings provided in [5] are used to extract the subset of SGSC data for fair comparison on the same test set. The data spanning the whole winter season of New South Wales Australia is subdivided into a split ratio of 0.7/0.2/0.1 as: The first set is to train the DeepDeFF model, validation set is used to select the best model weights based on performance on validation set, while the test set is for the evaluation of the DeepDeFF model. The data is spaced between 30 minutes interval; so for 69 customers the 9 days of evaluation implies the forecasting of 29,808 time points. Fig. 2 provides some insight into the diversity of customers by showing the similarity between their train and test data. Test data is plotted over training data with matching numeric dates. Fig. 2a shows the data of a customer with similarity between train and test data patterns, whereas Fig. 2b shows no similarity for another customer. This indeed effects the results of DeepDeFF architecture which is reflected in their respective MAPE of 26.04% and 50.78% using BLSTM layer; thus the DeepDeFF model has been able to learn the underlying patterns and temporal relations for Fig. 2a but not as good for Fig. 2b.
2) Results: Table-I shows the comparison of results from rigorous experiments that are performed on SGSC dataset using the proposed DeepDeFF method in contrast with the implementation of the model proposed in [5]. The addition of derived features in the proposed architecture along with MAPE as loss function, outperforms the state of the art on the SGSC dataset as evident from the average MAPE computed in Table-I  ever, Fig. 3b shows that the model under performs for customer 8655993 due to uncorrelated train and test data, owing to disjoint customer behavior during training and testing days.

1) Train & Test Setting:
The AMPds data is converted from 1 minute resolution to 30 minutes, yielding 17,483 data points [10]. The data is subdivided with a split ratio of 0.7/0.  Fig. 4 shows that the train and test data have some common pattern and that there are no abrupt changes like in Fig.2b. Even though, the training and testing data is not available for same dates of different years, there is a general trend that is being followed in the test data. 2) Results: Table-II shows a comparison of the results produced by the simple two layer sequential model and the DeepDeFF architecture with derived features. The proposed architecture beats the benchmark of 26.23% achieved in [10] for 6 time-steps. Fig. 5 shows that the DeepDeFF architecture performs well in predicting the general load and suffers in case of outliers. This is because the model was able to learn the underlying general pattern from the training data, and gave it more importance than to outliers. This problem occurred because the training data was not enough and does not cover all the months; so the test data is of a month that was never seen during training.    6 show the subsets of training and testing data for the dates mentioned in the figures' legends. Such close resemblance in the test and train data helps the model to make accurate predictions as evident from the results.
2) Results: It is observed from the results for SGSC and AMPds datasets that the experiments with 2 time-steps mostly yield the best results. Henceforth 2 time steps is used for the experiments on other datasets. Table-III shows the results for RTE dataset. The proposed model with GRU and derived features performed best with average MAPE of 0.81%. Fig. 7 shows the prediction results against the actual system load which further confirms the excellent performance of the DeepDeFF architecture.  Similar to RTE, ERCOT is also the accumulated load consumption data of Texas. The train and test data for ERCOT also has close resemblance similar to Fig. 6 2) Results: Table-IV shows the results for ERCOT dataset where the DeepDeFF model with BGRU performed best with average MAPE of 0.91%. Figure-8 shows the results that establishes the effectiveness of the DeepDeFF architecture.

1) Train & Test Setting:
Owing to the peculiar nature of the PRECON dataset, it is pre-processed in two steps in this research. First, the data is converted from 1-minute interval to 30-minute intervals by taking the average over the 30 consecutive load readings. The second step is to take care of close to zero values in the data that are mostly due to power outages. Otherwise these values cause divide-by-zero problem when using MAPE function for evaluation, resulting into unrealistically high MAPE and adversely effecting the performance of the machine learning algorithm. This is countered simply by adding a small offset of 0.1 KW to all the readings. The offset is small enough to makes no significant change in the nominal values of the load and takes effect only for the near zero data. This simplest pre-processing has shown remarkable impact on the performance of the DeepDeFF algorithm as evident from the results.
The data splitting is done in a unique way here due to the reason that it spanned over a period of only one year with no repeated data for any month. So instead of using an overall split of data, as done in previous datasets, a monthwise split is proposed. Here the training, validation, and testing data is taken from days 1 -21, 22 -26, 27 -30/31 respectively for each month. This corresponds roughly to an overall split of 0.7/0.2/0.1.
2) Results: Table V shows the comparison of results obtained for PRECON dataset. DeepDeFF models have consistently outperformed the basic models on all the houses, achieving on average 8.9% lower MAPE than basic models.
The value of the MAPE achieved by DeepDeFF models ranged from 7.67% on House 3 to 37.61% on House 29. The graphs of predicted versus actual load of these two houses are shown in Fig. 9a and Fig. 9b respectively. The above mentioned results on five public datasets infer that SLF for individual households: SGSC, AMPDs and PRECON is more difficult than aggregated load forecasting of a country or state wide dataset because of high variances in load consumption patterns of the former ones. However the proposed DeepDeFF architecture has been able to forecast better than the previously published techniques.

VI. CONCLUSION
Load forecasting is of critical importance to optimally schedule and reliably manage the operations of power systems. This manuscript presented a deep learning architecture based on sequential layers, and a pre-processing method for introducing hand-crafted features into the endto-end learning pipeline of the deep learning model, for short-term load forecasting. It is demonstrated with rigorous experimentation that the inclusion of hand-crafted features has improved the learning and predictions of the model, specially for smaller datasets. The proposed DeepDeFF architecture has been comprehensively tested on five different datasets -two country/state wide datasets and three household datasets. The results achieved from the proposed methodology beat the current benchmark of these datasets for SLF.