Machine Learning Based Integrated Feature Selection Approach for Improved Electricity Demand Forecasting in Decentralized Energy Systems

Improved performance electricity demand forecast can provide decentralized energy system operators, aggregators, managers, and other stakeholders with essential information for energy resource scheduling, demand response management, and energy market participation. Most previous methodologies have focused on predicting the aggregate amount of electricity demand at national or regional scale and disregarded the electricity demand for small-scale decentralized energy systems (buildings, energy communities, microgrids, local energy internets, etc.), which are emerging in the smart grid context. Furthermore, few research groups have performed attribute selection before training predictive models. This paper proposes a machine learning (ML)-based integrated feature selection approach to obtain the most relevant and nonredundant predictors for accurate short-term electricity demand forecasting in distributed energy systems. In the proposed approach, one of the ML tools– binary genetic algorithm (BGA) is applied for the feature selection process and Gaussian process regression (GPR) is used for measuring the fitness score of the features. In order to validate the effectiveness of the proposed approach, it is applied to various building energy systems located in the Otaniemi area of Espoo, Finland. The findings are compared with those achieved by other feature selection techniques. The proposed approach enhances the quality and efficiency of the predictor selection, with minimal chosen predictors to achieve improved prediction accuracy. It outperforms the other evaluated feature selection methods. Besides, a feedforward artificial neural network (FFANN) model is implemented to evaluate the forecast performance of the selected predictor subset. The model is trained using two-year hourly dataset and tested with another one-year hourly dataset. The obtained results verify that the FFANN forecast model based on the BGA-GPR FS selected training feature subset has achieved an annual MAPE of 1.96%, which is a very acceptable and promising value for electricity demand forecasting in small-scale decentralized energy systems.


I. INTRODUCTION
Decentralized energy system operators, aggregators, suppliers, managers or other stakeholders are challenged by several confronts, varying from inadequate electricity supply to increasing consumption.The electricity demand The associate editor coordinating the review of this manuscript and approving it for publication was Canbing Li. curves of decentralized energy systems (such as buildings, energy communities, microgrids, virtual power plants, local energy internets, etc.) are different from typical electricity demand curves that represent nation-or region-wide electricity consumptions.This makes the conventional techniques (developed for national or regional electricity demand forecasts) inappropriate for their straightforward application in decentralized energy systems due to two clear reasons.VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see http://creativecommons.org/licenses/by/4.0/ In decentralized energy systems, not only the total electricity demand level is many times less than the regional or national demand levels, but also the electricity demand profile manifests more fluctuation and does not generally follow the same profile.Therefore, the recent deployment of decentralized energy systems calls for appropriate and applicable feature selection (FS) tools and forecasting models for economic and efficient consumption modeling.
Feature selection is a procedure of picking a subset of most important features (attributes, variables, or predictors) for use in predictive model development.
Basic knowledge about statistical models may help to understand an effective feature selection method is important for the final prediction performance of prediction models.However, the effective feature selection approach should be properly designed, implemented and tested for a specific application in question.In the present age of 'Big Data', datasets are full of information with enormous data gathered from millions of Internet-of-Things (IoT) apparatuses and sensors.This makes the data high dimensional and it has become very common to observe datasets with hundreds (even thousands) of variables.
The same is true in the power industry sector, since the concept of smart grids has been developed and being implemented based on the idea of IoT and complex interaction of data among various stakeholders.
When data is presented with very high dimensionality, prediction models generally choke since: 1. Training, validation and testing times increase exponentially with number of variables.2. Prediction models will have increasing risk of overfitting with increasing number of predictors.3. Prediction accuracy will face cumulative threat of reducing with increasing number of features.Therefore, feature selection is a very important element in Data Scientists' workflow.Different research groups have performed various feature selection methods for various applications and scenarios.However, very few of them have coupled and investigated feature selection tools and forecasting models.Moreover, there is no standard and universally agreed feature selection method so far.The R& D for finding the most effective feature selection tools is still ongoing by various independent research groups and institutions.Effective and adaptive feature selection methods shall be designed, implemented and tested for desired applications and scenarios all time as new data sources, policies and algorithms are being emerged all time.These are the main reasons, which have motivated the development and implementation of the FS work in this paper.
Feature selection strategies are important for the following main advantages: • Reduce computation time • Decrease data storage requirements • Simplify models, making them user-friendly • Improve data understandability and interpretability • Evade curse of dimensionality • Enhance generalization, reducing overfitting.The core argument when applying an FS method is that the original dataset holds some variables which are either duplicated or not important, and can therefore be eliminated without inducing ample damage of information.It has been proved by several researches that, redundant and irrelevant features reduce the accuracy and generalization capability of predictive models.
That is why, nowadays, FS studies has become very popular in AI, Machine Leaning (ML), Deep Learning (DL), and Statistics.
It is demonstrated, by various relevant studies, that the total energy cost (energy production, operation and purchasing costs) can be reduced significantly by applying demand response concept in in small-scale decentralized energy systems.However, techniques for electricity demand prediction in decentralized energy systems have not been properly investigated.Most applicable methods are limited for predicting the amount of electricity demand at large-scale (nationor region-wide) and disregard the specific electricity demand for smaller entities such as buildings, energy communities, microgrids, virtual power plants, local energy internets, etc., which comprise equivalent significance for energy system optimization, sustainability and efficiency.
Thus, the goal of this paper is to propose and implement a feature selection approach for modeling and forecasting the fluctuating electricity demand in decentralized energy systems in general and buildings in particular.Results will assist distributed energy system stakeholders to efficiently use limited energy resources and regulate dispatchable generation and flexible demand levels.Prediction accuracy has mainly been the indispensable target of forecasting studies.It is soundly revealed in [1] and [2] that the accuracy of prediction models not only relies on the models' configurations and associated learning methods but also on the predictor domain, which is established via the initial predictor space and FS techniques.FS is mostly applied in ML implementations as one of the preprocessing steps where a predictor subset (independent attributes) is found by removing predictors with lower or irrelevant information and highly redundant [3].However, very few forecasting techniques perform FS before training the prediction models.
GA has gained extensive consideration due to its operability and robust searching ability.GA is one of the artificial intelligent (AI) probabilistic searching algorithms, and has been broadly implemented for several optimization problems [7].It was inspired by the survival of the fittest principle of the Charles Darwin evolution theory and genetics.The GA search starts from a randomly chosen set of individuals called initial population.It then iterates to find the best individual (solution) through its three main consecutive operatorsselection, crossover and mutation.GA uses a performance index called fitness function to calculate the fitness of the individuals over iterations.BGA is a special version of GA which operates by first representing the given feature space (chromosomes or candidate solutions) in binary bit-strings.This makes the BGA well suited for FS problems than the conventional GA.
This paper proposes a machine learning based hybrid feature selection method to obtain the most relevant and nonredundant features for improved short-term forecasting of electricity demand in decentralized energy systems.In the proposed method, the Binary Genetic Algorithm (BGA) is applied for the feature selection process and Gaussian Process Regression (GPR) is used for measuring the fitness score of the features.
To the best of our understanding, there exist very few research works that have performed feature selection work before fitting or training forecasting models.Moreover, as far as we have investigated, the BGA-GPR based hybrid machine learning approach has never been applied for feature selection problem in the domain of electricity demand forecasting.
In this study, the residual of the GPR model is chosen as the fitness evaluation measure.GPR is a powerful algorithm for regression.It is chosen due to its higher capability of fitting nonlinear input-output relationships based on probabilistic distributions over functions.It has few parameters to tune and easily to implement.It can provide a consistent estimation of its uncertainty.GPR can directly apprehend the model (input-output or feature-target relationship) uncertainty.For instance, it directly provides a distribution for the feature selection fitness measure (error) value, rather than just one value as the estimate.This uncertainty is not straight apprehended by most of the other ML or AI tools.Moreover, GPR is able to add a previous knowledge and specification about the behavior of the input-output (feature-target) relationship by using different kernel functions.
Generally, the paper contributions can be regarded as (1) modeling, parameterization and implementation of the BGA and GPR algorithms to suit the feature selection problem in question, and (2) establishment of seamless combination of the two algorithms to work in unison for solving the feature selection problem.
Specifically, this paper has generally the following main contributions: • Investigate and suggest the relevance of an effective FS approach for improved-performance (accurate) electricity demand forecasting; • Provide effective and efficient hybrid AI-based FS approach for electricity demand forecasting models; • Improve electricity demand prediction accuracy through the application of FS before fitting prediction models.The remaining sections of the paper are outlined below.Section II describes the previous relevant works on FS.Section III describes the dataset and states the feature selection problem.Section IV presents the brief working principle of the binary GA.Similarly, the theory and mathematical modeling of the GPR model used for the fitness measure in the binary GA is described in Section V. Section VI presents the proposed Binary BGA-GPR based FS strategy.The experimental results and validations are presented in Section VII. Conclusions are finally drawn in Section VIII.

II. RELATED WORKS
FS methods are classified as filter, wrapper and embedded techniques [1].Filter techniques do not depend on any prediction model and they sort features depending on statistical characteristics.They utilize a correlation score to grade a feature subset.Filter technique based FS methods are generally fast.The Filter FS approach contains correlationbased [8], mutual information-based [9], and principal component analysis-based methods [10].Filters are generally require less computation time than other FS techniques, but they generate a feature set which is not fitted to a particular forecast model.
Filter techniques are extensively applied in big data analysis due to their computational efficiency.Wrapper techniques evaluate predictor subsets based on their worth to a specific forecaster or classifier.Wrapper techniques assume the FS as a searching problem that prepares various mixes of predictors, assessed, and contrasted with other mixes.Common heuristic AI-based optimization methods stated in Section I are used to monitor the searching procedure.Compared to filter techniques, wrapper techniques reveal improved performance since various predictor sets are assessed by a predictive model or fitting method in every iteration [11].Embedded techniques merge the feature selection process into the prediction model training.For instance, regularization approaches in model training [1] is one example of embedded type FS method.The LASSO model which normalizes the parameters of linear models with L 1 penalties to reduce uncorrelated coefficients to zero can also be one example of embedded method.
Table 1 presents recently published FS strategies in the demand prediction scope for power system and energy applications.From the wrapper techniques outlined in Table 1, the GA-based feature selection shows best performance in removing duplicated features.Reference [15] devised an integrated strategy to predict electricity consumption; in this work, a conventional GA is used to obtain the best predictor subset hybridized with an adaptive neuro-fuzzy inference system (ANFIS).Reference [18] devised a combination of GA and ACO for FS in the electric load demand prediction; VOLUME 7, 2019 ANN is employed as a forecaster model to measure the fitness score of the feature subsets.Reference [19] devised a GA-support vector regression (SVR) configuration to predict accommodation-booking demands in hotels, where the GA is employed for both feature selection and SVR parameter optimization.Reference [20] implemented a GA-based feature selection to the demand prediction problems in retail industries of China.Reference [21] proposed a modified GA for FS in the demand prediction problem in the health sector specifically in outpatient department (OPD); deep neural network is employed as a predictive model to evaluate the fitness score of the feature subsets.In addition to demand prediction, GA-based feature selection is often used for knowledge exploration in other important sectors, such as financial market analysis [22], stock price forecasting [23] and financial distress forecasting [24].Reference [25] proposed a binary GA-based FS for classification problem.It used the k-nearest neighbors (kNN) algorithm for the GA fitness evaluation measure.
Designing a suitable fitness evaluation measure is very important in FS approaches.The fitnesses evaluation measure is used as a performance index for evaluating the suitability of candidate features.Predictors are ranked and selected according to their evaluated values of the fitness function or measure.
The combination of predictors that results the best value of the fitness measure is chosen at the end of the FS algorithm running.This paper uses the residual (error) of the Gaussian Process Regression (GPR) model as the fitness function of the BGA.The GPR model is kernel-based probabilistic distribution over functions.It is chosen in this paper due to its higher capability of fitting nonlinear input-output relationships based on assumptions of probabilistic distribution of given input-output data or function.Besides, it has few parameters to tune and easily to implement.
Following a comprehensive assessment of the abovementioned genetic algorithm based FS techniques, we discover that a conventional genetic algorithm with the usual framework (conventional GA configuration) is used in most researches [15].For instance, the initial population (initial chromosome set) is arbitrarily created where the population variety cannot be guaranteed and the occurrence of duplicated predictors may influence the quality of the search procedure.Moreover, the conventional GA works with the continuous features themselves to minimize the desired fitness function (FS evaluation measure).This reduces the efficiency of the algorithm and causes computational complexity and increased total computation time.
Assuming an intelligent heuristic algorithm should be the best option to determine the search target; a research problem exists and shall be addressed by replacing the conventional GA with the BGA and hybridizing it with robust fitness evaluation measure (GPR in this paper).BGA first represents the features as an encoded binary string and works with the binary strings to minimize the GPR-based evaluation measure to obtain the most relevant and nonredundant predictor subset at the end.BGA is more efficient and stable than the conventional GA.It also reduces computational complexity and execution time compared to the conventional GA.

III. DATASET AND FEATURE SELECTION PROBLEM
The original feature set is constructed through basic assessment of the characteristics of the electricity consumption in decentralized energy systems such as buildings and its association with historical (prior) consumption and external agents.The external agents are electricity market price, seasonality (minute/hour, month and season), weather factors, people's interaction (occupancy).The availability of the data sources for these external agents affecting electricity consumption is also another major factor to construct the original feature space.
The complete feature set for electricity demand predictive model in this FS work consists of lagged electricity demand, seasonal (or calendar) parameters, weather parameters, occupancy, and economic factor (electricity price).Table 2 presents the original feature space or initial dataset for the FS work in this paper.The variables f i , i = 1, 2, . . ., 24, in Table 2 designate the original predictor dataset (feature space) required for the FS work in this paper.Therefore, the feature domain of the FS is an R mxn matrix, where m = 8760 is the number of samples, which is a one-year (2015) hourly observation of the variables and n = 24 is the size of the feature space (original dataset).In this paper, the following optimization problem is solved to obtain the best (relevant and nonredundant) predictor subset from the original dataset given in Table 2.
FS Problem: Given that: where, f r is the number of predictors in the lower-dimension (reduced) predictor subset and β is the percentage forecast error.Find a feature subset of f i from Table 2 such that the objective β and f r are reduced.

IV. BINARY GENETIC ALGORITHM (BGA)
GA is a population-based heuristic type optimization method that is inspired by the survival of the fittest principle of the Charles Darwin theory of evolution and genetics [26].The GA operating mechanism involves iterative steps processing a set of chromosomes (candidate solutions) to generate a new population (offsprings) via genetic operatorsselection, crossover and mutation.
The fitnesses of the nominee solutions (chromosomes) are calculated employing a function generally called objective or fitness function.That means, the objective function provides scores (numeric values) which are employed for grading the existing solutions in the population.BGA is an extended version of the standard GA.
The BGA first represents the candidate solutions as encoded binary strings (binary search space) and works with the binary strings to minimize or maximize the fitness function.BGA is more efficient and stable.It also reduces computational complexity and execution time.Figure 1 shows the flowchart of BGA.

V. GAUSSIAN PROCESS REGRESSION (GPR)
GPR models nonlinear relationship between input(s) and target(s) based on probabilistic distributions over functions.The Gaussian process (GP) describes the distribution over functions based on the assumption that samples of targets obtained at any two or more instants in a function trail a joint (multi-variate) Gaussian distribution (GD).
In explicit saying, a GP is described as a group of arbitrary attributes.Any fixed quantity of the attributes can form a joint (multi-variate) GD [27].
In GPR, the target y of the function f at input-attribute x is defined as: where, ε ∼ N 0,σ 2 ε -normal distribution with zero (0) mean and σ standard deviation.This is the same with the statement stated in linear regression where it is assumed that a sample contains an independent ''function'' part f(x) and ''noise'' part ε.In GPR, on the hand, we consider that the function part is also an arbitrary variable that trails a given distribution.The distribution emulates once doubt on the function.The stochasticity on f can be decreased by measuring the function target at various instants.The noise part ε emulates the intrinsic uncertainty in the samples that usually exist regardless of the number of samples made [27].
In GPR, the function f(x) is assumed distributed as a GP: GP is described by mean and covariance functions.The mean function m(x) emulates the anticipated functional value at input-attribute x: That means m(x) is the mean of all the functions in the distribution estimated at x.The initial expected value is usually fixed at zero (i.e., m(x) = 0) to evade costly future calculations and only make decision through the covariance function.
Practically, fixing the initial value to zero is generally done by deducting the (initial) mean from each sample.The covariance function k(x, x ) simulates the relationship between the functional values at various input instants x and x : The function k is known as the kernel of the GP [28].
A suitable kernel should be selected considering the facts for instance evenness and likely patterns anticipated in the dataset.
A practical consideration is generally to assume the correlation of two instants declines with the distance between the instants.That is, the nearby instants are likely to act more in the same way than instants that are far away from each other.The GPR model in this study employs the squared exponential kernel function defined below: fittest principle of the evolution theory.A starting population is generated and assessed using an objective function.For binary chromosome employed in this paper, a gene value of '1' indicates that the specific feature pointed by the place of the '1' is chosen.Else, (if '0'), the feature is not chosen for the fitness evaluation.
Employing the place pointer of the variables pointed by the '1s', the individuals are then ordered and according to the orders, the upper k fittest offsprings (Elitism of size k) are chosen to persist with the succeeding generation.Once the selected offsprings are moved directly to the succeeding generation, the other offsprings in the present solution space are permitted to genetically move via the crossover and mutation operators to create crossover and mutation offsprings respectively [26].The three offsprings namely selection, crossover and mutation then establish the new solution space (new generation).The crossover operator is a fusion of two chromosomes to create crossover offsprings.While the mutation operator is employed for genetic disorder (diversity) of the genes in the chromosomes by tossing the bits based on the mutation likelihood.Following the procedures outlined in Figure 2, the detail operating mechanisms of the proposed GA-GPR FS are described in the following subsections.

A. INITIAL POPULATION
The GA starting solution space, used in this work, is a matrix of size p × q, where p is the number of chromosomes and q is the chromosome length (called Genomelength).p equals the population size and q equals the amount of bits or genes in each individual.It is recommended to let the number of chromosomes equals at least chromosomes length such that the chromosomes in every population encompass the search domain [29].

B. FITNESS EVALUATION
For the BGA to choose the predictor subset, an objective function (BGA driver) should be specified to calculate the discriminative power of each predictor subset.The fitness of each chromosome in the population is assessed employing GPR-based fitness function.In this paper, the fitness of the various subsets of features is evaluated using the MSE (mean squared error) of the GPR model predictive residuals.The GPR model f(x) is fitted for every feature subset.Hence, the MSE of the training target and the GPR model estimate evaluated for each feature subset in the feature search space defined in Table 2 is used as the fitness evaluation measure, and it is defines as follows.
where, T is a vector of training target (electricity demand) and n is amount of training samples or observations.The aim of the BGA is to minimize the fitness function (MSE) defined in equation ( 9) by choosing a subset of input features having the best fitness over subsequent iterations.In each chromosome a gene value of '1' shows the specific 91468 VOLUME 7, 2019 variable pointed by the place of '1' is chosen.If it is '0', the predictor is not chosen for assessment of the chromosome in question.The chromosomes representing the predictors are encoded as bitstrings.As the BGA runs, the individual chromosomes (feature subsets) in the present population are assessed, and their fitnesses are graded based on the GPR model residual or error.Chromosomes with smaller fitness (smaller residual or error) have greater probability of persisting with the next population or mating-pool.
Each iteration of the BGA running guarantees that the BGA decrease the error level and elites the chromosome with the lowest (best) objective function value.The individual chromosome corresponding to the least error level of the fitness evaluation contains the desired most relevant features.

C. REPRODUCTION
Table 3 presents the parameters of the BGA used in this paper.From Table 3, the chromosome length equals 24 as there are an overall number of 24 predictors nominated for the FS works in this paper.The extreme number of iterations (generations) is set to 100 to evade the BGA been stuck by local optimum.Following the fitness evaluation, a new population is produced for the next generation through elitism, crossover and mutation.In BGA, three kinds of sequential offsprings are formed to create the new population [29].They are: 1) Elite offspring: A selection mechanism should be predetermined in BGA to ensure the population is continuously getting better throughout all the fitness scores or iterations.The selection method assists the BGA to disregard worst individuals and keep only the best chromosomes.There exist several selection methods for BGA, however the Tournament Selection Mechanism (with size 2) is used in this study because of its ease-of-use, swiftness and efficiency [25], [30].
Besides, the tournament selection method imposes better selection burdens on the BGA that results in better convergence rate and ensures the bad candidate solutions are not moved into the succeeding generation. In

D. CONVERGENCE CONDITION
The BGA terminates when it converges at the desired optimal solution.The optimal solution corresponds to the desired feature subset for the FS problem in question.The termination condition where the BGA ends running is known as convergence or stopping condition.The two convergence conditions used in this paper are the following: 1) Maximum number of generations or iterations 2) Stalled generation limit The values used for these convergence conditions are given in Table 3.

E. FINAL FEATURE SUBSET
After the BGA attains convergence, the chromosome that resulted in the best fitness score is chosen and decoded as the final feature subset, shown in Figure 3.

VII. EXPERIMENTAL RESULTS AND VALIDATION
In this section, the case study for the proposed FS work and the results obtained are discussed.Comparative validation, evaluation of FS results for improved forecasting and quantitative relevance analysis of the FS results are also presented in this section.

A. CASE STUDY
In this paper, the hybrid BGA-GPR based FS approach is developed and applied to four electricity demand datasets obtained from four building types (customer classes) in Otaniemi area of Espoo, Finland.The buildings are Building A (residential building type), Building B (educational building type, contains classrooms and laboratories), Building C (office building type), and Building D (mixed use building, contains computer laboratories and health care center).The buildings have a peak (in 2015) aggregate electricity demand of 221kW, 592kW, 29kW, and 86kW, respectively.The feature space for the FS work in this paper is described in Table 2.The electricity demand datasets of the four buildings are the desired target variables in the proposed FS strategy.A one-year (2015) hourly sample, 8760 values, of both the feature sets and target variables are used for FS work.

B. FS RESULTS
The devised BGA-GPR based FS algorithm is run for each of the four datasets (representing four electricity customer types) separately.The empirical results found for the four datasets are presented in Table 4.The fitness values in Table 4 are calculated using the unnormalized (original dataset format) values of the selected features.As it is clearly observed from the FS results in Table 4, the number of predictors chosen by the proposed FS strategy is considerably lower than the size of the feature space (number of predictors in the original dataset given in Table 2).This can be due to the availability of irrelevant and redundant information by most of the variables in the original feature space.The BGA finally selects the feature subset which contains the most relevant and nonredundant variables.For the purpose of consistency and making use of similar set of predictors, features selected at least for one of the buildings can be chosen to form the input dataset for shortterm forecasting of electricity demands in buildings.Hence, the final selected feature subset consists of 19 featureshour of the day, day of the week, month of the year, season of the year, period of the day, holiday/weekend indicator2, ambient air temperature, dew point, humidity, air pressure, wind direction, wind speed, gust speed, global solar radiation, sunshine duration, electricity price, previous 24h average electricity demand, 24h lagged electricity demand, and 168h lagged electricity demand.
Besides, the average computation time of the devised integrated BGA-GPR based FS algorithm with two-years long sample of 24 initial features is about 30 minutes, using MATLAB simulation environment on a research workstation with Intel Core i7-6820HQ Processor, 2.70 GHz CPU, 16 GB RAM.

C. COMPARISON WITH OTHER FS APPROACHES
To validate the BGA-GPR FS work in this paper, the feature subset results by the proposed GA-GPR FS were compared with feature subset results using other two common FS approaches, namely: Correlation-based feature selection (C FS) and Neighborhood Component Analysis Regressionbased feature selection (NCA FS).The Correlation-based FS first calculates the Pearson and Spearman correlations of each feature with the target, and it then takes the maximum of the two correlation coefficients.A feature with correlation value greater than a given threshold value (0.5 in this paper) is selected as relevant feature and included in the final feature subset.The NCA FS operates based on the neighborhood component analysis (NCA) regression model fitted over the feature subsets versus target dataset.The NCA FS obtains the predictor weights (for reduced feature subsets) using a diagonal adaptation of the NCA regression model.The NCA model realizes FS by regularizing the predictor weights.
Table 5 provides the performance comparison of the FS result by the proposed method and other two methods.For the purpose of suitability of comparison, the same fitness function (MSE) modeled as the residual of the GPR model is used.That means each selected feature subset by each method is evaluated for fitness using the GPR model residual.Besides, the fitness values are calculated using the unnormalized values of the selected features.
As shown in Table 5, the proposed BGA-GPR based FS achieved the feature subset with the best fitness value (lowest MSE).Hence, the feature subset selected by the proposed FS strategy contains more relevant and nonredundant features than the other evaluated FS methods.That means, an electricity demand forecasting model whose input dataset constitute the feature subset achieved by the proposed BGA-GPR FS strategy can attain accurate forecast results.

D. EVALUATION OF FS RESULTS FOR IMPROVED FORECASTING
In order to further validate the effectiveness of the obtained FS results, a Feedforward Artificial Neural Network (FFANN) based 24h-ahead electricity demand forecast model was developed for each customer category.The 19 features selected by the devised BGA-GPR FS, presented in Section VII B, form the training input dataset for FFANN forecast model.While the training target variable is the electricity demand of each building.A two-year (2015 -2016) time series hourly data of the selected features and target variable were employed to train the FFANN model.The FFANN model parameters was found experimentally.A hidden layer of 10 neurons was used.Moreover, the conventional GA was used to find the optimal parameters of the FFANN weight parameters.The prediction performance of the developed FFANN forecast model was verified with a one-year (2017) window length testing data.
However, to illustrate the forecast results here, the model testing result is given for randomly chosen four weekdays and four weekends/holidays representing the weekdays and weekends of the four seasons of the year: summer weekday (Wednesday -July 26, 2017), summer weekend (Sunday -July 16, 2017), fall weekday (Thursday -Oct 12, 2017), fall weekend (Saturday -Oct 28, 2017), winter weekday (Monday -January 9, 2017), winter holiday (Sunday -January 1, 2017), spring weekday (Tuesday -April 18, 2017), and spring weekend (Saturday -April 8, 2017).For the VOLUME 7, 2019 purpose of illustration, the forecast results for one of the buildings, Building B, are shown next.The forecast results are presented for the random eight testing days with one-hour time resolution, and they are depicted in Figures 4 and 5, for the weekdays and weekends/holidays, respectively.As it can be observed in Figures 4 and 5, the forecasts follow actual electricity demand trends with smaller gaps (errors) between them.
This further verifies the effectiveness of the proposed FS approach in selecting the best features subset that enables the forecast model to achieve improved forecasts that are more accurate.
Moreover, the following criteria were used to evaluate the accuracy of the obtained forecasts: • Error Error = P a h − P f h (10) where, P a h and P f h are the real and prediction values of the electricity demand at hour h, respectively.
• Mean absolute error (MAE) where, N is the forecasting horizon and its value is 24 for 24h-ahead forecast.
• Mean absolute percentage error (MAPE) MAE of 7.20kWh, MAPE of 1.96%, and daily peak MAPE of 2.04% are obtained for the one-year (2017) forecast using the proposed BGA-GPR FS results as input dataset for the FFANN based forecast model of the educational building (Building B). Figure 6 shows the schematic illustration of the values of the forecast accuracy evaluation criteria used.Hence, the obtained numerical results further confirm the acceptability of the forecast accuracy achieved and effectiveness of the implemented FS method.

E. QUANTITATIVE RELEVANCE ANALYSIS OF FS RESULTS
In order to quantify the benefits and relevance of the proposed hybrid BGA-GPR based FS method and the selected features, the following metrics are used: • Computation time reduction: t comp = t without_FS − t with_FS t without_FS (13) where, t without_FS is the total computation time which includes data preprocessing, forecasting model training, validation, and prediction using the original feature space without FS, t with_FS is the total computation time with the use of the obtained FS results, and t comp is the change in total computation time due to FS. Positive value of t comp indicates the reduction of computation time requirement of the electricity demand forecasting model due to making use of FS results.
• Dimensionality reduction: where, R m×n without_FS is a matrix of feature space without FS with m number of samples and n number of features, R m×n r with_FS is a matrix of the reduced feature space with FS • FS fitness value improvement: fit = fit without_FS − fit with_FS fit without_FS (15) where, fit without_FS is the fitness value of the features without FS with respect to a predefined fitness function (MSE of output and actual target formulated in equation ( 9)), fit with_FS is the fitness value of the selected features with FS, fit is the change in fitness value due to FS. Positive value of fit indicates the improvement in predictive model fitness value (reduction in MSE value) due to FS.
• Forecasting accuracy improvement: acc = acc with_FS − acc without_FS acc without_FS (16) Here, acc without_FS is the accuracy of the forecasts without making use of FS results (using the original feature space as training input) and acc with_FS is the accuracy of the forecasts with FS results (using the reduced feature space as training input).acc without_FS and acc with_FS are defined as follows: acc without_FS = 100 − MAPE without_FS (17) acc with_FS = 100 − MAPE with_FS (18) where, MAPE without_FS is the mean absolute percentage error of the forecasts without FS and MAPE with_FS is the mean absolute percentage error of the forecasts with FS. acc is the change in forecast accuracy due to FS. Positive value of acc indicates the improvement of forecast accuracy due to making use of FS results in the forecasting process.Table 6 presents the values of the metrics defined in equations (13) to (16) to determine the benefits achieved due to the implementation of the devised FS method for short-term electricity demand forecasting.
As clearly shown in Table 6, the implementation of the FS and its integration to the forecasting model has resulted in much improvements compared to the forecasting performance using the original dataset without FS.For example, for the residential building (Building A), the enhancement in fitness value (MSE) using the BGA FS algorithm selected variables to fit the electricity demand by the GPR model is 42.9% over the original dataset (without FS).Similarly, the selected feature subset of the educational (Building B), office (Building C) and mixed-use building (Building D) types have given enhancements of 96.5%, 96.7% and 99%, respectively, compared to the original dataset, regarding the FS GPR-modeled MSE fitness function.
The reduction in data dimensionality over the original feature space is 20.8%, 29.2%, 37.5% and 37.5% for Building A, B, C and D, respectively.Likewise, the decrease in total computation time is 17.7%, 26.1%, 34.6% and 34.6%, respectively for Building A, B, C and D.
Above all, the improvement of the forecasting accuracy is the most important and major objective of this study.The improvement in prediction accuracy using the BGA-GPR FS selected features to constitute the forecasting model training inputs is 38.7%, 81.2%, 81.9% and 83.0%, respectively.Thus, the above quantifications and experimental results further demonstrate the relevance of the effective FS for the improvement of the electricity demand forecasting.

VIII. CONCLUSION
This paper proposed and implemented a BGA based feature selection approach for improved short-term electricity demand forecasting models.The approach includes the use of a GPR fitness function to choice a combination of predictors from a given original predictor space.The proposed BGA-GPR FS has given a feature subset that resulted in a better fitness (lower MSE value) than the original dataset with all the initial features.For comparison and validation, features selected by other two feature selection methods were presented.The BGA-GPR features outperformed the other features with respect to the MSE fitness function defined in the GPR framework.Moreover, the proposed BGA-GPR FS was applied to four different electricity demand datasets representing four different customer types (residential, educational, office and mixed-use building energy systems).It achieved the best feature subsets, which can constitute the input datasets for accurate forecasting of electricity demands for all the electricity customer groups.Moreover, a FFANN based 24h-ahead electricity demand forecast model was developed to evaluate the effectiveness of the FS results.The electricity demand forecasting model developed using the obtained FS results has achieved an annual accuracy improvement of 38.7%, 81.2%, 81.9% and 83.0%, respectively for residential, educational, office and mixed-used building types compared to forecasting based on the original feature space VOLUME 7, 2019 without FS.Therefore, the paper findings verify that the combination of effective feature selection method and forecasting models owns robust forecasting power, compared to forecasting with arbitrary features without predictor selection methods.The work is both novel and effective from application, algorithms hybridization and performance improvement perspectives.The study contributes a new and robust feature selection tool by combining BGA and GPR for improved performance (more accurate) electricity demand prediction problem.

3 )
size 2 tournament selection, two individuals are chosen out of the solution space following the withdrawal of the elite offsprings and the best of the two individuals, (based on objective function score), is chosen.Tournament selection is carried out repeatedly till the new population is fully populated.Elite offsprings are moved automatically to the following generation, as they are the highest fitted values.However, the number of elite offsprings is limited by the population size.It is generally less than the population size.With size proper setting of the number of elite offsprings (elite count); the BGA chooses the upper elite count best individuals and move them directly into the succeeding generation.The amount of elitism used, as given in Table3, is two.Hence, the upper 2 offsprings with the best fitness scores are directly taken into the following generation.Therefore, the quantity of the elite offsprings (elite count) = O 1 = 2.That is there are 22 (i.e.24 -O 1 ) chromosomes in the population except the elite offsprings.From the rest 22 individuals, crossover and mutation offsprings are then generated.2) Crossover offspring: The BGA crossover operator genetically fuses two chromosomes (parents) to create offsprings for the following generation.Chromosomes from two parents are required to perform the crossover function.These two parents are obtained from tournament selection.The crossover function used in this paper is arithmetic type, which applies a logical XOR operation on the chromosomes of the two parents as they are represented in binary form.The portion of the following generation, excluding the elite offsprings, which are created by the crossover operator is known as crossover offspring.The crossover fraction, which refers the ratio of the number of the crossover offsprings (produced by the crossover operator) to the total number of offsprings other than the elites, used in this paper is 0.8.There is no mutation offspring in the BGA when the crossover ratio sets to one.With the crossover ratio of 0.8, the number of the crossover offsprings is O 2 = round (22 * 0.8) = 18.Mutation offspring: Mutation is a genetic disorder of chromosomes in the population.The BGA implemented in this study used uniform mutation.Using uniform mutation, the BGA creates a set of uniformly distributed random numbers whose size equals the length of the chromosomes.The random number values are related to the index of the bits in the chromosomes.The chromosomes are checked clockwise and for every corresponding bit, the random number is crosschecked with the mutation rate.If the random number at a given index is smaller than the mutation probability, then the gene (bit) at that index is tossed.Else, the gene is remains untossed.This continues from the left most bit to the right most bit of each individual of the mutation offspring.Mutated offsprings are very important for the BGA to have a genetic diversity in the chromosomes, which helps the BGA not to converge to VOLUME 7, 2019 local (suboptimal) solutions due to much similarity genes in the population.The quantity of mutation offsprings is O 3 = 24 − O 1 − O 2 = 24−2−18 = 4.This verifies O 1 +O 2 +O 3 = 24.

FIGURE 6 .
FIGURE 6. Obtained values of forecast accuracy evaluation criteria.
of samples and n r number of reduced features, and D is the change in data dimension due to FS. Positive value of D (n-n r ) indicates the reduction of input data dimension for the electricity demand forecasting model.

TABLE 1 .
Summary of FS strategies.

TABLE 2 .
Feature space of the FS problem.

TABLE 5 .
Comparison of FS results.

TABLE 6 .
Quantitative relevance analysis of FS results.