A survey of preprocessing methods used for analysis of big data originated from smart grids

In this paper, a brief survey of data preprocessing methods is presented. Specifically, the data preprocessing methods used in the smart grid (SG) domain are surveyed. Also, with the advent of SG, data collection on a large scale became possible. The data is essential for electricity demand, generation and price forecasting, which plays an important role in making energy efficient decisions, and long and short term predictions regarding energy generation, consumption and storage. However, the forecasting accuracy decreases when data is used in raw form. Hence, data preprocessing is considered essential. This paper provides an overview of the data preprocessing methods and a detailed discussion of the methods used in the existing literature. A comparison of the methods is also given. A survey of closely related survey papers is also presented and the papers are compared based on their contributions. Moreover, based on the discussion of the data preprocessing methods, a narrative is built with a critical analysis. Finally, future research directions are discussed to guide the readers.

Handling big data remains a crucial issue for areas like social networking, computational intelligence, machine learning, data mining, etc., [11]- [14]. Different frameworks like Apache Spark are proposed for the in-depth analysis of data. The main focus of these frameworks is the representation of data and its prediction [15]. Usually, velocity, variety, veracity and volume are considered the main characteristics of data acquired from SGs, as shown in Figure 1. For analyzing high volume of real time data, efficient processing is required [16]. Collecting data from different sources (sensors and smart meters) is considered as the first step in big data analytics [14]. This data may include energy consumed or demanded from SG, data sensed by sensors, smart meters' electricity consumption records, history of weather forecast, etc. However, efficient integration, storage and cleansing of data are the challenging issues. Therefore, this survey paper focuses on data preprocessing techniques, which have a great impact on the prediction model's output.
Preprocessing of data is considered as the most important task in transformation of data to an acceptable and suitable form for analysis purpose. In preprocessing, dataset size is reduced, data is normalized and outliers present in the dataset are detected. Generally, the following four steps are used in data preprocessing [17].
• Data cleansing.  Initially, data is analyzed in two steps: data acquisition and data integration. Firstly, data is acquired from the real world. The data may be incomplete, noisy or redundant. Therefore, data cleansing is used for removing redundancy in data to make it consistent. Missing values are filled and smoothing of data is done using clustering, binning and regression. Secondly, data taken from multiple sources such as files, databases, etc., is initially integrated into a single source. Afterwards, data is reduced and transformed in such a way that it does not alter its identity and becomes useful for further processing. Inconsistent and noisy data is removed in preprocessing, which plays a very important role in data analysis.
The road map of the paper is shown in Figure 2. The remaining paper is structured as follows. Section II presents the related work, which is further divided into two subsections: related work of survey papers and related work of technical papers. Moving ahead, Section III comprises data preprocessing steps and methods used for performing data preprocessing. Section IV presents the critical analysis of the works discussed in this survey paper. The future challenges are highlighted in Section V while the paper is concluded in Section VI. Table 1 presents the list of abbreviations. While Table 2 presents the symbols used in the equations given in the paper.

II. RELATED WORK
This section presents the related work of the existing research done in data preprocessing domain. The discussion is divided into two-subsections. One dealing with the existing survey papers and the other highlighting the work done in technical papers.

A. RELATED WORK OF EXISTING SURVEY PAPERS
In [1], a broad literature of deep learning models for power forecasting of solar panels, wind turbines and electric load forecasting is presented. In the survey, the datasets utilized for testing and training of different forecasting techniques, which allow researchers to identify suitable datasets for their studies, are discussed. The comparison of numerous deep learning models is also performed in the work. In the study, it is revealed that the performance of the forecasting Section II Section I Section III Section IV Section V Section VI FIGURE 2: Road map of survey techniques relies on the amount of available data where a huge data storage system and fast computing technologies are used to deal with the big data issues. A short review of forecasting techniques and optimization mechanisms for tuning hyperparameters is presented in [2]. Moreover, data preprocessing models are discussed. The forecasting models are compared based on their error methods, preprocessing methods and hyperparameters' optimization. Furthermore, the data preprocessing models and the existing optimization methods are critically analyzed and their important findings are highlighted. In the paper, the authors present a review of the previous survey studies and their recency scores are calculated according to the number of recently reviewed studies in them. In conclusion, the authors discuss the future research directions in details. In [3], the authors present an overview and a comparison of load management in a SG along with their technologies and development problems. Utility and consumer concerns in the context of load management are discussed to improve the intuition of readers on the topic. In the research, dynamic pricing and incentive models, which are the major categories of load management, are compared and discussed. Moreover, description of dynamic pricing schemes based on home energy management and related optimization methods, and their comparison are included in the work. It is concluded that finding an appropriate control and communication infrastructure, optimizing the consumption of energy and creating load management policies are the ongoing research domains that are related to efficient load management in the SG.
In [4], a comprehensive survey of wireless communication technologies is presented to implement a SG in a better manner. Many attributes of the network including data rate, power usage, Internet protocol support, etc., are considered for comparing the technologies in the context of the SG. The mechanisms that are appropriate for home area networks such as Z-wave, IPv6 over low-power wireless personal area networks, Wi-Fi and ZigBee are compared and discussed in the contexts of network attributes and consumer concerns. The discussion of similar methods in utility concerns' context, used for the wireless communication mechanisms such as GSM and WiMAX based cellular standards is presented. The challenges of SG applications and related network problems are discussed at the end. The authors in [5] deeply discuss many big data techniques as well as big data analytics in the SG context. Moreover, the authors present opportunities and challenges that are brought about by the emergence of machine learning techniques and big data origination from the SGs. The authors conclude that Apache Spark is more appropriate for both real time and batch processing in SGs. However, Apache Spark is a centralized storage system and therefore, it needs a third party to manage the storage system. Moreover, the authors advise users to install Apache Spark on Hadoop where the advantages given by the Spark are fully utilized together with a parallel distributed storage system.
In [6], a comprehensive survey of advancements made in machine learning models is provided. In the paper, a short survey of SG designs and their sources of data are discussed. Furthermore, data security requirements and types of false attacks are presented. In the survey, recent machine learning detection methods are summarized, which are grouped into three main detection scenarios. The groups are load forecasting, state estimation and non-technical losses. Furthermore, the authors investigate the future research directions at the end of the work by focusing on the deficiencies of the current machine learning methods. In [7], a review for deep learning models to forecast electricity load is presented. This review elaborates the previous works performed on the bases of distributed deep learning methods and conventional deep learning methods. It is concluded in the paper that data aggregation dependencies are advantageous to reduce electricity load forecasting's computational time. In [8], detailed analyses of machine learning methods used and big data originated in the energy sector are presented. Big data analysis for smart energy controls, impact, measurement, applications, operations and problems are discussed in this survey. Machine learning and big data methods are required to be used after carefully analyzing the energy system issues. Determining the match between the advantages of machine learning and big data for resolving the issues present in energy systems holds paramount importance. These methods help to operate and plan the SG or conventional power grid. In the study, the basics of machine learning and big data methods are also discussed along with their applications in various domains such as customer service, e-commerce, finance, retailers, web and digital media, telecommunication, health care and electrical power. The opportunities and problems of machine learning and big data are presented in this work.
In [9], a broad analysis and reviews of the recently pro-VOLUME 4, 2016 Correlation among data  Historical actual load GP h Weight of natural gas price index GP f t Actual power generation L 24 t Load observation at time t − 24hrs w 1 Weight posed works for secure data analytics in the SG platform are presented. However, achieving a secured data analytic solution for intelligent grid platforms is a critical issue. The development endeavors and existing research are not completely explored using the secured data analytic solutions in intelligent power grid platforms. Moreover, the distinctive behavior of the secured data analytic and its complications on the SG are discussed. A complete taxonomy abstraction for a novel process model that highlights many research problems like data communication, data security and privacy issues, load management and analysis, load prediction, secure load data processing and storage, and secure data collection and preprocessing, is presented. In conclusion, a case study is shown to illustrate the process model. The survey in [18] explores the prospects of analyzing and verifying information from various large storage systems and analyzing many socioeconomic aspects related to a criminal incident. Moreover, the survey categorizes patterns, analyzes outliers and designs better schemes for predicting crimes utilizing machine learning and data mining methods. In this survey [19], four feature selection (FS) models are utilized for finding important features of human resource datasets to enhance the accuracy of data classification on the employee attribution of companies. In the paper, machine learning models such as neural network, Naive Bayes (NB), gradient boosting tree, K-nearest neighbor (KNN), etc., are utilized for assessing the FS models' performance. It is concluded that, using FS models, the accuracy of data classification for the human resource datasets can be improved. In [20], a broad survey of the crow search algorithm along with its latest types, which are grouped into hybridized and modified schemes, is presented. In the paper, various uses of the crow search algorithm in many areas like distributed generation, economic dispatch, scheduling, image processing and FS are discussed. Moreover, suggestions regarding interested research domains related to crow search algorithm's hybridization, enhancement and possible new applications are made in the paper. In [21], a contemporary survey for recent breakthroughs on utilizing, redistributing, planning and trading of energy that is harvested in the upcoming non-wired communication network, inter-operating with the intelligent power grid, is presented. This survey discusses the classical models for technologies that are related to renewable energy harvesting. The authors discuss the optimization and constrained operation of various energy harvesting platforms like multi-cell, multi-hop, multipoint-to-multipoint, multipoint-to-point and point-to-point systems. They also review information transfer and wireless energy techniques that ensure unique implementation of energy harvesting wireless communication. Finally, it is shown in the work that effective redistributions and mutual energy trading can highly minimize the energy consumption bill for providers of wireless services and decrease the energy consumption, respectively. Also, a comprehensive list of research directions is given, which needs further investigation. The authors in [22] present a brief introduction to big data services' framework and scientific computing mod-els that include data storage and data collection. Moreover, big data analysis and processing based on various service requirements are discussed, which provide prominent data for servicing consumers. In addition, they introduce a cloud service based big data platform that gives improved performance results for large-scale data processing, analysis and storage. In conclusion, some big data systems' applications in various domains are summarized.
In [23], a comprehensive survey of 200 recently published articles is done to review the currently proposed works and other practices for machine learning models. Moreover, the trends in broad-spectrum applications of SG areas are discussed. In the paper, the rapid expansion and increasing interest in the machine learning models' applications for addressing the scientific problems of SGs from many perspectives are demonstrated. Moreover, it is shown that problems like the analysis for intelligent decision-making and highperformance data processing remain open and worthy for further research efforts. In addition, the future views for utilizing advanced communication and computing technologies such as 5G wireless networks, ubiquitous Internet of things and edge computing in the intelligent grids are elaborated. The authors concluded that machine learning is among the drivers of future intelligent energy platforms and the work gives an introductory basis for further development and exploration of associated insights and knowledge.
In [24], the authors present a survey of the intelligent power grid along with its relevant features as well as its various views on industrial energy distributions. They also discuss how the SGs' technologies have changed over time and still have prospects for evolving and strengthening the distribution platforms further. In [25], the authors provide a detailed survey of the current machine learning methods for detecting a false data injection's attack against the energy platform's state estimation technique. Moreover, data driven methods, utilizing machine learning mechanisms, are used to overcome the limitations of the conventional residual data detection methods. In [26], the authors present a detailed review of opportunities and challenges for the construction of SGs. The survey provides many challenges related to SGs' construction such as distribution of grid management, energy storage, demand response, network communications and interoperability. Moreover, a review of many global, national, regional and local opportunities related to the construction of a SG is presented.
In [27], the authors provide a systematic analysis for techniques utilized in the literature review to predict solar irradiance. The work's target is to see how the input data of meteorology, time slots, optimization, sample size and preprocessing methodology affect the complexity of models and their accuracy. Various findings and important parameters are presented in the paper. The survey gives important results based on the studied literature for selecting an optimal model at a specific site. Also, the metrics utilized for measuring the forecasting models' efficiency are discussed. In [28], a structured review of some existing artificial intelligence VOLUME 4, 2016 (AI) models for security issues, fault detection, power grid stability assessment and load forecasting are presented. The authors also provide research problems when AI techniques are used to realize true SG platforms. Opportunities for using AI techniques to tackle SG issues are also presented. In conclusion, the AI applications can improve and enhance the resilience and reliability of intelligent power grid platforms. The researchers in [29] present a complete review for detecting and classifying power quality disturbances by elaborating AI tools and signal processing methods with their cons and pros. Moreover, automatic recognition methods are critically analyzed for different modes of operations, AI methods, FS methods, preprocessing tools and the energy input signal types (noisy/real/synthetic). Furthermore, the study gives prominent recommendations to those researchers that are interested in the domain of power quality analysis and wish to explore suitable methods for future enhancements. The summarized related work of the existing survey papers is given in Table 3.

B. RELATED WORK OF EXISTING TECHNICAL PAPERS
For efficient utilization of energy, an efficient energy management system (EMS) and a robust load forecasting model are required. In the existing literature, AI is used for handling complex problems. Different AI methods are used in [30] for short term load forecasting (STLF). In the study, the accuracy of a model is dependent on input parameters and data preprocessing methods. In retrospect, several FS methods are used for dimensionality reduction. However, FS techniques face challenges in the case of unbalanced data. An FS method based on genetic programming is proposed in [31]. The most discriminative features are selected that give optimal solutions. In addition, scores of all metrics are analyzed using which noisy, inconsistent and less important features are filtered out. Hence, processing and memory utilization are improved.
Effectiveness of features' classification can be improved by selecting relevant features from a dataset. It is necessary to remove noisy and redundant information without affecting the useful information. Hence, the main goal is to remove the maximum number of irrelevant features from the input datasets. The efficiency of a model and overfitting chances are reduced. For improving classification accuracy, a new wrapper method is proposed by [32] based on genetic programming.
In the field of pattern recognition, the extraction of meaningful information from a large amount of datasets is considered a challenging task. In a conventional way, batch processing is used for extracting features from whole dataset. However, this is not a feasible and convenient process. Hence, different preprocessing methods are proposed by researchers for feature extraction. The main focus of [33] is subspace feature extraction methods based on the loss function. A gray wolf optimizer (GWO) method is used for converting features into binary form. Binary GWO (BGWO) is proposed for the selection of features from the dataset. Different initialization methods are used for FS. Results are compared with existing schemes such as genetic algorithm (GA) and particle swarm optimization (PSO). The proposed technique outperforms the existing methods. Furthermore, for performance validation, testing dataset is used.
For knowledge extraction, an actual sample of data is obtained. Original dataset is divided into smaller datasets in data preprocessing. With the help of preprocessing, efficiency of load forecasting is improved. Sampling and FS methods, which are data reduction techniques, are applied for the preprocessing of data [34]. In another work, big data techniques are used to model the recency effect [35]. Modeling of recency effect is done on hourly basis. In the work, naive model is implemented in four different ways. However, the naive model does not show optimal results at the level of aggregation. Moreover, a regression model is used while considering daily data instead of hourly data. It shows higher accuracy in comparison to the naive model.
It is difficult to forecast the load of a single building as compared to aggregated load [36]. Several methods are used for load forecasting like support vector machine (SVM), artificial neural network (ANN), etc. However, load forecasting is not performed in an efficient way. Therefore, multi layer architecture and deep learning methods are used for single building load forecasting while considering short, long and medium term load forecasting. Conventional neural network (CNN) is used for forecasting. Results are compared with existing methods and CNN shows better performance.
Prediction of load is done using many approaches like neural network (NN), regression, etc., on a large scale. Prediction of electricity consumption in a market is done in [37]. In this work, the authors focus on residential areas for electricity load prediction. There are a large number of homes and electricity demand of every house varies from each other. Therefore, for feature relevancy, an FS approach is used, which finds correlation among features. Moreover, a cluster based aggregation forecasting (CBAF) scheme is used for load forecasting in the model. STLF is considered a challenging job [38]. Error occurs in STLF when normal power distribution exceeds the committed power. In such cases, requirement of power is fulfilled by purchasing it from grid, which causes a significant increase in cost. Traditional methods of load forecasting always show less accuracy due to their poor performance. In comparison, AI methods are adopted, which are more suitable. Recurrent deep NN (RDNN) and feed forward deep NN (FDNN) are used for short term prediction of electricity.
The model proposed by authors in [39] is compared with the existing models in terms of accuracy and error estimation. Redundancy issues are found in data, which cause difficulty in price forecasting. To meet the electricity demand, efficient and accurate electricity demand forecasting is important. Traditional classifiers such as ANN and decision tree (DT) are used; however, these classifiers have overfitting problem. In the proposed model, cross validation methods are used for adjusting the hyperparameters of SVM. As the convergence  Various layers are found in NN. The changes in parameters of previous layers in NN affect the output layers. So, there is a need of frequent changes for poor training handling. For solving such problems, batch normalization is used. The details of batch normalization are as follows [40]: z j is output value and y j is input value. Mini batch size is referred as n, which shows the number of inputs in every mini batch. The mean of total input for the same batch is given by v A . Variance in mini batch is shown using s 2 A . y j values are normalized asȳ j . Learning parameters are represented as β and ω. and AN are the model's performance error and batch normalization function, respectively. By applying batch normalization, efficiency of training is improved.
Nowadays, efficient price and load forecasting is a challenge due to complex and irregular nature of electricity consumption. Efficient STLF is done considering hourly price [41]. For improving accuracy of forecasting, NN is considered with multiple layers. Data is normalized before passing to the input layer. The variance is taken as 1 while mean is taken as 0. Firstly, initial weight of NN is selected considering range from −1/ √ e to 1/ √ e where the number of neuron's input is shown using e.
z is the predicted value and z j is the actual value. The total number of the predicted data is denoted as m and determination coefficient is represented by DC 2 . Hence, data normalization improves the model's performance as compared to the existing approaches. For dimensionality reduction, various methods are used in [42] like discrete wavelet transform (DWT), symbolic aggregation approximation (SAX) and discrete fourier transform (DFT). Arbitrary values (y = y 1 , y 2 , ..., y n ) of consumption data are transformed into range of [0, 1].
At time t,ȳ j , y j are normalized and actual values, respectively. Minimum and maximum consumptions are represented by y min and y max , respectively. For improving potential effect of demand response, filter method is applied. Euclidean distance (ED) is calculated in time series data, which has low bounding. For reducing data dimensionality, SAX approach is used. The numerical time series data is converted into symbolic strings with the help of SAX method. Firstly, load data is transformed into piecewise aggregation approximation (PAA). Secondly, a discrete string of PAA is symbolized. In PAA, the values of amplitude that belong to identical intervals are removed. Mean values are calculated as:ȳ The index of the normalized load data is represented by k, the index of transformed PAA load is shown with the help of j. The domain break-point at j th time is l j . At j th segment, average value is represented byȳ j . After applying the averaging method, smooth values are achieved. Moreover, PAA is a pruning process based on DWT, whose cost of computation is low. Discrete symbols are obtained by applying SAX algorithm.
In the classification step, a generalized normalized euclidean distance (NED) is used to find the similarity between two real-valued vectors. NED is calculated between samples of data as follows [43].
From N ED 12 , 1 and 2 represent sample 1 and sample 2, and NED is calculated between these samples. Where y 1j and y 2j represent load data for sample 1 and sample 2, respectively. Dimension is represented with m and for representing maximum load of every dimension, y max is used. Hence, the application of NED ensures better understandability of data for further processing.
In power planning, analysis and forecasting of load are very important [44]. Data integration has a significant impact on prediction. Planners record the load data and perform the analysis considering various scales of time. Load data consists of various features. By considering these features, a coordinated forecasting method is proposed. In this way, accuracy and efficiency of the proposed method are improved.
Data quality is improved after applying preprocessing methods, which help in analysis of the predicted data. In [45], a hybrid approach is used for load forecasting. This hybrid approach is applied to the historical data, which is taken from real world. The proposed methodology has two steps. Firstly, for prediction of price at an individual level, relevance vector machine (RVM) is used. Secondly, aggregation is applied to individual prediction. After that, regression is applied. Results are compared with individual RVM, NB and auto regressive moving average (ARMA). The proposed methodolgy gives better prediction results as compared to other techniques.
Forecasting of electricity load is done using the proposed FS technique for next day load prediction [46]. Data dimensionality is reduced using principle component analysis (PCA). Dimensionality of time series data is reduced and only the distinct variables are used. The performance parameters of PCA are tuned using GA. After that, prediction accuracy is measured. The proposed model gives better accuracy in terms of load and price forecasting as compared to existing models. However, interdependencies between load and price data are not considered. So, an approach, termed as multiple input and multiple output (MIMO), is proposed. Correlation is calculated among electricity price and load data. Three components are considered in the proposed model. With the help of wavelet packet transform (WPT), subsets of data are made. Moreover, DWT is used as a filter method in many scientific papers. On the other hand, load signals and electricity price data have non linear patterns. Therefore, coefficient vector approximation is found in DWT. However, some useful information is lost. Hence, WPT is used for decomposition and finding approximate coefficients. The computational time of the proposed model is decreased using WPT. For selecting the best input candidate, generalized mutual information (GMI) is used. Forecasting is done using least square support vector machine (LSSVM). For simultaneous prediction of load and price, MIMO model is used as a base model for the proposed framework, LSSVM-MIMO [47]. In DWT, original signal is obtained using the following equation.
where the scaling function is given by ϕ(2 k−1 t − m) while wavelet function is given by Ψ(2 k−1 t − m). The intermittent nature of renewable energy sources makes the management of electricity prices difficult [48]. Therefore, there occurs an imbalance between electricity production and consumption, due to which the grid becomes unsteady. For removing uncertainty and ensuring grid stability, a combination of four deep learning models is used in the proposed work for predicting electricity load and improving prediction accuracy.
In this survey, the main focus is on preprocessing techniques used in big data. Data is taken from SG using different sensors and devices. Comparisons of various preprocessing methods are made. At the end of this paper, a critical analysis is performed and future directions are provided.  Data preprocessing involves transforming raw data to wellformed datasets so that data analytic can be applied. Using the data preprocessing methods, irrelevant and redundant data is removed from the dataset. Data preprocessing is considered as the foremost step in data analytics. Data cleansing, integration, transformation and reduction are the most important steps of preprocessing. The main purposes of data preprocessing are to remove all irrelevant data and ensure consistency in the data representations. In Figures 3  and 4, data preprocessing hierarchy and steps involved in preprocessing big data originating from the SG are shown, respectively.

A. DATA REDUCTION
It is the transformation of digital numerical or alphabetical information derived empirically or experimentally into a corrected, ordered and simplified form [49]. Data reduction is shown with the help of Figure 5. Infrequent and large spikes are found in electricity data due to dynamic nature of its price. For electricity price prediction, better results can be achieved by ignoring spikes during the process of estimation [50]. In this way, time series data is obtained, which is easy to handle and use for electricity forecasting. Similarly, log transformation is used as a preprocessing step and for parameter estimation. To further enhance model performance, the data is normalized within [0, 1] interval using machine learning models. In this way, accurate modeling is done. However, statistical significance is not considered [51], [52], [53].
A filtering approach is applied on electricity consumption data to filter the data for further processing. A two step method is proposed for the selection of relevant features [54]. In the first step, daily and weekly patterns of electricity load data are captured. After that, using FS approaches, subsets of these features are formed. Four FS approaches are used to extract relevant features: mutual information (MI), auto correlation (AC), correlation FS (CFS) and RReliefF. Furthermore, the value of AC function is computed and on the basis of this function, correlation strength is identified between features using the following equation [54]: Time series value at time t is given as Y t .Ȳ represents the mean value of all values of Y in a given time slot. j is an autocorrelation coefficient and c j is a linear correlation. Dependencies between two features Y and Z are measured using MI. In case of dependency, MI has a positive value and in case of independency, it has a negative value. MI is a very important FS method used for electricity load forecasting. Non-linear and linear correlation among features can be captured using this method. MI, based on KNN, is applied on data distribution for computing MI between two features [54].
[Ψ(m p (i)) + Ψ(m q (i))] + Ψ(M ). (13) Ψ(j) is digamma function, n neigh is the number of nearest neighbors, and P and Q are considered as features. m p (i) and m q (i) are the number of points p k with a distance to p i satisfying ||p i − p k || ≤ η p (i)/2 and the number of points q k with a distance to q i satisfying ||q i − q k || ≤ η q (i)/2, respectively. The distance between p i and its n th neigh neighbor is given by η p (i)/2 and the distance between q i and its n th neigh neighbor is given by η q (i)/2. An individual feature subset is produced explicitly in CFS. All features are ranked individually and final subsets of variables are formed. Subsets of features are not correlated with each other, rather, they are correlated with the predicted values.
The number of features is represented by r, average correlation between every feature f is denoted asc f . c f f represents average variable to variable pairwise correlation.
RReliefF is applicable for forecasting, selection and classification tasks [55]. Probability values are used in RReliefF to differentiate the values of two classes. RReliefF randomly selects an instance I and searches for its closest neighbor in the same class (nearest hit) and in another class (nearest miss). The nearest hit is represented as T and the nearest miss is represented as S. Through the following equation, weights w f of features are updated. Difference between two instances is calculated using dif f function [55]. The values are then normalized between 0 and 1.
Randomly selected variables m relief from training data cause increased variations in FS method. For data reduction, all m relief examples are replaced by training examples. The reliability of a feature's weight is also increased. RReliefF works well on irrelevant, noisy and redundant data. Its complexity time is linear, so, it is an efficient algorithm as compared to others. In order to perform perfect load estimation, sufficient information regarding electricity load and features of data must be available.
A dataset having millions of records has high chances of including missing and erroneous values, along with outliers. For analysis purpose, data must be in a proper format. An outlier is an unwanted training item that has an unexpected feature value due to non-ordinary conditions or exceptions [56]. The rejection of outliers is an important task. For this purpose, an outlier rejection algorithm is proposed by [57]. Distance based outlier rejection (DBOR) method is used for removing outliers. In case, outliers are present in data, the classifier may overfit. For further processing, data preparation is required. In [48], a clustering and incremental frequent pattern mining (IFPM) mechanism is proposed. In addition, the correlation of data sample is calculated between appliances' instances. Moreover, classes are constructed on the basis of members' similarity and dissimilarity in cluster analysis [48]. Using the above mentioned processes, the volume of data is reduced; however, the analytical results are the same. Data can also be reduced by the following ways.

1) Number of attributes are reduced
Different FS methods are used to reduce the number of attributes and select only the most prominent and useful features.

2) Number of attribute values are reduced
Attribute values can be reduced using PCA [58]. It is used for the representation of data vectors. Firstly, normalization is done so that all attribute values lie in the same range and no attribute has any dominance over other attributes. Secondly, vectors are calculated on the basis of normalized data. Computed vectors are known as principal components (PCs). Thirdly, these PCs are arranged in a decreasing order. In this way, patterns or groups are identified. Lastly, size of VOLUME 4, 2016 data is reduced by removing components that have very low variance and original data approximation is done.

FIGURE 5: Data reduction
Dimensionality of data is reduced for minimizing computational complexity [59]. PCA is used for feature reduction. In this method, accuracy of clustering is improved using the following equation: y j are data points where j = 1, 2, ..., L and C is a covariance matrix. The eigenvectors of C are computed as: U = [µ 1 , µ 2 , ..., µ τi ] is a set of eigen vectors, Λ= diag(λ 1 , λ 2 , ..., λ τ ) is eigenvalue, τ i is considered as maximum dimension. In PCA dimensionality reduction, weights are also assigned.
w τ represents weight of an attribute. Objective function can be minimized by determining U and centers of clusters V .
where N h represents hidden layer nodes, Q represents the sample number and µ jk denotes the membership degree of j with k. The weighted ED from sample y j to cluster center V k , denoted as d jk , is defined as follows: Furthermore, time series data taken on daily basis should be preprocessed and sorted. So that missing, redundant and noisy values are removed. In [60], a model comprising moving average method and WT is proposed. The moving average method is used for data smoothing. In addition, preprocessing of data is done using WT while information division is done on frequency basis. In the model, price series data of electricity is divided into sub series and its coefficients are adjusted using the following equation [60]: wavelet function is represented as W (.), value of price at time t is P t . Length of series is taken as T . Resolution coefficient is shown as P W SR , where R and S are level and position, respectively. Mother and father wavelets are based on multi-resolution methods. Low frequency is extracted using father wavelet function while high frequency is extracted using mother wavelet function. Hence, A S (S = 1, 2, ..., S * ) is an approximate set and D S (S = 1, 2, ..., S * ) is a detailed set, which are defined by the following equations: ψ SR (t) is mother wavelet function and ϕ SR (t) is father wavelet function. p φ SR and p Ψ SR are obtained coefficients of mother and father wavelet functions, respectively and S * is the optimal position of the data signal.
The basic time series P t (t = 1, 2, ..., T ) can be written as:

B. DATA CLEANSING
Data cleansing is the process of detecting and correcting corrupt or inaccurate records [61]. A record may consist of a set, table or a database, which is used to identify incorrect, inaccurate or irrelevant parts of the data. After that, the corrupt data gets replaced, detected or modified with accurate records. The main purpose of data cleansing is the transformation of raw data into useful information where the contents are made readable and easy to access. Data plays a very vital role in various organizations. Hence, maintenance of data is very important. Figure 6 shows the steps involved in data cleansing after it is being imported and before it is being used. Data cleansing is required before the data is used because the imported data is mostly in the form of raw data, which contains noise, outliers, redundant data, etc. Data cleansing can be done in the following ways.

1) Filling missing values
Data collected from real world is inconsistent due to the presence of noise and missing values. The missing values in the dataset occur due to many reasons like limited storage, disagreement in uploading the data, compromising the input devices and sometimes because of security reasons. The missing values adversely affect the reliability and performance of the machine or deep learning models. Keeping this in view, these values need to be handled before proceeding with the development of the model. Certain approaches can be adopted to fill these missing values such as calculating means of attributes, probability or sometimes by ignoring data rows. Filling the missing values would make the data consistent and noise free.

2) Identifying outliers
A sample is randomly taken from population and distance is calculated among two values. If the distance is abnormal, sample point is referred as an outlier. Otherwise, it is not an outlier. The data sample point that is far from the mean position is also referred to as an outlier.

Methods to fill missing values and outliers
The following methods are used to fill missing values and outliers.
• Clustering: task of grouping objects in such a way that objects in the same group (called a cluster) are more similar to each other than those in other groups [62]. Clustering lies in the category of unsupervised classification. It is used for distribution and preprocessing of data. Large volume of data is divided into multiple groups on the basis of feature similarity [63]. Owing to the increase in population, the data associated with each electricity user is also increasing day by day, which brings a severe challenge of data storage and processing. For this purpose, storage optimizing hierarchical agglomerative clustering (SOHAC) is proposed for fulfilling storage requirement [64]. For space optimization, inconsistent and redundant records are removed. A MapReduce framework is used for implementation of the proposed parallel clustering approach [65]. The design of the proposed technique is based on K means clustering approach. K means clustering does not work well on a large number of datasets as compared to the approach proposed in [66]. Cluster tendency can be measured by calculating the degrees of clusters using the following equation [67]: considering that the nearest neighbor v d j is the distance of a j A from its nearest neighbor in A and u d j is the distance of b j B from its nearest neighbor in A. Here, A represents the set of data points in a d dimensional space and B represents the set of uniformly randomly distributed data points. • Regression: It is a set of statistical processes in which relationships among variables are estimated. Predictions of a range of numerical or continuous values, which are found in specific datasets, are also taken. It can also be used as a linear regression in which relationship is estimated among two variables. There exists only one independent variable y j to model m data points where α 0 and α 1 are considered as parameters. So, equation of straight line can be written as follows [68]: Equation of multiple linear regression can be written as: Sample linear regression model can be written as: whereᾱ 0 andᾱ 1 represent the parameter estimators. • Imputation: two common approaches are used to fill the missing values: deleting the missing values and imputing the missing values [69], [70]. The former approach is not suitable and is never recommended because the entire row or column is deleted if the missing value is found in the dataset. In this way, important information might be lost as well. Therefore, the latter approach is commonly used to handle the missing values.  Data taken from the real world is in huge volumes and it is mostly inconsistent, meaning it consists of anomalies, missing values, outliers, etc. This data gives poor performance in terms of prediction and classification. Hence, the inconsistencies present in the data are removed before using it for various purposes.

4) Removing noise
The removal of noise from the data cleanses it. The following types of methods are used for removing noise.
• Filter method • Wrapper method • Embedded method The three methods belong to the class of conventional FS methods. These methods are discussed as follows.
• Filter method: Subsets of features in datasets are primarily determined using filters [71]. These subsets are dependent on size of the data. Many learning algorithms like random forest (RF), Relief-F, etc., are applied on feature subsets to evaluate the type of data. RF combines the output of individual decision trees, which is a random subset of features, and generates the final output. The final output is obtained after filtering less optimal feature subsets. Relief-F, on the hand, calculates the feature score of each feature and ranks the features accordingly. The features are then filtered on the basis of rank. Moreover, proper subset selection on the basis of consistency criteria becomes a difficult task. Based on the nature of the problem, cross validate filter (CRV), ensembler filter (EF) and partitioning filter (PF) are used as per requirements [72]. In CRV, the features are divided into subsets and performance of each subset is tested. The features that give poor performance are filtered out and the best feature is selected. While, PF partitions the entire dataset in form of chunks and selects the partition on which the model performs the best. The steps of filter method are shown in Figure 7.
Set of all features Selecting the best subset Learning algorithm Performance FIGURE 7: Filter method • Wrapper method: Subsets of variables are evaluated and variables' interaction is detected, as shown in Figure  8. In wrapper methods, the FS process is based on machine learning techniques that try to fit on particular datasets [71]. As a result, overfitting risk and computational time are increased in wrapper methods. Examples of the method are stepwise selection, backward elimination, forward selection, etc.

FIGURE 8: Wrapper method
• Embedded method: Different techniques are used for assessing data. In training phase, gain ratio of every attribute is automatically adjusted, incomplete records are removed and system validation is done, as shown in Figure 9. The subset's size is dependent on variety and size of samples. Verification test is applied in the case of incomplete records. Hence, missing values are replaced and removed. After that, with the help of a surrogate filter, values are substituted in the primary filter. In the surrogate filter, the filtered values are taken from one filter and transferred to another filter, which is considered as the primary filter in most cases. The surrogate filter can work in the case of homogeneous data. However, it is hard to choose a surrogate filter in heterogeneous data. It is because issue of interoperability would arise when data would be different. Hence, appropriate values are checked at the global level and a surrogate filter is chosen [74]. The above mentioned techniques are implemented on the samples of structured data, however, in case of unstructured data, these methods are not applicable. The advantages and disadvantages of FS categories are discussed in Table 4.

C. DATA TRANSFORMATION
Data is converted from one structure to another structure according to volume, complexity and format. Transformed data can be simple or complex [73]. Different technologies and tools are used for data transformation, e.g., Talend, Pentaho, CloverDX, etc.

1) Normalization
Data is organized in a tabular form such that data redundancy and dependency are reduced [75]. Data is divided into smaller tables and relationship is defined among all tables. Data normalization can be done using the following methods: decimal scaling, z-score and min-max normalization (the most commonly used method). Attribute values are normalized by calculating standard deviation (SD) and mean in z-score normalization, given in the following equation.
σ F is SD andF is mean of attribute F. When the minimum and maximum values, max F and min F , of attribute F are known, then z-score normalization is used. Linear transformation can be performed on data using the following equation [75].
P trans is calculated considering M x F = 1 and M n F = 0. Relationship among data values can be preserved using minmax normalization. Data is normalized between the range 0 to 1. In [76]- [79], data is normalized after aggregating it on daily and hourly bases.
P C j represents power consumption in the j th interval. The number of intervals are shown with the help of no inter . Moreover, P D shows the power demand. In the proposed work, all coefficient average values can be calculated using the following equation: Where ζ 0j (t) is the coefficient of correlation. For determining dominant frequency, DFT is used: where x(n) and Y (k) are the time domain and the transformed signals, respectively and input signal's length is represented by N . To normalize the data, equation (8) is used. After data normalization, correlation among data is given as [80]: Λ * o (k) and Λ * i (k) are gray coefficients between sequences λ * o (k) and λ * i (k). Distinguished coefficient is represented by ζ dist and its value is set as 0.5. Data normalization is done using equation (8). Weight w is assigned to each feature n and the normalized data is represented as follows: (40) ED is calculated in [81] using the following formula: The historical actual load is represented by L h . The weight of natural gas price index is represented by GP h while GP f t and L 24 t are the actual power generation and load observed at time t − 24hrs, respectively. As mentioned above, data preprocessing usually includes normalization and data cleansing. Load data has some missing and defective values, which are removed and replaced by applying an averaging method. Data is also normalized between maximum and minimum values of electricity. Suppose training data has large values, then weights are adjusted according to sigmoid activation function and data is normalized [41]- [43], [82]. SAX is used for reducing datasets. The input values are transformed in the range [0, 1]. Moreover, there exists a ratio between moving average and input variable. Correlation between data is calculated on the basis of the ratio, given in [83]. This ratio is calculated using capacity utilization function (CUF): where the residential load at time t is represented by RD(t) while the available capacity at time t is represented by AC(t).
Tables 5-8 present the details of different methods used in literature grouped on the basis of four different data preprocessing methods: averaging, aggregation, normalization and dimensionality reduction. The working discussed in the first column of these tables present the methods used for preprocessing data. Different works use different methods for data preprocessing. Table 9 provides the summary of different data preprocessing techniques.

IV. CRITICAL ANALYSIS
In this section, based on the discussion of existing data preprocessing methods, a narrative is built and presented in the form of critical analysis. This section will help other researchers to draw future directions and also strengthen the performance of the existing preprocessing methods.

A. WPT ONLY SUITABLE FOR NON-CRITICAL DATA
WPT is a commonly used method in the literature for dividing data samples into groups. However, in this method, some necessary information is lost and the computational cost of data preprocessing is increased [81], [82]. Hence, it is suitable for problems with non-critical data. In such problems, the effect of data loss is not significant because similar data is present in the dataset. Moreover, the spike preprocessing method instead of WPT is a more suitable solution for dividing data samples into groups.

B. SELECTION OF A METHOD DEPENDS ON THE NATURE OF PROBLEM
Missing values are filled by taking an average of the existing values. However, from the literature, it is observed that for the purpose of load prediction, the improvement in forecasting accuracy is not up to the mark [84]. Instead of averaging data, pattern mining is a promising option. In this method, when a missing value is detected, the pattern of its previous and next values is mined, and the missing value is predicted based on these patterns. However, this method increases the computational overhead. Hence, the selection of an appropriate method to fill the missing values depends on the nature of the problem. If forecasting is time-critical and accuracy is the secondary priority, then the averaging method is suitable. Otherwise, if forecasting is not time-critical and accuracy is the top priority, then pattern mining is suitable to fill the missing values.

C. FITNESS CALCULATION METHOD BETTER THAN AVERAGING METHOD
To tackle the missing values in a dataset, fitness calculation of the best and the worst values has proven to be a suitable choice [85]- [89]. It is observed that when the averaging method is replaced with fitness calculation values, better forecasting results are acquired. It implies that, before selecting a data preprocessing method, the nature of the data and problem should be investigated thoroughly, and then the best suitable method should be selected.

D. EFFICIENT TUNING OF HYPERPARAMETERS
In data preprocessing, clustering is done for supervised algorithms. DT is the simplest method of clustering and k-means is the commonly used clustering algorithm. The algorithms have hyperparameters, which need to be tuned carefully to produce accurate results. In literature, heuristic algorithms have gained popularity in finding suitable values for hy- Ireland Missing values are found in data, limited data for smart meter is utilized for forecasting Averaging is performed for hourly and daily temperature Missing values are found by taking average, recency effect is also calculated, which shows accuracy in forecasting [84] USA Model shows less accuracy Applying equal weights to all data intervals and then taking their average Quantile regression averaging is used for generation of probabilistic load forecasting [85] Charlotte, USA Probabilistic forecasting generation is not done Taiwan Only one step ahead or two step ahead prediction is done Scalar and hourly price is transformed into normalized data and averaging is performed Autoregressive moving average model is used for reducing attributes and the existing systems are extended for improving results [103] Spain and Germany Hourly pricing is not considered   Predictive performance is improved [59] Australia Complexity is found in clustering algorithm Electricity price series is divided into sub series, which behave in an efficient way Wavelet preprocessing is used [60] Australia Data becomes complex for analyzing Correlation is found between datasets and dimensionality is reduced Autocorrelation is used. For reducing dimension, 1 week sliding window is used [115] Australia Only linear dependencies are captured during prediction  --------[37]  ---------[38] - perparameters. However, the inclusion of these algorithms increases the execution time of the preprocessing methods [90]- [95].

E. DATA INTEGRATION MAKES DATA COMPLEX
Integration of data from several sources makes the data complex. The complexity of the data is reduced by FS, discretization or instance selection [77]- [80]. Sometimes, this step is done manually without using any method, i.e., selecting the useful features. However, for big data or data with a large number of features, it is important to use a suitable method. This phase of data preprocessing is also known as data reduction. Some of the commonly used data reduction methods are discussed in this survey; however, there is still a room for improvement because size, shape and nature of data change over time.

F. NATURE OF DATASET TELLS WHAT STEPS TO USE
From the existing literature, it is obvious that all of the data preprocessing steps are not always necessary to improve the quality of the data. For example, a data of electricity load consumption consisting of four features, demand, price of electricity, fuel price and temperature, does not need a data reduction step. Similarly, every dataset does not need normalization or scaling. Hence, to choose the appropriate steps of data preprocessing for a dataset, it is important to understand the nature of the dataset and the problem to be solved.

V. FUTURE CHALLENGES
In the light of above discussed literature, the promising future research directions are presented in this section.

A. SCALING OF DATA PREPROCESSING TECHNIQUES
For accurate load forecasting, data preprocessing is an important step. From literature, it is analyzed that without preprocessing, the accuracy of a model is affected and computational cost is increased. Hence, the selection of an appropriate method for preprocessing is an important step.
To reduce the data dimensionality, multiple FS methods are available for preprocessing [41]- [43], like PCA and WT.
The methods are used for selecting the relevant features. However, with time, as the characteristics of data change, the existing FS models become insufficient. For example, in instance reduction, subsets of data are arranged to carry out the learning task, which does not show any significant improvement in forecasting accuracy. For obtaining subsets from a big database, it is necessary to have a complete set of instance reduction methods. Re-adjustment of these methods has been done for dealing with large-scale data. For this purpose, high computational capabilities are required. Hence, considering the importance of FS methods and the increase in the data volume, new and efficient FS methods are introduced for a better selection of the relevant features. Moreover, data cleansing is an important phase. The dataset may have missing values, noise or irrelevant data, etc. Missing value imputation is used for replacing the missing values in a dataset. Here, to fill the missing values, the best possible values are estimated on the basis of the relationship among data. In addition, noise treatment is a complex problem in which similarity is measured among data points to identify and measure noise. Besides, a dataset may also include erroneous values, which are important to be tackled to improve the forecasting accuracy. From the literature, it is observed that error rate finding is low in medical, load and electricity forecasting, commerce, banking education, etc. Moreover, data cleansing is a challenging task in these fields. It is because a huge amount of data is generated due to the increasing population, which needs to be reduced and scaled. During scaling of big data, results' dependency, treatment of data according to data preprocessing capacity, iterative processing and parallelization possibility are the main concerns. Hence, new and efficient methods are required to tackle the aforementioned issues.

B. BIG DATA LEARNING PARADIGM
Data complexity, data security, data capture and data scaling problems are arising continuously in big data. To tackle these challenges, various data preprocessing methods have been proposed in the literature. As data is increasing day by day, therefore, storage issues are arising. Moreover, in semisupervised learning, labeling of data is a complex task and real-time responses are required for processing large datasets. Besides, in the filter method, features are ordered on the basis of importance in a specific time; however, decision criteria for performing filtration are not decided [96]. Additionally, on a large dataset, it is hard to employ the wrapper method because this process involves a comprehensive search.
As we have discussed earlier, data is collected from multiple sources; hence, a large amount of variation is found in datasets after integration. During the preprocessing phase, an appropriate data sampling rate is assigned and cleansing of data is done on various levels. However, the processing cost increases. Furthermore, efficient learning and validation models are required. In addition, the solutions are restrictive to specific and complex predictors. Instead of providing results, solutions suddenly change direction towards used pre-dictors. Moreover, prior knowledge is not considered in filter based methods. Hence, more time is required in finding solutions as compared to metaheuristic optimization methods. On the whole, data cleansing and filtering are of paramount importance. While considering large datasets in hybrid and embedded approaches, system complexity increases and an overfitting issue arises. These methods perform well on small datasets. However, performance is not good on large heterogeneous datasets. As data is increasing day by day, so, cleansing techniques are becoming difficult to be applied because of scalability and computational issues. Furthermore, it is difficult to apply preprocessing methods to unstructured data. Hence, new and improved methods are required for data cleansing, filtration, reduction and transformation.

VI. CONCLUSION
In this paper, a comparison is made between various data preprocessing methods. Data preprocessing is an important step for efficient load and price forecasting. The data collected from the real world is inconsistent, incomplete and noisy. Hence, data preprocessing is inevitable before forecasting to get accurate results. In literature, data preprocessing is categorized into four steps: data cleansing, data integration, data transformation and data reduction. However, it is unnecessary to implement all four steps on a dataset. The nature and type of dataset determine the data preprocessing steps. Data preprocessing aims to make the data meaningful and improve its quality. The problem of noisy data is solved in the data cleansing phase. Moreover, if data is collected from different sources, the data integration step is implemented. Additionally, data transformation and data reduction phases are used to transform data from one form to another and reduce the data size, respectively. The above discussion concludes that each data preprocessing step has its unique characteristics and significant nature. Moreover, data preprocessing steps seem to increase the computational time; however, they save a forecasting model from overfitting and underfitting issues and decrease a model's training time. Hence, from a detailed survey, it is concluded that data preprocessing is important for efficient and accurate forecasting.