Household Power Consumption Prediction Method Based on Selective Ensemble Learning

In the context power big data, the household power consumption data on the user side has the characteristics of large quantity, wide distribution and many types. Ensemble learning is very excellent in the analysis and mining of power data with large amount of data, strong timeliness and many influencing factors. Based on the data of household power consumption, this paper analyzes and predicts the power consumption of some users in a city by using selective ensemble learning technology and combining with meteorological factors. In this paper, K-means algorithm is used to cluster the household power consumption, and then the clustering results are combined with the meteorological information. In the stage of power consumption prediction, a Filter Iterative Optimization Ensemble Strategy (FIOES) is proposed to selectively ensemble the basic learners and get the final prediction model. Experimental results show that the FIOES algorithm has better performance in time cost and prediction accuracy than the traditional ensemble learning algorithm.


I. INTRODUCTION
In the context of smart grid, the power big data, especially residential power consumption data, has potentially valuable information. How to use these massive data is of great significance to promote production, improve services, and ensure grid security. The power information collection equipment on demand side collects a large amount of data, which constitutes the marketing side big data in the smart grid. The information structure in power big data is very complicated. There are not only structured data, but also many semistructured and unstructured data. Therefore, the processing of power big data is currently a difficult problem.
The power system consists of five parts: generation, transmission, transformation, distribution and consumption [1]. Power load prediction [2] is to analyze the historical data of power load and study the influence of related external factors, so as to establish a mathematical model and estimate the demand of the power system. It is mainly divided into shortterm load prediction, medium-term load prediction and longterm load prediction [3]. User power consumption prediction is part of load prediction, and its essence is to analyze and model the user's historical power consumption information The associate editor coordinating the review of this manuscript and approving it for publication was Mingjian Cui . and predict the power consumption in the next stage. There are many influencing factors of power load prediction, mainly including time-space factors [4], meteorological factors and so on. Due to the wide geographical distribution of household users, this time-space factor can also be considered in consumption prediction, so as to be different from the centralized f prediction of some pilot measurement nodes and better adapt to the needs of customized load analysis. Because the data used in this paper is used by users in a certain province and city for power data, the influence of time and space factors is not taken into account when prediction power consumption. In this paper, the temperature, precipitation and sunshine time of the area are taken into account when prediction the power consumption.
In this paper, the prediction on stage is mainly mediumterm load prediction. We collected the power consumption information of home users in a certain city for one year as a data source. Considering the important factors that affect the residents' power consumption, such as average temperature, monthly precipitation, and sunshine statistics, users are clustered based on their historical power consumption. Using neural network and selective integrated learning method, the user power consumption is predicted. This paper proposes a method for home users' power consumption prediction based on neural networks and selective ensemble learning [5], [6]. This method provides necessary reference for power grid planning and transformation.

II. RELATED WORK
Power big data mining is mainly carried out from the prediction of power data [7], [8]. Machine learning is widely used in classification and prediction problems. By training a large number of data, we can find the rules contained in the training data, and then use these learner models to classify and predict the new data. Generalization ability has always been one of the important criteria to evaluate a machine learning algorithm model. It means that a machine learning system should be able to process new data well according to the model trained by existing data.
In the power big data processing, because of the complexity of data and the particularity of semi-structured and unstructured, the effect of a single learner is not very ideal. The research of scholars shows that ensemble learning can effectively improve the generalization ability of machine learning system. Ensemble learning is to get several classifiers (called basic learners) in the process of training samples, and then combine these basic learners in a certain way, so that multiple basic learners work together to solve a learning task. For the combination of basic learners, the traditional methods are generally as follows: majority voting method [9], weight voting method [10], hierarchical combination method [11], etc. These integrated learning algorithms can effectively solve all kinds of classification and prediction problems, but they still have shortcomings. For example, although they have good generalization performance, they are still unsatisfactory in some aspects of performance and efficiency. In reference [12], when the amount of data is too large, although it is relatively simple in the construction of various learners, but in the process of ensemble, the hardware memory and computing capacity of the server system are required to be high. In the earliest ensemble learning, all the basic learners are used to combine the learners. This method maybe ensemble the learners with poor performance, which will affect the ensemble effect of the model. Selective ensemble learning can solve this problem well. During ensemble stage, only the learners with better performance are selected to avoid the impact of the basic learners with poor performance on the ensemble model. The process of selective ensemble learning is shown in  The main idea of selective ensemble learning [13] is to select only a part of the learners with better performance among many base learners, so as to obtain the better effect than that of ensemble all base learners [14]. In the search process of the basic learner, the method used is very important, which has a great impact on the ensemble results. At present, there are four methods of selective ensemble: iterative optimization, ranking, clustering and pattern mining. In reference [15], the author summarizes and analyzes the existing selective integration learning methods, and points out that scholars first adopt the method of selective integration according to the accuracy. With the deepening of research, GASEN and the improved selective integrated learning algorithm are proposed by researchers. This algorithm adopts the combination of voting method and genetic algorithm. Firstly, voting method is used to discard part of the learners first, and then genetic algorithm is used to search the expected basic learners.

1) RANKING METHOD
Ranking is the most common selective ensemble algorithm [16]. Its main idea is to sort the basic learners ac-cording to their performance. Generally, the accuracy rate is used. After ranking, only the top parts of the basic learners are selected for ensemble to improve the performance of the overall model. The main advantage of ranking method is that its selection speed is relatively fast, the cost in the ensemble stage is relatively small, and it is suitable for small-scale ensemble learning. The common ranking based selective ensemble learning algorithms are: direction-based ranking, boundary-based ranking, kappa coefficient-based ranking [17], etc., which all show good performance in the ensemble stage.
However, most of the ranking algorithms only consider the performance of the prediction results when selecting the base learners, which is one-sided. If the performance of multiple base learners is good, but the difference is large, the integration model has good prediction performance, but its generalization ability has not been significantly improved. In the selection of basic learners, the relevance and difference between each basic learner are not considered, so the ranking method has some defects.

2) ITERATIVE OPTIMIZATION METHOD
The main idea of iterative optimization method [18] is to use a specific optimization method when finding the optimal solution, and then optimize the target value according to the result of each iteration. The advantage of iterative optimization is that it avoids the problem of too large data combination. By introducing the optimization method, the optimization is carried out step by step until the optimal combination of the basic learners is obtained, so that the optimal solution is obtained when the basic learners are selected.
At present, the commonly used iterative optimization methods include hill climbing method, GASEN algorithm, EPRL algorithm and SDP algorithm [19]- [22]. Hill climbing is the most widely used algorithm in iterative optimization. It is a process of gradually finding the optimal solution for the selection of each basic learner, which is similar to hill climbing. In the search of the optimal solution, the results will gradually decrease and the search speed will gradually increase. In the process of searching for the optimal solution, it will keep approaching to the optimal solution to obtain the optimal combination. Hill climbing is the most widely used algorithm in iterative optimization. It is a process of gradually finding the optimal solution for the selection of each basic learner, which is similar to hill climbing. In the search of the optimal solution, the results will gradually decrease and the search speed will gradually increase. In the process of searching for the optimal solution, it will keep approaching to the optimal solution to obtain the optimal combination. Because of its fast speed and good performance, it is one of the most commonly used algorithms. GASEN algorithm mainly uses genetic algorithm to optimize the combination of base learners. In the process of optimization, the weight of each basic learner is divided, and the final integrated model has better prediction performance. EPRL algorithm uses reinforcement learning method in iterative optimization. At the beginning of iteration, a decision function is introduced, and then each basic learner is optimized according to the function, so as to get the optimal ensemble model. SDP algorithm is to transform the selection problem of each basic learner into a quadratic integer problem. In order to get the optimal solution, the integer programming method is used to select the basic learner.

III. ANALYSIS OF POWER USER CHARACTERISTICS A. POWER DATA AND PREPROCESSING
The power data used in this paper is part of the power consumption information from residents of a certain city provided by the State Grid Corporation of China (SGCC). The source data mainly includes field information such as Electricity Meter Box ID (EM_BoxID), User ID (UserID), Collection Time (ColTime), Power Consumption (PowCsp), Latitude and Longitude (Lat_Lon), and Weather (Weather). The power management method for residents in this city is: the meter box is a first-level unit, which is connected to multiple users, and each user has its own unique User ID. In general, the collection time is once a month, the Power Consumption is the user's current power consumption for this month. The original data includes 9,374 meter boxes, more than 100,000 connected users, and a total of 2.73 million records of data power consumption information collected for 14 months from January 2018 to February 2019.

1) MISSING VALUE PROCESSING
The main research object of this article is the monthly power consumption information of home users, and the problem of missing values in the ''Power Consumption'' field needs to be considered. During the clustering process, adding and deleting a large amount of data has a great impact on the clustering results, so we analyze and process the data from the following three cases: (1) If the user's Monthly average power consumption is less than 30 kW·h, the household is likely to be a vacant house. This kind of data has little meaning to the clustering results, so it can be eliminated.
(2) If the user's power consumption information is missing for 1-3 months, fill it with the average value.
(3) If the month with missing values is greater than 4 months, it is eliminated.
We use the tool software Tableau [23] for data processing. Through the field information, we find the user ID of the power user with the missing value. According to the analysis, there are 3541 users and 2532 related meter boxes whose power consumption is 0 every month. There are 492 users with a missing value of 1-3 months and 277 related meter boxes. There are 49 users with a missing value of more than 4 months and 36 related meter boxes. The above detailed data information is shown in Table 1.

2) OUTLIER PROCESSING
Statistical discrimination is a method of dealing with outliers. This paper mainly uses the Box-plot method [24] to deal with outliers in the original data. The Box-plot method is shown in Fig 2. Its main processing is as follows: (1) Calculate the upper quartile, the median, and the lower quartile; (2) Calculate the difference between the upper and lower quartile, that is, the quartile difference; (3) Draw the upper and lower ranges of the boxplot. The upper limit is the upper quartile and the lower limit is the FIGURE 2. Box plot schematic diagram. VOLUME 8, 2020 lower quartile. Draw a horizontal line at the median position inside the box; (4) Values greater than 1.5 times the quartile difference in the upper quartile, or values less than 1.5 times the quartile difference in the lower quartile are classified as outliers; (5) In addition to the outliers, draw the horizontal line at the two values closest to the upper and lower edges as the whisker of the boxplot.
The boxplot method can be used to clearly identify outliers in the data set, which plays a role in data cleaning. In addition, it can also observe the data distribution. Firstly, users whose monthly power consumption is less than 30 kW·h are searched with tableau tool, and these data are eliminated. And then, we used Tableau to make a boxplot of power consumption, setting the upper and lower boundaries to 377 and 31. After analysis, it was found that there were 56339 power consumption records exceeding this limit, the number of related meter boxes was 1,682, and the number of users was 8,943. Among them, 743 records have increased the power consumption by more than 60% compared with the neighboring months. If there are errors in statistics, they are filled with the average value of neighboring months. Another 55596 records with higher power consumption are not outliers.

B. POWER CONSUMPTION CLUSTERING RESULTS AND ANALYSIS 1) CLUSTERING PROCESS
In this paper, K-means algorithm [25] is used to cluster the data set. When clustering, we should first determine the initial cluster centroid and select k = 3. The setting of initial cluster center value is shown in Table 2.
Power consumption (PowCsp) represents the power used by the user for a month.
The change of power consumption (PowCsp_Chg) refers to the difference in power consumption between two consecutive months. The calculation method is as shown in formula (1): Power Consumption Change Rate (PowCsp_ChgRt) is the change of power consumption divided by power consumption last month. The calculation method is as shown in formula (2): In this paper, the clustering is the user's power consumption data, that is, numeric data. For this data, the K-means algorithm is used to perform clustering analysis on the data source. The clustering center is first determined when clustering, according to the square error criterion: In formula (3), E is the integrated square error of all samples in the data source, p represents the monthly power consumption, and m i is the average value of the cluster C i . This criterion ensures that the generated clustering results are as compact and independent as possible.
In the K-means algorithm, the type of the sample point is divided by the distance from the current sample point to the center of the class, and the distance is calculated using the Euclidean distance, as shown in formula (4): In equation (4): x i represents the value of the i-th variable in the sample, and y i represents the value of the i-th variable in the clustering center. After subtracting the two, the squares are accumulated, and then the square can be obtained to obtain its Euclidean distance.
The specific process of K-means clustering algorithm is: (1) Select the number of cluster centers required for the experiment; (2) Calculate the Euclidean distance from the data sample to the cluster center according to formula (4), and divide it according to its distance; (3) Then recalculate each cluster center according to formula (3); (4) Repeat steps (2) and (3) until the position of each cluster center no longer changes; (5) Output results. According to the above steps, the experiment first selected three clustering centers, and their initial settings are shown in Table 2.
According to the steps of K-means algorithm, the clustering centroid is calculated and the users are clustered. After 474 iterations, the position of cluster centroid no longer changes, and the iteration stops. The final clustering centroid iteration results are shown in Table 3.
The clustering results are as follows: Class I users have the largest power consumption, with a total of 16,784 households, and their monthly average power consumption is 563 kW·h. Followed by Class II users, a total of 53,167 households, and their average monthly power consumption is 277 kW·h. Class III users is the smallest, with a total of 26459 households, and their monthly average power consumption is 103 kW·h.

2) CLUSTERING CATEGORY ANALYSIS
We use the boxplot to analyze the above clustering results. Class I users: the upper boundary is 714 kW·h, and the lower boundary is 492kW·h, with 328 users out of range.
Class II users: the upper boundary is 453 kW·h, and the lower boundary is 237 kW·h, with 453 users out of range.
Class III users: the upper boundary is 213 kW·h, and the lower boundary is 31kW·h, with 33 users out of range.
Taken together, the out-of-bounds users account for 1.95% of total users in Class I. The out-of-range users accounted for 0.85% of total users in Class II. The above two categories of clustering results are better. The out-of-range users accounted for 0.12% of the total class III users, and the clustering effect is the best.

3) CLUSTERING RESULTS ANALYSIS
The relationship between power consumption and the change of power consumption can be obtained by clustering results, as shown in Fig 4. Class I users have the largest range and fluctuation in the change of power consumption, and even extreme changes of −54 kW·h to 31 kW·h have occurred. The range of change for Class II users is relatively modest, and it is basically maintained within ± 19 kW · h, with most positive values, and this category of users is relatively stable. Class III users have the smallest change in power consumption, which is in a relatively low range and relatively stable.
The relationship between power consumption and the change rate of power consumption can be obtained by clustering results, as shown in Fig 5. Class I users have the largest power consumption and the smallest range of change rates. It ranges from −9% to 5% and is the most stable class. The power consumption of category II users is second, and  the range of change rate is also relatively small, ranging from −5% to 14%, which is relatively stable; The power consumption of Class III users is the least, but its range of change rate is the largest, between −17% and 22%.
Through clustering analysis, we summarize the power consumption characteristics of the three categories, as shown in Table 4. Fig 6 shows the monthly average power consumption of the three categories between January 2018 and February 2019. Figure 5 shows that:  (1) Their overall power consumption trends are similar.
(2) They all reached the maximum power consumption in August.
(3) Although the change rate of power consumption is large, they are relatively stable overall.
(4) The temperature factor has a certain effect on the power consumption.

IV. POWER CONSUMPTION PREDICTION BASED ON SELECTIVE ENSEMBLE LEARNING
Based on the clustering of power consumption in the previous section, this paper adopted the theory of selective ensemble learning to establish a selective ensemble prediction model to predict the power consumption of home power users. We propose a Filtering Iterative Optimization Ensemble Strategy (FIOES), which uses the advantages of the Ranking method to optimize the traditional Iterative Optimization method, trains multiple base learners, and each base learner makes predictions through the Neural Network [26]- [28]. Experiments have proved that the model has a good performance for home users' power consumption prediction.

A. ANALYSIS OF METEOROLOGICAL FACTORS
Meteorological factor is one of the important factors that affect the household power consumption, which has a very important significance for the accurate prediction of power consumption. In the meteorological data, the influence of temperature, precipitation and sunshine time on power consumption is very obvious. Therefore, the above three aspects are the main weather factors to be considered in the prediction of household power consumption. The temperature has the most obvious effect on the power consumption. According to the data in the previous section, residents use the most power in summer every year. The main reason is that residents use high-power electrical appliances such as air conditioning for a long time in summer, resulting in a sharp increase in power consumption, reaching the peak of the year. Because this is a northern city, during the heating period in winter, the power demand of residents is relatively less. The impact of precipitation on power consumption varies with the month. Some studies show that the power consumption of some months increases with the increase of precipitation, and decreases with the decrease of precipitation [29]. The influence of sunshine time on household power consumption is also related to the month. The sunshine intensity in different months affects the household power consumption differently.
The meteorological data in this paper are from China Weather Network. The collected data are the monthly average temperature, monthly precipitation and monthly sunshine time of the province. The specific data after processing is shown in Table 5.  Table 5 shows the average temperature, precipitation and sunshine time of the city in 2018. The average temperature of the whole year is 13.9 • C, in which the average temperature in January is the lowest, and the average temperature in August is the highest. The total precipitation of the whole year is 474.2mm, the month with the most precipitation is July, and the month with the least precipitation is January. Import the sorted data into tableau, merge with the original household power consumption data and clustering results data, and lay the data foundation for the subsequent power consumption prediction.

B. CONSTRUCTION OF BASIC LEARNERS
In the construction of the basic learner, the weather data and power data are combined, and the Multilayer Perceptron (MLP) is used to predict the power consumption. MLP is an Artificial Neural Network (ANN) with multiple hidden layers. Due to different data sets, the number of hidden layers can be modified. The structure of neural network is shown in Figure 7, which contains two hidden layers. The first layer is the input layer, which is generally an n-dimensional vector, such as: x 1 , x 2 , . . . , x 7 . They are respectively expressed as: clustering result, power consumption change, monthly average temperature, distance to the clustering centroid, power consumption change rate, precipitation and sunshine time. The second and third layers are hidden layers, the second layer is 8 neurons, the third layer where z ={1,2,. . . ,8}. After transformation, we can get: By analogy, the calculation process of this propagation can be extended to any layer, and the rules are as follows: where l represents the number of neurons in the current layer, y 1 represents its output. u l (j) represents the input of node j, y l (j) represents the output of node j, W l is the weight matrix between two layers, ω ji l is the weight from the i-th node in the previous layer to the j-th node in the current layer, b l represents the offset of the node, f is the activation function.

C. FILTERING ITERATIVE OPTIMIZATION ENSEMBLE STRATEGY 1) SELECTION OF BASIC LEARNER
FIOES uses a selective ensemble method that combines the ranking method [30], [31] with the iterative optimization method. The main steps are as follows: Step 1: During the iterative optimization, the base learner is selected by the Ranking method, and some base learners with poor performance are filtered in advance.
Step 2: The unfiltered base learner is integrated using an iterative optimization method until the iteration is within a set threshold.
Step 3: The remaining base learners after the iteration are selected by the ranking method for the second filtering and integrated.
When filtering all the basic learners, the ranking method is used to sort them. Here, kappa coefficient method is adopted, and kappa coefficient is used to preliminarily filter each basic learner. The filtering process is shown in formula (8): where, p 0 is the average prediction accuracy of all base learners, that is, the overall prediction accuracy. p i is the prediction accuracy of this learner.
The results of the Kappa coefficient are usually between 0 and 1. The base learners are sorted according to the calculated Kappa coefficients, and the base learners whose accuracy is above 0.8 are selected for ensemble, and others are filtered out.
The main advantage of the Ranking method is that it is simple and fast. There are also advantages in generalization performance and the calculation method of the difference between learners. A common approach is to divide a validation set M val from the overall data set to calculate the prediction accuracy R of each base learner C i : In general, the verification set M val is used to calculate the differences among the basic learners. However, this method may have two shortcomings: (a) during the training process, it may lead to the loss of part of the training data, especially when the training data is small, it may easily cause the model to underfit. (b) The optimization target is set as a part of the data set in advance. This method may cause a deviation of the generated model in some algorithms, which is not conducive to the generation of the best model.

2) ITERATIVE OPTIMIZATION ENSEMBLE ALGORITHM
The FIOES algorithm flow is shown in Table 6.

D. EXPERIMENT AND RESULT ANALYSIS 1) EXPERIMENTAL ENVIRONMENT AND DATA SET
In this paper, the experimental environment is the Ubuntu system, the experimental programming language is Python 3.0, the research and development tool is PyCharm, and the deep learning TensorFlow framework is used.
The data from January to December 2018 is used as a training set to predict the power consumption in January and February 2019.
In the extraction of training data, the Out of Bag method in the Bagging algorithm is used for random sampling [32], [33]. The main process is: First, the data of 100,000 users in the entire data set were processed for missing values and outliers, and 3590 households were eliminated, leaving VOLUME 8, 2020 96410 households. Then it randomly samples 96410 times, so the probability that a piece of data is collected each time is 1/96410, and the probability that it is not collected is 1-1/96410.
If the probability of not being collected in 96410 acquisitions is (1-1/96410) 96410 ≈0.3679, and each extraction is performed randomly, it is guaranteed that the data at each sampling is not exactly the same. We use this method to train each base learner.
In the construction of the base learning, in order to ensure that all data sets can be selected to participate in the training of the base learner, a total of 15 random samplings were performed during the construction phase, that is, the number of base learners is 15. Each random sampling not only guarantees the differences between the base learners, but also ensures that almost all data sets participate in the training of the base learners. This method can ensure that the final integration model has good generalization performance.

2) EXPERIMENTAL DESIGN
This paper uses the FIOES model to classify and prediction the power consumption information of household power users. The importance of each base learner to the predictor variables is shown in Table 7. This importance will be used as a weight when the neural network is trained. When analyzing the prediction result, the error between the actual value and the prediction value is used to evaluate the quality of the prediction result. There are two groups of comparative experiments in this paper. The specific methods are as follows: (1) Compare the prediction results of the selective ensemble framework model with a single base learner model to verify whether the selective ensemble framework model is better than the traditional model.
(2) Set different values of Kappa coefficient k i , compare the classification effect under different k i values, and prove the impact of Filtering Iterative Optimization Ensemble Strategy on the results. Table 8 shows the prediction results of some users in January and February 2019 by using the FIOES model, which mainly includes the actual value, predicted value, relative error, and absolute error. After statistical analysis, the error range of the FIOES model is between −18 and 23, and the average error is 7.35. Table 9 shows the prediction results of a single base learner. Here we use the traditional neural network model. The error range of this model is between −31 and 38, and the average error is 12.33. Compared with the FIOES model, the prediction accuracy of the single learner model is worse.

3) COMPARISON AND ANALYSIS OF PREDICTION RESULTS
Based on the above data, Fig 8 shows  In the selective integration stage of the FIOES model, different Kappa coefficients k i are used to make the prediction results significantly different. Table 10 shows the deviation range and average deviation between the predicted result and the actual value when the Kappa coefficient k i is adjusted from 0.6 to 0.9. When k i changes from 0.6 to 0.8, the deviation range gradually decreases, indicating that the effect of the model increases with increasing k i . When k i changes from     0.8 to 0.9, the deviation range continues to increase, indicating that the effect of the model will also decrease. Therefore, when k i is 0.8, the classification effect of this model is the best.
It can be clearly seen in Table 10 that the prediction result of FIOES model is the best when k i = 0.8, and the minimum deviation and the maximum deviation are − 18 and 23 respectively. Compared with other values, the prediction deviation is the lowest, so when k i = 0.8, the integration effect of each basic learner is the best. The experimental results show that the ensemble strategy of each basic learner determines the classification effect of the model.

V. CONCLUSION
In this paper, we have proposed FIOES to improve the accuracy of power big data prediction. The proposed FIOES is effective for almost all power consumption prediction, which means that FIOES has good generalization ability. In order to make the prediction results more accurate, we first cluster the household power users, classify the users according to their power consumption characteristics, and bring data support to power management and prediction. To the best of our knowledge, this is the first work in the literature to adopt the combination of ranking method and iterative optimization method to selectively ensemble each basic learner. The experiments show that the prediction accuracy of the model can be improved by secondary filtering and iterative optimization. Moreover, different Kappa coefficient leads to different ensemble effects. Only by selecting the appropriate coefficient, the optimal prediction model can be obtained. KUN