Short-Term Load Forecasting Based on Integration of SVR and Stacking

Selection of the kernel function by the support vector regression (SVR), for the purposes of load forecasting, is affected by the power load characteristics. The non-ideal SVR with a kernel function has low forecasting accuracy and poor generalization ability. A novel load forecasting method combining SVR and stacking is proposed in this paper. Base models are constructed based on SVRs with different kernel functions, then multiple base models are merged to obtain a base model layer via stacking algorithm. Finally, an SVR is connected as the meta-model layer. The stacking fusion model is composed of base model layer and meta-model layer. This model is trained with k-fold cross validation to enhance its generalization ability. An improved artificial fish swarm algorithm is employed to optimize the parameters to improve the forecasting accuracy of the stacking fusion model; speed variables are introduced to replace step lengths and improve the convergence speed and search ability. The forecasting accuracy and generalization ability of the proposed method are verified by comparative analysis.


I. INTRODUCTION
The forecasting of power demand is of crucial importance for the development of modern power systems. The stable and efficient management, scheduling and dispatch in power systems rely heavily on precise forecasting of future loads on various time horizons. In particular, short-term load forecasting (STLF) focuses on the forecasting of loads from several minutes up to one week into the future [1]. In the process of load forecasting, many factors such as weather and calendar rules make the load curve complicated and difficult to analyze, the effects of these uncertainties need to be explored in order to fully ensure system security [2], [3]. There is demand for innovative, highly accurate load forecasting methods to support today's electric power systems.
Kalman filters [15], [16], and support vector machine (SVM) [17] are commonly used. In [18], a random search method was used to seek optimal hyper-parameters; the Long Short-Term Memory (LSTM) model with the best generalization ability was selected to improve the load forecasting effects. In [19], the weighted grey relation projection method was used to extract features and the improved particle swarm optimization algorithm was used to optimize model hyperparameters; finally, the support vector machine was employed for forecasting, which improved the forecasting accuracy. In [20], a variance-adjusted gradient boosting algorithm for approximating a Gaussian process regression (VAGR) was used to seek results of nonparametric regression, which could effectively avoid the cubic time complexity hindering the regression to the expansion of large data sets in the Gaussian process. In [21], a method was used to combine grey model with semi-parametric regression model by time-varying weight for load forecasting, which would make forecasting more accurate. In [18]- [21], different load forecasting models were compared. The results showed that fusion models and the method of hyperparameter VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ optimization could effectively promote the forecasting accuracy of load forecasting. The SVR method has been widely investigated [22]- [24]. In [22], the method adopted the load features and the temperature features extracted by individual LSTM networks provides good forecasting results. In [23], a method based on EMD and SVR was used to forecast wind speed. In [24], a model based on affinity propagation SVR was used to forecast power load, and seek the optimal hyperparameters by particle swarm optimization. With regard to the recent research works, two problems arise in the use of SVR to forecast the data. First, the hyperparameter optimization algorithm used has problems with local optimal solutions and excessive iteration time, and secondly, the selection of the kernel function depends on experience to select the radial basis kernel function (RBF), and no comparison whether there is a more suitable kernel function.
To solve the problems, the improved artificial fish swarm algorithm and the model based on integration of SVR and stacking are proposed in this paper, respectively. In addition, the stacking fusion model can effectively improve the forecasting accuracy and generalization ability. The effectiveness of the proposed method is verified by comparing with other models.
The rest of the paper is organized as follows. Section II contains the introduction to principle of SVR and kernel functions. In section III, the principles of improved artificial fish swarm algorithm and stacking fusion model are introduced. In section IV, we present the experimental results and discussion. Finally, the conclusions are summarized in section V.

II. SVR PRINCIPLE BASED ON DIFFERENT KERNEL FUNCTIONS
In section II, the principle of the SVR model and analysis of the application of different kernel functions are introduced in detail, corresponding to section A and section B.

A. SVR PRINCIPLE
SVR is a branch of SVM applied in the field of function fitting. The set of training samples D = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m )}, y i < ε, y i > −ε. The form of the trained SVR model is f(x) = w T x + b. In other words, the corresponding forecasting value f(x i ) derived from a given xi approximates the true value y i to the greatest extent possible while w and b are the model parameters to be determined.
The type of hard-margin SVR optimization problem is expressed in the following form: Equation (1) is the objective function, that is, the maximum margin of the shortest distance between the samples contained in the space and the hyperplane. The constraint condition is that the distance from the farthest sample in the space to the hyperplane does not exceed ε. After simplification, the following convex quadratic optimization formula can be obtained: Considering a case where the distance between the sample and the hyperplane exceeds ε, that is, the soft-margin SVR, a hinge loss function can be added to calculate the loss in the samples. The hinge loss function is as follows: By introducing slack variables and penalty factors, Formula (2) can be rewritten as follows: where ξ i , ξ ∧ i are slack variables and C is the penalty factor in Formula (4). Lagrange's Theorem is used to transform this into an unconstrained mathematical equation, then the Lagrange function is transformed into a duality. A mapping function is introduced here to map the input space to a highdimensional linearly separable Hilbert space for seeking the solution to the nonlinear regression problem. Finally, the ker- calculations in high-dimensional space, where ϕ() represents the mapping function and <, > represents the inner product. After simplification, the following dual problem can be obtained: where α ( * ) = (α 1 , α

B. KERNEL FUNCTION
The selection of the kernel function is the key to an effective SVR. Different kernel functions lead to different learning and generalization capabilities in the forecasting model [25]. The most common ones are as follows.
(1) The linear kernel function can rapidly resolve linear problems with relatively few parameters. The inner product expression of linear kernel function is as follows: (2) The RBF kernel function more stably processes nonlinear problems [26], and provides ideal fitting effects after selecting accurate parameters. The inner product expression of the RBF kernel function is as follows: (2) The polynomial (Poly) kernel function solves nonlinear problems and is primarily suitable for orthogonally normalized data. The inner product expression of the Poly kernel function is as follows: (4) The sigmoid kernel function deals with nonlinear problems. In this case, the SVR is a type of multilayer perceptron neural network. In a convex quadratic optimization problem, it prevents falling into the local optimum. The inner product expression of the sigmoid kernel function is as follows: The SVR with linear, Poly or RBF are transformed into a linear problem-solving process after spatial mapping. Among them, the SVR with linear is a spatial mapping of the same dimension. Nonlinear problems can be transformed into linear problems through feature extraction. The SVR with Poly is limited by the order of polynomials and can only be mapped to a finite-dimensional space, which may have nonlinear problems. The SVR with RBF can be mapped to an infinite-dimensional space for solution, but improper selection of parameters can easily lead to over-fitting. The SVR with sigmoid is used as a multilayer perceptron neural network in this study to directly manage nonlinear problems.

III. STACKING FUSION MODEL BASED ON IMPROVED ARTIFICIAL FISH SWARM ALGORITHM
The main contributions are descripted in section III. Firstly, in section A, the improved artificial fish swarm algorithm is proposed for the problems of local optimal solutions and long iteration time in the hyperparameter optimization algorithm. Secondly, in section B, the load forecasting method combining SVR and stacking is proposed to overcome the difficulty of selecting an appropriate kernel function in the SVR model.

A. IMPROVED ARTIFICIAL FISH SWARM ALGORITHM
As SVRs with different kernel functions observe the data differ from various space angles and structure angles, the various model parameters are inconsistent and thus difficult to adjust. An improved artificial fish swarm algorithm is used in this study instead to optimize the hyperparameters of SVRs with different kernel functions.
The artificial fish swarm algorithm simulates the behavior of swarms of fish foraging, swarming, chasing, and freedom to find food, corresponding to four behaviors in the algorithm, respectively: foraging, swarming, chasing, and freedom behavior [27]. The traditional artificial fish swarm algorithm uses fixed values in setting the hyperparameter step size. If the set step size is too large, it is prone to hovering around the optimal value, resulting in failure to obtain the optimal solution. If the set step size is too small, the convergence speed is low [28]. Seeking the optimal hyperparameters by setting the speed variable rather than the step size accelerates the convergence while securing the global optimal solution. The optimal solution on the bulletin board is updated by iteratively comparing the food concentration at the position of the artificial fish. The hyperparameters are VOLUME 8, 2020 deemed optimal when the satisfactory error bound or upper limit of the number of iterations is reached.
The individual state of the artificial fish can be defined as a vector X = (x 1 , x 2 , . . . , x n ), where x i (i = 1, 2, . . . , n) is the variable for optimization. The current food concentration of the artificial fish is expressed as Y = f(X ), where Y is the objective function value. The moving speed of the ith artificial fish is defined as v i = (v i1 , v i2 , . . . , v in ) and the speed update hyperparameters as c 1 , c 2 (the artificial fish's knowledge of themselves and the artificial fish's knowledge of the colonies, respectively). The optimal position found by the ith artificial fish is P i = (p i1 , p i2 , . . . , p in ), the optimal position found by the whole fish colony is G = (g 1 , g 2 , . . . , g n ), and δ is the crowding degree factor.
(1) The foraging behavior: The artificial fish chooses the direction of movement through visual perception of the food concentration in the water. A state vector X j is randomly selected within the field of view. If Y i < Y j , the artificial fish moves in this direction at the speed V i|next , otherwise, X j is reselected or random behavior is performed. This is expressed as: where ζ , η, and γ are random numbers within range of 0-1.
(2) The swarming behavior: A certain colony of artificial fish gathering in groups for collective foraging under two principles. Moving to the center of neighboring partners as much as possible and avoiding overcrowding. The number of artificial fish found by the ith artificial fish in its field of view is n f and the state of the artificial fish in the center position is X c . When Y c / n f > δY i , the food concentration is sufficient and the artificial fish move in the direction at the speed V i| next , otherwise, they will perform foraging behavior. This is expressed as: (3) The chasing behavior: When a certain fish or several fish find food, the fish around them ''follow'' and swim towards the artificial fish X j in the best state in the field of view. Similar to swarming behavior, they avoid overcrowding, otherwise, they perform foraging behavior. The chasing behavior is expressed as: (4) The freedom behavior: It is complementary behavior to foraging behavior. The state vector X j randomly selected by the artificial fish in the field of view always does not satisfy Y i < Y j , so a state is randomly selected to swim out toward and obtain a new state while effectively avoiding the local optimum: The SVR hyperparameter is selected here as x i in X = (x 1 , x 2 , . . . , x n ). The artificial fish swarm performed the above four behaviors in the order of priority. The maximum number of iterations is set to obtain the optimal hyperparameters on the updated bulletin board. The process of the proposed improved artificial fish swarm algorithm is plotted as shown in Fig. 3.

B. STACKING FUSION MODEL
Based on the characteristics of the kernel function (Section I) and considering the insufficient learning ability and poor generalization ability they cause, a load forecasting method integrating SVR and stacking is developed in this study. The hypothesis space is, theoretically, a collective space composed of all features. The limited size of samples and massive size of the hypothesis space causes an incomplete the mapping from input to output, resulting in low forecasting accuracy. However, SVRs with different kernel functions have different observation results from different data space angles and data structure angles. The stacking algorithm summarizes the different observation results to cover the entire hypothesis space (or as much of it as possible) and improve the forecasting accuracy.
A limited sample size leads to consistency between multiple hypotheses and the training set. A larger version space (set of such hypotheses) results in weaker generalization ability. The stacking algorithm used in this study learns the version space of each SVR to minimize the version space of the overall model and improve the generalization ability. The stacking fusion model structure has upper and lower layers -a structure which strengthens the learning effect, reduces the redundancy and complexity of the forecasting model, ensures strong forecasting accuracy, and shortens the operation time. The upper layer is the base model layer, which is composed of multiple base models, and the lower layer is a meta model layer composed of a separate meta model that is connected to the first layer, the stacking fusion model is composed of base model layer and meta-model layer.
When the model input is X i , the nth basic model of the first layer is M n and the forecasting model of the second layer is M . The output of the nth basic model of the first layer is M n (X i ), which is used as the input of the forecasting model of the second layer. The final forecasting result y i is shown in equation (16): The two stages of the fusion process are analyzed separately here.
step 1: Based on the k-fold cross validation concept, the original data is divided into a training set S and test set T . The training set is divided into k parts denoted as S 1 -S k . First, S 2 -S k is used to train the base model 1, then the trained base model 1 is used to forecast the validation set S 1 and the test set. Next, S 1 , S 3 -S k are used to retrain the base model 1, then forecast the validation set S 2 and the test set. By analogy, k sets (S 1 -S k ) of forecasting values a 1 , a 2 , . . . , a k  The number of base models is strongly correlated with the fusion effect. A small number of models cannot achieve the complementary fusion of various models -an excessive number of models, conversely, results in redundancy and an unreasonable system parameter complexity and forecasting time. The preferable quantity of base models is between 3 and 5 [29], [30]. Four base models as the base model layer achieved the optimal fusion effect and highest learning efficiency in this study. Considering the different data spaces and data structures, the SVRs of four different kernel functions, linear, RBF, Poly, and sigmoid are used as the base model of the first layer of the stacking algorithm. According to the selection rules of the second layer of the stacking algorithm, the model with strong generalization ability or high forecasting accuracy is then selected as the second layer meta-model. On the basis of Reference [31] and by comparison among the training results, the SVR of RBF with the highest forecasting accuracy is used as the second layer meta-model, which constitutes the entire stacking fusion model. The forecasting accuracy is highest in this case. Considering the inconsistency of various model hyperparameters and the difficulty with adjustment, the improved artificial fish swarm algorithm is employed to determine the SVR's kernel coefficient and penalty parameter C. The hyperparameter values are adjusted repeatedly until satisfactory accuracy is obtained. The process of the proposed stacking fusion model is plotted as shown in Fig. 5.

IV. SIMULATION ANALYSIS
The example data and experimental results are analyzed in section IV. In section A, two sets of data are preprocessed. The input characteristics related to the load are listed in section B. In section C and section D, the effectiveness of the improved artificial fish swarm algorithm and stacking fusion model are verified by comparing with other models.

A. EXAMPLE DATA
Real-world load data from a certain area of Guizhou in 2018 and Spain in 2015 are preprocessed, respectively. The load data from a certain area of Spain is retrieved from https://www.kaggle.com/nicholasjhana/datasets. The data horizontal comparison method provided by Reference [32] is employed to identify continuously missing or continuously suddenly changed data and eliminate the impact of sudden changes caused by factors such as missing information or emergencies. The data size is constrained between [0, 1] by the maximum-minimum criterion (min-max) standardization. The load from Guizhou is sampled once every 30 min over a total of 48 sampling points each day and the load from Spain is sampled once every 60 min over a total of 24 sampling points each day. The model is implemented in python. The code is executed in a desktop with Nvidia RTX2070 super graphics card and i7-8700K CPU. The predictive evaluation index is constructed with the mean absolute percentage error e MAPE and root mean square error e RMSE , as shown below: where x(i) and y(i) denote the actual value and forecasting value at time point i, respectively; n is the sample size.

B. INPUT AND FEATURES OF THE TRAINING SET
Taking into account various load patterns of different seasons of the year, the load data of the first 24 days of typical January, May, and August months are used as the training set corresponding to winter, spring, and summer load conditions. The test set is composed of the load data of the last 7 days in January, May, and August targeting the forecasting effect of the proposed method. The input variables are determined as the base model of the first layer is being trained. The optional input variables with higher correlation included historical information, weather information, and calendar rules, as shown in Table 1.

C. OPTIMIZATION OF MODEL HYPERPARAMETERS BASED ON IMPROVED ARTIFICIAL FISH SWARM ALGORITHM
As discussed in Section III, in the proposed method, the hyperparameters optimized by the improved artificial fish swarm algorithm include the penalty parameter C and kernel coefficient γ of each SVR. Based on the load data of different characteristics in January, May, and August, three sets of the optimized hyperparameters are obtained as shown in Table 2. Table 2 shows significant differences in the hyperparameters of the base models. The stacking fusion model appears to reduce the forecasting error by selecting SVRs with different kernel functions. The observation results and version spaces of different SVRs are learned to improve the forecasting accuracy and generalization performance.

D. COMPARATIVE ANALYSIS OF FORECASTING
The improved artificial fish swarm algorithm and traditional artificial fish swarm algorithm are compared on the error curve shown in Fig. 6. Fig. 6 shows that the improved artificial fish swarm algorithm is significantly better than the original artificial fish swarm algorithm. Speed variables rather than step size is introduced to accelerate the convergence speed. The original artificial fish swarm algorithm reached the optimum after approximately 35 iterations and the improved artificial fish swarm algorithm did after approximately 17. As mentioned above, setting too small a step size results in the local optimum, setting the speed variable allows jumping out of the local optimum to approach the global optimum. The improved artificial fish swarm algorithm is selected as the hyperparameter optimization algorithm of the stacking fusion model to rapidly find the global optimal hyperparameters to ensure the best forecasting accuracy of the stacking fusion model.
The proposed method's forecasting performance is also compared with LSTM, RF-stacking, XGBoost-stacking, and the SVR with RBF. To ensure that the stacking fusion model could find the optimal hyperparameters, the hyperparameters of the improved artificial fish swarm are set as follows: the VOLUME 8, 2020 maximum number of iterations is 20, the number of populations is 10, the try number is 10, the crowding degree factor δ is 0.623, the speed parameters c 1 and c 2 are 2, 2, respectively. Six sets of load data for January, May, August in Guizhou and Spain are input into the stacking fusion model and other models for training and forecasting. Six sets of weekly load forecasting curves are obtained, as shown in Fig. 7.
As shown in Fig. 7, there are significant differences among the weekly load curves of different months in both regions. The weekly load curve of January 25-January 31 is the highest among them in Guizhou and Spain, which can be attributed to the load characteristics of different seasons. The three sets of weekly load curves for the weekend loads in Guizhou from 18:00 on Fridays to 24:00 on Sundays show a significant increase compared to work days. Saturday, August 25 shows the relatively low load in Guizhou due to the Zhongyuan Festival; the load reached its lowest level during this holiday as well. The load increased rapidly on the following Sunday. The forecasting accuracy of the five forecasting models are obtained, as shown in Table 3.
The forecasting accuracy as calculated (Table 3) suggests that the proposed stacking fusion model outperforms the other models. Among the three sets of forecasting indicators from Guizhou, the e MAPE reaching 2.5% at minimum and 2.93% at maximum. The fitting effect of the SVR with RBF are most unsatisfactory, 0.67% higher than that of the stacking fusion model on average, followed by that of the LSTM, RF-stacking and XGBoost-stacking, respectively. The proposed stacking fusion model shows a minimum e RMSE of 160.82 MW and maximum of 242.14 MW, which are also better than the other model e RMSE indicators. Among the three sets of weekly load forecasting indicators in Spain, the proposed stacking fusion model shows a minimum e MAPE of 0.88% and maximum of 1.9%, the e RMSE reaching 412.79 MW at minimum and 1195.81 MW at maximum.
It appears that each model achieves its respective ideal forecasting results at the smooth segment of the 66h-78h weekly load curve such as the end of May in Guizhou (Fig. 7, Table 3). When the load fluctuates drastically, for example, the SVR with RBF shows insufficient learning ability, and a high probability of over-fitting, which lead to considerable deviation from the actual data between the 78h and 102h time points. The LSTM could properly forecasts the variation trend of load due to its long-term memory, but there still exist a wide gap between the forecasting values and actual values. The RF-stacking and XGBoost-stacking have the advantages of integrated learning, and the forecasting accuracy surpasses the SVR and LSTM, but they are not as effective as the integration of stacking and SVR. The proposed stacking fusion model shows excellent predictive performance and better fitting effect on the random volatility of load than the other models.

V. CONCLUSION
The stacking fusion model is proposed based on the SVR concept and utilized for short-term load forecasting with actual power load data in this paper. The proposed method is compared with other methods to confirm its feasibility and effectiveness. The conclusions can be summarized as follows. First, the stacking algorithm has higher forecasting accuracy and better generalization ability than other models because the differences observation results and version space are learned in SVRs with different kernel functions. Secondly, the generalization ability of the stacking fusion model can be improved by dividing the data set via k-fold cross validation. Third, the improved artificial fish swarm algorithm can be used to seek the optimal SVR hyperparameters, which improves the forecasting accuracy of the stacking fusion model while accelerating its convergence.