Short-Term Load Forecasting Based on PSO-KFCM Daily Load Curve Clustering and CNN-LSTM Model

Short-term load forecasting (STLF) with excellent precision and prominent efficiency plays a significant role in the stable operation of power grid and the improvement of economic benefits. In this paper, a novel model based on data mining and deep learning is proposed. Firstly, the preprocessing of data includes normalization of historical load, and fuzzification of influencing factors (meteorological factors, date types and economy) based on Pearson correlation coefficient (PCC). Secondly, kernel fuzzy c-means (KFCM) modified by particle swarm optimization (PSO-KFCM) algorithm clusters the daily load curve. In the clustering experiments, the within-cluster sum of squared error (SSE) index is presented to determine the number of clusters and the clustering validity has a 31.9% enhancement compared with the traditional FCM algorithm. Thirdly, the cosine similarity establishes the resemblance between the prediction date and each cluster, and the similar cluster is determined according to the principle of maximum similarity. Finally, a multivariate and multi-step hybrid model MMCNN-LSTM based on convolution neural network (CNN) and long short-term memory (LSTM) neural network is proposed to forecast the load in following 24 hours, in which similar cluster data is applied to training set. To demonstrate the effectiveness of proposed integrated technique, the accuracy has been verified in three predictive experiments. The fruitful results indicated that the average mean absolute percent error (MAPE) in the entire test set was only 1.34%, a 3.02% reduction compared to a single LSTM.


I. INTRODUCTION
Power load forecasting is to forecast the future load data with historical data as the key component [1]. The level of power load forecasting has become a remarkable sign to measure whether the management of an electric power enterprise is going towards modernization. Accurate power load forecasting plays a significant role in realizing the modernization and scientific management of power grid [2]. Power load forecasting can be divided into long-term load forecasting (LTLF), medium-term load forecasting (MTLF), short-term load forecasting (STLF) and very short-term load forecasting (VSTLF) according to the forecast duration. Among The associate editor coordinating the review of this manuscript and approving it for publication was Seyedali Mirjalili . them, STLF refers to the prediction of the future daily load or weekly load, which is mainly worked for power system operation dispatching, guaranteeing the safety of power grid process and improving the operational efficiency. Therefore, it is a substantial task of power system from the perspective of security, economy and development [3]. The traditional forecasting model takes historical load data as the whole basis. For instance, the trend extension method usually deduces its future trend and state according to the gradient law. Similarly, methods of regression analysis is to adjust the parameters and extrapolate the prediction [4]. Time series method is one of the most customary forecasting methods, whose core is to establish a mathematical model by analyzing the metabolic law between historical data and time information. The ordinally adopted time series models are: autoregressive moving average (ARMA)model [5], autoregressive integrated moving average (ARIMA) model [6], seasonal autoregressive integrated moving average (SARIMA) model [7], auto regressive integrated moving average models with external input (ARIMAX) model [8]. However, the conventional prediction methods require a minor transformation in development trend, and the relationship between historical data and forecast data is relatively simplistic. Obviously, the traditional method with inferior accuracy caused by the tremendous alterations in tendency has a fatal shortcoming.
In recent decades, with the rapid progression of machine learning, experts and scholars all over the world have conducted in-depth research on STLF and put forward numerous effective forecasting models. Despite the dilemma in identifying the optimal parameters, kernel function parameter σ and penalty factor c, Hua et al. introduced a supervised learning model called support vector machine (SVM) into STLF [9]. The addition of the optimization-seeking algorithm to SVM was proposed to alleviate the underlying problem, such as SVM optimized by particle swarm optimization (PSO-SVM) [10], genetic algorithm (GA-SVM) [11], fruit fly algorithm (FF-SVM) [12], dragonfly algorithm (DA-SVM) [13]. Because inequality constraints are changed into equality constraints, the least squares support vector machine (LS-SVM) model applied by Yang et al. simplified the algorithm and enhanced solving speed [14]. In 1991, Artificial neural network (ANN) was first proposed by Park DC [15] for load forecasting of power system, and back propagation (BP) algorithm was adopted. Afterwards, due to the optimization by intelligence algorithms, the ANN got improved accuracy, but it relied heavily on the quality of training data [16]. Nowadays it is well known that deep neural network (DNN) has dominated load prediction in recent years. Shi et al. imported the novel recurrent neural network (RNN) into household STLF domain, which was available for time characteristics [17]. In addition, long short-term memory (LSTM) network is a variant of RNN, which overcomes the gradient disappearance and gradient explosion of RNN, therefore, the LSTM adopted by Liu et al. performed more prominently on long sequences [18]. Other DNN baseline models were trained with abundant samples, such as gate recurrent unit (GRU) [19] and bidirectional recurrent neural network (Bi-RNN) [20], however their results implied that it is not a promising realisation due to the ease of overfitting. In the process of STLF development, single machine learning models had difficulty meeting load accuracy requirements and a few hybrid preprocessing methods were mixed into them. These usual preprocessing methods include grey theory [21], wavelet packet analysis [22], empirical mode decomposition (EMD) [23], random forest [24] and so on.
To sum up, although the STLF based on modern prediction method has achieved great performances in theory and application, the reasoning process is quite complex and tough to meet the demands of practical problems. This paper proposes a novel model based on data mining and deep learning, which not only takes historical data into account, but also meteorology, date type, economy and others. The normalization of historical load prevents the gradient from falling sluggishly and the fuzzification avoids that the influencing factors cannot be fed directly into prediction model due to different weights. The proposed PSO-KFCM algorithm is applied to daily load curve clustering, which cracks the obstacle that the initial clustering center is easily limited to local optimization. After that, cosine similarity of the influencing factor is exerted to establish relationship between the prediction day and all clusters. Finally, a hybrid deep learning model CNN-LSTM whose input mode is multivariate and multi-step (MMCNN-LSTM) is proposed to forecast the load data in the next 24 hours. The above comprehensive technique which combines the PSO-KFCM algorithm and CNN-LSTM model so far have not been applied in the field of STLF. The effectiveness of the above comprehensive technique is verified by three predictive results, which will provide a reference for future load forecasting.
The main contributions in this paper are summarized as follows: The remainder of this paper is organized as follows. Part II introduces the theory of algorithms applied in this study. Part III presents the source of the dataset and modeling.  In Part IV, exploratory experiments are carry out and experimental results are shown. Finally, Part V summarizes this study.

A. THE PROPOSED METHOD
The proposed short-term load forecasting based on PSO-KFCM algorithm and CNN-LSTM model is shown in Fig.1. Firstly, historical power load data can be obtained from the supervisory control and data acquisition (SCADA) system. Load forecasting should not take the historical load data as whole evidences, influencing factors are ought to obtain from the local administrative section such as meteorological factors, date types, economic factors, etc. The load features composed of historical data and influencing factors guarantee the accuracy of the prediction and enhance the anti-interference ability. Then, the historical load data and the influencing factor are normalized and fuzzified respectively, which overcomes the odd of slow gradient descent caused by vast values and and not being able to input directly into the prediction model due to non-uniform units or no units. In the second step, PSO-KFCM algorithm clusters the normalized historical load data to excavate typical power consumption characteristics. Third, cosine similarity establishes the resemblance between the prediction day and each cluster, where the input is the influencing factor of the predicted day and the average influencing factor of within-cluster both after fuzzification. The similar cluster of the prediction day is identified according to the principle of maximum similarity. All three steps mentioned above apply to the preprocessing method for extracting hidden features and internal laws. Finally, the multivariable and multi-step CNN-LSTM model is committed to forecasting 24h ahead load data in the end. Bezdek et al. [25]. FCM belongs to soft clustering, which is different from traditional hard c-means (HCM) clustering in that it allows the same object to belong to the same cluster. This fuzzy partition enables each data point to determine the degree of its relevance with other groups through the membership grade between [0,1]. Set the dataset as X = {x 1 , x 2 , · · · , x n } ⊂ R p , each data x has p characteristics and n is the number of sample data in dataset X . To divide a dataset into k classes, the objective function of FCM algorithm is as follows: where m represents the fuzzy weighting coefficient, u ij represents the membership grade of sample data j to cluster i, c i represents the clustering center of i cluster. U is a k * n matrix, representing the membership matrix, V is a k * p matrix, representing the clustering center matrix. Obviously, this 2-norm x j − c i 2 is the Euclidean distance from each data point to the clustering center. The constraint condition is k i=1 u ij = 1, ∀j = 1, 2, · · ·n, that is, the sum of membership grades of each sample data to all clusters is equal to 1 [26]. More seriously, the membership grade is inversely proportional to Euclidean distance, which makes FCM sensitive to noise and outliers. In the case of data with strong interference, this fatal shortcoming lead to poor clustering quality.
For the purpose of settling the problem of poor clustering quality caused by the constraint condition, the kernel function is introduced into KFCM algorithm, which maps the points of the original space to the high-dimensional feature space. Contrasted with FCM algorithm, KFCM algorithm has been greatly improved in performance and classification effect, because it enlarges the feature differences among various samples through nonlinear mapping [27]. The objective function of KFCM algorithm is as follows: where represents a nonlinear mapping, the Euclidean distance x j − c i 2 in the traditional FCM algorithm is rewritten as (x j ) − (c i ) 2 , (x j ) and (c i ) are the images of sample data and clustering center mapped from the original space to the high-dimensional feature space respectively.
The common kernels are radial basis function (RBF) kernel, rational quadratic (RQ) kernel, exponential kernel, sigmoid kernel and so on. In this study RBF kernel function is dedicated to nonlinear mapping, which has the characteristics of rotational symmetry and separability [28]. RBF kernel function can be decomposed into the following forms: Obviously, K (x, x) = 1.Therefore, the objective function of KFCM can be simplified as follows: The perfect clustering result is the smallest similarity within a cluster and the largest similarity between the clusters, which means calculating the minimum objective function.
The Lagrange multiplier method is applied to this extreme value problem with constraint condition. In the end, calculation formula of the clustering center c i and membership grade u ij are shown in Eq.(5) and Eq.(6) respectively: where l represents the iteration step at present. The concrete steps of KFCM algorithm are as follows: 1) Initialize the maximum number of iteration steps M , number of clusters k, fuzzy weighting coefficient m, RBF kernel parameters σ , termination threshold of the objective function δ and iteration step l = 0. 2) Initialize the membership matrix U (0) through the random numbers between [0,1] in case of satisfying the constraint condition. 3) According to the Eq.(5) and Eq.(6), The clustering center matrix V (l+1) and membership matrix U (l+1) are constantly updated respectively. Then, compute the value of the objective function J m < δ or reaching the maximum number of iteration steps, the calculation is terminated, otherwise let iteration step l = l + 1 and skip back to step (3). The KFCM mentioned above has overcome the problem of poor quality caused by outliers. However, extra attention should be paid to step (3), KFCM itself is an iterative descent algorithm, which makes it sensitive to the initial clustering center and tough to converge to global optimality.

2) PSO-KFCM
In view of the shortcoming of KFCM clustering algorithm which is sensitive to initial value and easy to drop into local optimum, the kernel fuzzy c-means optimized by particle swarm optimization (PSO-KFCM) is figure out to escape poor robustness.
PSO is a new global optimization algorithm with winged convergence speed and few parameters proposed by Eberhart and Kennedy [29]. It simulates the predatory behavior of birds searching for food randomly through mass free particles. Particles have two important properties: velocity and position. Let the particle population size be N , where the position of the i-th particle in the D-dimensional space can be expressed as x i = (x i1 , x i2 , · · · , x id , · · · , x iD ). The velocity of i-th particle is defined as the moving distance in each iteration, expressed by The optimal position of the i-th particle at present is called individual extremum, which is denoted as p best = (p i1 , p i2 , · · · , p id , · · · , p iD ). The optimal position of the whole population at present is called global extremum, denoted as g best = (p g1 , p g2 , · · · , p gd , · · · , p gD ). The formula for the i-th particle to update its velocity and position in d-dimensional space is as follows respectively: where w is the inertia weight, c 1 and c 2 are the acceleration constant, r 1 and r 2 the random number between [0,1], p id is the individual optimal position at present in d-dimensional space, and p gd is the optimal position of the whole population at present in d-dimensional space. The specific process of PSO algorithm is as follows: 1) Initialize the particle swarm, including the population size N , the speed v i and position x i of each particle. 2) Calculate the fitness value of each particle.
3) According to the fitness value, search for the individual extremum p best and the global optimal solution g best . 4) Update the speed and position of each particle according to Eq. (7) and Eq.(8). 5) If the error is little enough or reaches the maximum iteration steps, the optimal result is output, otherwise skip back to step (2) and proceed with calculation. The theory of PSO has been mentioned before, the following is the combination of PSO and KFCM. The prime problem to be solved is the encoding of particles. Let each particle represent the solution of the clustering center, so the velocity and position of each particle are k * p matrix. Fitness function is another trouble to be resolved, which evaluates the position of each particle. As we all know, the smaller the KFCM objective function, the higher the clustering quality, that is, the larger the fitness value. Therefore, the objective function of KFCM is inversely proportional to the fitness value. The fitness function is defined as: where K 0 is an arbitrary minor positive number and 1 is a decent choice to avoid the denominator of fitness function being 0 and ignoring the subject. Then the overall flow of PSO-KFCM algorithm is shown in Fig.2.

C. THE PRINCIPLE OF THE COSINE SIMILARITY
Cosine similarity is an effective method to calculate the similarity between two unknown datasets. As we all know, Euclidean distance measures the absolute distance, which is directly related to the location coordinates of each point. Furthermore, it equates the differences between different attributes of samples, which can not occasionally meet the actual requirements [30], [31]. Contrary to Euclidean distance, cosine similarity measures the angle of space vector, which is more reflected in difference of direction rather than position. The closer the cosine value is to 1, the closer the angle is to 0 degree, that is, the more similar the two vectors are. In extreme cases, two vectors are completely coincident. The cosine similarity analysis between vector a and b is as follows: Customarily, cosine similarity is applied in multidimensional positive space. In the field of load forecasting, load features are generally multidimensional positive numbers. After preprocessing, the data are in the first quadrant, the angle is 0-90 degree, thus the cosine similarity value is between [0,1]. Obviously, cosine similarity is absolutely suitable for the determination of similar cluster. The similar cluster is established by the maximum similarity intensity between the prediction date and each cluster, which is equivalent to performing a classification problem.

D. THE PROMCIPLE OF THE CNN-LSTM MODEL 1) CNN MODEL
Convolutional neural network (CNN) is a feedforward neural network (FNN) with feature extraction ability, which was proposed by Fukushima [32]. It can be divided into 1DCNN,2DCNN,3DCNN, among which 1DCNN is the most suitable for the prediction of time series. The integrated 1DCNN network consists of input layer, convolution layer, pooling layer, flattening layer and dense layer [33].
As can be seen from Fig.3, 1D does not mean that the input is 1-dimensional, but the direction of the convolution kernel motion is fixed, where the convolution kernel is a linear weighted function. The height H of input layer represents time steps and width W represents time features. In the convolution process, the input and the convolution kernel do point multiplication to extract features. If there are k convolution kernels of size f and the step size is s,then the formula for calculating the height of the convolution layer is illustrated The width of the convolution layer is determined by the number of convolution kernels, that is, W cov = k.Then pooling layer concentrates data by narrowing the sampling window. The flatten layer stretches the data and connects it to the dense layer. Through this structure mentioned above, the characteristics of CNN can be summarized as follows: 1) Local receptive field: In contrast to the full connection, the convolution kernel is connected to the local part of input, which accelerates on operation. 2) Weight sharing: All the elements on the same feature map share the identical convolution kernel, that is to say, they assign a fixed weight, so that the parameter setting reduction. 3) Subsampled: In order to lessen redundancy and prevent overfitting, average pooling and maximum pooling are employed to concentrate data.

2) LSTM MODEL
In order to solve the gradient explosion and disappearance of traditional RNN, Sepp Hochreiter and Jürgen Schmidhuber proposed long short-term memory (LSTM) network in 1997 [34]. Compared with RNN, LSTM cell units are still calculated based on input and hidden layer output of the upper level, but the internal structure changes, while the external structure remains invariant. As shown in Fig.4, the internal structure of LSTM cell unit is composed of input gate i, forget gate f , output gate o and internal memory unit c. Forget gate f manages the forgetting degree of input x(t) and output of the upper hidden layer h(t − 1). Input gate controls the update efficiency of input x(t) and output of the upper hidden layer h(t − 1). The memory unit c determines which fresh information remains in the cell and renews the cell state. Finally, output gate o supervises how much information is output for cell [35]. The calculation of each component in LSTM is summarized as Eq.(12) -Eq. (17).
Forget gate:

3) THE HYBRID CNN-LSTM MODEL
In this paper, CNN can be regarded as ''feature extractor'' after preprocessing, that is to extract local features in time step. 1DCNN is typically employed to address time series related problems, convolution kernel only slides along an inflexible direction to automatically extract the hidden features and internal laws of data in the time direction. The extracted feature information sequence is input into LSTM network. By a large number of training data, the weights of input gate, forget gate and output gate in LSTM network are adjusted constantly, so that LSTM is capable to learn the time dependence relationship between feature information sequence and output.
As demonstrated in Fig.5, the original data is a multi-dimensional load features with time information. CNN which has an excellent feature extraction ability is the same as automatic encoder in Seq2Seq model. Furthermore, LSTM which has a brilliant prediction capacity of long time series is more like an automatic decoder in Seq2Seq model. This

III. DATA AND MODELING A. DATA SOURCES AND PREPROCESSING
This public electrical load dataset from Australian New South Wales (NSW) which is downloaded at the Australian Energy Market Operator (AMEO) Official website https://www. aemo.com.au/energy-systems/electricity/national-electricitymarket-nem/data-nem/aggregated-data. NSW power system is responsible for supplying electricity to almost 1.6 million users, mainly including three metropolises, Sydney, Newcastle and Wollongong. The original data is composed of settlement date, totol load demands, electricity price, period type. Obviously, they are insufficient as a complete basis for load forecasting. It is universally acknowledged that meteorological factors are particularly prevalent in affecting load variation, therefore, meteorological factors are supplemented at the website https://www.wunderground.com, including temperature (drybulb temperature, dewpoint temperature, wetbulb temperature), humidity, precipitation, wind speed, pressure and so on. Besides, seek for calendar to determine the date type due to the massive distinction between weekdays and weekends, holidays and non-holidays. Every 0.5h is a time step, so there are 48 time indexes in one day. The data is distributed from January 01, 2006 to December 31, 2010, with a total of 1826*48 rows of data, furthermore, the columns represent total load features (Historical load and influencing factors). The distribution of the whole electrical data is presented in the Fig.6. From the box-whisker plot, it seems that there are numerous outliers from 9:00 to 17:00, but this is not necessarily a rotten matter, the difference of classification and difficulty level of prediction are highlighted. At the same time, it should attach importance to noted that the dimensions of power load data are all on the thousand or ten thousand scale, too large value retard speed of finding the optimal solution by the gradient descent method. To overcome this obstacle, normalization which converts the data to [0,1] is utilized to simplify calculation. Assuming that the maximum and minimum values of data x are x max , x min respectively, the normalized datax calculation formula is as follows: Meteorology is the most ordinarily wielded influencing factors in STLF, since they have impact on economic activities (industrial, residential, commercial and agriculture), so as to indirectly affect electricity consumption. For instance, when the temperature drops low enough, more energies are required to increase the comfort index of human body (CIHB) which leads to boom load demand. In addition, agricultural electricity is the most sensitive to precipitation. However, These meteorological factors that keep different weighting impacts on load variation have diverse units, such as temperature ( • C) and wind speed (mph). On weekends or holidays people live with more recreational activities, while on weekdays and non-holidays they tend to be obsessed with subsistence, which justifies why date types should be taken into account. It is also strenuous to input date types directly into our forecasting model as they have no units.
Through fuzzification of influencing factors, a kind of mapping is established to conquer the obstacle of no unit or different units. The result of fuzzy mapping is determined according to the Pearson correlation coefficient (PCC) which is an examination of the degree of correlation between two domains [36]. The PCC formula between n-dimensional data X and Y is defined by Eq. (21): whereX andȲ are the mean value of X and Y separately. PCC values range from −1 to 1, where its sign represent positive or negative correlations, and the magnitude of the absolute value depends on correlational strength. For the sake of clarity, PCC between load data and segmental factors is exhibited in heat map Fig.7. It can be markedly noted that the absolute value of characteristic PCC scopes from 0.11 to 0.73 in the first column. The median absolute value of 0.42 is applied as a demarcation for the correlational strength. In other words, if the absolute value of PCC between load and a factor is greater than 0.42, this factor is defined as a strong correlation,such as the above drybulb temperature, humidity and date type, otherwise, the opposite is defined as a weak correlation, such as wind speed, precipitation and pressure.
Furthest behind, the fuzzy mapping result is determined by the correlational strength and positive-negative correlation, in which the strong correlation maps the original physical scale to [0,1], the weak correlation maps to [0, 0.5]. Positive correlation mapping results are proportional to the original physical scale and negative correlation is inversely proportional. After fixing the two parameters, it refers numerically to the normalization. This fuzzy mapping approach not only conquers the lack of outright input influencing factors into the prediction model caused by different units or no units, but also avoids the slow gradient descent as normalization. The segmental fuzzy mapping table of influencing factors in this dataset is shown below Table.1.

B. DAILY LOAD CURVE CLUSTERING
Clustering technology can excavate the archetypal power consumption characteristics from enormous load data, and supply sovereign support for power grid companies to achieve load forecasting and demand side management. A significant step of daily load curve clustering is to determine the number VOLUME 9, 2021 of clusters in the PSO-KFCM algorithm mentioned above. There are various methods to ascertain the number of clusters, for instance, within-cluster sum of squared errors (SSE), partitioning around medoids (PAM), gap statistic (GS) [37]. In this article, the comprehensive SSE method is adopted to accomplish this assignment. The analytical formula of SSE is as follows: where c i is the i-th clustering center, p i represents the muster of data points in the i-th cluster. With the increase of cluster number k, the sample partition will be more meticulous, and the aggregation degree of each cluster will gradually amend, so SSE will naturally become smaller. Theoretically, the smaller the SSE value is, the better the clustering effect will be. However, when k increases to a certain extent, the effect on the decrease of SSE is rare. Therefore, the k value near the inflection point of the curve is normally the appropriate number of clusters. In this clustering experiment, Set the particle population size N = 100, maximum iteration steps M = 100, RBF kernel parameters σ = 150, termination error δ= 1e − 4 and the range of k is programmed from 1 to 10, and the SSE broken line is exhibited in the below Fig.8. Evidently, the most appropriate number of clusters is recognized as 6, since there is no abrupt inflection point nearby and it tends to the minimum. After that, this novel algorithm begins to cluster the preprocessing load data with 48 time steps. The daily load curve after PSO-KFCM clustering is shown in the Fig.9, where numerous daily loads are represented by different color curves. In order to prove the superiority of this algorithm, FCM, KFCM, GA-KFCM algorithm are contrasted with proposed method. Four internal indexes [38], [39] are adopted to evaluate the validity of clustering as shown in Eq.(23)-Eq.(26).
1) Silhouette coefficient (SC): The range of coefficient is between [-1,1], and the closer to 1, the better the clustering performance. a i is the average distance between the sample i and other points in the same category, and b i is the minimum average distance from sample i to other clusters 2) Davies-Bouldin index (DB): DB index describes the distance between the clustering centers and the within cluster divergence of samples. The smaller the index, the better the clustering effect. s i represents the average distance between samples in cluster i.

3) Calinski-Harabasz index (CH): CH index is obtained
by the ratio of compactness to separation. Thus, the larger the index, the more compact it is. Tr(B k ) denotes the trace of between-clusters dispersion mean matrix and Tr(W k ) represents the trace of within-cluster dispersion matrix. 4) Krzanowski-Lai index (KL): KL index can only be applied to calculate clusters of two categories and above. In order to achieve the best clustering effect, KL index should be as large as possible. W k is the sum of the squares of the distances from the clustering interior point to clustering center.
Their comparison indicators are clearly displayed in Table.2 and Fig.10. It can be seen from the results that the three indices of SC, CH, KL are the highest and DB is the lowest with 0.496, 1120.465, 1.575, 1.001, respectively, proving that all the clustering validity is preferred over the other three proposed methods. In addition, the clustering 50352 VOLUME 9, 2021  effect is enhanced by 31.9% compared with the traditional FCM algorithm in term of SC. From the comparison results above, the effectiveness of the proposed method is verified.

C. PREDICTIVE MODELING
The dataset is divided into 80% training set (January 01, 2006 to December 31, 2009,1461 days) and 20% (January 01, 2010 to December 31, 2010,365 days) test set. Among them, the training set is employed for daily load curve clustering and training prediction model, the test set is used to determine the similar cluster and as the input in the trained model.
In this paper, the default prediction model is multivariable and multi-step CNN-LSTM (MMCNN-LSTM). The overall network composition is distinctly expressed in Fig.11. Each time step has a total of 21 dimensions of load features, such as the current moment L t , the previous moment L t−1 , the first 2 moments L t−2 , the first 48 moments L t−48 of load, . . . , the current temperature T t , humidity H t , wind speed W t , . . . , electricity price P t etc. The input is a 48*21 matrix representing time steps and the number of load features. A double convolution layer with 128 convolution kernel of size 2 extracts features and the extracted time series are compressed by a max pooling layer of size 2. To avoid the model from overfitting, dropout layers are added with a probability of 0.1. Set the amount of LSTM hidden layer units to 200, since the load is only predicted one day ahead, the output displayed is 48 dimensions. These network parameters are continuously adjusted through predictive experiments before they are obtained.
With the purpose of certifying the hybrid model's superiority proposed in this paper, three indexes are selected for prediction evaluation, namely, root mean square error (RMSE), mean absolute error (MAE), mean absolute percent error (MAPE) [40]. Suppose the original data is y and the predicted data isŷ, these calculation formulas are shown as Eq.(27)-Eq.(29) respectively.

A. EXPERIMENT I
The predictive experiment platform is Jupyter Notebook, the framework are Tensorflow (GPU) and Keras, and the device configuration is NVIDIA Titan xp and Intel(R) Xeon(R) CPU E5-2620 and RAM 16G.
In the first experiment, the PSO-KFCM clusering model and the model without pretreatment were compared while VOLUME 9, 2021 keeping the other variables the same. Selected local 30-day data, a total of 30*48 time steps as observations. When an unpreprocessed forecast is settled upon, the observations are the last 24 days of the training set and the first 6 days of the test set. When the PSO-KFCM clustering data are imported into prediction model, the observations are the last 24 days of a certain cluster in the training set and the first 6 days of a similar cluster based on cosine similarity in the test set. The clustering data has been given in mentioned Section III(B), and the cluster label ''Cluster 5'' is selected in this investigation.
The two vectors of cosine similarity are determined as influencing factor of the predicted day and average influencing factor of within-cluster. The prediction model's input is in form of daily maximum temperature T max , minimum temperature T min , average temperature T avg , maximum humidity H max , minimum humidity H min , average humidity H avg and so on, within-cluster is the average of all days in the same format. The determination of similar cluster is based on the maximum cosine similarity principle. Among them, the cosine similarity data operated to determine the similarity of two samples (Two of the six days) are shown in Table.3. According to the two largest cosine similarity, 0.925 and 0.897, respectively, it can be indeed confirm that the two test samples belong to ''Cluster 5''. The comparative graph of 6-day forecast data between the model without pretreatment and the PSO-KFCM clustering model are clearly depicted in Fig.12(a) and Fig.12(b) respectively. Obviously, the data after PSO-KFCM clustering present a regular periodicity, which relatively alleviate the predictive hindrance. Through local amplification, especially the prediction accuracy of peak and valley values have been greatly upgraded. The detailed RSME, MAE and MAPE of the two methods are presented in Table.4(a) and Table.4(b) respectively. It should be noted that the maximum MAPE value of the clustering model 0.83% is also fewer than the minimum of the model without pretreatment 1.81% within these six days. On the whole, the three indexes of the model after pretreatment are 82.71, 61.86 and 0.67% on average, which is almost one-third of the unpreprocessed model. Compared with the unpreprocessed model, the average MAPE value of the assembled model is acutely dropped by 1.51%, which amply verifies the superiority of PSO-KFCM pretreatment and method for determining similar clusters by cosine similarity. In order to further observe the prediction accuracy, experiments II and III were carried out.

B. EXPERIMENT II
To certify the supremacy of multivariate and multi-step input approach on the foundation of original clustering logic, compare with the other three input modes, namely, univariate and single-step CNN-LSTM (USCNN-LSTM), univariate and multi-step CNN-LSTM (UMCNN-LSTM), multivariable and single-step CNN-LSTM (MSCNN-LSTM). The univariate model represents only a single historical load data as a theoretical support without taking into account all influencing factors let alone fuzzy mapping. Similarly, just one step data will be predicted in single-step model, even insufficient outputs result in overlapping predictions. In order to preserve the consistency of other settings, set the optimizer to 'Adam', the learning rate is '0.001' and the maximum number of iterations is 100.   January 14, 2010 was selected as the prediction date. Using the above clustering data and calculating cosine similarity, the label of similar cluster is determined as ''Cluster 6''. Fig.13 indicates the comparisons between four different input modes on that day and the complicated indexes are shown in Table.5 and Fig.14. From the chart above, it can be clearly observed that the outcomes of multivariate model are more likely to superior than those of the univariate model in case of keeping output step size constant. From the perspective of elaborate MAPE value, The MS model decreases by 0.59% over the US model and the largest accurate enhancement is in MM model, which is a 0.7% dramatic decline compared to UM model. This justifies seamlessly why it is necessary to consider numerous influencing factors and fuzzy mapping, which provides with a admirable theoretical basis. Maintaining the input variables as constant, US model and UM model, MS model and MM model are compared, and the precise improvement is between 0.41% and 0.52% in term of MAPE. Multi-step models are acknowledged due to precise accuracy and prominent applicability to the STLF domain for multi-time forecasting.
In this entire procedure, the time taken by the US model (Training and prediction), UM model (Training and prediction), MS model (Clustering, training and prediction) and MM (Clustering, training and prediction) model are 20s, 22s, 30s and 33s, respectively. Although the present model spends more time in training and clustering process than  its partner, the high accuracy criterion still better meets the realistic demands. The mean square error (MSE) loss curve of MMCNN-LSTM model is exhibited in Fig.15. The train loss and validation loss are evidently decline and eventually stabilize. Combined with the fitted curves, which are almost identical in magnitude and direction, these are distinguished signs that it has outstanding fitting ability and appropriate network parameter settings. In a nutshell, the results clearly indicate that the proposed MMCNN-LSTM model is more capable of finishing the STLF task than its counterparts regardless of precision and adaptability.

C. EXPERIMENT III
In the first two experiments, the advantages of clustering preprocessing and the optimal input pattern have been demonstrated. The overall effect will verified by comparing with alternative unclustered DNN-based models in the following step. Multilayer perceptron (MLP), gate recurrent unit (GRU), bidirectional recurrent neural network (Bi-RNN), extreme gradient boosting (XGBoost) and  conventional LSTM with certain impacts in the domain of time series forecasting have been handpicked. Taken as a whole, the entire test set was determined to be predicted since MAPE existed for each day. Table.6 indicates the detailed evaluation parameters of the six models in terms of max MAPE, min MAPE, average MAPE and Fig.16 provides a visual representation of the MAPE distribution. From these six groups of results, it is obvious that poor presentation of the single LSTM model and GRU model with maximum mape higher than 5%. As the more prevalent XGboost model whose average mape reaches 2.16% in deep learning over recent years has a 20.9% heightened accuracy over the traditional MLP model, which is second only to the novelty admixture. As can be see, the proposed method produces better prediction results, with a decent in average MAPE value from 0.82% to 3.31%. The results show that the proposed method significantly upgrades predictive accuracy in comparison with unprocessed DNN baseline models. In addition, the difference between the maximum and minimum is only 1.92%, however, the GRU and Bi-RNN model are 4.29%,2.68% respectively, which proves to play prettily in the stability of predictions as well.

V. CONCLUSION
Short-term load forecasting is a basic work for daily operation of power grid. This paper presents a STLF method based on PSO-KFCM daily load curve clustering and CNN-LSTM model. This comprehensive technique taken historical load data and influence factors (meteorology, date type, economy and others) into account, where historical load data were normalized and influencing factors were fuzzy mapped according to the Pearson correlation coefficient. The novel PSO-KFCM algorithm clusterd the preprocessed daily load curves, which not only solved the problem of sensitivity of the initial clustering center, but also greatly improved the clustering quality. The clustering experiment shown that the number of clusters was determined as 6 by sum of squared error index and the 31.9% improvement in Silhouette coefficient over conventional FCM algorithm. Besides the Silhouette coefficient, Davies-Bouldin index, Calinski-Harabasz index and Krzanowski-Lai index were operated in clustering validity indicators as well. The cosine simlarity mainly for multidimensional positive space was selected as the indelible bridge between clustering label and prediction model. Multivariate and multi-step CNN-LSTM was focused on predicting load data for the next 24h in half-hourly steps and the accuracy was verified by root mean square error, mean absolute error, mean absolute percent error. This hybrid prediction integrates the advantages of both, feature extraction capability and long time series processing potential. Finally, contrasted with the model without clustering, three other input models, and five DNN baseline models, the extensive comparative results have confirmed the high-precision and excellent practicality and stability of the proposed model.
The PSO-KFCM method and CNN-LSTM model proposed in this paper are not only limited to short-term load forecasting, but also can be applied to other deep learning contents, such as bearing fault diagnosis, signal pattern recognition, intelligent visual sorting, etc.