Ultra-Short-Term Building Cooling Load Prediction Model Based on Feature Set Construction and Ensemble Machine Learning

As the requirements for the optimal control of building systems increase, the accuracy and speed of load predictions should also increase. However, the accuracy of load predictions is related to not only the prediction algorithm, but also the feature set construction. Therefore, this study develops a short-term building cooling load prediction model based on feature set construction. The impacts of four different feature set construction methods—feature extraction, correlation analysis, K-means clustering, and discrete wavelet transform (DWT)—on the prediction accuracy are compared. To ensure that the effect of the feature set construction method is universal, three different prediction algorithms are used. The influences of the sample dimension and prediction time horizon on the prediction accuracy are also analysed. The prediction model is developed based on an ensemble learning algorithm utilising the cubist algorithm, and the performance of the prediction model is improved when DWT is used for constructing the feature set. Compared with other commonly used prediction models, the proposed model exhibits the best performance, with R-squared and CV-RMSE values of 99.8% and 1.5%, respectively.

INDEX TERMS Cooling load prediction, feature extraction, ensemble learning algorithms, discrete wavelet transform. The primary energy consumption of the construction industry accounts for 30%-40% of the total energy consumption VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ globally [1]. With its recent rapid economic growth, building energy consumption in China has increased significantly. However, its energy efficiency remains low compared with that in most developed countries [2]. As the construction areas in China increase, energy consumption and carbon dioxide emissions also continue to increase [3]. Moreover, the growth rate of the building energy consumption is 3.7% [4]. Therefore, improvements in building energy efficiency have the potential to create immense energy and economic savings [5]. Reliable cooling load prediction results are the basis for optimised building operation strategies and effective methods to improve operation efficiency [6,7]. And embed the load prediction model in a smart city architecture can support the development of sustainable cities [8]. Researchers have previously conducted extensive studies on building load prediction methods [9].

A. LOAD PREDICTION METHODS
Load prediction is a method of predicting the cooling and heating loads of buildings in the future based on the main factors that affect the building load, including information such as outdoor environment and energy usage rules. The commonly used cooling load prediction methods can be classified into three categories: white-box-based, grey-box-based, and black-box-based techniques. White-box-based methods refers to calculating the building load using detailed physical model consistent with the building performance. Grey-boxbased methods refers to the use of historical operating data to establish a simplified physical model to reflect the characteristics of building load. White-box and grey-box models can provide accurate energy consumption and load prediction results only if detailed information describing the building is provided for the modelling. Moreover, because of the complex interaction among the model input features, the computational efficiency ratio of grey-box models is low [10], [11]. Therefore, white-box and grey-box models are seldom used for the predictive control of building systems. Black-box models are more commonly used for building load prediction. Black-box models are also known as machine learning models. These models can be run based on the historical operational data for a building without taking into consideration the physical characteristics of the site [12]. The commonly used black-box models include regression analysis, artificial neural networks, and support vector machines [13].
Many researchers have implemented studies on the effects of black-box models. Hanane et al presented a comprehensive and detailed study on short-term load prediction in a district building using an artificial neural network (ANN) model. The best performance could be obtained for one hour ahead. Moreover, the load prediction results could reduce the electricity costs and help shave the peak load in the district [14]. The ANN model can also have good performance on mid-term daily peak load prediction [15]. Song et al proposes a heating load prediction model based on temporal convolutional neural network (TCN). The proposed model could effectively improve the prediction accuracy [16]. The ANN model and SVM model are also wildly used in the prediction of electricity consumption [17], [18]. The Long Short-Term Memory (LSTM) model can produce a more reliable prediction of energy consumption of air-conditioning system compared to Autoregressive Integrated Moving Average (ARIMA) time series model and back propagation (BP) neural network model [19]. Ensemble learning is a combination of multiple learning algorithms that can achieve better predictive performance than the separate learning algorithms [20]. To illustrate the superiority of the ensemble prediction model in terms of prediction accuracy, Wang et al compared the results of each prediction model (i.e., multilayer perceptron network (MLP), ANN, and support vector regression (SVR)) with the result obtained using the combination of the three models [21]. Adeodato et al. used the median of 15 MLP set predictions; as a result, the performance accuracy of multi-step predictions was improved compared with those for a single MLP [22]. Ngoc-Tri Ngo et al. developed an ensemble machine learning method to predict building cooling loads [23]; 4 single machine learning models and 23 ensemble machine learning models were trained and evaluated. The superiority of bagging ANNs was demonstrated, with a mean absolute percentage error (MAPE) and computing time of 6.17% and 4.98 s, respectively. These studies have demonstrated that using ensemble learning algorithms is effective for load prediction. Al-Rakhami et al proposed an ensemble learning model using extreme gradient boosting (XGBoost) algorithm to avoid overfitting problems and builds an efficient prediction model [24].

B. FEATURE SET CONTRUCTION METHOD
For data-driven models, effective selection of the prediction algorithm and rational acquisition of the model input data are necessary to establish the building load prediction model [25].
Many feature selection methods have been applied in previous studies. For example, HVAC domain knowledge has been used as a basis to select easily accessible data of internal and external disturbances as input variables [26]. To improve the accuracy of the load prediction, methods have been applied to enrich the input variable set of the prediction model. In some studies, the fluctuation values of variables, structural parameters, and degree days have been utilised as input variables [27], [28]. In prediction models, an excessive number of variables may increase the calculation time, meanwhile the accuracy of the predict results may not be improved. Therefore, feature construction methods should be applied to select those variables that are the most effective for the prediction model. Sholahudin et al. applied the analysis of variance method to select variables that are related to the heating load; moreover, the relevant variables were used as input features of the model [29]. Huang et al. used the associative class notification (AC) algorithm to select input variables and construct an associated classifier for chiller failure diagnosis [30]. Kapetanakis et al. utilised the correlation coefficient method to analyse the relationships between the loads of different types of buildings in different regions and various indoor and outdoor variables. The results indicated that when the variables selected by the correlation coefficient method were used as input parameters, the prediction accuracy could be improved and the model complexity could be reduced [31]. Dimension reduction methods such as principal component analysis (PCA) have also been shown to improve the efficiency of load prediction models [32].
In addition, some studies have combined different feature selection methods to develop feature set construction procedures. Fan et al. employed four feature extraction methods to select model input parameters from the outdoor dry bulb temperature to the dew point temperature; the impact of the variables for the first 24 h on the predicted moment load was also considered [33]. Zhang et al developed a systematic feature selection procedure for building energy forecasting. The procedure has three steps: raw data pre-processing, feature filtering, and a feature wrapper [34].
Although some research has been carried out on the construction of feature sets, some limitations remain and should be addressed: (1) the effects of different feature selection methods have not been compared in terms of the load prediction, and (2) research on the compatibility between prediction algorithms and their input feature sets is limited. Consequently, this study develops a short-term building cooling load prediction model based on feature set construction. The proposed procedure has the following features: (1) different types of feature selection methods are used, and the effects of different combinations of methods are compared, and (2) three typical ensemble learning algorithms are used to ensure the compatibility between the feature set construction procedure and load prediction algorithms.

A. RESEARCH OUTLINE
The basic work flow of this study includes data collection, feature set construction, application of the prediction algorithm, and evaluation of the results. The objective of this study is the selection of an appropriate feature set that has a compatible structure with the prediction algorithm and can improve the prediction accuracy. The research outline is shown in Figure 1.
The tested building data were divided into three categories: meteorological data, historical indoor data, and historical load data. Seven data feature set construction methods were employed. To avoid the influence of the random partitioning of the dataset on the prediction output, a 10-fold cross-validation was applied to modify the parameters of the predictive model.

B. DATA DESCRIPTION
The dataset was collected from an office building in Tianjin, China. The office building is open on weekdays from 9:00 to 18:00 and has a total floor area of 8677 m 2 . The ''cooling season'' test was conducted from 3rd July 2017 to 18th August 2017. The data log interval was 1 h, and the tests were only performed on weekdays.
The collected dataset can be categorised as outdoor meteorological data, indoor data, and load data. The outdoor meteorological data were recorded by weather stations, whereas the indoor data were recorded using sensors.
Outdoor meteorological data were measured by a weather station located on the rooftop of the studied building. The occupancy rate was measured using infrared counters installed at the entrance and exit of the building. The indoor air temperature and relative humidity were recorded using a data logger installed in the room. The historical load data were calculated based on the flow rate of chilled water and the temperature difference between the supply water and return water. The flow rate was measured using an ultrasonic flowmeter installed on the side pipe of the air conditioner. The water temperature was measured using a data logger installed in the water pipe. The accuracy of the instruments is summarised in Table 1.

C. FEATURE SET CONSTRUCTION METHODS
In this study, feature extraction, K-means clustering, correlation analysis, and the discrete wavelet transform were performed to collect feature information from the variables.

1) FEATURE EXTRACTION
Feature extraction is used to reduce model input data features, eliminate redundant and irrelevant information, generate new features with low dimensions, and reduce the model runtime. It can be utilised to enhance the ability of the prediction model and to obtain the nonlinear relationships in the data [35]. The performances of two feature extraction methods (PCA and t-distributed stochastic neighbour embedding (t-SNE)) were investigated in this study.
PCA can be used to reduce the dimensionality of the dataset and maximally represent the dataset information. Moreover, it is a linear dimension reduction technique that uses linear combinations to manage primitive multidimensional variables and form a new set of dimensionally reduced variables [36]. Assuming that m is the number of original variables, through PCA, m principal components can be obtained, and each corresponds to an eigenvalue λ i . The ability of the principal component to explain variance in the original dataset can be expressed by Eq. (1).
Here, g i represents the explained variance of the principal component i, and m represents the number of principal components. To a certain extent, the eigenvalues can be regarded as the impact index of the principal components. By ranking the variables according to their eigenvalues, the most influenced variables can be selected. the principal component whose eigenvalue is greater than 1 should be selected as the model input, and the cumulative explained variance of all the selected variables should be at least 80% [25,37]. In contrast, t-SNE is a nonlinear dimensionality reduction technique that is extremely suitable for visualising highdimensional data. The t-SNE algorithm is an unsupervised machine learning algorithm that implements a dimensionality reduction process by constructing similar probability distributions for both high-dimensional and low-dimensional objects. This algorithm consists of two phases. First, a probability distribution is constructed between high-dimensional objects such that similar objects have a higher probability of being selected. Second, the probability distribution of these objects is constructed in a low-dimensional space such that the two probability distributions are as similar as possible.
A simple implementation of the t-SNE algorithm is provided in Reference [35] 2) CORRELATION ANALYSIS In this study, the r spearman was used to calculate the correlation between various variables. The absolute value of the correlation coefficient is less than 1. Values approaching 1 indicate a higher degree of correlation. Furthermore, a value of 0.2 was selected as the lower limit for the correlation among variables [36].
where x i denotes a data point in sample X , y i is a data point in sample Y , and n represents the sample number of variables X and Y .

3) K-MEANS CLUSTERING
K-means clustering was adopted to classify the input information and categorise the data by adding tags to divide n points into k clusters, where n is an instance of a sample. K-means clustering was employed to enable a large number of input values to be classified quickly.
This clustering approach has three steps: a. The K value corresponding to the Gap n (k) maximum value is taken as the optimal cluster number, k, and k variables are randomly selected from all variables as the centroid of the initial k clusters [37].
b. The Euclidean distance between the k centroid and other variables is determined, and the variable is associated with the nearest centroid. Then, the centroid of the new k cluster is calculated. c.
Step 2 is repeated until the sum of the squared error (SSE) reaches a minimum.
In Eq. (3), Ck represents the given cluster, which contains n k points. In Eq. (4), the variance quantity, W k , is the elbow method used to determine the optimal number of clusters.

4) DISCRETE WAVELET TRANSFORM
Wavelet transforms are powerful data processing methods, and these methods can be divided into continuous wavelet transforms (CWT) and discrete wavelet transforms (DWT) [38]. These have been successfully applied for building energy prediction in previous studies [39]. DWT methods utilise decomposition techniques to reduce the noise in the original load series, produce relatively stable and easyto-model series, and facilitate the extraction of hierarchical features of the load data. The cooling load data were regarded as the signal sequence. Through wavelet decomposition, the load signal was decomposed into high-and low-frequency bands. In this study, the D4 wavelet basis function was selected [40]. The prediction algorithm was used to construct prediction models for the high-and low-frequency bands. Corresponding prediction values were obtained from different frequency bands [41]. The predicted load signal values were then reconstructed using the reverse process.

D. ENSEMBLE MACHINE LEARNING METHODS
Three typical prediction algorithms were chosen to develop the predictive models: the random forest (RF), gradient boosting machine (GBM), and cubist methods. Ensemble learning has a high accuracy for machine learning algorithms. At present, there are two types of ensemble learning algorithms: booting-based and bagging-based. The RF algorithm is the former, whereas the GBM and cubist algorithm are based on the latter type.

1) RANDOM FOREST
The basic idea of RF is the construction of several decision trees to form a forest. In the entire process of the RF algorithm, two stochastic processes are involved: 1) the training samples are randomly generated from the resampling bootstrap of the original samples; and 2) in establishing each tree, the grouping variable is the optimal combination of a subset of random candidate variables of the input variable, and the grouping variables are randomly selected [42]. Through these two stochastic processes, RF methods can largely avoid the occurrence of overfitting. The advantage of the random forest algorithm is that it has better performance for high-dimensional data. The disadvantage of the random forest algorithm is that it may ignore the correlation of variables.

2) GRADIENT BOOSTING MACHINE
The main objective of the GBM is to build a new next basis learner based on the gradient descent direction of the loss function of the previously established basis learner. This method aims to reduce the loss function of the entire model by integrating the base learners to continuously improve the model [43]. The advantage of GBM is the faster calculation speed.

3) CUBIST ALGORITHM
The cubist algorithm is based on the M5 decision tree algorithm. The cubist algorithm is equivalent to the adoption of a piecewise multivariate linear function and makes predictions based on a series of input variables. Therefore, the choice of variables is critical in the modelling process. Increasing the related variables can improve the prediction accuracy. When redundant variables are eliminated, the memory space occupied by the program will be reduced; thus, the computational efficiency will be improved. The efficient predictive performance of this method has achieved good results in areas such as geospatial big data analysis [44]. The cubist algorithm is a regression prediction model based on rules and instances; its working principle is presented in Figure 2. VOLUME 8, 2020

IV. MODEL DEVELOPMENT A. EVALUATION METRICS
The prediction accuracy and runtime were the two types of metrics used to assess the performance of the models. To evaluate the prediction accuracy, the mean squared error (MSE), coefficient of variation of the root-mean squared error (CV-RMSE), and R-Squared metrics were employed.

MSE
When CV-RMSE is less than 30%, the calibrated prediction model is considered to approach the actual value [45]. Runtime refers to the length of time from the start of the load prediction algorithm to obtaining the prediction result.

B. FEATURE SET CONSTRUCTION
Among the data, 80% were used as the training set, and 20% were used as the validation set. The validation set was utilised to verify the prediction accuracy of the model and calculate the corresponding evaluation indicators.
In addition to the data collected during the test, derived variables such as historical data and fluctuation variables were also included as part of the feature set. The thermal inertia of the building envelope enclosure considered, and historical data were selected as derived variables. There were fluctuations in the data, and thus derived variables FOG and MSG were proposed to represent the variable changes in the preceding five hours [46], [47]. To predict the cooling load accurately, the fluctuation value can be used to reflect the change in the cooling load.
where F(i) is the value of variable F at moment i. The composition of twelve feature sets is summarised in Table 2.

C. EXPERIMENT SETUP
To evaluate the load prediction performance of the feature set construction procedure and the prediction algorithm for different sample dimensions and time horizons, we divided the experiments into three parts: Experiment I, Experiment II, and Experiment III. Experiment I compare the performance of different feature set construction methods. FS1-FS8, which are developed based on the raw data with 914 dimensions using different feature set construction methods, are used to develop prediction models. The prediction results of each model are compared to get the evaluation of the feature set construction methods. In Experiment II, the impacts of the sample dimension of the feature set are compared. To analyse the influence of the three feature set construction methods on the prediction effects of samples of different dimensions, three methods were applied to large and small samples. The effects of the feature set construction method were compared in terms of the resulting prediction accuracy and model runtimes. In Experiment III, prediction model was evaluated for 24 prediction horizons, from 1 to 24 h ahead, to discuss on which time horizon the prediction model will have the best performance.

D. MODEL PARAMETER SELECTION
The caret package was used in this study. The model parameters were optimised via a 10-fold cross-validation. Thereafter, the model with the best parameters was utilised for the cooling load prediction. The parameter optimisation results are given in Table 3.

V. RESULTS AND DISCUSSION
To evaluate the load prediction performance of the feature set construction methods and prediction algorithms for different sample sizes and time horizons, the experiments were divided into three parts. Experiment I compared the influence of different feature set construction methods on the prediction accuracy. Experiment II compared the ultra-short-term cooling load prediction performance between large and small sample sizes. Experiment III was intended to compare the prediction performance at different time horizons.

A. EXPERIMENT I: INFLUENCE OF DIFFERENT FEATURE SET CONSTRUBTION METHODS ON PREDICTION ACCURACY
To compare the performance of different feature set construction methods, feature sets FS1-FS8 were used to develop prediction models. The prediction accuracy and model runtime are two important aspects that should be considered when evaluating the quality of the model. Table 4 summarises the results of three metrics for the prediction accuracy based on the prediction results and the runtime for each feature set using different prediction algorithms. The tool used in this study was a Windows system with a 2.50GHz Core i5-7300HQ and 16GB RAM.
It is evident that for each prediction algorithm, the models trained using FS5 have the best prediction accuracy. FS5 was constructed using DWT based on the original feature set, FS1. Thus, it can be concluded that the performance of the prediction model can be improved enormously by using the DWT feature set construction method. In terms of the runtime, the model developed based on FS5 has the longest runtime. This is because FS5 has three frequency bands and the highest dimension.
The models using the PCA and t-SNE feature extraction methods have the worst prediction accuracy, even below that of the model without any feature set construction. However, from another aspect, because these methods reduce the dimension of the data, the model training time is substantially reduced.
For the performance of models based on the three algorithms and feature sets FS1, FS4, FS5, and FS6, it can be observed that the use of CA reduced the feature set dimension; however, this was not conducive for improving the prediction accuracy. One possible reason for this is that the selected correlation coefficient limit (0.2) in the CA was not suitable for cases with several input variables, and this may have resulted in the loss of information.
Among the feature sets, FS2 and FS3 had the shortest runtimes. The GBM was the most effective ensemble learning method because its runtime was approximately 1/10 that of the other ensemble learning algorithms; the RF and cubist algorithms were the slowest. Therefore, the computation time was observed to be dependent on the dimension of the feature set and on whether the load sequence was hierarchically decomposed. The FS5 feature set had the longest runtime because it included three frequency bands and had the highest dimension.
By comparison, the feature set construction method of DWT can significantly improve the prediction accuracy. DWT divides the load data into different sequences by frequency characteristics. The low frequency part can represent the basic load affected by weather. The high-frequency part can indicate load fluctuations caused by frequent changes factor such as building occupant behavior. In this way, a more detailed relationship between input parameters and cooling load is established resulting in an accurate prediction result. The method of CA eliminates the input variables that are not highly related to the load data. These variables may have negative effect on the prediction accuracy. However, the choice of the limit of CA will also affect the effect of the CA. When the limit is high, too much effective information may be eliminated. When the limit is low, too much invalid information may be retained. Therefore, CA is not an ideal method for selecting effective variables. The methods of PCA and t-SNE reorganize the input variables and selects the part that has the greatest impact on the load data. These methods can more effectively select the variables that are effective for load prediction. The method of k-means divides different combinations of input variables into different classes, and uses the class tag as a new input variable. However, the input variables of a class label are not enough to affect the load prediction algorithm. Therefore, the k-means method has little effect on the accuracy of load prediction.
The ensemble algorithms combined the FS5, FS6, and FS8 models to achieve optimal performance. A possible explanation for this combination is that these algorithms must import the hidden information behind the raw data to achieve satisfactory performance; hence, it was necessary to perform DWT, CA, or K-means clustering to reveal hidden information. Figure 3 shows the prediction and actual cooling load curves for the cubist algorithm As presented in Figure 3, feature sets FS5, FS6, and FS8 yielded satisfactory prediction performance and high degrees of fitting compared with the original load sequence; moreover, the frequency characteristics were similar to those of the original load sequence. Figure 3 shows that ARE fluctuated considerably every 8 h. When combined with the fluctuation time, it was observed that large fluctuations usually occurred during commuting and lunch break periods. At such moments, the energy consumption status for each person varies considerably. Thus, there will be relatively significant differences between the predicted load and actual load values; this is consistent with actual situations. This demonstrates that the cooling load fluctuation is not only related to the objective environment, but also to the subjective activities of people.

B. EXPERIMENT II: INFLUENCE OF THE SAMPLE DIMENSION ON THE PREDICTION ACCURACY
To analyse the influence of the three feature set construction methods on the prediction effects of samples of different dimensions, three methods were applied to large and small samples. The effects of the feature set construction method were compared in terms of the resulting prediction accuracy and model runtimes.

1) DWT
To compare the effects of the DWT method, feature sets FS1 and FS5, which were developed based on large samples with 914 dimensions, together with feature sets F9 and F10, which were developed based on small samples with 32 dimensions, were used to train different prediction models. The resulting prediction accuracies and runtimes are shown in Figure 4.
The results show that the prediction accuracy varies when different prediction algorithms are used. When RF is used as the prediction algorithm, the models trained using the small samples have better prediction accuracy than those trained using the large samples. In addition, the runtimes for the small samples are much shorter. Therefore, the small samples are more suitable for RF. For GBM, the accuracy and runtime of the large samples are not much different from those of the small samples. For the cubist algorithm, the large samples are better than the small samples. The accuracy of the model trained using large samples is relatively high or even higher than the best accuracies of the other two algorithms. Although the runtime is slightly longer than when the small samples are used, it is still within the acceptable range. To compare the effect of applying DWT to large and small samples, the increase rate (IR) is used to evaluate the degree of improvement in the prediction accuracy. The IRs for different samples are shown in Figure 5.
The IR refers to the degree by which the prediction accuracy is improved for the feature set constructed using the DWT method (FS5 or FS10) compared to the feature set constructed without DWT (FS1 or FS9). For instance, the IR of the large sample refers to the improvement in the prediction accuracy of FS5 compared with that of FS1.
The results indicate that the wavelet transform contributes to an enhancement of the prediction accuracy, and the effect of applying DWT to large and small samples varies. For GBM, the degree of improvement is greater for small samples. However, for the cubist and RF algorithms, the result is the opposite, and the degree of improvement is greater for large samples. From the perspective of the evaluation metrics, when DWT is applied to large samples, all the metrics are improved. However, the R-squared value is reduced in all the models using small samples. In general, DWT has a better effect on improving the accuracy of models with large samples.
Among the ensemble learning algorithms, cubist-DWT based on the large samples yielded the best prediction accuracy, with a CV-RMSE value of 1.5% and an R-squared value of 99.8%. The runtimes of models based on DWT were approximately one-third of those for the models without DWT. This is mainly attributed to the three prediction processes performed by the DWT on three load frequency bands.

2) CA-DWT
In this section, the CA-DWT feature set construction method is discussed. FS1 and FS9 are used as raw data without the application of any feature set construction method. Feature sets FS6 and FS11 are constructed using CA-DWT based on the large samples in FS1 and small samples in FS9, respectively. Figure 6 shows a comparison of the results of different evaluation metrics for commonly used approaches based on large and small samples with CA-DWT.  There is no significant difference in the prediction accuracy between the large and small samples. For RF and GBM, the accuracy of the model trained using FS11 is slightly higher than that of the model trained using FS8. For the cubist algorithm, the model trained with the large samples has better prediction accuracy than the model trained with the small samples. Moreover, for all the prediction algorithms, the small samples exhibit clear advantages over the large samples in terms of the runtime. Figure 7 shows the IRs after performing CA-DWT on the large and small samples in terms of the three-evaluation metrics.
It is evident that for most of the cases, applying CA-DWT will decrease the prediction accuracy compared to the raw data. Only for the models trained using large samples that employ RF and GBM can applying CA-DWT improve the effectiveness of the models. Compared with the IRs for the application of DWT, it can be concluded that the prediction accuracy will be decreased when CA is used in the feature set construction. In terms of comparing the results of feature sets FS1, FS6, FS9, and FS11, the cubist algorithm based on the large samples yielded the best accuracy, with a CV-RMSE of 6.1%. For the ensemble learning algorithms, RF and GBM achieved the best accuracies with FS10, whereas the cubist algorithm achieved the best accuracy with FS1. In terms of the runtimes, the models based on CA-DWT were generally faster than the models based on DWT. From the perspective of engineering applications, when the accuracy requirements are not extremely high, CA-DWT can be considered more suitable than DWT for reaching the target.

3) K-MEANS CLUSTERING
To compare the effect of applying K-means clustering to large and small samples, feature sets FS1, FS8, FS9, and FS12 are considered. The results for the evaluation metrics of the models trained using these feature sets are shown in Figure 8.
It can be clearly observed that for most cases, the prediction results obtained using K-means clustering are very similar to those obtained without K-means clustering in terms of the prediction accuracy and runtime. This is also confirmed by the IRs for different samples, as shown in Figure 9.
K-means clustering is a process of information addition; thus, the small samples are more sensitive to K-means clustering. However, for most of the cases, K-means clustering may have an adverse influence on the prediction results. In the large samples, because the degree of increase in information with K-means clustering was not as significant as the increase with DWT, the accuracy IR of the large samples was lower.
Based on the effect of the three above methods on large and small samples, the following conclusions can be drawn. The DWT feature set construction method can improve the prediction accuracy for both large and small samples at the cost of increased runtimes. Moreover, this effect is more pronounced for large samples.
In summary, for all three prediction algorithms, the DWT feature set construction method can improve the prediction accuracy. The CA-DWT method can reduce the calculation time under the condition of a certain reduction in accuracy. K-means clustering has no significant effect on the prediction accuracy or runtime. Moreover, among the three prediction algorithms, the cubist algorithm provides the best performance, with the highest prediction accuracy and shortest runtime.

C. EXPERIMENT III: INFLUENCE OF THE TIME HORIZON ON THE PREDICTION ACCURACY
Experiment II investigated the performance of the DWT ensemble learning model for cooling load prediction 1 h ahead. In experiment III, the DWT ensemble learning model was evaluated for 24 prediction horizons, from 1 to 24 h ahead. Figure 10 shows the R-squared values of the three algorithms for different time horizons to demonstrate the time horizon at which the prediction accuracy is reduced to the greatest extent; the reduction ratio of the evaluation metric is then calculated. In Figure 10, the selected metric is the R-squared value.
In Figure 10, it can be observed that as the prediction time horizon increased, the reduction ratio of the prediction accuracy decreased. Among the 24 horizons, the R-squared reduction ratios from 2 h to 1 h and 3 h to 2 h were the largest; this indicates that predictions 1 h or 2 h ahead are reliable. ( * Note: The R-squared reduction ratio is the ratio of the R-squared reduction values between two time points.) As shown in Figure 10, the R-squared values gradually decreased. Therefore, with increasing prediction horizons, the noise disturbance caused by the multiple decomposition of the load sequence was gradually amplified; thus, the prediction performance was degraded. If an R-squared value of 80% is taken as the benchmark, a sharp decrease in prediction accuracy will be observed for predictions 2 h ahead. The fluctuation values of the prediction accuracy in the other algorithms were large; this indicates that the prediction accuracy decreased to become stable.
Combined with Table 4, it can be concluded that the cubist algorithm is advantageous in terms of predictions 1 h ahead.  GBM can achieve a certain degree of prediction accuracy in the shortest time. In Figure 10, it can be observed that RF exhibited a stable performance over the 24 h prediction horizon.

D. SHORT-TERM BUILDING COOLING LOAD PREDICTION MODEL
Based on the aforementioned experiments, a short-term load prediction model is constructed according to the cubist prediction algorithm with DWT as the feature set construction method. To illustrate the performance of the proposed prediction model, the prediction results of the model are compared with those obtained using some common algorithms, including penalised linear regression (GLMNET), support vector machine (SVM), classification and regression tree (CART), and k-nearest neighbours (KNN). The feature set is constructed using DWT with large samples. The evaluation metrics for all of the prediction results are shown in Figure 11.
It can be concluded from Figure 11 that the prediction accuracy of the cubist algorithm is the highest, with an MSE of 5, CV-RMSE of 1.5%, and R-squared value of 99.8%. The GLMNET and CART prediction algorithms can also meet the requirements for prediction accuracy. The prediction accuracy of the SVM and KNN algorithms is low and cannot provide accurate prediction results.

VI. CONCLUSION
This study analysed the potential of ensemble learning models for predicting the cooling load of office buildings from the perspective of the feature set construction and the selection of prediction algorithms. Based on the results, the following conclusions can be drawn.
(1) In terms of pre-processing, the results showed that the model based on a feature set constructed by applying DWT yielded a significant improvement in prediction performance. When DWT was combined with the CA method to construct the feature set, the CA-DWT feature set could meet the 30% CV-RMSE limit. The dataset constructed by combining DWT and K-means clustering achieved a higher prediction accuracy IR with small samples. CA and CA-DWT can be used in cases where the accuracy requirement is not extremely high, but a high operational speed is necessary.
(2) Among the different prediction horizons, predictions for 1 h and 2 h ahead were most reliable. As the prediction time increased, the fluctuation of the R-squared reduction ratios for the three algorithms did not exceed 15.5%. Reliable cooling load predictions 2 h ahead can enable the development of building operation strategies.
(3) Among the prediction methods, the ensemble learning algorithms provided the best performance, and practically all cases could provide engineering-acceptable prediction accuracies when using the feature sets constructed with DWT. The cubist algorithm had the best performance, with R-squared and CV-RMSE values of 99.8% and 1.5%, respectively.
The framework presented in this study is flexible and scalable. In this study, we only discuss the prediction model combining some typical feature set construction methods and ensemble learning algorithms. But there may be other algorithms that can provide better prediction results. Further research will be conducted to improve the data mining framework of the building automation system. In addition, more advanced prediction algorithms will be explored, and more appropriate methods will be integrated into the framework to improve the prediction performance of the model.