A Novel Multi-Model Stacking Ensemble Learning Method for Metro Traction Energy Prediction

Metro traction energy prediction is the basis of abnormal monitoring and plays an indispensable role in the planning and operation of the metro system. However, current studies rarely offer a satisfactory prediction performance. To improve the prediction accuracy, a novel prediction method for metro traction energy consumption is proposed based on gradient penalty Wasserstein generative adversarial network (WGAN-GP) and stacking ensemble learning with multi-model integration. Firstly, aiming to collect effective train data, WGAN-GP is used to generate characteristic data of traction energy consumption. Then, various algorithms like BP, SVM, ELM, and XGBoost are employed to preliminarily disclose the relationship between traction energy consumption and characteristic data of traction energy consumption via K-fold verification. Thereafter, the XGBoost algorithm is implemented as the meta model to construct a stacking ensemble learning prediction model. Finally, the proposed method is verified with data from Guangzhou Metro Line 13, and the results substantiate the effectiveness of the prediction model.


I. INTRODUCTION
In 2022, China's rail transit mileage exceeds 9000km, and Shanghai, Beijing, Guangzhou, Chengdu, and Wuhan rank in the top five cities in the world in terms of subway mileage. The rapid development of metro rail transit has brought challenges in high energy consumption. To achieve the carbon peak and neutrality targets, it is necessary to reduce the energy consumption of the metro system. In the metro system, 40%∼60% of the total energy is consumed by traction [1]. Therefore, accurate traction energy consumption prediction is conducive to reducing the total energy consumption in the metro system. Usually, the abnormal monitoring of traction energy consumption is realized by analyzing the error between the predicted and the actual value [2]. Therefore, the accurate prediction of traction energy consumption plays a significant role and also provides a basis for abnormal monitoring [3], [4]. On the other hand, the energy optimization of subway train operation also needs the prediction result of traction energy consumption [5], [6], [7]. Current prediction The associate editor coordinating the review of this manuscript and approving it for publication was Giovanni Pau . works on metro traction energy consumption are mainly based on historical data via either the physical model [8] or the statistical model [9]. The physical model is to establish the dynamics model of the metro train through force analysis. The calculation process is complex and inaccurate, especially considering multiple trains running on the line [10]. The statistical model is mainly based on a machine learning regression model that can quickly evaluate the traction energy consumption [11]. For example, Lü et al [12]. proposed a prediction model based on support vector regression and random forest regression. This model can accurately describe the relationship between traction energy consumption and related influencing factors with large-scale operation data. Tang et al [13]. established the prediction model of migration energy consumption and total energy consumption. By binary linear regression fitting and support vector regression with two years' operation data, the model can achieve high accuracy on multiple lines. However, all the above studies are based on existing lines and require a large amount of historical operation data, making them difficult to directly apply to new lines with short operation times. Moreover, the existing studies are all based on a single prediction model. The learning ability of a single prediction model is limited, and the complex relationship in the feature data of traction energy consumption cannot be fully extracted [14]. This makes accurate predictions hard to be achieved.
To get a better prediction model, ensemble learning can be used to train different prediction models and select appropriate combination methods [15]. Ensemble learning methods can integrate different prediction models or the same type of prediction models. The commonly used ensemble learning methods are Stacking [16], Boosting [17], and Bagging [18]. They have shown excellent performance in load prediction and other fields. For instance, Dong et al [19]. proposed a wind power prediction method based on Stacking ensemble learning and achieved higher accuracy and stability than those single prediction models. Al-Hajj et al [20] proposed A global solar short-term prediction model with integrated stacking learning and compared and analyzed multiple stacking integrated learning structures and cycle models to achieve one-year solar radiation assessment and analysis. Zhang et al [21] proposed four different photovoltaic prediction models of stacking ensemble learning based on random forest, extreme gradient boosting, and other base models. Experimental results showed that the prediction performance of the stacking integrated learning model was better than that of a single model. Therefore, given the excellent prediction performance, it is promising to apply the ensemble learning method in predicting traction energy consumption.
In addition, it is hard to obtain enough characteristic data on traction energy consumption,it is difficult to build a traction energy consumption prediction model with high accuracy. A few data modeling methods, such as transfer learning [22] generative adversarial network (GAN) [23], and other methods have been successfully applied in the field of wind power prediction.
The GAN can learn the distribution of the historical data, and the generated data with GAN have the same statistical properties as historical data [24], [25]. This method can provide a novel solution for the above problems. Based on the game theory, GAN trains both generators and discriminators. Therein, generators use noises to generate new data that match the mathematical distribution of historical data, while discriminators are employed to distinguish the generated data from the original data [26]. Once proposed, GAN has been widely used in many fields of power systems, building power demand prediction [27] daily power demand prediction [28], prediction of the Motor State [29], and battery state prediction [30]. For example, Chen et al. [31]. proposed a datadriven GAN to generate scenarios that capture the spatial and temporal correlations of renewable power plants. This method can generate real and diverse wind and photovoltaic power distribution maps. Whereas, traditional GAN also has the disadvantages of poor generation diversity and schema collapse.
To emphasize the novelty of this work, two concerns regarding the research gaps are further clarified.
1) The first concern is the data shortage issue in new subway lines. The abnormal monitoring of metro lines is based on the difference between predicted and measured values. Therefore, it requires a high prediction accuracy, which is conducive to developing a more energy-saving driving scheduling plan. However, in new subway lines, owing to data shortage, it is difficult to establish a prediction model with high accuracy. To overcome data insufficiency, the WGAN-GP model is first proposed to enhance the data samples so as to address the few-shot learning problem in new subway lines. Compared with the original GAN, the improved GAN can generate high-quality characteristic data of traction energy consumption to meet the needs.
2) The second concern is how to improve prediction accuracy. Since deep learning algorithms require a large demand of data, single shallow-layer network algorithms such as SVM, BP, and random forest may lead to a large deviation in prediction accuracy. To fully utilize their respective advantages, the stacking ensemble learning is first applied to construct a traction energy consumption prediction model. The results in the experimental tests confirm the obvious advantage of our proposed ensemble learning method over the single shallow-layer network algorithms. Therefore, in this paper, a prediction model for metro traction energy consumption is proposed based on the gradient penalty Wasserstein generative adversarial network (WGAN-GP) and Extreme Gradient Boosting (XGBoost). The main contributions are summarized as follows: 1) To address the data shortage problem, a novel WGAN-GP method for traction energy consumption is proposed. The limited data on traction energy consumption are amplified by the trained WGAN-GP. The data generated by WGAN-GP have the same statistical properties as the real data and thus can be applied in the prediction of traction energy consumption of new Metro lines. The method can effectively generate characteristic data and circumvent the dilemma of data shortage in the prediction of traction energy consumption.
2) To improve prediction accuracy, a stacking ensemble learning method is used to excavate the complex nonlinear relationship of metro traction energy consumption. The prediction model of traction energy consumption is established on the XGBoost ensemble learning model. The stacking ensemble learning model is formed by the concatenation of two-layer models. The model performance can be greatly improved by integrating multiple models.
3) Eleven combinations of four base models are tested with the Stacking ensemble learning method. With single models extracting different feature information, integrated Stacking ensemble learning can improve the prediction performance. The best prediction model, in which the combined algorithm with BP, SVM, ELM, and XGBoost as one layer base model and XGBoost algorithm as the second layer meta model, is proved to be more suitable for traction energy prediction.
The remaining of this paper is organized as follows. Section II introduces the WGAN-GP model. Section III presents the structure of stacking ensemble learning. In Section IV, the model parameter setting is conducted and the prediction process is demonstrated. Section V compares the effectiveness of WGAN-GP data generation and the performance of different basic models and meta models, and also summarizes the advantages of the proposed method with extensive case studies. Section VI concludes this paper.

II. DATA AMPLIFICATION MODEL A. GAN MODEL
The basic architecture of GAN is shown in FIGURE 1.
Generator G learns the relationship between random signal Z noise and real data R through continuous training, while discriminator D is used to distinguish whether the input data is real data during training.
By training the GAN, the generated data by generator G have the same statistical properties as historical data, cheat discriminant D; Discriminator D tries its best to identify whether its input data is real data or generated data, and the two constantly play games to finally reach Nash equilibrium.
The generator and discriminator interactively train and compete with each other, so that the network is constantly optimized. Finally, the trained generator can generate high-quality new sample data, but the discriminator cannot distinguish it from the real data. Therefore, the loss functions of the generator and discriminator are defined as: (1) where L D is the generator loss function, L G is the discriminator loss function, G is the generator, D is the discriminator, P g (x) is the true distribution of the original data, z ∼ P g (z) is the noise data conforming to the normal distribution, E is the expectation, x is the real data, and z is the gaussian noise. GAN is a kind of unsupervised learning neural network. It can train models with Gaussian noise as the input and output of the data with a similar distribution pattern to the real data. The training objectives of GAN are defined as follows: where P Z is the generated data distribution.

B. WGAN-GP MODEL
Traditional GAN has problems like gradient vanish and cannot learn the distribution law of the characteristic data well.
To alleviate the gradient disappearance and enhance training stability in the original GAN, the Wasserstein generative adversarial network (WGAN) is proposed in [32]. WGAN can effectively utilize the real data and generate new data by minimizing the Wasserstein distance, which can be expressed as: where (P R , P Z ) represents the joint distribution of real data P R and generated data P Z , (x, y) ∼ (P R , P Z ) represents sampling from (P R , P Z ), W (P R , P Z ) is the lowest value for x − y of all the x and y that satisfy the distribution, namely the Wasserstein distance.
where L D is the generator loss function, L G is the discriminator loss function, f w ( * ) is the fitting function of the neural network, λ is the penalty coefficient, * 2 represents the binary norm, and Px is the generated sample distribution.

III. ENSEMBLE LEARNING METHOD A. MACHINE LEARNING ALGORITHM
Popular deep learning methods require massive data for training [33], [34], whereas the traction energy consumption data are limited in new lines. As a result, it would be very hard to build effective deep learning models. Therefore, some excellent machine learning methods are selected to establish the prediction model instead of a single algorithm.

1) XGBOOST ALGORITHM
XGBoost is a Boosting ensemble learning algorithm based on gradient lifting. XGBoost optimizes Boosting algorithm based on the gradient lifting decision tree, reducing the problem that the traditional gradient lifting decision tree is easy to overfit [35], [36], [37], and its model is shown in Equation (7).
whereŷ(i) is the predicted value of the ith sample, M is the number of trees, f m ( * ) is the state of the m-th tree, x i is the ith sample, and F is the set space of trees. The objective function of XGBoost is shown in Equation (8): where L[y(i),ŷ(i)] is the training error between the predicted value and the actual value, and (f m ) is the tree complexity. VOLUME 10, 2022 The complexity of the tree is calculated as shown in Equation (9): where γ is the control coefficient of the number of leaf nodes, T is the number of leaf nodes, λ is the control coefficient of leaf node fraction, and ω is the leaf node fraction. XGBoost algorithm can realize parallel computing for nodes in each layer of the network, which is conducive to improving the training speed of the model. By inputting multiple characteristics of metro train traction energy consumption into the XGBoost model and aiming at measured traction energy consumption, XGBoost's metro train traction energy consumption prediction model can be established. Through several iterations, a good fitting effect can be obtained.

2) EXTREME LEARNING MACHINE
Extreme Learning Machine (ELM) is an improved model of the feedforward neural network, which has the advantages of fast training speed and simple parameter adjustment [38], [39], [40]. ELM, as a new type of feed-forward neural network, randomly generates input weights during training and keeps them unchanged. Only the weights of the training outputs are needed to make the network model constantly approximate the training samples.

3) SUPPORT VECTOR MACHINE
Support Vector Machine (SVM) is a machine learning method based on statistical learning theory. It is mainly divided into two categories, which are commonly used in classification and nonlinear regression tasks [41], [42], [43]. When SVM is used in a regression task, low-dimensional samples can be mapped to high-dimensional vector space through a nonlinear mapping function ϕ(x) to better solve the small-sample problem. The function relation of SVM is as follows: where f SVM (x) is the predicted value for sample x, ω SVM is the weight coefficient matrix of SVM, and b SVM is the bias coefficient matrix of SVM.

4) ERROR BACKPROPAGATION ALGORITHM
Error back Propagation (BP) neural network is a neural network that adjusts the weights and bias parameters among network layers according to the training errors [44], [45], [46], [47]. At the same time, through different activation functions, the complex nonlinear relations in the features can be further extracted, so that the BP neural network can be infinitely close to the real distribution of features.

B. STACKING ENSEMBLE LEARNING
A single prediction model has limited prediction performance. Ensemble learning can integrate the advantages of For metro traction energy consumption characteristics of the sequence of Q = {(y n , x n )}, n = 1, · · · , N , x n for the first n samples of characteristics, y n as the traction energy consumption; The sequence Q is randomly divided into K equal subsets S 1 , S 2 , · · · , S k . Where S −k = S-S k , S k and S −k are respectively defined as the k-th folding test set and training set in k-folding cross verification. For layer 1, the prediction algorithm contains K base learning [49], the training set S −k training gets the base model using the first k algorithm M k , k = 1, . . . , K .
For each sample x n in the k-folding test set S k in the k-folding cross validation, the prediction of it by the base learner M k is expressed as z kn . After completing the crossvalidation process, the output data of K base learners is constituted into a new data sample, namely S new = {(y n , z 1n , · · · , z kn )}, n = 1, · · · , N . The second layer M new is trained by using x n as the input of the second layer Stacking ensemble learning learner.
The detailed implementation of the Stacking ensemble learning algorithm is illustrated in Algorithm 1.

IV. MODEL PARAMETER SETTING AND PREDICTION PROCESS A. WGAN-GP PARAMETER SETTINGS
The WGAN-GP model of the metro traction energy consumption is shown in Fig.3.
Step one: The primary learner is generated by training a layer learning algorithm through the training set dataset Q; 2. for t = 1 to K do 3.
Step two: Create a new training data S new ; 6. S new = ∅; 7. for i = 1 to n do 8.
for t = 1 to K do 9.
z it = h t (x i ); 10.
end for 11.
S new = S neww ∪ {y n , z 1n , · · · , z kn }; 12. end for 13. M new = ξ (S new ); 14. returnM new  In this paper, the deconvolution layer and the convolution layer are used to build the WGAN-GP model of metro traction energy consumption. The generator is composed of 5 hidden layers, while the discriminator D is composed of 4 hidden layers. The basic parameters are shown in Table 1.
where n Dconv is the number of deconvolution layers, kernel size is the size of the convolution kernel, n up is the number of upsampling layers, n fc is the number of neurons in the fully connected layer, and n conv is the number of convolution layers.
The input of generator G is gaussian noise, and the output is traction energy consumption data of the metro train. The input of discriminator D is the real data set of metro traction energy consumption and the generated data set to form a 1 × 5 tensor. In the training process, both generator G and discriminator D use the Adam optimization algorithm, and the initial learning rate is 0.001. After training, generator G can generate traction energy consumption data consistent with the real train, so as to realize the data amplification of the traction energy consumption.

B. STACKING ENSEMBLE LEARNING PARAMETER SETTINGS
In this paper, the stacking ensemble learning model is divided into two layers. The first layer uses XGBoost, SVM, BP, and ELM models as the base learning model, and the second layer uses XGBoost as the meta-learning model.
Detailed parameter Settings of different models are shown in Table 2.

C. WGAN-GP AND STACKING PREDICTION PROCESS
The flow chart of traction energy consumption prediction is illustrated in FIGURE 4 below.
According to FIGURE 4, the prediction process mainly consists of three steps: 1) data preprocessing, 2) WGAN-GP model training and data generation, and 3) base learning model training and meta-learning model training.
1) Data preprocessing: Eliminate the influence of outliers and extreme values through data processing, and provide an excellent data source for data generation and prediction model.
2) WGAN-GP model training and data generation: The WGAN-GP data generation model suitable for metro train traction energy consumption is trained and established to provide sufficient data for the prediction model.
3) Base learning model training and meta-learning model training: Fully mine the characteristic data of traction energy consumption, establish the stacking ensemble learning model of traction energy consumption prediction, and conduct experimental verification.

V. CASE STUDIES
To verify the effectiveness of the proposed prediction model in this paper, this section employs the measured data of Guangzhou Metro for verification. All simulation experiments are implemented with Keras deep learning framework under Python 3.8. The configuration of the simulation platform is Intel core i5-10210U processors running at 2.11 GHz with a memory capacity of 16 GB under Windows 11 Operating System.

A. DATA DESCRIPTION
In this paper, a total of 120 data pieces are collected including the daily train operation energy consumption, auxiliary equipment energy consumption, passenger flow, depot The first 60 points are set as the training set and the last 60 points as the test set. All features are mapped to the interval [0,1] according to the following formula: where x i std is the value after mapping the i-th feature, x i is the original value of the i-th feature, x i max and x i min are the maximum and minimum values of the i-th feature respectively, µ u and µ d are the lower and upper bounds of normalization, in this paper µ u = 1 and µ d = 0.
To reduce the difficulty of model training, the Pearson correlation coefficient is applied to select the above 8 features, as displayed in Table 3.
To select suitable variables as the input of the prediction model, the Pearson prod-moment correlation coefficient is calculated in Table 2. It can be figured that the correlation coefficient between some variables is high, such as actual train number, and passenger car mileage. But some variables have lower correlation coefficients, which indicates that these variables have less relation with the traction energy consumption and should be discarded. Thus, the top four variables with the largest correlation coefficients are selected  as input features, viz, the actual train number, passenger car mileage, passenger flow, and energy consumption of auxiliary equipment.

B. PREDICTION PERFORMANCE EVALUATION INDICES
To evaluate the prediction performance, mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) are chosen as the evaluation indices. The corresponding calculation formulas are expressed in Equations (12) - (14).
where N is the number of samples, y(i) is the measured traction energy consumption value, andŷ(i) is the predicted traction energy consumption value. As can be seen from formula (12)- (14), the smaller the values of e MAE , e RMSE and e MAPE , the better the prediction effect.       the training loss curve of the generator and adversarial, and Fig.6 presents the distribution of the generated data and real data.
As can be seen from Fig.5, in the multi-iteration training of the proposed WGAN-GP model, the loss of generator and adversarial tends to be stable after 800 iterations. The final generator loss function value is stable at 1.3 ∼ 1.4, and the loss function value of the determination device is stable  between −0.33 ∼ −0.15. Therefore, the iteration number of the WGAN-GP model is set to 800.
From FIGURE 6 characteristic distribution of traction energy consumption, the distribution of generated data basically consistent with that of real data, indicating that the data generation model is effective. From Table 4, the similarity between the generated features and the original features is higher than 0.7, which indicates that the generated data has a strong correlation with the original data.

D. CASE 2: EFFECTIVENESS ANALYSIS OF WGAN-GP GENERATED DATA
To further verify the data validity of WGAN-GP, four base models i.e., XGBoost, SVM, ELM, and BP are employed in the comparative analysis. They share the same input with the original metro train feature series and generate new feature series respectively. The prediction indicators of the four base models are shown in Table 5, and the predicted curves are drawn in FIGURE 7.
The following conclusions can be drawn from Table 5 and FIGURE 7: (1) Compared with BP, SVM and ELM, the XGBoost model achieves the best prediction performance with all three evaluation indices. Statistically speaking, in the prediction with original data, the e MAE index is reduced by 5.19%, 9.72%, and 29.19%, respectively, the e MAPE index is reduced by 3.14%, 10.05%, and 26.90%, respectively, and the e RMSE index is decreased by 13.50%, 4.95%, and 31.90%, respectively. Therefore, the XGBoost model is selected to build the prediction model for traction energy consumption.
(2) Compared with the original data, the data generated by GAN play a more significant role in improving the prediction performance in all four base models. Take the MAE indicator as an example, the e MAE indexes are reduced by 3.97%, 2.97%, 1.20%, and 2.30 with BP, the SVM, ELM and XGBoost respectively. In addition, the e RMSE and eMAPE indexes are also decreased with the data from GAN. VOLUME 10, 2022  (3) Compared with the GAN generated data, the data generated by WGAN-GP play a more significant role in improving the prediction performance in all four base models. Take the MAE indicator as an example, the e MAE indexes are reduced by 1.08%, 6.22%, 1.72%, and 2.55% with BP, the SVM, ELM and XGBoost respectively. In addition, the e RMSE and e MAPE indexes are also decreased with the data from WGAN-GP. Therefore, the WGAN-GP can provide useful data for traction energy consumption and effectively improve the prediction performance.

E. CASE 3: PERFORMANCE VALIDATION OF DIFFERENT META MODELS
To verify the effect of different meta models, based on the data generated by the characteristics of metro train sequences, with four different models (XGBoost, SVM, ELM, and BP model) as a meta model to constitute a stacking ensemble learning model. Table 5 shows the predicted index results, and FIGURE 8 shows the predicted curve.
FromTable 6and FIGURE 8, four meta models show different performances, and the best performance is the XGBoost meta model, and the worst effect is the ELM meta model. Compared with BP, SVM, and ELM, e RMSE index of XGBoost decreased by 8.50%, 8.23%, and 10.28%, respectively. e RMSE index decreased by 8.98%, 12.63% and 10.47%, respectively. e MAPE index decreased by 8.49%, 13.97% and 10.02%, respectively. The experimental results show that XGBoost is more suitable as a stacking ensemble learning meta model for traction energy consumption prediction. This is because the XGBoost model selects nodes with the largest information gain to construct the Classification and Regression Tree, and reduces the overfitting. In addition, the model can train in parallel in feature granularity and improve the training speed.

F. CASE 4: STACKING ENSEMBLE LEARNING VALIDATION OF DIFFERENT BASE MODEL COMBINATIONS
To verify the combinations of different base models, using XGBoost as the meta model, different stacking ensemble models are built. The same data are used for analysis. The index results are shown in Table 7.
From Table 7 and FIGURE 9, in all indexes, the first combination, viz, BP, SVM, ELM, and XGboost, achieves the best performance. Therefore, the stacking ensemble prediction model, in which the combined algorithm with BP, SVM, ELM, and XGBoost as one layer base model and XGBoost algorithm as the second layer meta model, is more suitable for traction energy consumption prediction. Different single models can extract different feature information, and then be integrated by Stacking ensemble learning, which is conducive to improving the prediction performance.
The model time complexity and running time of different models are shown in Table 8. It can be found that the time complexity of the generation model and ensemble learning prediction model proposed is O(n 2 ), and thus guarantees a relatively fast runtime to obtain the prediction results.

G. CASE 5: STABILITY VERIFICATION
To further verify the stability of the prediction model, all models were repeated 30 times, and the standard deviation (e std ) of each evaluation index was calculated based on equation (15).
(15) Table 9 shows the experimental results of different methods involved in this paper. It can be seen that the standard deviation of all models is between 0.1 and 0.3, showing good stability, which confirms the effectiveness of the proposed method.

VI. CONCLUSION
In this paper, a novel prediction model for metro traction energy consumption is proposed. In addition to limited original data, extra data are generated by WGAN-GP using gradient penalty to alleviate the gradient disappearance and enhance training stability in the original GAN. The stacking ensemble learning with multiple models is employed to improve the prediction performance. Satisfactory predictions are obtained and verified by extensive experiments.
Conclusions are drawn as follows: 1) The WGAN-GP model can learn the distribution patterns of traction energy consumption and the characteristics of metro trains. WGAN-GP can provide the data needed for prediction and effectively improve prediction accuracy. This is particularly suitable for predicting the traction energy consumption of new lines.
2) The stacking ensemble prediction model, in which the combined algorithm with BP, SVM, ELM, and XGBoost algorithms work as one layer base model and XGBoost algorithm as the second layer meta model, can offer the best prediction performance.
3) The ensemble learning model with other single prediction models can still achieve good results, and thus can also be applied in other prediction programs accordingly. For example, high-speed railway or load prediction of a new industrial park.