Electrical Energy Prediction of Combined Cycle Power Plant Using Gradient Boosted Generalized Additive Model

A combined cycle power plant (CCPP) employs gas and steam turbines to generate 50% more power while utilizing the same fuel as a normal single cycle plant. The performance of a CCPP under full load is affected by a variety of factors such as weather, process interactions, and coupling, which makes it challenging to operate. Therefore, a reliable assessment of the maximum output power of a CCPP is required to improve plant reliability and monetary performance. In this paper, a predictive model based on a generalized additive model (GAM) is proposed for the electrical power prediction of a CCPP at full load. In GAM, a boosted tree and gradient boosting algorithm are considered as shape function and learning technique for modeling a non-linear relationship between input and output attributes. Furthermore, predictive models based on linear regression (LR), Gaussian process regression (GPR), multilayer perceptron neural network (MLP), support vector regression (SVR), decision tree (DT), and bootstrap-aggregated tree (BBT) are also designed for comparison purposes. Results reveal that GAM improves the RMSE by 74%, 68.8%, 70.3%, 54.8%, 21.2%, and 17.3% compared to LR, GPR, MLP, SVR, DT, and BBT, respectively. Furthermore, the results of the Man-Whitney U test and rank analysis also confirm the effectiveness of GAM for energy prediction of CCPP. Finally, it can be concluded that the proposed method is effective, robust, and accurate for the assessment of the maximum output power of a CCPP to improve plant consistency and financial performance.


I. INTRODUCTION
In order to analyze a thermodynamic system, various hypotheses are needed to compensate for the uncertainty in the solution. In real time applications, these hypotheses are impractical for analyzing complex systems. It involves solving hundreds of nonlinear equations, resulting in excessive computational requirements. To circumvent this constraint, machine and deep learning techniques are gaining popularity as a way to avoid thermodynamic-based techniques, discover counter-intuitive aspects, and provide performance efficiencies beyond design variables. These advances result from the The associate editor coordinating the review of this manuscript and approving it for publication was Hiram Ponce . discovery of diverse and complex correlations and interconnections between important input and output attributes [1]. A combined cycle power plant (CCPP) is a well-known example of a thermodynamic system. The performance of a power plant under full load is affected by a variety of factors such as weather, process interactions, coupling, and so on, which makes it challenging to create a reliable mathematical model for CCPP. The CCPP uses gas and steam turbines to produce 50% more energy using fuel similar to a standard simple cycle plant. However, accurate estimation of output power at maximum load is essential to enhance plant efficiency and financial operations [2]. Reliable energy generation assessment tools can help to conserve energy and maximize returns on existing megawatt-hours (MWh), which improves power plant efficiency, especially when facing the limits of raw material conservation and high profitability. Hence, precise power generation forecasting has great importance in enhancing the efficiency of power plants and improving environmental conditions [3]. In recent years, researchers have utilized various approaches based on machine learning (ML) algorithms to predict the output power at full load of CCPP. Previous studies on the prediction and control of CCP are reviewed in Table 1. In [16], the LR2,1 norm-based online sequential extreme learning algorithm (LR21OS-ELM) is designed for different prediction problems. The performance of LR21-OS-ELM is compared with that of ELM and LR21-ELM for electrical energy prediction. Results demonstrate that the proposed ML algorithm outperformed ELM and LR21-ELM in terms of RMSE. In [17], the Ridge and support vector regression models are designed and implemented for the energy prediction of CCPP. The regression coefficient for SVR (0.98) is higher than the ridge regression (0.92), which shows the higher predicting accuracy of SVR. In [18], principal component analysis (PCA)-based K-means and agglomerative clustering are used for CCPP energy prediction. The results show that the proposed algorithms have an accuracy of 80% compared to the support vector machine (SVM) and regression tree. A deep learning neural network (DNN) is designed for the CCPP energy forecasting [19]. The predicting performance of DNN is compared with that of sequential API and functional API based ANN. Results show the superior performance of deep learning neural networks.
Various ML techniques are used in the literature to predict the electrical energy output of CCPP. Each ML algorithm has its own pros and cons. For example, the number of neurons in each hidden layer, synaptic weights, learning rate, and bias values all have a significant impact on ANN performance. Fuzzy logic-based ML methods require an accurate estimation of the rule base, which is a timeconsuming operation. On the other hand, the prediction accuracy of SVR and SVM is determined by the appropriate values of their corresponding hyper-parameters. From the viewpoint of the above discussion, the aim of this work is to investigate a competent energy forecasting model for CCPP based on a boosting based generalized additive model (GAM), which can provide energy experts with the necessary insight into CCPP energy generation. These predictive models are simple to interpret while enhancing forecast accuracy. It also generally outperforms most linear techniques, such as linear regression, while providing greater interpretability compared to other ML algorithms. These predictive models allow current knowledge to be integrated during the model construction process in order to improve the prediction performance. The key contributions of the article are as follows.

A boosting-based generalized additive model (GAM) is
proposed for the output energy prediction of CCPP. 2. The predictive performance of GAM is compared with that of linear regression (LR), Gaussian process regression (GPR), multilayer perceptron neural network (MLP), support vector regression (SVR), decision tree (DT), and bootstrap-aggregated tree (BBT). 3. The Mann-Whitney U test, violin plots, and rank analysis are all utilized to perform a detailed assessment of the models' outcomes.
The following are the remaining sections of this work: Sections II and III provide a full discussion of the dataset as well as the proposed algorithm. Sections IV and V describe the results and discussion, followed by the conclusion of the work.

II. COMBINED CYCLE POWER PLANT (CCPP) SYSTEM
A CCPP is a combination of steam and gas turbines (ST and GT) with heat recovery steam generators (HRSG). The power in a CCPP is produced using ST and GT, which are integrated in one cycle and transported from one turbine to the other [20]. In the CCPP, GT not only produced electric power but also hot emissions consisting of NOx and COx gases. These gases are passed through HRSG, where they are converted to steam and generate electricity due to coupled ST and generators. Thus, the GT generator generates energy, and the remaining heat from the exhaust gas is used to make steam, which in turn generates power via the ST generator [21]. For this study, the data set is from CCPP-1 [22] with a small production capacity of 480 MW, consisting of one 160 MW ABB ST, two dual HRSGs and two 160 MW ABB 13E2 GTs, as shown in Figure 1. The CCPP data set contains 47840 (9568 per year) data points collected between 2006 and 2011 while the plant was operating at full load. The power (PE) generated by the combination of GT and ST is primarily affected by four environmental variables: ambient temperature (AT), exhaust vacuum (V), relative humidity (RH), and ambient pressure (AP). Thus, AT, V, RH, and AP act as input attributes while PE acts as output attribute of the ML algorithm. A statistical description of the CCPP data set is presented in Table 2. For better understanding of the data set, all input and output attributes are described as histogram fit in Figure 2.

A. DATA PREPROCESSING
Preprocessing of the data (DP) is regarded as the most crucial phase in any data-driven investigation. It provides information about the dataset's outliers, redundancies, and missing terms. The highly diverged data points from the other points are known as outliers and should be removed from the dataset [23]. In this work, a quartile-based outlier detection and rejection (O R ) method is applied, which is given by where, l is the input or output attribute that lies in mdimensional space (l eR m ). Q 1 , Q 3 and, IQ R represents the 1st, 3rd and interquartile range of an input or output attribute, respectively. Further, a median by target method (equation (2)) is employed to fill the missing (M V ) terms in the attributes as follows.
After preprocessing, the next step is sampling of the dataset into training (70%), validation (15%), and testing (15%) subsets. Figure 3 shows the sampling of the CCPP dataset into three subsets as follows.

III. GENERALIZED ADDITIVE MODEL (GAM)
As per literature, various machine learning algorithms like boosted tree, RF, MLP, SVR, and DNN are utilized for the energy prediction of CCPP. These algorithms provide accurate and precise predictive regression models for low and high-dimensional predictive problems. Furthermore, in several applications, whatever is learnt is just as essential as predictive accuracy. As a result, the profound precision of complicated models comes at the cost of interpretability, i.e., the influence of a specific input on the predictive output of a complex model is cumbersome to interpret. Generalized additive model (GAM) can easily address the interpretability issue of complex models [24]. GAM is the extended version of LR models. A conventional LR models, gives the linear correlation between input and output attributes. let Y is the output attribute with normal distribution mean γ and variance η 2 . The linear relationship between Y and input attributes X j is given as follows of predictor attributes Furthermore, equation (3) can be rewritten by considering link function h, which relates γ to X j , as follows The equation (4) is a functional form of generalized linear models (GLMs). GAM is the extended version of GLMs, which introduces the non-linear form of predictor attributes. Such non-linear predictors are linked to the predicted value of the dependent variables using an appropriate link function and are therefore expressed as: where, f j (i) = j th basis function j = parameter of j th basis function In order to improve the accuracy of the conventional GAM, a pairwise interactions are added to it, then equation (5) can be modified as [25].
However, training the GAM is dependent on two critical factors: (1) the shape function selection and (2) the learning algorithm used to train the GAM. In this work, boosted trees and gradient boosting are used as shape function and learning method to train the GAM.

A. GRADIENT BOOSTING (GB)
GB is a repetitive process that starts by estimating the function while considering a constant offset, which does not fit the data adequately. After each iteration, fit is improved by fitting the base learner to the negative gradient of a pre-specified error function. GB enhances the predictive performance of the model along with attribute selection and model identification. It has significant benefits over other approaches. If GB stops suddenly before getting convergent, then it improves predictive accuracy by decreasing regression coefficients to 0, a strategy similar to lasso regression, ridge regression, and shrinkage smoothing. GB is used to achieve attribute selection by setting certain components to 0. Another advantage of GB over other regression is its ability to integrate nonlinear correlations and spatial impact [26], [27]. As per the GB method, the estimation of the optimal prediction function f * to realize the output attribute Y from the input attribute X is as follows.
where, f * minimizes the cost function ρ over the all possible values of input attribute X . f * can be any function that minimizes E Y ,X [ρ (Y, f (X ))]. The correct distribution of X and Y is not known, so GB reduces the following empirical risk (ER).
where ER is the approximation of E Y ,X [ρ (Y, f (X ))], it relates the mean to −ρ(Y j , f (X j )), j = 1, 2, . . . n of the sample site.
Steps for the implementation of GB technique [28]- [30]. I. Set the initial value for the n-dimensional shape function vectorf [ Figure 4 shows the schematic representation of the overall methodology considered in this work. Furthermore, three performance indices, i.e., root mean square error (RMSE), mean absolute error (MAE), and R-squared (R 2 ) are considered for the performance investigation of the proposed ML algorithm towards energy prediction.
where,Ŷ k and Y k are predicted and actual values under the k th independent variable, N is the total number of samples in the CCPP dataset.

IV. RESULT AND DISCUSSION
As an illustration, CCPP uses gas and steam turbines to produce 50% more power in comparison to a single-cycle plant. Further, the development of a mathematical model for CCPP under full load is a tedious job due to its dependencies on various factors. Hence, a predictive model is required for the improvement of the plant's efficiency and financial operations. Therefore, in this work gradient boosting based GAM is proposed for the prediction of energy generation by CCPP.
Preprocessing of the data is required before designing the predictive models in order to remove outliers. They introduce skewness and kurtosis, which make the algorithm overfit or underfit to the predicted output values. Figure 5 shows the Pearson correlation matrix for CCPP input and output attributes after preprocessing. The Pearson correlation coefficient provides an indication of the level of gradual shift of independent parameters in order to accurately examine the influential aspects of the data. The negative value of coefficients shows the inverse correlation between the variables, whereas the positive value suggests a positive correlation. If the value of the coefficient is 0, it means both the variables are uncorrelated. It can be observed from Figure 5 that input attributes AT and V are negatively correlated to the output PE. Whereas AP and RH have a positive correlation with PE. There is a strong positive correlation between AT and V, and both the input attributes have a negative correlation with AP and RH, respectively. After preprocessing, the dataset is divided into three subsets: training (70%), validation (15%), and testing (15%). Thus, 33488, 7176, and 7176 samples have been chosen randomly as the train, validate, and test subset. It is a well-known fact that, the predictive accuracy of any ML algorithm greatly depends on the values of its hyper-parameters. A trial and error method is performed in order to evaluate the optimal values of the GAM hyper-parameters (  linear regression (LR), support vector regression (SVR), Gaussian process regression (GPR), multilayer perceptron neural network (MLP), decision tree (DT), and bootstrapaggregated tree (BBT) ML algorithms are also designed for comparison purposes.
As discussed previously, 7176 data points are considered for validation and testing purposes. Figure 6 shows the regression plots for all the algorithms in the validation data set. According to the regression plots, GAM had the highest R 2 value of 99.54%, followed by BBT (99.32%), DT (99.21%), SVR (98.11%), GPR (95.82%), and MLP (95. 30%). However, LR has the lowest value of R 2 (94%) in comparison to other models. Table 4 demonstrates the performance comparison for all the designed algorithms in terms of RMSE and MAE. From Table 4, it can be observed that GAM attains the lowest RMSE and MAE when compared to other techniques. The next step is to investigate the performance of the predictive models for the testing data subset. Figure 7 shows the individual tracking plots of all the designed ML algorithms for the testing dataset. It can be observed from Figure 7, that GAM is able to track the testing dataset with the highest R 2 value of 99.58%, followed by BBT (99.28%), DT (99.12%), SVR (97.93%), GPR (95.65%), and MLP (95.20%). In this case also, LR attains the lowest value of R 2 (93.70%) compared to other algorithms. Finally, Figure 8 shows the tracking performance of ML algorithms on 50 data points (6000-6050) for a better understanding of the comparison. It is observed that the predicted values by GAM are nearer the testing data points compared to the other ML algorithms. Figure 9 shows the error distribution plot of predictive models for CCPP electrical energy prediction. Table 5 displays the maximum and minimum error deviations for all the predictive models. The maximum and minimum deviations attain by GAM are 11.2470 and −10.9319, respectively. Hence, from table 5 it can be revealed that the deviation of the error is less in the case of GAM in comparison to other ML algorithms. It can also be further concluded by the visual inspection of Figure 9. In addition to this, a nonparametric 'Mann-Whitney U' test [31] is also performed to investigate the normality and probability distribution of actual and predicted values for all the developed models.
The M-W test compared the actual and predictive outputs to investigate whether both the outputs are derived from the same distribution or whether there is a difference in their median values. Table 6 shows the outcomes of the M-W test for all the predictive models. After comprehensive analysis, it can be observed that the Z value is highest for SVR (1.8437) and the smallest for GAM (0.0294), respectively. There is a homogeneity in the 1t-P and 2t-P values, meaning no large deviations are observed. Further, GAM has the larger values of 1t-P (0.4882) and 2t-P (0.9764) compared to other techniques, which shows the effectiveness of the proposed model. Furthermore, the performance of GAM with other designed models and models existing in the literature are estimated on the basis of performance indexed (PI) and rank analysis values. For this analysis, all the relationships evaluated using RMSE and MAE presented in Table 7 are considered. The PI can be defined by following equation [32].
where, a is related to every predictive model. From table 6, it can be seen that predictive models based on GAM (0.2514), BBT (0.2830), DT (0.2987) and SVR (0.5925) have better accuracy in comparison to existing models. However, the predictive model developed in [11] has a PI value equal   to 0.7677, which is better than LR (1), MLP (0.8581) and GPR (0.8091), respectively

A. UNCERTAINTY ANALYSIS
The confidence ranges of forecast errors (CL ± ) are calculated by the given equation (13) to measure the uncertainty related with all the predictive models [33].  where, ξ and ω are the mean and standard deviation of the forecasted error, respectively. D λ is the standard variable with λ % of significance level. Figure 10 shows the uncertainty band bar graph of all the models for CCPP electricity generation prediction.
VOLUME 10, 2022   whereas, for every I i the higher value of S A (i,j) , shows the greater dependency of that predictive variable on the target attribute (O j ). Figure 11 shows that the S A values for the input attributes AP (0.9994) and RH (0.9829) are higher than AT (0.9228) and V (0.9661), respectively. This suggests that AP and RH are the most important elements in estimating PE.

V. CONCLUSION
In this article, a gradient boosted generalized additive model (GAM) ML algorithm is proposed for the development of a predictive model for combined cycle power plant (CCPP). Initially, preprocessing of a CCPP dataset is completed by removing the outliers using a quartile-based method and replacing the missing values using the median method. The next step is to split the preprocessed dataset into training, validation, and test subsets. Furthermore, optimal values of GAM hyper-parameters are estimated using the trial and error method. In addition to this, predictive models based on LR, GPR, MLP, SVR, DT, and BBT are also designed for the performance comparison of GAM. The performance of the presented models has been analyzed with different statistical measures like RMSE, MAE, and R 2 respectively. A detailed comparison has been carried out among all the predictive models on the basis of violin plots and the nonparametric M-W test. Results also suggested that GAM shows the best performance amongst the seven models, with RMSE = 1.1053, MAE = 0.8187, and PI = 0.2514. Finally, an uncertainty analysis was also conducted for all the models. GAM shows the least uncertainty in predicting the electrical energy of CCPP. As a result of this study and the overall review, it can be said that the proposed model has a better ability to improve plant reliability and financial performance than other predictive models.