The Deep Convolutional Neural Network for NOx Emission Prediction of a Coal-Fired Boiler

This paper presents a methodology for predicting NOx emissions of a coal-ﬁred boiler by using real operation data, coal properties and CNN (Convolutional Neural Network). Two building blocks are carefully designed following the practical guidelines for the light weight CNN architecture design. Furthermore, the building blocks are used to develop the deep CNN-based model for NOx prediction. A comprehensive comparison among different prediction models based on DL (Deep Learning) shows that the proposed deep CNN-based prediction model outperforms other prediction models in terms of RMSE (Root Mean Square Error) criteria. The results indicate that the developed deep CNN-based prediction model has more excellent accuracy and better numerical stability. Besides, the architecture design of the DL-based prediction model has a signiﬁcant impact on the performance of the prediction model.


I. INTRODUCTION
Coal is the primary fuel used in power plants to generate electricity. However, NOx emissions during coal combustion are responsible for human health and environmental pollution [1]. The control of NOx emissions from coal combustion is still a concerned issue in many countries. Therefore, the clean and efficient utilization of coal in power plants has become one of today's main objectives in coal combustion researches.
The combustion optimization approach can effectively reduce NOx emissions by adjusting the operational parameters [2], [3]. For this approach, it is crucial to develop an accurate prediction model for NOx emissions at the furnace exit. However, because of complex combustion dynamics, fluid mechanics and nitrogen conversion chemistry, it is difficult to build such a prediction model based on the overall dynamics of the boiler. Alternatively, the advanced machine learning algorithms can be used to build the relationship between the related operational parameters and NOx emissions at the furnace exit.
Some studies have been conducted on applying shallow learning algorithms (such as shallow neural network, support vector machine, extreme learning machine, and their The associate editor coordinating the review of this manuscript and approving it for publication was Venkateshkumar M . variants) for modeling NOx emissions in coal combustion processes [4]- [7]. The above studies have achieved some success in the prediction of NOx emissions; however, there are some weaknesses among these shallow learning algorithms. First, the operational variables for modeling NOx Emissions contain complex data information in part reflecting the complex dynamics of the boiler and the peak load regulation. These algorithms are considered difficult to learn such complex nonlinear functions [8]. Second, these algorithms have some restrictions on the size of the training data set. They are prone to overfitting when using a large data set [9]. Overfitting can severely degrade algorithm performance. Third, these algorithms are of the inability to learn distributed and hierarchical feature representations from the data. Thus, much of the actual effort in deploying these algorithms goes into the design of preprocessing pipelines [10].
Recent studies have focused on introducing deep learning (DL) algorithms for modeling NOx emissions during coal combustion. Due to high-performance computing systems, DL algorithms can model complex non-linear relationships and learn internal representation for a large amount of data [8]- [10]. Also, some techniques are being proposed to alleviate the overfitting problem. Wang et al. developed the DL-based NOx prediction model based on the deep belief network [11]. However, the feature representation process is completely independent of the NOx prediction process in this model. This design approach degrades the model's performance. Tan et al. used the recurrent neural network with Long Short-Term Memory (which we will concisely refer to as LSTM) to model NOx emissions [12]. Yang et al. used two LSTMs to build the DL-based NOx prediction model [13]. Xie et al. used the LSTM variant called bidirectional LSTM as the building block to build encoder-decoder architecture to predict NOx emissions [14]. LSTM can capture long-term temporal dependencies from data by storing the history information, which leads to increased storage cost and computing cost. Thus, training LSTM or its variants is difficult and timeconsuming [15].
Compared with LSTM, the convolutional neural network (CNN) is usually at a considerably cheaper computational cost on certain sequence-processing problems. CNN is a type of feed-forward artificial neural network and uses convolution operation in place of general matrix multiplication to reduce the computational burden [16]. The representations learned from data are translation invariant, which means the representations do not change even though the input of CNN is translated by a small amount. Thus, CNN can use fewer training samples to learn representations having better generalization power. Deep CNN has become the master algorithm in computer vision since AlexNet won the ImageNet Challenge [17]- [21]. However, the application of CNN is very limited for modeling NOx emissions in coal-fired power plants. In the present work, we proposed a deep CNN-based model for predicting NOx emissions of a coal-fired boiler, aiming to develop an accurate prediction model for NOx emissions at the furnace exit for more effective emissions reduction. Two building blocks are designed to learning richer data representations with less parameter. The building blocks are used to build a light-weight model to predict the NOx emissions at the furnace exit of a 330MW pulverized coalfired utility boiler. The data samples from the distributed control system (DCS) are employed to train and test the proposed NOx emission prediction model. Furthermore, comparisons with the other DL-based NOx prediction models are conducted. The remainder of this paper is organized as follows. Section 2 describes the work in developing the building block and deep CNN-based prediction model. Section 3 describes the detailed application of NOx emissions prediction and model comparisons. Section 4 closes with a summary and conclusion.

II. CONSTRUCTION OF THE DEEP CNN-BASED PREDICTION MODEL ARCHITECTURE A. BRIEF DESCRIPTIONS OF THE BOILER AND DATA PREPARATION
The studied boiler is 330MW subcritical tangential pulverized coal-fired utility boiler manufactured by Shanghai Boiler   vertical direction. Five medium-speed coal pulverizers are put into operation to supply with fuel for combustion. Coal-air mixtures are fed to the burners on A-E levels. Four layers over fire air (OFA) are fixed over the upper nozzles to replenish the air in the combustion anaphase for better combustion efficiency.
The data for modeling NOx emissions consists of three parts. First, the coal burned in the boiler is an important factor responsible for the NOx formation. The coal properties are given by industrial analysis and the analysis results are listed in Table1. There are no real-time data about coal properties due to the lack of an on-line coal analyzer in the power plant. Thus, these coal properties are introduced to build a NOx prediction model. Second, fifty-five operational variables, including boiler load (one), main steam pressure (one), total fuel flow (one), total air flow (one), coal-feeder rate (five), primary air flow (five), primary air temperature (five), main steam temperature (one), total secondary air flow(two), secondary air temperature (two), secondary air flow (twentyfour), main steam flow (one), OFA air flow (four), Oxygen concentration before the selective catalytic reduction inlet (two), have been selected based on the engineers' advice and the knowledge of the tangentially coal-fired boiler. The data points covering ten days are obtained from DCS with a time resolution of 1 second. Third, NOx emissions at side A and side B of the furnace exit (two) have also been considered.
To construct the dataset for modeling, three steps are adopted in succession on the raw data. First, extreme outliers are removed to improve data quality. Second, the data should be normalized to make learning easier for our prediction model. All the data should be standardized by removing the mean and scaling to unit variance as follow: where x and z are the operational variable or NOx emissions before and after scaling, µ is the sample mean, and s is the standard deviation. Third, the data samples in the dataset used for modeling NOx emissions should have the form as follow: where . . .
In the above matrix, p denotes the row vector containing the coal properties, o (t) denotes the row vector containing selected operational variables at time t, n (t) denotes the row vector containing NOx emissions at time t, K 1 and K 2 are nonnegative integers. Thus, these data samples belong to the multivariate time series. In this study, both of K 1 and K 2 are equal to 60. Modeling NOx emissions based on this dataset means that using the data samples in the first 60 seconds to predict the mean of NOx emissions in the next 60 seconds.
There are two motivations. First, there is a stronger correlation between the adjacent data samples. Second, the data samples with a larger value of K 1 contain more information.

B. BRIEF INTRODUCTION TO CNN
Recently, there has been lots of progress in designing a small and computation-efficient deep CNN architecture for mobile and embedded vision applications, such as ShuffleNetV1 [20] and ShuffleNetV2 [21]. These architectures are suitable for the applications which need to be carried out in a timely fashion. CNNs used in these architectures are designed for processing image data. This type of CNN is referred as 2D CNN layer. However, the multivariate time series data for modeling NOx emissions is completely different from these image data. Thus, we consider using a type of CNN for processing the multivariate time series data which is referred as 1D CNN layer. The data operated by CNN is called feature map and the column vector in feature map is called a channel. The computation procedure of 1D CNN layer is shown in Fig. 2 (a). Firstly, we should determine the size of the convolution window. The convolution window is used to extract the patches from the multivariate time series along the time axis. The extracted patches are essentially the numerical matrixes having the same dimensions as the convolution window. Secondly, the extracted patches are sent to a group of the convolution kernels. Any convolution kernel is a weight matrix that is not predefined but is learned during the training process of a 1D CNN layer. The scalar value is obtained by taking dot products on the patch and the convolution kernel. Such an instance is shown in Fig. 2 (b). The definition of the dot product on two matrixes is as follows: where A = a ij m×n and B = b ij m×n .

C. ARCHITECTURE DESIGN OF THE BASIC BUILDING BLOCKS
In [21], there are some practical guidelines, proposed for light weight CNN architecture design. Based on these guidelines for ShuffleNetV2, two basic building blocks in this study are designed. However, the original architecture of Shuf-fleNetV2, which is designed to process the tasks in computer vision, can't process the multivariate time series data for modeling NOx emissions. In order to process the multivariate time series data feasibly and efficiently, two important modifications are made in our basic building blocks. Firstly, the first 2D CNN layer in the first dashed box in Fig. 3 (a) must be replaced by 1D CNN layer in the first dashed box in Fig. 3 (b). Secondly, the components in the second dashed box in Fig. 3 (a) are replaced by a 1D separable CNN layer in the second dashed box in Fig. 3 (b). As shown in Fig. 3 (b), there is a channel split operator at the beginning of the basic building block. The input having k channels is split into two branches with k 1 and k 2 channels, respectively. The left branch is a shortcut connection introduced in ResNet [18]. It can be considered as an identity map and all information is always passed through. The right branch consists of four components. The first component is a 1D CNN layer. The size of convolution window of this CNN layer must be fixed to 1. It can be considered as bottleneck layer to reduce the number of input feature maps, and thus to improve computational efficiency. Next, the second component contains batch normalization [22] and a rectified linear unit [23]. Batch normalization maintains an exponential moving average of the batch-wise mean and variance of the data during training. It has been proved to accelerate the training process of the CNN layer. The rectified linear unit (ReLU) is defined by the activation function f (x) = max {0, x}. It is considered as the most important factor in improving the performance of CNNs [24]. The third component is a 1D separable CNN layer. The separable CNN layer, which is referred as depth-wise separable convolution, consists of the depth-wise convolution and the pointwise convolution [25]. First, the depth-wise convolution performs independently a convolution operation on each channel of its input. Second, the pointwise convolution creates a linear combination of the output channel of the depth-wise convolution. Some studies have demonstrated that the separable CNN layer can efficiently reduce the computation cost and learn better representations using fewer data [19]. 1D separable CNN layer is the version of separable CNN layer which can process the multivariate time series. The fourth component is the same effect as the second component. The results of the two branches are concatenated to keep the number of channels same as the input. At the end of the basic building block, the channel shuffle operation is used to reshape the order of channels of output to enable information communication work between different channels. Fig. 3 (c) shows the architecture of the basic building block with stride 2. The stride is a parameter defined by the distance between two successive convolution windows. Using stride equal to 2 means the row rank of the input feature map is down sampled by a factor of 2 to reduce the computational cost and the number of parameters. In addition, the risk of overfitting is limited. The basic building block with stride 2 is different from the basic building block. Firstly, the channel split operator is removed. Secondly, in the right branch, a 1D separable CNN layer is replaced by a 1D separable CNN layer with stride 2. Thirdly, in the left branch, a 1D separable CNN layer is added to keep the size of the input the same as the output of the right branch. BN and ReLU achieve the same effect as in the basic building block.

D. DEEP CNN-BASED MODEL FOR NOx PREDICTION
The block diagram as shown in Fig. 4 is a deep CNN-based model for NOx emissions prediction. The model is a streamlined architecture based on the basic building blocks in Fig. 3 (b) and (c). The design of CNNs used in the model has twofold: (1) they are used to gradually increase the number of the channels of the output feature map; (2) they are used to gradually reduce the row rank of the output feature map. This guideline will make the deep CNN-based model wider and deeper, which has been proven to increase the performance of the model [19]- [21].
The first component is a 1D CNN layer with stride 2. The size of convolution window of this CNN layer is set to 3. The second component consisting of BN and ReLU has the same effect as the components in the basic building block. The following three components have the same structure but with different parameters. In each stage, the basic building block with stride 2 is set at the beginning, and the basic building VOLUME 8, 2020 block is repeated three times. The data representations are further refined after these three stages. Next, the sixth component is also a 1D CNN layer. The size of convolution window is set to 1. The seventh component is the same as the second component. After that, the data representations will have two dimensions but can't directly be used for prediction. Consequently, the global average pooling layer is introduced to reduce the dimensions of the data representations and result in one-dimension vectors. These vectors go through the final component which is a regular fully-connected layer (FC layer). This FC layer has two outputs for NOx emissions at side A and side B. The first eight components are used to extract the data representations from the multivariate time series, and the final component is used to predict NOx emissions.
To evaluate our prediction model, the dataset should be splitted into three sets. The training set consists of 60% data samples; the validation set consists of 30% data samples; and the test set consists of 10% data samples. The division of the data depends on the size of the dataset which covers different operation conditions. It is stresses that the training set and the validation set containing the enough data can improve the generalization error of the prediction model. Root mean square error (RMSE) is introduced to evaluate the performance of the NOx prediction model. It is true that the root mean square error (RMSE) is a widely used performance measure for regression problems. It gives an idea of how much error the model typically makes in its predictions, with a higher weight for large errors. It is defined as, where N denotes the number of the data samples,ŷ i denotes the measured value andŷ i denotes the corresponding predicted value.

A. NOx PREDICTION RESULTS
We implemented our model using the open-source deep learning library Keras with the TensorFlow back-end [26]. A single NVIDA GeForce GTX 1080 is used. The convolution library is CUDNN 10.0 [27]. The optimization configuration is used for our model: the Optimizer is Adam [28]; the initial learning rate is 0.001 and the decay of rate is 0.95 every 5 epochs.
To avoid the overfitting problem, the early stopping strategy is applied to the validation set. Thus, a model checkpoint procedure should be performed either on the training set or validation set to keep the best model during the training process. The model follows most of the hyper-parameters used in [21]. There are 30 runs of our model to evaluate model reliability. The summary statistics are shown in Table 2. The mean RMSEs of the test set at side A and side B are 1.11 mg/Nm 3 and 1.06 mg/Nm 3 respectively. The lowest mean RMSEs show that our model has high prediction accuracy on the testing set. The standard deviations of RMSEs at side A and side B are 0.68 mg/Nm 3 and 0.59 mg/Nm 3 , respectively. The lowest standard deviations of RMSEs demonstrate a good stability of our proposed model.
For the 3rd run, RMSEs at side A and side B are 0.94 mg/Nm 3 and 1.07 mg/Nm 3 , which are very close to the average RMSEs at side A and side B. Fig. 5 shows the predicted values at the 3rd run on the test set. The predicted values are in good agreement with the reference values. Fig. 6 shows the relative errors at the 3rd run on the test set. The maximum relative error is 1.55% at side A, and the maximum relative error is -1.6% at side B. The good prediction performance on test set exhibits a satisfactory capability of the deep CNN-based prediction model in this study.

B. MODEL COMPARISONS AND DISCUSSIONS
In this section, we survey a variety of DL-based prediction models based on the leading building blocks and make comparisons with our proposed model. For fair comparison, we do not use any data preprocessing methods except the methods in building the dataset, and all prediction models for a comparison have the same training environment. DL-based prediction models for comparison are as follows: (1) VGGNet is a deep CNN architecture consisting of multiple 2D CNN layers. Its building block is a single 2D CNN layer. Following the design principle of VGGNet, the VGG-like prediction model was developed based on the 85916 VOLUME 8, 2020    Fig. 7. The first two stages have the same structure, and the next three stages have the same structure. This model follows most of the hyper-parameters used in [17].
(2) ResNet is a deep CNN architecture introducing shortcut connections which can improve the training efficiency. Its building block consists of multiple 2D CNN layers and a shortcut connection as shown in Fig. 8 (a). We use 1D CNN layers to replace 2D CNN layers. The modified building blocks of ResNet-18 are shown in Fig. 8 (b) and (c). The overall architecture of the ResNet-like prediction model is shown in Fig. 8 (d). In the ResNet-like prediction model, the stage consists of a modified building block of ResNet with stride 2 and a modified building block of ResNet. This model follows most of the hyper-parameters used in [18].  (3) Xception architecture can be considered as a linear stack of 2D separable CNN layers. The original building block of Xception is shown in Fig. 9 (a). We use the 1D separable CNN layers to replace the 2D separable CNN layers in the original building block of Xception. The modified building block of Xception is shown in Fig. 9 (b). The overall architecture of the Xception-like prediction model is shown in Fig. 9 (c). In the prediction model, the stage consists of a single modified building block of Xception. This model follows most of the hyper-parameters used in [19].
(4) ShuffleNetV1 is a light weight CNN architecture which also introduces the channel shuffle operator. Its building block is designed based on the depth-wise CNN layers and group CNN layers as shown in Fig. 10 (a). The modified building blocks of ShuffleNetV1 are shown in Fig. 10 (b) and (c). Fig. 10 (d) shows the overall architecture of the ShuffleNetV1-like prediction model. In the prediction model, the stage consists of a modified building block of ShuffleNetV1 with stride 2 and three modified building blocks of ShuffleNetV1. This model follows most of the hyper-parameters used in [20]. (5) The LSTM layer has been used to build the prediction model in [12] and [13]. The detailed architecture of the LSTM layer can be found in [29]. As shown in Fig. 11 (a), the LSTM-based prediction model consists of a LSTM layer and a FC layer. Because the number of units is an important hyper-parameter of the LSTM layer, we test the LSTM-based prediction models with a different number of units. There are 10 runs for each number of units. The detailed summary statistics of the prediction results are shown in Table 3. For example, we use LSTM-100 to denote an LSTM layer with 100 units. The average RMSE and the standard deviation of RMSE rise with the increase of the number of units. Among the three settings, LSTM-10 achieves the best results. Thus, we prefer to LSTM-10 for comparison.
(6) The bidirectional LSTM (BLSTM) layer, which is a variant of the LSTM layer, consists of two LSTM layers, one processing the input sequence forwards and the other one backward. The detailed architecture of the BLSTM layer can be found in [29]. As shown in Fig. 11 (b), the BLSTM-based prediction model consists of a BLSTM layer and a FC layer. Also, there are 10 runs for each number of units. The detailed summary statistics of the prediction results are shown in Table 4. It is clear that BLSTM-10 has the best results. Thus, we prefer to BLSTM-10 for comparison.
(7) Stacking multiple LSTM layers (or BLSTM layers) is a way to form a deeper model [30]. Based on the results in Table 3 and Table 4, we have 10 runs for some settings and the summary statistics of prediction results is shown in Table 5. It is clear that the prediction results are not improved by adding more layers. Thus, we do not use this class of models for comparison. 85918 VOLUME 8, 2020 The DL-based model has also been constructed to estimate NO X emission of coal-fired power plants in [11]. This model has exhibited satisfactory performance in the prediction accuracy and the most details can be found in [11].
The variations of RMSEs of the eight models among 30 runs are shown in Fig. 12. The minimum RMSE at side A is 0.25 mg/Nm 3 and achieved by our prediction model at the 7th run, and the minimum RMSE at side B is 0.28 mg/Nm 3 and achieved by our prediction model at the 5th run. For VOLUME 8, 2020  our prediction model, a smooth trend of RMSEs is observed, which can empirically demonstrate the good performance of our prediction models. The similar smooth trends of RMSEs can be observed for the Xception-like model and the ShuffleNetV1-like model. There exist some fluctuations among the trends of RMSEs for the VGG-like model and the ResNet-like model. For LSTM-10 and BLSTM-10, the significant fluctuations can be observed on RMSEs. The maximum RMSE at side A is 57.73 mg/Nm 3 and achieved by LSTM-10 at the 28th run, and the maximum RMSE at side B is 82.21 mg/Nm 3 and achieved by LSTM-10 at the 21th run. These significant outliers mean that LSTM-10 and BLSTM-10 sometimes fail to successfully learn effective data representations from the multivariate time series which contains more information. In other words, it is difficult to obtain acceptable results from the prediction model based on a single LSTM layer or a single BLSTM. Combined with the results in Table 5, the prediction performance can't improve by simply stacking more LSTM layers (or BLSTM layers). This is mainly due to the lack of practical guidelines of architecture design for organizing multiple LSTM layers or its variants. Although the model in [11] has smooth trend of RMSEs, the performance of this model is weaker than our model. Based on the above results, it is obvious that all deep CNN-based prediction models have better results than the LSTM-based prediction models. There are two reasons: (1) the building blocks in the CNN-based prediction models are carefully designed following the practical guidelines; (2) during the training process of the deep CNN-based prediction model, multiple down-sampling processes are used to reduce the complexity of the data representations learned from the multivariate time series.

IV. CONCLUSION
In this study, a novel deep CNN-based model architecture has been developed for predicting NOx emissions from a 330MW tangentially coal-fired power plant boiler. The collected raw data are translated to the multivariate time series and the dataset for modeling NOx emissions is built. In order to efficiently process the multivariate time series samples, two basic building blocks are carefully designed based on the combination of the 1D CNN layer, the 1D separable CNN layer, the channel split operator and the channel shuffle operation. The overall prediction model architecture is developed mainly based on these two basic building blocks. The comparisons among the different prediction models have suggested that our proposed model has the best performance. In particular, the minimum RMSE of the test set at side A is 0.25 mg/Nm 3 and the minimum RMSE of the test set at side B is 0.28 mg/Nm 3 . It also demonstrates that architecture design is important to build an accurate prediction model. There are two reasons that affect the accuracy of the prediction model: (1) the developed deep CNN-based prediction model depends on the sufficient data covering different operation conditions; (2) Recent advances in modern network architectures, which are also crucial components for other state-ofthe-art networks, are adopted in our prediction model. The proposed model architecture has good potential to predict NOx emissions on similar pulverized coal-fired utility boilers with adequate data.
NAN LI received the B.Eng. degree in automation from the Nanjing University of Science and Technology, Nanjing, China, in 2006, and the M.Sc. degree in systems engineering and the Ph.D. degree in control theory and control engineering from North China Electric Power University, Beijing, China, in 2011 and 2017, respectively. He is currently a Lecturer with the School of Information and Electrical Engineering, Lu Dong University. His current research interests include machine learning and digital signal/image processing.
YONG HU received the Ph.D. degree in control theory and control engineering from North China Electric Power University, Beijing, China, in 2015. He is currently holding a postdoctoral position in energy engineering with the Mechanical Engineering College, North China Electric Power University. He has been engaged in the research of intelligent power generation operation control systems, modeling and optimal control of thermal power plant for a long time.