Least Square Parallel Extreme Learning Machine for Modeling NOx Emission of a 300MW Circulating Fluidized Bed Boiler

It is very important to establish an accurate combustion characteristics model of a boiler to reduce NOx emission. In this paper, a novel least square parallel extreme learning machine (LSPELM) is firstly proposed, all of whose weights and thresholds are determined by using least square method twice. Then, LSPELM is applied to 11 classical regression problems to test the validity. The experimental results show that, compared with other methods, LSPELM with a few hidden neurons can achieve good generalization and stability. Next, using Moore-Penrose generalized inverse theory and Woodbury formula, an online learning way of LSPELM (OLSPELM) based on sample increment is also proposed. If the samples of the current time are the same as those of the last time, the weights and thresholds of OLSPELM remain unchanged and are not updated. Only when the input samples of two times are different, can the weights and thresholds of OLSPELM be updated adaptively. Finally, LSPELM and OLSPELM are employed to successfully establish offline and online models of NOx emission concentration for a 300WM circulating fluidized bed boiler. The simulation results also show that LSPELM and OLSPELM have better nonlinear generalization ability and stability performance than some other state-of-the-art models. So, the proposed LSPELM and OLSPELM have good application value.


I. INTRODUCTION
With increasing energy crisis and awareness of environmental protection, reducing NOx emission from coal-fired boilers is one of the urgent problems to be solved for power plants. In order to reduce NOx emission, an accurate NOx emission concentration model of a boiler firstly needs to be established. However, combustion process of a boiler involves many chemical reactions, such as turbulence, heat and mass transfer. Furthermore, multiple operating parameters of a boiler are highly coupled each other. So, it is difficult to establish an accurate combustion characteristics model of a boiler by using traditional mechanical modeling methods.
The associate editor coordinating the review of this manuscript and approving it for publication was Mauro Tucci .
At present, artificial neural network (ANN) has become a very important tool of intelligent modeling. ANN is the abstraction, simplification and simulation of human brain, which is also named as neural network. ANN has good predictive ability, self-learning ability and stability. If we use ANN to establish combustion characteristics model of a boiler, we need not learn complex combustion mechanical knowledge of the boiler. Once we have the history combustion data of a boiler, we can establish the model quickly. So, it is very simple, easy and effective to establish mathematical model by using ANN. ANN has been successfully applied in regression approximation [1], [2], data classification [3], [4], signal processing [5], [6], intelligent control [7]- [9], speech recognition [10]- [12] and other fields.
There are many kinds of ANNs, whose core ideas of weight adjustment mostly come from back propagation (BP) algorithm and its improvements. BP adopts steepest descent method to obtain network parameters based on iterative calculation. However, it takes a lot of training time to establish a model by using BP algorithm. In addition, BP is easy to fall into local optimum, which leads to low generalization ability. For solving the above problems, Huang et al. proposed a new type of neural network-Extreme Learning Machine (ELM) [13] in 2006.
ELM is a single hidden layer feed-forward neural network. The input weights and hidden layer thresholds are randomly generated, and then the output weights are obtained by solving a system of linear equations. Since the output weights are obtained by one-step calculation, compared with BP algorithm, the training time of ELM is greatly reduced. ELM greatly improves the training speed. The output weights of ELM are uniquely determined, so local minimum problem is also avoided [14]. In addition, it also avoids the complicated process of calculating partial derivatives when BP seeks their network weights. More importantly, ELM has better generalization ability than BP neural network. These superiorities of ELM make it widely use in data classification [15], [16], speech recognition [17], [18], wind speed prediction [19]- [21], image processing [22], [23], smart grid [24], [25], and so on. In order to further improve the generalization ability of ELM, many scholars improved it, and then obtained many excellent scientific research results [26]- [32].
In 2015, Tavares et al. proposed an ELM with parallel layer perception (PELM) [32], which is a parallel neural network. The principle of PELM is simple. The training process of PELM does not need iterative calculation, either. For n inputs and m nonlinear neurons, PELM provides (n + 1)m linear parameters, while ELM only has m linear parameters (one for each hidden neuron) [32]. PELM provides more freedom for the proper adjustment than ELM. In [32], PELM was applied to 12 different regression problems and 6 classification problems. The experiment results showed that, compared with ELM, PELM built a well-conditioned hidden layer and kept under control the norm of the linear weight parameters naturally. So, PELM achieved good generalization performance at a fast speed.
However, since PELM randomly selects the input weights and thresholds of hidden layer in lower network, this may cause training target trap into local minimum. The generalization ability and stability of the network may be affected. How to obtain the suitable parameters of hidden layer becomes very important. To solve this problem, one method is to increase the number of neurons of hidden layer. This method could make randomly assigned parameters achieve the similar effect of the optimal parameters as much as possible. However, this method not only complicates the structure of the network, but also may cause over-fitting of the neural network. So, it is urgent to find another way.
In this study, based on Moore-Penrose (MP) generalized inverse theory, by using least square method, a least square PELM (LSPELM) is firstly proposed. LSPELM needs very few neurons of hidden layer. Furthermore, the input weights and thresholds are obtained by least square solution of linear equations, which improves generalization performance. Then, the validity of LSPELM is verified on 11 standard regression data sets from UCI database. The experimental results show that LSPELM with a small number of hidden layer neurons has better generalization ability and stable performance than other networks on most data sets. Next, when data samples arrive one-by-one or chunk-by-chunk (a block of data) with fixed or varying chunk size, based on LSPELM model, using MP generalized inverse theory and Woodbury formula, an online LSPELM (OLSPELM) based on sample increment is also proposed. If the samples of the current time are the same as those of the last time, the weights and thresholds of OLSPELM remain unchanged and are not updated. Only when the input samples of two times are different, can the weights and thresholds of OLSPELM be updated adaptively. The validity of OLSPELM is also verified. Finally, according to the history combustion data of a 300MW circulating fluidized bed boiler (CFBB) in a thermal power plant, the offline and online models of NOx emission concentration are successfully established by using the proposed LSPELM and OLSPELM, repectively. Now, the novelty and main contributions of this study are highlighted as follows.
• A novel LSPELM based on the least square method is first proposed, and then an OLSPELM based on sample increment is also proposed.
• Generalization and stability of proposed LSPELM and OLSPELM are compared with those of state-of-the-art models.
• LSPELM and OLSPELM are employ to establish offline and online models of NOx emission concentration for a boiler.
The rest of this paper is arranged as follows. PELM is reviewed in Section II. A LSPELM is proposed in Section III. An online learning algorithm OLSPELM based on sample increment is proposed in Section IV. The offline and online models of NOx emission concentration for a CFBB are established in Section V. The study is concluded in Section VI.

II. REVIEW OF PELM
The network structure of PELM is described in Fig.1. Suppose that there is a data set with N arbitrary distinct samples (x l , y l ) ∈ R n×N × R 1×N , l = 1, 2, . . . , N and that PELM has m hidden layer nodes. U is the matrix of input weights and thresholds of upper network. V is the matrix of input weights and hidden thresholds of lower network. β(·), γ (·) and φ(·) are activation functions. Therefore, PELM is mathematically modeled as where u ji and v ji are the elements of U and V , x il is the ith input of the lth sample and y l is the output of the lth sample. In PELM, the input-output mapping is made by applying the product of functions [32]. As a particular case of Eq. (1), if the activation functions β(·) and γ (·) are linear functions, then the network output y l is calculated by using Eq. (4).
Replacing Eqs. (2) and (3) in Eq. (4), we can obtain Eq. (5) The rearranged elements of U construct a row vector c as follows: According to Eq. (5), we construct a new matrix H as Eq. (7), as shown at the bottom of the next page. Then, Eq. (5) can be rewritten as where y = [y 1 , y 2 , . . . , y N ].
According to MP generalized inverse theory, the minimum norm least square solution of Eq. (8) is where H + is MP generalized inverse of H . After the row vector c is calculated, the output of PELM can be calculated by using Eq. (5). The learning process of PELM is described as follows: (1) Randomly generate v ji ∈ [−1, 1], i = 0, 1, . . . , n, j = 1, 2, . . . , m.

III. A LEAST SQUARE PELM
Through the learning process of PELM, it can be known that the input weights and thresholds of lower layer network are randomly generated. However, randomly generated parameters are not necessarily the suitable parameters. Therefore, in order to obtain the better network parameters of PELM, the input weights and thresholds related to input samples are given by using least square method, and a least square PELM (LSPELM) is proposed. A detailed derivation process of LSPELM is shown below.

A. DETERMINE THE INPUT WEIGHTS AND THRESHOLDS OF LOWER LAYER NETWORK
Suppose that the input weight and threshold matrix V of lower layer network and the output layer threshold d are both VOLUME 8, 2020 known, Then, Eq. (10) can be obtained by using Eq. (8).
Then, the output matrix H of hidden layer can be estimated by using Eq. (13).
According to MP generalized inverse theory, the minimum norm least square solution of Eq. (13) iŝ where c + is MP generalized inverse of vector c.
On the other hand, combining with Eq. (7), the following derivation can be made.
We construct a new matrix G by taking the row 1, the row n + 2, the row 2n + 3, · · · · · · and the row (m − 1)n + m of the matrix H 2 as follows: where the definitions of V and X are as follows: If φ is a reversible function, then where φ −1 is the inverse function of φ. The input weight and threshold matrix V of lower network can be estimated by using Eq. (21).
According to MP generalized inverse theory, the minimum norm least square solution of Eq. (21) is as followŝ where X + is MP generalized inverse of X .
Suppose that there is a matrix Q m×1 , which can make φ −1 (G) = Q m×1 y 1×N . Therefore, Eq. (22) can be rewritten asV Since φ −1 (G) is unknown before the training, the elements of matrix Q can be randomly generated between 0 and 1.

B. DETERMINE THE INPUT WEIGHT AND THRESHOLD MATRIX U of upper layer network and the output layer threshold d
After the input weight and threshold matrix V of lower layer network is determined, according to Eq. (17), the output matrix G of hidden layer of lower layer network can be obtained. Then, according to Eq. (15), the matrixĤ is obtained. Therefore, LSPELM can be considered a linear system. Eq. (10) can be rewritten into Eq. (24) as follows: Let ς = c d , and then Eq. (24) changes into Then,ς can be obtained by using Eq. (26).
Because β(·) is a reversible function, the matrixς can be further determined by using Eq. (27).
According to MP generalized inverse theory, the minimum norm least square solution of Eq. (27) is given as follows.
Therefore, the input weight c of upper network and output layer threshold d can be obtained by using Eq. (29).
Finally, the input weight c can be transformed into the matrix U by using Eq. (6). Through the above derivation, it can be seen that compared with original PELM network, LSPELM network has three following advantages: (1) Although the matrix Q is also randomly generated, the numbers of random numbers is reduced from original m×(n+1) to m. The reduction of the number of random numbers ensures that LSPELM network will own better stability than original PELM, which is also verified in the following simulation experiments. VOLUME 8, 2020 (2) The input weight and threshold matrixV of lower network of LSPELM is the minimum norm least square solution of Eq. (20). It can be seen from [33] that the network with smaller weights often has better generalization performance than original network.
(3) The input weight and threshold matrixV of LSPELM is obtained by Eq. (23). The matrixV is related to the input and output variables of samples. Compared with original PELM network, LSPELM network is more closely related to samples. Therefore, LSPELM may have better generalization performance than PELM.

C. TRAINING PROCESS OF LSPELM
Suppose that there is a training data {(x i , y i )|x i ∈ R n , y i ∈ R}, i = 1, 2, . . . , N and that the number of neurons in each hidden layer is m. The detailed training process of LSPELM is described as follows: Step 1: Randomly generate the matrix Q m×1 .
Step 3: Calculate the matrix G according to Eq. (17), and then obtain the matrix H 2 .
Step 6: Obtain the matrix U of upper layer network and the output layer threshold d by using Eq. (29).
After all of the weights and thresholds are determinated, LSPELM can be used to predict unknown samples. For LSPELM, the training process shows that the input weights and thresholds and the output threshold are obtained by solving linear equations in one step. They are not obtained by stepwise iteration like traditional BP neural network. Therefore, compared with BP neural network, the training speed of LSPELM is very fast. According to [32], PELM has been proved a universal approximator. PELM randomly selects the input weights and thresholds of hidden layer, while our proposed LSPELM obtains the input weights and thresholds by using the least square method. LSPELM does not change the model structure of conventional PELM, but presents an initialization algorithm for PELM. Since PELM is convergent, LSPELM is also convergent.

D. PERFORMANCE TESTING OF LSPELM
To verify the performance of LSPELM, we implement the testing on 11 benchmark regression data sets from well-known UCI database. These data sets have been used by many scholars to verify generalization performance of proposed model [13], [34]. Each data set is randomly divided into training set, testing set and verification set, which are shown in Table 1. In order to prevent the under-fitting and over-fitting of the network, the number of hidden layer neurons needs to be selected during the testing. As in [35], the validation set for each data set is used to determine the optimal number of neurons. When the Root Mean Square Error (RMSE) on validation set is minimal, the corresponding number of neurons is then used as the optimal number for prediction.
All the experiments are run in MATLAB 2012a with Windows 7, 2.5 GHZ and 4.00GB RAM. In order to compare LSPELM with other methods, PELM, two-hidden-layer ELM (TELM) [36] and linear regressing (LR) method are selected. TELM also has two hidden layers which are connected in series, but the two hidden layers of PELM and LSPELM are connected in parallel. According to [36], for regression problems, the activation functions of two hidden layers of TELM are recommended to use g( , and the regression effect is better. For PELM and LSPELM, the activation function φ(·) is set as sigmoid function, i.e. φ(x) = 1/(1 + e −x ). The function β(·) is set as linear function. In [32], the generalization ability of PELM has been compared with ELM in detail. A large number of regression experiments show that PELM has better performance than ELM on most data sets. Therefore, repeated comparison between PELM and ELM is not made in this section.
To prevent different magnitudes of the attributes of data set from affecting regression effect, the input and output attributes are both normalized to [0.01,1]. Each experiment is repeated 20 and 50 times in order to prevent the effect of random numbers. As in [35], the maximum neuron number in each hidden layer is set as 50. The optimal number of neurons in each network is shown in Table 2.  Although these three networks are all double hidden layer feed-forward neural networks, the connection way and training methods are very different. Therefore, the number of neurons for the same data is also very different. The results in Table 2 show that TELM connected in series requires more hidden layer neurons than PELM and LSPELM connected in parallel. The optimal number of neurons for LSPELM is fewer than or equal to that of PELM on almost all data sets, but still much fewer than TELM.
In order to compare the generalization performance of different models, the mean and standard deviation (S.D.) (except LR method) of RMSE, the mean of Mean Absolute Percentage Error (MAPE) and the mean of determination coefficient R-Square (R 2 ) of 20 and 50 experiments are recorded in Tables 3-6. The computational formulas of RMSE, MAPE and R 2 are shown as Eq. (30)-Eq. (32 ). The smaller the first three indicators and the larger the fourth indicator, the better the performance of the model. In addition, we also compare the training and testing time of four models. The comparing results are shown in Table 7.
where N represents the number of samples, y i represents the actual value of output variable,ŷ i represents the predicted value of output variable, andȳ represents the mean of the actual values of output variable. Table 3 shows the mean and S.D. of RMSE for running 20 times on testing samples. LSPELM obtains the best predicted means on six data sets of Abalone, Auto.MPG, Bank domains, Machine CPU, Boston housing and Cancer. TELM only obtains the best predicted mean on Delta-ailerons. PELM obtains the best predicted means on Servo and Stocks domain. LR method only obtains the best predicted means on Triazines and Price. The advantage of LSPELM is obvious. Table 5 shows the mean and S.D. of RMSE for running 50 times on testing samples. LSPELM obtains the best predicted means on seven data sets of Abalone, Auto.MPG, Bank domains, Servo, Machine CPU, Boston housing and Cancer. TELM obtains the best predicted mean on Delta-ailerons and Triazines. PELM only obtains the best predicted means on Stocks domain. LR method only obtains the best predicted means on Price. This simulation results also show that LSPELM has the smallest mean of RMSE on most data sets. So, the generalization ability is optimal, which shows a big advantage of the network. In addition, the S.D. of the RMSE of LSPELM is very close to zero on almost all data sets. Compared with TELM and PELM, LSPELM obtains the smallest S.D. on all benchmark data sets. Therefore, LSPELM is least affected by random numbers and has the best stability. The computational result is consistent with the VOLUME 8, 2020  theoretical analysis in Section III(B), which shows another advantage of LSPELM. Table 4 shows the means of MAPE and R 2 for running 20 experiments on testing samples. As can be seen from the MAPE, LSPELM shows the best predicted results on five data sets of Abalone, Auto.MPG, Bankdomains, Machine CPU and Cancer. TELM obtains the best results on three data sets of Servo, Boston housing and Price. PELM obtains the best results on three data sets of Triazines, Stocks domain and Delta-ailerons. LR method does not obtain the best predicted MAPE on any data set. Table 6 shows the means of MAPE and R 2 for running 50 experiments. LSPELM shows the best predicted results on seven data sets of Abalone, Auto.MPG, Bankdomains, Servo, Triazines, Machine CPU and Cancer. Other methods obtain the best predicted results on a smaller number of data sets than LSPELM. Furthermore, with the increasing of the number of trials, we can find that LSPELM shows better and better superiority. Therefore, whether 20 or 50 experiments, the MAPE of LSPELM is smallest on most data sets. In conclude, The experiment results of MAPE verify that the prediction ability and generalization ability of LSPELM are optimal in these methods. In addition, the computational result of R 2 in Table 4 show that LSPELM performs the best fitting results on seven data sets of Abalone, Auto.MPG, Bank domains, Machine CPU, Boston housing, Cancer and Price. TELM obtains the best results on three data sets of Triazines, Servo and Delta-ailerons. PELM only obtains the best fitting result on Stocks domain. LR method only obtains the best fitting result on Price. For 50 experiments in Table 6, LSPELM shows the best fitting results on seven data sets of Abalone, Auto.MPG, Bank domains, Machine CPU, Boston housing, Servo and Cancer. Other methods obtain the best fitting results on a smaller number of data sets than LSPELM. The experiment result of R 2 shows that the fitting effect of LSPELM is best on most data sets. The good prediction ability and generalization performance of LSPELM are verified once again.
Taking account of running time in Table 7, since LR method is very simple, the computational time is shortest on most data sets and the testing result is worse. For ELM-based improved models, TELM takes the least amount of time on Triazines. PELM takes the least amount of time on five data sets of Stocks domain, Machine CPU, Boston housing, Cancer and Price. Our proposed LSPELM takes the least amount of time on six data sets of Abalone, Auto.MPG, Bank domains Servo, Delta-ailerons and Price. This is attributed to the fact that LSPELM uses a smaller number of neurons on these data sets.
In summary, through the above analysis of testing results, we can know that LSPELM obtains good prediction ability and generalization performance for different data sets with a few hidden layer neurons.

IV. AN ONLINE LEARNING WAY OF LSPELM (OLSPELM) BASED ON SAMPLE INCREMENT
In general, when training samples are given in different times, it is impossible to concentrate the samples and train the network once. So, training samples only can enter the network one by one or chunk by chunk. When new samples arrive, the current learning results are updated immediately, which makes sequential online learning more efficient. Therefore, an online learning algorithm of LSPELM based on sample increment is given below.
The basic idea of OLSPELM is to update the weights and thresholds of the network adaptively according to the difference of the samples between the current time and the last time. If the samples of the current time is the same as those of the last time, the weights and thresholds of the network remain unchanged and are not updated. Only when the input samples of two times are different, can the weights and thresholds of the network be updated adaptively. At any time, OLSPELM only learns new samples and discards them once they have been learned.

A. LEARNING PROCESS OF OLSPELM
The derivation process of LSPELM in Section III(A) shows that the input weight and threshold matrixV of lower network can be calculated by Eq. (23), i.e.V = QyX + . Here Q is a m × 1 matrix, y is the output variable of training data, and X + is generalized inverse of X . Suppose that the rank of X is n + 1, where n is the attribute dimension of input variables. According to MP generalized inverse theory, X + can be expressed as Therefore, for initial training subset , the input weight and threshold matrixV 0 of lower network can be calculated by using Eq. (34) where X 0 = [x 1 , x 2 , . . . , , and According to Eq. (17), the output matrix of hidden layer G 0 of lower network is obtained, and then the matrixĤ 0 is obtained according to Eq. (15).
Therefore, the input weight c 0 and output layer threshold d 0 are Let P 0 = K 0 K T 0 , and then Eq. (37) is transformed aŝ Substituting Eqs. (35) and (41) into Eq. (40), and we can obtainV When two learning samples at adjacent times are the same, i.e. X 1 = X 0 and y 1 = y 0 , Eq. (42) can be rewritten as Eq. (43).V According to Eq. (35), we can obtain Qy 0 X T 0 −V 0 M 0 = 0. So, V 1 =V 0 . In this situation, the input weights aren't adjusted. Therefore, only when the learning samples at the adjacent time are different, can the input weights and the thresholds be adjusted adaptively according to the arrival of new samples. This updated way can improve efficiency and save time.
Next, the updated formulas of the input weight and threshold matrix c 1 of upper layer network and output layer threshold d 1 will be derived below. Let K 1 = Ĥ 1 I 1 , whereĤ 1 is shown in Eq. (44), as shown at the bottom of this page. According to Eq. (28), and then we can obtain Then, according to Eq. (29), the updated formulas of c 1 and d are obtained.
Let S = M −1 , and then According to Woodbury formula, Eq. (50) can be transformed into Therefore, Eq. (48) can be rewritten aŝ Similar to deducing that when the learning samples at adjacent times are the same, there isV 1 =V 0 , it can also deduce that when the learning samples at adjacent times are the same, there is alsoV k+1 =V k . Therefore, the updated formula ofV k+1 is In addition, according to Eq. (46), the updated formulâ ς k+1 of the (k + 1)th iŝ Let R = P −1 , and then According to Woodbury formula, Eq. (55) can be transformed into Therefore, Eq. (54) can be written aŝ Furthermore, c k+1 and d k+1 are obtained as follows.
So far, the updated formulas of all the weights and thresholds of OLSPELM have been deduced with the arrival of new samples. From the above derivation process, it can be analyzed that OLSPELM can adaptively update all the weights and thresholds of the network based on sample increment. OLSPELM is an online form of LSPELM. Since LSPELM is convergent, OLSPELM is also a universal approximator.

B. TRAINING STEPS OF OLSPELM
According to the learning process of OLSPELM described in Section IV(A), the training steps of OLSPELM can be mainly summarized two processes, which are shown in detail as follows.
Step 1: Initialization phase. Suppose that initial training sample set is (1) Randomly generate a m × 1 matrix Q.
(3) Obtain the output matrix G 0 of lower network according to Eq. (17), and then obtainĤ 0 according to Eq. (15). CombineĤ 0 and I 0 into the matrix K 0 .
Step 2: Updating phase of the weights and thresholds. Suppose that the training sample set of the (k +1)th is L k+1 = (1) Update the input weight and threshold matrixV k+1 according to Eqs. (51) and (53).
(5) Set k = k + 1, and then return to Step 2 until the end of the learning.

C. PERFORMANCE TESTING OF OLSPELM
To verify the performance of OLSPELM, we present some simulation studies on two large datasets of Condition Based Maintenance of NavalPropulsion Plants (CBM) [37] and Elevators [38]. The data sets with their corresponding partitions and the optimal number of neurons are shown in Table 8.
LSPELM directly learns all the training samples once, while OLSPELM divides the training set into two parts. One part is initial training samples, which are learned once. The other part is sequence learning samples, which are mainly used to simulate the generation of new data. OLSPELM can learn data one-by-one or chunk-by-chunk (a block of data) with fixed or varying chunk size. Through a large number of simulation experiments, it is found that different online learning modes have less influence on the model, so hundred-byhundred learning mode is selected in this section. In addition, in order to explain the influence of the number of initial samples on OLSPELM, the number of initial training samples is set as 500 and 1000, respectively. In order to fairly verify the performance of online models, OLSPELM is compared with famous OSELM [39] and online PELM (OPELM). Every model runs 20 times. The computational results are given in Tables 9-10.
From the value of every performance index in Table 9, regardless of the size of initial samples, OLSPELM shows the best performance on the training and testing data sets. From Table 10, since OSELM is very simple, the computational time is shortest in three online models. However, the computational results of training and testing are worst. When the number of initial training data is 500, OLSPELM shows the best performance on training and testing data sets except the running time. When the number of initial training data is 1000, OLSPELM and OPELM present similar predicted results on training data. However, for testing data, compared with OPELM, OLSPELM presents much better predicted results except the running time. Therefore, OLSPELM is superior to OSELM and OPELM.

V. MODEL NOx EMISSION CONCENTRATION FOR A CFBB BY USING LSPELM AND OLSPELM
NOx is an important environmental pollutant from combustion emissions of a CFBB. With the deterioration of global environment and people's increasing emphasis on energy conservation and environmental protection, reducing NOx emission from power station boilers becomes an urgent matter. Combustion optimization technology is one of the effective measures to reduce NOx emission. In order to achieve VOLUME 8, 2020  combustion optimization of a boiler, it is firstly necessary to establish a combustion characteristic model of NOx emission concentration.
The modeling methods of NOx emission concentration are mainly divided into two categories. The first category is mechanistic modeling method based on detailed combustion processes. However, combustion process of a boiler is a very complicated physical and chemical process. The parameters such as coal feeder rate, air velocity, air temperature, and oxygen content of the flue gas in the combustion process are characterized by strong coupling and nonlinearity. So, it is difficult to establish an accurate combustion characteristics model. The second category is artificial intelligence technology, which is based on historical combustion data of a CFBB collected from Distributed Control System (DCS). We need not learn the complex combustion mechanism knowledge of a CFBB to model the combustion characteristics. Once we have history data of boiler combustion, we can establish the combustion characteristics model quickly. So, this method is simple and efficient. In this study, LSPELM and OLSPELM are used to establish the offline and online models of NOx emission for a CFBB.

A. EXPERIMENTAL DATA AND THE SELECTION OF OPERATIONAL PARAMETERS
Partial combustion history data are collected from a 300 MW CFBB of a power plant under normal operating conditions. Due to the limitations of field equipment and sampling techniques, 600 samples with 25 attribute parameters related to NOx emission concentration are collected, which are shown in Table 11. The sampling interval between two samples is 30 seconds. In order to more easily see the distribution characteristics and fluctuations of operating parameters, whose boxplots are given in Fig. 2. To verify that the 25 attribute parameters are statistically related to the NOx emission concentration, Spearman test is performed in the SPSS 19.0 software. The significance level is set as 0.05. Since the overall distribution of the data is unknown, Pearson test is not used. The detailed testing result is shown in Table 12 below.
The P value in Table 12 indicates that there is a statistically linear correlation between NOx emission concentration and the 24 attribute parameters except PAV(B). Although there is no linear correlation between NOx emission concentration and PAV(B), it is known from the actual situation that the PAV is nonlinearly related to NOx emissions. In addition, although the 25 attribute parameters may be related and coupled each other, LSPELM can directly deal with multivariate related problems, and automatically reflect the correlation degree of each parameter to the weight of the network through the training process. Therefore, the 25 attribute parameters are used as input parameters of LSPELM, and the NOx emission concentration are used as output parameter to establish combustion characteristic model of the CFBB. In order to eliminate the influence of the magnitude between different attributes, input and output attributes are normalized to [0.01,1]. After the results are calculated, these attributes are anti-normalized.

B. OFFLINE ESTABLISH THE MODEL OF NOx EMISSION CONCENTRATION BY USING LSPELM
In this section, the proposed LSPELM are used to establish offline model of NOx emission concentration of the boiler. The collected 600 samples are randomly divided into three parts. 300 samples are used as training set to establish the model, 150 samples are used as verification set to select the suitable number of neurons, and the remaining 150 samples are used as testing sets to verify generalization performance. As in Section III(D), in order to prevent under-fitting and over-fitting of neural network, appropriate number of neurons needs to be set by the principle of minimizing the RMSE on verification set. For ELM and TELM that require relatively many neurons, the maximum neuron number in each layer is set as 50 according to experience. According to Section III(D), PELM and LSPELM require fewer neurons, so the maximum neuron number in each hidden layer is set as 5. The activation functions of the hidden layers are the same as those of Section III(D). Similarly, to prevent the effects of random initialization on the networks, each experiment is repeated 20 times. Fig. 3 shows the curves of predicted precision with increasing number of neurons in hidden layer on validation set.
According to the principle of error minimization, the numbers of neurons of ELM and TELM are set as 27 and 39, respectively. In addition, it can be clearly seen from Fig. 3 that after the first neuron, the RMSEs of PELM and LSPELM on verification data do not decrease, but keep rising. So, the overfitting phenomenon of neural networks occurs. In order to prevent over-fitting, a fewer neurons can be used, which is also reduce the complexity of the network. So, for PELM and LSPELM, the neuron number in each hidden layer is selected as 1. After the number of neurons in hidden layer is selected,  the model of NOx emission concentration can be established according to Section III(C). Fig. 4 shows predicted NOx emission concentration by using LSPELM model. Table 13 gives the comparison of predicted precision of NOx emission concentration. As seen from Table 13, although LSPELM uses a few neurons, the mean and S.D. of RMSE, and the mean of MAPE are smallest in four models. The maximum MAPE is 1.0523% by calculation, so the maximum value of MAPE is smaller than the mean of MAPE of other models. From the three   testing indicators, we know that the generalization ability of LSPELM is much better than original PELM and other models. It can be seen that the predicted effect of LSPELM is best in four models. Fig. 4 and Table 13 show that LSPELM not only has higher prediction precision and generalization ability, but also has better stability.

C. ONLINE ESTABLISH THE MODEL OF NOx EMISSION CONCENTRATION BY USING OLSPELM
The established model in Section V(B) is a static description of combustion characteristics of a boiler, so LSPELM is an  offline modeling method. However, the combustion process of a boiler is very complex and will be affected by many factors, so the operating conditions of a boiler will constantly VOLUME 8, 2020  change. If new conditions have new features during dynamic process, the offline established model needs to abandon the results of the last learning. This leads to wasting a lot of time and space resources. Therefore, in this situation, online modeling can track the combustion dynamics of a boiler. So, the proposed OLSPELM is used to establish an online model of NOx emission concentration.
The data classification is the same as the offline modeling. In order to explain the influence of the number of initial samples, the number of initial training samples is set as 100, 200 and 250, respectively. Since the number of sequence learning samples is small, one-by-one learning mode is selected in this section. In addition, all model parameters are set the same as Sections V(B). For a more vivid observation, the modeling flowchart of OLSPELM is given in Fig. 6. The simulation results are shown in Table 14 and Figs. 7-9.  Table 14 shows that no matter how many initial samples, every performance index of OLSPELM is smallest on the training and testing data sets. Therefore, OLSPELM is significantly superior to OSELM and OPELM. In particular, OLSPELM has more obvious advantage on the S.D.. The S.D. of OLSPELM is two orders of magnitude smaller than that of OPELM and one order of magnitude smaller than that of OSELM. This indicates that the stability of OLSPELM is much better than that of OPELM and OSELM. In addition, when the number of initial training samples increases from 100 to 200, all three indexes of OLSPELM are better on the testing set. When the number of initial samples increases from 200 to 250, except the slight increasing of S.D., the other two indicators are also getting better and better. Fig.s 7-9 show the predicted effects of OLSPELM. Fig. 7 gives the predicted result on 250 initial training samples. Fig. 8 gives the predicted result on 50 online sequence learning samples. Fig. 9 gives the predicted result on 300 testing samples. The maximum value of relative errors is 1.5978% by calculation on testing samples. These three figures show that OLSPELM has small prediction error and good prediction effect.
In a word, OLSPELM not only has better identification performance than OSELM and OPELM, but also has stronger generalization ability and better stability than OSELM and OPELM. Therefore, OLSPELM is very suitable to describe the dynamic combustion characteristics of the CFBB.

VI. CONCLUSION
In this paper, a novel LSPELM is firstly given by the minimum norm least square solution of linear equations. Then, 11 well-known regression data sets from UCI are used to verify the performance of LSPELM. The experimental results show that LSPELM with a fewer hidden layer neurons has better generalization ability and stability than PELM and other methods. Furthermore, using MP generalized inverse theory and Woodbury formula, OLSPELM based on sample increment is also given. Next, according to the combustion history data of a 300MW CFBB from a thermal power plant, an offline model of NOx emission concentration is established by using LSPELM. The simulation results show that LSPELM can better represent the functional relationship between NOx emission concentration and boiler-related operating parameters than PELM, ELM and other model. Finally, an online model of NOx emission concentration by using OLSPELM is also established. The simulation results also show that OLSPELM has better nonlinear identification ability, generalization ability and stability performance than OSELM and OPELM. OLSPELM can effectively identify the complex dynamic combustion process of the boiler. In short, the proposed LSPELM and OLSPELM are effective and useful modeling tools. The proposed methods can be expanded to solve the classification problems, the image processing problems and speech recognition problems.
The input weight and threshold matrix of LSPELM is related to the samples. So, compared with randomly initialized PELM network, LSPELM have better generalization performance than original PELM. on the other hand, the reduction of the number of random numbers ensures that LSPELM owns better stability than PELM. However, due to the limitations of field equipment and sampling technology, the size of history combustion data is small. How to collect a larger amount of samples that can represent the combustion operation of the boiler needs further research. In addition, we more focus on incremental learning of OLSPELM, and decremental learning is not considered in the model. In order to further improve the performance of the model, more attention needs to be paid to consider some mechanisms to lower the importance of the old samples. This can be considered in the near future work.