Growing and Pruning Selective Ensemble Regression for Nonlinear and Nonstationary Systems

For a selective ensemble regression (SER) scheme to be effective in online modeling of fast-arriving nonlinear and nonstationary data, it must not only be capable of maintaining a most up to date and diverse base model set but also be able to forget old knowledge no longer relevant. Based on these two important principles, in this paper, we propose a novel growing and pruning SER (GAP-SER) for time-varying nonlinear data. Specifically, during online operation, newly emerging process state is automatically identified and a local linear model is fitted to it. This adaptive growing strategy therefore maintains a most up to date and diverse local model set. The online prediction model is then constructed as a selective ensemble from the local linear model set based on a probability metric. Moreover, a pruning strategy is derived to remove ‘unwanted’ out of date local linear models in order to achieve low online computational complexity without sacrificing online modeling accuracy. A chaotic time series prediction and two real-world data sets are used to demonstrate the superior online modeling performance of the proposed GAP-SER over a range of benchmark schemes for nonlinear and nonstationary systems, in terms of online prediction accuracy and computational complexity.


I. INTRODUCTION
With the growing real-world applications based on fastarriving data streams [1]- [8], online learning has been a paramount issue for regression models. The data captured by sensor networks, industrial machinery and others alike usually exhibit both nonlinear and nonstationary characteristics. The root cause of time-varying nature may be sensor drift and/or underlying process drift. Sensor drift is a temporal shift of sensors due to ageing or environment changes [9]. Process drift is resulted from changes of operating conditions, catalyst deactivation, mechanical abrasions or external climatic variations, etc. [10]. The adverse consequence of such The associate editor coordinating the review of this manuscript and approving it for publication was Zhiwei Gao . drifts, which can be either gradual or abrupt, degrades the performance of real-time or online predictive models.
To cope with drifting effects of data, the development of predictive models with adaptive capability is necessary. A commonly used simple method is using adaptive recursive estimators, such as the recursive least square (RLS) [11]- [13] or the online sequential extreme learning machine (OS-ELM) [14]- [16], to update the predictive model's weights in real time. Another way of online adaptation is to selectively record the important data pattern to update the predictive model's structure. One popular representative is the resource allocating network (RAN). Starting from an empty set of radial basis function (RBF) nodes, the RAN adds RBF nodes with arriving input data based on their significance [17], [18]. Hence, the RAN can only grow the model size, which usually makes it ends up with a very large model and having high complexity in online prediction, after long online adaptive operation. By contrast, starting from an initial set of RBF nodes, the fast tunable RBF [19] adaptively replaces an 'insignificant' RBF node with a new node online. Although its model size is fixed, the fast tunable RBF can better track 'local characteristics' in a nonstationary environment. The experimental results of [19] show that this fast tunable RBF method typically outperforms the RAN and the OS-ELM.
Compared with above single global modeling approaches, ensemble learning that employs multiple models to separately modeling the data subspaces has proven to be popular in online learning [6], [20]- [22]. In the area of selective ensemble learning, most of the researches focus on adaptive classification [23]- [25] and the regression problem is rarely discussed. One important issue in ensemble learning is the diversity between models [26]. Maintaining highly diverse ensemble enables covering a wide dynamical range of data space. This is utmost important during online modeling of a nonstationary process, where the process dynamics can change significantly and often result in newly emerging multiple operating regions. The ensemble of OS-ELM (EOS-ELM) [27], however, does not have this capability, as all the neural network base models are trained on the same data set and the base model structures are not updated in real-time operation. The multiple model learning framework [28] suffers from the same drawback, as all the local RBF models are initially trained, and the local RBF model structures are not updated during online operation.
To ensure diversity, base models must be statistically different. If base models represent different local regions of the data, then they are statistically different and this guarantees the diversity of the model set. For soft sensing application, the concept of local learning is introduced in [29], where the offline training data is partitioned into multiple local regions, each covered by a local model. The works [30], [31] further extend this localization strategy to the online operation, to grow new local linear models adaptively on the newly emerging process states and, therefore, for producing the adaptive online modeling with a selective ensemble from the diverse set of local linear models. Motivated by the works [30], [31] which are for a different application of soft sensor design, our recent work [32] proposes a selective ensemble based multiple local model learning (SEMLM) for nonlinear and nonstationary systems. The SEMLM adaptively identifies newly emerging characteristics of the underlying system and grows the local linear model set online accordingly. Online modeling is then carried by a selective ensemble of subset local linear models from the model candidates. The results obtained in [32] show that this SEMLM is capable producing more accurate online predictive modeling than the fast tunable RBF of [19]. A potential drawback of the SEMLM, in comparison with the fast tunable RBF, is that it may impose higher average computation time per sample (ACTpS). This is because for a highly nonstationary system and over a long period of online adaptation, the number of local linear models can grow to be very large, which may impose high computational complexity in constructing a selective ensemble model.
The main motivation of our current work is to improve the online computational complexity of the SEMLM, while retaining its capability of maintaining highly diverse local linear model set and producing highly accurate adaptive selective ensemble modeling. Clearly, for effective online learning of fast-arriving and time-varying data, learning models not only must have the ability to capture the newly occurring knowledge as fast as possible but also are able to forget the past accumulated old concepts that are no longer relevant [7], [33]. Therefore, in order to reduce online computational complexity, it is desired to remove some 'oldest' local linear models from the model set. However, this is not as simple as it appears. Any local linear model in the model set represents some local process state or knowledge that has actually appeared. The fact that a local model is not used in the most recent selective ensemble does not imply that it will not be needed in future. Our main contribution is to derive a reliable mechanism of removing 'unwanted' local models online, without sacrificing the diversity and accuracy of selective ensemble regression.
Specifically, in this paper, a growing and pruning selective ensemble regression (GAP-SER) is proposed for timevarying nonlinear data. During online operation, newly emerging process state is automatically identified from the incoming data and a local linear model is fitted to it. This growing strategy is identical to our previous SEMLM and it maintains a most up to date and diverse local model set. However, unlike our previous work [32], which constructs the selective ensemble based on the mean square error (MSE) metric, we build the selective ensemble based on a probability metric, which is capable of achieving excellent online modeling performance with very few local models selected and hence helps to maintaining low online computational complexity. Most importantly, an ensemble pruning strategy is performed to reliably remove the 'unwanted' local linear models and, therefore, to achieve lower ACTpS without sacrificing the online modeling accuracy. The above two improvements make the proposed ensemble regression particularly suitable for online modeling of nonlinear and nonstationary systems. Three case studies, 1) chaotic time series prediction, 2) online identification of a real-world industrial system, and 3) EEG data modeling, are used to demonstrate the effectiveness of the proposed GAP-SER, in comparison with a range of benchmark schemes for modeling and identification of nonlinear and nonstationary systems.

II. GROWING AND PRUNING SELECTIVE ENSEMBLE REGRESSION
To achieve the ultimate goal of our GAP-SER, which is to produce accurate online prediction or modeling while imposing low computational complexity, we rely on the two fundamental principles, ability to maintain most up to day and diverse local linear model set and capacity of reliably removing unwanted out of date local linear models. VOLUME 8, 2020 Three components of our GAP-SER, namely, the growing strategy, the new pruning strategy and the new selective ensemble prediction, are now detailed.

A. GROWING STRATEGY
Given the data sample set {x(t), y(t)} N t=1 , where x(t) ∈ R m and y(t) ∈ R are the system's input vector and output, respectively, our task is to construct the local linear models {f l } L l=1 that are valid in their corresponding process states represented by their respective sub-datasets {X l , y l } L l=1 . Without loss of generality, let a local window be initially set, and a local linear model f ini is built on it as vector whose elements are all one, and the model parameter vector β ∈ R (1+m) is given by the least square (LS) estimate as The predicted error or residual vector of this local model is By shifting the data window one sample ahead, a new window W sft = X sft , y sft is obtained, which contains the samples {x(t), y(t)} t ini +1+W G t=t ini +1 . If the two local data regions W ini and W sft are not significantly different, it can be considered that the data within W sft follow the same distribution as in W ini and the window continues to be shifted forward. Otherwise, W sft is considered to represent a new process state different from the one for W ini , and a new local linear model f new should be developed based on W sft . Let the estimation error vector produced by f ini on W sft be denoted as Whether the two local data regions W ini and W sft are similar or not can then be turned into the equivalent testing that tests whether e ini and e sft are significantly different or not.
Since f ini is a linear model, e ini and e sft are considered not significantly different when both their means, µ ini and µ sft , and variances, σ 2 ini and σ 2 sft , are the same. Therefore, the two null hypotheses can be set to The mean µ ini and variance σ 2 ini are estimated based on e ini , while µ sft and σ 2 sft are estimated based on e sft . Since f ini is an unbiased estimator, µ ini = 0 and σ 2 ini = 1 W G −1 e T ini e ini . Assuming that e ini and e sft follow normal distribution, the T and χ 2 statistics can be constructed as According to the statistical theory, if the hypotheses H µ 0 and H σ 2 0 are both valid, the T 0 statistic (7) and χ 2 0 statistic (8) follow the t distribution and χ 2 distribution with the degree of freedom W G −1, respectively. Thus, the t-test and χ 2 -test can be utilized to test the above two hypotheses. Specifically, the conditions of accepting H µ 0 and H σ 2 0 are |T 0 | < λ t and χ 2 0 < λ χ , where λ t is the threshold of the T statistic for the given significance level α t which satisfies Pr{|T | < λ t } = 1 − α t , while λ χ is the threshold of the χ 2 statistic for the given significance level α χ , which satisfies Pr{χ 2 < λ χ } = 1−α χ . Let the local model set contain L > 1 independent local linear models {f l } L l=1 , and f ini = f L . When one or both conditions of (9) are violated, W ini and W sft are significantly different, and the new local linear model f new = f sft is different from f L . We still need to test whether f new differs from {f l } L−1 l=1 . This task can also be fulfilled based on the hypothesis testing. Let the predicted errors of W new = X sft , y sft based on f new and f l be defined respectively by With the assumption that e new and e l follow normal distribution, the T and χ 2 statistics are constructed according to where µ new and σ 2 new are the estimated mean and variance of e new , while µ l and σ 2 l are the estimated mean and variance of e l . If the null hypotheses are both valid, the T l statistic (12) and χ 2 l statistic (13) follow the t distribution and χ 2 distribution with the degree of freedom W G −1, respectively. Therefore, if there exist an l ∈ {1, 2, · · · , L −1} such that the hypotheses (14) and (15)  The significance levels in the statistical testings are typically set to α t = 0.05 and α χ = 0.05. This growing strategy is summarized in Algorithm 1. Clearly, the growing model window size W G is the only algorithmic parameter to be set.
Remark 1: This local learning strategy is identical to the one given in our previous work [32], and it can operate both offline and online. During online operation, when the newest data sample {x(t next ), y(t next )} is available, the data widow shift one sample ahead, and the corresponding learning procedure can then be carried out. It can be seen that this growing strategy is capable of identifying every newly occurring process state and, moreover, all the local linear models in the base model set are statistically different. Therefore, our local learning strategy is capable of maintaining the maximum diversity of the base model set.

B. NEW PRUNING STRATEGY
The essence of selective ensemble regression (SER) is to accurately capture the current local characteristics, rather than to model the overall system dynamics. The base model set {f l } L l=1 identified by the growing strategy of Algorithm 1 represents all the local process states occurred so far. Some of these local linear models are likely to be far away from the current data range and they are not needed in modeling the current process dynamics. Indeed, a SER constructs a online prediction model by selecting a subset of relevant local models. For fast-arriving highly nonstationary data, over a long period of online operation, the base model set is likely to become very large, and this imposes high online computational complexity in constructing SER prediction. This issue is related to the so-called stability-plasticity dilemma [34].
Step 1: New local model detection 6: When a new data sample is available, shift W sft one sample ahead. 7: Calculate e sft , and estimate µ sft and σ 2 sft . 8: Construct T and χ 2 statistics using (7) and (8). 9: If both conditions of (9) are satisfied 10: Go to Step 1. Compute e l , and estimate µ l and σ 2 l .

21:
End if 22: End for 23: Step 3: Add new local model 24 An ensemble learner should not only have the ability to retain acquired knowledge (stability) but also adapt to new concept with a fast recovery (plasticity). On one hand, stability implies that a learner retains the acquired knowledge for maintaining diversity. On the other hand, plasticity requires a learner to forget part or all previous knowledge in order to capture the new knowledge from upcoming data as fast as possible.
To achieve plasticity, it is desired to remove models that do not contribute to the selective ensemble's performance based on some ensemble pruning strategy. Two major issues in ensemble pruning are: 1) decide which model should be removed, and 2) when and how frequently to remove models. In the existing literature, various ensemble pruning strategies have been proposed. For example, the removal of models can occur when the number of models exceeds a threshold [35]- [37] or when the memory usage exceeds a threshold [38]. The excluded model can be the oldest model [35], [36] or the model with worst performance [37], [39]. None of these schemes is sufficiently reliable. In highly nonstationary environments, how to reliably perform ensemble pruning is very challenging, particularly for data with seasonality and periodicity features. Removing the oldest or the worst-performance model for example may run the risk that the removed model may actually become important in future [20].
In order to improve the reliability of ensemble pruning, we propose to remove a local linear model only if it does not contribute to the SER prediction over a pruning model window with the window size W P > 1, that is, a model can be removed only it is not selected by the SER for the consecutive W P samples. If a model is not needed consistently for the current W P prediction samples, the probability of it being selected in the near future prediction samples is extremely small. Therefore, removal of such a model will not affect the prediction performance in the near future. It should be emphasized again that any pruning strategy will run the risk that the removed model may become important in future. This is because the local process state represented by the removed local linear model may re-appear in future. Fortunately, our growing strategy is capable of re-discovering it when the corresponding process state re-appears in the data stream.
Since our pruning strategy is linked to the SER construction, it is necessary to start the discussion from how the SER prediction is constructed. Assume that after the online operation at sample t, Algorithm 1 produces the local model are used in constructing the SER prediction, i.e., deciding which subset of local linear models are selected. Let e l (t) = [e l (t) e l (t − 1) · · · e l (t −p+1)] T be the modeling error vector of the lth local linear model f l over the prediction window, which is given by The performance metric of the lth local model is defined as The MSE J l (t) can be conveniently transformed into a probability metric. Specifically, J l (t) is first converted to a similarity measure [40] ranging from 0 to 1 as follows The probability metric Pr l (t) of the lth model is computed as the normalized similarity measure as Pr l (t) can be used to quantify the contribution of the lth local linear model to the SER, since a large value of Pr l (t) indicates that the lth local model is a good regressor for SER and vice versu. Arrange all the L local models according to their probability values in descending order as We select the first M best local models for constructing the SER when the criterion is met, where 0 < ε < 1 is a desired tolerance. These selected M local linear models are then used to construct the SER prediction for the next sample t next = t + 1, which is detailed in the next subsection. Note that this SER prediction construction is based on the probability metric (20), which is different from the normalized MSE metric used in our previous SEMLM [32]. The above discussion also suggests a pruning strategy. Specifically, the local models to be removed should be the l L th to l M +1 th models that are not selected to form the SER for the prediction at t next . However, pruning a model based on its 'one-sample' prediction performance may not be sufficiently reliable. Note that it is always desired to introduce a 'memory depth' for an ensemble learner. In our growing strategy, a local linear model is constructed upon a data window with window size W G . Within this data window, the process is considered to be stationary. Similarly, we introduce a data window for pruning with the window size W P . If a local model is never selected over the consecutive W P prediction samples, then it can be removed with high confidence.
Our pruning strategy is listed in Algorithm 2. Since p and ε are the algorithmic parameters of the selective ensemble prediction, the only algorithmic parameter of our pruning strategy is W P . We can conveniently set W P = W G .
Remark 2: Since the newest local linear model f L represents the latest data, it is highly desired to retain it. Therefore, we slightly modify the selection procedure by always retaining f L . Consequently, count L in Algorithm 2 is always set to zero. To ensure the diversity and hence prediction accuracy of the ensemble, a minimal number of local models L min should be guaranteed. Accordingly, pruning in Algorithm 2 can be modified so that maximally only the (L − L min ) oldest models in can be removed. More specifically, if the If f l is not selected at current sample t 8: count l = count l + 1.

9:
End if 10: End for 11: Set t = t + 1 and go to Step 1. Add l to pruning model index set , and set index = index + 1.

16:
End if 17: End for 18: Delete f l for all l ∈ , and set L = L − index. 19: End if 20: Step 2: Pruning model window update 21: Clear counters for all local models, set t ini = t and index = 0, and go to Step 1.
number of removal model candidates (given by index counter in Algorithm 2) in is larger than L −L min , only (L −L min ) of them can be removed. Appropriate value of L min is closely linked to the growing window size W G . If the value of W G is small, each data window only represents a small region of local characteristics and, therefore, the size of the model library, i.e., L min , should be sufficiently large to cover the entire data range. By contrast, if W G is large, a small size of the model library is sufficient to cover all the process states. In most cases, there exists a training data set for discovering the local process characteristics and identifying the local models to represent these local process states. In this case, we can set the minimum size of the model library L min to the size of the local model set identified during training.

C. NEW ADAPTIVE SELECTIVE ENSEMBLE PREDICTION
After the online operations at sample t, the set of the local linear models {f l } L l=1 have been produced. At the next sample of t next = t +1, the task of online modeling is to produce the model prediction y(t next ) for the process's true output y(t next ), given the process input x(t next ) and the available local model set {f l } L l=1 . We adopt an ensemble of the selected M local linear models from the model library {f l } L l=1 based on the p latest labeled data {x(t − i), y(t − i)} p−1 i=0 . Recalling (17) to (22), the M selected local linear models yield the M model outputs for 0 ≤ i ≤ p − 1. The estimate y(t − i) of the process output y(t − i) is given as the weighted summation of the M selected subset models, which is computed by where nonnegative θ m (t) is the combining coefficient for the mth selected local model, and the combining coefficients must satisfy the constraint The estimation errors are utilized to determine the combining coefficients. Specifically, the optimal combining coefficients can be obtained by minimizing the LS cost function subject to the constraint (25). Because of M m=1 θ m (t) = 1, where θ(t) = θ 1 (t) · · · θ M (t) T andĒ(t) is the estimated error covariance matrix given bȳ The problem of determining the optimal θ(t) can then be formulated as the following optimization The Lagrangian function for the optimization (30) is given by where γ > 0 is a Lagrange multiplier, and where 0 M = [0 · · · 0] T ∈ R M . This suggests that the optimal combining vector θ can be obtained as follows. First, calculate which is followed by the normalization The prediction y(t next ) for the process's true output y(t next ) is produced as the selected ensemble Algorithm 3 summarizes the adaptive selective ensemble based prediction using our GAP-SER, where the prediction horizon p and the desired threshold ε are the two algorithmic parameters of selective ensemble prediction construction. has been constructed, otherwise give value of L min . 3: Give W G , p and ε, set W P = W G . 4: Step 1: Online prediction 5: Give input x(t next ) at new sample time t next = t + 1. 6: Calculate probability Pr l (t) of each local model using (20) for 1 ≤ l ≤ L. 7: Select M subset models until termination criterion (22) is satisfied. 8: Calculate error covariance matrixĒ(t) using (29). 9: Calculate optimal combining coefficients θ(t) using (33) and (34). 10: Predict true system output y(t next ) with selective ensemble prediction (35). 11: Step 2: Online pruning 12: Perform relevant pruning operations. 13: Step 3: Online growing 14: When y(t next ) is available, add {x(t next ), y(t next )} to dataset with t = t + 1. 15: Carry out relevant growing operations to adapt local model set. 16: Set t next = t next + 1, and go to Step 1.

Remark 3:
The adaptive selective ensemble prediction of Algorithm 3 is very different from the one given in our previous work [32]. First, the probability metric is used in our present work which is different from the selection metric of [32]. More importantly, unlike the scheme of [32], which only performs adaptive local modeling by growing the local linear model set, we not only perform adaptive model set growing but also carry out reliable adaptive local model set pruning. This significantly reduces the online computational complexity of the adaptive selective ensemble prediction, without sacrificing the prediction accuracy.

III. EXPERIMENTAL RESULTS
Experiments involving a chaotic time series prediction and two real-world data sets are performed to evaluate the proposed GAP-SER, and the results are compared with existing online modeling approaches, which include single global VOLUME 8, 2020 nonlinear modeling schemes of the RAN [17], the OS-ELM with RBF nodes [14], [15] and the fast tunable RBF [19] as well as the ensemble modeling schemes of the EOS-ELM with RBF nodes [27] and our recent SEMLM [32]. The MSE metric is utilized to evaluate the online prediction performance, where y(i) denotes the model prediction for y(i). The online computational complexity of an adaptive modeling method is quantified by its online ACTpS. The experiments are carried out on Matlab 2017a, running on a PC with i7-3770 3.40 GHz processor of 4 cores and 16 GB of RAM.

A. ONLINE CHAOTIC TIME SERIES PREDICTION 1) DATA DESCRIPTION
Lorzen chaotic time series [41] is governed by the three differential equations as where a, b and c are the parameters that control the behaviour of Lorzen system. In our experiments, three cases are considered, and they are 1) Lorenz series with fixed parameters ( which is used in prediction, rather than the original y(t). The time series y(t) is even more nonstationary than {y(t)} of 1) and 2). In particular, the dynamic range of y(t) changes from [−20, 20] initially to [−2000, 2000] in the end. The fourth-order Runge-Kutta method with a step size of 0.01 is used to generate the samples, and only Y -dimension samples {y(t)} are used for time-series prediction. The 60-steps ahead prediction is considered, which predicts y(t) with the past samples In the LSTD case, y(t) is replaced by y(t). In all the simulations, after a long initial period, 4000 samples are generated.
The first 1000 samples are used for initial training and the last 3000 samples are employed for online prediction. Noted that the RAN, the fast tunable RBF, our previous SEMLM and our proposed GAP-SER do not really need such a large number of training samples but the OS-ELM and EOS-ELM need, as the ELM modeling must contain a large number of hidden nodes to cover the entire input space. For each time series, 100 independent realizations are generated. The performance of each method are presented by its mean and standard deviation (STD) of the test MSE and ACTpS over 100 realizations.

2) INITIAL TRAINING
The RAN does not really need training and may be applied directly to prediction. Since we have training data, we can apply the RAN to the training data and obtain an initial RAN model. In the following, 'RAN' represents the RAN without initial training and 'RANini' denotes the RAN with initial training. Similarly, our GAP-SER may be applied directly to online prediction from scratch. Algorithm 1 will gradually build up a base model set. However, the online prediction accuracy may be poor during this built-up period, as there may exist insufficient number of base local models. In real life, a prediction model can only be applied to online prediction, if the model is known to at least match well the underlying process's past dynamics. Therefore, in practice, initial training is somewhat necessary and always desired. The same discussion also applies to the SEMLM. For the OS-ELM, we randomly select a large number of training input data points as its centers to cover the input space and determine its weights by the LS estimate. For the EOS-ELM, we use 5-model ensemble and train each base model similarly to the OS-ELM. For the fast tunable RBF, the training is done by the orthogonal least squares algorithm [42] to construct a small RBF model. For the SEMLM and GAP-SER, the initial local linear model set is obtained by Algorithm 1. Fig. 1 shows the influence of the window size W G on the number of local linear models obtained during training.

3) ONLINE PREDICTION PERFORMANCE COMPARISON
Three algorithmic parameters of the GAP-SER, namely, the growing window size W G , the prediction horizon p and the threshold ε for constructing selective ensemble, need to be set. Note that the pruning widow size is set to W P = W G . The impacts of W G and p on adaptive modeling and online prediction are similar to the SEMLM, which is detailed in our previous work [32]. Generally, model built on a large window size is more stable or accurate but may not respond quickly to data changes [43]. Therefore, selecting an appropriate W G is a trade off between stability and adaptive capability. W G = 38, 53 and 36 are selected empirically for the LSF, LSTV and LSTD, respectively. The choice of p typically trades off the computational complexity and the robustness against noise [32]. Also since a smaller p can better track local characteristics, which is beneficial in fast time-varying environments, we choose a small p = 5 empirically. Note that ε is particularly important, as it not only influences the number of subset models for selective ensemble but also determines how many models to be removed. Given W G and p, the impact of ε is investigated. It can be seen from Fig. 2 (a) that the test MSE increases with ε. The reason is twofold. 1) The number of subset models selected decreases as ε increases, which can be seen from Fig. 2 (b). When ε = 1, only one local model is used for online prediction. 2) As ε increases, more models are removed from the model set and hence the number of local models L decreases, until L reaches the minimum L min , as can be seen from Fig. 2 (c). Basically, increasing ε enhances the prediction accuracy at the cost of higher online computational complexity, as clearly illustrated by Fig. 2 (a) and Fig. 2 (d).
Taking into account both prediction accuracy and computational complexity, ε = 0.5, 0.5 and 0.6 are chosen for the LSF, LSTV and LSTD, respectively. We set the algorithmic parameters of the SEMLM in a similar way. For the fast tunable RBF [19], the node replacement threshold and the number of latest data points for weight adaptation are empirically chosen as 10 −6 and 5, respectively, while the step size and the maximum number of iterations for its iterative search procedure are empirically set as 0.01 and 5, respectively. For the RAN [18], its algorithmic parameters for online modeling, namely, the maximum and minimum center distance thresholds, the error threshold and the decay constant, are carefully tuned for each time series.
Figs. 3 to 5 show the learning curves of various schemes for the LSF, LSTV and LSTD test datasets, respectively,    Table 1 further compares the performance of prediction accuracy and Not surprisingly, the OS-ELM and EOS-ELM perform poorly, in terms of both test MSE and ACTpS. As expected, the EOS-ELM imposes significantly higher ACTpS than the OS-ELM and yet it does not necessarily have better online prediction accuracy than the latter. This is because the EOS-ELM of [27] does not have necessary diversity capability, as all the base RBF models are trained on the same data. The results show that the RAN outperforms the OS-ELM and EOS-ELM considerably, in terms of both test MSE and ACTpS. Interestingly, the RANini does not always achieve better prediction accuracy than the RAN. For the LSF and LSTV prediction, for instance, the test MSE of the RANini is poorer than that of the RAN. This may due to that for learning in a nonstationary environment, a model with small memory depth is better to capture the current signal dynamics. Also the RANini does not always impose higher online computational complexity than the RAN. In the LSF and LSTD predictions, the RANini actually has lower ACTpS than the RAN.
The fast tunable RBF, the SEMLM and the proposed GAP-SER are in a complete different league, and they dramatically outperform the OS-ELM, EOS-ELM and RAN, in terms of both prediction accuracy and online computational complexity. Specifically, the fast tunable RBF of [19] with its fast adaptive capability and small model size is capable of achieving excellent prediction accuracy while imposing the lowest ACTpS. Owing to its fast adaptation capability and maximum diversity property, the SEMLM of [32] outperforms the fast tunable RBF, in terms of test MSE. The SEMLM however imposes higher ACTpS than the fast tunable RBF. The reason is that the SEMLM only grows the local linear model set online. During long online operation, the size of the local linear model set may become large and this causes high computational complexity in constructing selective ensemble prediction. By contrast, our GAP-SER with its reliable pruning strategy is capable of removing 'out-of-date' models from the local linear model set, and consequently it imposes considerably smaller computational complexity in constructing selective ensemble prediction, without sacrificing prediction accuracy, in comparison with the SEMLM. In fact, the GAP-SER may even outperform the SEMLM, in terms of test MSE, particularly for highly nonstationary data, such as the LSTV and LSTD predictions. Observe that the GAP-SER can achieve comparable ACTpS with the very efficient fast tunable RBF.

B. ONLINE IDENTIFICATION OF MICROWAVE HEATING SYSTEM 1) SYSTEM DESCRIPTION
Microwave heating process (MHP) is a typical nonlinear and time-varying thermal process [44]- [47]. Since MHP involves multiple physical fields coupling and its inner electromagnetic field distribution is normally unknown [44], data-driven VOLUME 8, 2020 modeling technique offers a practical means of MHP identification [48]- [50]. Temperature is a crucial measurement during the operation of MHP, as thermal runaway often occurs due to the time-varying physicochemical properties of material [46]. With the increase of the medium temperature, its dielectric loss increases dramatically, which conversely poses a positive feedback to temperature increase. Therefore, accurate online temperature prediction is vital to detect thermal runaway in advance.   6 illustrates a real-world industrial microwave heating system [48], [51], which consists of five microwave generators and waveguides, temperature measurement sensors and the control system hosted in a programmable logic controller (PLC). Microwave generated by each microwave generator is transmitted through the corresponding waveguide, which is fed into the cavity and absorbed by the heated material. Each microwave generator has a maximum power supply of 3 kW at 2.45 GHz. The material is continuously transported through cavity by the conveyor belt, whose speed can be adjusted by a motor driver. Three fiber optical sensors (FOSs), denoted as FOS1 to FOS3, are placed at three different key locations using microwave transparent taps to online record multiple-points of temperature. During the realtime operation of this MHP, the control center receives the measured temperature values from the FOSs, and sends control commends, which include the five microwave powers u p i (t), 1 ≤ i ≤ 5, for the five microwave generators as well as the conveyor speed v(t) to the cavity. Thus, the control inputs to this MHP are given by Each FOS measures the temperature y s j (t) at the FOS's location, where 1 ≤ j ≤ 3. Because of near instantaneous response of MHP, the temperature y s j (t) at the jth FOS's location can be adequately represented by [48], [51] y s j (t) = f nl−ns,j (x j (t); t), where f nl−ns,j (·; t) represents the corresponding unknown nonlinear and time-varying system mapping with the input From large amount of data collected from this industrial microwave heating system, we use three datasets from the three FOSs, and each data set contains 3000 data samples. We first normalize the five microwave power inputs and the temperature measurements according tō where y min,s j and y max,s j are the minimum and maximum temperature measurements of the jth FOS, respectively. For each FOS's dataset, we use the first 1000 samples for training and and the last 2000 samples for online prediction.  Table 2 compares the online identification performance of all the modeling methods, in terms of prediction accuracy and online computational complexity. From these results, the same observations as those for Lorenz chaotic time series can be drawn. In particular, our GAP-SER imposes the lowest ACTpS, while achieving the test MSE as good as the SEMLM.
C. EEG DATA MODELING 1) DATA DESCRIPTION Analysis of electroencephalographic (EEG) time series is a practical example of nonlinear and nonstationary system identification [52], [53]. We use a real EEG time series {y(t)} available from the University of Bonn [54]. The sampling rate is 173.61 Hz, and we construct a dataset D N = {x(t), y(t)} N t=1 with 10 seconds of the signal, a total of 1730 samples, where x(t) = [y(t − 1) y(t − 2) y(t − 3) y(t − 4)] T . The first 5 seconds of observations with 865 data pairs are used for training, and the rest 5 seconds of observations, also having 865 data pairs, are used for testing.

2) ONLINE MODELING PERFORMANCE COMPARISON
Only the RAN, fast tunable RBF, SEMLM and GAP-SER are compared, as the online performance of the OS-ELM and EOS-ELM are poor. The algorithmic parameters of   the GAP-SER are empirically chosen to be W G = 30, p = 30 and ε = 0.1. The threshold of the SEMLM is ε = 1. For the tunable RBF, we have the node replacement threshold 10 −5 , the weight update innovation length 3, the step size 0.01, and the iteration number 5. For the RAN, the maximum and minimum center distances, the error threshold and the decayed constant are also set empirically.  The online MSE learning curves for the four methods are shown in Fig. 10. Additionally, the online prediction and adaptive model performance are compared in Table 3. Clearly, the results obtained demonstrate that our proposed GAP-SER is capable of capturing the nonstationary dynamics of the EEG signal accurately, while imposing very low online computational complexity. The recovered signal calculated by the GAP-SER and the original EEG signal are shown in Fig. 11.

IV. CONCLUSION
This work has proposed a new growing and pruning selective ensemble regression learner for online adaptive modeling of nonlinear and nonstationary data. Our growing strategy can automatically identify every newly emerging process state from fast-arriving data and fit a local linear model to it. This ensures the maximum diversity of the base model set. On the other hand, our new pruning strategy is capable of removing out-of-date local linear models reliably and, therefore, significantly enhancing the plasticity of our selective ensemble learner. A direct consequence of this reliable pruning is that online computational complexity is significantly reduced, which is vital for adaptive modeling of fast arriving data. Based on a probability metric, our new selective ensemble learner selects a small number of best subset linear models from the local linear model set and optimally combines them to produce the accurate online prediction. Extensive experiments, including a chaotic time series prediction, online identification of a real-world industrial microwave heating system and EEG data modeling, have been conducted. The results obtained have demonstrated that our GAP-SER compares very favourably with the existing state-of-the-arts adaptive modeling schemes for nonlinear and nonstationary systems. Specifically, it has been shown that our GAP-SER is capable of producing the most accurate prediction while imposing the lowest online computational complexity for highly nonlinear and nonstationary cases. Our further efforts will be devoted to apply this modeling technique in real-world industrial control systems.