Robust Online Learning Method Based on Dynamical Linear Quadratic Regulator

In this paper, a novel algorithm is proposed for inferring online learning tasks efficiently. By a carefully designed scheme, the online learning problem is first formulated as a state feedback control problem for a series of finite-dimensional systems. Then, the online linear quadratic regulator (OLQR) learning algorithm is developed to obtain the optimal parameter updating. Solid mathematical analysis on the convergence and rationality of our method is also provided. Compared with the conventional learning methods, our learning framework represents a completely different approach with optimal control techniques, but does not introduce any assumption on the characteristics of noise or learning rate. The proposed method not only guarantees the fast and robust convergence but also achieves better performance in learning efficiency and accuracy, especially for the data streams with complex noise disturbances. In addition, under the proposed framework, new robust algorithms can be potentially developed for various machine learning tasks by using the powerful optimal control techniques. Numerical results on benchmark datasets and practical applications confirm the advantages of our new method.


I. INTRODUCTION
As an important subtopic of machine learning, online learning has attracted increasing attention during the past decade due to its extensive applications to realistic modeling problems, for instance, online advertising, financial quantitative transaction, and mechanical damage detection [1]- [4].In the online learning, data become available in a stream, and predictive models are required to be updated in a real time manner.Therefore, two key issues need to be addressed in developing new algorithms.The first issue is to absorb new information from the incoming data flow and incrementally update the predictive model.The other one is to remove the old and useless information to maintain The associate editor coordinating the review of this article and approving it for publication was Shagufta Henna.
the parsimony of the model and restrain the computation load within a certain lever.
A wide variety of online learning methods have been proposed during the last decade.A major approach is the gradient based method, in which the update is often given by solving an empirical error minimization problem [5]- [8].The gradient descent principle is introduced to ensure that the computation load will not increase substantially when the data sequentially come into the learning.The convergence rates and approximation error have also been studied in literature [9], [10].Another major approach is the online version of batch learning algorithms [11]- [13].In these algorithms, the learning model is updated by repeatedly solving the corresponding regularized error minimization problem when the new instances are subsequently added.The sample selection techniques such as the moving window strategy, fast leave one out, and pruning error minimization, have been applied to reduce the computation complexity, meanwhile improve the sparsity and generalization performance of the learning model [14]- [19].Among them, passive aggressive (PA) algorithms are a family of margin based online learning methods [20]- [25].Instead of penalizing the complexity of the model, PA algorithms penalize the increments of learning model and update the model when the predictive error exceeds a predetermined threshold value.Compared with the gradient based methods, PA algorithms have shown advantages in robustness due to their less sensitivity to the parameter setting and adaptive learning rate.
Although the existing methods are very useful for many real-world applications, there still exist several challenges when these methods are applied to complex data streams.A major limitation is the deficiency of coping with noise effects.Regarding the batch learning methods, when complex random disturbances are encountered, the optimization in the least square form needs to be modified to improve the prediction efficiency.For example, for data with heterogeneous noise, the weighted least square regression is often utilized.However, in online learning cases, it is impossible to compute the weighted parameters or obtain a stable structure of noise effects.Thus, the sample selection techniques and modification strategies cannot be well incorporated into the learning simultaneously.In the gradient based methods and PA methods, the updates rely on the feedback of error signals.In many realistic applications, the error signals may be inevitably corrupted by noise and thus provide incorrect gradient information for updates.This problem will be even worse if the learning is conducted in the environments of high intensity noise.The updates will be continuously misguided, and the convergence rate will be seriously affected.Another limitation is that some key learning parameters, such as the length of the window in the moving window regression [26] and the learning rate in the gradient based methods [27], are difficult to be adjusted.It is noted that in online learning, the system under studies often experiences abrupt changes and the objective model is usually time-varying.Therefore, a satisfactory modeling performance cannot always be guaranteed with constant learning parameters.Unfortunately, in the majority of the literature, these parameters are predetermined based on heuristic knowledge, which also brings substantial challenges to the online learning problems.
In this paper, we introduce the state feedback control theory into the modeling of data streams and propose a novel online learning approach.By a carefully designed numerical scheme, the online learning problem is reasonably transformed into the state feedback control problems for a series of finite-dimensional, controllable, and completely observable systems.Dynamical linear quadratic regulator is introduced to solve the corresponding optimal control problem.Two algorithms, named as the online linear quadratic regulator learning algorithm (OLQR) and online kernel linear quadratic regulator learning algorithm (OKLQR) are developed for the learning problems in the linear space and reproducing kernel Hilbert space respectively.Since a completely different approach is explored in our framework, there is no need to introduce any online adjustment to the learning parameters, complex data window or pruning techniques, which enables our method to overcome the aforementioned limitations of the existing methods.Compared with the conventional methods, the proposed algorithms can achieve better performance in both learning efficiency and prediction accuracy regardless of the characteristics of noise disturbances.Although a number of pioneering approaches have been proposed for developing control-based learning algorithms [28]- [30], the learning problems were transformed into the output feedback control problems in these works rather than the state feedback control problems proposed in our method.Our approach not only bring better control performance for learning but also leads to strict mathematical analysis on the convergence, all of which rationalize our proposed method for establishing a solid learning framework.
Notation: The real field is denoted by R, while R M denotes the set of real vectors of size M .The superscript (•) T and (•) −1 denote the transpose and inverse of a matrix, respectively.< •, • > and || • || denote the inner product and norm of a Hilbert space respectively.Vectors are denoted by bold letters and matrix by capital letters in Spencerian fonts i.e., s and S respectively.I denotes the identity matrix, and finally 0 ≺ S ≺ I means that the symmetric matrix S and I − S are both positive definite.

II. BENCHMARK ONLINE LEARNING METHODS
We start with a brief review on some benchmark online learning methods.The issues discussed in this section for the linear system can be naturally extended to the nonlinear cases.The gradient based methods are the most popular algorithms in optimization and so far the most common approach in the online learning strategies.Assume that there is a sequence of random samples (x(n), y(n)) (n = 1, 2, . ..) generated by the model where Here β(n) denotes the estimate of β at time slot n, and the prediction error is defined as e In the stochastic gradient methods [5], [6], the learning law is given by minimizing the following instantaneous risk The update is given by where λ > 0 is the regularization parameter and η is the learning rate.Some gradient algorithms are also proposed VOLUME 7, 2019 without an explicit regularization, i.e. λ = 0, given by The learning algorithms displayed above are also referred as stochastic approximation methods or least mean square methods.A major limitation of the gradient based methods is the deficiency of coping with noise.The error signal e(n) is used as the distance between β(n) and β .If e(n) = 0, the algorithm considers that β is accurately estimated, and β(n) remains steady without any updates.Otherwise, β(n) will be updated.Noticed that the update of β(n) becomes The error signal e(n) is inevitably corrupted by noise ε(n), and may provide false information for the online estimation.
Another difficulty is the determination of learning rate η.
A large learning rate may lead to quick convergence, but also raise the risk of instability of the algorithm.However, a small enough learning rate can guarantee the asymptotic convergence but often lead to slow learning and inefficiency for the fast-changing systems.Although the majority of the gradient based methods usually prefer a relatively small rate to ensure the convergence, the range of optimal learning rates is still distinct for online learning problems with time-varying objective functions.Although the adaptability of learning rates has been extensively studied [8]- [10], heuristic knowledge or additional cost is still needed.
In the PA learning methods [20], [23], suppose at time slot n, the algorithm receives instance (x(n), y(n)) and makes a prediction ŷ(n) = x(n)β(n).For the true target value y, the algorithm defines a loss function.For example, the υ-insensitive hinge loss function is defined as where υ is a positive parameter which controls the sensitivity of the updates.The algorithm sets the law as the solution of following optimization The update given in (8) has a closed form solution as where τ n = L υ (β(n); x(n), y(n)).From ( 8) and ( 9), it can be seen that the error signal e(n) along with learning rate τ (n) are both corrupted by random disturbances, and the updates can be misguided for the similar reason that we have discussed for the gradient methods.The learning parameters are usually heuristically chosen to obtain a satisfactory modeling performance.
For online batch learning methods [11]- [13], suppose that at the beginning, we have data (x(n), y(n)), n = 1, 2, . . ., N , the optimization problem with regularization is given by min or by a more general formula, which can include the nonlinear case where f is a nonlinear approximator.This method cannot be directly applied to online cases since all the samples will be added into the learning.The estimated model will be a lack of sparsity and become computationally infeasible.Some sparsification techniques have been developed.For example, the moving window method was proposed in [26], which assumes that the earliest data in a window contains the least information.The pruning error minimization method was also developed in [14].This method selects the sample that brings the smallest error after it has been pruned.It is noticed that realistic random disturbances are usually heterogeneous, temporal correlated and even non-Gaussian.Some modifications must be made to the optimization methods [31]- [33].For the case of heterogeneous noise, the weighted least square optimization is usually employed where σ n is the weight parameter.However, for the time varying data stream, it is extremely cumbersome or impossible to obtain stable and efficient estimates for σ n in a real time manner.

III. A BRIEF INTRODUCTION OF OPTIMAL CONTROL
To demonstrate our motivations, we give a brief introduction of optimal control for discrete linear systems.Optimal control is a mature mathematical discipline with a wide range of applications in science and engineering.Consider the following linear control system where A and B are coefficient matrices, Z(n) is the state vector series and θ (n) is the control input vector.The objective of optimal control is to find a control policy that stabilizes systems, and optimizes a specific performance index for systems.With controller gain F and control input θ (n) = FZ(n), the closed loop system Z(n+1) = (A+BF)Z(n) is stable, i.e.A + BF is contractive.In optimal control, an efficient F can be obtained by minimizing following performance criterion including a cost function that penalizing the state and control input simultaneously where S 0 , Q 0 and R 0 are constant symmetric positive matrices, and N is a positive constant.Despite the accumulated optimal control models and methods, there are still limited approaches for addressing the online learning problems from the perspective of optimal control.As discussed in the previous section, the prediction error is used in online learning to infer the distances between the model parameters and objective parameters, and the online updating of model parameters ( ) is supposed to minimize the prediction error as much as possible.This process is similar to that in the stabilization problem of control theory, for example, obtaining a series of control inputs θ (n)s to stabilize (13) and making Z(n) converges to the original point, which sheds light on the motivations of the technical methods to be developed in this paper.In the following sections, we develop numerical schemes using the prediction errors as state variables of control system and the parameter updates as control input.Meanwhile, we also elaborate the connections between the optimal control and online learning.

IV. ONLINE LEARNING FRAMEWORK
In this section, we establish a novel online learning framework from the perspective of state feedback control.Consider an adaline machine with M input variables.Suppose there is a data stream (x(1), y(1)), . . ., (x(k), y(k)), . . .generated by where β = (β 1 , β 2 , . . ., β M ) T is the objective parameter vector.Let β(n) be the estimated parameter vector of β at time slot n.Assume that at time slot n, we already have an estimated β(n), which will be updated by Let ê(n − l) be the projected prediction error by β(n + 1), i.e.
for l = 0, 1, . . ., M − 1.Let e(n − l) be the prediction error by β(n), for l = 0, 1, . . ., M − 1.For all l, with we rewrite (18) as To investigate the learning problem from optimal control perspective, we further denote (19) as where U(n) = β(n) is a feedback control input waited to be determined.This is a linear, discrete and finite dimensional error dynamical system with parameter vector B(n) and control input U(n).Based on (20), the update for the learning of linear model is transformed into a typical control problem of a discrete dynamical system with finite dimensional control input U(n) [34].Generally, to realize online learning, U(n) is the feedback to the observations, i.e.
where F(n) is the controller gain to be determined from the optimal control problem.It is noted that β(n) is assumed to be known at time slot n.Therefore, e(n − l) can be observed and the problem can be regarded as a state feedback control problem.The objective of the learning is to obtain an efficient F(n) to stabilize the error system (20) and make the state E(•) be contractive to the origin point.This implies that after the update, the prediction error by β(n + 1) will be smaller than that by β(n) on the given sample set, from the point view of algebraic mapping.As n grows, a series of time invariant control systems can also be formulated.If we sequentially stabilize these systems, a series of control inputs . .) can be obtained as well and In the following, we show that the distance between β(n + k) and β will be restricted in a compact set as k → ∞.Moreover, in literatures, ε(n) is usually set to be normal to facilitate the statistical analysis.However, random perturbations are always show the characteristics of boundness.For example, the return rate of the securities in stock market is usually assumed to be normal in econometrics models, but in fact it is bounded [35].Therefore, ε(n) is assumed to be bounded where I + B(n)F(n) is a state transition matrix for the control system.Denote Proof: See the appendix A. It is well known that in linear regression analysis, if the variables are perfectly multiple collinear, i.e. one independent variable is an exact linear combination of the others, there is no method to obtain a unique and promising learning model.Therefore, B(n) is assumed to be full rank and invertible.This also guarantees that system (20) is completely controllable, which greatly facilitates the development of our new method.Thus the estimation error of β(n) will be convergent to a compact set including the origin as n → ∞ with a series of carefully designed control inputs U(n), which means it is possible to develop efficient learning algorithms from the perspective of state feedback control theory.When the intensity of e(n) is zero (D = 0), the compact set can be extremely small in this noise-free case.

V. ROBUST ONLINE LEARNING METHOD BASED ON LQR
The theory of optimal control is concerned with operating a dynamic system at minimum cost.The case where the system dynamics are described by a set of linear equations and the cost is described by a quadratic function is called the linear quadratic (LQ) problem [34], [36], [37].One of the main results for the LQ problem is the linear quadratic regulator, which is a well known method that provides optimally controlled feedback gains to enable the closed loop stable and high performance design of systems.For systems controlled by LQR, control inputs and plant responses are predicted using the system state space model and optimized over a family of piecewise constant intervals with respect to a cost function with weighting factors including penalties on the system states and control inputs.Once the optimization problem is solved, only the control input of the current time slot is implemented.This optimization procedure is then repeated in the next time slot to continuously generate a series of efficient control inputs.
In this paper, the infinite horizon LQR is utilized to obtain the optimal state feedback control inputs to establish our online learning method.For system (20), we construct a virtual time invariant dynamical control system as follows where 24) is regarded as a time invariant (static) linear system.It is easy to verify that ( 24) is a completely controllable and observable as long as B n is full rank, which implies there exists an optimal control that can stabilize (24) according to the control theory [37], [38].To develop the learning method by LQR, the infinite time horizon optimization problem is constructed as where F n is the controller gain to be determined.R and Q are systematic and semi-definite matrices.These matrices can be chosen to obtain a desirable closed loop response.Here Q and R are set to unit matrices I and γ I , respectively.The first term in V measures the output deviation, and the second term penalizes the intensity of control input.In addition, γ > 0 is a tradeoff parameter to weight the two goals of the optimization.Once the solution of ( 25) is obtained, the control input in (20) and parameter update for the learning model is given by applying only the first control input U n (1) as 24) is updated to B n+1 accordingly.The corresponding LQR problem (25) is solved again to obtain the update law β(n + 1).This procedure will be repeated to update the model in a real time manner as the observation data are continuously added into the learning process, which is also named as the dynamical LQR in this study.
The techniques for obtaining the solution of linear quadratic optimization problem are also utilized to obtain the following schemes of our algorithm [37], [39].Theorem 2: The solution of optimization problem (25) can be given by solving the following matrix equation for P n , given by In addition, the optimal controller gain and parameter update are given as The estimation error by applying the update law is convergent, i.e. there exists a constant Proof: See the appendix B. It can be seen from the proof of Theorem 2 that with the update law obtained by the infinite horizon LQR, the distance between β(n) and β will be exponentially convergent to a 117784 VOLUME 7, 2019 compact set including the origin as n → ∞, which leads to an efficient and fast convergent online learning method named OLQR in this paper.In addition, the compact set and the estimation error can be extremely small in the noise free cases.Therefore, the dynamical LQR and the concept of state feedback control can be well introduced into the machine learning of data streams, which is one of the main contributions of this paper.Algorithm 1 shows the detailed steps of the proposed method for the learning of adaline system (14).Regarding the time horizon N in the index (25), from the perspective of optimal control, the solution obtained from LQR over a large time horizon can usually stabilize the closed loop system.However, the system may be still unstable with a time horizon that is not long enough.This problem can be entirely avoided if the controlled performance is evaluated over an infinite prediction horizon i.e.N = ∞, which not only ensure the convergence but also lead to an explicit, convenient and consistent computational scheme for the online learning [37], [39].The algorithm is named as OLQR (for system (14)) and summarized as follows 1) Initialization: the number of variables M .
2) Let the system run for M steps, and β(M ) = 0.

VI. THE ONLINE LEARNING IN KERNEL SPACES
In this section, we consider the online learning problem in the reproducing kernel Hilbert space H K associated with inner product < •, • > H and feature mapping φ.The kernel K for H K has the reproducing property [8], [9], i.e.
Here H K is the closure of the span of all K (x, •), and is the Gaussian kernel, and σ is the bandwidth.In this study, it is assumed that X is a compact subset.Suppose there exists a data stream (x(1), y(1)), . . ., (x(n), y(n)), . . .generated by y(n) = f (x(n)) + ε(n), where x(•) ∈ R M 1 , f and ε are unknown nonlinear function and random disturbance respectively.Suppose there is an objective w in H K such that and we have a learning model in x(•) is assumed to be in the compact set X ⊂ R M 1 .At time slot n, we have w(n), which will be updated to w(n + 1) by w(n).
for l = 0, 1, 2 . ... It follows that Let l = 0, 1, . . ., M 0 − 1, where M 0 will be specified later.Similar to ( 18) and ( 19), with ), e(n − 1), . . ., e(n − M 0 + 1)] T , Equation ( 33) can be written as where (n) = (φ(x(n)) T , φ(x(n − 1)) T , . . ., φ(x(n − M 0 + 1)) T ) T .For any n, (34) can be considered as a control system with control w(n) defined in the kernel space.With the similar techniques developed in (24), for system (34), a virtual system can be designed as where n = (n), w n (1) = w(n) and E n (1) = E(n).The optimization of LQR with (35) in H K is designed as It is noticed that w(n) in the kernel space may be infinite dimensional.To reasonably transform (36) into a finite dimensional one, we find optimal solution in a subspace H s K rather than in H K itself [12], [13].Let u i (i = 1, 2, . . ., M ) be different vectors in X .H s K is the linear subspace of H K spanned by basis vectors φ(u i ), i = 1, . . ., M .For ∀n, w(n) and objective vector w are then represented in H s K as Then, the prediction errors on the dataset (x(n − l), y(n − l)), l = 0, . . ., M − 1 by w(n + 1) (denoted as ê(•)) and w(n) (denoted as e(•)) can be reduced into difference equations with finite parameters  39) is denoted as It is also noted that in the subspace where G = [K (u i , u j )] i,j=1,2,...,M is the kernel Gram matrix.Therefore, the index of LQR ( 35) can be reduced accordingly to a formula with finite control inputs By the same method presented in Theorem 2, the optimal update law can be obtained where P n is given by the following matrix equation The computation for solving this matrix equation can be implemented by using the optimization toolbox in Matlab.
For φ, we have the following results [15], [40].Lemma 3: The feature mapping φ is a compact mapping, i.e. φ maps any bounded set onto a relatively compact set in H K .
By the properties of relatively compact set [41], it can be concluded that there exists a finite open coverage for the range of φ in H K , which implies that for a given degree of accuracy, the range can be approximated by the linear combination of the vectors in H s K .Regarding the selection of basis vectors B S = {φ(u 1 ), φ(u 2 ), . . ., φ(u M )}, the approximate linear dependence (ALD) method is utilized here [15], [17].Suppose at time n we have already identified w(n) = M i=1 α i (n)φ(u i ).For the incoming data (x(n+1), y(n+1)), let (45) and ζ (n) can be easily expanded as ) cannot be linearly represented very well.The approximation ability of the space spanned by {φ(u 1 ), φ(u 2 ), . . ., φ(u M )} is weaker than the space spanned by {φ(u 1 ), φ(u 2 ), . . ., φ(u M )} ∪ {φ(x(n + 1))}, which means the model obtained from the former space may be inadequate for the learning.If ζ (n) is very small, there is negligible difference between the spaces, and φ(x(n + 1)) can be redundant to be a basis vector.A predetermined constant ν can be chosen as a threshold for the update of We propose that if , where u M +1 = x(n + 1), and K n is updated as E(n + 1) and E(n) are expanded as (ê(n), ê(n − 1), . . ., ê(n − M )) T and (e(n), e(n − 1), . . ., e(n − M )) T , respectively.α(n) is updated as (α 1 (n), . . ., α M (n), 0).Then, the learning can be conducted by ( 43) and ( 44) accordingly.Our algorithm is summarized as online kernel linear quadratic regulator algorithm (OKLQR) and presented as follows 1) Initialization: the number of variables M , the kernel bandwidth σ , the threshold value ν, the basis vector set B S = ∅.Let the system run for at least M steps.For If ζ (n) ≤ ν, B S remains unchanged, and α(n) is updated as α(n + 1) according to (43) and (44).
is updated as α(n + 1) by solving (42) based on ( 46), ( 43) and ( 44). 3) If data flow ends, stop; otherwise n=n+1, go to step 2. Remark 4: Based on the theory of LQR, the updates of β(n) and w(n) are more likely to be restricted by a large γ [38], [39].This may increase the robustness of the method but also make the learning model adapt the new dynamics at a relatively low speed.On the contrary, a small γ brings relatively fast learning and makes the model trance the changes of the data more efficiently due to the fewer restrictions on the update size, but may also result in the over-fitting problem.Therefore, the second terms in ( 25) and ( 42) can be treated as regularization for the learning.By borrowing the concepts of ''passive'' and ''aggressive'' that defined in [20], [23], the method can be described to be more passive with a larger γ , but more aggressive with a smaller γ .Therefore, a moderate γ is apt to be chosen for most of the learning tasks.We also remark that in gradient based method, the learning parameters such as learning rate should be carefully chosen to avoid divergence.Unlike the gradient based methods, γ is the only parameter that needs to be adjusted in OLQR and OKLQR.More importantly, no matter what γ is finally selected, the methods are still exponential convergence and stable as long as γ > 0. This is the main advantage of our method.Also, the new method can also be extended for the leaning problems with polynomial kernels.
Remark 5: As discussed in section 4 and 5, in the gradient based learning methods, β(n) and w(n) are supposed to be updated on feature plane (Euclid plane for linear model, hyperplane for kernel model) by decreasing the quadratic risk index.In the best cases, we wish that β(n) and w(n) are updated in the ''perfectly correct'' direction, i.e. β(n) and w(n) directly converge to β and w .However, ''perfectly correct'' cannot be achieved since the feature plane will be no longer smooth due to the noise effects.This situation can be even worse if the noise disturbances are complex.The derivative in the gradient algorithms may indicate incorrect information for the distance between β(n) and β , and provide a wrong update direction.In our method, the learning problem is solved through a completely different approach with optimal control techniques.By employing the control input obtained by LQR, the prediction errors and distance between β(n) and β will always be exponentially convergent to the neighborhood of origin point regardless of the characteristics of the noise effects.It means that despite ''perfectly correct'' cannot be achieved due to the noise effects, β(n) and w(n) are still updated in a ''generally correct'' direction.Therefore, the new method is able to provide more robust modeling performance on both convergence speed and prediction accuracy in the case of complex noise disturbances.This is another main advantage of our method.

VII. NUMERICAL EXAMPLES
This section provides numerical results obtained by our method using simulation data and realistic data.We also compare our results with those from the existing benchmark online methods.

A. ONLINE REGRESSION ANALYSIS
Consider following discrete, adaline system The input vector for this system is (z(n − 1), z(n − 2), z(n − 1)z(n − 2), z(n − 1) 3 ).Let B be the back-shift operator.The random term ε(n) = (0.5 , is temporal correlated and heterogenous, where The estimation errors by different online learning methods for system (47).
is uniformly distributed on [−0.5, 0.5], ς 2 (n) ∼ N (0, 1), and ς 3 (t) is uniformly distributed on [−1, 1].This system is perturbed by high intensity, temporal correlated, and nonstationary random disturbances, which means the learning is conducted in a complex noise environment.Learning results of four methods are presented in Fig. 1 and Fig. 2. The hyper-parameters of each method are chosen to give the optimal performance.In OLQR, γ is set as 1.5.The learning rate η is 0.2 in LMS.In OPA, υ is chosen to be 0.05, and the window length is 8 in WMLSR.Consider following linear system with classical online learning setting For system (48), z 1 (n) is uniformly distributed on [−1, 1], z 2 (n) is generated by N (0, 1), ε(n) is a Gaussian noise with variance of 0.05.Fig. 3 gives the online estimation results obtained by OLQR, least mean square algorithm (LMS) [6], online passive aggressive algorithm (OPA) [20] and least square regression with moving window algorithm (MWLSR) [26].The estimation errors are compared in Fig. 4. In MWLSR, the window length is chosen to be 5.In OPA, the value of υ for hinge loss function is 0.01.The learning rate of LMS is 0.15.The regularization γ in OLQR is set to be 5.The instances and variables are set to be independent in (48).This also illustrates that although this paper focuses on the data streams with complex noise disturbances, the proposed method can also be well applied to the learning problems with the classical online learning setting.
Although the system undergoes significant changes during the learning process, i.e. the objective parameter vector (β 1 , β 2 , β 3 , β 4 ) in (47) shifts from (1, 0.5, 0.25, −0.3) to (2, −0.5, 0.25, −0.3), and (β 1 , β 2 ) shifts from (1, −1) to (−1.5, 2.5) in (48), the changes of the systems are immediately detected and traced, and the objective parameters are online estimated very well.The estimations by OLQR converge quickly to the true values as shown in figures, suggesting that the proposed new method provides better performance in terms of both learning accuracy and convergence rate for this test system.

B. NONLINEAR SYSTEM IDENTIFICATION
In this example, we compare the proposed method with some benchmark algorithms based on their performances on nonlinear system identification.The nonlinear system with heterogenous noise term is given as follows where z(n) ∼ N (0, 2).This system is perturbed by white noise or heterogenous noise.For the former case ε(n) = 0.05ς 1 (n), where ς 1 (n) ∼ N (0, 1).Two series of datasets are generated.For the latter case   where n = 1, . . ., 900, and ŷn (j) is the prediction output of the learning model for y(j) at time slot n.
The comparison results are shown by comparing the proposed algorithm with some existing online kernel learning algorithms including the kernel least mean square algorithm (KLMS) [6], online kernel passive aggressive algorithm (OKPA) [20], and online independent reduced least square support vector regression algorithm (OIRLSSVR) [13].The Gaussian kernel is applied in this example and the optimal kernel bandwidth is chosen as 1 for all the algorithms.In both OKPA and OIRLSSVR, υ is chosen to be 0.001.The learning rate of KLMS is 0.9.The regularization parameter γ in OLQR is 1.
Fig. 5a and 5b demonstrate the testing MSE of the 900 training steps for homogeneous and heterogenous cases, respectively.The results are the average performance calculated from 50 independent Monte Carlo simulations.It can be  seen that all MSEs obtained by different algorithms converge to a steady state after a number of learning iterations.OLQR outperforms the other algorithms for its smaller testing error and fast convergence rate in both cases.

C. NONLINEAR TIME SERIES ANALYSIS
In this section, a widely used benchmark system [43], is considered to examine the efficiency of our new method.
The system is given by where u(t) = 0.4 sin(πt/125) + 0.6 sin(π t/25).The initial values are y(1) = y(2) = y(3) = 1.The noise term ε(t) is temporal correlated and governed by ε(t) = 0.05(1 − 0.8B) −1 ς (t), where ς (t) ∼ N (0, 1).The input vector of the learning model consists of u(t) and y(t) as well as their delayed terms, i.e. (y(t − 1), y(t − 2), y(t − 3), u(t)), and various kernel methods are applied for the learning.The system undergoes significant changes at point 1000 and a sequence of 2000 samples is generated.The actual data of the system and prediction outputs by OKLQR with bandwidth σ = 1, penalty γ = 10 and ALD threshold parameter ν = 0.5 are shown in Fig. 7a.The comparison results with the prediction errors obtained by OKPA, KLMS, and OIRLSSVR are presented in Fig. 7b.To achieve the optimal performance, the regularization parameter γ in OLQR is set to be 1, the learning rate of KLMS is 0.9, For both OKPA and OIRLSSVR, υ is 0.05.
It shows that after a short period of transient effects, the system dynamics can be well predicted by OKLQR in one step ahead manner.In this example, the mean square error is defined as   illustrate the advantages of our method.It can be seen that OKLQR can achieve better learning results with faster convergence and smaller prediction error than the other learning algorithms.

D. REALISTIC DATA: BOX-JENKINS GAS FURNACE PROBLEM
In this section, we demonstrate our method on a real world data set.The Box-Jenkins gas furnace problem is a common benchmark to test learning methods [43], [44].The data consists of 296 pair of samples and are measured from a gas furnace with the CO 2 concentration y(t) and the gas flow rate u(t).The input vector is chosen to be (y(t − 1), y(t − 2), y(t − 3), u(t − 1), u(t − 2), u(t − 3)).
The prediction outputs by OKLQR with bandwidth σ = 10, penalty γ = 0.2 and ALD threshold parameter ν = 0.001 are shown in Fig. 6a.The comparison results with the prediction errors obtained by OKPA, KLMS, and  OIRLSSVR are shown in Fig. 6b.To achieve the optimal performance, the regularization parameter γ in OLQR is set to be 1.The learning rate of KLMS is set as 1.5.For both OKPA and OIRLSSVR, υ is 0.01.The MSE for this example is defined as  economic operation of a power plant.Electrical power output can be expressed as either linear or nonlinear function of environmental variables such as temperature and humidity recorded by sensors.This problem has been studied in many literature using other machine learning tools [45]- [47].Consider a combined cycle power plant (CCPP) with two gas turbines, one steam turbine, and two heat recovery systems [45], [47], those types of equipment are sensitive to four main variables: ambient temperature (AT), atmospheric pressure (AP), relative humidity (RH) and exhaust steam pressure (or vacuum, V).Thus, the full load electrical power output (P E ) is affected by the above four variables.The dataset in [45] 3.
To further illustrate our methods, the whole dataset is shuffled 50 times, the training and testing sets are constructed using the same strategy for 50 times.The Mento Carlo simulation results using different algorithms are shown in terms of boxplot in Fig. 9. OLQR and OKLQR algorithms can achieve better predicting accuracy with less fluctuation.

VIII. CONCLUSION AND DISCUSSION
In this paper, we have proposed a new learning method named the ''online linear quadratic regulator'' learning algorithm.By using a carefully designed scheme, the online learning problem is transformed into a state feedback control problem of a group of controllable, observable and time-varying systems.Two dynamical linear quadratic regulator based numerical algorithms are developed to give the model update law for the online learning of adaline models and reproducing kernel Hilbert space models.This method provides a novel online learning approach from the perspective of state feedback optimal control with solid theoretical basis.Compared with the existing online learning methods, the proposed method has better convergence rate and prediction accuracy for data streams with complex noise.Some key limitations of the existing methods such as robustness to noise effects, restrictions of learning parameters et.al.can be therefore overcome.The effectiveness and efficiency of our theory are also demonstrated by the encouraging experimental results.Our future work will focus on the further extensions and potential applications of the novel optimal control based online learning algorithms.(65) It follows 66) It is obvious that the eigenvalues of B −1 n A n B n and A n are the same, which implies that B −1 n A n B n is also contractive.By the same techniques presented in (52), (53) and (54), the theorem can be completed.

FIGURE 4 .
FIGURE 4. The estimation errors by different online learning methods for system (48).
ε(n) = 0.1ς 1 (n)ς 2 (n), where ς 2 (n) is uniformly distributed on interval [0, 1].In both cases, the data at the first 900 points are chosen as the training data and the subsequent 100 ones are used for evaluation.The testing mean square error (Testing MSE) is defined as

FIGURE 5 .
FIGURE 5.The testing MSEs for system (49) with white noise and heterogenous noise.

FIGURE 6 .
FIGURE 6.The prediction output by OKLQR and the prediction errors by different online learning methods.

FIGURE 7 .
FIGURE 7. The actural data, predictions by OKLQR and prediction errors by different learning methods.

FIGURE 8 .
FIGURE 8.The estimation errors by different online learning methods.

FIGURE 9 .
FIGURE 9. Evaluation of final testing MSEs in terms of boxplot by Monte Carlo simulations.

2 ,
is utilized for modeling, where the four variables and one target variable are recorded by different sensors hourly over six years(2006)(2007)(2008)(2009)(2010)(2011).Thus 9568 data points are included.The proposed algorithms (OLQR and OKLQR) and five benchmark online learning algorithms are employed to predict in the context of linear and kernel regression model.In the experiments, after shuffling the whole dataset, first 8000 samples are used as training set and the rest 1568 are for testing.For this example, the MSE at learning step n is defined as followsMSE(n)where ŷn (j) is the predicting output of jth test sample.To remove the transient effects, the prediction results obtained from the first 400 samples are abandoned and average MSE (AMSE) aiming at measuring the average predicting error is defined as Using the linear model, the testing MSEs obtained by OLQR, LMS and OPA are shown in Fig.8(a).We can observe that LMS and PA methods are seriously affected by chaotic noise perturbing, and OLQR show advantage on prediction accuracy with fewer peaks in the figure.Using kernel models, the testing MSEs obtained by OKLQR, KLMS, OKPA and OIRLSSVR are presented in Fig.8(b).Our control-based methods also show advantages in robustness, convergence speed, and average predicting accuracy.The learning parameters of each algorithm are chosen for the best learning performance.Details of the hyperparameter setting and AMSE are given in Table Let A n denote the transit matrix of closed loop system.With the optimal control input U n (1) = −(γI + B T n P n B n ) −1 B T n P n E n (1), we have E(n + 1) = A n E(n) = (I + B n F n )E(n) = (I − B n (R + B T n P n B n ) −1 B T n P n )E(n) = (I − B n (γ I + B T n P n B n ) −1 B T n P n )E(n) = (I − (γ (B n B T n P n ) −1 + I ) −1 )E(n) = γ (B n B T n P n ) −1 (γ (B n B T n P n ) −1 + I ) −1 E(n) E(n + 1) = A n E(n) = (I − B n (R + B T n P n B n ) −1 B T n P n )E(n) = (I − B n (γ I + B T n P n B n ) −1 B T n P n )E(n) = (I − P −1 n )E(n).(64) It is noted that γ (B n B T n P n ) −1 and P −1 n are positive definite, therefore γ (B n B T n P n ) −1 (γ (B n B T n P n ) −1 + I ) −1 0 is also positive definite and I − P −1 n ≺ I .It can be obtained that 0 ≺ A n ≺ I , and A n is contractive.Denote β(n) = (β 1 (n) − β 1 , β 2 (n) − β 2 , . . ., β M (n) − β M ) T .Noticed that E(n + 1) = B n β(n + 1) − d(n) and E(n) = B n β(n) − d(n), we have B n β(n + 1) − d(n) = A n (B n β(n) − d(n)).

2 ,
by removing the initial transient response effects, where ŷ(t) is the predicted output for y(t).The average MSEs obtained by different algorithms calculated from 100 independent Monte Carlo simulations are shown in Table1to further

TABLE 1 .
Learning results for nonlinear time series prediction.

TABLE 2 .
Learning results for real world data prediction.

TABLE 3 .
Learning results for real world data prediction.