Competitive Normalized Least-Squares Regression

—Online learning has witnessed an increasing interest over the recent past due to its low computational requirements and its relevance to a broad range of streaming applications. In this brief, we focus on online regularized regression. We propose a novel efﬁcient online regression algorithm, called online normalized least-squares (ONLS). We perform theoretical analysis by comparing the total loss of ONLS against the normalized gradient descent (NGD) algorithm and the best off-line LS predictor. We show, in particular, that ONLS allows for a better bias-variance tradeoff than those state-of-the-art gradient descent-based LS algorithms as well as a better control on the level of shrinkage of the features toward the null. Finally, we conduct an empirical study to illustrate the great performance of ONLS against some state-of-the-art algorithms using real-world data.


I. INTRODUCTION
Sequential learning 1 is about qualitatively predicting an output upon the presentation, at each trial, of an input from a sequence (stream). We consider the following model: where w t −1 ∈ R n is some weight vector, x t ∈ R n is the input, y t ∈ R is the output, and t ∈ R denotes the noise. The goal is to obtain an estimateŷ t ∈ R for y t upon maintaining a weight vector w t −1 ∈ R n for each trial t = 1, 2, . . . The application of model (1) includes a priori filtering, posterior filtering, and online regression. A priori filtering is used to recover uncorrupted output u, x t , before receiving the output y t . The overall discrepancy (error) after T steps is given in the form of a cumulative sum: T t =1 (u, x t − w t −1 , x t ) 2 . In a posteriori filtering, we filter out the noise using the output y t . The error is formulated as a cumulative sum: T t =1 (u, x t −w t , x t ) 2 . Notice that in a posteriori filtering, we are able to use the most recent weight vector w t to measure the quality of the filter (error). In contrast, for a priori filtering, we use w t −1 because we do not have access to the output y t , which resembles the online learning setting. However, in filtering, the goal is not to estimate the output; instead, we want to recover the output by assuming that it is corrupted by some noise. In the following, u, x t will refer to the prediction of the off-line algorithm.
Interested in online regression, we suggest a novel update rule which is inspired by the well-known NLS method from the filtering literature (see [7]). The proposed technique directly affects how neural networks learn. In this work, it is shown that the proposed technique has an advantage over the state of the art due to the inclusion of a well-known ridge penalty term. For details on the advantages of the ridge penalty, see [9]. The analysis of the least-squares algorithm was understood without normalization and given in the late 1990s. The main drawback of the least-squares algorithm is that it is sensitive to the scaling, which makes it very hard to choose a learning rate that guarantees the stability of the algorithm [13]. This work fills the gap in the literature by giving an analysis of a normalized algorithm.
To answer the question of how well an online learning algorithm predicts, often, competitive analyses [22] are performed. An algorithm is said to be competitive if it satisfies the following: where L T is the cumulative loss up-till time t, and L * T = inf w Y − Xw 2 , where w ∈ R n , X ∈ R t ×n , and Y ∈ R t . R T denotes the regret, and c denotes a constant. An interesting feature of such a definition of competitiveness is that it requires an algorithm to perform well for both "NP-hard" and "NP-easy" inputs. This is a stronger notion than the worst case analysis [11], where the performance of the algorithm is only measured for "NP-hard" inputs.
In the literature of learning theory, regression algorithms learn by updating their weight (parameter) vectors at each trial. Such an update requires adjusting the inverse of the covariance matrix at each trial t. One can use gradient descent to approximate the inverse of the covariance matrix, which is computationally more efficient. In both approaches, the constant c in the performance guarantee (2) is greater than or equal to unity. For example, covariance-based aggregating algorithm for regression (AAR) and ridge regression (RR) [25], recursive least-squares (RLS) [17], and adaptive regularization of weights (AROW) [6] for model (1) [18] has c = 1, and the gradient-based algorithms [5] have c = 2.25. In the completely online setup, the regret of the gradient-based algorithms is O(1); whereas, for covariance-based algorithms (for example, AAR), it is possible to have a logarithmic regret. Also, the bound of the gradient-based algorithms is better for noise-free data. However, such bound is weak when the true regression function includes the noise term t since the regret is only bounded by O (1).
In this brief, we make no assumptions on the incoming data. Our work is comparable to the work done in [5], where the performance guarantee on cumulative normalized square loss is obtained by using the generalized gradient descent. In order to obtain the performance guarantee, first, a lower bound on the progress (see [5,Lemma IV.4]) is computed assuming that η = (α/x t 2 ) (see [5, Th. IV.2]), where 0 < α < 2. The chosen α ensures that the performance guarantee on the normalized cumulative loss is held. We dwell further on the discussion done by Cesa-Bianchi et al. [5] and do not impose a similar restriction on η when bounding the progress of online normalized least-squares (ONLS). We bound 1/x t and provide a performance guarantee on square loss and relax this condition for the guarantee on normalized squared loss. Consequently, the proposed algorithm's guarantees have the tuning parameter next to the ridge penalty-indicating superior bias-variance tradeoff properties than the generalized GD update rule. In summary, the major contributions of this brief are as follows.: 1) derivation of the ONLS algorithm for regression; 2) development of a competitive analysis of ONLS; 3) empirical and comparative study using real-world data. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The structure of this brief reflects on these contributions in the subsequent sections after presenting the related work in Section II.

II. MOTIVATION AND RELATED WORK
RLS is a popular algorithm in the area of linear regression and consists of computing the weights as follows: In each iteration, t, the predictionŷ t = w t −1 x t , is made, and after receiving the true output y t , the weights are updated with r > 0 and A 0 = I ∈ R n×n . There exist many similar algorithms, such as AAR, RR, and AROWR. In particular, the weight update rule of AROWR is the same as RLS. The only difference is how the covariance matrix is updated:

AAR's and RR's update rules can be obtained by setting
Another difference between AAR and the other algorithms is that AAR divides its prediction w t −1 x t by 1 + x t A t −1 x t , whereas RR, AROWR, and RLS do not do that. Moreover, AAR is the only algorithm among the four that has the ability to perform shrinkage. It is worth noting that all of these algorithms perform regression when the target is assumed to be stationary. Sometimes, they are used as building blocks to develop second-order nonstationary algorithms [18] by adding a penalty term to handle drift of the target (nonstationarity).
The implementation of (3) has the time complexity O(n 2 ) that is significant for high-dimensional data. Often, LS [26] is considered as a less demanding solution since its time complexity is O(n).
by the learning rate η > 0, yielding the following update rule: The LS algorithm not only has a better time complexity but it is also In contrast, in the case of RLS, the right-hand side of (5) is replaced by 4. For further details on the matter, see [12]. In practice, often, normalized LS performs better than LS because NLS is not sensitive to the scale of the input [2], [3]. The existing work on NLS applies a normalized square loss to derive the update of the weights [5], [15] for η > 0. When x t = 0 or η = 0 with the convention that (0/0) = 0, the rules (6) and (7) output w t = w t −1 .
Up until now, we have briefly revised some of the popular existing approaches for regression using the squared loss and normalized square loss. The squared loss is a renowned loss function despite not being robust. In particular, in the presence of substantial outliers, the square loss function is not the preferred choice since it penalizes the mistakes with more severity (the difference between actual and predicted is squared) compared with some other loss functions, such as absolute loss and normalized square loss. For this reason, in the next section, we will study a tunable loss function that (loosely speaking) incorporates the features of both absolute and squared loss functions.

III. DERIVATION AND ANALYSIS OF ONLS
ONLS is an online regression algorithm, and so, it observes the following protocol: Protocol 1. Online regression.
In Protocol 1, it is assumed that the prediction is given by w t x t . Thus, the problem at hand is to design the update rule, which leads us to the following lemma.
Lemma 1: The following minimization problem with respect to w t : with the constraint y = w t x t has the following solution: Proof: The proof is given in Appendix A. Remark 1: We perform the analysis of the following update rule: for η > −x t 2 . The obvious advantage of using (8) is that we do not require any convention for the case when x t → 0. Later, we show that the addition of η in the denominator results in a better performance guarantee. ONLS is presented in Algorithm 1, where the weight vector is initially set to 0 ∈ R n . The update rule can be written in the form: 4) update w t using eq.(8) END FOR We now analyze ONLS using the technique (difference of sum of squares) suggested by Duda et al. [8] for convergence analysis, and we start by the following theorem that shows the bound on ONLS' performance.
Proof: The proof is given in Appendix B. The result obtained in Theorem 1 fulfills (2) with c = (1/(β(1 − β))), L * T R t = inf u ((η + X 2 )u 2 + L T (u)), and R T = O(1). Theorem 1 2 asserts for η = 0 and β = (1/2) It is clear that the addition of η in the update rule of Lemma 1 is advantageous. In Theorem 1, the addition of η decreases the dependence on the size of the data. It is worth noticing that (9) is the performance guarantee for the algorithm derived in Lemma 1, where we bound the input by the Euclidean norm. Also, notice when β = (1/2) and as b → −1 ⇒ inf u ((b + 1)X 2 u 2 + L T (u)) → L T ≤ 4 inf u L T (u), that is, ONLS is at most four times worse than the true regression function.
The following theorem presents the performance guarantee on the normalized squared loss.
Proof: Notice that x t 2 for η = bx t 2 . Thus, from (14) (see Appendix B) x t 2 ≤ u 2 and the result follows. In Theorem IV.2, the guarantee does not depend on the size of the input, and we do not bound the input. Also, the performance guarantee has no assumptions on the input, the output, and the weights.

Remark 2:
Similarly, for the case of L T , NGD is outperformed as We now present a guarantee that includes the learning rate η in the cumulative loss, which we will refer to as tunable loss function.
To conclude this theoretical analysis, we state the following: the addition of the learning rate is advantageous in the ONLS algorithm. The ridge penalty u 2 → ∞, and ONLS algorithm has a better guarantee than NGD's guarantee when −1 < b ≤ 0.5625X 2 and 0 < a ≤ 0.5625 for the squared and normalized squared loss, respectively. The presence of a > 0 and b > −1 next to the ridge penalty in the guarantees and update rule of ONLS implies a better control over the bias-variance tradeoff than NGD case where a = b = 1. Fig. 1 compares some of the renowned loss functions and the behavior of the loss function studied in Theorem 3. Notice that when the learning rate η = 0, the tunable loss is the same as the normalized squared loss. When η = −0.9x 2 , the tunable loss penalty is in a similar range as the absolute loss but with the shape of the squared loss. Also, the tunable square loss is differentiable for all values of η > −x 2 , at every value of x. The same statement does not hold for the absolute loss. Thus, the suggested tunable loss function has the robustness of the absolute loss but in the shape of the squared loss.

IV. EMPIRICAL STUDY
The primary goal of this study is to compare 3 ONLS against NGD, but, for the sake of completeness, we also compare it against the most theoretically studied variant of gradient method known as Adam [14], [21] as well as against the [5], ORR [17], and ONS [19]. The objective is to be as close as possible off-line solution (please see 2) P * T = Xw * , where w * = argmin w Y − Xw 2 . The P * T solution considers the entire data X ∈ R t ×n and Y ∈ R t . Table I contains the minimum (min), maximum (max), and median (med) cook distances for outliers and mean and variance (var) for the level of noise.
Data sets used in this study are Gaze [20], NO 2 [24], ISE [1](Istanbul Stock Exchange) F-16 [23], Friedman [10], and Weather [4]. Gaze data consist of 450 observations of 12 features, estimating the positions of the eyes of the subject when the subject is looking at the monitor. NO 2 data consist of 500 observations from a road air pollution study collected by the Norwegian Public Roads Administration. The ISE data have 536 observations with eight attributes. Ailerons (F-16) data consist of 13 750 observations with a total of 40 attributes that describe the status of the F-16. Friedman data are a synthetic data set with ten features and 40 768 rows. Weather data   Table I for characteristics of these data sets. Table II compares root-mean-squared error (RMSE), coefficient of determination (R 2 ), and mean absolute error (MAE) of the algorithms. Overall, ONLS performs best, and Adam performs worst on these data sets in terms of RMSE, R 2 , and MAE. ONLS is a good choice when the true regression function is not corrupted by noise (see Table II). Importantly, ONLS provides a significant improvement over NGD in all the studied scenarios.

V. CONCLUSION
We presented an exact formulation of the proposed algorithm, ONLS, along with its performance guarantee. We compared it against to the state-of-the-art LS-regression algorithms, showing that it allows for a better bias-variance tradeoff while providing feature shrinkage. In the future, we will study the tightness of the bounds of ONLS and potentially those of the state-of-the-art algorithms used in this study. APPENDIX A PROOF OF LEMMA 1 Introducing the Lagrangian multipliers α t , t = 1, 2, . . . , T and instead of solving the primal optimization problem mentioned earlier, we find the saddle point of the following: In accordance with the Kuhn-Tucker theorem [16], there exists values of Lagrangian multipliers α = α K T for which solving the primal problem is equivalent to finding the saddle point. Thus Substituting the obtained value of w t from (11) in the constraint, we get Substitution of α t from (12) in (11) gives In order to avoid the scenario w t → ∞ as x t 2 → 0, we use the convention (0/0) = 0.
Proof: The inequality is equivalent to the following: It is clear that the left-hand side can be written as (((2a − 2aβ) − b) 2 /4(1 − β)) for 0 < β < 1; thus, the inequality holds. We now prove a lower bound on Lemma 2 using inequality proven in Lemma 3. This can be interpreted as the lower bound on the progress per trial of Algorithm 1.
Proof: From Lemma 2 The last inequality holds due to Lemma 4.
Initialization of the weights is 0, and · is nonnegative; thus Setting .