On the Optimal Tradeoff Between Computational Efficiency and Generalizability of Oja’s Algorithm

The Oja’s algorithm is widely applied for computing principal eigenvectors in real problems, and it is practically useful to understand the theoretical relationships between the learning rate, convergence rate, and generalization error of this algorithm for noisy samples. In this paper, we show that under mild assumptions of sampling noise, both the generalization error and the convergence rate reveal linear relationships with the learning rate in the large sample size and small learning rate regime. In addition, when the algorithm nearly converges, we provide a refined characterization of the generalization error, which suggests the optimal design for the learning rate for data with noise. Moreover, we investigate the minibatch variation of Oja’s algorithm and demonstrate that the learning rate of minibatch training is decayed by a factor characterized by the batch size, which provides theoretical insights and guidance for designing the learning rate in minibatch training algorithms. Finally, our theoretical results are validated by experiments on both synthesized data and the MNIST dataset.


I. INTRODUCTION
Understanding the fundamental correlations between the learning rate, convergence rate, and generalization error of machine learning algorithms is an important issue in designing effective and efficient algorithms for real applications [1]. In this paper, we investigate this issue in computing the top eigenvector of a positive-semidefinite matrix A. Such problems appear in many machine learning scenarios, including the streaming principal component analysis (PCA) [2], canonical correlation analysis (CCA) [3], and recently the HGR maximal correlation problem [4], [5]. While the top eigenvector of matrices can be efficiently computed by the well-known power iteration algorithm [6], in practice the objective matrices are typically estimated by a small batch of data samples, which leads to an estimation noise at each The associate editor coordinating the review of this manuscript and approving it for publication was Seok-Bum Ko .
iteration. In such cases, the eigenvectors are often computed via the Oja's algorithm as shown in Algorithm 1, which introduces a learning rate η in the traditional power iteration to enhance the robustness of the algorithm over the noises ζ n 's. In particular, the learning rate η in the Oja's algorithm controls the magnitude of the updating steps for estimated eigenvector φ n at each iteration, which essentially improves the generalizability of the algorithm by trading with the convergence rate of the algorithm. Therefore, understanding the fundamental structure of this tradeoff is critical for designing efficient algorithms.
To address this problem, previous works mostly focused on presumed probabilistic models for the noises ζ n 's in the Algorithm 1. For example, the correlations between the generalization error and the learning rate were studied in [8]- [11], under a strictly bounded assumption for estimation noises. Similar results were obtained in [12] with the sub-Gaussian assumption for the data. In addition, the authors Algorithm 1 The Oja's Algorithm [7] With Noise Terms ζ n 1: Input: φ 0 . 2: for n = 1 to N do 3: ψ n ← Aφ n−1 + ζ n 4: φ n ← φ n−1 + ηψ n 5: end for 6: in [13] approximated the Oja's algorithm by some ordinary differential equations and derived the correlations between learning rate and the generalization error for the continuous counterpart. In the meanwhile, the crucial roles of hyperparameters for general machine learning problems, such as the learning rate and the batch size, have also been illustrated by empirical studies [14], [15], while the theoretical understanding of the correlations between these hyperparameters mainly focused on strongly convex problems, such as least squares regression [16], [17].
In this paper, we study the asymptotic tradeoff between the learning rate, generalization errors, and convergence rate of Oja's algorithm in the large sample size and small learning rate regime, by only requiring the variance of the estimation noises to be finite. For this general model, we show that both the generalization errors and the convergence rate depend linearly on the learning rate asymptotically. Moreover, for sufficiently large number of iterations N such that the algorithm nearly converges, we provide a tight characterization of the generalization errors, which essentially suggests the optimal design of the learning rate. Under this optimal design, we further provide a near-optimal bound of the generalization error. In addition, we consider the minibatch variation of Oja's algorithm, where a small batch of data samples are used in each iteration. Specifically, we demonstrate that the learning rate of utilizing minibatch samples is decayed by a factor of the corresponding batch size, which provides a theoretical interpretation for the linear scaling rule widely applied in training deep learning algorithms [18]- [20]. Our characterizations on these hyperparameters provide theoretical interpretations for previous empirical studies [14], [15], and also generalizes the results for convex problems [16], [17] to non-convex eigenproblem. Finally, experiments on both synthesized data and the MNIST dataset are provided to validate the theoretical results.

II. PRELIMINARIES AND DEFINITIONS
Suppose that A ∈ R d×d is a positive-semidefinite (PSD) matrix with the eigendecomposition A = V V T , where d is the dimensionality of A, and V = [v 1 , . . . , v d ] and = diag(σ 1 , . . . , σ d ) with σ 1 ≥ · · · ≥ σ d are the eigenvectors and eigenvalues of A, respectively. Then, Oja's algorithm for computing the top eigenvector of A is illustrated in Algorithm 1, where η > 0 is the learning rate, and ζ n represents the noise for estimating Aφ n−1 from data at the n-th iteration. Specifically, we assume that the learning rate η is a constant that does not change with n. In addition, suppose that the computation of each iteration uses a new data sample, and thus the number of iterations N is equal to the sample size. Furthermore, we adopt the assumption that the noise ζ n takes the form ζ n = Z n φ n−1 (1) where Z n 's are d ×d matrices denoting independent and identically distributed (i.i.d.) noises from data samples. In particular, we assume that each Z n has the same distribution as Z, and each entry of Z is zero-mean and has finite variance. As a motivating example, the streaming PCA problem can be regarded as a special case of this model.
To begin, note that the updating steps of Algorithm 1 can be mathematically equivalently expressed as 1 which will be more convienient for our theoretical analyisis. Then, with a given initializing vector φ 0 , our goal is to investigate how the vectors φ n computed through (2), for n = 1, . . . , N , approach to the top eigenvector v 1 of A.
To measure the accuracies of the computed φ n , we define ν(φ n ) and ρ(φ n ) as where ν(φ n ) = cos 2 (φ n , v 1 ) measures the angle between φ n and v 1 , and ρ(φ n ) is the Rayleigh quotient [21]. As v 1 is the true top eigenvector for machine learning problems, ν(φ n ) and ρ(φ n ) can be viewed as measurements of the generalizability of Algorithm 1. Consequently, we can define the generalization errors of the Algorithm 1 as Moreover, given an initializing vector φ 0 , we defineν n (φ 0 ) andρ n (φ 0 ) as which can be interpreted as expected computation accuracies of the top eigenvector of A in the n-th iteration of Algorithm 1. Similarly, the expected generalization errors are defined as Throughout this paper, we letd d(d + 1)/2, and use Id to denote thed ×d identity matrix, with e ij denoting its [i + (j − 1)j/2]-th column. Furthermore, we define thẽ d ×d-dimensional matrix G as where 1 and 2 ared ×d diagonal matrices, such that the [i + (j − 1)j/2]-th diagonal entries of 1 and 2 are σ i + σ j and σ i σ j , respectively, and where L is ad ×d matrix with the entry at the [i + (j − 1)j/2]-th row and i + (j − 1)j /2 -th column being where we have defined, for all i ≤ j ≤ d, Furthermore, we define

III. MAIN RESULTS
The main results are summarized in this section. In the following, we assume that for all i < j and i < j , we have Moreover, we assume that the input φ 0 of Oja's algorithm satisfies v 1 , φ 0 2 > 0, to avoid the trivial case. We start from characterizing the expected generalizabilitiesν n (φ 0 ) andρ n (φ 0 ) defined in (4). Note that since we can rewrite (4) as with π Then, the following proposition characterizes π (i) n (φ 0 ), a proof of which is provided in Appendix A.
Proposition 1: Given an initial vector φ 0 , after n iterations of (2), we have where e ii is the [i(i + 1)/2]-th column of Id , G is as defined in (6), and where θ 0 is thed-dimensional vector with the Note that from Proposition 1 we can immediately establish the analytical expressions of generalization errorsν c n (φ 0 ) andρ c n (φ 0 ) (or equivalently, generalizabilitiesν n (φ 0 ) and ρ n (φ 0 )), for general n. In the following, we would focus on the theoretical properties of the generalization errorsν c N (φ 0 ) andρ c N (φ 0 ), which characterizes the average accuracies of the computed φ N in the total N iterations. In particular, we will restrict our attention to the asymptotic regime where the learning rate η is small, i.e., far less than 1, while the sample size N is large such that N η is also large, referred to as the large sample size and small learning rate regime. Specifically, the large sample size assumption naturally comes from the large amounts of data in practice, and empirical studies have shown that adopting a small learning rate contributes to low generalization errors and can also avoid oscillations during the training process [1], [14], [15]. In addition, as we will soon develop, this regime can provide theoretical guarantees to small generalization errors, and thus can be of much practical interest.
To begin, we consider a special case of the large sample size and small learning rate regime, where we have infinite number of samples, i.e., N = ∞, and a small learning rate η. This special case essentially characterizes the convergence behavior of the generalization errors. In particular, the following result of π (i) n (φ 0 ) will be useful, of which a proof is provided in Appendix B.
Lemma 1: Given a small η > 0, the eigenvalues of G are and we have, for i = 1, . . . , d, where L ii,i j is as defined in (7), and where we have defined Using Lemma 1, the asymptotic behavior of generalization errorsν c n (φ 0 ) andρ c n (φ 0 ) can be established as follows. Theorem 1: Given a small η > 0, the generalization errors respectively, with a common convergence rate r = 1 − η (σ 1 − σ 2 ) + o(η). Proof: For sufficiently small η, from (13) we have λ ij (G) < λ 11 (G) for all (i, j) = (1, 1), which implies that γ ij (η) < 1 and lim n→∞ γ n ij (η) = 0. Then, taking the limit of (14) yields As a result, we have and thus Via similar derivations, we can obtain the corresponding result forρ c n (φ 0 ). Finally, from (10), the convergence rate is the ratio between the second largest eigenvalue and the largest eigenvalue of G, i.e., γ 12 (η). As indicated by (13), we have r = γ 12 (η) = From Theorem 1, the converged generalization errors are proportional to the learning rate η, which verifies the necessity of using a small η. However, the small learning rate will also cause slow convergence, and the tradeoff between generalization errors and convergence rate can be explicitly expressed as While the above tradeoff is established under the assumption of η being a constant, it can also provide theoretical insights in understanding practical learning rate schedules that are typically more complex. For example, in the step decay schedule [22], the learning rate during training is set to some initial value first and then cut by a fixed factor every constant number of epochs. Using the tradeoff established above, when a small initial learning rate is used, such schedule can be illustrated to accelerate the convergence at the early phase of training and finally achieve low generalization error with the decayed learning rate. For more general discussions of the generalization error bounds using the step decay schedule, see, e.g., [23]. However, our characterization of the generalization errors in the small learning rate regime is precise and can provide useful insights in designing the learning rate schedule.
In practice, since the number of available samples N is typically limited, it will be more useful to characterize the generalization errors in finite N iterations. To establish the generalization errors in this general case of the large sample size and small learning rate regime, we first introduce the following definition that specifies the small quantities in the regime.
Definition 1: Given a function f of η and N , we use f (η, N ) =ô(1) to indicate that there exist functions g(t) and h(t), such that and, for all η > 0 and N > 0, Then, we have the following tight characterization of the generalization errors, a proof of which is provided in Appendix C.
Theorem 2: In the large sample size and small learning rate regime, the generalization errors after N iterations can be represented as From Theorem 2, in the large sample size and small learning rate regime, the generalization errors are rather small and the algorithm nearly converges. Moreover, bothν c N (φ 0 ) and ν c N (φ 0 ) can be interpreted as the superposition of the corresponding generalization error in the noiseless case and the asymptotic generalization error. In addition, the theoretical generalization errorsν c N (φ 0 ) andρ c N (φ 0 ) follow U-shaped curves when plotted as functions of learning rate η, as shown in Figure 1. Our characterization thus provides theoretical VOLUME 8, 2020 insights for this typical relationship between the learning rate and the training error [1] in eigenproblem, which also generalizes previous results developed on convex optimization problems [14], [16]. Furthermore, it can be verified that the optimal learning rate that minimizesν c where Similarly, the optimal learning rate forρ c N (φ 0 ) is Remark 1: It is worth noting that both the optimal learning rates η * ν and η * ρ take the form log(CN ) 2(σ 1 −σ 2 )N , with the positive constant C being set to C ν and C ρ , respectively. In fact, it can be verified that the value of C does not affect the asymptotic behavior of the generalization errors. Therefore, we may simply let C = 1 and take η = log N 2(σ 1 −σ 2 )N as the optimal learning rate.
In addition to the characterization of expected generalization errors, we can also obtain the probabilistic characterizations of the convergence of generalization errors, as illustrated in the following theorem. A proof is provided in Appendix D.
Theorem 3: Given an initial vector φ 0 and a sufficiently large N , with the learning rate η = log N 2(σ 1 −σ 2 )N , we have, for all δ ∈ (0, 1), with C 0 defined as C 0 τ 1 2(σ 1 −σ 2 ) 2 . Remark 2: Theorem 3 provides a generalization error of , which matches the minimax information-theoretic lower bound 1/N developed in [24] up to a polylogarithm factor. While the bound 1/N is built under a sub-Gaussian assumption of x n , our analysis relies only on the finite second moment of Z n , or the finite fourth moment of x n in streaming PCA.

IV. OJA's ALGORITHM WITH MINIBATCH TRAINING
In practical training of learning algorithms, each updating step of parameters is often computed from a small batch of data, referred to as the minibatch training [1]. The usage of minibatch training greatly improves the computational efficiency due to the parallel computing of samples in a minibatch, and can also reduce the generalization error with a proper chosen minibatch size [14], [25]. In the following, we characterize the effect of minibatch training on generalization error, via considering the minibatch variation of Oja's algorithm. In particular, suppose that the total number of iterations N is fixed, representing the sample size in practice (see, e.g., Example 1). In addition, assume that N = km with m denoting the batch size, then the minibatch variation of Oja's algorithm can be summarized as Algorithm 2. During each updating step, the m estimations of Aφ n−1 in a minibatch are averaged to obtain φ n . φ n ← φ n−1 + ηψ n 9: end for 10: Output: φ k / φ k .
Note that Algorithm 1 can be regarded as a special case of minibatch training with the batch size m = 1. Compared with Algorithm 1, the minibatch training can have two opponent effects on the generalization error: firstly, since the noises are averaged in a batch, the update can be more accurate; secondly, the number of update is decreased by a factor of the batch size m. To characterize the impact of minibatch training, we first useν c n (φ 0 ; η, m) andρ c n (φ 0 ; η, m) to denote the expected generalization errors of φ n in minibatch training [cf. (4)], respectively, with η representing the corresponding learning rate. Then, the following result demonstrates that the overall impact of minibatch training on generalization error is equivalent to decaying the learning rate by a factor of the batch size.
Theorem 4: In the large sample size and small learning rate regime, suppose the batch size m is a given constant, then we haveν where the equivalent learning rate η eq is defined as and whereô (1) is as defined in Definition 1. Proof: First, note that from the updating steps of Algorithm 2 we have Hence, (28) is equivalent to k iterations of Oja's algorithm, and it follows from (9) and (29) that the corresponding parameters τ 's areτ i = τ i /m, for i = 1, . . . , d. Then, from Theorem 2 we havē (1)).
Via similar derivations, we can obtain (26). Therefore, utilizing the minibatch training with batch size m and learning rate η is equivalent to decaying the learning rate η to η eq = η/m. As a consequence, simultaneously scaling the learning rate η and the batch size m does not affect the generalization errors. In practical learning applications, such scaling can provide a speedup proportional to the batch size due to the parallel computation of minibatch training without increasing the generalization errors. This property is sometimes referred to as the linear scaling rule, which is useful in accelerating the training of deep neural networks [18], [20], [26].

V. EXPERIMENTS
To validate the above theoretical results, we conduct numerical experiments on both synthesized data and the MNIST dataset [27].

A. SYNTHESIZED DATA
In this experiment, we set d = 10 and randomly generate a PSD matrix A. In addition, the noise in Oja's algorithm is generated according to (1), with Z n 's being i.i.d. and zero-mean Gaussian matrices. For a given initial vector φ 0 , we then repeatedly run Algorithm 1 for 20 times, each with N = 10 6 iterations. Then the expected generalization errors are computed via (5), where the expectations are estimated by the corresponding empirical means over repeated experiments.
We first investigate the generalization errors under different settings of the learning rate η. In particular, theν c N (φ 0 ) and ρ c N (φ 0 ) computed from 20 repeated running of Algorithm 1 are shown in Figure 2, where the solid lines represent the corresponding theoretical results of Theorem 2. As indicated by the figure, the simulated results well match our theoretical predictions, especially when the learning rate η is relatively small.
In addition, we consider the convergence of generalization errors over iterations. Specifically, we set learning rate η = 2 i η * ν , i = −1, 0, 1, 2, 3 with η * ν as defined in (21), and investigate the behavior of generalization errorν c n (φ 0 ) over n. The corresponding results are shown in Figure 3, where the theoretical generalization errorν c n (φ 0 ) is computed by (19). The results again demonstrates the matching between the simulation and our theory. Moreover, Figure 3 also illustrates how the choices of learning rate affect the tradeoff between the generalization errorν c n (φ 0 ) and the convergence rate. Similar results can be obtained when the generalization error is measured by the Rayleigh quotientρ c n (φ 0 ). Furthermore, we study the generalization errors in minibatch training. In particular, we set the batch size to m = 1, 8, 32, respectively. Figure 4 shows the generalization errorsν c N /m (φ 0 ; η, m) andρ c N /m (φ 0 ; η, m) as functions of the  equivalent learning rate η eq = η/m, which validates our discussion in Section IV.

B. MNIST
To validate our results on real data, we also conduct an experiment on the MNIST dataset of handwritten digits, which has a training set of 60,000 samples and a test set of 10,000 samples, with each sample being a 28×28 grayscale image. In particular, we run Algorithm 2 to compute the principal components from the training set, where each sample is vectorized to a vector with 28 2 = 784 dimensions. Then, to measure the generalization errors, we compute the 784-dimensional covariance matrix A and the principal components v 1 , . . . , v d from the test samples. Note that since the parameters τ i for the MNIST dataset cannot be accurately estimated from samples, it is hard to establish the precise theoretical performance for comparison. Instead, we focus on the verification of the linear scaling rule in minibatch training. To reduce the oscillations during training, here we adopt relatively large minibatch sizes. Specifically, with a fixed input φ 0 , we set the batch size to m = 64, 128, 256, respectively. For each setting of  Figure 5, which again validates our discussion in Section IV.

VI. CONCLUSION
In this work, we investigate the asymptotic tradeoff between the learning rate, generalization errors, and convergence rate of Oja's algorithm in the large sample size and small learning rate regime. Our results suggest the optimal design for the learning rate and demonstrate the impact of batch size choices on generalization errors, which can provide theoretical insights and guidance for designing the hyperparameters in machine learning algorithms.

APPENDIX A PROOF OF PROPOSITION 1
Through this proof, we use ⊗ to denote the Kronecker product between vectors or matrices and adopt the notations with e i denoting the i-th standard basis vector of R d . Then, let E ∈ R d 2 ×d be the matrix composed of allê ij (i ≤ j), such that its [(i − 1)i/2 + j]-th column isê ij . Then, it can be verified that θ n = E Tφ ⊗2 n , where θ n ∈ Rd is defined such that, for all i ≤ j ≤ d, the [i + (j − 1)j/2]-th entry of θ n is φ T n V ij φ n . Moreover, the following property of E will be useful.
Lemma 2: For all u ∈ R d , we have Proof: Suppose u = [u 1 , . . . , u d ] T , then we have and thus In addition, note that since we obtain Combining (31) and (32), we obtain (30). In addition, the following result characterizing L will also be useful.
Lemma 3: The matrix L as defined in (7) can be represented as (33) Proof: For all i ≤ j and i ≤ j , the entry of where to obtain (36) we have used the fact that vec(V ij ) = V ⊗2ê ij for all i ≤ j, to obtain (38) we have used the fact that vec T (M 1 ) vec(M 2 ) = tr(M T 1 M 2 ) for matrices M 1 and M 2 .
Then, Proposition 1 can be proved as follows.
Proof: [Proof of Proposition 1] To begin, letφ n V T φ n ,ζ n V T ζ n , then the updating steps of (2) can then can be rewritten as and thus (41) We then investigate the expectation of the right-hand side of (41) conditional on φ 0 . Since the noises Z n 's are independent, we have the Markov relation φ 0 − φ 1 − · · · − φ n . Therefore, where to obtain the last equality we have exploited the fact In addition, note that sincẽ we have where to obtain the second equality we have again exploited the Markov relation φ 0 − φ 1 − · · · − φ n . Then, taking the expectation of (41) conditional on φ 0 yields where (43) follows from the fact that E T (I + η ) ⊗2 = (Id + η 1 + η 2 2 )E T , (44) follows from Lemma 2, (45) follows from Lemma 3, and (46) follows from the definition of G.
As a result, we obtain where the last equality follows from (48). Taking inner product with θ 0 then yields the main result (14).

APPENDIX C PROOF OF THEOREM 2
To characterize the asymptotic regime of the large sample size and small learning rate, we use f (η) = o (η) (1)  Then, it follows immediately thatô(1) = o (N η) (1) + o (η) (1), and the following lemma will be useful in our derivation. Lemma 4: We have Proof: According to (13), we have Hence, for i > 1 and a sufficiently small η, we have and thus γ N ii (η) < exp(N η(σ i − σ 1 )), where to obtain (51c) we have used the fact 1 + x ≤ e x .