Low-Rank Gradient Descent

Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (<monospace>GD</monospace>). Our proposed <italic>Low-Rank Gradient Descent</italic> (<monospace>LRGD</monospace>) algorithm finds an <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-approximate stationary point of a <inline-formula><tex-math notation="LaTeX">$p$</tex-math></inline-formula>-dimensional function by first identifying <inline-formula><tex-math notation="LaTeX">$r \leq p$</tex-math></inline-formula> significant directions, and then estimating the true <inline-formula><tex-math notation="LaTeX">$p$</tex-math></inline-formula>-dimensional gradient at every iteration by computing directional derivatives only along those <inline-formula><tex-math notation="LaTeX">$r$</tex-math></inline-formula> directions. We establish that the “directional oracle complexities” of <monospace>LRGD</monospace> for strongly convex and non-convex objective functions are <inline-formula><tex-math notation="LaTeX">${\mathcal {O}}(r \log (1/\epsilon) + rp)$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">${\mathcal {O}}(r/\epsilon ^{2} + rp)$</tex-math></inline-formula>, respectively. Therefore, when <inline-formula><tex-math notation="LaTeX">$r \ll p$</tex-math></inline-formula>, <monospace>LRGD</monospace> provides significant improvement over the known complexities of <inline-formula><tex-math notation="LaTeX">${\mathcal {O}}(p \log (1/\epsilon))$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">${\mathcal {O}}(p/\epsilon ^{2})$</tex-math></inline-formula> of <monospace>GD</monospace> in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data, <monospace>LRGD</monospace> provides significant gains over <monospace>GD</monospace> when the data has low-rank structure, and in the absence of such structure, <monospace>LRGD</monospace> does not degrade performance compared to <monospace>GD</monospace>. This suggests that <monospace>LRGD</monospace> could be used in practice in any setting in place of <monospace>GD</monospace>.


I. INTRODUCTION
First order optimization methods such as Gradient Descent (GD) and its variants have become the cornerstone of training large-scale supervised learning models due to their simplicity and ease of implementation.In general, the optimization problem of interest is min θ∈R p f (θ ), where the objective function f : R p → R captures the dependence on data as well as underlying model constraints (e.g., regularization).Moreover, the computational model of interest is known as oracle complexity, which measures the total number of gradient computations until termination.Subsequently, within the optimization for machine learning literature, there has been a significant interest in reducing oracle complexity of first order methods [1], [2].Formally, we will measure oracle complexity as the total number of directional derivative computations of f to achieve a certain predefined accuracy level > 0. Under this definition, the canonical GD method requires an oracle complexity of O(p log(1/ )) to find an -minimizer of a strongly convex objective function f [3].This complexity grows linearly with parameter dimension p (note that each iteration of GD costs p directional derivatives as it requires a full gradient computation), which can be prohibitive, e.g., when training deep neural networks.However, it has recently been observed that many empirical risk minimization problems have objective functions with low-rank structure in their gradients [4], [5], [6], [7], [8], [9], [10].In the sequel, we will refer to such functions as low-rank functions.We next list some motivating scenarios in which such functions naturally appear.
Motivation 1. Low-rank Hessians: Low-rank structures have been found in large-scale deep learning scenarios irrespective of the architecture, training methods, and tasks [4], [5], [6], [9], [11], where the Hessians exhibit a sharp decay in their eigenvalues.For instance, in classification tasks with k classes, during the course of training a deep neural network model, the gradient lives in the subspace spanned by the k eigenvectors of the Hessian matrix with largest eigenvalues [4].As another example, [7] empirically demonstrates that some layers of a deep neural network model (VGG-19) trained on the CIFAR-10 dataset shows exponential decay in the eigenvalues of the gradient covariance matrix.These practical scenarios suggest that we may be able to leverage such lowrank structures to mitigate training computation cost.This connection is further detailed in Appendix C. Furthermore, a simple simulation based on the the MNIST dataset [12] that illustrates low-rank structure in gradients of neural networks is provided in Appendix D.
Motivation 2. Relation to line search: Line search is an iterative strategy to find the optimum stepsize in GD-type methods [13].More precisely, given a current position θ , the line search algorithm minimizes the objective f restricted to the rank-1 subspace (i.e., a line) passing through θ in the direction of ∇ f (θ ), that is, θ ← arg min θ ∈{θ+α∇ f (θ ):α∈R} f (θ ).Now consider a similar search problem where the objective f is to be minimized restricted to a rank-r subspace rather than a line.This naturally yields a minimization (sub-)problem with low-rank structure.
Motivation 3. Ridge functions: Ridge functions are generally defined as functions that only vary on a given lowdimensional subspace of the ambient space [14].There are many standard examples of ridge function losses in machine learning, e.g., least-squares regression, logistic regression, one hidden layer neural networks, etc. [15], [16].The natural model for Principal Component Regression (PCR) is another such example [17].Moreover, they have been exploited in the development of projection pursuit methods in statistics, cf.[18] and the references therein.
Given all of the above motivations, we will consider a real-valued function as low-rank if its gradients live close to a low-dimensional subspace.Such low-rank structure has been exploited to theoretically improve the running times of federated optimization algorithms recently in [10].Yet, canonical GD methods do not exploit this additional structure.Hence, the goal of this work is to address the following question: Can we leverage low-rank structure to design optimization algorithms with running times that have better dependence on the dimension?Main contributions: Our main contribution is to propose Low-Rank Gradient Descent (LRGD) methods to leverage such low-rank structure in the objective to reduce oracle complexity.Specifically, we restrict the gradient of the function to the low-rank subspace defined by some significant directionsthe active subspace-and perform descent iterations.More precisely, LRGD identifies a fairly accurate proxy for such an active subspace.Then, in each descent iteration, it approximates the true gradient vector by computing only r (dimension of active subspace) directional derivatives of the objective along the active subspace.Note that each iteration of LRGD costs only r directional derivative computations.Intuitively, this is far less than canonical methods such as GD, which has iteration cost growing linearly with p, as noted earlier (see Table 1 for details).It is worth noting that LRGD methods continue to re-evaluate their active subspace when the restricted descent steps stop being useful.It is this adaptation that makes LRGD perform better than GD even when the underlying function may not be low-rank.In summary, we list the key contributions of this work: ), that adapt the active subspace for restricting gradient computation.This makes LRGD suitable even for full-rank functions.• We numerically evaluate the performance of LRGD on different objectives with both real and synthetic datasets.We find that LRGD is able to reduce oracle complexity significantly for the low-rank setting.In addition, for the full-rank setting, it is able to perform as well, if not better than GD primarily due to its ability to adapt the active subspace for gradient computation.This suggests that practically, it always makes sense to utilize the (iterated or adaptive versions of) LRGD in place of GD.
Related work: Low-rank structures have been exploited extensively in various disciplines.For example, in machine learning, matrix estimation (or completion) methods typically rely on low-rank assumptions to recover missing entries [19].In classical statistics, projection pursuit methods rely on approximating functions using ridge functions [14], [18], which are precisely low-rank functions (see Proposition 2.3).In a similar vein, in scientific computing, approximation methods have been developed to identify influential input directions of a function for uncertainty quantification [20].
In contrast to these settings, we seek to exploit low-rank structure in optimization algorithms.Recently, the authors of [21] developed a local polynomial interpolation based GD algorithm for empirical risk minimization that learns gradients at every iteration using smoothness of loss functions in data.The work in [10] extended these developments to a federated learning context (cf.[22], [23] and the references therein), where low-rank matrix estimation ideas were used to exploit such smoothness of loss functions.Following this line of reasoning, this article utilizes the structure of low-rank objective functions to improve iterative gradient-based algorithms in the canonical optimization setting.In a very different direction, low-rank structure has also been used to solve semidefinite programs in [24].
Several works have recently highlighted different forms of low-rank structure in large-scale training problems.For example, deep neural networks seem to have loss functions with low-rank Hessians [4], [5], [6], [9], which partly motivates our work here (also see Appendix D).Moreover, deep neural networks have been shown to exhibit low-rank weight matrices [25], [26] and neural collapse [27], [28], [29], and low-rank approximations of gradient weight matrices have been used for their training [8].
Our work falls within the broader effort of speeding up first order optimization algorithms, which has been widely studied in the literature.In this literature, the running time is measured using first order oracle complexity (i.e., the number of full gradient evaluations until convergence) [3], [30].The first order oracle complexity of GD for, e.g., strongly convex functions, is analyzed in the standard text [3].Similar analyses for stochastic versions of GD that are popular in large-scale empirical risk minimization problems, such as (mini-batch) stochastic GD, can be found in [31], [32].There are several standard approaches to theoretically or practically improving the running times of these basic algorithms, e.g., momentum [33], acceleration [34], variance reduction [35], [36], [37], [38], and adaptive learning rates [39], [40], [41].More related to our problem, random coordinate descent-type methods such as stochastic subspace descent [42] have been studied.In addition, various other results pertaining to GD with inexact oracles (see [43] and the references therein), fundamental lower bounds on oracle complexity [44], [45], etc.
have also been established in the literature.We refer readers to the surveys in [1], [2] for details and related references.
As noted earlier, the contributions in this article are complementary to the aforementioned approaches to speed up first order optimization methods, which do not use low-rank structure in objective functions.Indeed, ideas like acceleration, variance reduction, and adaptive learning rates could be used in conjunction with our proposed LRGD algorithm, although we leave such developments for future work.Furthermore, we only consider GD-like methods rather than more prevalent stochastic GD methods in this work, because our current focus is to show simple theoretical improvements in running time by exploiting low-rank structure.Hence, we also leave the development of stochastic optimization algorithms that exploit low-rank structure for future work.
Outline: We briefly outline the remainder of the article.We state our assumptions, the computational model, and definitions and properties of low-rank functions in Section II.We describe the LRGD algorithm and its theoretical guarantees in Section III.Then, we present illustrative simulations in Section IV.Finally, proofs of our theoretical guarantees and various properties of low-rank functions can be found in Sections V and VI, respectively.

II. FORMAL SETUP AND LOW-RANK STRUCTURE
Notation: We outline some key notation used in this work.For any vectors x, y ∈ R p , we let x, y denote the Euclidean inner product between x and y, and x denote the Euclidean 2 -norm.For a given r > 0, we let B(x, r) = {y ∈ R p : x − y ≤ r} denote the closed ball of radius r with center x.For any linear subspace H ⊆ R p , we use H ⊥ to represent its orthogonal complement subspace.We define H : R p → H as the orthogonal projection operator onto H such that for all x ∈ R p , H (x) := arg min y∈H x − y .Note that for any orthonormal basis {u 1 , . . ., u r } of H, we have H (x) = r i=1 x, u i u i .For any matrix A ∈ R p×q , we let A be the transpose of A, A F be the Frobenius norm of A, A op := max x∈R q : x =1 Ax be the operator norm of A, and A −1 be the inverse of A (when it exists and p = q).For any collection of vectors {x 1 , . . ., x k } ⊆ R p , we let span(x 1 , . . ., x k ) denote the linear span of the vectors.Moreover, for any matrix A = [a 1 , . . ., a q ] ∈ R p×q , where a 1 , . . ., a q ∈ R p are the columns of A, we let span(A) be the linear span of the columns of A, i.e., the column space of A.
For any sufficiently differentiable real-valued function f : R p → R, we let ∇ f : R p → R p denote its gradient and ∇ 2 f : R p → R p×p denote its Hessian.Further, for any unit vector u ∈ R p , we let ∂ u f : R p → R denote the directional derivative of f in the direction u, ∂ u f (x) := ∇ f (x), u .

A. PRELIMINARIES
We consider a real-valued differentiable function f : → R, where = R p for some positive integer p, and denote its gradient as ∇ f : → R p .We assume that f is L-smooth, i.e, its gradient is L-Lipschitz continuous.
Assumption 2.1 (L-smoothness): The function f is Lsmooth for a given parameter L > 0 if for any θ, θ ∈ , Non-convex setting: Under no further assumptions on the function f , the general goal we pursue is to find anstationary point of f , i.e., for a given approximation accuracy > 0, we aim to find θ ∈ such that Note that a solution to (1) exists when f is lower bounded by some f * = inf θ∈ f (θ ) ∈ R, which we assume throughout this article.
Strongly convex setting: We also study a similar problem under the additional assumption that f is strongly convex.

Assumption 2.2 (Strong convexity):
The function f is μstrongly convex for a given parameter μ > 0 if for any θ, θ ∈ , Under this particular setting, a solution to (1) can be interpreted as an approximate minimizer of f through the Polyak-Łojasiewicz inequality (see, e.g., [13]): 2 .Hence, the goal we pursue in this setting is to find an -minimizer for f , which is defined as θ ∈ satisfying where > 0 is a predefined approximation accuracy (as before).Note that it suffices to find an -stationary point of f with = √ 2μ to get an -minimizer of f , which explicitly connects the goals in both settings.
In this article, we focus on the computational complexity of gradient descent methods, which we concretely measure using the following computational model.
Computational model: To characterize the running time of a GD-type algorithm for solving problems (1) and (2), we employ a variant of the well-established notion of oracle complexity [3], [30].In particular, we tailor this notion to count the number of oracle calls to directional derivatives of the objective, where the directional derivative of f along a unit-norm vector u ∈ R p is a scalar defined as This computation model-which differs from most of the literature on first order methods that typically assumes the existence of a full-gradient oracle ∇ f (θ )-will allow us to illustrate the utility of a low-rank optimization method.Definition 2.1 (Oracle complexity): Given an algorithm ALG, for a predefined accuracy > 0 and function class F, we denote by C ALG the maximum number of oracle calls to directional derivatives required by ALG to reach an -approximate solution in the sense of ( 1) or ( 2) for any function in F.
Note that computing a full-length gradient vector of dimension p requires p calls to the directional gradient oracle.As a consequence, for the class of smooth and strongly convex functions, the oracle complexity of vanilla GD to find anminimizer of f is C GD = O(p log(1/ )) [3].
To reduce the oracle complexity, LRGD first identifies a low-rank subspace H, known as the active subspace, that (approximately) contains the gradients of the objective f .Let {u 1 , . . ., u r } denote an orthonormal basis for H.In each iteration with current model parameter θ , LRGD computes r directional derivatives of f , i.e., ∂ f u 1 (θ ), . . ., ∂ f u r (θ ), and uses as an approximation to the true full-gradient ∇ f (θ ) and updates the model parameter as θ ← θ − α ∇ f (θ ) with an appropriate step-size α.We provide the description of LRGD and its guarantees for different function classes in Section III.

B. APPROXIMATELY LOW-RANK FUNCTIONS
In this section, we formally define our notion of (approximately) low-rank functions.Intuitively, a low-rank function maintains its gradient vectors in a low-dimensional subspace.This is inspired by the motivating examples described in Section I.This notion allows low-dimensional subspaces to approximately satisfy the aforementioned condition.This is made explicit in Definition 2.2.In the rest of this section, we also provide more details about the properties of (approximately) low-rank functions.

Definition 2.2 (Approximate rank of a function):
The function f is (η, )-approximately rank-r, if there exist η ∈ [0, 1), ≥ 0, and a subspace where H denotes the orthogonal projection operator onto H.Note that this definition has two error terms represented by two constants η (multiplicative) and (additive).Both of these constants are required for the generality of the definition of approximate low-rankness.In particular, the role of the additive term is emphasized in the following Proposition 2.1, whereas the role of the multiplicative term is emphasized in Proposition 2.2 for the particular case of strongly convex functions.
The proof in Section VI is a direct application of the inverse function theorem to ∇ f at θ * .This result shows that in most settings, the approximate rank requires an affine approximation error (i.e., > 0 is necessary).Note that this does not produce a uniform bound on that would hold for any function f ; such a bound would only hold under additional assumptions on the third derivatives of f .

Proposition 2.2 (Strong convexity and approximate lowrankness):
If f is μ-strongly convex, for any vector u ∈ R p of unit norm and any > 0, there exists some θ ∈ R p such that where θ − θ * < .As a consequence, if f is (η, )approximate rank r < p on the ball of radius around θ * , i.e., The proof in Section VI shows that the gradient vectors of strongly convex functions typically span all directions.As a consequence, this result shows that there is a trade-off between the strong convexity constant μ and the approximately low-rank constants (η, ).For a given strongly convex function f , it is however still very reasonable to assume that it is approximately low-rank on bounded portions of the parameter space of the form B(θ * , ) with limited , rather than the full space = R p .This turns out to be sufficient in most practical cases.
Exactly low-rank functions: As noted earlier, the notion of approximately low-rank functions in Definition 2.2 subsumes the well-known class of ridge functions, which have η = = 0. We refer to functions that satisfy condition (3) with η = = 0 as exactly low-rank functions.It is worth noting that such functions have already been widely studied in the literature under different equivalent guises (see Proposition 2.3).However, the class of such functions is fairly restricted, e.g., it does not contain any strongly convex function.Next, we define the rank of a function for this particular case of exactly low-rank functions, and present several equivalent forms of such functions.

Definition 2.3 (Rank of function):
The function f has rank

(Characterizations of low-rank functions):
The following are equivalent: 1) There exists a subspace H of dimension r such that ∇ f (θ ) ∈ H for all θ ∈ .2) There exists a subspace H of dimension r such that f (θ ) = f (θ + θ ) for all θ ∈ and θ ∈ H ⊥ .3) There exists a matrix A ∈ R r×p and a map σ : R r → R such that f (θ ) = σ (Aθ ) for all θ ∈ .We defer the proof to Section VI.The proposition above relates our definition of exactly low-rank functions in part 1 to the notion in part 2, which is ordinarily referred to as H ⊥ −invariance (see, e.g., [20]), and the notion in part 3, which is typically the definition of ridge functions [14] or multi-ridge functions [46].In what follows, we will equivalently use the terminology of exactly low-rank functions and ridge functions.
We expand on some properties of exactly low-rank functions in Appendix B. Furthermore, we illustrate via an example in that appendix that optimization of a high-rank function can sometimes be equivalently represented as optimization of a low-rank function using an appropriately chosen non-linear transformation.Note, however, that for our purposes, the class of ridge functions has some limitations.In particular, any μstrongly convex function is of (full) rank p because it satisfies: which together with part 2 in Proposition 2.3 yields that H has full dimension r = p.This is why we focus our analysis on the much richer class of approximately low-rank functions.

III. ALGORITHMS AND THEORETICAL GUARANTEES
In this section, we state the precise description of the proposed LRGD algorithm, provide its oracle complexity guarantees, and conclude with a discussion on practical considerations.

A. LRGD ALGORITHM
The proposed algorithm, LRGD, is an iterative gradient-based method.It starts by identifying a subspace H of rank r ≤ p that is a good candidate to match our definition of approximately low-rank function for the objective f , i.e., Definition 2.2.More precisely, we pick r random models θ 1 , . . ., θ r and construct the matrix where we let g j := ∇ f (θ j ) for j ∈ [r].Then, assuming that G is full-rank, the active subspace H will be the space induced by the r dominant left singular vectors of G.In other words, if we denote G = U V as the singular value decomposition (SVD) of G, we pick H = span(U ) = span(u 1 , . . ., u r ).
Having set the active subspace H, LRGD updates the model parameters θ t in each iteration t by θ t+1 = θ t − α ∇ f (θ t ) for an appropriate stepsize α.Note that ∇ f (θ t ) here denotes the projection of the true gradient ∇ f (θ t ) onto the active subspace H, which LRGD uses as a low-rank proxy to ∇ f (θ t ).
Computing each approximate gradient ∇ f requires only r (and not p) calls to the directional gradient oracle as we have ∇ f (θ t ) = r j=1 ∂ u j f (θ t )u j .Algorithm 1 provides the details of LRGD.
Remark 3.1: We would like to highlight that in each iteration of the coordinate descent algorithm, one of the total p standard basis directions e 1 , . . ., e p is (randomly or deterministically) picked and the parameter is updated only along the picked direction.This procedure is irrespective of the objective function and its directions of significant variation.In contrast, LRGD first identifies such significant directions (through matrix G defined in Algorithm 1) and then updates the parameters along such directions which are different than the standard basis directions.

B. ORACLE COMPLEXITY IN STRONGLY CONVEX SETTING
Next, we characterize the oracle complexity of the LRGD method for strongly convex objectives and discuss its improvement over the benchmark GD method.Let us begin with the more general case of approximately low-rank functions.

Algorithm 1: Low-Rank Gradient Descent (LRGD).
We provide an oracle complexity guarantee for such functions under a strong convexity assumption.
Theorem 3.1 (Oracle complexity in approximately-low-rank and strongly convex case): Let the objective function f be L-smooth and μ-strongly convex with condition number κ : = L/μ, i.e., Assumptions 2.1 and 2.2 hold.We also assume that f is (η, √ μ /5)-approximately rank-r according to Definition 2.2, and the following condition holds: where σ r is the smallest singular value of the matrix G = [g 1 / g 1 , . . ., g r / g r ] with g j := ∇ f (θ j ) and g j > for all j ∈ [r].Then, for stepsize α = 1 8L , the proposed LRGD method in Algorithm 1 reaches an -minimizer of f with oracle complexity We defer the proof to Section V and remark the following points.
Remark 3.2: The condition required by Theorem 3.1 relates the low-rank characteristics of the objective and corresponding G matrix with the desired accuracy.More precisely, the smallest singular value σ r of G determines how well the columns of G span the active subspace.Intuitively, for the regime of our interest with constant r and large p, it is (statistically) likely to end up with a matrix G with a constant σ r (and not decaying with p).If the smallest singular value σ r of G is too small, it means that the sampled gradients are too correlated to have a good approximation of the active subspace, and therefore one should re-sample the initial points θ 1 , . . ., θ r until the value of σ r is adequately lower-bounded.
As elaborated in Section II and in Proposition 2.3, strongly convex functions cannot be exactly low-rank.However, an exactly low-rank function may be strongly convex when restricted to its active subspace.The next result characterizes the oracle complexity for LRGD in such scenarios.
Proposition 3.1 (Oracle complexity in the exact low-rank and strongly convex case): Let the objective function f be L-smooth (Assumption 2.1) and exactly rank-r according to Definition 2.3, where ∇ f (θ ) ∈ H, ∀θ .Moreover, assume that f restricted to H is μ-strongly convex (Assumption 2.1) with condition number κ := L/μ.Then, the proposed LRGD method in Algorithm 1 with stepsize α = 1/L reaches an -minimizer of f with oracle complexity where 0 = f (θ 0 ) − f * is the suboptimality of the initialization θ 0 .
We defer the proof to Section V. Theorem 3.1 and Proposition 3.1 show that in the strongly convex setting, for approximately low-rank functions with sufficiently small parameters, LRGD is able to reduce the oracle complexity of GD, which is C GD = O(κ p log(1/ )), to C LRGD = O(κr log(1/ ) + pr).This gain is particularly significant as the term depending on the accuracy does not scale with the parameter dimension p, and rather scales with the rank r which can be much smaller than p.Finally, it is worth noting that the final accuracy attained by LRGD (Algorithm 1) depends on the low-rank properties of the function f as captured by the assumptions in Theorem 3.1.

C. ORACLE COMPLEXITY IN NON-CONVEX SETTING
We now turn our focus to non-convex and smooth objectives and discuss the benefits of LRGD on the overall oracle complexity for such functions.As before, let us begin with approximately low-rank objectives.
Theorem 3.2 (Oracle complexity in approximately low-rank and non-convex case): Let the objective function f be Lsmooth, i.e., Assumption 2.1 holds.Moreover, assume that f is (η, /3)-approximately rank-r according to Definition 2.2, and the following condition holds: where σ r is the smallest singular value of the matrix G = [g 1 / g 1 , . . ., g r / g r ] with g j := ∇ f (θ j ) and g j > for all j ∈ [r].Then, for stepsize α = 1 8L , the proposed LRGD method in Algorithm 1 reaches an -stationary point of f with oracle complexity where 0 = f (θ 0 ) − f * is the suboptimality of the initialization θ 0 .We defer the proof to Section V. We next state the oracle complexity result for exactly low-rank objectives.
Proposition 3.2 (Oracle complexity in exactly low-rank and non-convex case): Let the objective function f be L-smooth, i.e., Assumption 2.1 holds.Moreover, assume that the gradient of f is exactly rank-r according to Definition 2.3.Then, the proposed LRGD method in Algorithm 1 with stepsize α = 1/L reaches an -stationary point of f with oracle complexity where 0 = f (θ 0 ) − f * is the suboptimality of the initialization θ 0 .We defer the proof to Section V. Theorem 3.2 and Proposition 3.2 further highlight the advantage of LRGD in the non-convex scenario where for approximately rank-r objectives with sufficiently small parameters (see Definition 2.2), LRGD is able to reduce the oracle complexity of GD, that is O(p/ 2 ) to O(r/ 2 + pr).This is particularly significant for low-rank functions with rank r p.Let us also note that although LRGD's guarantees stated in the previous theorems require the objective to be sufficiently low-rank on the entire model parameter space, in practice the proposed low-rank approach works with far less restrictive assumptions on the objective.This is our motivation to build on LRGD and propose other variants in the following section.These variants can work well with even full-rank functions.

D. VARIANTS OF LRGD
Algorithm 1 (LRGD) can be extended to broader settings, especially in order to make it less sensitive to low-rank assumptions on the function f , since such assumptions are rarely known a priori before the optimization task is defined.In this section, we present two such extensions: 1) Algorithm 2 provides a variant of LRGD where the active subspace is updated as soon as the low-rank descent converges.Note that this guarantees the termination of the optimization algorithm on any function, not just on functions satisfying the assumptions of the theoretical results.This algorithm is particularly adapted to situations where the function is locally approximately rank r as the active subspace is updated through a local sampling (that is also used to accelerate the descent with r iterations of GD).This algorithm still takes the parameter r as an input.It is used for the empirical tests in Section IV. 2) Algorithm 3 (in Appendix A) provides a variant of LRGD that is adapted to situations where the rank r is unknown.In this algorithm, the active subspace is made larger when the algorithm reaches convergence on the corresponding subspace.As a result, the rank r is gradually increased from 1 to p.This progression can be arithmetic, i.e. r ← r + 1, or geometric, that is, r ← 2r.This rank progression is repeated till either the desired accuracy is achieved or all the p total dimensions spanning the parameter space are included.We leave the analysis of this adaptive method for future work.

IV. NUMERICAL SIMULATIONS
In this section, we demonstrate the utility of the LRGD method-specifically the iterated variant described in Algorithm 2-through numerical simulations.We start with experiments for linear and logistic regression on real and synthetic datasets to demonstrate the effectiveness of the LRGD method compared to vanilla GD.Then, we consider a simple quadratic case to provide further intuition on the algorithm and provide illustrations.

A. EMPIRICAL RISK MINIMIZATION
In the following experiments, the goal is to minimize the empirical risk f (θ ) = i∈[n] (x i , y i ; θ ) for loss function over n data samples {(x i , y i ) : i = 1, . . ., n}.Here, denote the matrices of features and labels, respectively.
In the least squares regression problem, the loss function is (x i , y i ; θ ) = x i θ − y i 2 with real-valued features and labels.In the logistic regression problem, we pick the loss function as (x i , y i ; θ ) = 1(y i = 1) log(1/(1 + e −x i θ )) + 1(y i = 0) log(e −x i θ /(1 + e −x i θ )) with real-valued features and binary {0, 1}-valued labels.In both cases, a convex regularization term may be added to the empirical risk without further complexity thus the corresponding optimisation problems do not have a closed-form.We consider these regression tasks for three different datasets.
Digits recognition dataset: The UCI handwritten digits dataset [47] contains n = 2000 images of p = 8 × 8 pixels of handwritten digits along with their labels.We consider two different binary classification tasks on this dataset.The first task is a binary classification of digits 0 and 1.In the second task, we set our goal to detect loops in the digits, that is, classifying digits to either of the two groups {0, 6, 8, 9} or {1, 2, 3, 5, 7}.
Fig. 1(a) and (b) demonstrate the decay of the training loss versus the number of (directional) oracle calls for LRGD with several values of rank.Note that in these plots, the curve with rank r = p = 64 is indeed identical to vanilla GD.For instance, Fig. 1(a) demonstrates that the same level of accuracy is attained by the LRGD method with 10 times fewer oracle calls than the GD method.Furthermore, the better performance of LRGD with medium ranks further reveals inherent low-rank structures in the dataset.Synthetic data with full rank: In this dataset, the features are generated independently from a centered normal distribution with identity covariance, i.e., x i ∼ N(0, I p ) for i ∈ [n].The response variable y i is a noisy linear measurement of a ground-truth model θ * , i.e., y i = x i , θ * + n i with i.i.d.noises n i ∼ N(0, 0.1).By symmetry, the particular choice of θ * is inconsequential and we set it to a vector of all ones, i.e., For the logistic regression experiments, the quantity y i is turned into a binary value by randomly sampling from a Bernoulli distribution with parameter 1/(1 + e −y i ).
Synthetic data with low rank: This dataset differs from the previous solely in the way the features x i , i ∈ [n] are sampled.To guarantee an approximate low-rank structure, we first sample matrices U ∈ R n×r and V ∈ R p×r , and set the feature matrix as X = UV .The entries of U and V are i.i.d. and sampled from a standard normal distribution, i.e., U i, j ∼ N(0, 1) and V i, j ∼ N(0, 1).Fig. 2 demonstrates the training loss for the full-rank and low-rank datasets described above, respectively.Interestingly, LRGD's performance is superior to GD's performance even in the full-rank setting.Moreover, the LRGD method is particularly effective on the low-rank datasets as well.The convergence curves of LRGD have several successive drops in the function value-much like a "waterfall".This is particularly noticeable when the rank parameter of the algorithm is small.We will further elaborate on this observation in the next series of experiments.Moreover, Table 2 quantifies the total oracle complexity of iterated LRGD for different ranks r and the two linear and logistic regression problems, where the case r = p coincides with GD.For the same setting and problems described above, Table 3 demonstrates the total oracle complexity for different values of the stepsize α.

V. PROOFS OF ORACLE COMPLEXITY
In this section, we provide a rigorous analysis of the oracle complexity of LRGD in different settings as described in Section III.

A. STRONGLY CONVEX SETTING
We first prove Theorem 3.1 and Proposition 3.1.

1) PROOF OF THEOREM 3.1
The proof consists of two main parts.First, we characterize the gradient inexactness induced by the gradient approximation step of LRGD which will be used in the second part to derive convergence rates for inexact gradient descent type methods.
Lemma 5.1 (Gradient approximation): Assume that the f is (η, )-approximately rank-r according to Definition 2.2 and the following condition holds where σ r is the smallest singular value of the matrix G = [g 1 / g 1 , . . ., g r / g r ] with g j := ∇ f (θ j ) and g j > for all j ∈ [r].Then, the gradient approximation error of LRGD is bounded as follows Since f is (η, )-approximately rank-r according to Definition 2.2, we can bound the operator norm of G − G H as follows where in the last inequality we used the assumption that g j > for all j ∈ [r].The condition stated in the lemma ensures that G is full-rank (since σ r > 0).Next, we show that projecting columns of G on H does not degenerate its rank, that is, G H is full-rank as well.To see this, we employ Weyl's inequality [48,Corollary 7.3.5(a)] to write that In above, σ r (G H ) and σ 1 (G − G H ) respectively denote the smallest and largest singular values of G H and G − G H , (a) from the bound in (5), and (b) is an immediate result of the condition (4).Since H is a subspace of rank r and columns of G H are all in H, we have that H = span(U H ). Next, we employ Wedin's sin theorem [49] to conclude sin (span(U ), span(U This yields that ∀θ ∈ R p we can uniformly bound the LRGD gradient approximation error as follows Next, we characterize convergence of inexact gradient descent and use LRGD's gradient approximation error bound derived in Lemma 5.1 to complete the proof. Let us get back to the setting of Theorem 3.1.From smoothness of f in Assumption 2.1 and the update rule of LRGD, i.e. θ t+1 = θ t − α ∇ f (θ t ), we can write where e t := ∇ f (θ t ) − ∇ f (θ t ) denotes the gradient approximation error at iteration t.Next, we employ the gradient approximation error bound derived in Lemma 5.1, that is e t ≤ η ∇ f (θ t ) + ˜ , where we denote ˜ := √ μ /5 and η := η + 2r(η + ˜ )/σ r for notation simplicity.Together with simple algebra from ( 6), we have that Next, we use the gradient dominant property of strongly convex loss functions, that is ∇ f (θ ) 2 ≥ 2μ( f (θ ) − f * ), ∀θ , which together with (7) implies that for the shorthand notation α := α(1 − 2αL − 2 η2 (1 + 2αL)).Now we pick the stepsize α = 1 8L and use the condition η ≤ 1/ √ 10 to conclude that The contraction bound in (8) implies that after T iterations of LRGD the final suboptimality is bounded as below Finally, in order to reach an -minimizer of f , it suffices to satisfy the following condition i.e., LRGD runs for T = 16κ log(2 0 / ) iterations.Thus, the total oracle complexity of LRGD is as claimed in Theorem 3.1.
2) PROOF OF PROPOSITION 3.1 The proof is essentially based on the convergence of GD for strongly convex loss functions.More precisely, when the objective function is exactly low-rank (Definition 2.3), then the iterates of LRGD are indeed identical to those of GD.However, LRGD requires only r oracle calls to directional derivatives of f at each iteration.Therefore, we employ the well-known convergence of GD for strongly convex objectives to find the total number of iterations required to reach an -minimizer of the objective.However, as stated in the theorem's assumption, f itself can not be strongly convex (as elaborated in Section II as well).Rather, we assume that f restricted to H = span(U ) is μ-strongly convex.Let us make this more clear by introducing a few notations.We denote by F : R r → R the restriction of f to H.That is, Next, we relate the iterates of LRGD on f to the ones of GD on F .Let us assume that LRGD is initialized with θ 0 = 0 (The proof can be easily generalized to arbitrary initialization).For t = 0, 1, 2, • • • and stepsize α, iterates of LRGD are as follows which are identical to the iterates of GD on f due to the exact low-rankness assumption on f .Note that since f is not strongly convex, we may not utilize the linear convergence results on f .Nevertheless, we show in the following that such linear convergence still holds by using the restricted function F defined above.Let us define the GD iterates ω t on F as follows Next, we show that the iterates {θ t } and {ω t } are related as In above, we used the fact that U ω t = θ t for any t.To verify this, note that U ω t = UU θ t .On the other hand, one can verify that all the iterates {θ t } live in H. Thus, the projection of θ t to H equals to itself, i.e., U ω t = UU θ t = θ t .Next, we show that the minimum value of the two objectives f and F are indeed identical.Intuitively, F is the restriction of f to H and we know that the function value of f does not change along the directions of H ⊥ .Therefore, f and F attain the same minimum function values.To show this concretely, first note that F * := min ω∈R r F (ω) ≥ f * := min θ∈R p f (θ ).Let θ * be a minimizer of f , i.e., f (θ * ) = f * .Next, we have which yields that F * = f * .In above, U ⊥ U ⊥ denotes the projection to H ⊥ and we used the fact that Now, we are ready to derive the convergence of {θ t } on f .Since F is μ-strongly convex and L-smooth, the following contraction holds for GD iterates {ω t } on F for stepsize Therefore, LRGD takes T = κ log( 0 / ) iterations to reach an -minimizer of f where 0 = f (θ 0 ) − f * .Each iteration of LRGD requires r calls to directional derivatives of f .Together with the oracle complexity of the initial r full gradient computations (hence pr directional derivatives), the total oracle complexity sums up to be In above, we used the well-known convergence result for strongly convex functions.For the reader's convenience, we reproduce this convergence rate here as well.The next lemma characterizes the convergence of GD for strongly convex objectives [3].Lemma 5.2 (Convergence in strongly convex case [3]): Let F be L-smooth and μ-strongly convex according to Assumption 2.1 and 2.2 with κ = L/μ denote the condition number.Then, for stepsize α = 1/L, the iterates of GD converge to the unique global minimizer as follows

B. NON-CONVEX SETTING
We next prove Theorem 3.2 and Proposition 3.2.

1) PROOF OF THEOREM 3.2
We first invoke Lemma 5.1 which bounds the gradient approximation error of LRGD.Note that this lemma does not require any convexity assumptions, hence we use it in the proof of both strongly convex (Theorem 3.1) and non-convex settings (Theorem 3.2).
First, note that from Lemma 5.1 and given the condition stated in Theorem 3.2, the gradient approximation error (inexactness) is upper-bounded as e t ≤ η ∇ f (θ t ) + ˜ for η : = η + 2r(η + ˜ )/σ r and ˜ := /3.Secondly, we characterize the convergence of inexact GD (which LRGD updates can be viewed as such) to derive LRGD's oracle complexity to reach an stationary point.
Let us borrow the contraction bound (7) from Theorem 3.1.Note that we used only the smoothness of f (and not the strong convexity) to get this bound, hence, we can safely employ it here for the smooth and nonconvex objectives.That is, for stepsize α, the iterates of LRGD satisfy the fallowing contraction Picking the stepsize α = 1 8L and using the condition η ≤ 1/ √ 10, we have from the above inequality Averaging the above progression over t = 0, 1, . . ., T − 1 yields that Therefore, for T = 72L 0 / 2 , we conclude that there exists an iteration t ∈ {0, . . ., T − 1} for which which yields that the total oracle complexity of LRGD is 2) PROOF OF PROPOSITION 3.2 We first note that when f is exactly rank-r, then the iterates of LRGD are identical to (exact) GD's ones.Therefore, the number of iterations required by LRGD to reach an -stationary point is determined by the convergence rate of GD for nonconvex losses.Though this is well-known in the optimization community, it is rarely stated explicitly due to its simplicity.So, in the following, we provide such a convergence result for the readers' convenience.The proof is deferred to Appendix E. Lemma 5.3 (Convergence in non-convex case): Let f be L-smooth according to Assumption 2.1.Moreover, assume that f is exactly rank-r according to Definition 2.2.Then, for stepsize α = 1/L and after T iterations of LRGD in Algorithm 1, there exists an iterate 0 ≤ t ≤ T − 1 for which Getting back to Theorem 3.2 and using Lemma 5.3, it is straightforward to see that LRGD requires T = 2L 0 / iterations to reach an -stationary point of f which yields the total oracle complexity of LRGD to be where pr is indeed the (directional) gradient computation cost associated with the r gradient vectors g 1 , . . ., g r require to construct thee matrix G in LRGD (Algorithm 1).

VI. PROOFS OF PROPERTIES OF LOW-RANK FUNCTIONS
In this section, we provide proofs for the results in Section II.

A. PROOF OF PROPOSITION 2.1
We write Taylor's expansion around θ * , where ∇ 2 f (θ * ) is invertible, and the little-o notation applies entry-wise.For any given subspace H, we can consider u = ∇ 2 f (θ * ) −1 v where v ∈ H ⊥ and note that As a consequence, as t → 0, We remark that by looking at the rate of convergence in the equation above (which requires third derivatives), we could obtain a sharper lower bound for .

B. PROOF OF PROPOSITION 2.2
The result is a consequence of the following lemma, whose proof is relegated to Appendix E. Lemma 6.1 (Gradients on level sets): For any 0 < λ < μ 2 2 and any vector u of unit norm, there exists θ such that where θ − θ * < .Using Lemma 6.1, we proceed to the proof of Proposition 2.2.Let H be a vector space of dimension r < p such that the low rank approximation ∇ f is defined as the projection on this vector space, Take any direction u ∈ H ⊥ , and apply Lemma 6.1 with λ = μ 2 2 which gives θ u ∈ satisfying By orthogonality, H (∇ f (θ u )) = 0, and therefore, Recall from [13] that for all θ ∈ , In particular, we have 2μλ ≤ ∇ f (θ ) 2 and μ ≤ ∇ f (θ ) .2) ⇒ 3): Take e 1 , . . ., e r an orthonormal basis of H, take A = [e 1 , . . ., e r ] ∈ R r×p and for all x ∈ R r σ (x) = f (A x).Then A is a projection of on H and σ is the restriction of f to H.

VII. CONCLUSION
In this work, we focused on optimization problems with lowrank objectives whose gradients live (approximately) in a lower dimensional subspace.As elaborated, many machine and deep learning applications demonstrate such low-rank structures in their training trajectory.We proposed the LRGD algorithm that leverages the low-rankness of the objective function in order to reduce the total directional oracle complexity in gradient-based methods.We analyzed this method for two classes of loss functions, strongly convex and nonconvex objectives.Our analysis showed that for sufficiently low-rank functions, LRGD is able to mitigate the cost of benchmarks such as GD which linearly grows with the model parameters dimension.
We believe our LRGD approach could be applicable to stochastic gradient descent or momentum-based approaches, such as SGD, if the directional derivative oracle provides an / √ p-approximation of the exact value, for the approximation of a (η, )−approximately low-rank function is then (η, 2 ) approximately low-rank.On another note, our studied notion of function rank is closely related to the function's second-order characteristics such as the Hessian.Incorporating the underlying lowrankness and its connection to the Hessian matrix can result in more efficient second-order and quasi-Newton optimization methods.

APPENDIX A ADAPTIVE LOW-RANK GRADIENT DESCENT ALGORITHM
In this appendix, we include the pseudocode for an adaptive variant of the low-rank gradient descent algorithm, which uses LRGD as a basic building block.

APPENDIX B CALCULUS OF LOW-RANK FUNCTIONS
In this appendix, we describe the effect of standard algebraic operations, e.g., scalar multiplication, addition, pointwise product, and composition, on exactly low-rank functions.We then close the section with a simple example of how one might monotonically transform functions to reduce their rank.

A. ALGEBRA OF LOW-RANK FUNCTIONS
Recall that a differentiable function f : → R for ⊆ R p is said to have rank r ≤ p if there exists a subspace H ⊆ R p of dimension r such that ∇ f (θ ) ∈ H for all θ ∈ (and no Algorithm 3: Adaptive Low-Rank Gradient Descent.subspace of smaller dimension satisfies this condition).The ensuing proposition presents how the rank is affected by various algebraic operations.
using the product rule.As before, the dimension of In the case of the exponential function, g ( f 1 (θ )) = exp( f 1 (θ )) > 0, which means that the rank will not decrease.This completes the proof.We remark that the first three properties demonstrate how rank behaves with the usual algebra of functions.Moreover, the above operations can be applied to simple ridge functions to construct more complex exactly low-rank functions.

B2 EXAMPLE: TRANSFORMATION TO A LOW-RANK FUNCTION
Consider the function f : (0, ∞) p → R defined by where θ = (θ 1 , . . ., θ p ). Objective functions of this form are often optimized in isoperimetric problems, e.g., when volume is extremized with constrained surface area.Observe that where ∂ j denotes the partial derivative with respect to the jth coordinate for j ∈ [p].Consequently, for any i ∈ [p], ∇ f (1 − e i ) = e i , where 1 = (1, . . ., 1) ∈ R p and e i ∈ R p is the ith standard basis vector.Therefore, the function f has rank p using Definition 2.3.Now, using the change-of-variables φ i = log(θ i ) for every i ∈ [p], define the transformed function g : R p → R as where φ = (φ 1 , . . ., φ p ).The function g has rank 1 using Proposition II.3.
Since log(•) is a monotonic transformation, minimizing (or maximizing) f (θ ) is equivalent to minimizing (or maximizing) g(φ).However, g has rank 1 while f has rank p.Hence, this example demonstrates that it is sometimes possible to transform optimization problems where the objective function has high-rank into equivalent problems where the objective function has low-rank.We note that this example is inspired by the classical transformation of a geometric program into a convex program [13,Sec. 4.5].

APPENDIX C DISCUSSION ON LOW-RANK HESSIANS
As stated in the introduction, it has been noted in the literature that many functions of interest -such as losses of neural networks -appear to have approximate low rank Hessians, cf.[4], [5], [6], [9] and the references therein.In this paragraph, we will argue why this observation makes the approximate low-rank assumption realistic at a local scale.In particular, we aim to show the following result, Proposition C.1: Given a function f : → R with bounded third derivatives and some θ ∈ denote by σ r the r-th singular value ∇ 2 f ( θ ).There exists a linear subspace H ⊂ of dimension r that satisfies, for some constant M ≥ 0 that only depends on the bound on the third derivatives of f .Proof: By Taylor's expansion on a function with bounded third derivatives, there exists M ≥ 0 such that for every θ ∈ , If we denote by σ 1 , . . ., σ p (u 1 , . . ., u p and v 1 , . . ., v p respectively) the singular values (left singular vectors and right singular vectors respectively) obtained from the decomposition of ∇ 2 f ( θ ), we observe that y where H = span(∇ f ( θ ), u 1 , . . ., u r−1 ).Indeed, first note that Finally, it is a property of the projection operator that As a consequence we can combine inequality both ( 9) and ( 10) by triangular inequality into, The expression obtained above is very similar to the definition of approximate low rankness.In fact, for a given η > 0 the η−approximately low-rank assumption can hold as long as M θ − θ 2 + σ r ||θ − θ|| ≤ η ∇ f (θ ) .Typically this happens when the following conditions are met (1) ∇ 2 f ( θ ) is approximately low-rank, i.e., σ r is small; (2) ∇ f (θ ) is much greater than θ − θ ; (3) The third derivative constant M is very small.
In practice, the approximately low-rank assumption must hold for multiple iterates of LRGD of the form θ = θ + α ∇ f (θ ) where α is the step size.It is thus natural to expect θ − θ ∼ αN ∇ f (θ ) where N is the number of iterations for which the approximate low rank condition holds.This yields the following condition αMN ∇ f (θ ) + σ r ≤ η, note that this condition may be satisfied for very small step sizes α. test dataset is typically over 98%.)Let θ * ∈ R p be the optimal network weights obtained from this training.For a fixed standard deviation parameter τ = 0.2, we construct n = 100 independent random perturbations θ 1 , . . ., θ n ∈ R p of the optimal network weights θ * as follows: where Z 1 , . . ., Z n are independent and identically distributed Gaussian random vectors with zero mean and identity covariance.Note that τ is chosen to be the same order of magnitude as "non-zero" (i.e., larger) entries of θ * ; τ provides "soft" control over the size of the local neighborhood considered around the optimum θ * .Then, for any randomly chosen batch of 200 test data samples, we construct the matrix of gradients G = [g 1 , . . ., g n ] ∈ R p×n , whose ith column is given by , where ∇ f : R p → R p denotes the gradient of the empirical cross-entropy risk defined by the 200 test samples with respect to the network weights (i.e., ∇ f is the sum of the gradients corresponding to the 200 test samples).The gradients in G are computed using standard automatic differentiation procedures in Keras.We remark that small perturbations of τ , increasing n, increasing the batch size of 200 (which defines the empirical cross-entropy risk), and changing the hyper-parameters of training (e.g., number of neurons in hidden layer, choice of optimization algorithm, activation functions, etc.) do not seem to affect the qualitative nature of the plot in Fig. 3.We refer readers to [4], [5], [6], [9] for more thorough empirical studies of such low-rank structure.(As discussed in Appendix C, the low-rank Hessian structure analyzed in these works [4], [5], [6], [9] yields low-rank gradient structure.)

B. PROOF OF LEMMA 6.1
Consider 0 < λ < μ 2 2 , and note that the set = {θ ∈ : f (θ ) ≤ f (θ * ) + λ} is a closed, convex subset of R p , and is strictly included in B(θ * , ), because by strong convexity, For a given vector of unit norm u ∈ R p , consider the function g(θ ) = θ, u ∈ R.This function can be maximized on and the maximum is attained on some θ u ∈ .Note that θ u belongs to the boundary of , because it results from the optimization of a linear function on a convex set, as a consequence: ∇ f (θ u ) , and let us assume by contradiction that v = u and define w = u − v.By Taylor's expansion, f (θ u + tw) = f (θ u ) + t ∇ f (θ u ), w + o(t ), g(θ u + tw) = g(θ u ) + t u, w .
Note that by definition ∇ f (θ u ), w < 0 while u, w > 0. As a consequence, the maximality of θ u is contradicted.Indeed, there exists t > 0 and h = tu such that θ u + h ∈ , and g(θ u + h) ≥ g(θ u ).

FIGURE 1 . 2 .FIGURE 2 .
FIGURE 1. Training error for digits recognition.The large values taken by are explained by the absence of a normalizing term in the loss function, and the application of the stopping criterion ∇f (θ) ≤ √ 2 .

FIGURE 3 . 2 F
FIGURE 3. Plot of squared singular values of the matrix of gradients G normalized by G 2 F .Note that the ith largest normalized squared singular value (σ i (G)/ G F ) 2 is displayed at order index i ∈ [100].

Finally, let σ 1 ( 2 F
G) ≥ σ 2 (G) ≥ • • • ≥ σ n (G) ≥ 0 denote the ordered singular values of the matrix of gradients G, where G 2 F = n i=1 σ i (G) 2 .Fig. 3 plots the normalized squared singular values (σ 1 (G)/ G F ) 2 ≥ (σ 2 (G)/ G F ) 2 ≥ • • • ≥ (σ n (G)/ G F ) 2 ≥ 0 against the order indices [n].Itshows that G is approximately low-rank with approximate rank around 10, since the majority of the "energy" in G is captured by the 10 largest singular values.So, despite there being 100 different gradients in G in a neighborhood of θ * , they are all approximately captured by a span of 10 vectors in R p .Therefore, this simulation illustrates that the empirical cross-entropy risk (or objective function) corresponding to the MNIST neural network training task is approximately low-rank in a neighborhood of θ * .