Adaptive Localized Cayley Parametrization for Optimization Over Stiefel Manifold and Its Convergence Rate Analysis

The Adaptive Localized Cayley Parametrization (ALCP) strategy for orthogonality constrained optimization has been proposed as a scheme to utilize Euclidean optimization algorithms on an adaptive parametrization of the Stiefel manifold via a tunable generalized left-localized Cayley parametrization. Thanks to the adaptive parametrization, the ALCP strategy enjoys stable convergence of the estimate sequence of a minimizer by avoiding a certain singular-point issue. In this paper, we present a convergence rate analysis for the ALCP strategy using Euclidean optimization algorithms such as Nesterov-type accelerated gradient methods. Numerical experiments demonstrate the effectiveness of the ALCP strategy.

The associate editor coordinating the review of this manuscript and approving it for publication was Andrea De Marcellis .
A realistic goal for Problem 1.1 is to find a stationary point U ⋆  ∈ St(p, N ), which satisfies a necessary condition for a local minimizer of f over St(p, N ) [25], [26], [27], because Problem 1.1 is a nonconvex optimization problem in general.Problem 1.1 is challenging mainly because of the severe nonlinearity of St(p, N ).
However, in a case where U ⋆ is close to the singularpoint set E N ,p (S), the NCP strategy may suffer from slow convergence of updating (V n ) ∞ n=0 [28], [33].We call this performance degradation a singular-point issue.
To suppress the singular-point issue in the NCP strategy, we recently proposed an Adaptive Localized Cayley Parametrization (ALCP) strategy [42], which adaptively adjusts the location of singular points, for Problem 1.3:For a given differentiable function f : R N ×p → R and ϵ > 0, In the ALCP strategy, until the singular-point issue is detected by a computable restarting condition, we use the same S for updating V n ∈ Q N ,p (S) by using Euclidean optimization algorithms for Problem 1.2 with S in analogy with the NCP strategy (see also the later part [''Related works and primitive goal of this paper''] in this section).After the detection of the singular-point issue at the tentative estimate ).This restarting technique in the ALCP strategy computationally realizes an adaptively parameterized space, 1 For any U ∈ St(p, N ), there exists a convergent sequence (U n ) ∞ n=0 ⊂ St(p, N ) \ E N ,p (S) to U. 2 For every ϵ > 0, the existence of [28], [33], while the existence of a minimizer of f • −1 S is not guaranteed.
and improves numerical performance of the NCP strategy without suffering from impacts of the singular-point issue.
Convergence rates, in terms of gradient sequence, of Euclidean optimization algorithms have been of great interest in order to estimate the decay speed of the gradient to zero.Indeed, many Euclidean optimization algorithms, e.g., the Nesterov-type accelerated gradient method [39], [40], [41] and the adaptive momentum optimization algorithms [43], [44], [45], have been designed carefully with convergence rates.We call a class of such Euclidean optimization algorithms Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) (see Definition 2.4 for the formal definition of NOE-r).
So far, any convergence rate analysis for the ALCP strategy has not yet been established. 3Here, we encounter the following fundamental question: Can we derive any convergence rate of the ALCP strategy, in terms of gradient sequence, by incorporating NOE-r ?To establish a convergence rate for the ALCP strategy, we introduce a modification of the restarting condition [33], of the ALCP strategy, with a polynomial lower bound of the usage count of the same center point.Thanks to the NOE-r and the proposed modification, we successfully derive a convergence rate of the ALCP strategy in Theorem 3.3.Numerical experiments in Section IV demonstrate the effectiveness of the ALCP strategy compared with the standard strategy (see below), for Problem 1.1, known as the retraction-based strategy [25], [46], [47].
Related Works and Primitive Goal Of This Paper: The so-called retraction-based strategy [25], [46], [47] is known as the standard generic strategy for optimization over Riemannian manifolds, which include St(p, N ) as a special instance.In order to extend ideas established for optimization over the Euclidean space to that over St(p, N ), the retractionbased strategy approximates St(p, N ) by the tangent space to St(p, N ) locally around the current iterate U n ∈ St(p, N ) with a retraction mapping, where the tangent space varies according to the tangent point U n .However, the retractionbased strategy encounters certain difficulties in utilizing past estimates of a solution for updating the current estimate of a solution.This is mainly because the past estimates are expressed on different tangent spaces from the current tangent space.Such difficulties prevent designers of algorithms from extending powerful optimization algorithms over the Euclidean space to new algorithms over St(p, N ).This is because such advanced algorithms, e.g., [39], [40], [41], [43], [44], and [45], effectively exploit the past estimates of a solution for updating the current estimate of a solution.The major goal of this paper is to resolve such difficulties in the retraction-based strategy by utilizing an especially beautiful nature of St(p, N ), i.e., S : St(p, N ) \ E N ,p (S) → Q N ,p (S) and −1 S : Q N ,p (S) → St(p, N ) \ E N ,p (S).A partial preliminary short version of this paper was presented in [31].
Notation: N, N 0 , and R denote respectively the set of all positive integers, the set of all nonnegative integers, and the set of all real numbers.For general n ∈ N, I n stands for the identity matrix in R n×n , but the identity matrix in R N ×N is denoted simply by I.
denotes the (i, j) entry of X and X T denotes the transpose of X.For X ∈ R n×n , S kew (X) = (X − X T )/2 is the skew-symmetric component of X.For square matrices denotes the block diagonal matrix with diagonal blocks X 1 , X 2 , . . ., X k .For a given matrix X ∈ R m×n , ∥X∥ 2 and ∥X∥ F denote respectively the spectral norm and the Frobenius norm.For a subset N of N 0 , |N | denotes the cardinality of N .To distinguish from the symbol for the orthogonal group O(N ), the symbol o(•) is used in place of the standard big O notation.⌊•⌋ : R → Z : t → max{n ∈ Z | n ≤ t} stands for the floor function, where Z is the set of all integers.For a differentiable function J : X → R defined over the Euclidean space X with an inner product ⟨•, •⟩, ∇J (x) is the gradient of J at x ∈ X satisfying lim t→0,t̸ =0,t∈R , v⟩ for all v ∈ X .

II. PRELIMINARIES A. ADAPTIVE LOCALIZED CAYLEY PARAMETRIZATION FOR OPTIMIZATION OVER ST(P, N)
The generalized left-localized Cayley transform S [33] is an extension of the classical Cayley transform [48] for parametrization of St(p, N ).For p, N ∈ N satisfying p ≤ N , S with S ∈ O(N ) is defined as with where (i) 3) is called the singular-point set of S , and (ii) Q N ,p (S) is clearly a vector space [33] (see ( 1)).For S ∈ O(N ), the inversion mapping of S is given by We call a tunable parameter S a center point of S .Fact 2.1 (Optimality Condition [33]): S).Then, the following holds: (a) U is a local minimizer of Problem 1.1 if and only if By using S and −1 S , we illustrate the adaptive localized Cayley parametrization (ALCP) strategy [42] in Algorithm 1 for Problem 1.1 via Problem 1.3, where center points S [l] are adaptively updated to suppress the singular-point issue (see Remark 2.2 (b) and (c)).Basically, while keeping the lth center point ) by using a single update of a Euclidean optimization algorithm A ⟨l⟩ for the minimization of f • −1 S [l] (see Section II-B for A ⟨l⟩ ).In a case where the singular-point issue is detected by checking a certain restarting condition (see Remark 2.2 (d)), we update the center point from S [l] to ).At last, the next iterate is given as Just after updating a center point from S [l] to S [l+1] ∈O(N ), we restart the minimization of f • −1 S [l+1] by applying a Euclidean optimization algorithm A ⟨l+1⟩ with the initial guess where .

B. NONCONVEX OPTIMIZATION OVER THE EUCLIDEAN SPACE WITH CONVERGENCE RATES
Many Euclidean optimization algorithms have been designed to achieve better convergence.Especially in the machine learning area, convergence rates in terms of the gradient of the cost function are of great interest, e.g., [38], [39], [40], [43], [44], [45], and [41].Here, we focus on the class of Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) defined below (see also Example 2.5).
Definition 2.4 (NOE-r): Assume that the cost function J : X →R defined over the Euclidean space X satisfies the following condition: We say that an algorithm A belongs to NOE-r of o(cn −α ) with α>0 for the minimization of J if the following hold: Example 2.5 (Selected Algorithms in NOE-r): Many Euclidean optimization algorithms belong to NOE-r of o(cn −1/2 ) for some c>0, e.g., the gradient descent method [38], Nesterov-type accelerated gradient methods [39], [40], [41], and adaptive momentum optimization algorithms [43], [44], [45].Here, we introduce some methods below.

III. CONVERGENCE RATE ANALYSIS
Recall that (i) S [l] ∈O(N ) is employed as the lth center point of −1 S [l] in Algorithm 1, (ii) N l ⊂N 0 is the index interval where V n ∈Q N ,p (S [l] ) (n∈N l ) are produced (see also (5)), (iii) A ⟨l⟩ is a Euclidean optimization algorithm, which updates In this section, we present a convergence rate analysis of Algorithm 1 under the following condition (Condition 3.1).More precisely, for Algorithm 1, we first present a center point-wise convergence analysis in Lemma 3.2, then an overall convergence analysis in Theorem 3.3.

Condition 3.1 (For Convergence Analysis of Algorithm 1):
In Algorithm 1, ∇f S [l] (V n ) F does not achieve zero for any n∈N 0 , implying that either (i) or (ii) below holds true.
(i) For some l max ∈N 0 , N l max = min(N l max ),∞ ∩N 0 and N l =∅ (∀l>l max ).In this case, we use the index set c :={0,1,...,l max } of chosen center points during the process of Algorithm 1. (ii) For all l∈N 0 , N l =[min(N l ),max(N l )]∩N 0 and max(N l )+1=min(N l+1 ).In this case, we use the index set c :=N 0 of chosen center points during the process of Algorithm 1.Note that, in both cases, a wide-sense monotone increasing function ℓ:N 0 →N 0 :n →l, s.t.n∈N l is well-defined, and thus V n ∈Q N ,p (S [ℓ(n)] ) and ) are well-defined for all n∈N 0 .Lemma 3.2: Assume that (i) f :R N ×p →R is differentiable, and (ii) ∇f is Lipschitz continuous with a Lipschitz constant L ∇f >0 (see (7)).Consider the situation where Condition 3.1 holds true.Then, the following hold: (a) For all l∈ c , the condition ( 7) is achieved when J and X are given respectively by f and Q N ,p (S [l] ).(b) (Center point-wise convergence analysis) Suppose that, for some c>0 and all l∈ c , A ⟨l⟩ belongs to NOE-r of o(cn −1/2 ) in the sense of Definition 2.4 (b).Let (V n ) n∈ l∈ c N l be generated by Algorithm 1 incorporating A ⟨l⟩ (l∈ c ).Then, we have and Proof: (a) The image of f of St(p,N ), say f (St(p,N ))⊂R, is bounded below by the compactness of St(p,N ) and the continuity of f .From ). Regarding the condition (7), see Fact 1.1 for the differentiability of f S and the Lipschitz continuity of ∇f S [l] .
(Proof of ( 14)) Clearly, for any n∈N 0 , we have where the last equality is verified by (5) and Moreover, by ( 13), we have and Finally, by substituting these relations to the LHS of ( 15), we obtain min From ( 14) in Lemma 3.2 (b), to evaluate the speed of convergence of min k=0,1...,n ∇f S [ℓ(k)] (V k ) F to zero, it is sufficient to show that the RHS of ( 14) has the following upper bound with some α>0 and some c>0: To guarantee the existence of such α>0 and c>0, we modify 10 the restarting condition (6) by newly introducing a polynomial lower bound η(l) of the usage count of S [l] as where (i) T >0 is a predetermined threshold for the detection of the singular-point issue (see Remark 2.2 (d)), (ii) θ∈N is a tunable polynomial order of η(•), and (iii) τ ∈N is a tunable invariant period of η(•).Thanks to the latter condition in (17), we ensure Finally, we present the following overall convergence analysis of Algorithm 1.
Theorem 3.3 (Overall Convergence Analysis): Assume that (i) f :R N ×p →R is differentiable, and (ii) ∇f is Lipschitz continuous with a Lipschitz constant L ∇f >0.Consider the situation where Condition 3.1 holds true.Suppose that, for some c>0 and all l∈ c , A ⟨l⟩ belongs to NOE-r of o(cn −1/2 ) in the sense of Definition 2.4 (b).Let (V n ) n∈ l∈ c N l be generated by Algorithm 1 incorporating A ⟨l⟩ (l∈ c ) and ( 17) 10 Without any modification, it is impossible to show the existence of α>0 and c>0 in (16).Indeed, by considering a special situation where the center point S [ℓ(n)] is updated at every iteration, i.e., ℓ(n)=n, the LHS of ( 16) becomes 1, which is not bounded above by the RHS of ( 16) with any pair of α>0 and c>0.with some T >0, and (θ,τ )∈N 2 as the restarting condition in line 8 of Algorithm 1.Then, we have Proof: See Appendix B. Remark 3.4 (Interpretation of Theorem 3.3): (a) As stated just after Problem 1.1, a realistic goal of Problem 1.1 is to find its stationary point U ⋆ ∈ St(p,N ).More precisely, according to Fact 2.1 (b), it is desired to approximate (V ⋆ ,S ⋆ )∈Q N ,p ×O(N ) satisfying ×O(N ), e.g., by Algorithm 1.Clearly, the RHS in (19) converges to zero as n→∞.This upper bound of min k=0,1,...,n ∇f S [ℓ(k)] (V k ) F tells us how fast the gradient approaches zero.(b) The evaluations shown in (19) and (20) can also apply to the NCP strategy (incorporating an algorithm of NOE-r as A ⟨0⟩ ) as a special case of the ALCP strategy (Algorithm 1) by setting a sufficiently large T >0 in (17).
(i) Linear eigenvalue problem: For a given symmetric matrix A∈R N ×N , we considered minimize U∈St(p,N ) f 1 (U):=−Tr(U T AU).(22) Any global minimizer U ⋆ of the problem ( 22) is an orthonormal eigenbasis associated with the p largest eigenvalues of A [25].In our experiment, we used A:= A T A∈R N ×N with randomly chosen A∈R N ×N of which each entry is sampled by the standard normal distribution.
(ii) Nonlinear eigenvalue problem: By following [19], [20] and [21], we considered with some diagonal matrix D∈R N ×N , where The solution of ( 23) can be characterized as a global minimizer of (see, e.g., [20]) minimize In our experiment, we used ζ =3 and L as one-dimensional discrete Laplacian with 2 on the diagonal and −1 on the suband super-diagonals by following [21].Tables 1 and 2 summarize respectively the average results for problems (22) and ( 24) over 10 trials with N ∈{1000,2000} and p∈{1,10}, where for each output U ⋄ ∈St(p,N ) of these algorithms, 'fval-optimal' means the value f (U ⋄ )−f (U ⋆ ), 'fval' means the value f (U ⋄ ), 'feasi' means the feasibility I p −U ⋄T U ⋄ F , 'nrmg' means the norm gradf (U ⋄ ) F , 'itr' means the number of iterations, 'time' means the CPU time (s), and 'restart' means the number of updating center points for rNAG+ALCP.Figures 1 and 2 demonstrate the convergence histories for problems (22) and ( 24) respectively, where the plots show CPU time on the horizontal axis versus the value of the cost function on the vertical axis.The following remark summarizes the main points found in Tables 1-2 and Figures 1-2.  1 and 2, the performance of rNAG+ALCP is comparable to that of rNAG+Retr.From Figures 1 and 2, we found that the proposed rNAG+ALCP for all τ,θ∈{1,10} outperforms rNAG+Retr in a view of CPU time.(c) (Robustness against choices of τ,θ used in ( 17)) From Figures 1 and 2, the numerical performance of rNAG+ALCP is robust against chosen τ,θ.Moreover, from Tables 1 and 2, for rNAG+ALCP with (τ,θ)̸ = (1,10), we found that all results are the same except for ''time''.This is mainly because, for rNAG+ALCP with (τ,θ)̸ =(1,10), (i) η(l)≤1 holds during the process of algorithm due to the small number of restarting (see (17) and ''restart''); (ii) the latter condition n−min(N l )+1≥ η(l) in ( 17) always holds true; (iii) every center point was updated at the same iteration.

V. CONCLUSION
We presented a convergence rate analysis of the adaptive localized Cayley parametrization (ALCP) strategy for optimization over the Stiefel manifold with a new restarting condition.In the ALCP strategy, an optimization problem over the Stiefel manifold is translated into optimization problems over the Euclidean space with the tunable generalized leftlocalized Cayley transform.The proposed analysis provides a convergence rate, in terms of gradient sequence, for the ALCP strategy.The numerical experiments demonstrate the effectiveness of the ALCP strategy compared with the retractionbased strategy and the naive Cayley parametrization strategy.

APPENDIX A GRADIENT OF FUNCTION AFTER THE CAYLEY PARAMETRIZATION
We briefly review the gradient of f • −1 S in a case where the standard inner product ⟨V

APPENDIX B PROOF OF THEOREM III.3
The following claim is required for our analysis.
Since (i) the RHS of ( 27) is invariant for ϒ∈[κτ,(κ+1)τ −1] with every κ∈N 0 and (ii) the LHS of ( 27) is monotonically increasing for ϒ, it is sufficient to verify the inequality (27) specially for the case ϒ=κτ .In this case, the LHS of ( 27) is evaluated as . The orthogonality constraint set defined by St(p, N ) := {U ∈ R N ×p | U T U = I p } (p ≤ N ) is called the Stiefel manifold.In this paper, we consider the following optimization problem over St(p, N ): Problem 1.1:For (N , p) ∈ N 2 satisfying p ≤ N and a given differentiable function f : R N ×p → R, find U ⋆ ∈ argmin U∈St(p,N ) For p ≤ N , I N ×p ∈ R N ×p denotes the matrix of the first p columns of I.For p < N , the matrices U up ∈ R p×p and U lo ∈ R (N −p)×p denote respectively the upper and the lower block matrices of U ∈ R N ×p , i.e., U = [U T up U T lo ] T .The matrices S le ∈ R N ×p and S ri ∈ R N ×(N −p) denote respectively the left and right block matrices of S ∈ R N ×N , i.e., S = [S le S ri ].For a square matrix (a) (Update) Let N :=N 0 .Choose an initial guess x min(N ) (= x 0 )∈X , of a stationary point of J , arbitrarily.Estimates (x n ) n∈N ⊂X of a stationary point of J are generated by (n∈N ) A: x n ,R [min(N ),n] →x n+1 ∈X with a certain available record 8 R [min(N ),n] (see Example 2.5).For simplicity, we also use the notation A(x n ) for a single update from x n .(b) (Convergence rate in terms of gradient sequence) The algorithm A has a convergence rate o(cn −α ) in terms of gradient sequence, which means in this paper (see also [38, p.28]), for any cost function J satisfying (7), the algorithm A satisfies (∃α>0,∃c>0,∀n∈N ) min k∈N ,k≤n T U St(p,N ) (∇f (U))∈T U St(p,N ) with the orthogonal projection P T U St(p,N ) (X):=(I−UU T )X− US kew (X T U) onto the tangent space T U St(p,N ):={U +U ⊥ K∈R N ×p | ∈ Q p,p ,K∈R (N −p)×p } (see, e.g., [25, Example 3.6.2]),where U ⊥ ∈St(N −p,p) satisfies U T U ⊥ =0.

TABLE 2 .FIGURE 1 .
FIGURE 1. Convergence histories of each algorithm applied to the problem (22) regarding the value f (U)−f (U ⋆ ) at CPU time for each problem size, where ⋆ ∈St(p,N) was obtained by the eigenvalue decomposition of A. Markers are put at every 15 iterations.

FIGURE
FIGUREConvergence histories of each algorithm applied to the problem(24) regarding the value f (U) at CPU time for each problem size.Markers are put at every 15 iterations.

Remark 4 . 1 (
Main Findings From Numerical Experiments): (a) (Comparison of rNAG+ALCP and rNAG+NCP) As expected in the design concept of the ALCP strategy (see the paragraphs just before and after Problem 1.3, and Remark 2.2 (c)), Figures 1 and 2 show that rNAG+ALCP improves dramatically the convergence speed of rNAG+NCP.(b) (Comparison of rNAG+Retr and rNAG+ALCP) From a view of the number of iterations in Tables
[42](Detection of singular-point issue) We can detect the singular-point issue, in line 8 of Algorithm 1, by checking whether V n+1 2 exceeds a predetermined threshold T >0[42], where V n+1 =A ⟨l⟩ (V n ) is a tentative update to judge whether the parametric expression V n+1 of U n+1 is too influenced, by E N ,p (S [l] ), to keep S[l]or not.We can also use V n+1 block :=