A Momentum-Guided Frank-Wolfe Algorithm

—With the well-documented popularity of Frank Wolfe (FW) algorithms in machine learning tasks, the present paper establishes links between FW subproblems and the notion of momentum emerging in accelerated gradient methods (AGMs). On the one hand, these links reveal why momentum is unlikely to be effective for FW-type algorithms on general problems. On the other hand, it is established that momentum accelerates FW on a class of signal processing and machine learning applications. Speciﬁcally, it is proved that a momentum variant of FW, here termed accelerated Frank Wolfe (AFW), converges with a faster rate O ( 1 k 2 ) on such a family of problems, despite the same O ( 1 k ) rate of FW on general cases. Distinct from existing fast convergent FW variants, the faster rates here rely on parameter-free step sizes. Numerical experiments on benchmarked machine learning tasks corroborate the theoretical ﬁndings.

A Momentum-Guided Frank-Wolfe Algorithm Bingcong Li, Mario Coutiño, Georgios B. Giannakis, and Geert Leus Abstract-With the well-documented popularity of Frank Wolfe (FW) algorithms in machine learning tasks, the present paper establishes links between FW subproblems and the notion of momentum emerging in accelerated gradient methods (AGMs).On the one hand, these links reveal why momentum is unlikely to be effective for FW-type algorithms on general problems.On the other hand, it is established that momentum accelerates FW on a class of signal processing and machine learning applications.Specifically, it is proved that a momentum variant of FW, here termed accelerated Frank Wolfe (AFW), converges with a faster rate O( 1 k 2 ) on such a family of problems, despite the same O( 1k ) rate of FW on general cases.Distinct from existing fast convergent FW variants, the faster rates here rely on parameterfree step sizes.Numerical experiments on benchmarked machine learning tasks corroborate the theoretical findings.
Index Terms-Frank Wolfe method, conditional gradient method, momentum, accelerated method, smooth convex optimization

I. INTRODUCTION
We consider efficient means of solving the following optimization problem where f is a smooth convex function.The constraint set X ⊂ R d is assumed to be convex and compact, and d is the dimension of the variable x.We denote by x * ∈ X a minimizer of (1).Among problems across signal processing, machine learning, and other areas, the constraint set X can be structured but difficult or expensive to project onto.Examples include the nuclear norm ball constraint for matrix completion in recommender systems [1] and the total-variation norm ball adopted in image reconstruction tasks [2].The computational inefficiency of the projection, especially for a large d, impairs the applicability of projected gradient descent (GD) [3] and projected Accelerated Gradient Method (AGM) [4], [5].
An alternative to GD for solving (1) is the Frank Wolfe (FW) method [6]- [8], also known as the conditional gradient approach.FW circumvents the projection in GD by first minimizing an affine function, which is the supporting hyperplane of f (x) at x k , over X to obtain v k+1 , and then updating x k+1 as a convex combination of x k and v k+1 .When dealing with structural constraints such as nuclear norm balls and total variation norm balls, an efficient implementation manner or even a closed-form solution for computing v k+1 is available [7], [9], resulting in reduced computational complexity compared with projection steps.In addition, when initializing well, FW directly promotes low rank (sparse) solutions when the constraint set is a nuclear norm ( 1 norm) ball [1].Providing the easiness in implementation and enabling structural solutions, FW is of interest in various applications.Besides those mentioned earlier, other examples encompass structural SVM [10], video colocation [11], particle filtering [12], traffic assignment [13], and optimal transport [14], electronic vehicle charging [15], [16], and submodular optimization [17].
Although FW has well documented merits in several applications, it exhibits slower convergence when compared to AGM.Specifically, FW satisfies f (x k ) − f (x * ) = O( 1k ).This convergence slowdown is confirmed by the lower bound, which indicates that the number of FW subproblems to solve in order to ensure f (x k )−f (x * ) ≤ , is no less than O 1 [7], [18].Thus, FW is a lower-bound-matching algorithm, in general.However, improved FW type algorithms are possible in speedup rates for certain subclasses of problems.

A. Related works
There are three common approaches to select step sizes for FW and its variants: i) line search [7]; ii) minimizing a onedimensional quadratic function over [0, 1] for smooth step sizes [9], [19]; and iii) parameter-free step sizes; that is, O( 1 k ) [7].Most of the fast converging FW iterations rely on choices i) or ii), which require either the smoothness parameter or the function value of f .Step size i) is 'clumsy' when it is costly to access function values, e.g., in the big data regime.Concerns with choice ii) arise with how well the smoothness parameter is estimated.In addition, it is challenging to select the smoothness inducing norm, and each norm can result in a considerably different smoothness parameter [20].The need thus arises for FW variants relying on parameter-free step sizes, especially those enabling faster convergence.To this end, we first briefly recap existing results on faster rates.
Line search.Jointly leveraging line search and 'away steps,' FW-type algorithms converge linearly for strongly convex problems when X is a polytope [8], [23]; see also [24], [25], and [21] where the memory efficiency of away steps is also improved.
Smooth step sizes.If X is strongly convex, and the optimal solution is at the boundary of X , it is known that FW converges linearly [19].For uniformly (and thus strongly) convex sets, faster rates are attained when the optimal solution is at the boundary of X [26].When both f and X are strongly convex, FW with the smooth step size converges at a rate of O( 1 k 2 ), regardless of where the optimal solution resides [9].A variant of smooth step size along with modifications on FW jointly enable faster rates on a strongly convex f and Gauge set X [27], at the expense of requiring extra parameters besides the smoothness constant.
Parameter-free step sizes.Without any parameter involved here, there is no concern on the quality of parameter estimation, which saves time and effort because there is no need for tuning step sizes.Although implementation efficiency is ensured, theoretical guarantees are challenging to obtain.This is because f (x k+1 ) ≤ f (x k ) cannot be guaranteed without line search or smooth step sizes.Faster rates for parameterfree FW are rather limited in number.In a recent work [22], the behavior of FW when k is large and X is a polytope is investigated under the strong assumptions on f (x) being twice differentiable and locally strongly convex around x * .Hence, the analysis does not hold for e.g., the Huber loss, which is widely used in robust regression but is only oncedifferentiable.The faster rates, along with the assumptions on f and X , are summarized in Table I for comparison.To establish faster rates, our solution connects the FW subproblem with Nesterov's momentum, which is recapped next.
Nesterov momentum.After the O( 1 k 2 ) convergence rate was established in [3], [28], the efficiency of Nesterov momentum is proven almost universal; see e.g., the accelerated proximal gradient [5], [29], projected AGM [4], [5] for problems with constraints; accelerated mirror descent [4], [5], [30], and accelerated variance reduction for problems with finitesum structures [31], [32].Parallel to these works, AGM has been also investigated from an ordinary differential equation (ODE) perspective [30], [33]- [35].However, the efficiency of Nesterov momentum on FW type algorithms is shaded given the lower bound on the number of subproblems [7], [18].A means to bringing momentum into FW is to adopt conditional gradient sliding (CGS) [36], where the projection subproblem in the original AGM is substituted by gradient sliding which solves a sequence of FW subproblems.The faster rate O( 1 k 2 ) is obtained with the price of: i) the requirement of at most O(k) FW subproblems in the kth iteration; and ii) an inefficient implementation (e.g., the AGM subproblem has to be solved to certain accuracy, and it relies on other parameters that are not necessary in FW, such as the diameter of X ).
Although parameter-free FW is undoubtedly attractive in several applications, there are two main challenges in establishing faster rates for such step sizes: i) even AGM and most of its variants are not parameter-free since they involve a smoothness parameter; and ii) parameter-free FW in general cannot ensure per step descent, which is essential for faster rates.To overcome these challenges, we first unveil the links between the notion of momentum and the FW subproblem.Then, we leverage these connections to provide provable constraint-dependent faster rates.

B. Our contributions
In succinct form, our contributions are as follows.
• We observe that the momentum update in AGM plays a similar role as the subproblem in FW, intuitively and analytically.Hence, the FW subproblem can be leveraged to play the role of Nesterov's momentum, thus enabling faster rates on a useful family of problems.• We prove that a momentum-guided FW, termed accelerated Frank Wolfe (AFW), achieves a faster rate Õ( 1 k 2 ) on active p norm ball constraints without knowledge of the smoothness parameter or the function value.We also establish that AFW converges no slower than FW on general problems.• We corroborate the numerical efficiency of AFW on two benchmark tasks.We validate faster AFW rates on binary classification problems with different constraint sets.We further demonstrate that for matrix completion, AFW finds low-rank solutions with small optimality error more rapidly than FW.Notation.Bold lowercase letters denote column vectors; x stands for the 2 norm of a vector x; and x, y denotes the inner product between vectors x and y.All missing proofs can be found in the Appendix.

II. PRELIMINARY
This section briefly reviews FW starting with the assumptions to clarify the class of problems we are focusing on.
x k+1 = (1 − δ k )x k + δ k v k+1 5: end for 6: Return: x K Assumptions 1 -3 are standard for FW type algorithms, and they are assumed to hold true throughout.
FW is summarized in Alg. 1.A subproblem with a linear loss needs to be solved to obtain v k+1 per iteration.This subproblem is also referred to as an FW step, and it admits a geometrical explanation.In particular, v k+1 can be rewritten as Noticing that the RHS of ( 2) is a supporting hyperplane of f (x) as x k , it is thus clear that v k+1 is a minimizer of this supporting hyperplane over X .Note also that the supporting hyperplane in ( 2) is also a global lower bound of f (x) due to the convexity of f , i.e., f Upon minimizing this lower bound in (2) to obtain v k+1 , x k+1 is updated as a convex combination of v k+1 and x k to eliminate the projection.Next, we briefly recap the step sizes for FW to gain insights on why the parameter-free FW is challenging to analyze.
Smooth step size.At the kth iteration, the step size δ k in Alg. 1 is obtained as Clearly, it is imperative to estimate L accurately because this estimate markedly influences the performance.It has also been argued that algorithms relying on a guess of L are not robust [37].Tuning to find the 'best' L is employed in practice to optimize the performance empirically.On the other hand, smooth step sizes ensure descent per iteration, which is analytically attractive.Indeed, Assumption 1 implies that , and (b) holds because δ k minimizes the RHS of (3) over [0, 1].Line search.An alternative to tune for the best L is to employ line search for determining the local smoothness parameter.In particular, the step size is chosen as δ k = arg min δ∈[0,1] f (1 − δ)x k + δv k+1 .However, the price paid is the need to compute f (x), which is inefficient when function evaluation is costly (e.g., in big-data regimes).Note that f (x k+1 ) ≤ f (x k ) is automatically ensured by line search.
Parameter-free step size.This type of step sizes does not rely on L or other parameters, and hence it is extremely easy Algorithm 2 AGM [3] 1: Initialize: . However, these step sizes do not guarantee descent per iteration, which becomes the bottleneck for establishing faster rates on specific constraint sets.Our insight to overcome this comes from the observation that the FW step is similar to the momentum in AGM for convex problems.Hence, the FW step itself can be used as an approximate momentum.

III. CONNECTING MOMENTUM WITH FW
To bring intuition on how momentum can be helpful for FW type algorithms, we first recap AGM for unconstrained convex problems, i.e., X = R d .Note that the reason for discussing the unconstrained problem here is only for the simplicity of exposition, and one can extend the arguments to constrained cases straightforwardly.AGM [3], [4], [28] is summarized in Alg. 2. We start this section by characterizing the behavior of {x k }, {y k } and {v k } in the next theorem.
Theorem 1.Under Assumptions 1 and 2, with δ k = 2 k+3 , µ 0 = 2L, and In addition, it holds for any Theorem 1 shows that ∇f (y k ) 2 = O( 1 k 2 ), which implies that y k also converges to a minimizer as k → ∞.Through the increasing step size δ k µ k+1 = O( k L ), the update of v k stays in the ball centered at x * with radius depending on both x * and x 0 .
One observation of AGM is that by substituting Line 5 in Alg. 2 with v k+1 = x k+1 , the modified algorithm boils down to GD.Hence, it is clear that the key behind AGMs acceleration is v k and the way it is updated.We contend that the v k+1 is obtained by minimizing an approximated lower bound of f (x) formed as the summation of a supporting hyperplane at y k and a regularizer.To see this, one can rewrite Line 5 of AGM as where the linear part is the supporting hyperplane, and As k increases, the impact of the regularizer  2) and ( 4).
2 in (4) will become limited.Thus the RHS can be viewed as an approximated lower bound of f (x).Regarding the reasons to put a regularizer after the supporting hyperplane, it first guarantees the minimizer exists since directly minimize the supporting hyperplane over R d yields no solution.In addition, v k+1 is ensured to be unique because the RHS of ( 4) is strongly convex thanks to the regularizer.Since v k+1 minimizes an approximated lower bound of f (x), it can be used to estimate f (x * ).We explain in Theorem 4 in Appendix B that f Consequently, one can obtain an estimated suboptimality gap using Momentum v k update as an FW step.It is observed that v k+1 in both FW and AGM (cf.( 2) and ( 4)) are obtained by minimizing an (approximated) lower bound of f (x), where the only difference lies on whether a regularizer with decreasing weights is utilized.The similarity between the RHS of ( 2) and (4) will be amplified when k is large; see Fig. 1 for a graphical illustration on how (4) approaches to an affine function.In other words, the momentum update in (4) becomes similar to an FW step for a large k.In addition, there are also several other connections.
Connection 1.The v k+1 update via ( 4) is equivalent to for according to Theorem 1.By rewriting (4) in its constrained form (5), it can be readily recognized that for unconstrained problems Nesterov momentum can be obtained via FW steps with time-varying constraint sets.
Connection 2. Recall that in AGM, v k+1 obtained via (4) is used to construct an approximation of f (x * ), which is f (y k )+ ∇f (y k ), v k+1 − y k .When a compact X is present, directly minimizing the supporting hyperplane f (y k ) + ∇f (y k ), x − y k over X also yields an estimate of f (x * ).Note that the latter is exactly an FW step.In addition, the FW step in Alg. 1 also results in a suboptimality gap (known as FW gap; see e.g., [7]), which is in line with the role of v k in AGM.In a nutshell, both FW step and momentum update in AGM result in an estimated suboptimality gap.
Connections between momentum and FW go beyond convexity.We discuss in Appendix C that AGM for strongly convex problems updates its momentum using exactly the same idea of FW, that is, both obtain a minimizer of a lower bound of f (x), and then perform an update through a convex combination.
These links and similarities between momentum and FW naturally lead us to explore their connections, and see how momentum influences FW.

IV. MOMENTUM-GUIDED FW
In this section we show that the momentum is beneficial for FW by proving that it is effective at least on certain constraint sets.Specifically, we will focus on the accelerated Frank Wolfe (AFW) summarized in Alg. 3, and analyze its convergence rate.Since we will see later that δ k = 2 k+3 ∈ (0, 1), ∀k, for which y k , v k and x k lie in X for all k, AFW is projection free.Albeit rarely, it is safe to choose v k+1 = v k , and proceed when θ k+1 = 0. Note that the x k+1 update in AFW is slightly different with that of AGM.This is because AGM guarantees f (x k+1 ) ≤ f (y k ), ∀k, taking advantage of the known L. However, the same guarantee is difficult to be replicated in a parameter-free algorithm.
The key to AFW is the v k+1 update, which plays the role of momentum.To see this, if one unrolls θ k+1 (cf.(22) in Appendix) and plugs it into Line 5 of Alg. 3, v k+1 can be equivalently rewritten as where (the exact value of the sum depends on the choice of δ τ ).Note that f (y τ ) + ∇f (y τ ), x − y τ is a supporting hyperplane of f (x) at y τ , hence the right-hand side (RHS) of ( 6) is a lower bound for f (x) constructed through a weighted average of supporting hyperplanes at {y τ }.In other words, v k+1 is a minimizer of a lower bound of f (x), hence it is in line with the role of momentum.However, the momentum in AFW differs from AGM in two aspects.First, instead of relying on ∇f (y k ), the update of v k+1 utilizes coefficient θ k+1 , which is (roughly) a weighted average of past gradients {∇f (y τ )} k τ =1 with more weight placed on recent ones.The second difference on the v k+1 update with AGM is whether a regularizer is used.As a consequence of the non-regularized lower bound (6), its minimizer is not guaranteed to be unique.A simple example is to consider the ith entry [θ k+1 ] i = 0.The ith entry [v k+1 ] i can then be chosen arbitrarily as long as v k+1 ∈ X .This subtle difference leads to a significant gap between the performance of AFW and AGM, that is, AFW cannot achieve acceleration on general problems, as will be illustrated shortly.However, we confirm that momentum is still helpful since it is effective on a class of problems.

A. AFW convergence for general problems
The analysis of AFW relies on a tool known as estimate sequence (ES) introduced by [3].ES is commonly adopted to analyze projection based algorithms; see e.g., [31], [32], [38], [39], but seldomly used for FW.Formally, ES is defined as follows.
ES is generally not unique and different constructions can be used to design different algorithms.To highlight our analysis technique, recall that quadratic surrogate functions {Φ k (x)} are used for the derivation of AGM [3] (or see (12) in Appendix).Different from AGM, and taking advantage of the compact constraint set, here we consider linear surrogate functions for AFW Evidenced by the terms in the bracket of (7b), i.e., it is a supporting hyperplane of f (x), Φ k+1 (x) is an approximated lower bound of f (x) constructed by weighting the supporting hyperplanes at {y τ } k τ =0 .Next, we show that (7) together with proper {λ k } forms an ES for f .Through the ES based proof, it is also revealed that the link between the momentum in AGM and the FW step is also in the technical proof level.
Using properties of the functions in (7) (cf.Lemma 4 in Appendix E), the following lemma holds for AFW.
Leveraging Lemma 2, the convergence rate of AFW for general problems can be established.
Theorem 2. When Assumptions 1, 2 and 3 are satisfied, upon choosing δ k = 2 k+3 and θ 0 = 0, AFW guarantees Theorem 2 asserts that the convergence rate of AFW is O( LD 2 k ), coinciding with that of FW [7].Notwithstanding, AFW is tight in terms of the number of FW steps required.To see this, note that the convergence rate in Theorem 2 translates to requiring O( LD 2 ) FW steps to guarantee f (x k ) − f (x * ) ≤ .This matches the lower bound [7], [40].Similar to other FW variants, acceleration for AFW cannot be claimed for general problems.AFW however, is attractive numerically because it can alleviate the zig-zag behavior 1 of FW, as we will see in Section V.
Why acceleration cannot be achieved in general?Recall from Lemma 2, that critical to acceleration is ensuring a small ξ k , which in turn requires v k+1 and v k to stay sufficiently close.This is difficult in general because the non-uniqueness of v k prevents one from ensuring a small upper bound of The ineffectiveness of momentum in AFW in turn signifies the importance of the added regularizer in AGM momentum update (4).

B. AFW acceleration for a class of problems
In this subsection, we provide constraint-dependent accelerated rates of AFW when X is a ball induced by some norm.Even for projection based algorithms, most accelerated rates are obtained with L-dependent step sizes [41].Thus, faster rates for parameter-free algorithms are challenging to establish.An extra assumption is needed in this subsection.
To analyze convergence of FW iterations, it is reasonable to rely on the position of the optimal solution, which justifies why this assumption is also adopted in [19], [26], [42], [43].For a number of signal processing and machine learning tasks, Assumption 4 is rather mild.Relying on Lagrangian duality, it can be seen that problem (1) with a norm ball constraint is equivalent to the regularized formulation min x f (x) + γg(x), where γ ≥ 0 is the Lagrange multiplier, and g(x) denotes some norm.In view of this, Assumption 4 simply requires γ > 0 in the equivalent regularized formulation, that is, the norm ball constraint plays the role of a regularizer.Given the prevalence of regularized formulations, it is worth investigating their equivalent constrained form (1) under Assumption 4. Next, we will use the 2 norm ball constraints to illustrate the intuition behind the acceleration.
2 norm ball constraint.Consider X := {x| x 2 ≤ D 2 }.In this case, v k+1 admits a closed-form solution The uniqueness of v k+1 is ensured by its closed-form solution, wiping out the obstacle for a faster rate.In addition, through (8) it becomes possible to guarantee that v k+1 and v k are close whenever θ k is close to θ k+1 .
Theorem 3. If Assumptions 1, 2, 3 and 4 are satisfied, and X is an 2 norm ball, choosing δ k = 2 k+3 and θ 0 = 0, AFW guarantees acceleration with convergence rate where C and T are constants depending on L, D and G.
Theorem 3 demonstrates that momentum improves the convergence of FW by providing a faster rate.Roughly speaking, when the iteration number k ≥ T , the rate of AFW dominates that of FW.We note that this matches our intuition, that is, the momentum in AGM (4) only behaves like an affine function when k is large (so that the weight on the regularizer is small).In addition, the rate in Theorem 3 can be written compactly as , ∀k, hence it achieves acceleration with a worse dependence on D compared to vanilla FW.Note that the choice for δ k and θ 0 remains the same as those used in general problems, leading to an identical implementation to non-accelerated cases.Compared with CGS, AFW sacrifices the D dependence in the convergence rate to trade for i) the nonnecessity of the knowledge of L and D, and ii) ensuring only one FW subproblem per iteration (whereas at most O(k) subproblems are needed in CGS).
1 norm ball constraint.For the sparsity-promoting constraint X := {x| x 1 ≤ R}, the FW steps can be solved in closed form.Taking v k+1 as an example, we have We show in the Appendix (Theorem 5) that when Assumption 4 holds and the set arg max j [∇f (x * )] j has cardinality 1, a faster rate O( T1LD 2 k 2 ) can be obtained.The additional assumption here is known as strict complementarity, and has been adopted also in, e.g., [44], [45] for analysis.
p norm ball constraint.Consider an active p norm ball constraint X := {x| x p ≤ R}, where p ∈ (1, +∞) and p = 2.The i-th entry of v k+1 is found in closed form as where 1/p + 1/q = 1.We discuss in Appendix K that faster rates are possible under mild conditions.Though not covering all cases, it still showcases that the momentum is partially helpful for parameter-free FW algorithms.
Beyond p norm balls.In general, when a specific structure of x * (e.g., sparsity) is promoted by X (so that x * is likely to live on the boundary), and one can ensure the uniqueness of v k through either a closed-form solution or a specific implementation, acceleration can be effected.A direct extension of the results in this subsection to matrix space is when the constraint is a Schatten p norm ball.This is because X p := σ 1 (X), σ 2 (X), . . ., σ r (X) p , where σ i (X) denotes the ith singular value of X.Our numerical results confirm the acceleration in Section V-B.

V. NUMERICAL TESTS
We validate our theoretical findings as well as the efficiency of AFW on two benchmarked machine learning problems, binary classification and matrix completion in this section.All numerical experiments are performed using Python 3.7 on a desktop equipped with Intel i7-4790 CPU @3.60 GHz (32 GB RAM).Additional numerical tests using other loss functions and constraints can be found in Appendix L.

A. Binary classification
Logistic regression for binary classification is adopted to test AFW.The objective function is where (a i , b i ) is the (feature, label) pair of datum i and n is the total number of data samples.Datasets from LIBSVM 2 are used in the numerical tests presented.Details regarding the datasets are summarized in Table II, where d is the dimension of x, n is the number of data, and 'nonzeros' refers to the percentage of nonzero entries in {a i } n i=1 to reflect the sparsity of the dataset.The constraint sets considered include 1 and 2 norm balls.As benchmarks, the chosen algorithms are: projected GD with the standard step size 1 L ; parameter-free FW with step size 2 k+2 [7]; and projected AGM with parameters according to [4].The step size of AFW is δ k = 2 k+3 according to Theorems 2 and 3. Note that both GD and AGM are not parameter-free.
We first let X be an 2 norm ball with a large enough radius so that ∇f (x * ) ≈ 10 −4 .This case maps to our result in Theorem 2, where the convergence rate of AFW is O( 1 k ).The performance of AFW is shown in Fig. 2. On dataset a9a, AFW slightly outperforms GD and FW, but is slower than AGM.Evidently, AFW is much more stable than FW, as one can see from the shaded areas that illustrate the zig-zag range.
Next, we consider active 2 norm ball constraints, where the diameter of X is chosen to maximize the generalization error on the validation dataset.In this case, our result in Theorem 3 applies and AFW achieves an Õ( 1 k 2 ) convergence rate.The performance of AFW is listed in the first row of Fig. 3.In all 2 https://www.csie.ntu.edu.tw/∼ cjlin/libsvmtools/datasets/binary.html.tested datasets, AFW significantly improves over FW, while on datasets other than covtype, AFW also outperforms AGM, especially on mushroom.
When the constraint set is an norm ball, the performance of AFW is depicted in the second row of Fig. 3.It can be seen that on datasets such as covtype and mnist, AFW exhibits performance similar to AGM, which is significantly faster than FW.While on dataset mushroom, AFW converges even faster than AGM.Note that comparing AFW with AGM is not fair since each FW step requires d operations at most, while projection onto an 1 norm ball in [46] takes cd operations for some c > 1.This means that for the same running time, AFW will run more iterations than AGM.We stick to this unfair comparison to highlight how the optimality error of AFW and AGM evolves with k.

B. Matrix completion
We then consider matrix completion problems that are ubiquitous in recommender systems.Consider a matrix A ∈ R m×n with partially observed entries, that is, entries A ij for (i, j) ∈ K are known, where K ⊂ {1, . . ., m} × {1, . . ., n}.Note that the observed entries can also be contaminated by noise.The task is to predict the unobserved entries of A. Although this problem can be approached in several ways, within the scope of recommender systems, a commonly adopted empirical observation is that A is low rank [47]- [49].Hence the problem to be solved is where X * denotes the nuclear norm of X, and it is leveraged to promote a low rank solution.Problem ( 11) is difficult to be solved via GD or AGM because projection onto a nuclear norm ball is expensive.On the contrary, FW and its variants are more suitable for (11) given that FW step can be solved easily and the update promotes low-rank solution directly [1].
We test AFW and FW on a widely used dataset, Movie-Lens100K3 , where 1682 movies are rated by 943 users with 6.30% percent ratings observed.And the initialization and data processing are the same as those used in [1].The numerical performance can be found in Fig. 4. In subfigures (a) and (b), we plot the optimality error and rank versus k choosing R = 3.The choice of R is based on the number of different movie categories.It is observed that AFW exhibits improvement in terms of both optimality error and rank of the solution.In particular, AFW roughly achieves 1.4x performance improvement compared with FW in terms of optimality error, and finds solutions with much lower rank.

VI. CONCLUSIONS
We built links between the momentum in AGM and the FW step by observing that they are both minimizing an (approximated) lower bound of the objective function.Exploring this link, we show how momentum benefits parameter-free FW.In particular, a momentum variant of FW, which we term AFW, was proved to achieve a faster rate on active p norm ball constraints while maintaining the same convergence rate as FW on general problems.AFW thus strictly outperforms FW providing the possibility for acceleration.Numerical experiments validate our theoretical findings, and suggest AFW is promising for binary classification and matrix completion.

A. Proof of Theorem 1
The convergence on x k is given in [41], and hence we do not repeat here.Next we show the behavior of y k and v k .
We use the same surrogate functions with those in [41], i.e., In [41], it is shown that with λ 0 = 1 and In addition, it is also shown that Φ k+1 (x) can be rewritten as . We will use these conclusions directly.Rearranging the terms in where (a) is because Φ k (x) − f (x) ≤ λ k Φ 0 (x) − f (x) by Definition 1, and f (x k ) ≤ Φ * k shown in [3].Choosing x as x * , we arrive at This further implies Hence the behavior of v k in Theorem 1 is proved.
To prove the convergence of y k , the following inequality is true as a result of ( 13) Next, we link ∇f (y k ) and Rearranging the terms we can obtain the convergence of ∇f (y k ) 2 , that is, Theorem 4. If Assumptions 1 and 2 hold, and we choose k+2 ; and per iteration k, we let w Proof.It is easy to verify that where (a) follows from the convexity of f , that is, 14), we get Summing over k (and recalling v 0 = x 0 ), we arrive at By the definition of w which completes the proof.

C. AGM Links with FW in strongly convex case
We showcase the connection between the momentum update of AGM in strongly convex case and FW.We first formally define strong convexity, which is used in this subsection only.

Assumption 5. (Strong convexity.) The function
Under Assumptions 1 and 5, the condition number of f is κ := L µ .To cope with strongly convex problems, Lines 4 -6 in AGM (Alg.2) should be modified to [3] where δ = 1 √ κ .Here v k+1 in (16c) denotes the momentum and thus plays the critical role for acceleration.To see how v k+1 is linked with FW, we will rewrite v k+1 as Notice that z k+1 is the minimizer of a lower bound of f (x) (due to strongly convexity).Therefore, the v k+1 update is similar to FW in the sense that it first minimizes a lower bound of f (x), then update through convex combination (cf Alg. 1).This demonstrates that the momentum update in AGM shares the same idea of FW update.A few basic lemmas for all the proofs in Section IV are provided below.

D. Proof of Lemma 1.
Proof.We show this by induction.Because where (a) is because the convexity of f ; and the last equation is by definition of λ k+1 .Together with the fact that lim k→∞ λ k = 0, the tuple {Φ k (x)} ∞ k=0 , {λ k } ∞ k=0 satisfies the definition of an estimate sequence.
where the last inequality is because Definition 1. Subtracting f (x * ) on both sides, we arrive at 7) can be rewritten as with Proof.We prove this lemma by induction.First Clearly, since Φ k+1 (x) is linear in x, the slope is In addition, because v k+1 is defined as the minimizer of Φ k+1 (x) over X , from (20) we have v k+1 = arg min x∈X x, θ k+1 .Then, since Φ * k+1 is defined as 20), we have The proof is thus completed.
F. Proof of Lemma 2.
Proof.We prove this lemma by induction.First by definition Using (19c), we have and plugging the definition of ξ k+1 , the proof is completed.

G. Proof of Theorem 2
Proof.Since Lemma 2 holds, one can directly apply Lemma 3 to have where ξ k is defined in Lemma 2. Clearly, ξ k ≥ 0, ∀k, and we can find an upper bound for it in the following manner.

H. Proof of Theorem 3
The basic idea is to show that under Assumptions 1, 2, 3 and 4, v k − v k+1 2 is small enough when k is large.To this end, we will make use of the following lemmas.Next we show that the value of ∇f (x * ) is unique.
Lemma 6.If both x * 1 and x * 2 minimize f (x) over X , then we have ∇f (x * 1 ) = ∇f (x * 2 ).Proof.From Lemma 5, we have = 0 where (a) is by the optimality condition, that is, ∇f (x * 1 ), x − x * 1 ≥ 0, ∀x ∈ X .Hence we can only have ∇f (x * 2 ) = ∇f (x * 1 ).This means that the value of ∇f (x * ) is unique regardless of the uniqueness of x * .Lemma 7. Choose δ k = 2 k+3 and let where Proof.By convexity where (a) is by Theorem 2. Next using Lemma 5, we have The proof is thus completed.
In addition, there exists a constant Proof.First we have Noticing that 2 where (a) follows from Lemma 7 and Assumption 4.
Then to find C 2 , we have where in (b) we use k + 3 ≥ 3 and 2 , ∀k ≥ T .In addition, it is guaranteed to have for any k ≥ T + 1 Proof.Consider a specific k with θ k+1 < √ G 2 satisfied.In this case we have From Lemma 8, we have From this inequality we can observe that θ k+1 can be less than Hence, the first part of this lemma is proved.
For the upper bound of v k+1 − v k , we only consider the case where θ k+1 = 0 since otherwise v k+1 = v k and the lemma holds automatically.For any k ≥ T + 1, from (8), one can rewrite Using this relation, the RHS of ( 23) becomes .
Plugging back to (23), the proof can be completed.
I. Proof of Theorem 3.
Proof.We first consider the constraint set being an 2 norm ball.From Lemma 2, we can write where in (a) T is defined in Lemma 9; (b) is by Lemma 9 and Assumption 4; and in the last equation constants are hide in the big O notation.
Finally, applying Lemma 3, we Plugging ξ k in the proof is completed.When the constraint set is an 1 norm ball, the basic proof idea is similar as the 2 norm ball case, i.e., after T iterations v k and v k+1 are near to each other.The only difference is that a regularization condition should be satisfied to ensure the uniqueness of v k (only for proof, not necessary for implementation).There are multiple kinds of regularization schemes, for example, [∇f (x * )] i − [∇f (x * )] j = c > 0, where i, j are the largest and second largest entry of ∇f (x * ), respectively.In this case, we only need to modify the T in Lemma 9 as a c dependent constant, and all the other proofs follow.

J. 1 norm ball
In this subsection we focus on the convergence of AFW for 1 norm ball constraint under the assumption that arg max j [∇f (x * )] j has cardinality 1 (which naturally implies that the constraint is active).Note that in this case Lemma 6 still holds hence the value of ∇f (x * ) is unique regardless the uniqueness of x * .This assumption directly leads to arg max j When X = {x| x 1 ≤ R}, the FW steps for AFW can be solved in closed-form.We have v k+1 = [0, . . ., 0, −sgn[θ k+1 ] i R, 0, . . ., 0] , i.e., only the i-th entry being nonzero with i = arg max j |[θ k+1 ] j |.Lemma 10.There exist a constant T (which is irreverent with k), whenever k ≥ T , it is guaranteed to have Proof.In the proof, we denote i = arg max j |[∇f (x * )] j | for convenience.It can be seen that Lemma 8 still holds.
We show that there exist T = ( 3C2 λ + 1) 2 − 3, such that for all k ≥ T , we have arg max j |[θ k+1 ] j | = i, which further implies only the i-th entry of v k+1 is non-zero.Since Lemma 8 holds, one can see whenever k ≥ T , it is guaranteed to have Proof.The proof for k < T + 1 is similar as that in Lemma 2, hence it is omitted here.For k ≥ T +1, using similar argument as in Lemma 2, we have where the last equation is because of Lemma 10.
Theorem 5. Consider X is an 1 norm ball.If arg max j [∇f (x * )] j has cardinality 1, and Assumptions 1 -3 are satisfied, AFW guarantees that Proof.Let T be defined the same as in Lemma 10.For convenience denote ξ Finally, applying Lemma 3, we have Plugging ξ k in completes the proof.

K. p norm ball
In this subsection we focus on AFW with an active p norm ball constraint X := {x| x p ≤ R}, where p ∈ (1, +∞) and p = 2.We show that if the magnitude of every entry in ∇f (x * ) is bounded away from 0, i.e., |[∇f (x * )] i | = λ > 0, ∀i, then AFW converges at O( 1 k 2 ).In such cases, the FW step in AFW can be solved in closedform, that is, the i-th entry of v k+1 can be obtained via q−1 q • R where 1/p + 1/q = 1.For simplicity we will emphasis on the k dependence only and use O notation in this subsection.We will also use θ i k to replace [θ k ] i for notational simplicity.In other words, θ i k denotes the i-th entry of θ k .First according to Lemma 8, and use the equivalence of norms, we have θ k − ∇f (x * ) q = O( 1 √ k ).Hence, there must exist T 1 , such that θ k q ≤ 2G, ∀k ≥ T 1 .Next using similar arguments as the first part of Lemma 9, there must exist T 2 , such that θ k q ≥ G/2, ∀k ≥ T 2 .In addition, using again similar arguments as the first part of Lemma 9, we can find that there exist T 3 , such that |θ i k | > λ 2 , ∀k ≥ T 3 .Let T := max{T 1 , T 2 , T 3 }.Next we will show that v k+1 − v k 2 = O( 1 k ), ∀k ≥ T .To start, using (25), one can have Next using G/2 ≤ θ k+1 q ≤ 2G, ∀k ≥ T , and |θ i k+1 | ≤ θ k+1 q , we have = O θ k+1 q−1 q − θ k q−1 q We first bound the first term in RHS of (26).Let h(x) = (x) q−1 .Then by mean value theorem we have h(y) = h(x) + ∇h(x)(y − x) + ∇ 2 h(z) x − y 2 , where z = (1 − α)x + αy  for some α ∈ [0, 1].Taking x = θ k q and y = θ k+1 q , and using the fact G/2 ≤ θ k q ≤ 2G for k ≥ T , we have = θ k q−1 q + O( θ k q − θ k+1 q + θ k q − θ k+1 q 2 ) = θ k q−1 q one can find that the first term on the RHS of ( 26) is bounded by O 1 √ k .Next we focus on the second term of (26) by considering whether θ i k and θ i k+1 have different signs.Case 1: θ i k and θ i k+1 have the same sign.Then we have the last inequality uses the same mean-value-theorem argument as (27)  Applying the same argument in the proof of Theorem 3, we have that when k ≥ T , ξ k+1 = Õ( 1 k 2 ).This further implies f (x k ) − f (x * ) = Õ( 1 k 2 ) as well.

L. Additional numerical tests
AFW is tested on other loss functions and constraints to demonstrate its efficiency.
n-support norm ball constraint.We first consider logistic regression over a n-support norm ball [50].This is challenging due to the constraint X = conv{x| x 0 ≤ n, x 2 ≤ R}, where conv{•} denotes the convex hull.GD and AGM are expensive for such a constraint set since efficient projection is unclear, while the FW subproblem can be solved easily [51].For this reason, we only compare FW with AFW, and the numerical results depicted in Fig. 5 demonstrate that AFW outperforms FW.Log-sum-exp loss.We also test AFW using the log-sumexp loss function, that is, We set n = 1, 000 and d = 500, and draw a i from a standardized normal distribution.The 2 norm ball and nsupport norm balls are used as constraints.The results in Fig. 6 corroborate that AFW outperforms FW.

Fig. 2 .
Fig.2.Performance of AFW when the optimal solution is at interior.
∀j.Hence, we have arg max j |[θ k+1 ] j | = i.Then one can use the closed form solution of FW step to see that when k ≥ T , we have v k+1 − v k+2 = 0.The proof is thus completed.Lemma 11.Let ξ 0 = 0 and T defined the same as in Lemma 10.Denote Φ * k

TABLE I A
COMPARISON OF FW VARIANTS WITH FASTER RATES