Fully Projection-Free Proximal Stochastic Gradient Method With Optimal Convergence Rates

Proximal stochastic gradient plays an important role in large-scale machine learning and big data analysis. It needs to iteratively update models within a feasible set until convergence. The computational cost is usually high due to the projection over the feasible set. To reduce complexity, many projection-free methods such as Frank-Wolfe methods have been proposed. However, those projection-free methods have to solve a linear programming problem for every update of models which still leads to high computational cost for a complex feasible set, and can be unbearable in practical scenarios. Motivated by this problem, we propose a fully projection-free proximal stochastic gradient method, which has two advantages over previous methods. First, it enjoys high efficiency. The proposed method does not conduct projection directly but finds an approximately correct projection point with a very low computational cost. Second, it achieves tight and optimal convergence rates. Our theoretical analysis shows that the proposed method achieves convergence rates of <inline-formula> <tex-math notation="LaTeX">${\mathcal {O}\left ({\frac {1}{\sqrt {T}} }\right)}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">${\mathcal {O}\left ({\frac {\log T}{T} }\right)}\vphantom {{ {\mathcal {O}\left ({\frac {\log T}{T} }\right)}}_{j}}$ </tex-math></inline-formula> for convex and strongly convex functions, respectively. These convergence rates successfully match with the known lower bounds. Therefore, in this paper, we provide a valuable insight that <italic>some loss of accuracy of projection can improve the efficiency significantly, but does not impair convergence rates</italic>. Finally, empirical studies show that the proposed method achieves more than <inline-formula> <tex-math notation="LaTeX">$5\times $ </tex-math></inline-formula> speedup than previous methods.


I. INTRODUCTION
Proximal stochastic gradient method is widely used to solve large-scale machine learning problems such as support vector machines [1], [2] and logistic regression [3]. Generally, it iteratively finds a descent direction, and then updates the model within a feasible set by following the direction until convergence. Finding such descent direction requires to solve a constrained convex programming problem. Specifically, the standard projected proximal gradient method usually needs to solve a Quadratic Programming (QP) problem with convex constraints to find the descent direction. This translates to apply a projection operation over the feasible set. However, for high dimensional or complex learning problems, solving the QP problem can be very infeasible and not practical since the projection step is always time consuming and can potentially become the computation bottleneck.
The associate editor coordinating the review of this manuscript and approving it for publication was Vivek Kumar Sehgal .
To reduce such computational cost, many projection-free methods such as conditional gradient methods [4] or variants of Frank-Wolfe methods [5]- [7] have been proposed. These methods require to solve a Linear Programming (LP) problem instead of a QP problem to find the descent direction. In this regard, they enjoy a higher efficiency than the projected methods. However, solving the LP problem can still lead to a high computational cost when the feasible set 1 is complex. Moreover, since there are usually hundreds or even thousands of iterations, the computational cost can be extremely high, and thus is unbearable in practical scenarios. Therefore, we are motivated by following questions: • Is there a fully projection-free proximal gradient method, which does not have to solve an LP problem to update models?
• Does such projection-free proximal method achieve the tight and optimal convergence rate?
In this paper, we give positive answers to these motivating questions. First, we propose a new projection-free proximal method. Inspired by [8], our basic idea is to use the approximate projection point to update models, instead of finding the exactly accurate projection point. The insight is to find an approximately exact direction, and then to search a boundary point by following the direction as the final ''projection'' point. The accuracy of the approximate direction is essentially important, which is significantly impacted by the curvature of the set. It thus does not have to solve the LP problem, and improves the efficiency significantly.
Second, our theoretical analysis shows that the proposed method successfully achieves tight and optimal convergence rates, which are O 1 √ T and O log T T for convex and strongly convex functions, respectively. We prove that alough the approximate projection leads to some loss of accuracy, it does not impair the performance of convergence theoretically. Additionally, extensive empirical studies shows that the proposed method has almost the best convergence performance and the highest efficiency over the previous methods. Especially, comparing with previous methods, the proposed method can achieve more than 5× speedup. In addition, our major contributions are highlighted as follows: • We propose a new fully projection-free proximal stochastic gradient method, which enjoys high efficiency.
• The proposed method achieves O 1 √ T convergence rate for the convex function, which matches with the known lower bound, and is thus tight and optimal.
• The proposed method achieves O log T T convergence rates for the strongly convex function, which is thus tight and optimal.
The rest of the paper is organized as follows: Section II discusses the related work. Section III setups the problem formally. Our new projection-free algorithm is presented and its detailed convergence analysis are presented and discussed in Section IV and Section V, followed by experiment details in VI. Finally, Section VII concludes the paper alongside with the future research direction.

II. RELATED WORK
Recent years have witnessed the proliferation of the studies on the projection-free learning and optimization. Here, we briefly review the previous studies and divide the related work into the broad categories as following.

A. PROJECTION-FREE OFFLINE OPTIMIZATION
The conditional gradient descent (a.k.a Frank-Wolfe algorithm) was originally proposed by Frank et al. [9] for offline smooth optimization over polyhedral sets. It was further extended to semi-definite programming by Hazan [10] and general convex optimization by Jaggi [5]. Braun et al. [11] proposed a blended conditional gradient approach by combining the conditional gradient with gradient-based steps. Although these projection free techniques have been effective in the offline setting, they have yet to be as extensively applied in the online setting

B. PROJECTION-FREE ONLINE OPTIMIZATION
The online variant of conditional gradient descent was proposed by Hazan and Kale [12] which achieves a suboptimal O T 3 4 regret. An improved online conditional gradient descent was designed by Garber and Hazan [13] for smooth and strongly convex optimization which achieves the optimal O

√
T regret. In addition, many other methods have been proposed to handle cumulative constraints [14]- [16]. Recently, Hazan and Minasyan [17] proposed a randomized projection-free algorithm for general online convex optimization with smooth loss functions, and guaranteed a O T 2 3 regret. However, these algorithms can perform arbitrarily poorly if supplied with stochastic gradient estimates.

C. PROJECTION-FREE STOCHASTIC OPTIMIZATION
Johnson and Zhang [18] introduced the variance reduction technique for stochastic gradient descent. It was extended by Xiao and Zhang [19] to the proximal setting. Nitanda [20] applied this technique for accelerating proximal stochastic gradient in the mini-batch setting. However, each of these methods requires an expensive projection step at each iteration. Mahdavi et al. [21] introduced the primal-dual machinery for nearly projection-free algorithm that needs only one projection for the last iteration. This method was later developed in [22] with a complex inequality constraint. Hazan and Luo [23] devised a conditional gradient descent algorithm based on the variance reduction technique. Lan and Zhou [24] proposed a stochastic condition gradient sliding algorithm using Nesterov's acceleration method. It has been further improved in [4] by introducing the lazy conditional gradient. Xie et al. [25] introduced the recursive variance reduction technique to reduce noise in stochastic gradients prediction. Although these methods have enjoyed success in reducing projection cost, they require either at least one projection or linear optimization both of which can be prohibitive in large scale learning problems.
Our work is closely related to the very recent work of fast approximate projection method proposed in [8]. It introduced the interior point technique in solving online gradient descent which achieves O

√
T and log T regret for convex and strongly convex online optimization, respectively. Our proposed method is inspired by [8], but extends it to the proximal stochastic setting.

A. NOTATIONS
Before presenting the work, we illustrate some important notations. X represents the feasible set, and bound(X ) represents its boundary. ∇ and ∂ represent the gradient and the VOLUME 8, 2020 sub-gradient operators, respectively. a q represents the q norm of a with q ∈ {1, 2, ∞}. a represents the 2 norm of a in default. E represents the mathmatical expectation.
represents ''less than up to a constant''. For example, a b means there exists a constant c > 0 such that a ≤ bc.
We consider the following optimization problem: Here, f (x) is (strongly) convex, but may be non-smooth. Similarly, r(x) is convex, but may be non-smooth. The feasible set X is denoted by is convex and smooth with respect to x. Many machine learning problems can be formulated as this type, we now give a typical example: Example 1: Let us take the risk-averse learning task as an example. Generally, the risk-averse learning aims to learn the optimal model, and simultaneously avoids its performance to change vastly for different instances. It usually has the constraint , and ι is a given positive constant. For example, the portfolio management is one of those tasks.
Suppose an investor has d assets in the market. We let r t ∈ R d + denote their rewards observed at the time t ∈ [n]. The portfolio management is to maximize the mean of total rewards with bounded risk. In the task, f i (x) = −r t x, and the constraint is to bound the variance. Here, r(x) = λ x 1 , and λ is a given constant.

B. PROXIMAL PROJECTED GRADIENT DESCENT
The proximal projected gradient method is often used to solve the optimization problem (1). It performs the (t +1)-th update of model by: where η t is the learning rate, g t is an unbiased estimation of where I X (x) = 0 if x ∈ X , and ∞ otherwise. Thus, at each iteration t = 0, 1, 2, . . ., proximal gradient method updates its model to the opposite direction of the gradient, and then projects onto X . A stochastic variant of proximal projected gradient is stochastic proximal gradient descent. In this setting, we randomly pick i t from {1, 2, . . . , n} at each iteration t = 1, 2, . . . ,and take the following update: Play x t , and query a stochastic gradient Update model by For stochastic proximal gradient descent, it only needs to compute a single stochastic gradient g t,i t at each iteration 2 ; in contrast, proximal projected gradient requires the computation of n gradients. Although the computational cost of stochastic proximal gradient is 1/n that of the proximal projected gradient per iteration, the projection step is still computationally expensive, making stochastic proximal gradient unpractical for large-scale optimization problem.
Combining 3 and 4, we rewrite the model update rule as following: where r(x) is a regularizer, e.g., r(x) = x 1 . When r(x) is convex with respect to x, (5) is a constrained QP problem. Solving (5) usually leads to a very high computational cost. Notice that, there are always more than hundreds of updates of models, and every update has to conduct one projection over X . Therefore, the total computational cost is still extremely high for large-scale optimization problem.

IV. FULLY PROJECTION-FREE PROXIMAL STOCHASTIC GRADIENT
In this section, to efficiently solve the large-scale optimization problem, we propose a fully projection-free proximal stochastic gradient (FAP) method to largely reduce the computational cost. We first discuss the algorithm design and later establish its employment analysis.
A. ALGORITHM DESIGN The detailed procedure of our proposed FAP algorithm is shown in Algorithm 1. At each iteration t = 0, 1, 2, . . ., a stochastic gradient g t of f t with f t ∈ ∂f (x t ) such that E g t = f t is first computed. Based on the stochastic gradient g t , we calculate the proximal gradient descent update, i.e., x t+ 1 After this, a projection step is performed to project x t+ 1 2 onto X . In particular, two scenarios may happen: (i) x t+ 1 2 ∈ X , in this case, the projection step directly returns x t+ 1 2 ; (ii) x t+ 1 2 / ∈ X , we need to project x t+ 1 2 to the closest point in X , which is very time-consuming.
The main idea behind handling the computational complexity of projection step is to replace it by the approximate projection (line 6). It is first proposed in [8] for online learning in the non-proximal setting. Here we extend it to solve our proximal projected gradient descent. Our insight is to find an approximately exact direction, and then to search a boundary point by following the direction as the final ''projection'' point. The accuracy of the approximate direction is essentially important, which is significantly impacted by the curvature of the set. Specifically, consider a ∈ X and b / ∈ X , we use Figure 1 to illustrate how to find the approximate projection of b over X := {x : h(x) ≤ 0} based on a. We first find z which is the interaction of b − a and the boundary of X (i.e., ∂X ). The approximate direction is the normal vector n at z, where the normal vector is defined as following, Definition 1 (Normal vector): Given a closed convex set X and z ∈ ∂X , n is a normal vector to X at z, if n · (y − z) ≤ 0, ∀y ∈ X (8) Specifically, if n = 1, n is a unit normal vector. Hence, the normal vector n at z can be calculated as n = ∇h(z) ∇h(z) . We then denote s = b − z. The projection of s on the linear subspace orthogonal to n can be calculated as s − n, s n. Therefore, the approximated projection is the interaction of {b − n, s n − θ n : θ ≥ 0} and the boundary of X , i.e., the point AProj X (a, b) in Figure 1. Its algorithmic details are presented in Algorithm 2.
Recall that the full projection operation P X (b) is to find the closest point to b in X . Since n is the normal vector at z to X , b − n, s n is intuitively closer to b than any other point in X . Notice that b − n, s n may potentially not in X . In this case, we set b − n, s n as the starting point, then search following the direction of n and find the closest point of b− n, s n in X . X (a, b) [8]: Approximate Projection of b Over X := {x : h(x) ≤ 0} Based on a Require: a ∈ X and b / ∈ X . 1: Find z ∈ (a, b] bound(X ).

B. IMPLEMENTATION ANALYSIS
The implementation of Algorithm 1 is highly efficient. The most expensive step is to find the approximate projection point, that is Algorithm 2. Specifically, Algorithm 2 needs to calculate the b and θ * . Since h(a) ≤ 0 and h(b) > 0, we can use bisection search to find z, which requires O log 1 queries of h(·) to achieve at most error, e.g., |h(z) − h(z exact )| ≤ . Now suppose there is a point y ∈ X and y is also on the {b − n, s n − θ n : θ ≥ 0}. In this setting, we have h(y) ≤ 0, and h(b − n, s n) > 0. Thus, we can again use bisection search to find the minimum θ * , which still requires O log 1 queries of h(·) for error.

V. CONVERGENCE ANALYSIS
In this section, we present our analysis of convergence rate of the proposed FAP algorithm. Before presenting convergence rates, we present some assumptions, which will be used in theoretical analysis.
Assumption 1: We make some basic assumptions as follows.
• We assume both f (x) and r(x) are convex but non-smooth with respect to x.
• We assume f ≤ G and e ≤ H , where f ∈ ∂f (x) and e ∈ ∂r(x) for any x ∈ X .
• We assume g is an unbaised estimation of f with f ∈ ∂f (x), that is, E g = f , and E g − f ≤ σ for any x ∈ X .
• We assume x − y ≤ R for any x ∈ X and y ∈ X . Assumption 2: We assume f (x) is α-strongly convex with respect to x, that is, holds for any x ∈ X and y ∈ X .
It is trivial to verify that a general convex function is also a 0-strongly convex function. In the theoretical analysis, we use VOLUME 8, 2020 to measure the distance between the model yielded by Algorithm 1 and the optimal model. Here, x * represents the optimal model, that is, x * = argmin x∈X F(x).
for any x ∈ X and y ∈ X . Definition 3 (Global curvature): The global curvature µ of X is defined by , where bound(X ) represents the boundary of X . Lemma 1 (Theorem 3.1 in [8]): Suppose h(·) is convex and smooth, and the feasible set X is defined by X := {u : h(u) ≤ 0} with the global curvature µ. Given any a ∈ X , b / ∈ X , and its approximate projection AProj X (a, b) yielded by Algorithm 2, the following inequality holds for any y ∈ X . Lemma 1 is a guarantee of the approximate projection. We recommend [8] for proof details.
Lemma 2: If η t > 0, we have Proof: According to the first-order optimization condition of (6), that is, there is e ∈ ∂h(x t+1/2 ) satisfying Thus, we have It completes the first part of proof. According to Lemma 1, we have

It finally completes the proof.
Lemma 3: Given any vectors g, u t ∈ R d , u * ∈ R d , and a constant scalar η > 0, if . Proof: It is is trivial to be obtained by using Lemma 25 in [26].
Proof: According to Lemma 1, we have Thus, we have 1 holds due to Lemma 3. 2 holds due to (9). 3 holds due to Lemma 2. Besides, we have 1 holds due to Lemma 2. Therefore, we have It completes the proof. We provide the meta analysis on the convergence rates, which will be used to produce specific rates by using assumptions of f . Proof: where f t ∈ ∂f (x t ) and E g t = f t . According to Lemma 4, we have Let us begin to bound I 2 . We have The last inequality holds due to the α strong convexity of f (x).
The last inequality holds due to Lemma 2. Substituting those bounds, we finally complete the proof. According to Theorem 1, a larger µ leads to a slower convergence. Recall that µ is the global curvature of the feasible set X , and a large µ means that X looks more like a 'circle'. Theorem 1 implies the following insight.
Insight 1: Theorem 1 shows that the convergence of Algorithm 1 is highly impacted by the geometry of the feasible set X .
It matches with intuitive understanding. Consider a special case: h(x) is linear with respect to x. First, in the case, L h = 0, and thus µ = 0. Theorem 1 is exactly same with the known result of the proximal projected gradient method. Second, note that the approximate projection is the exact projection in the case of linear h(x). Therefore, Algorithm 1 becomes traditional proximal projected gradient method, and their convergence rates are exactly same for linear h(x).
After that, by choosing some specific assumptions about f (x), we obtain the following convergence rates.
Corollary 1: , respectively. Therefore, our convergence rates are tight and optimal. It implies the following insight.
Insight 2: Some loss of accuracy due to the approximate projection helps to improve the efficiency, and is not harmful to the convergence rates.

VI. EXPERIMENTS
In this secion, we show the learning tasks, experimental settings and conduct the comprehensive evaluation.

A. LEARNING TASKS
As shown in Section IV, we conduct empirical studies on two risk-averse learning tasks: Portfolio Management (PM) and Logistic Regression (LR). The feasible set is denoted by (2).
• Portfolio Management. Suppose an investor has d assets in the market. r t ∈ R d + denotes their rewards observed at the time t ∈ [n]. In the task, f i (x) = −r t x, r(x) = λ x 1 , and λ is a given constant.
• Logistic regression. Suppose the data matrix A ∈ R n×d consisting of n instances, and every instance has d features. The label matrix is y ∈ {1, −1} n , and y i ∈ {1, −1} is the label of the i-th instance A i ∈ R 1×d . In the task, we let f i (x) = log (1 + exp (−y i A i x)), and r(x) = λ x 1 . Similarly, λ is also a given constant.

B. EXPERIMENTAL SETTINGS
We collect eight wide-use datasets from the internet. For the PM task, we conduct the experiment on datasets of FF49Industries, DowJones, FTSE100, and NASDAQComp. 3 For the LR task, we conduct the experiment on datasets of  ijcnn1, a9a, w8a, and covtype. 4 Details of those datasets are presented in Table 1.
We have two benchmarks to compare with our proposed method, i.e., FAP. 1) FWM, a strong projection-free optimization method in [11], and 2) PGD, an efficient optimization method with projection proposed in [27]. Besides, the optimal model is same for all those methods, and we use the empirical loss F(x) as the metric to measure the performance of those methods. The learning rate η t for all those methods is set to be η t = 1 √ t , and we set λ = 10 −3 in all empirical studies. For every iteration, we use a mini-batch of instances to update 4 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html models, and the size of a mini-batch is n 10 . Here, n represents the number of instances in a dataset. Finally, we implement those methods by using Matlab 2016b. All codes are run in a server with an Intel I7 CPU, 16 GB memory, and 1TB SSD storage. Figures 2 and 3 show the evaluation of the convergence performance. Figure 2 plots the results with the PM task and Figure 3 plots the results with the LR task. Figure 2 shows that FAP significantly outperform both the FWM and PGD with a large performance gap (or we can say with a xx% average performance gap) across all datasets. As the number of rounds increases, the performance gap remains stable in Figure 2. Meanwhile, the empirical loss of FWM and PGD is always at the same level as the round moves on. Figure 3 also shows that FAP always achieves the lowest empirical loss as compared with other methods. In particular, FAP is slightly better than PGD, but still hit a large performance gap compared with FWM. The empirical loss of FAP is much lower during the initial training rounds compared with the other two methods. In addition, the performance gap is more obvious on the large dataset, that is, covtype having 581, 012 instances and 54 features. In summary, the proposed FAP method always achieves the optimal convergence rate as compared with other methods which validates our theoretical analysis.    is due to the differences of model updating approach, where PGD has to solve a quadratic programming problem to update models, and FWM has to solve a linear programming problem to complete the update of models. Compared with those two updating approaches, FAP updates model with a fully projection-free approach by using proximal gradients, which can greatly enhance the efficacy of the computation.

VII. CONCLUSION AND FUTURE WORK
We propose a new fully projection-free method, which updates models by using proximal gradients. Our theoretical analysis shows that the proposed method achieves the tight and optimal convergence rate. The superiority of the proposed method has been validated by extensive empirical studies.
In the future, the proposed method can be extended to handle the non-convex loss function, or to work in the decentralized or asynchronous settings.