Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample b functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that it benefits from two speedup effects. First, we prove that as long as b is below a certain threshold, we can reach any predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.


I. INTRODUCTION
T HE problem we are interested in is to minimize a sum of two convex functions, where R is a separable, convex, possibly nonsmooth function and F is the average of a large number of smooth convex functions f i , i.e., We further make the following assumptions: Hence, the gradient of F is also Lipschitz continuous with the same constant L. Assumption 2. P is strongly convex with parameter µ > 0, i.e., ∀x, y ∈ dom(R), where ∂P (x) is the subdifferential of P at x.
Let µ F ≥ 0 and µ R ≥ 0 be the strong convexity parameters of F and R, respectively (we allow both of them to be equal to 0, so this is not an additional assumption).We assume that we have lower bounds available (ν F ∈ [0, µ F ] and ν R ∈ [0, µ R ]).

A. Our contributions
In this work, we combine the variance reduction ideas for stochastic gradient methods with mini-batching.In particular, we develop and analyze he complexity of mS2GD (Algorithm 1) -a mini-batch proximal variant of S2GD [1].We show that the method enjoys twofold benefit compared to previous methods.Apart from admitting a parallel implementation (and hence speedup in clocktime in an HPC environment), our results show that in order attain a specified accuracy ǫ, our mini-batching scheme can get by with fewer gradient evaluations.This is formalized in Theorem 2, which predicts more than linear speedup up to some b -the size of the mini-batches.Another advantage, compared to [2], is that we do not need to average the t k points x in each loop, but we instead simply continue from the last one (this is the approach employed in S2GD [1]).

B. Related work
There has been intensive interest and activity in solving problems of the structure (1) in the past years.One of the most practical methods for this problem is accelerated proximal gradient descent of Nesterov, with its most successful variant being FISTA [3].However, this method is not very efficient in large-scale settings (big n) as it needs to process all n functions in each iteration.Two classes of methods address this issue -randomized coordinate descent methods [4], [5], [6], [7], [8], [9], [10], [11], [12], [13] and stochastic gradient methods [14], [15], [16], [17], [18].This paper is closely related to the works on stochastic gradient methods with a technique of explicit variance reduction of stochastic approximation of the gradient.In particular, our method is a mini-batch variant of S2GD [1]; the proximal setting was motivated by SVRG [19], [2].Moreover, an accelerated version of proximal SVRG with mini-batches has been proposed by Nitanda as Acc-Prox-SVRG [20]; however, the acceleration of Acc-Prox-SVRG depends largely on the mini-batch size.
A typical stochastic gradient descent (SGD) method will randomly sample i th function and then update the variable x using ∇f i (x) -a stochastic estimate of ∇F (x).Important limitation of SGD is that it is inherently sequential.In order to enable parallelism, the idea of mini-batching-using estimating the gradient using multiple random i in each iteration-is often employed [21], [22], [23], [24], [18], [25].

II. PROXIMAL ALGORITHMS
A popular proximal gradient approach to solving (1) is to form a sequence {y k } via Note that in view of Assumption 1, U t is an upper bound on P whenever h > 0 is a stepsize parameter satisfying 1/h ≥ L. This procedure can be equivalently written using the proximal operator as follows: In a large-scale setting it is more efficient to instead consider the stochastic proximal gradient approach, in which the proximal operator is applied to a stochastic gradient step: where G t is a stochastic estimate of the gradient ∇f (y t ).Of particular relevance to our work are the the SVRG [19], S2GD [1] and Prox-SVRG [2] methods where the stochastic estimate of ∇F (y t ) is of the form where x is an "old" reference point for which the gradient ∇f (x) was already computed in the past, and i t ∈ [n] def = {1, 2, . . ., n} is a random index equal to i with probability q i > 0. Notice that G t is an unbiased estimate of the gradient of F at y t : Methods such as S2GD [1], SVRG [19] and Prox-SVRG [2] update the points y t in an inner loop, and the reference point x in an outer loop indexed by k.With this new out iteration counter, we will have x k instead of x, y k,t instead of y t and G k,t instead of G t .This is the notation we will use in the description of our algorithm in the next section.The outer loop ensures that the squared norm of G k,t approaches zero as k, t → ∞ (it is easy to see that this is equivalent to saying that the stochastic estimate G k,t has a diminishing variance), which ultimately leads to extremely fast convergence.Indeed, semi-stochastic methods enjoy the complexity bound O((n + κ) log(1/ǫ)), where κ = L/µ is a condition number.This should be contrasted with proximal gradient descent, with complexity O(nκ log(1/ǫ)) or FISTA, with complexity O( √ nκ log(1/ǫ)).It is clear that semi-stochastic methods always beat gradient descent, and even outperform FISTA in specific regimes for κ and n.However, unlike FISTA, this is achieved without the use of momentum.

Algorithm 1 mS2GD
1: Input: m (max # of stochastic steps per epoch); h > 0 (stepsize); Compute and store Initialize the inner loop: with probability q t given by (6) 6: uniformly at random 8: Compute a stochastic estimate of end for 11: Set x k+1 ← y k,t k 12: end for The algorithm includes an outer loop, indexed by epoch counter k, and an inner loop, indexed by t.Each epoch is started by computing g k , which is the (full) gradient of F at x k .It then immediately proceeds to the inner loop.The inner loop is run for t k iterations, where t k is chosen randomly by setting t k = t ∈ {1, 2, . . ., m} with probability q t , where , where ) Subsequently, we run t k iterations in the inner loop -the main step of our method (Step 8).Each new iterate is given by the proximal update (4), however with the stochastic estimate of the gradient G k,t in (5)

IV. COMPLEXITY RESULT
In this section, we state our main complexity result and comment on how to optimally choose the parameters of the method.
Theorem 1.Let Assumptions 1 and 2 be satisfied and let x * def = arg min x P (x).In addition, assume that the stepsize satisfies and that m is sufficiently large so that . Then mS2GD has linear convergence in expectation: Remark 1.If we consider the special case ν F = 0, ν R = 0 (i.e., the case that we do not have any good estimate for µ F and µ R ), we obtain In the special case when b = 1 we get α(b) = 1, and the rate given by (8) exactly recovers the rate achieved by Prox-SVRG [2] (in the case when the Lipschitz constants of ∇f i are all equal).The rest of this section focuses on post-analysis of the above result.In particular, we show what the optimal choice of parameters is, and how it translates to the overall complexity result.

A. Mini-batch Speedup
To explore the speedup from applying mini-batch strategy, we need to fix some parameters to avoid too many parameters in the complexity result (Theorem 1).For simplicity, we are using ν F = 0 and ν R = 0 in (7) so that we can analyze (8), in which case we can still analyze a strongly convex problem even without any explicit knowledge of its modulus.
Moreover, due to the fact that mini-batching is only employed in inner loops, it would be reasonable for us to fix the target decrease ρ for each epoch.Consequently, to achieve ǫ-accuracy, which should guarantee the number of epochs should be at least ⌈log(1/ǫ)⌉.Now let us focus only on inner loops and fix target decrease as ρ * in a single epoch.For any 1 ≤ b ≤ n, define (h b * , m b * ) to be the optimal pair in the sense that the stepsize h = h b * minimizes the computational effort -m is minimized subject to ρ ≤ ρ * with ρ defined in (8).Under these definitions, b = 1 recovers the optimal choice of parameters without minibatching.If m b * < m 1 * /b for some b > 1, then mini-batching can help us reach the target decrease ρ * with fewer component gradient evaluations.
The following theorem presents the formulas of h b * and m b * .Equation (10) suggests that as long as the condition hb ≤ 1 L holds, m b * is decreasing at a rate roughly faster than 1/b.Hence, we can attain the same decrease with less work, compared to the case when b = 1.
Theorem 2. Fix target decrease ρ = ρ * , where ρ is given by (8) and ρ * ∈ (0, 1).If we consider the mini-batch size b to be fixed and define the following quantity, then the choice of stepsize h b * and size of inner loops m b * , which minimizes the work done -the number of gradients evaluated -while having ρ ≤ ρ * , is given by the following statements.
If hb ≤ 1 L , then h b * = hb and where

B. Convergence Rate
In this section we provide a practical bound on number of component gradient evaluations ∇f i (x) needed to achieve a predefined ǫ-accuracy in k iterations (9).In particular, we show that the efficient speedup from mini-batching is obtainable only for b roughly up to 29.
We set the number of outer iterations to be k = ⌈log(1/ǫ)⌉.Fix the target decrease in Theorem 2 to satisfy ρ ≤ ǫ 1/k , which gives the corresponding optimal choice of parameters h b and m b , then this yields the total complexity of gradient evaluaitons to get (9).
From Theorem 2, the following equivalence holds, .
Hence, it follows that if b < ⌈b 0 ⌉, then h b = hb and m b is defined in (10); otherwise, h b = 1 L and m b is defined in (11).Obviously, with n large, we have b 0 ≥ 8, which, together with the above two cases, is able to demonstrate a total complexity of O ((n + bκ) log(1/ǫ)) .

Corollary 3. By setting the number of outer iterations
and h as in (13) and m as in (12), the total complexity of mS2GD is The complexity is measured in terms of component gradient evaluations, with the goal to achieve target accuracy ǫ in (9).

C. Comparison with Acc-Prox-SVRG
One of the most related method, which applies mini-batch scheme to stochastic gradient variance-reduced methods, is the Acc-Prox-SVRG [20].Acc-Prox-SVRG incorporates both mini-batch scheme and Nesterov's acceleration method [27], [28]; however, the author claims that when b < ⌈b 0 ⌉, with the threshold b 0 defined as This suggests that acceleration will only be realized when the mini-batch size is large, while for small b, Acc-Prox-SVRG achieves the same overall complexity of as mS2GD.
Taking a close look into the theoretical results given by Acc-Prox-SVRG and mS2GD, for each ǫ ∈ (0, 1), we are numerically minimizing the total work done -total number of component gradient evaluations given by over ρ ∈ (0, 1) and h, to compare their complexity. 2ig. 1 illustrates situations for both ill-conditioned and wellconditioned data in theory.With a small mini-batch size b, mS2GD is advantageous over Acc-Prox-SVRG; however, for a large b, the situation reverses because of the acceleration in Acc-Prox-SVRG. 3 Plots with b = 64 illustrate the cases where we cannot observe any differences in both methods.

V. EXPERIMENTS
In this section we perform numerical experiments to illustrate the properties and performance of our algorithm.In Section V-A, we impose an efficient way to apply mS2GD algorithm to sparse datasets.In Section V-B, we introduce numerical characteristics of mS2GD.In the last section, we compare our mS2GD with some relevant algorithms.
Although our mS2GD in proximal setting applies to both the popular regularizers with L1 norm and L2 norm.We focus our experiments in Sections V-B and V-C on the problem with L2-regularizer, i.e., and in Sections V-B and V-C, we conduct experiments on logistic regression for which we have in problem (14), with a set of training data points {(a i , b i )} n i=1 , where a i ∈ R d and b i ∈ {+1, −1} for binary classification problems.
We performed experiments on four publicly available binary classification datasets, namely rcv1, news20, covtype 4 and astro-ph 5 .In Section V-B, we investigate speedup from mini-batches and parallelism in practice on the rcv1 and astro-ph datasets.Section V-C, focuses on comparison of the performances of other relevant algorithms on all four datasets.
In the logistic regression problem, the Lipschitz constant of function f i can be derived as L i = a i 2 /4.Our analysis assumes (Assumption 1) the same constant L for all functions.Hence, we get the constant as L = max i∈[n] L i .We set the regularization parameter λ = 1 n in our experiments, resulting in the problem having the condition number, evaluated as

A. Implementation for Data with Sparse Structure
A natural question one might want to ask is whether the mS2GD can take advantage of sparse data.In the case of SGD, if the i th data point depends only on few coordinates, say ω, computing the gradient of the i th function can cost6 O(ω) operations, and the resulting gradient will have just ω nonzero elements.Thus, updating the test point will require O(ω) operations.This is not the case for mS2GD, since the full gradient in the update (5) is, in general, fully dense even for sparse data.However, taking in to account that we do not need all coordinates to compute the stochastic estimate of the gradient due to sparse data, we can adapt the algorithm, to make update in respective coordinates only before they are needed to evaluate a stochastic gradient, which induces our application of proximal "lazy" updates into Algorithm 1.For example, in Algorithm 1, for inner iteration t at epoch k, we would only update G k,t with coordinates in A kt for the full gradient g k .
In order to illustrate our mS2GD algorithm with sparse data in Algorithm 2, we define the "lazy" update operator as "prox l [•]", which is a lazy update for the i th coordinate.This operator is able to fulfill the cases for both L1 and L2 regularizers, and details about this operator will be explained after introducing the algorithm.
In Algorithm 2, we assume the functions f i have the form for all i = 1, . . ., n, which covers the cases of linear and logistic regression.We denote the s th coordinate of a vector y by (y) s .
with probability q t given by (6) end for 17: for s = 1 to d do 18: end for 20: Set x k+1 ← y k,t k 21: end for This algorithm follows the scheme of Algorithm 1 while the only difference is the application of lazy updates for proximal operators.Problems with various regularizers can be performed with efficient updates by using "prox l " operator for sparse data.The most popular regularizers in machine learning research are L1 and L2 regularizers.The following lemmas include details about proximal lazy updates with L2 and L1 regularizers, respectively.Lemma 1 (Proximal Lazy Updates with L2 Regularizer).For the problem (14), which has L2 regularizer in (1) (λ = 0) , our mS2GD algorithm can efficiently perform proximal lazy updates for sparse data by using the following operator in Algorithm 2.
where s corresponds to the s th coordinate and

Lemma 2 (Proximal Lazy Updates with L1 Regularizer).
Consider problem (1) with L1 regularizer R(x) = x 1 .Our mS2GD algorithm can efficiently update the iterates for sparse data by using the proximal lazy update operator in Algorithm 2. Define M and m as follows, then the forms of the operator are distinct in the following three situations, according to the value of (g k ) s .
1) If (g k ) s ≥ λ, then by letting 2) If −λ < (g k ) s < λ, then the operator can be defined as 3) If (g k ) s ≤ −λ, then by letting q def = (y k,t )s m − 1, the operator can be defined as The proofs of Lemmas 1 and 2 are available in AP-PENDIX C.

B. Speedup of mS2GD
With mini-batches, mS2GD can be accelerated in the benefit of simple parallelism.In Section IV-A, we have shown in theory that up to some threshold of mini-batch size, increasing mini-batch size does not hurt the performance of mS2GD.
Fig. 2 compares the best performances of mS2GD with different mini-batch sizes on datasets rcv1 and astro-ph.Each effective pass is considered as n evaluations in component gradients and each full gradient evaluation counts as one effective pass.In both cases, by increasing the mini-batch size, mS2GD with b = 2, 4, 8 are comparable or sometimes even better than S2GD (b = 1) without any parallelism.
Although for larger mini-batch sizes, mS2GD would be obviously worse, the results are still promising with parallelism.In Fig. 3, we show the ideal speedup by parallelism, which would be achievable if and only if we could always evaluate the b gradients efficiently in parallel 7 .

C. Comparison with Related Algorithms
In this part, we implemented the following algorithms to conduct a comparison. 1) SGDcon: the proximal stochastic gradient descent method with the constant step-size which gave the best performance in hindsight.
2) SGD+: the proximal stochastic gradient descent with adaptive step-size h = h 0 /(k +1), where k is the number of effective passes and h 0 is some initial constant stepsize.We used h 0 which gave the best performance in hindsight.
3) FISTA: fast iterative shrinkage-thresholding algorithm proposed in [3].This is considered as the full gradient descent method in our experiments.
4) SAG: a proximal version of the stochastic average gradient algorithm [26].Instead of using h = 1/16L, which is analyzed in the reference, we used the constant step-size, which provided the best performance in practice, instead of using h = 1/16L, which is analyzed in the reference.
5) S2GD: semi-stochastic gradient descent method proposed in [1].We applied proximal setting to the algorithm and used constant step-size, which gave the best performance in hindsight.
6) mS2GD: the mS2GD algorithm with mini-batch size b = 8.Although a safe step-size is given in our theoretical analyses in Theorem 1, we ignored the bound, experimented with various step-sizes and used the constant step-size that gave the best performance.

VI. CONCLUSION
In this paper, we have proposed a mini-batch proximal algorithm, with variance reduction technique on stochastic gradients, for minimizing a strongly convex composite function, which is the sum of a smooth convex function and a possibly nonsmooth regularizer.This kind of unconstrained optimization problems arise in inverse problems in signal processing  and statistics.Our mS2GD algorithm enjoys speedup from both mini-batch and variance reduction of stochastic gradients, where the former admits a simple parallelism.Comparisons to state-of-the-art algorithms suggest mS2GD, with a small minibatch size, is competitive in theory and faster in practice even without parallelism.Possible implementation in parallel and adaptiveness for sparse data imply its potential in industry.

APPENDIX A TECHNICAL RESULTS
Note that the above contractiveness of proximal operator is a standard result in optimization literature.

Lemma 4. Let {ξ
Following from the proof of Corollary 3 in [2], by applying Lemma 4 with ξ i := ∇f i (y k,t ) − ∇f i (x k ), we have the bound for variance as follows.
Theorem 4 (Bounding Variance).Considering the definition of G k,t in Algorithm 1, conditioned on y k,t , we have E[G k,t ] = ∇F (y k,t ) and the variance satisfies,

A. Proof of Lemma 4
Note that i,j ξ T i ξ j , we can thus write: where in the last step we have used the bound

B. Proof of Theorem 1
The proof is following the steps in [2].For convenience, let us define the stochastic gradient mapping then the iterate update can be written as Let us estimate the change of y k,t+1 − x * .It holds that Apply Lemma 3 in [2] with Therefore, After moving hµ R y k,t+1 − x * 2 to the LHS, we obtain and dividing both sides by (1 + hµ F ) gives us In order to bound −∆ T k,t (y k,t+1 − x * ), let us define the proximal full gradient update as 8   ȳk,t+1 = prox hR (y k,t − h∇F (y k,t )), with which, by using Cauchy-Schwartz inequality and Lemma 3, we can conclude that Letting δ def = 1−hµF 1+hµR , we thus get By taking expectation, conditioned on y k,t 9 we obtain (21),( 20) where we have used that and hence E[−∆ T k,t (ȳ k,t+1 − x * )] = 0 10 .Now, if we put (17) into (22) and decrease index t by 1, we obtain (23) where . Now recall that we assume that we have lower-bounds ν F ≥ 0 and ν R ≥ 0 on the true strong convexity parameter µ F and µ R available.Letting δ ′ def = 1−hνF 1+hνR , we obtain from (23) that 8 Note that this quantity is never computed during the algorithm.We can use it in the analysis nevertheless. 9For simplicity, we omit the E[• | y k,t ] notation in further analysis 10 ȳk,t+1 is constant, conditioned on y k,t which is equivalent to Now, by the definition of x k we have that where γ = m t=1 (δ ′ ) m−t .By summing (24), multiplied by (δ ′ ) m−t for t = 1, . . ., m, we obtain on the left hand side and for the right hands side we have: Combining ( 26) and ( 27) and using the fact that LHS ≤ RHS, we have Now, using (25), we obtain Strong convexity (3) and optimality of x * imply that 0 ∈ ∂P (x * ), and hence for all x ∈ R d we have Since E y k,m − x * 2 ≥ 0 and y k,0 = x k , by combining ( 29) and ( 28) we get This is equivalent to and when ρ is defined as The above statement, together with assumptions of Lemma 3 in [2], implies Applying the above linear convergence relation recursively with chained expectations, we have

C. Proof of Theorem 2
Clearly, if we choose some value of h then the value of m will be determined from (8) (i.e.we need to choose m such that we will get desired rate).Therefore, m as a function of h obtained from ( 8) is Observe that m ′ (h) is defined and continuous for any h ∈ I h .Therefore there have to be some stationary points (and in case that there is just on I h ) it will be the global minimum on I h .The FOC gave us that If this hb ∈ I h and also hb ≤ 1 L then this is the optimal choice and plugging (31) into (30) gives us (10).
a) Claim #1: It always holds that hb ∈ I h .We just need to verify that Only one thing which needs to be verified is that the denominator of ( 11) is positive (or equivalently we want to show that ρ > 4α(b)(1 + ρ).To see that we just need to realize that in that case we have

APPENDIX C PROXIMAL LAZY UPDATES FOR L1 AND L2 REGULARIZERS A. Proof of Lemma 1
By applying the updates with proximal operator (4) and L2 regularizer, from Algorithm 1 we have y k+1,t = prox hR (y k,t − hG k,t ) = arg min For Case (3), when (y k,t ) s ≥ −[λ − (g k ) s ]h, we can conclude that (y k+τ,t ) s = max{(y k,t ) s , λ + (g k ) s } − τ [λ + (g k ) s ]h, and in addition, the following equivalences hold when (g k ) s ≤ −λ, which can summarize the situation as The proof can be completed by letting M = [λ + (g k ) s ]h and m = −[λ − (g k ) s ]h.
, which is formed by using a minibatch of examples A kt ⊂ [n] of size |A kt | = b.Each inner iteration takes 2b component gradient evaluations 1 .

Fig. 4
Fig.4demonstrates superiority of mS2GD over state-ofthe-art algorithms on the four datasets.For mS2GD, the best choices of parameters with b = 8 are given in TABLE II.

x∈R d 1 2
x − (y k,t − hG k,t ) 2 + (y k,t − hG k,t ) = β(y k,t − hG k,t ), which is typically the most ill-conditioned problem considered in practice.TABLEIgives a summary of the four datasets, including the sizes n, dimensions d, their sparsity as proportion of nonzero elements and Lipschitz constants L.

TABLE I :
Summary of datasets used for experiments.

TABLE II :
Best choices of parameters in mS2GD.