A Flexible Stochastic Multi-Agent ADMM Method for Large-Scale Distributed Optimization

While applying stochastic alternating direction method of multiplier (ADMM) methods has become enormously potential in distributed applications, improving the algorithmic flexibility can bring huge benefits. In this paper, we propose a novel stochastic optimization method based on distributed ADMM method, called Flex-SADMM. Specifically, we incorporate the variance reduced first-order information and the approximated second-order information for solving the subproblem of ADMM, which targets at the stable convergence and improving the accuracy of the search direction. Moreover, different from most ADMM based methods that require each computation node to perform the update in each iteration, we only require each computation node updates within a bounded iteration interval, this has significantly improved the flexibility. We further provide the theoretical results to guarantee the convergence of Flex-SADMM in the nonconvex optimization problems. These results show that our proposed method can successfully overcome the above challenges while the computational complexity is maintained low. In the empirical study, we have verified the effectiveness and the improved flexibility of our proposed method.


I. INTRODUCTION
Mathematical optimization has been widely applied in many applications, e.g., computational biology [1], [2], wireless communication [3]- [7] and machine learning [8]- [10]. The stochastic optimization is based on the theories of the deterministic optimization. It can be traced back to the epochal work [11], and its main ingredient is to use a mini-batch of data points that are randomly sampled from the full dataset to approximate the exact gradient for the search direction, which is known as the stochastic gradient descent (SGD).
However, it will have large variance which leads to a slow convergence [12]. Thus, there are several methods targeting at controlling and reducing the variance during mini-batch optimization. Particularly, the variance reduction techniques [13] have been developed to solve the above issue and greatly accelerate the convergence of SGD. Exemplar algorithms are stochastic variance reduction gradient method (SVRG) [14], The associate editor coordinating the review of this manuscript and approving it for publication was Fung Po Tso . stochastic average gradient method (SAG) [15] and its extension SAGA [16]. In particular, SAG and SAGA utilize full gradient without direct calculation but being instead updated with the newest information at each iteration. They have the fast convergence speed and the low computation level as SGD. Furthermore, utilizing second-order information, e.g., the Hessian matrix, is more accurate and can lead to faster convergence speed. To be precise, various recursive update schemes for the approximated Hessian matrix have been proposed including symmetric rank-one (SR1) update [17]- [19] and rank-two variants such as the Davidon-Fletcher-Powell (DFP) scheme [20], [21] and the BFGS update scheme [22], [23].
In particular, [24] has studied the online BFGS and online limited memory BFGS (oLBFGS) under stochastic regime. The main ingredient is adapting the deterministic BFGS and limited memory BFGS (LBFGS) to the stochastic version using the stochastic gradient. The convergence properties of oLBFGS have been further studied under the strong convexity assumption in [25]. [26] has proposed a novel stochastic VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ quasi-Newton method for solving the support vector machine (SVM) problems. It aims to approximate the diagonal elements via the term-by-term equalities. Since it only involves scalar multiplications -the approximated diagonal elements of the rescaling matrix, this method is quite computationally efficient. However, direct applying stochastic gradient for the Hessian approximations may bring noise, which may affect the robustness of the convergence. Hence, it is a challenge for controlling the quality of the curvature approximations in stochastic optimization. The above optimization methods are implemented in a single computational node. However, in the big data era, solving a large-scale optimization problem in machine learning with a single computation node is generally prohibitive. This is due to the reason that the quantities of realistic data for training a machine learning model can be very large and typically range from 1TB to 1PB. As a result, distributed optimization has become a crucial technique and it has been of importance to develop algorithms that are both efficient and scalable enough to process huge datasets in a parallelized or fully decentralized fashion. The alternating direction method of multipliers (ADMM) is a natural fit for the distributed convex optimization, and in particular to large-scale machine learning problems. ADMM method is simple and its main ingredient is to decompose the problems into subproblems, which can be individually solved in parallel, thus it turns out to be a natural fit in a wide class of large-scale applications, e.g., machine learning with large-scale data-distributed systems [27], distributed compresive sensing [28], [29], and massive MIMO wireless communication systems [30], [31]. Moreover, since it solves optimization problems via augmented Lagragian function, which penalizes the contraint equality with a quadratic term, it is advantageous in numerical stabability, and hence it has attracted a considerable amount of attention in both practical and theoretical aspects over decades [32]- [36].
Specifically, [32] has provided an extensive review of ADMM framework. The convergence analysis for a class of strongly convex problems with a linear equality constraint has been studied. The comprehensive work [36] has further studied the convergence rate of ADMM method with multiple separable blocks of variables in strongly convex optimization problems. It has established the global R-linear convergence rate with the sufficient small dual stepsize and the assumption that certain error bound condition holds true. In [33], a decentralized ADMM (DADMM) method is proposed for further computational efficiency. The main ingredient is to take the network topology of agents into full consideration and then each agent can individually update its variable such that it communicates with its neighbors only. [35] has further proposed a linearized version of DADMM (DLM). It combines the computational efficiency of distributed gradient method and the rapid convergence rate of ADMM. With the same assumption of strong convexity in the objective function, the linear convergence rate of both DADMM and DLM is further demonstrated. In [34], ADMM method has been investigated for solving a class of nonconvex problems with multiple separable blocks of variables, where the penalty parameter is assumed to be sufficiently large such that the subproblem is strongly convex and can be conveniently solved.
The above ADMM methods have been studied under the deterministic regime, and have assumed full accessibility to the whole distributed dataset. However, in realistic applications, the computational workload may still be heavy when the distributed dataset is large. A more natural way of handling such problem is to utilize the sampled mini-batch of dataset instead of the full distributed dataset for unbiased estimation of gradient, which leads to the incorporation of stochastic settings into ADMM. In [37], a stochastic ADMM method has been proposed for solving a class of nonsmooth convex problems. Moreover, the convergence rates for both convex and strongly convex functions has been studied under specific assumptions. [38] have investigated the wellknown SVRG strategy applied in stochastic ADMM. [39] has extensively studied stochastic ADMM combined with various variance reduction strategies for nonconvex problems. Specifically, SVRG, stochastic average gradient (SAG) and the extension of SAG (SAGA) have been incorporated into ADMM method, which has resulted in SVRG-ADMM, SAG-ADMM and SAGA-ADMM respectively. All the resultant ADMM methods have been demonstrated to converge to a -stationary point in expectation. In [38], SVRG-ADMM has been further investigated for solving convex optimization problems with composite objective functions, which has achieved the convergence rates of O(logS/S) for strongly convex and Lipschitz smooth objective functions, and O(1/ √ S) for convex objective functions without Lipschitz smoothness. In [40], a novel stochastic ADMM has been proposed, its main ingredient is to combine classical stochastic ADMM with gradient free and variance reduction strategy.
However, all the above methods cannot be directly applied in a system with multiple agents that only a part of agents are available for updating variables while other agents remain silent at each iteration, e.g., federated learning applications. Although [36] has improved the flexibility of ADMM, the subproblem may exists large variance when the stochastic gradient is applied. Therefore, we propose a novel ADMM method, which aims to improve the flexibility of the classical stochastic ADMM method. Precisely, the novel method only requires each agent update its variables at least once periodically. While the flexibility is improved, we incorporate the variance reduction technique. Thus, our proposed method divides the procedure of classical stochastic ADMM method into two stages, where the variance reduced stochastic gradient is computed at the first stage, and then it is applied in iterations of second stage. It should be noted that when the periodicity is set to be one, i.e., all agents update their corresponding variables at each iteration, our proposed method becomes SVRG-ADMM. Hence, our proposed method generalizes the current SVRG-ADMM. To summarize, our contributions are as follows: • A novel stochastic ADMM method is proposed. As we have not assumed the convexity of the objective function f , our proposed method can be potentially applied to solve a specific class of nonconvex problems. Moreover, as the subproblem may be nonconvex, we incorporate Sd-REG-LBFGS to ensure the positive definiteness of the Hessian approximation. Furthermore, each agent is not required to update its variable at each iteration. Instead, we require that each agent updates its variable at least once periodically.
• We incorporate SVRG strategy into our proposed stochastic ADMM method. Thus, our proposed ADMM method is also divided into two stages as SVRG method. Although SVRG has been widely used, its application to the stochastic ADMM framework above is not straightforward since the direct combination may not ensure convergence. Considering that each agent updates its variable at least once periodically, we propose the periodicity to be the iteration number of the second stage of SVRG method.
• We have shown the convergence result of the novel method. We construct a real valued sequence, which is closely related to the augmented Larangian function (it may be viewed as a further regularized version of augmented Largrangian function). Instead of showing that the augmented Lagrangian function is decreasing at each iteration, we have shown that the constructed sequence is convergent. It lies threefold: (1). We have shown the sequence is monotonically decreasing at each iteration within the second stage; (2). We will show that the sequence is decreased at the start of the next second stage; (3). We have shown the sequence is bounded below. Thus the sequence is convergent. With the help of the convergence result, we further show that the proposed method is convergent to its KKT stationary point.

II. THE PROPOSED ALGORITHM
In this section, we start with the review of basic theory in ADMM. This part is based on [32]. Then, we will introduce the application of SRL in the subproblem of stochastic ADMM, which is quite straightforward. Specifically, suppose that there are K agents working in parallel (Here, each agent corresponds to a computation node [42]), and the large dataset D is divided into K subsets D k . Moreover, D k is allocated to the kth agent. Consider the following consensus optimization problem with equality constraint: min x k ,k=0,...,K ; where x k ∈ R d , f k : R d → R is possibly a nonconvex function with respect to the kth agent, and g : R d → R is assumed to be a convex function with respect to x 0 . Moreover, the single function with respect to the data point d i ∈ D k is assumed to be f k,i (x k ; d i ). With these settings, we have Here, we use the notation f k (x k ) for f k (x k ; D k ) for simplicity, i.e., f k (x k ) := f k (x k ; D k ). The problem (1) can be solved through augmented Larangian function as follows: In this section, we consider the large dataset cases and the subset D k is also large. Hence, the traditional stochastic ADMM method exists the following three problems: (1). In linearized ADMM, the subproblem x t+1 2 may be solved more accurately by incorporating Hessian approximation B k , where β is the approximation parameter, t is the iteration number and ρ k is the ADMM regularization parameter. However, if f k (·) is nonconvex, the subproblem can be difficult to solve since the reduction in the objective function of the subproblem may not be ensured at each iteration. (2). Moreover, directly sampling a small subset of D k to form the stochastic gradient for optimization may lead to large variance, this may further slowdown the convergence speed. (3). All the K agents are available to implement the update of each x k , k = 1, . . . , K in parallel, traditional stochastic ADMM method can be employed. However, a more general case is when there are less than K agents available for the implementation of the update at each iteration and thus traditional stochastic ADMM cannot be directly applied.

A. SVRG STRATEGY
The reduction of the variance of stochastic gradient can be realized by two stage optimization. In the first stage, the full gradient is evaluated at the newly updated variable. Then it is stored and used for the calculation of stochastic gradient in next stage. Specifically, denote the iteration number as (s + 1, t + 1), where s + 1 is iteration number of the first stage, and t + 1 is the second stage iteration number, then the stochastic gradient proposed with respect to the kth variable x k in [14] is given as follows: v s+1 where a subset of data pointsD t k are randomly sampled with the batch size b t k N k . It can be easily verified VOLUME 10, 2022 that the stochastic gradient is unbiased, i.e., E[v s+1 k,t |x s k,t ] = ∇f k (x s+1 k,t ). As the full gradient ∇f k (x s k ) is only calculated at the first stage and maintained for the second stage, SVRG strategy is computationally efficient. Moreover, it is verified in Lemma 1 that as the algorithm converges, the variance of v s+1 k,t will be progressively reduced to null. SVRG is first proposed in [14], and soon it has been widely applied [38]- [40] and shown effectiveness and fast convergence. Considering the above benefits, we incorporate SVRG method in developing our proposed stochastic ADMM method.
As not all agents update their corresponding variables at each iteration, developing an ADMM based algorithm that guarantees convergence can be difficult. However, in the convergence analysis, by requiring that each agent updates its variable at least once every specific period T , the stochastic algorithm is expected to convergence. Specifically, at each iteration t +1, define an index set I t+1 ⊆ {0, 1, . . . , K }, if an agent index k ∈ I t+1 , then the agent is required to update the variable x k , otherwise, the agent keeps the variable x k using the previous iterate. For the requirement of convergence that each variable must be updated at least once for every T iterations, we have,

B. HESSIAN APPROXIMATION SCHEME
For dealing the problem 1, let us consider the quadratic approximation of the objective function in ADMM subproblem, specifically, we have: Popular recursive update schemes for the approximated Hessian matrix may be used, and these include symmetric rank-one (SR1) update [17]- [19] and rank-two variants such as the Davidon-Fletcher-Powell (DFP) scheme [20], [21] and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update scheme [22], [23]. In this subsection, we mainly consider BFGS method as it is one of the most popular quasi-Newton algorithms: where the correction pairs are s k = x k+1 − x k and y k = ∇f (x k+1 ) − ∇f (x k ) respectively. However, there exists another problem of numerical stability. Especially for nonconvex objective functions, the positive definiteness of the Hessian update is difficult to maintain. Moreover, for small dataset, it may result in the ill-conditioning problem, which harms the convergence. To address these concerns, we adopt Sd-REG-LBFGS update scheme. The main ingredient is to incorporate both regularization scheme and damped parameter, where the former can ensure numerical stability and the latter can keep the positive definiteness of the Hessian approximation matrix. Specifically, Sd-REG-LBFGS updates the Hessian approximation matrix via: with the modified gradient difference: where δ is a given positive constant that satisfies specific condition (See Lemma 1), and the damped parameter is given as: Moreover, the extensions to stochastic regime and limited memory version are straightforward.
if 0 ∈ I then 5: if k ∈ I then 10: Update x k via (5) and update λ k via end for 15: end for C. FLEX-SADMM According the above discussions, we propose a novel stochastic ADMM method that only requires each agent updates its corresponding variable at least once every T iterations. For the incorporation of SVRG, we divide the ADMM procedure into two stages, where the full gradient is calculated at the first stage, and the iteration number required for the second stage is exactly set to T . In this way, each agent can update its variable at least once during the second stage. According to the above discussions, we summarize our proposed stochastic ADMM as in Algorithm 1.
In Algorithm 1, since all agents shall update their corresponding variables at least once within the T iterations at the second stage, the full gradient ∇f k (x s k ) is calculated in step 2 for all agents k = 1, . . . , K . Then v s+1 k,t is further computed for k ∈ I, which is used for updating the corresponding variable in step 10. It should be noted that the steps 2-14 can be carried out in parallel. Specifically, the chosen agents in the index set I can update the variables x k and λ k in a distributed manner. In step 10, the Hessian approximation by Sd-REG-LBFGS is maintained positive definite. As mentioned above, the strategy can avoid the ill-conditioning problem and thus our proposed algorithm can perform robustly under small samples. Moreover, the subproblem in step 10 is a quadratic convex problem and can be solved by directly nulling the first-order derivative.
Remark: Our proposed stochastic ADMM method can be potentially applied in a parameter server. Moreover, for a widely used distributed gradient descent method, there have been extensive research on both asychronous and synchronous algorithms. Synchronous algorithms require that the server waits for all the workers for their update, while asynchronous collects parts of the worker information. The distributed gradient descent (DGD) has relatively smaller fault tolerance while ADMM method is able to increase the fault tolerance. Thus, ADMM method can be adapt to the asynchronous algorithm. Since our proposed method only requires that each worker updates its variables at least once every T iterations, we have provided a framework of adaptation to asynchronous algorithms.

III. CONVERGENCE ANALYSIS
For notational simplicity, let us denote x s+1 k,t as x t k , the same strategy is used for other variables. Subsequently, the following lemma shows the variance of the stochastic gradient is progressively reduced to zero when the algorithm converges.
Lemma 1: The variance of the stochastic gradient is bounded above by the distance between current iterate x t k and Proof: See Appendix A. Next, we shall focus on the change of augmented Lagrangian function for the update of each variable, then it follows the function changes of each iteration. As mentioned above, not only the change of the augmented Lagrangian function within the second stage shall be considered, the change between first stage shall also be evaluated. Note that the function L({x k }, {λ k }, x 0 ) is strongly convex with respect to x 0 with modulus γ > 0, i.e., In particular, since x * 0 is a minimizer of L({x k }, {λ k }, x 0 ), then we have: According to this, we have the following lemma which shows that each update of x 0 is expected to lead to the reduction in the augmented Largrangian function.
Lemma 2: If 0 ∈ I s+1 t+1 , then we have: Proof: See Appendix B. Next, for notational convenience, let us define the function which can be viewed intuitively as the individual augmented Lagrangian function corresponding to the kth agent and . Then, we have the following useful lemma to evaluate the change of for t = 0, the following holds: Proof: See Appendix C. We continue to consider the update of the variable x k at agent k, Lemma 4: If k ∈ I, it satisfies: Proof: See Appendix D. Moreover, with a sufficiently large parameter ρ k , it can be seen that L k is decreased for each update of x k . Now, we are ready to construct the following sequence which is helpful for the establishment of the convergence result.
Lemma 5: Define the positive sequence µ s Furthermore let us define the sequence {ζ s t }, where ζ s t = K k=1 ζ s k,t and ζ s k,t is given as follows: then the following holds: Proof: See Appendix E. It can be seen from (19) that the sequence {ζ s t } is monotonically decreasing. In fact, we will show the sequence is bounded below in the following theorem. Hence, the sequence is convergent. This is useful for the establishment of the convergence of our proposed algorithm to the KKT stationary point in expectation, which is described in detail in the following theorem.
Theorem 1: For each k ∈ {1, . . . , K } the sequence {x s k,t , λ s k,t , x s 0,t } generated by the proposed Algorithm 1 is expected to converge to a limit point {x * k , λ * k , x * 0 } which satisfies: where the metric d(x, Y) is the minimal distance between the vector x and the set Y, i.e., Moreover, given any sufficiently small positive value ε > 0, the iteration number S needed to achieve the following ε-stationary point for some positive constant η > 0 and τ > 0.
Proof: See Appendix F.

IV. NUMERICAL EXPERIMENTS
We have studied the theoretical convergence properties of our proposed Flex-SADMM. For the convergence speed in the Theorem 1, we have shown that it is closely related to the first stage iteration number S. However, it is also related to the second stage iteration number T . Let us illustrate this briefly. For the convergence speed O(1/S) in Theorem 1, it has been derived under the assumption that each agent k has updated its variables at least one time within the second stage. In fact, if we assume that each agent k is required to update its variables r times within the second stage, it can be straightforwardly derived that the convergence speed will be O( 1 r·S ). Hence, in this section, we will study the effect of the update times r for each agent at the second stage. Here, five scenarios based on ADMM are applied for studying the effectiveness of using our proposed Hessian approximation scheme: 1). Flex-SADMM: it will be implemented according to Algorithm 1; 2). Flex-SADMM without BFGS: when the stochasticity is introduced in real applications, it will also bring stochasticity to the BFGS matrix, which will counteract or even degrade the performance that BFGS matrix has brought. Hence, it is necessary to conduct this ablation check the effect of BFGS; 3). Flex-SADMM without SVRG: to reflect the tremendous performance improvement brought by SVRG, we also implemented Flex-SADMM with no SVRG to show its importance and efficiency; 4). Flex-SADMM without SVRG or BFGS: it will be seen that SVRG has brought large performance improvement, which even overwhelms the BFGS. Hence, Flex-SADMM with no SVRG or BFGS is implemented. 5). Fast ADMM: the momentum based algorithm [43]; For the dataset, we choose scene dataset [41] (available at http://mulan.sourceforge.net /datasets-mlc.html), since it is simple to study the effect of the parameters. We choose this dataset since it is simple to study the effect of the parameters on our proposed algorithm. With the Scene data, we form a predictor of scene labels from image features. It contains 1,784 images in the training set and 446 images in the test set. There are 294 images features and up to 6 scene labels per image. We have regrouped the scene dataset to contain only two labels, with label = 1 containing 875 images for training, and the left is label = 0 group. Hence, this is a balanced binary classification problem. In relation with the problem (1), for the logistic regression problem. Furthermore, we consider the above 5 scenarios applying to solve logistic regression for binary classification problem due to the simplicity of logistic regression. For the performance evaluation, we choose the norm of gradient (NOG) such that it can reflect how close the iteration point is to the optimal point, i.e., where θ is the optimization variable, z n is the class label and x n is the feature vector.

A. THE EFFECTIVENESS OF BATCH SIZE
We first show the effectiveness of the batch size in our proposed methods. Specifically, we consider the batch sizes m ∈ {1, 10, 50, 100}. The total iteration number is set to 1, 000 such that all algorithms is sufficient to converge. To be technically precise, the first stage iteration number is set to S = 50 and it follows that the second stage iteration number is T = 20. In particular, we arrange the available agents as one block, and each block has no common agent. Then for simplicity, one by one block is chosen at each iteration, while the unchosen agents remain silent, and the Figure 1 can illustrate this. Therefore, we assume there are K = 10 agents with available agents AG = 5 at each iteration. Subsequently, each agent will update its variables 10 times. According to Theorem 1, it is sufficient to ensure the convergence. Furthermore, For the Hessian approximation scheme with Sd-REG-LBFGS, the regularization parameters are set to γ = 10 −3 and δ = 1.25γ + 10 −4 to satisfy the condition that 0.8δ > γ . We first consider the batch size m = 1. It can be seen from Figure 2 that our proposed method performs nearly the same with the scenario 2). It seems that no Hessian approximation has lower computational cost while the performance is slightly better. However, without Hessian approximation in scenario 4), it performs worse than scenario 3). Hence, this performance improvement is brought by SVRG in large proportion, and has overwhelmed the Hessian approximation, which can also bring significant performance improvement. But here, it should be noted that there also exists large variance in Hessian approximation matrix. This may counteract the potential performance improvement brought by the Hessian matrix to some extent. For larger batch sizes, our proposed method has shown a stable performance comparing to small batch sizes. Hence, we suggest choosing sufficient small batch size such that the computational cost is greatly reduced while the performance maintains nearly the same.

B. THE IMPACT OF FIRST STAGE
Recall that with the assumption that each agent should update its variables at least one time with the second stage, the convergence speed will become O(1/S). Moreover, the number of the iteration in the first stage determines the times of calculating the full gradient. Therefore, with the above discussions, we continue to study the effect of the iteration number in first stage, namely, S. Here, the total iteration number is set to 1, 000 to ensure fair comparison. Hence, the iteration number in second stage is 1000/S. We choose S = [10, 20, 50, 100]. Moreover, for simplicity, we arrange the available agents as one block, and each block has no common agent. According to this, we set the available agent number to be AG = 5. Moreover, the dataset is divided into K = 10. Subsequently, we have each agent will update its variables r = [50, 25, 10, 5] times respectively. Figure 3 shows the results of different iteration number settings in the first stage. It can be seen that convergence speed is nearly the same. This is because for the four scenarios of r = [50, 25, 10, 5], since the convergence speed is O( 1 r·S ) as aforementioned discussions, the four scenarios share nearly the same convergence speed. This matches theory well. In particular, the zigzags along the line shows the regularity, which is an interesting phenomenon. To be specific, taking Figure 3 (a) as the example, there are in general 10 zigzags and each is with smaller zigzags oscillating, we call the larger zigzag as global zigzag and the smaller as local zigzag. For the Figure 3 (a), the number of global zigzags equals to the first stage iteration number S = 10. This is due to the reason that the full gradient is calculated at each start of the first stage, and then it has been passed to the second stage. Hence, with the newly calculated full gradient, it will drop more obviously at the start of the second stage. For the local zigzags, it is obvious that it lies within second stage. It oscillates because only one block of agents updates its variables at each iteration while the others remain silent, and then next block is chosen for the updating. Hence, larger S will lead to dense zigzags and this matches well as the Figure 3 shows.

C. THE IMPACT OF AVAILABLE AGENTS
According to the above experimental results, we notice that the updating times of each agent within the second stage have a significant impact on the performance of our proposed method. Hence, we continue to study the impact of different number of the available agents at each iteration, which can determine the update times of each agent. Specifically, the total iteration number is set to 1, 000, and the first stage iteration number is set to S = 50, such that the update times of each agent can be varied more flexibly. For the data division, it will be set to K = 20. Furthermore, we set AG = [1,3,5,7,9,20]. Subsequently, the corresponding update times for each agent are [1,3,5,7,9,20] respectively.
It should be noted that when AG = 20, it means all agents update their variables at each iteration.
As Figure 4 illustrates, it can be seen that in general larger update times for each agent lead to a better performance of NOG and faster convergent speed. This matches well with the convergent theory O( 1 r·S ). Note that the curves in Figure 4 (a) and (f) are much smoother. For Figure 4 (f), the reason is that all the agents update their variables and thus there will be no zigzags. In Figure 4 (a), it can be seen that at each start of the second stage, the curve drops more sharply than the cases in the iterations of the second stage. However, compare to other cases that multiple agents update their variables at each iteration, this decreasing is much slight.  This is because only one agent performs to decrease NOG. For other cases that have multiple agents available at each iteration, it can be seen that the performance improvement becomes smaller as the number of available agents increases. Hence, our proposed method has strong ability in fault tolerance, which also means that one can choose fewer available agents at each iteration for the update, because the very slight performance sacrificing is worth trading for saving hardware resources.

D. THE STUDY OF THE CONVERGENCE
Next, we continue to study when the condition that each agent updates its variables at least once is not satisfied. To achieve this scenario, the dataset is divided into K = 60 parts. The total iteration number is set to 1, 000 with S = 50. Hence, under the condition that adopts the available agent scenario according to Figure 1, there should be at least a block of 3 agents updating the variables at each iteration. To see if it is not satisfied, we choose to set to be AG = [1,2,3,4]. Obviously, we have set three groups for studying, where the first group does not satisfy the condition, the second group just meets the condition and the third group just surpasses the condition. Specifically, AG = [1, 2] does not satisfy the convergence condition. AG = 3 just meets the condition while AG = 4 just exceeds the condition.
It can be seen from Figure 5 that when the condition that each agent should update its variables at least one time (here, AG ≥ 3) is not satisfied, all algorithms perform poorly. AG = 3 seems to be a water-shed, which means beyond AG = 3 it can reach a better and satisfactory performance, while below AG = 3, the performance is poor. Note that when AG = 4, the performances of all algorithms are substantially improved compared to AG ≤ 3. Therefore, it matches Theorem 1 well.

V. CONCLUSION
As there exists different issues for classical stochastic ADMM methods, we have proposed a novel stochastic ADMM method to address these concerns. Specifically, we first incorporate SVRG strategy and divide the ADMM procedure into two stages. At the second stage, the agents work for updating their corresponding variables in parallel. However, we only require each of them updates its variable at least once at the second stage. Furthermore, the convergence properties of our proposed method are comprehensively studied. Since we have not assumed the convexity of the objective function, our proposed method can be potentially applied in nonconvex problems. In the comparisons with other methods, our proposed method has shown to be advantageous especially in flexibility.

A. PROOF OF LEMMA 1
Proof: Suppose at the current iteration t, we sample a subset of data pointsD t k with the batch size b t k N k for the 19054 VOLUME 10, 2022 stochastic gradient: Here, we have set c t k = L k b t k and used the following result that: The proof of Lemma 1 ends.

B. PROOF OF LEMMA 2
Proof: By using the fact that L({x k }, {λ k }, x 0 ) is strongly convex with respect to x 0 , we have (B1) Note that as x s+1 0,t+1 is the minimizer in the update in step 5 for 0 ∈ I s+1 t+1 , hence, by setting x * 0 = x s+1 0,t+1 , we can obtain the desired result in Lemma 2, and the proof of Lemma 2 ends.

C. PROOF OF LEMMA 3
Proof: Let us first consider the case for t = 1, . . . , T −1.
For notational simplicity, we omit the notation s. Specifically, we have Note the subproblem in step 10 for the update of x k , by nulling the derivative of the objective function, we have: Furthermore, according to the step 13 for the update of λ k and substituting λ t+1 Thus, substituting (C3) into (C1), it yields: Thus, we obtained the desired result. As it contains the variable x s+1 k,t−1 , it does not apply to the case for t = 0. Hence, we continue to proof the result for the case with t = 0. First, it should be noted that the last iterate x s k,T within second stage is reserved as the initial value next second stage, i.e., x s+1 k,0 =x s k = x s k,T . Thus, we have Thus, we obtained the desired result in Lemma 3. Proof: For the notational convenience, we omit the iteration number s of the first stage. For k ∈ I t+1 , we have Here, we have used the assumption that the gradient of f k is Lipschitz continuous with respect to L k . Recall that the update for x k in step 10 results in (C1), this can be a guidance for further derivation by adding and subtracting terms. To start with, we first add the term ρ k where we have used (C1) for the third equality. Next, Hence, conditioned on x t k , the expectation E[(x t+1 k −x t k )|x t k ] is computed as: Substituting (D3), it further yields: Thus, we have obtained the desired result in Lemma 4.

E. PROOF OF LEMMA 5
Proof: Let us first evaluate the change of the individual augmented Lagrangian function for each update of every variable at agent k. Specifically, we have Here, we have omitted the notation for t = 1, . . . , T − 1. According to (E1), we can add and subtract corresponding terms to construct the sequence ζ s k,t , subsequently, the difference ζ t+1 k − ζ t k can be bounded above based on (E1): Moreover, for t = 0, we have Based on this, we further have Subsequently, from the definition of the sequences {µ s t,k } and {M s t,k }, combining the above we further have ζ s+1 k,t+1 − ζ s+1 By taking summation, we can obtain the desired result.

F. PROOF OF THEOREM 1
Proof: we first show that the sequence {ζ s t } is bounded below. First Here, t(k) denotes the last iteration that the kth agent updates its variables. Taking summation over the above equation, we have 1 Thus, the sequence {ζ s t } must be bounded below. Moreover, as it is monotonically decreasing, {ζ s t } is convergent, i.e., lim s→∞ ζ s t = ζ * . Next, we will use this useful result to show that our proposed stochastic ADMM method is convergent. Specifically, for s = 1, . . . , S and t = 0, . . . , T − 1, we take summation, it subsequently yields: Therefore, by setting S to infinity and using the fact that each agent updates its variables at once every T iterations, we obtain: subsequently, we have lim s→∞ e s t = 0, which implies that x k → x * and x 0 → x * with probability 1. Hence, for a given sufficiently small value ε > 0, we can always find a S 1 such that e s k,t(s) < ε for all s ≥ S 1 . Therefore, we further obtain that: Here, t(s) means at the s iteration of first stage, x 0 is updated at t(s) iteration of the second stage. With the straightforward derivation, we can obtain E ∇f k (x s k,t ) + λ s k,t ≤ η 1 Hence, we can always choose a sufficiently small ε > 0 such that for all s > S and s > S, the desired results hold with S ≥ η(ζ 2 0 −ζ * ) τ ε , where η := max(η 1 , η 2 , η 3 ). The proof of Theorem 1 ends.
LIN WU is currently pursuing the Ph.D. degree with the School of Computer and Cyberspace Security, Communication University of China. He is also a Senior Engineer/Master's Tutor with the School of Computer and Cyberspace Security, Communication University of China. In charge of multiple projects from the Science and Technology Commission, Chaoyang, Beijing. In charge of multiple projects of university-level projects, developing multiple media application platform. He is participating and completing 863 project sub-topics of the Natural Science Foundation of China National CNGI. He have been published eight articles (including six EI articles) and eight software copyright certificates. His main research interests include theory research and development of application software including intelligent big data analysis, machine learning, and machine translation. TUO SHI was born in Beijing, China. She received the Ph.D. degree in signal and information processing from the Communication University of China, Beijing. She is currently an Associate Professor with Beijing Police College, Beijing. She is also a Postdoctoral Researcher of China CITIC Institute, Beijing. Her research interests include machine learning, data mining, deep learning, and their applications in the field of crime. VOLUME 10, 2022