A General Framework to Distribute Iterative Algorithms With Localized Information Over Networks

Emerging applications in the Internet of Things (IoT) and edge computing/learning have sparked massive renewed interest in developing distributed versions of existing (centralized) iterative algorithms often used for optimization or machine learning purposes. While existing work in the literature exhibits similarities, for the tasks of both algorithm design and theoretical analysis, there is still no unified method or framework for accomplishing these tasks. This article develops such a general framework for distributing the execution of (centralized) iterative algorithms over networks in which the required information or data is partitioned between the nodes in the network. This article furthermore shows that the distributed iterative algorithm, which results from the proposed framework, retains the convergence properties (rate) of the original (centralized) iterative algorithm. In addition, this article applies the proposed general framework to several interesting example applications, obtaining results comparable to the state of the art for each such example, while greatly simplifying and generalizing their convergence analysis. These example applications reveal new results for distributed proximal versions of gradient descent, the heavy ball method, and Newton's method. For example, these results show that the dependence on the condition number for the convergence rate of this distributed heavy ball method is at least as good as that of centralized gradient descent.


A General Framework to Distribute Iterative Algorithms With Localized Information Over Networks
Thomas Ohlson Timoudas , Silun Zhang , Sindri Magnússon , Member, IEEE, and Carlo Fischione , Member, IEEE Abstract-Emerging applications in the Internet of Things (IoT) and edge computing/learning have sparked massive renewed interest in developing distributed versions of existing (centralized) iterative algorithms often used for optimization or machine learning purposes.While existing work in the literature exhibits similarities, for the tasks of both algorithm design and theoretical analysis, there is still no unified method or framework for accomplishing these tasks.This article develops such a general framework for distributing the execution of (centralized) iterative algorithms over networks in which the required information or data is partitioned between the nodes in the network.This article furthermore shows that the distributed iterative algorithm, which results from the proposed framework, retains the convergence properties (rate) of the original (centralized) iterative algorithm.In addition, this article applies the proposed general framework to several interesting example applications, obtaining results comparable to the state of the art for each such example, while greatly simplifying and generalizing their convergence analysis.These example applications reveal new results for distributed proximal versions of gradient descent, the heavy ball method, and Newton's method.For example, these results show that the dependence on the condition number for the convergence rate of this distributed heavy ball method is at least as good as that of centralized gradient descent.
The idea of investigating the convergence properties of distributed iterative algorithms drawing on the massive wealth of existing work in the nondistributed (also referred to as centralized) setting, and combining it with a network consensus protocol, appears to be potentially quite fruitful.Yet, the development of such distributed algorithms from existing centralized ones still requires significant work, both for the algorithm design, and to establish their theoretical convergence properties, even though many examples share many characteristics from both aspects [19], [20].There is an urgent need to standardize this process and the methods to transform centralized iterative algorithms into distributed ones, and to simplify the development of new distributed algorithms for future networked applications, while taking advantage of the significant existing work on centralized algorithms.
Such a unified framework would have additional benefits.For instance, many emerging IoT applications have very different requirements compared to traditional distributed settings, e.g., relying on massive wireless networks and/or (provisional) mesh networks without a central coordinator, with intermittent and/or limited communication in terms of bandwidth and data rates [21], [22].Additionally, the data needed for many applications is generated by several different nodes with different ownership.As a result, these applications tend to emphasize higher degrees of decentralization and autonomy, data privacy, energy conservation, and interference mitigation.These network and communication considerations add another layer to the distributed algorithm design.A unified framework for distributing (centralized) iterative algorithms could also facilitate the analysis and integration of such an added communications layer, independently of the specifics of the "base" algorithm that is being distributed.
We shall now define the class of algorithms that we consider in this article.In many applications, a (the) solution x * ∈ R d to a given problem can be found as a fixed point of an iterative algorithm x(t + 1) = Φ(x(t)) = F (x(t), Dx(t)) (1) given by a mapping Φ, where F : R d × R m → R d and D : R d → R m are some operators.Here, x(t) is an iterate that is updated according to the algorithm F (•), and Dx(t) is some additional information, possibly depending on x(t), required to execute the algorithm, e.g., a gradient or a Hessian matrix.We shall refer to the operator D as the "data" operator.However, D does not (necessarily) represent a concrete data set, but rather some additional information derived from such a data set.Examples of algorithms that can be expressed in this form include vanilla/proximal gradient methods, Newton's method, and dynamic programming [23], [24].Now that the class of algorithms under consideration has been made clear, we shall clarify the network setting-how the data are distributed throughout the network.In many of the emerging IoT applications, the data will not be located at a single node, but rather distributed between multiple nodes in the network.For instance, in networks where all the nodes exploit data in the same manner, Dx(t) is often the average of local data operators [25] where D i (x) is the local data (operator) of node i, e.g., D i (x) is the local gradient or Hessian matrix for the node i, evaluated at the point x.In highly dense and/or mesh networks, it may not be feasible to transmit all the local data D i x(t) to a central coordinator, e.g., due to power/interference constraints, or the need to route data through many nodes.In such cases, especially in the absence of a central coordinator, it may be therefore necessary to compute the solution using the algorithm in (1), in a distributed fashion, without access to the full data (operator) D, using instead local estimates of D based on communication between network neighbors (i.e., consensus/gossip).The goal of this article is to develop a general method for distributing the execution of such iterative (centralized) algorithms in networks.

A. Related Work
In recent years, the study of distributed learning and optimization has seen tremendous interest and advances.In fact, many problems in engineering, learning, and coordination of networked agents can be formulated (at least in part) as an optimization problem.Early work on decentralized algorithms to solve such optimization problems include primal-dual approaches for resource allocation in networks [26], and incremental subgradient methods, by which the nodes sequentially broadcast their local (sub)gradients to the others, followed by a common gradient update, in a round-robin fashion [27], in random sequence [28], or more complicated deterministic sequence [29].These ideas were further developed to better scale in networks with many nodes in [6], allowing nodes in (decentralized) mesh networks to transmit their local gradients and perform updates in parallel (and only to their direct network neighbors), using ideas from consensus (and in particular dynamic average consensus and consensus tracking [2], [3]) that had proved successful in applications to multiagent control [10], [11], and sensor fusion [16].
One major drawback of the early distributed subgradient method introduced in [6], is that it requires decaying step sizes.While suitable (and sufficient) for nonsmooth objective functions, this requirement results in sublinear convergence rates even for strongly convex objective functions -in contrast to the (nondistributed) standard gradient descent.The authors in [30] and [31] analyzed the effects of using a (sufficiently small) constant step size, and found that the (mean) squared error in fact decreases linearly, but only up to a certain, nonzero error bound (which increases with the step size).Subsequent work showed that it was possible to modify this algorithm to attain linear convergence rates in the strongly convex case.
Many of these algorithms in the literature are very similar, with differences in order of operations, what exactly is communicated in the consensus steps, and/or differences due to the particular algorithm that is being decentralized.In particular, they all rely on ideas developed in earlier work on dynamic average consensus and consensus tracking [2], [3].As a result, recent work has attempted to reconcile these in a unified algorithmic framework.In particular, [19], [20], [41], [47] have proposed general frameworks that unify algorithm design and convergence proofs for several different variations of distributed gradient descent, and their proximal versions, in the literature.A general separation principle for designing decentralized optimization algorithms, by combining a base optimization algorithm with consensus tracking, is proposed in [48].Additionally, a unified convergence analysis approach is proposed, and demonstrated on decentralized versions of gradient descent and ADMM, by regarding the gradient operators as a feedback regulator in a closed-loop dynamical system.These works provided a view to separate the ingredients of optimization and consensus average in distributed optimization algorithms, which depicts a divide and conquer strategy in distributed optimization, i.e., separating the development of new consensus dynamics, e.g., [2], [3], with the optimization iterations.In this article, we continue such a path to separate network collaborations within a class of general contraction iterations.

B. Summary of Our Contributions
We summarize our main contributions as follows.
r Given a general class of iterative algorithms designed for centralized settings, we formalize and develop a framework for distributing its execution in (mesh) networks, for which the information necessary for such an execution is spread throughout the network.
r We prove that, if the original algorithm has a linear conver- gence rate, then the distributed algorithm also has a linear convergence rate (with a different constant factor), for a general class of algorithms.
r As a special case, we show that our framework provides a unified method and analysis distributing multiple different first-order gradient methods (e.g., standard gradient descent, (proximal) heavy ball, proximal gradient descent), as well as second-order Newton methods.r These results show, for instance, that the dependence on the condition number (the ratio between the smoothness and strong convexity parameters) of the convergence rate of our distributed (proximal) heavy ball method is no worse than for centralized gradient descent-an improvement on the existing state of the art.
r In this work, the analysis of proximal distributed versions follow immediately from the corresponding analysis of the nonproximal case.Accordingly, we can show that the proximal distributed version of (any) such algorithm enjoys the same convergence rate as the distributed version of the non-algorithm (analogous to the centralized setting).

C. Structure of this Article
Section II introduces the problem formulation, together with a list of assumptions we will use to analyze the problem and prove our theoretical results.The proposed distributed solution (algorithm) is given in Section III.Section IV contains our theoretical main results about the convergence (rate) of the proposed algorithm.The main results are applied to several examples from optimization in Section V.The results in Section V demonstrate the wide applicability of our framework, and also establish some new results for specific example algorithms.The proofs of the main theoretical results are given in Section VI.

D. Notation
In the rest of this article, without extra indication, all vectors are regarded as real-valued row vectors.The exception to this rule is that 1 k always denotes a column vector in R k×1 , whose components are all 1.Given n vectors v 1 , . . ., v n ∈ R d , we denote by v ∈ R d the average of the vectors: Given a matrix A ∈ R m×n , we denote by A •,i and A j,• the ith column and jth row of matrix A, respectively.The transpose of A is denoted by A T .For vectors, • 2 denotes the Euclidean norm, and for matrices it denotes the induced operator norm, which is defined by The norm • F denotes the Frobenius norm of matrices, i.e., The map I refers to "the" identity map, and the domain of the map is given by the context.

II. PROBLEM FORMULATION
As mentioned in the introduction, many interesting applications involve finding fixed point solutions to an iterative algorithm of the form (1), which could be equivalently expressed as the (iterative) algorithm x(t + 1) = F (x(t), y(t)) Recall that F : R d × R m → R d is some operator, given by the algorithm of the problem application, and that the data operator D : R d → R m models some additional input data, which may depend on the current iterate x(t), required to execute the algorithm.Note that many optimization algorithms can fit the form given in (3).In Section V, the correspondence between concrete examples and the abstract algorithm formulation in (3) is demonstrated, see Section V-A for proximal gradient descent method, Section V-B for heavy ball method, and Section V-C for Newton's method.The premise of this article is as follows: suppose that n distinct agents (called nodes) have the common goal of finding the solution (fixed point) x * to an iterative algorithm of the form (1) x(t + 1) = Φ(x(t)) = F (x(t), Dx(t)).
We shall assume that this iterative algorithm is linearly convergent, i.e., that Φ : R d × R d has a unique fixed point x * satisfying Φ(x) − x * ≤ r x − x * , for some 0 ≤ r < 1 and every x ∈ R d .The exact technical assumptions used in this article are given in Section II-A.We will furthermore assume that the data D is distributed between these n nodes in the network, such that each node i = 1, . . ., n has a local data operator D i : R d → R m , and that these local data operators average to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
for every x ∈ R d .For example, D i x might be the local gradient or Hessian matrix of node i at the point x.The assumption in (4) is in general not very restrictive, see Remark 4 and the discussion following it, in Section II-A.In particular, it is assumed that no single node knows the (global) data operator D(•) -however, it is assumed that all of the nodes know the operator F (i.e., how to combine the local data in the algorithm).
Communication between the nodes is restricted to the edges of a fixed (constant) undirected (network) graph G = (V, E) (which may be arbitrary), in which the vertices V = {1, . . ., n} represent the agents, and the edges in E represent (bidirectional) communication links.We assume that the graph is connected, but it can otherwise have arbitrary topology.Two nodes i, j ∈ V are called neighbors if and only if there is an edge (i, j) ∈ E, and we write j ∈ N (i) if there is an edge from node i to j.In this network model, two agents i, j ∈ V are able to communicate if and only if they are neighbors.Since each node has access to its own data, we assume that (i, i) ∈ E for every i ∈ V. Given a graph G with n nodes, a mixing matrix for the graph G is a doubly stochastic 1 matrix W = (w ij ) ∈ R n×n , whose elements satisfy w ij > 0 if and only if (i, j) ∈ E. Suppose that σ is the second largest eigenvalue of W (1 is the largest one).For such matrices, one can show that for any (column) vector x = (x 1 , . . ., x n ) ∈ R n×1 .We are now ready to state the main problem (inspired by a conjecture in [35]) to which this article is devoted.Problem 1: Develop an iterative distributed (decentralized) algorithm to find the solution (fixed point) x * of (3) that maintains a linear convergence rate, 2

such that communication is restricted to the edges (links) in E.
It would be interesting to pose and solve the same problem for other convergence rates, such as sub-linear.In Section V, we further explore this problem in the context of several common optimization algorithms, and how our results apply to them, such as: 1) standard (and proximal) gradient descent; 2) the heavy ball method; 3) Newton's method.
To develop distributed solutions (algorithms) to this problem, and establish their theoretical properties, we must first pin down the structures and assumptions we impose on the original algorithm (3).

A. Assumptions
In this work, we consider algorithms that converge linearly.More formally, we make the assumption.
Assumption 1: There is a constant r ∈ [0, 1), and a unique fixed point 1 Every component of the matrix is nonnegative, and moreover each row's components sum to one, and each column's components sum to one. 2 The iterative algorithm in (3) has a linear convergence since r ∈ [0, 1).The convergence rate of the distributed algorithm may have different constants.
for every x ∈ R d .Moreover, let 0 ≤ δ be a constant satisfying Remark 1: Note that any contraction mapping Φ(x) satisfies (5), which is therefore weaker than assuming that Φ(x) is a contraction.Moreover, the assumption (5) also implies that the underlying algorithm Φ(•) has a linear convergence rate.
The more general algorithms with other convergence rates are left for the future study.Here, we conjecture that for any algorithms with a sublinear convergence rate, a similar decentralization method exists as proposed in this article.
Remark 2: Furthermore, for any mapping Φ that satisfies (5), the assumption in ( 6) naturally holds 3 with δ = 1 + r ≤ 2 (which is not sharp, in general).The reason we still indicate (6) explicitly, is because the convergence rate we obtain for our proposed algorithm depends on δ-a smaller δ gives a better convergence rate.
While we expect it to be possible to extend our results to algorithms that are convergent in the Lyapunov-sense, i.e., with respect to a Lyapunov function, that would require a substantially more complicated proof.Therefore, we leave it as a possible future extension of our work.
We also need to impose some smoothness conditions on the algorithm and data operator.Specifically, we assume that both of the functions F and D are Lipschitz continuous.Lipschitz continuity for F can be stated as follows: Assumption 2: There are positive real numbers L x and L y , such that the function for any points x, z ∈ R d , and y, u ∈ R m .Additionally, we need to assume that the local data operators D i are Lipschitz continuous, to ensure that the consensus step converges.
Assumption 3: There is a positive real number L D , such that, for every for any points x, z ∈ R d .Moreover, (4) holds.
Remark 3: The Lipschitz condition in Assumption 3 could be stated in a weighted L 2 -norm to allow more flexibility when applying our framework to algorithms with complex data dependencies.We have opted for this simpler case, to keep the notation in the proofs to a minimum.
Remark 4: If D is instead given by a weighted sum D = a i D i , one may use the transformation 4), without changing the information content of D.
In many applications, it is natural to assume that (4) holds, e.g., in joint optimization problems in which the global objective function is the sum (or average) of local objective functions.This assumption is in general not very restrictive, since the local operators can in many cases be embedded into a direct product space, so that D = 1 n (D 1 , . . ., D s ), and then combined inside the function F .
Lastly, we impose the following requirement on the communication graph.
Assumption 4: The network graph G is connected, and the second largest eigenvalue σ of the mixing matrix This is a standard assumption, and is necessary to ensure that the local data can spread to every node (and that the nodes can reach consensus).

III. PROPOSED DISTRIBUTED SOLUTION
The main idea to decentralize an (iterative) algorithm in a network is balancing the original iteration with the data spreading and synchronization amongst the nodes.For an algorithm given by the mapping Φ(•), such a balance of two simultaneous procedures, iteration, and synchronization, is often facilitated by linear interpolation of where I is the identity mapping, and the interpolation weight θ varies between 0 and 1, i.e., 0 < θ ≤ 1.Note that a point x is a fixed point of Φ if and only if it is also a fixed point of Φ θ , and that Φ 1 = Φ.Essentially, θ is a "step size"-decreasing it means taking a smaller "step" in the update direction determined by Φ − I, which can be used to increase the relative strength of a synchronization step (e.g., consensus) between nodes.This is also a common construction in the study of nonexpanding operators -for example, the parameter θ is the (normalized) step size used to ensure convergence of gradient descent methods.The linear interpolation for the mapping F (•, •) similarly reads Notice that we interpolate only with respect to the first variable, but not the second one (the input data variable).Using the above interpolations, the corresponding iterative algorithm is given by which has exactly the same solutions (fixed points) as the original one in (3) using Φ.
Our proposed solution to 1, given by Algorithm 1, only relies on (parallel) local execution of the algorithms, using local estimates of the global data operators D, followed by a consensus (communication and combination) step.The consensus step, which is an extension of consensus tracking and gradient tracking in the literature, essentially "diffuses" the local data operators D i throughout the network, to compute (delay-compensated) local estimates that converge to the global operator D. This is the reason for the name of the algorithminterleaved network consensus and local iteration compensation, or INCLIC.

Algorithm 1: (INCLIC) Interleaved network consensus and local iteration compensation.
Input: The mixing matrix W = (w ij ), and the parameter θ ∈ (0, 1] for the interpolation.1: Initialization: Each agent i ∈ [n] initializes its local state variables x i (0) (any initial value is allowed) and y i (0) = D i x i (0).4 2: At each iteration t = 0, 1, . . ., each agent i performs the following process (in parallel): Step 1 (Locally) compute the intermediate variable Step 2 Send the local variable ξ i (t + 1) to every network neighbor j ∈ N (i).Upon receiving the local variables of every network neighbor j ∈ N (i), (locally) compute the state variable Step 3 (Locally) compute the intermediate variable Step 4 Send the local variable φ i (t + 1) to every network neighbor j ∈ N (i).Upon receiving the local variables of every network neighbor j ∈ N (i), (locally) compute the data estimate variable Step 5 Store the variables x i (t + 1) and y i (t + 1) for the next iteration.

3: END
To this end, introduce the local variables x i (t), representing the local iterate at time t of the node i ∈ V, and the local variables y i (t), representing the local estimate of the data operator D at time t of the node i.Each node i ∈ V first initializes the local variables, in such a way that x i (0) ∈ R d may be chosen arbitrarily, and then setting each At each new iteration t + 1, each node will first update their local variables, and then send these updated values to its neighbors in the communication graph G. First, the nodes update the x-variable, as in Steps 1 and 2. The local iterate x i of each node i ∈ V (recall that V is the set of nodes) will be updated as follows: using the interpolated operator F θ given in (8), where the coefficients w ij are the elements of the mixing matrix W = (w ij ).
Recall that each w ij is nonnegative, and that w ij is nonzero if and only if i and j are neighbors in G.Moreover, for each i = 1, . . ., n, it holds that j∈V w ij = 1.
We note that, even though we apply the consensus/combination step after evaluating the algorithm F , as in (9), our results still remain valid (but with slightly different constants) if instead using the update rule which applies the algorithm after the consensus step.These two different update rules are in fact almost equivalent, in the sense that "shifting" an orbit generated by the second update rule by W gives the same result as using the first rule, but with the initial points x i (0) "shifted" by W .
Then, in Steps 3 and 4, the local estimates y i of the global data operators will be similarly updated as Introducing the difference D j x j (t + 1) − D j x j (t), or some variation thereof, in the update step is an important approach to preserve convergence rate, e.g., as used in distributed gradient descent [34], [35].
The above steps can be more compactly expressed by introducing the stacked state (matrix) and the stacked data state estimates We notice that the spectral properties of W carry over to the stacked state matrices; specifically, for any matrix M ∈ R n×n with arbitrary n , it holds that We note that inequality (10) implies that the consensus-based information spreading in the network has a linear convergence rate.This is a bottleneck for any decentralization methods based on consensus, even the original centralized algorithm has a superlinear convergence rate (see Remark 1 and the conjecture following it).We further extend the operators D and F θ to the stacked state form, through the induced operators D : R n×d → R n×m and F θ : R n×d × R n×m → R n×d , given by Dx(t) = Dx 1 (t) T , . . ., Dx n (t) T T , and respectively.We also introduce the operator and the stacked local data operator which is simply the matrix whose rows are given by the local data operators evaluated at the local iterate, i.e., D i x i (t).
With the above stacked states, Algorithm 1 can then be compactly expressed as In the next section, we will show that, with proper tuning of the interpolation variable θ (the step size), the distributed algorithm given in (11)-that is, Algorithm 1-converges linearly to the same fixed point as the original algorithm.

IV. MAIN THEORETICAL RESULTS
Our main results show that, given the assumptions in Section II, and with proper tuning of the interpolation parameter θ in the proposed distributed Algorithm 1 in Section III, Algorithm 1 converges exponentially (see the statement of the results for the exact definition).That is, Algorithm 1 maintains the same convergence properties as the original algorithm in (3).
We defer the proofs of the results to Section VI, to simplify the presentation in this section.These results will be further applied to important examples and applications in optimization, see Section V.
Throughout this section, we will assume that the operators F and D, that fully determine the original algorithm, are given.Moreover, we shall assume that these operators satisfy Assumptions 1-4.To state our results, we need to introduce the system matrix where we have set Λ = L y L D , and the constants r, δ, L x , L y , L D , and σ are the ones given in Assumptions 1-4.The importance of this matrix Γ will become clear in the statements of the results and their proofs.In particular, the convergence rate of Algorithm 1 is bounded from above by the spectral radius of Γ.
We note that, in general, convergence of the distributed Algorithm 1 can only be guaranteed using some degree of interpolation, i.e., that 0 < θ < 1.This is analogous to how the step size must be tuned for gradient descent methods, and in those cases θ actually corresponds directly to the step size.Our first result, Theorem 1, establishes a sufficient condition for the distributed Algorithm 1 to converge, and in particular achieve linear convergence, without any interpolation, i.e., with θ = 1.In fact, this same result can be applied to the interpolated algorithm F θ , to derive sufficient conditions on the interpolation parameter that ensure convergence of the distributed Algorithm 1.That is our second result, Theorem 2.
Our third result Theorem 3 is a specialization of Theorem 2, which lets us derive a better upper bound for the interpolation parameter θ.This is the result that will be used later in Section V, where we apply our results to different optimization algorithms.
Lastly, Theorem 4 shows that our results can be immediately applied to establish convergence for their proximal counterparts.This result demonstrates that convergence for the proximal versions of these algorithms can be be derived directly from their nonproximal originals, with the same convergence rate, just like in the nondistributed (centralized) case.With no further ado, we present our first result: Theorem 1 (No interpolation case): Suppose that the maps F, {D i } i , and W satisfy Assumptions 1-4.If the spectral radius ρ = ρ(Γ) of Γ is strictly less than one, i.e., ρ(Γ) < 1 then Algorithm 1 with θ = 1 converges to the unique solution x * with exponentially rate ρ.In particular, there is a positive constant C, independent of the initial value of x(0), such that for any t > 0 and initial value x(0), if y(0) = D loc x(0).
In general, the convergence condition (13) may not hold and (the distributed) Algorithm 1 may fail to converge.However, the next result shows that it is always possible to tune θ ∈ (0, 1] so that Algorithm 1 converges linearly.It follows from Theorem 1, by considering the interpolated operator F θ , which satisfies the assumptions with the constants r θ = (1 − θ) + θr, δ θ = θδ, L x;θ = (1 − θ) + θL x , L y;θ = θL y , and Λ θ = L y;θ L D = θΛ.See the proof in Section VI-B for details.
Theorem 2: Under the same assumptions as in Theorem 1, Algorithm 1 converges exponentially fast for any θ ∈ (0, 1) that satisfies all of the conditions That is, for any interpolation parameter satisfying these conditions, the bound (14) in Theorem 1 holds with ρ replaced by ρ θ = ρ(Γ(F θ )), the spectral radius of Γ(F θ ).
We note that the conditions (15) given in Theorem 2 are in general not optimal for choosing the interpolation parameter θ, as discussed in the following remark.
Remark 5: The conditions in (15) were derived using an upper bound of the spectral radius, obtained through application of Gershgorin's circle theorem.These conditions are therefore, in general, not sharp.
With the additional assumption that 0 ≤ L x ≤ 1, which holds in many interesting examples, it is possible to obtain significantly better convergence rates, as in the following result.
Theorem 3: Under the same assumptions as in Theorem 1, assuming in addition that 0 ≤ L x ≤ 1, it follows that Algorithm 1 converges exponentially fast for any 0 < θ ≤ 1 that further satisfies where with convergence rate That is, the bound ( 14) in Theorem 1 holds with ρ replaced by this ρ θ , which is an upper bound on the spectral radius of Γ(F θ ).
In many applications, such as proximal algorithms for regularization in learning, or constrained optimization, the map F is a composition of a "base" map with a proximal operator, such as a projection.In many cases, the proximal operator is a nonexpanding map, for example projection onto a convex set.For the next result, we shall assume that the map Φ is a contraction, i.e., that it satisfies Given a nonexpanding 6 operator P : R d → R d , set G(x, y) = P • F (x, y) and consider the composition which is again a contraction map with a unique fixed point x * (which is in general different from the fixed point x * of the original map Φ, e.g., as in projective gradient descent when the constraint set does not contain the stationary point).In this case, the following result established convergence of Φ G directly from convergence of Φ. Theorem 4: Let operators F, D, and W be given, which satisfy Assumptions 1-4.Additionally, assume that Φ is a contraction map, as in (16).Then Theorems 1-3 all hold mutatis mutandis for the operators G and Φ G , defined in (17), in place of F and Φ, and with the same conditions on θ and convergence rate ρ.
Moreover, the maps G and Φ G given in (17) satisfy Assumptions 1-4, with the exact same constants as the maps F and Φ, for any nonexpanding map P : R d → R d .

Remark 6:
The assumption that Φ is a contraction in Theorem 4 ensures that the composition P • Φ is also a contraction with a unique fixed point.If Φ is only assumed to satisfy the weaker Assumption 1, i.e., linear convergence, the composition may have (among other complications) more than one fixed point.

V. APPLICATIONS TO DISTRIBUTED OPTIMIZATION AND MACHINE LEARNING
We now illustrate our theoretical results on distributed optimization algorithms.Consider a network with n agents, each with its own local differentiable objective function f i : R d → R. Additionally, suppose that they have a common closed, properly convex function g : R d → R ∪ {∞}, which may be nondifferentiable.The goal of the nodes is to jointly solve the optimization problem minimize Such problems are often solved using proximal gradient descent (or some variation thereof), i.e., the update rule The typical function of g is to represent regularization terms and/or problem constraints.In the constrained optimization case, the proximal operator would be a projection operator.When g ≡ 0, the proximal operator is the identity operator.Therefore, this problem formulation contains (nonconstrained) smooth optimization as a special case.Throughout this section we assume that f : R d → R d is Lsmooth and μ-strongly convex, i.e., for any pair x, y ∈ R d We will also assume that each f i is L-smooth and convex.However, they need not be μ-strongly convex individually, as long as their average f is.That is, only some of them, but not all, must be strongly convex.
In this strongly convex case, it turns out that the proximal operator prox ηg is nonexpansive if 0 < η ≤ 2/(μ + L).This is the only property of the proximal operator that we will use in the rest of this section.The constant κ = L μ is known as the condition number of the problem.Under these assumptions on f , there is a unique global minimum x * (satisfying ∇f (x * ) = 0), and many (centralized) algorithms for solving Problem (18) converge linearly.We will now show how several of these algorithms can be transformed into distributed algorithms, using our framework, while also maintaining the essential convergence properties of the original algorithm.We will also compare our results to the state of the art for each specific example.Note in the following subsections how the proximal cases will follow immediately from the nonproximal version (directly using the nonexpansiveness property).

A. Proximal Gradient Descent
Standard gradient descent can be represented by setting F (x, y) = F (x, y; η) = x − ηy, and D i (x) = ∇f i (x) (20) where η is the familiar step size, and D = 1 n n i=1 D i (x) = ∇f (x).Note that there are also other ways of representing it.
If the step size η satisfies 0 ≤ η ≤ 2/(μ + L), (standard) gradient descent is a contraction map with contraction factor r = 1 − ημ, see [23].It therefore follows that F, D, and Φ satisfy Assumptions 1-3 with r = 1 − ημ, L x = 1, L y = η, L D = L, and δ = 1 + r ≤ 2 (here, we simply use the worst bound for δ).Proximal gradient descent is simply the composition of F with the proximal operator prox ηg , which in this case is a nonexpanding map.The resulting distributed version of proximal gradient descent is then Since the gradient descent operator is a contraction (for the chosen step sizes), Theorem 4 applies.Therefore, the convergence rate for this distributed proximal gradient descent is the same as we would get for the distributed nonproximal version.Therefore, we only need to analyze the special case of standard (nonproximal) gradient descent to obtain the convergence rate also for the proximal version.
As a side remark, notice that setting σ = 0 (corresponding to a fully connected network graph) recovers the centralized convergence rate.This condition is certainly met if θ = (1 − σ) 2  14(1 + κ) giving the convergence rate In the case, when g ≡ 0, i.e., smooth convex optimization, the proximal operator is the identity operator and the distributed algorithm reduces to the algorithms proposed in [33], [34], and [35] (the order of the consensus step W varies, but from a dynamical point of view they are essentially equivalent).Our results are consistent with their results, i.e., the existing state of the art, for this particular class of distributed algorithms, and static network graphs with doubly stochastic W .However, it should be noted that [34] also treats varying graphs, and that [35] also treats the nonstrongly convex case, both of which are beyond the present study.
While [32], [33], [34], and [35] do not treat treat the proximal case, distributed proximal gradient descent algorithms are also developed in [20], [39], [40], and [41], but their algorithm designs differs from ours.Note that our analysis does not explicitly involve subgradients, but directly uses on the nonexpansiveness property of the proximal operator.

B. Proximal Heavy Ball Method
The heavy ball method is given by the update rule where the constant β ∈ [0, 1) represents the momentum strength, and can be tuned according to the problem [49, Sec.
It can be seen that the eigenvalues of the matrix above are imaginary (nonreal) pairs with modulus In particular, this holds if Setting ).Thus, the Assumptions 1-3 are satisfied with r = 1 − √ αμ, L x = 0, L y = 1, L D = r, and δ = 1 + r ≤ 2 (we simply use the worst bound), and it follows that According to the proposed decentralized algorithm, the proximal heavy ball method is given by where Dz(•) is defined in (22).Again, since the heavy ball method is a contraction for this choice of α and β, Theorem 4 gives that the proximal version of the distributed heavy ball method enjoys the exact same convergence rate as the nonproximal version.Therefore, we only need to analyze the nonproximal version.To the best of our knowledge, this is the first example of a distributed proximal heavy ball method, with linear convergence guarantees, in the literature.Theorem 3 applies to show that the distributed heavy ball method resulting from Algorithm 1 (and its proximal version) converges with the rate √ κ which has the same dependence on the condition number as centralized (nondistributed) standard gradient descent.That is, distributed heavy ball method is as good as centralized standard gradient descent.This also proves conclusively that the heavy ball method performs better than, or at least as good as, standard gradient descent, also in the distributed setting.To the best of our knowledge, this is the first time this has been proved analytically.
A similar distributed heavy ball method is introduced in [37], but it does not treat the proximal case.Since the convergence rate in [37] is given by a complicated expression, it is difficult to compare it to our work.It seems as if the convergence rate in [37] is inversely proportional to the number of nodes, while ours is not.Since our convergence rate is very explicit and simple, that is another benefit of our approach.However, we should mention that [37] also treats a wide range of parameters α and β.

C. Proximal Newton's Method
Let PSD d (μ) denote the subset of positive definite matrices with smallest eigenvalue μ > 0, i.e., matrices A that satisfy x T Ax ≥ μx T x.Note that this space is closed under taking weighted averaging, and therefore preserved by the consensus step.Define 2 y 1 where the step size η ∈ (0, 1].Assuming that each function f i is also twice continuously differentiable, Newton's method can be represented by F together with the two data operators To simplify the rest of the analysis, and ensure that Newton's method is contractive, assume that each Hessian ∇ 2 f i is constant. 7That is for some (strictly) positive definite matrix A i .Note that the matrices A i can vary between nodes.This special case corresponds to least square minimization, which is a common problem in practical applications.
To fit this example into our current framework, represent (y 1 , y 2 ) and (D i , D (2) i ) (for each i) as a single vector variable y, and a single data operator D i , by combining and stacking the columns of the vector y 1 and the matrix y 2 on top of one another to create a d(d + 1)-dimensional vector (and analogously for the data operators).If we can show that F is Lipschitz continuous, it follows that it can be extended globally with the same Lipschitz coefficients, by Kirszbraun theorem (see, e.g., [51, p. 201]).
The operator F is clearly 1-Lipschitz in the x-component.Since the second variable y 2 is a matrix in PSD d (μ), F satisfies the Lipschitz conditions The last inequality follows since A 2 ≤ A F ≤ rank(A) • A 2 for any matrix A.
In this case, the relevant constants are and the distributed proximal Newton's method becomes where d+1) .Then applying Theorem 3, we get the convergence rate for the distributed (proximal) Newton's method as We remark that, by extending our results to weighted norms, or multivariable data, the factor d/μ can be removed, and replaced by a term depending on the Lipschitz constant of the Hessian.

VI. PROOF OF THE MAIN THEOREMS
In the rest of this section, it is assumed that the operators F, D, Φ, and W satisfy Assumptions 1-4.To simplify notation, introduce the shorthand notation where x * is the fixed point of the mapping Φ(•).We see that global convergence of the local values x i (t) to the solution is controlled by Z(t).The following three key lemmas bound the time evolution of these variables X, Y, and Z in terms of one another.Their proofs appear in the last three subsections of this section.
Lemma 1: Suppose that Assumptions 1-3 hold.Then, for every t ≥ 0, it holds that Lemma 2: Suppose that Assumptions 2 and 4 hold.Then, for every t ≥ 0, it holds that The latter can be bounded instead as follows.Lemma 3: Suppose that Assumptions 1-4 hold.Then, for every t ≥ 0, it holds that Recall that the main results in this article depend on the system matrix given in Section IV.Lemmas 1-3 altogether imply the recurrence relations ⎡ ⎢ ⎣ where "≤" means elementwise less than or equal to.With this recurrence relation, we are now ready to prove the results in Section IV, starting with Theorem 1.

A. Proof of Theorem 1
The recurrence relation in (28) shows that the convergence of the algorithm is controlled by the spectral radius of the matrix Γ, given in (27).Since L y and L D are positive, the matrix Γ is an irreducible nonnegative matrix.Therefore, the spectral radius of Γ is a simple eigenvalue of Γ, by the Perron-Frobenius Theorem (see, e.g., [52,Th. 8.4.4]).Thus, if ρ = ρ(Γ) is its spectral radius, then there is a constant C which is independent of the initial values X(0), Y (0), and Z(0), such that X(t), Y (t), and Z(t) are all bounded by ⎛ ⎜ ⎝ Thus, they all converge to 0 exponentially fast (with rate ρ), provided that the spectral radius is strictly less than one, i.e., ρ < 1.This concludes the proof.

B. Proof of Theorem 2
Suppose that F satisfies Assumptions 1 and 2, with the constants 0 ≤ r < 1, 0 ≤ δ, 0 ≤ L x , and 0 ≤ L y .Consider the interpolated operator F θ , and note that the maps {D i } i and the mixing matrix W (and hence also Assumptions 3 and 4) are unaffected by the interpolation.The interpolation formula directly shows that Φ θ satisfies Assumption 1 with the constants r θ = (1 − θ) + θr = 1 − θ(1 − r) and δ θ = θδ.Similarly, the interpolation formula directly shows that F θ satisfies Assumption 2 with the constants L x;θ = (1 − θ) + θL x = 1 + θ(L x − 1) and L y;θ = θL y , respectively.Therefore, Theorem 1 also applies to F θ , with Λ θ = L y;θ L D = θΛ.For each θ, the system matrix Γ in ( 12) is then given by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
While it is possible to directly compute the eigenvalues of this matrix, as the roots of a third degree polynomial, the resulting expressions are impractical.Instead, we shall bound them from above.
When L y = 0, notice that the nondiagonal elements of the top row are both zero, and therefore that Γ θ has the eigenvalues λ 1 = σ, λ 2 = σ(1 + θ(L x − 1)), and λ 3 = 1 − θ(1 − r).These are all nonnegative, and since both σ and r are strictly less than one by assumption, both of the eigenvalues λ 1 and λ 3 are strictly less than one.Thus, if L y = 0, the spectral radius is less than one only if λ 2 < 1, i.e., if either L x ≤ 1 (0 ≤ L x by assumption) or When 0 < L y , we instead introduce the change of variables Y = L y;θ Y , and shall use Gershgorin's circle theorem to bound the spectral radius of the system matrix.With this coordinate change, we see that ⎡ ⎢ ⎣ Applying Gershgorin's circle theorem, it follows that the eigenvalues of Γ θ (and hence also its spectral radius, since the eigenvalues are all positive) are bounded from above by the greatest of its column sums.It follows that the algorithm converges (linearly) for any 0 < θ that satisfies: After some straight-forward algebraic manipulations, recalling that 0 < θ ≤ 1, and thus, also θ ≤ √ θ, we arrive at the weaker conditions Notice that the condition θ < max(1, (1 − σ)/(L x − 1)), for the case when L y = 0, is also implied by the above conditions (specifically, the first one).Therefore, the conditions in (29) ensure that the spectral radius is strictly less than 1, for all cases.
The statement now follows from Theorem 1, which concludes the proof.

C. Proof of Theorem 3
The following proof involves many detailed computations, which is why we have divided it into two smaller steps.
Step 1. Bounding the spectral radius of the matrix Γ. Lemma 4: Assuming that 0 ≤ L x ≤ 1, the spectral radius ρ(Γ) of the matrix Proof: Since the matrix is nonnegative, its spectral radius coincides with its largest real eigenvalue, which must be nonnegative.It is therefore sufficient to consider only nonnegative real solutions to its characteristic equation, which is given by Gather the terms containing the factor λ − r, and rewrite and notice that Now, let K ≥ 0 be some parameter to be decided, and set Since we assume that 0 ≤ L x ≤ 1, it follows that 0 ≤ 3Λ − L x + 1 ≤ q(Λ), and therefore that: We shall now derive a condition to ensure that 0 < p(λ), whenever λ < λ.Since the spectral radius coincides with the largest positive real root of p (see the discussion in the beginning of this proof), this would then bound the spectral radius by λ.Since σ(1 + L x )/2 ≤ λ, it holds for every λ > λ that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
That is, 0 < p(λ), provided that Thus, the spectral radius is bounded by λ, which concludes the proof.The next step is to derive conditions on the constants in Γ that ensure that the spectral radius is less than one.This means deriving conditions for the interpolation parameter θ.
Step 2. Conditions to ensure that the spectral radius of Γ is less than one.
In this step, we will replace F by its interpolated operator F θ , replacing all relevant coefficients by their interpolated counterparts.Recall (for the details, see the beginning of the proof of Theorem 2) that the interpolated operator F θ satisfies Assumptions 1-2 with L x;θ = (1 − θ) + θL x , L y;θ = θL y , r θ = (1 − θ) + θr and δ θ = θδ.Assumptions 3 and 4 are unaffected by interpolation.
Using Lemma 4, we can bound the spectral radius of the system matrix Γ θ = Γ(F θ ), for any given 0 < θ ≤ 1, by Now, let us impose the condition that the spectral radius is in fact bounded (from above) by (1 + r θ )/2, so that Using the interpolation formulas for the constants, we can rewrite q θ (Λ θ ) as for positive real numbers, and L x;θ ≤ 1 the condition in (30) can be weakened as That is, the spectral radius is bounded by 1 + r θ 2 = 1 − θ 1 − r 2 under these conditions on θ.This proves Theorem 3.

D. Proof of Theorem 4
Recall that P is a nonexpanding map, and that G is given by G(x, y) = P • F (x, y).
Since P is nonexpanding, and Φ is a contraction, this G induces a contraction map i.e., an algorithm of the original form (3). Since Φ G is a contraction, it has a unique fixed point x * , which in general is different from the fixed point x * of Φ.
Since P is nonexpanding, one immediately sees that Φ G together with this fixed point x * satisfies Assumption 1 with the same constants r and δ as the original Φ (induced by F ). Again, since P is nonexpanding, it also follows that G satisfies Assumption 2 with the same constants L x and L y as F .Assumptions 3 and 4 are unaffected by the addition of the operator P , and are therefore satisfied for G with the same constants as for F .
Therefore, the map G satisfies the conditions in Theorems 1-3, with the same constants as F .As a result, their conclusions apply mutatis mutandis to this map G.

E. Preparations for the Proofs of the Three Key Lemmas
Before proving the three key lemmas, Lemmas 1-3, we state and prove a few smaller lemmas that isolate certain computations and inequalities that appear in multiple places in their proofs.
Recall that each y i (0) is initialized to D i x i (0) in the distributed Algorithm 1.This guarantees that the average of the estimates y(t) actually equals to the average of the local data operators, as shown in the next lemma.
Lemma 5: Since each y i (0) = D i x i (0), it holds that for every t ≥ 0. Proof: Since W preserves averaging, we see that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
By induction, it follows that:

Due to the initialization y
. Thus, the first and last terms cancel each other.
The following lemma bounds the difference between y i (t) and the "true" (global) data operator for the local x i (t).This essentially controls "how far" the local updates are from the original update rule (applied to the local values x i (t)).
Lemma 6: Under Assumption 3, the inequality As for the last term in (31), we have Again, taking square roots gives us the inequality The statement now follows.
The next lemma controls how far the local updates are from the original update rule (applied to the local values x i (t)).
Lemma 7: Under Assumption 2 and 3, it holds that Proof: Writing it out, we have Taking square roots, the conclusion follows from Lemma 6.
Using Assumption 3, we can further bound g(t + 1)

VII. CONCLUSION
In this article, we developed a general framework for decentralizing a contractive iteration for multiple connected nodes in networks, which maintains the convergence rate of the original centralized iterative algorithm.Three examples, distributed proximal gradient descent, heavy ball method, and Newton's method, are given to demonstrate the proposed framework.
For future work, we will study the general decentralization method for the algorithms with a superlinear convergence rate, and for the ones with a sublinear convergence rate.

APPENDIX A
The following lemma is a general statement for the Frobenius norms of matrices related to the averaging operator.Lemma 8: For any matrix x = (x T 1 , . . ., x T n ) T ∈ R n×d , i.e., for which row i is given by the vector x T i ∈ R d , it holds that Proof: By the definition of the Frobenius norm, it holds that x , Thus, the assertion follows.