A Compressed Gradient Tracking Method for Decentralized Optimization With Linear Convergence

Communication compression techniques are of growing interests for solving the decentralized optimization problem under limited communication, where the global objective is to minimize the average of local cost functions over a multiagent network using only local computation and peer-to-peer communication. In this article, we propose a novel compressed gradient tracking algorithm (C-GT) that combines gradient tracking technique with communication compression. In particular, C-GT is compatible with a general class of compression operators that unifies both unbiased and biased compressors. We show that C-GT inherits the advantages of gradient tracking-based algorithms and achieves linear convergence rate for strongly convex and smooth objective functions. Numerical examples complement the theoretical findings and demonstrate the efficiency and flexibility of the proposed algorithm.


I. INTRODUCTION
In this paper, we study the problem of decentralized optimization over a multi-agent network that consists of n agents. The goal is to collaboratively solve the following optimization problem: where x is the global decision variable, and each agent has a local objective function f i : R p → R. The agents are connected through a communication network and can only exchange information with their immediate neighbors in the network. Through local computation and local information exchange, they seek a consensual and optimal solution that minimizes the average of all the local cost functions. Decentralized optimization is widely applicable when central controllers or servers are not available or preferable, when centralized communication that involves a large amount of data exchange is prohibitively expensive due to limited communication resources, and when privacy preservation is desirable. Problem (1) has attracted much attention in recent years and has found a variety of applications in wireless networks, distributed control of robotic systems, and machine learning, etc [1]- [3]. To solve (1) over a multi-agent network, early work considered the distributed subgradient descent (DGD) method with a diminishing step-size policy [4]. Under a constant stepsize, EXTRA [5] first achieved linear convergence rate for strongly convex and smooth cost functions by introducing an extra correction term to DGD. Distributed gradient trackingbased methods were later developed in [6]- [9], where the local gradient descent direction in DGD was replaced by an auxiliary variable that is able to track the average gradient of the local objective functions. As a result, each agent's local iteration is moving in the global descent direction and converges exponentially to the optimal solution for strongly convex and smooth objective functions [8], [9]. Compared with EXTRA, gradient tracking-based methods are also suitable for uncoordinated step-sizes [6], [10] and possibly asymmetric weight matrices while preserving linear convergence rates. Some variants were also proposed to deal with stochastic gradient information and time-varying or directed network topologies, etc. For example, in [11], a distributed stochastic gradient tracking method was considered which exhibits comparable performance to a centralized stochastic gradient algorithm. Combining an approximate Newton-type method and gradient tracking leads to Network-DANE, which enables further computational savings by performing variance-reduced techniques [12]. Time-varying networks were considered in [8], [13]- [15], and more recent development on directed graphs can be found in [13], [16]- [20] and the references therein.
In many application scenarios, it is vital to design communication-efficient protocols for distributed computation due to limited communication bandwidth and power constraints. Recently, in order to improve system scalability and communication efficiency, researchers have considered a variety of communication compression methods, such as sparsification and quantization [21]- [30], under the master-worker centralized architecture. Several techniques were introduced to alleviate compression errors, including compression error compensation and gradient difference compression [21], [24], [26], [27].
In the decentralized setting, the difference compression and extrapolation compression techniques were introduced to reduce model compression error in [31]. A novel algorithm with communication compression (CHOCO-SGD), which combines with DGD and preserves the model average, was presented in [32], [33]. But the method converges sublinearly even when the objective functions are strongly convex. In [34], a linearly convergent decentralized optimization algorithm with compression (LEAD) was introduced for strongly convex and smooth objective functions. The method is based on NIDS [35], a sibling of EXTRA. In light of an incremental primal-dual method, a linearly convergent quantized decentralized optimization algorithm was developed for unbiased randomized compressors in [36]. In [37], a blackbox model was provided for distributed algorithms based on finite-bit quantizer.
In light of the advantages of gradient tracking-based methods for decentralized optimization, it is natural to consider the marriage between gradient tracking and communication compression. The first such effort was made in [38] which considered a quantized gradient tracking method based on a special quantizer. It was shown to achieve linear convergence rate for strongly convex and smooth objective functions. However, the algorithm design is rather complicated and relies on a specific quantizer. In addition, the convergence conditions are not easy to verify.
In this paper, we consider a novel gradient tracking-based method (C-GT) for decentralized optimization with communication compression. The algorithm compresses both the decision variables and the gradient trackers to provide a communication-efficient implementation. Unlike the existing methods which are mostly based on unbiased compressors or biased but contractive compressors, C-GT is provably efficient for a general class of compressors, including those which are neither unbiased nor biased but contractive, e.g., the composition of quantization and sparsification and the normsign compression operators. We show that C-GT achieves linear convergence for strongly convex and smooth objective functions under such a general class of communication compression techniques, where the agents may choose different, uncoordinated step-sizes.
The main contributions of the paper are summarized as follows: • We propose a novel compressed gradient tracking algorithm (C-GT) for decentralized optimization, which inherits the advantages of gradient tracking-based methods and saves communication costs at the same time. • The proposed C-GT algorithm is applicable to a general class of compression operators and works under arbitrary compression precision. In particular, the general condition on the compression operators unifies the commonly considered unbiased and biased but contractive compressors and also includes other compression methods such as the composition of quantization and sparsification and the norm-sign compressors. • C-GT provably achieves linear convergence for minimizing strongly convex and smooth objective functions under the general condition on the compression operators, where the agents may choose different, uncoordinated step-sizes. • Simulation examples show that C-GT is efficient compared to the state-of-the-art methods and widely applies to various compressors.
The rest of this paper is organized as follows. We present the general condition on the compressors and the C-GT algorithm in Section III. In Section IV, we perform the convergence analysis for C-GT. Numerical examples are provided in Section V. Finally, concluding remarks are given in Section VI.

A. Notation
Vectors are columns if not otherwise specified in this paper. Let each agent i hold a local copy x i ∈ R p of the decision variable and a gradient tracker (auxiliary variable) y i ∈ R p . At the k-th iteration, their values are denoted by x k i and y k i , respectively. For notational convenience, define X := [x 1 , x 2 , . . . , x n ] ∈ R n×p , Y := [y 1 , y 2 , . . . , y n ] ∈ R n×p , and X := 1 n 1 X ∈ R 1×p , Y := 1 n 1 Y ∈ R 1×p , where 1 is the column vector with each entry given by 1. At the k-th iteration, their values are denoted by X k , Y k , X k and Y k , respectively. Auxiliary variables of the agents (in an aggregative matrix form) H x , H y , Q x , Q y , X,and Y are defined similarly. Denote the aggregative gradient We use · to denote the Frobenius norm of vectors and matrices by default. Specially, for square matrices, · represents the spectral norm. The spectral radius of a square matrix M is denoted by ρ(M).

II. PROBLEM FORMULATION
In this section, we provide the assumptions on the communication graphs and the objective functions. Then, we discuss different kinds of compression methods and provide a general description for compression operators.

A. Preliminaries
We start with introducing the conditions on the communication network/graph and the objective functions. Assume the agents are connected over an undirected graph G = (V, E), where V = {1, 2, . . . , n} is the set of vertices (nodes) and E ⊆ V × V is the set of edges. For an arbitrary agent i ∈ V, we define the set of its neighbors as N i . Regarding the network structure, we make the following standing assumption.
Assumption 1: The undirected graph G is strongly connected and permits a nonnegative doubly stochastic weight matrix W = [w ij ] ∈ R n×n . That is, agent i can receive information from agent j if and only if w ij > 0, and W1 = 1 and 1 W = 1 .
Remark 1: Although we assume an undirected graph G, note that the considered C-GT method also works with any balanced directed graph, where it is convenient to construct a doubly stochastic weight matrix.
The assumption on the objective functions is given below. Assumption 2: The local cost function f i is µ i -strongly convex, and its gradient is L i -Lipschitz continuous, i.e., for any x, x ∈ R p , From Assumption 2, the objective function f is µ-strongly convex and the gradient of f is L-Lipschitz continuous, where µ = 1 n n i=1 µ i and L = max {L i }. 1 Moreover, there exists a unique solution denoted by x * ∈ R 1×p to problem (1) under Assumption 2.

B. Compression Methods
In this subsection, we introduce some common assumptions on the compression operators and then present a more general and unified assumption.
1) Unbiased compression operators: Remark 2: The expectation is taken with respect to the random vector corresponding to the internal compression randomness of Q. Some instances of feasible stochastic compression operators satisfying Assumption 3, such as the unbiased bbits q-norm quantization compression method, can be found in [32]- [34] and the references therein.

3) General compression operators:
We now present a general assumption on the compression operators, which contains Assumptions 3 and 4 as special cases.
Assumption 5: The compression operator C : and the r-scaling of C satisfies for some constants δ ∈ (0, 1] and r > 0. Remark 4: On one hand, if C < 1, Assumption 5 degenerates to Assumption 4 by setting r = 1 and δ = 1 − C. On the other hand, if C is unbiased, i.e., EC(x) = x, then Assumption 5 degenerates to Assumption 3 by setting r = C+1 and δ = 1 C+1 . In short, Assumption 5 gives a unified description of unbiased and biased compression operators and thus Assumptions 3 and 4 can be regarded as its special cases.
However, there also exist compression operators where C is biased and C ≥ 1 in Assumption 5, that is, they do not satisfy Assumptions 3 and 4. Examples include the normsign compressor C(x) = x q sign(x) and the composition of quantization and sparsification [23], [25], [39].
Remark 5: Although some compression operators (e.g., composition of quantization and sparsification) can be rescaled so that the new compression operator satisfies the contractive 1 We denote κ = L/µ as the condition number. condition in Assumption 4, applying the rescaled operator may hurt the performance of the algorithm when compared with directly using the original compression operator C. Considering Assumption 5 provides us with more flexibility in choosing the most suitable compression method.

III. A COMPRESSED GRADIENT TRACKING ALGORITHM
In this section, we introduce the communication-efficient compressed gradient tracking algorithm (C-GT). We also give some interpretations as well as how C-GT connects to existing works. Denote where η i is the stepsize of agent i. Let Compress be the compression function, and the compression operators are associated with the function Compress. The proposed compressed gradient tracking algorithm (C-GT) is presented in Algorithm 1.

15:
Return: Z, Z w , H, H w 16: end procedure a In Lines 4 and 5, αz in the compression function is replaced by αx for decision difference compression and αy for gradient tracker difference compression, respectively.
The compression and communication steps are included in the procedure COMM(Z, H, H w ). The function Compress is the compression operator that independently compresses the variables for each agent per iteration. In Line 10, the difference between Z and the auxiliary variable H is compressed and then added back to H in Line 11 for obtaining Z. Here H acts as a reference point, and when it gradually approaches Z such that the difference vanishes to 0, the compression error on the difference will also decrease to 0 under Assumption 5. The low-bit compressed value Q is transmitted in Line 12.
To control the compression error, particularly for a relatively large constant C in Assumption 5, we introduce a momentum update H = (1 − α z )H + α z Z motivated from the centralized distributed method DIANA [26] and the decentralized algorithm LEAD [34]. If α z = 1, the update degenerates to that in the decentralized stochastic algorithm CHOCO-SGD [32].
In Line 14, H w is used as a back-up copy for the neighboring information. By introducing such an auxiliary variable, there is no need to store all the neighbors' reference points H [32]. Noticing that H 0 w = WH 0 from the initialization in Lines 1 and 2, we have Z w = W Z and then H w = WH by induction. It follows that X k w = W X k and Y k w = W Y k from Lines 4 and 5. Therefore, the decision variable update in Line 6 becomes and the gradient tracker update in Line 7 is given by One key property of C-GT is that gradient tracking is efficient regardless of the compression errors, i.e., for k ≥ 0, The second equality holds because 1 (I − W) = 0, and the last equality is obtained by induction under the initial condition Y 0 = ∇F(X 0 ). Therefore, as long as y k i reaches (approximate) consensus among all the agents, each y k i is able to track the average gradient 1 ∇F(X k )/n. Moreover, by multiplying 1 and dividing n on both sides of Line 8, we obtain Hence the update of X k does not involve any compression error from the current step. Remark 6: If no communication compression is performed in the algorithm, i.e., X k = X k and Y k = Y k , then C-GT recovers the typical distributed gradient tracking algorithm in [8], [9] for η i = η. To see such a connection, note that C-GT reads and where we substitute X k = X k and Y k = Y k in (6) and (7), respectively. By denoting W := (1 − γ)I + γW, C-GT takes the same form as the typical gradient tracking method.
On the other hand, C-GT performs an implicit error compensation operation that mitigates the impact of the compression error, as can be seen from the following argument. The decision variable is updated as (12) where E k := X k − X k measures the compression error for the decision variable. The additional term (I − W)E k implies that each agent i transmits its total compression error − j∈Ni∪{i} w ji e k i = −e k i to its neighboring agents and compensates this error locally by adding e k i , where e k i ∈ R 1×p is the i-th row of E k . Similarly, the compression errors for the gradient trackers are also mitigated.

IV. CONVERGENCE ANALYSIS
In this section, we study the convergence properties of the proposed compressed gradient tracking algorithm for minimizing strongly convex and smooth cost functions. Our analysis relies on constructing a linear system of inequalities that is related to the optimization error Ω

and compression errors
In order to derive the main results, we introduce some useful lemmas first.
Remark 7: We introduce below the key lemma for establishing the linear convergence of the C-GT algorithm under Assumptions 1, 2 and 5.

A. Main Results
By taking some concrete values for the constants in Lemma 4, we derive the main convergence result for the C-GT algorithm under Assumptions 1, 2 and 5 in the following theorem, which demonstrates the linearly convergent property of C-GT for the general compression operators.
Theorem 1: Suppose Assumptions 1, 2, 5 hold,η ≥ Mη for some M > 0, α x , α y = 1/r, the consensus step-size γ satisfies and the maximum step-sizeη satisfieŝ where κ = L/µ and m = 6 The convergence rate of C-GT is then comparable to those of the typical gradient tracking methods; see, e.g., [9]. Remark 9: In practice, the restrictions on α x and α y can be relaxed to α x , α y ∈ (0, 1 r ] as in Lemma 4. The condition η ≥ Mη is always satisfied for some fixed M , e.g., M = 1 n . If in addition that all η i are equal, then we can take M = 1.
Remark 10: Comparing the performance of C-GT with the existing linearly convergent algorithm LEAD [35], C-GT enjoys more flexibility in the mixing matrix, compression methods and the stepsize policy, while LEAD achieves faster convergence in theory under more restricted conditions.

V. NUMERICAL EXAMPLES
In this part, we provide some numerical examples to confirm our theoretical results. Consider the ridge regression problem: where ρ > 0 is a penalty parameter. The pair (u i , v i ) is a sample that belongs to the i-th agent, where u i ∈ R p represents the features and v i ∈ R represents the observations or outputs. In the simulations, pairs (u i , v i ) are pre-generated: input u i ∈ [−1, 1] p is uniformly distributed, and the output v i satisfies v i = u ix i + ε i , where ε i are independent Gaussian noises with mean 0 and variance 25, andx i are predefined parameters evenly located in [0, 1] p . Then, the i-th agent can calculate the gradient of its local objective function In our experimental settings, we consider penalty parameter ρ = 0.1. The number of nodes is n = 100, and the dimension of variables is p = 500. Meanwhile, x 0 i is randomly generated in [0, 1] p and other initial values satisfy H 0 x = 0, H 0 y = 0, and Y 0 = ∇F(X 0 ).
We compare C-GT with CHOCO-SGD [32], LB [36], LEAD [34] and the uncompressed linearly convergent methods, NIDS [35] and GT [9], for decentralized optimization over a randomly generated undirected graph. In order to guarantee the fairness, all algorithms use their equivalent matrix forms. The considered compression methods are 2-bit ∞norm quantization (Q), Top-10 sparsification, composition of quantization and sparsification (Q-T) and its rescaled version (Q-T-R). Note that Q-T only satisfies Assumption 5 and does not satisfy Assumptions 3 and 4. The communication bits of these compression methods are given in [32]- [34], [39]. 3 The parameter settings of the algorithms are given in Table  I, which are hand-tuned to achieve the best performance for each algorithm.
In Fig. 1, we compare the communication efficiency of C-GT, LEAD, LB and CHOCO-SGD with the uncompressed methods GT and NIDS. For C-GT, LEAD, LB and CHOCO-SGD, we apply the compressors that work the best for them, respectively. Apparently, C-GT and LEAD outperform the other methods, while C-GT achieves the best communication efficiency.
In Fig. 2, we further present a detailed comparison between C-GT and LEAD under different types of compressors. Note that LEAD works the best under the unbiased compressor Q, while C-GT is more efficient under the biased compression operator Top-10 and the composition of quantization and sparsification Q-T. In particular, the performance of C-GT under Q-T is the most favorite among all the combinations.
It can also be seen from Fig. 2 that using the rescaled compressors leads to slower convergence, which suggests that rescaling the compression operators to satisfy the typical contractive requirement (i.e., Assumption 4) may harm the algorithmic performance. Therefore, we can conclude that considering Assumption 5 provides users with more freedom in choosing the best compression method. These experimental findings demonstrate the effectiveness of C-GT.

VI. CONCLUSIONS
In this paper, we consider the problem of decentralized optimization with communication compression over a multiagent network. Specifically, we propose a compressed gradient tracking algorithm, termed C-GT, and show the algorithm converges linearly for strongly convex and smooth objective functions. C-GT not only inherits the advantages of gradient tracking-based methods, but also works with a wide class of compression operators. Simulation examples demonstrate the effectiveness and flexibility of C-GT for undirected networks. Future work will consider equipping C-GT with accelerated techniques such as Nesterov's acceleration and momentum methods. Non-convex objective functions are also of future concern.

A. Supplementary Lemmas
The following vector and matrix inequalities are often invoked.
Lemma 5: For U, V ∈ R n×p and any constant τ > 0, we have the following inequality: In particular, taking τ = τ − 1, τ > 1, we have In addition, for any U 1 , U 2 , U 3 ∈ R n×p , we have Lemma 6: (Corollary 8.1.29 in [40]) Let M ∈ R l×l and v ∈ R l be a nonnegative matrix and an element-wise positive vector, respectively. If Mv ≤ θv, then ρ(M) ≤ θ.

B. Proof of Lemma 3
Let F k be the σ-algebra generated by {X 0 , Y 0 , X 1 , Y 1 , · · · , X k , Y k }, and define E[·|F k ] as the conditional expectation with respect to the compression operator given F k .
Before deriving the linear system of inequalities, we bound E X k − X k 2 F k and E Y k − Y k 2 F k , respectively. From the COMM procedure in Algorithm 1 for the decision variable, we know X k − X k Similarly, we have 1) Optimality error: According to (9), Lemmas 1 and 5, we obtain Taking τ 1 = 3 8η µ and noticing thatη ≤η ≤ 1 3µ , we have In the first inequality, we use D −ηI ≤η. The last inequality holds sinceη ≤η andη ≥ Mη for some M > 0.