Compressed Gradient Tracking for Decentralized Optimization Over General Directed Networks

In this paper, we propose two communication efficient decentralized optimization algorithms over a general directed multi-agent network. The first algorithm, termed Compressed Push-Pull (CPP), combines the gradient tracking Push-Pull method with communication compression. We show that CPP is applicable to a general class of unbiased compression operators and achieves linear convergence rate for strongly convex and smooth objective functions. The second algorithm is a broadcast-like version of CPP (B-CPP), and it also achieves linear convergence rate under the same conditions on the objective functions. B-CPP can be applied in an asynchronous broadcast setting and further reduce communication costs compared to CPP. Numerical experiments complement the theoretical analysis and confirm the effectiveness of the proposed methods.


I. INTRODUCTION
I N this paper, we focus on solving the decentralized optimization problem, where a system of n agents, each having access to a private function f i (x ), work collaboratively to obtain a consensual solution to the following problem: where x is the global decision variable.The n agents are connected through a general directed network and only communicate directly with their immediate neighbors.
The problem (1) has received much attention in recent years due to its wide applications in distributed machine learning [1], [2], [3], multi-agent target seeking [4], [5], and wireless networks [6], [7], [8], among many others.For example, the rapid development of distributed machine learning involves data whose size is getting increasingly large, and they are usually stored across multiple computing agents that are spatially distributed.
Centering large amounts of data is often undesirable due to limited communication resources and/or privacy concerns, and decentralized optimization serves as an important tool to solve such large-scale distributed learning problems due to its scalability, sparse communication, and better protection for data privacy [9].Many methods have been proposed to solve the problem (1) under various settings on the optimization objectives, network topologies, and communication protocols.The paper [10] developed a decentralized subgradient descent method (DGD) with diminishing stepsizes to reach the optimum for convex objective functions over an undirected network topology.Subsequently, decentralized optimization methods for undirected networks, or more generally, with doubly stochastic mixing matrices, have been extensively studied in the literature; see, e.g., [11], [12], [13], [14], [15], [16].Among these works, EXTRA [14] was the first method that achieves linear convergence for strongly convex and smooth objective functions under symmetric stochastic matrices.For directed networks, however, constructing a doubly stochastic mixing matrix usually requires a weight-balancing step, which could be costly when carried out in a distributed manner.Therefore, the pushsum technique [17] was utilized to overcome this issue.Specifically, the push-sum based subgradient method in [18] can be implemented over time-varying directed graphs, and linear convergence rates were achieved in [19], [20] for minimizing strongly convex and smooth objective functions by applying the push-sum technique to EXTRA.
Gradient tracking is an important technique that has been successfully applied in many decentralized optimization algorithms.Specifically, the methods proposed in [12], [21], [22], [23] employ gradient tracking to achieve linear convergence for strongly convex and smooth objective functions, where the work in [21], [23], [22] particularly considered combining gradient tracking with the push-sum technique to accommodate directed graphs.The methods can also be applied to time-varying graphs [21] and asynchronous settings [22].The Push-Pull/AB method introduced in [24], [25] modified the gradient tracking methods to deal with directed network topologies without the push-sum technique.The algorithm uses a row stochastic matrix to mix the local decision variables and a column stochastic matrix to mix the auxiliary variables that track the average gradients over the network.It can unify different network architectures, including peer-to-peer, master-slave, and leader-follower architectures [24].For minimizing strongly convex and smooth objectives, the Push-Pull/AB method not only enjoys linear convergence over fixed graphs [24], [25], but also works well under time-varying graphs and asynchronous settings [24], [26], [27].
In decentralized optimization, efficient communication is critical for enhancing algorithm performance and system scalability.One major approach to reduce communication costs is considering communication compression, which is essential especially under limited communication bandwidth.Recently, several compression methods have been proposed for distributed and federated learning, including [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40].Recent works have tried to combine the communication compression methods with decentralized optimization.The existence of compression errors may result in inferior convergence performance compared to uncompressed or centralized algorithms.For example, the methods considered by [41], [42], [43], [44], [45], [46] can only guarantee to reach a neighborhood of the desired solutions when the compression errors exist.QDGD [47] achieves a vanishing mean solution error with a slower rate than the centralized method.To reduce the error from compression, some works [48], [49], [50] increase compression accuracy as the iteration grows to guarantee the convergence.However, they still need high communication costs to get highly accurate solutions.Techniques to remedy this increased communication costs include gradient difference compression [34], [51], [52] and error compensation [37], [53], [54], which enjoy better performance than direct compression.In [55], the difference compression (DCD-PSGD) and extrapolation compression (ECD-PSGD) algorithms were proposed to reach the same convergence rate of the centralized schemes with additional requirements on the compression ratio.
In [54], a quantized decentralized SGD (CHOCO-SGD) method was proposed and shown to converge to the optimal solution or a stationary point at a comparable rate as the centralized SGD method.The works [53] and [56] also achieve comparable convergence rates with the centralized scheme for solving nonconvex problems.
For strongly convex and smooth objective functions, [57] first considered a linearly convergent gradient tracking method based on a specific quantizer.More recently, the paper [52] introduced LEAD that works with a general class of compression operators and still enjoys linear convergence.Some recent developments can be found in [58], [59], [60], where [59] particularly combined Push-Pull/AB with a special quantizer to achieve linear convergence over directed graphs.
In this paper, we consider decentralized optimization over general directed networks and propose a novel Compressed Push-Pull method (CPP) that combines Push-Pull/AB with a general class of unbiased compression operators.CPP enjoys large flexibility in both the compression method and the network topology.We show CPP achieves linear convergence rate under strongly convex and smooth objective functions.
Broadcast or gossip-based communication protocols are popular choices for distributed computation due to their low communication costs [61], [62], [63].In the second part of this paper, we propose a broadcast-like CPP algorithm (B-CPP) that allows for asynchronous updates of the agents: at every iteration of the algorithm, only a subset of the agents wake up to perform prescribed updates.Thus, B-CPP is more flexible, and due to its broadcast nature, it can further save communication over CPP in certain scenarios [63].We show that B-CPP also achieves linear convergence for minimizing strongly convex and smooth objectives.
The main contributions of this paper are summarized as follows.
• We propose CPP -a novel decentralized optimization method with communication compression.The method works under a general class of compression operators and is shown to achieve linear convergence for strongly convex and smooth objective functions over general directed graphs.To the best of our knowledge, CPP is the first method that enjoys linear convergence under such a general setting.• We consider an asynchronous broadcast version of CPP (B-CPP).B-CPP further reduces the communicated data per iteration and is also provably linearly convergent over directed graphs for minimizing strongly convex and smooth objective functions.Numerical experiments demonstrate the advantages of B-CPP in saving communication costs.The rest of this paper is organized as follows.We provide necessary notation and assumptions in Section II.CPP is introduced and analyzed in Section III.In Section IV, we consider the algorithm B-CPP.Numerical examples are presented in Section V, and we conclude the paper in Section VI.

II. NOTATION AND ASSUMPTIONS
Denote the set of agents as N = {1, 2, • • • , n}.At iteration k, each agent i has a local estimation x k i ∈ R p of the global decision variable and an auxiliary variable y k i .We use lowercase bold letters to denote the local variables, and their counterpart uppercase bold letters denote the concatenation of these local variables.For instance, X k , ∇F (X k ) are the concatenation of , respectively, and their connections are and L-smooth, i.e., for any Since all f i (x ) are strongly convex, f (x ) admits a unique minimizer x * .Denote X * = 1x * ⊤ , where 1 is the all-ones column vector.
Given any nonnegative matrix M , we denote by G M the induced graph by M .The sets are called the in-neighbors and out-neighbors of agent i.
The communication between all the agents is modeled by directed graphs.Given a strongly connected graph G = (N , E) with E ⊂ N × N being the edge set, agent i can receive information from agent j if and only if (i, j) ∈ E. There are two n-by-n nonnegative matrices R and C .A spanning tree T rooted at some i ∈ N in G R is a subgraph of G R containing n − 1 edges, and each node except i can be connected to i by a path in T .Let R R , R C ⊤ denote the roots of the spanning trees in G R and G C ⊤ .We have the following assumption on R and C .
and R is a row stochastic matrix, i.e., R1 = 1.The matrix C is also supported by G, and C is a column stochastic matrix, i.e., By [24, Lemma 1], Assumption 2 implies the following facts: R has a unique left eigenvector s R with respect to 1 such that s ⊤ R 1 = n.C has a unique right eigenvector s C with respect to 1 such that s ⊤ C 1 = n.In addition, the entries of s R and s C are all nonnegative.All nonzero We denote the spectral radius of matrix A as ρ (A).The inner product of two matrices is defined as X , Y = tr X ⊤ Y , and the Frobenius norm is X F = X , X .Given a vector d , d a:b is the subvector of d containing the entries indexed from a to b.Given a matrix A, the notion A a:b,c:d denotes the submatrix containing the entries with row index in [a, b] and column index in [c, d].We abbreviate "1 : n" by ":" and "i : i" is abbreviated as "i".Especially in our notations, A i,: and A :,j denote the i-th row and j-th column of A, respectively.For vector v ∈ R m , Diag (v ) is an m-by-m diagonal matrix with Diag (v ) ii = v i .Definition 1.Given a vector norm • * , we define the corresponding matrix norm on an n × p matrix A as When Definition 2. Given a vector norm • * , we define the corresponding induced norm on an n × n matrix A as Lemma 1 (Lemma 5 in [24]).For any matrices A ∈ R n×p , W ∈ R n×n , and a vector norm • * , we have The compression in this paper is denoted by COMPRESS (•).For any matrix X ∈ R n×p , we denote COMPRESS(X ) as an n-by-p matrix with the i-th row being COMPRESS (X i,: ).Assumption 3. The compression is unbiased, i.e., given x ∈ R p and x = COMPRESS (x ), there exists C 2 > 0 such that E x x = x , and And the random variables generated inside the procedure COMPRESS (x ) depends on x only.

III. A PUSH-PULL METHOD WITH COMPRESSION
In this section, we propose a Push-Pull method with Compression (CPP) in Algorithm 1.We start from discussing the following scheme of Push-Pull/AB from the viewpoint of agent i [24], [25]: Call procedure PULL: Return: y i , y C,i .end procedure and y 0 i = ∇f i x 0 i .At each iteration, agent i computes R ij x k j to average the local decision variables received from its neighbors.As each agent takes such a step, the local decision variables tend to get closer with each other and eventually reach consensus when y k i goes to zero.In the next "gradient tracking" step, by adding the gradient difference term ∇f i x k+1 i −∇f i x k i into y k+1 i and considering that y 0 i = ∇f i x 0 i and 1 ⊤ C = 1 ⊤ , we have i∈N y k i = i∈N ∇f i x k i by induction, which indicates that y k i i∈N can help track the average gradients over the network.
The Compressed Push-Pull (CPP) method differs from Push-Pull/AB mainly in the averaging step which requires communication.To alleviate the impact of compression errors, we use a "damped" averaging step and replace the exact local variables by their communicationefficient version, i.e., (1 − β) For y k i , we take a slightly different averaging step so that the relation i∈N y k i = i∈N ∇f i x k i is preserved and the compression errors can also be bounded "safely".However, sending x k i itself can still be expensive.Therefore, we call the procedure PULL to save the communication costs.In this procedure, The compression and the communication are applied on the difference (x i − u i ) and its compressed version, respectively.It can be derived from the relation p R And the compression error could be very small if the difference x i − u i is small.With this observation, we have x R,i = j∈N − R,i R ij x j with x j being an unbiased approximation of x j , whose variance converges to 0.
In the PUSH procedure, since the variable y k i converges to zero as we will show later, we can simply compress it and estimate C ij y k j using the compressed values.Similarly, we have that in each call to PUSH, y C,i = j∈N − C,i C ij y j , with y j being the unbiased compression of y j , whose variance converges to 0 as the input y k i converges to 0. Let α = max 1≤i≤n α i , and define α as a diagonal matrix with α ii = α i .Then, we can rewrite Algorithm 1 into the following matrix form ( Comparing to the standard Push-Pull method in [24], the procedures PUSH and PULL save communication costs at the cost of error in the mixing.Note X We further write them into the following matrix form From the PULL procedure, we can see that W k is also the compression error for Assumption 3, the variance of the approximation error W k will also tends to 0. Since Y 0 = ∇F X 0 and 1 ⊤ C = 1 ⊤ , it follows from (5d) and induction that Then, as the local variables become closer to each other, y k i are more and more parallel to 1 ⊤ ∇F (X k ).As X k approaches the optimal point, 1 ⊤ ∇F (X k ) tends to 1 ⊤ ∇F (X * ) = 0 ⊤ .This indicates that Y k also tends to 0. Then, by Assumption 3, the variance of compression error E k will also tend to 0. As the compression errors , Algorithm 1 will reduce to the PUSH-PULL method in [24].
To show the convergence of the proposed compressed algorithm, we define We define the value α = s ⊤ R αs C /n.Under Assumption 2, α > 0 can be satisfied by setting Note that the analysis in the rest of this section holds true for arbitrary w ≥ α/α.As we will choose w ≥ α/α as an auxiliary parameter to help choose proper stepsizes {α i } i∈N in Theorem 7 which is the main theorem of this section, we use the same w in the analysis below.
Let F k denote the σ-field generated by E j , W j k−1 j=0 , and The following lemma comes from (5c)-(5e) and Assumption 3.
Lemma 2. The variables X k , Y k , U k , V k are measurable with respect to F k , and X k is measurable with respect to F + k−1 .Moreover, we have Proof.By expanding (5c)-(5e) recursively, X k , Y k , U k , and V k can be represented by linear combinations of X 0 , Y 0 , U 0 , V 0 and random variables And we obtain (6) directly from Assumption 3.

A. Convergence Analysis for CPP
In this section, we analyze the convergence rates of Algorithm 1.To begin with, we define the averages of X k , Y k as follows, We define two matrices In the analysis below, E x k − x * 2 2 is employed to measure the closeness to the optimal point; measure the consensus error and the gradient tracking error, respectively.To bound the compression errors, we use Compared with the convergence analysis for Push-Pull, the analysis for CPP additionally requires dealing with the compression errors and establishing the relationship between the error terms of different types.

Moreover, another term
is considered to simplify the proof, and its role will be made clear in the follow-up analysis.Now, we have the following expansion by multiplying n on both sides of (5c) where Multiplying Π R on both sides of (5c), we have Multiplying Π C on both sides of (5d), we obtain Lemma 3.There are invertible matrices R, C ∈ R n×n inducing vector norms where Especially, • * can be taken as is introduced to help simplify the proof.
Proof.To begin with, we choose positive numbers ⊤ , then the five entries {g i } functions of γ) are given by By (14a), for any , we have g 1 (γ) < 0, where ξ 1 is the minimum positive root of g 1 (γ).
R sC , there exists γ ′ > 0 (the value of γ ′ is given in Lemma 6) such that for any γ satisfying 0 < γ < γ ′ and if we set β = β ′ γ 2 and the stepsizes where σ, v 1 , v 2 are constants given in the proof.
Proof.Denote A = A ( α, β, γ, η) for simplicity.Since A is a regular nonnegative matrix 1 , by the Perron-Frobenius theorem [65], ρ (A) is an eigenvalue of A, and A has a unique right positive eigenvector v with respect to the eigenvalue ρ (A).Define the constant σ = max 1≤i≤4 Next, we prove the linear convergence by induction.If we have proved d k 1:4 ≤ σρ (A) k v 1:4 , we will show that it also holds true for k + 1.The requirement ( 16) According to Lemma 6, ρ (A) < 1.By (13a) and the inductive hypothesis, there holds Combining with the inductive hypothesis, we obtain Now, by (13b) and ( 17), i.e., the statement holds for k + 1.Therefore, by induction, d k 1:4 ≤ σρ (A) k v k 1:4 , for any k ≥ 0, which completes the proof.
Remark 8.In practice, we can take for instance Then, we hand-tune γ under the relations α i = α = α ′ γ 3 (∀i ∈ N ) and β = β ′ γ 2 to achieve the optimal performance.Remark 9. We can also add momentums for the communication of Y k like what we did for X k .In this case, linear convergence can be proved similarly.However, we see little improvement in the numerical experiments when Y k is equipped with momentums.In addition, more momentums will take more storage space.Therefore, we omit the details of this case here.

IV. A BROADCAST-LIKE GRADIENT TRACKING METHOD WITH COMPRESSION (B-CPP)
In this section, we consider a broadcast-like gradient tracking method with compression (B-CPP).Broadcast or gossip-based communication protocols are popular for distributed computation due to their low communication costs [61], [62], [63].In B-CPP, at every iteration k, one agent i k ∈ N wakes up with uniform probability.This can be easily realized, for example, if each agent wakes up according to an independent Poisson clock with the same parameter.Hence the probability Pr [i k = j] = 1 n for any j ∈ N .In addition, {i k } k≥0 are independent with each other.
Briefly speaking, each iteration k of the B-CPP algorithm consists of the following steps: 1) One random agent i k wakes up.
2) Agent i k sends information to all its out-neighbors in R (N + R,i k ) and C (N + C,i k ).
3) The agents who received information from i k are awaken and update their local variables.We remark that the number k is only used for analysis purpose and does not need to be known by the agents.B-CPP can be generalized to the case when each agent wakes up with different but known probabilities.The awakened agents will know whether they are i k and whether they are in the set N + R,i k or in the set N + C,i k , and will take different actions accordingly.The B-CPP method is illustrated in Algorithm 2. To implement communication compression in a broadcast setting, a naive way is to let each agent i hold different momentums u i,j for each neighbors j.In this way, at each time agent i receives information p from neighboring agent j, it can restore the information sent from agent j directly by summing p + u i,j .However, this procedure would require momentums as many as the total number of edges.Hence it could be impractical when the storage space is limited and the graph is dense.In B-CPP, we overcome this issue so that each agent only uses 2 momentums.
To help analyze the convergence rate of Algorithm 2, for 0 ≤ k ≤ K, let us define the final state of variable x j before iteration k as x k j .These x k j are written compactly into an n-by-p matrix X k .And Y k , U k , U k R , ∇F (X k ) are defined analogously.Let D k be the σ-field generated by i j , p j , q j 0≤j≤k−1 .Let D + k denote the σ-field generated by D k and i k .Let D ++ k denote the σ-field generated by D k and i k , p k .For any event A, let χ A denote the indicator function of A.
The differences between B-CPP and CPP mainly lie in the averaging step.Taking the averaging step for X k as an example.Firstly, we also have by induction that Output: each agent i outputs its final local decision variable x i Each agent i initializes , sends p k to all agents j ∈ N + R,i k and wakes up these agents.Agent i k compresses q k = COMPRESS y i k , sends q k to all agents in N + C,i k and wakes up these agents.For all agents j ∈ {i stochastic, r i > 0 for any i ∈ N .Notice that the update step (18a) will be implemented by agent j when agent i k is an in-neighbor of agent j (or equivalently, j is an out-neighbor of i k ). Analogously, and Thus, taking expectation on the RHS of (18b) yields Briefly speaking, after taking conditional expectation, the averaging step of B-CPP reduces to that of CPP (the first term of the RHS of (5c)).By choosing a proper value for β, the variance of the RHS of (18) can be made small enough.In this way, the averaging step in B-CPP could have a similar effect to that of CPP.
To show the linear convergence of B-CPP, for simplicity, we assume α = α 1 = α 2 = • • • = α n .We remark that the analysis below can be easily generalized to the case when the stepsizes differ among the agents.
We denote 1 i ∈ R n×1 as the vector with 1 on the i-th component and 0 on the others.Define As U k + P k and Q k are used to estimate X k and Y k , respectively, we define the error matrices induced by the compression as follows The following lemma is a direct corollary of Assumption 3.
Lemma 10.The random variable i k is independent with Moreover, by Assumption 3, We will also use the following random matrices, which are all measurable with respect to i k , Now, Algorithm 2 can be rewritten into a more compact form: Compared to CPP, the stochastic matrices will induce additional errors that need to be dealt with in the convergence analysis of B-CPP.
To analyze the convergence rate of B-CPP, similar to what we did for CPP, we also use to measure the closeness to the optimal point, consensus error, gradient tracking error and the convergence of the momentums, respectively.

And we use
to help simplify the recursive relations between the above four quantities.The definition of V k is given in (47).
Next, we show that with properly chosen parameters and stepsize, the spectral radius of the parameterized matrix A (α, β, γ, η) in (22) will be less than 1.

The above arguments hold true for arbitrary
satisfying (25a), (25b) and (25c).Then, let the set The following theorem shows the linear convergence of Algorithm 2 given proper parameters and stepsize.
Next, we prove the linear convergence by induction.

If we have proved
1:4 , we will show that it also holds for k + 1.The requirement (26) By (23a) and the inductive hypothesis, there holds Then combining with the inductive hypothesis, we have By (23b) and ( 27), Then combining with (27), we have By ( 23c) and ( 28), Then combining with (28), we have obtained Using (23d) and ( 29), we have i.e., the statement holds for k + 1.Therefore, by in- 4 , for any k ≥ 0, which completes the proof.Remark 14.In practice, the parameters β, γ, η and stepsize α can be chosen in the same way as in CPP.

V. NUMERICAL EXPERIMENTS
In this section, we compare the numerical performance of CPP and B-CPP with the Push-Pull/AB method [24], [25].In the experiments, we equip CPP and B-CPP with different compression operators and consider different graph topologies.
We consider the following decentralized ℓ 2regularized logistic regression problem: where (z i , λ i ) ∈ R p × {−1, +1} is the training example possessed by agent i.The data is from QSAR biodegradation Data Set 2 [66], where each feature vector is of p = 41 dimension.In our experiments, we set n = 20 and µ = 0.001.We construct the directed graphs G R and G C by adding d directed links to the undirected cycle, respectively.In our experiments below, we will consider the cases d = 5, 20, 50, representing sparse, normal and dense graphs respectively.We consider two kinds of compression operators.The first one is the b bit 2-quantization given by where v ∈ R p is a random vector picked uniformly from (0, 1) p and we choose b = 2, 4, 6 in the experiments.The second one is Rand-k which is defined by where x is obtained by randomly choosing k entries of x and setting the rest entries to be 0. We will set k = 5, 10, 20 respectively.Then, we hand-tune to find the optimal parameters for all the numerical cases below.
In the figures, the y-axis represents the loss f x i − f (x * ) and x-axis is the iteration number or the number of transmitted bits.

A. Linear convergence
The performance of Push-Pull/AB, CPP and B-CPP is illustrated in Fig. 1.To see the performance of B-CPP more clearly, we plot the trajectories of B-CPP additionally at a larger scale in Fig. 2. The experiments illustrated in Fig. 1  from Fig. 1 and Fig. 2 that all the trajectories exhibit linear convergence, and the exact Push-Pull/AB method is faster than CPP and B-CPP with any compression methods.This is reasonable as the compression operator induces additional errors compared to the exact method, and these additional errors could slow down the convergence.Meanwhile, as the values of b or k increases, both CPP and B-CPP speed up since the compression errors decrease.When b = 6 or k = 20, the trajectories of CPP are very close to that of exact Push-Pull/AB, which indicates that when the compression errors are small, they are no longer the bottleneck of convergence.Within the same number of iterations, CPP outperforms B-CPP, and the trajectories of CPP are smoother than B-CPP.These results can be expected since CPP updates all the local variables in a single iteration, while in B-CPP, only one node updates with its neighbors.
2 https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradationTo guarantee the linear convergence, B-CPP requires smaller parameters and stepsizes.In addition, the mixing matrix at each iteration of B-CPP can be regarded as a stochastic matrix, which will induce additional variances.

B. Communication efficiency
When we compare the number of transmitted bits to reach certain levels of accuracy, CPP and B-CPP show their superiority.We consider the same settings as in Section V-A.We can see from all of the sub-figures of Fig. 3 that, to reach a high accuracy within about 10 −15 , the number of transmitted bits required by these methods have the ranking: B-CPP < CPP < Push-Pull/AB.To see why CPP outperforms Push-Pull/AB, note that the vectors sent in CPP have been compressed, and hence the transmitted bits at each iteration are greatly reduced compared to Push-Pull/AB.Although the additional compression errors slow down the convergence, our design for CPP guarantees that the impact on the convergence rate is relatively small.Therefore, CPP is much more communication-efficient than the exact Push-Pull/AB method.Moreover, in all cases, B-CPP is much more communication-efficient than CPP.This is because when CPP converges, each agent will receive similar, or "overlapping" vectors from different neighbors, which impacts the communication-efficiency.By contrast, for B-CPP, only one agent wakes up in each iteration, and its information can be utilized by its neighbors more efficiently.
It is worth noting that for both CPP and B-CPP, the choices b = 2 for quantization or k = 5 for Randk are more communication-efficient than b = 4, 6 or k = 10, 20.This indicates that as the compression accuracy becomes smaller, its impact exhibits "marginal effects".In other words, when the compression errors are not the bottleneck for the convergence, sacrificing the communication costs for faster convergence will reduce the communication efficiency.

C. Different topologies
We also examine the performance of CPP and B-CPP on communication networks with different levels of connectivity.We consider d = 5, 10, 20 respectively: as the value of d increases, the communication network has a better connectivity.As before, both the quantization and the Rand-k compressors are considered, and we show the performance of CPP and B-CPP separately for better clarity.
In Fig. 4 and Fig. 5, we can see that when d increases, the convergence of both CPP/B-CPP and Push-Pull/AB speed up.This is expected since better connectivity implies less consensus errors in each iteration, and the algorithms perform closer to the centralized gradient descent algorithm which is faster.

VI. CONCLUSIONS
In this paper, we proposed two communicationefficient algorithms for decentralized optimization over a multi-agent network with general directed topology.First, we consider a novel communication-efficient gradient tracking based method, termed CPP, that combines the Push-Pull method with communication compression.CPP can be applied to a general class of unbiased compression operators and achieves linear convergence for strongly convex and smooth objective functions.Second, we consider a broadcast-like version of CPP (B-CPP) which also achieves linear convergence rate for strongly convex and smooth objective functions.B-CPP can be applied in an asynchronous broadcast setting and further reduce communication costs compared to CPP.We prove the existence of R, θ R , δ R,2 for a given R, and the existence of C , θ C , δ C,2 can be shown similarly.Under Assumption 2, all eigenvalues of R lie in the unit circle, and 1 is the only eigenvalue with modulus one.In addition, the multiplicity of eigenvalue 1 is one.Denote all eigenvalues of R by

Supplementary Material
There is an invertible matrix D such that D −1 RD = J , where J is the Jordan form of R. The columns of D are right eigenvectors or generalized right eigenvectors of D. We can rearrange the columns of D such that the unique 1-by-1 Jordan block of eigenvalue 1 is on the left-top of J , then, the first column of D is parallel to the all-ones vector 1.
We denote u = s R in this proof for simplicity.As eigenvalue 1 has multiplicity one in the spectral of R and u is the left eigenvalue with respect to 1, all other columns of D except the first column are orthogonal to u.Thus, By the choice of t, we have t Y 2 < 1, then and The inequality (11) as well as R R ≤ 2, R − I R ≤ 3 and

B. Proof of Lemma 5
By Assumption 3, Next, we provide bounds for the quantities in turn.
Bounding E x k+1 − x * 2 : we first analyze the terms in (8) one-by-one.By Lemma 10 in [12] and Assumption 1, for α ≤ 2 µ+L , By Assumption 1 and the fact |||•||| 2 ≤ |||•||| R we assumed in Lemma 3, Taking conditional expectation on both sides of ( 8) yields where Here the first equality uses Lemma 2 to eliminate the cross terms; the first inequality is by (30); the second inequality is by Bounding E Π R X k+1 2 R : taking conditional expectation on both sides of ( 9), where The first equality uses Lemma 2 to eliminate cross terms; the first inequality is by setting θ = 1 1−θRβ in Lemma 4; the last inequality is by Lemma 3 and (30).
Bounding E Π C Y k+1 2 C : using (5c), we have is measurable with respect to F k .Now, we have where we used the fact ||| 4n in the last inequality.Thus, by Lemma 2 and ( 30), where we also used the fact R = n in the last inequality.Taking conditional expectation on both sides of (10) yields Here the first equality uses Lemma 2 to eliminate cross terms; the first inequality is by Lemma 4; the last inequality above is from (30), Lemma 3 and the L-smoothness in Assumption 1.Then, by substituting ( 37), ( 38), Bounding E X k − U k 2 2 : by (5e) and taking conditional expectation, we have where the first inequality is by Lemma 2; the second inequality is by (30) and the fact R 2 2 ≤ n R 2 ∞ = n; the third inequality uses Lemma 4; the last inequality is by substituting (37).Now, we bound the term Y k 2 which occurs frequently above.By Assumption 1 and the fact 1 ⊤ ∇F (X * ) = 0 ⊤ , we have where The relation (13a) is by taking expectation on both sides of (43).Combining ( 33), ( 35), ( 40) and ( 41) and taking full expectation yields (13b).

C. Proof of Lemma 11
Define It follows by Lemma 10 that Since ).Now, we can expand (21a) as where and in the second equality, we used the fact By multiplying s ⊤ R n on both sides of ( 46) and defining α = By Lemma 10 in [12], for α ≤ 2 µ+L , Using (45) with the fact E W k |D + k = 0 from Lemma 10, we have By taking conditional expectation and using (49) to eliminate cross-terms, Similarly to (33), there holds where The second inequality above is by Lemma 4; the third inequality used ( 32) and ( 48); the last inequality above used the relation Λ 2 < 1 which follows from the fact that Λ is a diagonal matrix with all diagonal entries within (0, 1).
To bound E Π R X k+1 R |D k , we first multiply Π R on both sides of (46), which yields Using (49) to eliminate cross-terms, we obtain Thus, we have where the second inequality is by Lemma 4; the last inequality uses Lemma 3 and Lemma 4.
Next, we proceed to bound E X k+1 − X k 2 2 |D k .By (46) and the fact (R − I ) 1 = 0, Again, by taking conditional expectation and using (49) to eliminate the cross-terms, where we used the fact R − I R ≤ 3 from Lemma 3 in the last inequality.To bound E Π C Y k+1 2

C
|D k , we first expand (21b) using Q k = Y k − E k , then where By multiplying Π C , we have Then, taking conditional expectation on both sides of the above equation and using (20) yield where we also used Π C C = 1 from Lemma 3 in the last inequality.
It follows by the definitions that .
By combining the above equations and using the L-smoothness, we obtain where To bound E U k+1 − X k+2 2 2 |D k , we expand (21c) using the relation W k = X k − P k − U k , i.e., Then, By the definition of V k in (47), we have is measurable with respect to D + k .Then, . Now, by taking conditional expectation and using (19), where we also used the fact Λ k 2 = n in the last inequality. . And Combining the above equations, we have where For the variable V k defined in (47), we have Here we used the fact E W k |D + k = 0 from Lemma 10 in the first equality and (19) in the last inequality.The relation (23a) is by taking full expectation on both sides of (43); the relation (23b) is derived by taking full expectation on both sides of (58); the relation (23c) is by taking full expectation on both sides of (53); the relation (23d) can be derived by combining (50), ( 52), ( 54) and ( 56) and taking full expectation.
X k and Y k , respectively.The difference between X k (or Y k ) and X k (or Y k ) are induced by the compression errors.Let us denote the approximation error as w
and |||•||| C refer to the norms defined in the above lemma.

Lemma 4 .
For any vector norm • * induced by the inner product •, • * , and for any θ > 1, we have By Definition 1, it suffices to prove a a a + b 2 * ≤ θ a a a 2 * + θ θ−1 b 2 * for any vectors a a a, b ∈ R n .And it can be verified by the elementary inequality 2 a a a, b * ≤ (θ − 1) a a a 2 * + 1 θ−1 b 2 * .The vector norms • R , • C are induced by inner products Rx , Ry * , C x , C y *

Fig. 3 .
Fig. 3. Performance of Push-Pull/AB, CPP, B-CPP against the number of transmitted bits: the left column shows the results with quantization (b = 2, 4, 6) and the right column shows the results with Rand-k (k = 5, 10, 20).

10 Fig. 4 . 10 Fig. 5 .
Fig. 4. Performance of CPP and Push-Pull/AB with different communication networks under both quantization and Rand-k compressors.

2 2 2 ≤
which follows by setting θ = 1 1−αµ in Lemma 4; the third inequality is by (31), (32); and we used the relation α ≥ α/w and the fact Π C Y k Π C Y k C from Lemma 3 in the last inequality.