DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates

Two widely considered decentralized learning algorithms are Gossip and random walk-based learning. Gossip algorithms (both synchronous and asynchronous versions) suffer from high communication cost, while random-walk based learning experiences increased convergence time. In this paper, we design a fast and communication-efficient asynchronous decentralized learning mechanism DIGEST by taking advantage of both Gossip and random-walk ideas, and focusing on stochastic gradient descent (SGD). DIGEST is an asynchronous decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning. We design both single-stream and multi-stream DIGEST, where the communication overhead may increase when the number of streams increases, and there is a convergence and communication overhead trade-off which can be leveraged. We analyze the convergence of single- and multi-stream DIGEST, and prove that both algorithms approach to the optimal solution asymptotically for both iid and non-iid data distributions. We evaluate the performance of single- and multi-stream DIGEST for logistic regression and a deep neural network ResNet20. The simulation results confirm that multi-stream DIGEST has nice convergence properties; i.e., its convergence time is better than or comparable to the baselines in iid setting, and outperforms the baselines in non-iid setting.


I. Introduction
Emerging applications such as Internet of Things (IoT), mobile healthcare, self-driving cars, etc. dictate learning be performed on data predominantly originating at edge and end user devices [1]- [3].A growing body of research work, e.g., federated learning [4]- [9] has focused on engaging the edge in the learning process, along with the cloud, by allowing the data to be processed locally instead of being shipped to the cloud.Learning beyond the cloud can be advantageous in terms of better utilization of network resources, delay reduction, and resiliency against cloud unavailability and catastrophic failures.However, the proposed solutions, like federated learning, predominantly suffer from having a critical centralized component referred to as the Parameter Server (PS) organizing and aggregating the devices' computations.Decentralized learning, advocating the elimination of PSs, emerges as a promising solution to this problem.
Decentralized algorithms have been extensively studied in the literature, with Gossip algorithms receiving the lion's share of research attention [10]- [17].In Gossip algorithms, each node (edge or end user device) has its own locally kept model on which it effectuates the learning by talking to its neighbors.This makes Gossip attractive from a failuretolerance perspective.However, this comes at the expense of high network resource utilization.As shown in Fig. 1a, nodes in the synchronous Gossip algorithm use a synchronous clock to perform local model update and aggregation where aggregation demands receiving model updates from the neighbors.Until their synchronization clocks expire, the nodes receive model updates from their neighbors.As seen, there should be data communication among all nodes after each model update, which is a significant communication overhead.Furthermore, some nodes may be a bottleneck for the synchronization as these nodes (which are also called stragglers) can be delayed due to computation and/or communication delays, which increases the convergence time.This is due to the synchronous clock time that is determined according to the slowest node (or a set of fastest nodes).
Asynchronous Gossip algorithms, where nodes communicate asynchronously and without waiting for others are FIGURE 1: DIGEST in perspective as compared to existing decentralized learning algorithms; (a) synchronous Gossip, asynchronous Gossip, and random-walk.Note that "∇" represents a model update."Xmit" represents the transmission of a model from a node to one of its neighbors."Recv" represents the communication duration while receiving model updates from all of a node's neighbors."A" represents model aggregation.x v t shows the local model of node v at iteration t.For random walk algorithm, the global model iterates are denoted as x t .We note that the absence of blue boxes in all figures means that nodes do not continue their computations.On the other hand, the absence of red boxes means that there is no communication among neighboring nodes.We also note that communication ("Xmit") and computation ("∇") are parallel in DIGEST and asynchronous Gossip, but aggregation ("A") and computation are sequential.The figure shows them as parallel tasks for the sake of easier presentation and considering that the duration of aggregation ("A") is negligible as compared to communication ("Xmit") and computation ("∇").
promising to reduce idle nodes and eliminate the stragglers, i.e., delayed nodes [18]- [21].Indeed, asynchronous algorithms significantly reduce the idle times of nodes by performing model updates and model exchanges simultaneously as illustrated in Fig. 1b.For example, node 1 can still update its model from x 1 t to x 1 t+1 and x 1 t+2 while receiving model updates from its neighbors.When it receives from all (or some) of its neighbors, it performs model aggregation.However, nodes still rely on iterative Gossip averaging of their models, so updates propagate gradually across the network.Such delayed updates, also referred as gradient staleness in asynchronous Gossip may lead to high error floors [22], or require very strict assumptions to converge to the optimum solution [18].Moreover, such methods must be implemented with caution to prevent the occurrence of deadlocks [19].
In both synchronous and asynchronous Gossip, models propagate over the nodes and are updated by each node gradually as seen in Fig. 2a.This may lead to a notion that we name "diminishing updates", where a node's update (e.g., node 1 in Fig. 2a), even though crucial for convergence, may be averaged and mixed with other models in the next node (e.g., node 2 in Fig. 2a).The diminishing updates are more emphasized when a model passes through high degree nodes, and detrimental to the convergence when data distribution is heterogeneous across the nodes.
If Gossip algorithms are one side of the spectrum of decentralized learning algorithms, the other side is randomwalk based decentralized learning [23]- [26].The randomwalk algorithms advocate activating a node at a time, which would update the global model with its local data as illustrated in Fig. 1c.Then, the node selects one of its neighbors randomly and sends the updated global model.The selected neighbor becomes a newly activated node, so it updates the global model using its local data.This continues until convergence.Random-walk algorithms reduce the communication cost as well as computation with the price of increased convergence time due to idle times at nodes.
The goal of this work is to take advantage of both Gossip and random-walk ideas to design a fast and communication-efficient decentralized learning.Our key intuitions are; (i) Nodes do not need to communicate as much as Gossip to update their models, i.e., a sporadic exchange of model updates is sufficient; (ii) the diminishing updates inherent to Gossip algorithms can be eliminated by employing a global model (as shown in Fig. 2b) as nodes do not average out multiple models received from multiple neighbors; they only add their models to the global model; and (iii) nodes do not need to wait idle as in random walk.
We design a fast and communication-efficient asynchronous decentralized learning mechanism DIGEST by particularly focusing on stochastic gradient descent (SGD).DIGEST is an asynchronous decentralized learning algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning [27]- [29].In local-SGD, each node performs multiple model updates before sending the model to the PS.The PS aggregates the updates received from multiple nodes and transmits the updated global model back to nodes.The sporadic communication between nodes and the PS reduces the communication overhead.We exploit this idea for asynchronous decentralized learning.The following are our contributions.3, where each stream operates on a smaller part of the network, so global model updates can be completed quickly.We identify the multiple streams using a rooted tree, which is determined in a decentralized manner via a distance vector algorithm [30].Note that two or more streams may intersect at one node which is how the streams aggregate their global models.The communication overhead may increase when the number of streams increases, and there is a nice convergence and communication overhead trade-off which can be exploited by adjusting H. • Convergence analysis of DIGEST.We analyze the convergence of single-and multi-stream DIGEST, and prove that both algorithms approach the optimal solution asymptotically.We show that DIGEST's approach of simultaneous global model updating and local SGD iterations does not hurt the convergence rate and does not create any convergence gap.Furthermore, DIGEST is not affected by the network topology, i.e., even high degree nodes do not create any learning bias.We also indicate how frequently global model updates should be made, i.e., what the value of H should be to achieve linear speedup O( 1 V T ) in both iid and non-iid cases, where V is the number of nodes in the network and T is total number of iterations.
• Evaluation of DIGEST.We evaluate the performance of single-and multi-stream DIGEST for logistic regression and a deep neural network ResNet20 [31] for datasets w8a [32] and MNIST [33], and CIFAR-10 [34].We consider both iid and non-iid data distributions over various network topologies with different number of nodes.The simulation results confirm that DIGEST has nice convergence properties; i.e., its convergence time is better than or comparable to the baselines in iid setting, and outperforms the baselines in non-iid setting.

II. Related Work
Decentralized optimization algorithms have been widely studied in the literature, where nodes interact with their neighbors to solve an optimization problem [35]- [38].Despite their potential, these algorithms suffer from a bias in non-iid data [39], and they require synchronization and VOLUME , orchestration among nodes, which is costly in terms of communication overhead.Decentralized algorithms based on Gossip involve a mixing step where nodes compute their new models by mixing their own and neighbors' models [16], [40], [41].However, this is costly in terms of communication as every node requires O(deg(G)) data exchange for every model update for a graph G. Furthermore, model updates propagate gradually over the network due to iterative gossip averaging.Such gradual model propagation reduces the convergence time and makes the learning mechanism very sensitive to data distribution over nodes.Finally, Gossip algorithms tend to favor higher-degree nodes while updating models, which causes slower convergence for non-iid data [42].Our goal in this paper is to reduce the communication cost in decentralized learning for any data distribution without hurting convergence.
Asynchronous Gossip algorithms are designed to improve synchronous Gossip algorithms.The main focus of asynchronous Gossip is the reduction of synchronization costs in a gossip setting by utilizing non-blocking communication [18], [19], [43].This means that nodes could potentially receive and use the stale versions of neighboring nodes' models to update their own models.Despite this modification, asynchronous Gossip algorithms continue to rely on the traditional Gossip to spread the information, where each node sends its model to all of its neighbors, which still introduces high communication cost.Our goal in this paper is to reduce the communication cost in an asynchronous manner for decentralized learning.
A random walk-based decentralized learning is considered in [24], which is similar to work on random walk data sampling for stochastic gradient descent, e.g., [25], [26].Reducing the global averaging rounds as compared to Gossipbased mechanisms is considered in [44] by one-shot averaging.However, the global averaging rounds require long synchronization duration for large networks, which increases the convergence time.Also, strong assumptions and only iid data is considered [44].As compared to Gossip and random walk-based algorithms, DIGEST designs a communication efficient decentralized learning without hurting convergence rate for both iid and non-iid data.

III. Design of DIGEST A. Preliminaries
Network Topology.We model the underlying network topology with a connected graph G = (V, E), where V is the set of vertices (nodes) and E is the set edges.The vertex set contains V nodes, i.e., |V| = V , and |.| shows the size of the set.The computing capabilities of nodes are arbitrary and heterogeneous.If node i is connected to node j through a communication link and can transmit data, then link {i, j} is in the edge set, i.e., {i, j} ∈ E. The set of the nodes that node i is connected to and can transmit data is called the neighbors of node i, and the neighbor set of node i is denoted by N i .We do not make any assumptions about the behavior of the communication links; there can be an arbitrary, but finite amount of delay over the links.
Data.We consider a setup where nodes have access to a subset of data samples D. Each node v has a local dataset D v , where D v = |D v | is the size of the local dataset and D = V v=1 D v .The distribution of data across nodes is not identical and independently distributed (non-iid).
Stochastic Optimization.Assume that the nodes in the network jointly minimize a d-dimensional function f : R d → R. The goal of the nodes is to converge on a model x * , which minimizes the empirical loss over D samples, i.e., x * := arg min , where f i (x) : R d → R is the loss function of x associated with the data sample i.The optimum solution is denoted by f * .The loss function on local dataset Notation.We provide our notation table in Appendix A.

B. Single-Stream DIGEST 1) Overview
Local Model Update.We assume that the time is slotted, and at each slot/iteration, a local model is updated.However, a calculation of a gradient may take more than one slot, vary over time, or not fit into slot boundaries.Thus, at each iteration t, any gradients which have been delayed up to iteration t, and not used in previous local updates are used to update the local model.We note that time slots across nodes do not need to be synchronized in DIGEST as each node can have its own iteration sequence and update local and global models over its own sequence.The only assumption we make is that the slot sizes are the same across nodes, which can be decided a priori.
Let us consider that L v T = {l v t } 0≤t<T is the set of the delayed gradient calculations at node v, where l v t shows that the local-SGD update of iteration t is delayed until iteration l v t .For instance, l v t ′ = t means that the local-SGD of iteration t ′ is lagged behind and performed in iteration t, t ≥ t ′ .Then, we define u v t = {t ′ | l v t ′ = t} to show all the updates completed at iteration t in node v.If we consider that there is no global update at node v, the local model is updated as The global model is updated as where , where x 0 is the initial model.As seen, the global model is updated across all nodes by taking into account all delayed gradient calculations.We use Dv D ratio to give more weight to the gradients with larger data sets.Now that we provided an overview of DIGEST, we provide details on how DIGEST algorithms operate next.We define visited as the set of nodes that are recently visited for the global model updates.It is initialized as an empty set at node v.We define a period of time, during which all the nodes in V are visited at least once, as a synchronization round.During a synchronization round, all nodes update their local models with a global model as they are visited at least once.More details regarding the visited set will be provided as part of Alg. 2.
The node that node v receives the global model from is defined by pre node, where its initial value is set to v as there is no previous node at the start.The set of global model update indicators, i.e., S v T = {s v t } 0<t≤T is initialized as an empty set, where T is the number of slots that Alg. 1 runs.Assuming that v 0 is the node where the global model update starts, s v0 1 is set to 1, i.e., s v0 1 = 1.
Algorithm 1 Local and global model update of DIGEST at node v ∈ V.
At every iteration t, node v first gets one data sample from the local dataset randomly (line 3), and computes a stochastic local gradient (line 4) based on the selected data sample and the current model at node v, i.e., x v t .Then, node v uses all the gradients whose computations are delayed until iteration t, and that are not used in local model updates so far for the local model update (line 5).
If node v receives a "message" from one of its neighbors at slot t, then it should update the global model.Each message contains information on the global model xt , the : set of visited nodes, i.e., visited, the id of the node that sends this message to node v, e.g., v ′ , and a parameter m, which is always set to 0 in single-stream DIGEST, but may take different values for multi-stream DIGEST.After the message is extracted (line If the global model is updated at node v, i.e., if s v0 1 = 1, then node v creates a message and sends it to one of its neighbors if (i) visited ̸ = V: when not all nodes are visited in the current synchronization round; or (ii) mod (t, H) = 0: this is an indicator of the start of a new synchronization round, which happens periodically at every H iterations.In other words, global model synchronization continues until all nodes in V are visited.Then, global model update is paused until a new synchronization round (satisfied by line 17), which starts at every H iteration.We will describe how H should be selected later in the paper as part of our convergence analysis and evaluations.If one of the conditions in line 14 is satisfied, then node v sends the global model to one of its neighbors by calling Alg. 2.
Sending Global Model.Alg. 2 describes the logic of DIGEST at node v for sending a global model to a neighboring node.Alg. 2 implements a Depth-First Search (DFS) to traverse all the nodes in the network in a synchronization round.If v is not visited before in this synchronization round, it is added to visited (line 3) and its parent node p v m is set to pre node (line 4).The parent node is the node that node v receives the global model from for the first time in this synchronization round.C is a set of nodes that node v can possibly transmit.It includes all of the neighboring nodes which are not in the visited set.If C is not empty, one of its elements v ′ is chosen randomly (line 8), and a message including the global model is transmitted to node v ′ from node v.If C is empty, i.e., all the neighbors of node v are visited in the current synchronization round, the message is sent to the parent of node v (p v m ) (line 11).We note that if all the nodes are visited in the network, Alg. 1 pauses global model update (line 17), and Alg. 2 is not called.Note that Alg. 2 and Alg. 1 operate simultaneously; one does not need to stop and wait for the other as also illustrated in Fig. 1d.

C. Multi-Stream DIGEST 1) Tree Construction and Multiple Streams
Our goal in multi-stream DIGEST is to find a shortest-path tree so that model updates can be distributed quickly.We use a classical distance vector routing algorithm such as Bellman-Ford to construct a shortest-path tree.Belmann-Ford algorithm is optimal [45] in the sense that it results a shortest path tree when a root node is fixed.Multi-stream DIGEST finds the best tree among all shortest path trees (constructed for each node using Bellman-Ford).The best tree is the tree that has the smallest range (distance from the root to the farthest leaf node).In particular, our first step is to create a rooted tree from our undirected graph G via Bellman-Ford, where each node v in graph G learns its delay distance d G vu to node u in a decentralized manner and via message passing.We define the radius of node v as R G v , which is the largest distance from node v; i.e., The root of the network is the node with the smallest R G v , i.e., r = arg min v {R G v }, where r is the root node.The shortest delay tree ST r rooted with r is constructed in a decentralized manner as each node keeps d G vu information.
After the tree is constructed, multiple streams are created to exchange the global model in the network.First, the root node creates a number of streams which is equal to the number of its children.Each of these streams has a range, which starts from the root node and ends at a child node if the child node itself has more than one child.In that case, the child node behaves exactly as a root node and creates multiple streams towards its children by following the same rule that we just described for the root node.Eventually, there will be M streams in the tree, and the set of the streams that go through node v is M v .The set of nodes that are in the range of stream m is V m .The set of the streams between root r and node v is defined as P r v .

2) Algorithm Design
Multi-stream DIGEST is summarized in Alg. 3. The following are the main differences between Algs.for each stream m.Each node v has a queue to store all the messages that a node receives from its neighbors.It is initialized as an empty queue at the start.Whenever node v receives a message from one of its neighbors, it is added to the queue.Each node can receive up to |M v | messages related to different streams, so the size of the queue is |M v |.In each message, there is a stream index m (line 10).
Node v extracts all the messages in its queue (lines 9-12).Then, it updates its global and local models as in Alg. 1 if The global model is updated using the most recent local updates of node v and global updates of other streams (line 15).The global model synchronization continues until all nodes in V m are visited for stream m.Then, global model update is paused until a new synchronization round, which starts at every H m iteration.The policy for selecting H m is explained in the next section.

IV. Convergence Analysis of DIGEST
We use the following assumptions for the convergence analysis of single-and multi-stream DIGEST.
Algorithm 3 DIGEST on node v ∈ V with multiple streams.
1: Initialization: Start computing the gradient ∇f i v t (x v t ). 8: if queue ̸ = () then 10: for any message in queue do 11: Remove message from queue 14: end for 15: end if for m in M v do 17: 20: pre node[m], r to a neighboring node by calling Alg.2.end for 30: end for The variance of the stochastic gradient is bounded for all nodes, i.e., 0 3) Bounded diversity.The diversity of the local loss functions and global loss function is bounded, i.e., 0 ≤ t < T , Theorem IV.1.Let assumptions 1-5 hold, with a constant and small enough learning rate η ≤ 1 30LA (potentially depending on T ), the convergence rate of single-and multistream DIGEST is as follows: , , where xT = Õ hides constants and poly-logarithmic factors, T represents the wall clock time, The convergence rate of single-stream DIGEST follows when H ′ = H, and the convergence rate of multi-stream DIGEST is obtained by putting For strongly-convex case, in iid data distribution over nodes, i.e., ζ = 0, the convergence rate to the optimum value f * is Õ( ρ T ) given that Converge rate ) determines how much communication is needed to achieve a linear speed-up.Remark IV.1.1 also shows the impact of non-iid data distribution, which requires smaller H ′ , hence more communications to converge and achieve linear speedup.
Corollary IV.1.2.Corollary IV.1.1 shows that the linear speed up is achieved when T = Ω( H ′ ρ ) and T = Ω( H ′2 ρ ) for iid and non-iid data, respectively.When the network is larger, single-stream DIGEST needs longer H ′ (which is equal to H) to visit all the nodes, which requires larger T (convergence time).But in multi-stream DIGEST, H ′ defined as H ′ = max v m∈P r v H m could be as low as R G r , which is the radius of root node r (or maximum delay toward any node from root node r).As R G r does not necessarily increase with the size of the network, multi-stream DIGEST is plausible even for large networks.
Corollary IV.1.3.Let's assume that the network can be covered in H iteration using single/multi-stream approach.DIGEST can efficiently perform synchronization while nodes are doing local-SGD, i.e., network topology, spectral gap, or the maximum and minimum degrees in the network topology do not affect the convergence rate.This is one advantage of using DIGEST in comparison to previous works on asynchronous decentralized learning like [43] where the convergence rate in non-convex setting is . Here, we observe that the minimum degree (d min ), maximum degree (d max ), and spectral gap (λ) of the network graph are part of the result, so affects the convergence.

Sketch of Proof of Theorem IV.1. (The details of the proof is provided in the Appendix B.) We define a virtual sequence {x
following a similar idea in [27].Lemma IV.2, and IV.3 indicates how the convergence criteria are related to E ∥x t − x v t ∥ 2 , the deviation between virtual and actual sequences and we find an upper-bound for this term in Lemma IV.4.

V. Evaluation of DIGEST
We evaluate DIGEST as compared to baselines; (i) Uniform Random-Walk (URW) [24]: This is a random walk-based learning algorithm described in Fig. 1c; (ii) Gradient tracking (GT) with local-SGD [46]: It is an algorithm that is developed to overcome data heterogeneity across nodes in a decentralized optimization problem; (iii) Async-Gossip [18] with local-SGD; (iv) (ii) Sync-Gossip [18] with local-SGD.Our codes are provided in [47].
We consider two network topologies; an Erdős-Rényi graph of V = 10 and V = 100 nodes with 0.3 as the probability of connectivity.Each neighboring pair's communication delay is assumed to conform to an exponential distribution.The average delay is randomly chosen to span from 0 to 50 times the duration of the specific model's local SGD computation.
We use two data distributions: (i) iid-balanced, and (ii) non-iid-unbalanced.In iid-balanced case, data is shuffled and equally divided and placed in nodes.Non-iid-unbalanced has two features: (i) Non-iid, which is realized by sorting data according to their labels and distributing them in the sorted order.Thus, the data distributed over nodes will be non-iid; (ii) Unbalanced, which means that each node may have a different amount of data.We use geometric series to realize unbalanced data across nodes.For example, if a node u has D u = δ data, the next nodes get δρ, δρ 2 , etc. data, where ρ is determined by taking into account the size of the total dataset D.

A. Logistic Regression
We examine the convergence performance of logistic regression, i.e., f ( , where a i ∈ R d , and b i are the feature and label of the data sample i.The regularization parameter is considered λ = 1 D .We run the optimization using a tuned constant learning rate for each algorithm.To grid-search the learning rate, we try the experiment by multiplying and dividing the learning rate by powers of two.This involves trying out both larger and smaller learning rates until we find the best result.We use datasets w8a [32] and MNIST [33].The numerical experiments were run on Ubuntu 20.04 using 36 Intel Core i9-10980XE processors.For each experiment, we repeat 50 times and present the error bars associated with the randomness of the optimization.In every figure, we include the average and standard deviation error bar.
Figs. 4 and 5 show the convergence behavior of our algorithms as well as the baselines for MNIST and w8a datasets in 10-nodes and 100-nodes topologies.URW generally underperforms as compared to other methods due to its approach of conducting only one local-SGD operation per iteration on a single node.As a consequence, it does not have a linear speed-up with increasing number of nodes.In certain situations involving non-iid data distribution, URW may exhibit better performance than some other methods  as shown in Figs.4b, 4c, 5c, 5c.This is because URW is not affected by non-iidness as it uniformly incorporates data from all nodes.DIGEST, Sync-Gossip, and Async-Gossip have similar performance in iid data distribution in Figs.4a, 5a.On the other hand, we observe that Gossip based algorithms are suffering from slow convergence in non-iid setting as shown in Figs.4b, 4c, 5b, 5c.We also observe that GT algorithms enhance the performance of gossipbased algorithms by incorporating a mechanism to overcome non-iidness.However, this algorithm demands twice the communication overhead compared to sync-Gossip, resulting in more communication overhead, which can degrade its convergence performance in terms of wall-clock time.In comparison, DIGEST has better convergence behavior thanks to its very design of spreading information uniformly in the network to handle non-iidness.It is evident that when the network is larger, one-stream DIGEST method is unable to cover the entire network as quickly as required, highlighting the need to utilize multi-stream DIGEST to overcome this limitation.This observation is supported in Figs.4c, 5c, where all streams have the same H m = H, m ∈ M in multistream DIGEST.

B. Deep Neural Network (DNN)
In this section, we use ResNet-20 [31] as the DNN model.The dataset is CIFAR-10 [34].We have set the batch size to 36 per node, and the learning rate is decayed by a constant factor after completing 50% and 75% of the training time.The initial value of the learning rate is separately tuned for each algorithm.We have set the momentum value to 0.9 and the weight decay to 10 −4 .We observe that in the iid setting (Fig. 6a), all algorithms except URW perform similarly.However, in non-iid settings, where communication and model distribution across the network become crucial, DIGEST outperforms Gossip-based algorithms, Fig. 6b.

C. Speed-up
In this section, we evaluate the speed up performance of our DIGEST algorithms as well as the baseline; centralized parallel SGD.We consider the following cost function x < 1. ( to control the non-iidness and local variances.Note that we need to increase the number of nodes to generate speed-up curves, so we need to create a non-iid data distribution over the nodes.Creating a uniform non-iid distribution using real datasets when the number of nodes increases is very difficult.Thus, we use a pre-defined cost function in (3) to verify our theoretical results following a similar approach in [48].In particular, we employ Local-SGD at node v with gradients affected by a normal noise, i.e., ∇f To create the speedup curve, we divide the expected error of a single node SGD by the expected error of each method at the last iteration T for different number of nodes.As in linear speed-up, error decreases linearly with the increasing number of workers, so we expect to see a straight line on the graph.The speedup curve is illustrated in Fig. 7.The central parallel SGD averages all nodes' updates at every H steps, and updates the model in all nodes.It is worth noting that the central parallel SGD with H = 1 is the best speed-up that can be achieved in this scenario.
We set the learning rate to 0.001, and |ζ v | = 5 for v ∈ V, σ = 5, and T = 10 4 .Note that in iid setting with a less restrictive constraint on H, larger H can still lead to linear speed-up when compared to non-iid setting.Moreover, it is seen that single stream DIGEST has linear speed-up to a certain limit; however, as the number of nodes increases and single-stream DIGEST cannot traverse the entire network fast enough, linear speed-up is not maintained.On the other hand, multi-stream DIGEST achieves linear speed up and achieves a very close performance to the best possible scenario, which is centralized parallel SGD with H = 1.

VI. Conclusion
We designed a fast and communication-efficient decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD).We designed single-and multi-stream DIGEST to exploit the convergence rate and communication overhead tradeoff.We analyzed the convergence of single-and multi-stream DIGEST, and proved that both algorithms approach to the optimal solution asymptotically for both iid and non-iid data.The simulation results confirm that the communication cost of DIGEST is low as compared to the baselines, and its convergence rate is better than or comparable to the baselines.
The whole dataset in the network with size Loss function of model x associated with the data sample i f (x) Global loss function of model Local loss function of model Binary variable that shows if node v receives the global model at t from stream m in multi-stream DIGEST Motivated by [27], a virtual sequence {x t } t≥0 is defined as follows.
We do not need to calculate this sequence in the algorithm explicitly and it is only used for the sake of the analysis.We also define where f (x), f v (x) are global loss function and local loss function in node v, respectively.Let us introduce i t = {i 1 t , ...i V t } to denote the data samples selected randomly during time slot t in all nodes.Then, observe that ḡt = E it g t .We have xt+1 = xt − η t g t .
First, we illustrate how the virtual sequence, {x t } t≥0 , approaches to the optimal in Lemma A.1, and Lemma A.2. Second, we depict in Lemma A.3 that there is a little deviation from the virtual sequence in the actual iterates, x v t .Finally, the convergence rate is proved.
In (20) we have used the fact that (21), and ( 22) are due to the convexity of ∥•∥ 2 and L-smoothness, respectively.Note that by L-smoothness we have µ-strong convexity provides us with Using L-smoothness to bound the last term in (17), we have We obtain the following result by applying (22), (25), and ( 26) to (17): This can be rewritten as Using concavity of β∥ • ∥ 2 for β ≤ 0, we get so, we get We have (2η t L − 1) ≤ − 1 2 , (2η t L + 1) ≤ 2 as we assumed η t ≤ 1 4L .So, we obtain : By definition, we have that where ( 33) is based on the fact that variance of the sum of independent random variables equals sum of their variances.
Taking expectation of ( 31) and applying it with ( 35) into ( 14) provides , and Proof: Based on the defenition of xt and L-smoothness of f v (x) we have Let's take expectation of the second term on the right-hand side of (39) (42) is based on the fact that for any λ > 0, 2⟨a, b⟩ (44) is based on the convexity of ∥.∥ 2 and ( 45) is due to L-smoothness.Let's take expectation of the last term on the right-hand side of (39) as where ( 48) is based on (22), and (35).Putting everything together, we obtain where where ( 63) is due to the fact that ω The second term in the right-hand side of (60) is bounded as For the third term in in the right-hand side of (60), we obtain Using the same approach for the last term, we can obtain where instead of (23) we have used the fact that for independent zero-mean random variables, we get a tighter bound as follows.
Adding up the last four inequalities and applying them in (60) and the final result in (56), we get Now we try to bound the second term in (53).
where we have where (82) is due to the convexity of ∥∥ 2 .We can write where t} is the set of nodes that are visited after node v in the current synchronization round.(86), and (83) can be extended exactly as in (60) and according to (80) we will have Using (79), and (87) in (53) we obtain : By rearranging (88) and assuming η t = η, we get This completes the proof for the single-stream DIGEST.A similar argument can be made for multi-stream DIGEST with just one subtle change.Note that 2(H + E) indicates how long it takes for an SGD update performed in one node to be available in all the other nodes.In multi-stream DIGEST, this is determined by the depth of the tree, i.e., the longest delay path from the root node to its farthest leaf (in terms of the delay distance), which is expressed as 2(E +max u m∈P r u H m ).So, replacing (H + E) with (E + max u m∈P r u H m ) gives the result for multi-stream DIGEST.Combining Lemma A.3 with Lemma A.1 for the convex case and Lemma A.2 for the non-convex case, we can obtain a recursive description of suboptimality.We follow closely the technique described in [40] for estimating the convergence rates.
Convex Based on lemma A.1 we have By assuming η t = η and multiplication of ωt η in both sides and summing up we get By replacing result of lemma A.3 we get By using (24), dividing both sides by W T = T −1 t=0 ω t , and rearranging, we have Based on the convexity of f we have (96) Now we state two lemmas that helps us bound the right-side of (96).
(104) Non-convex Based on lemma A.2 we have By assuming η t = η and multiplication of ωt η in both sides and summing up we get (non convex case) Combining this inequality with Lemma A.5 and considering that η t = η ≤ 1 30L(H+E) , provides

4 FIGURE 3 :
FIGURE 2: Spread of information across a decentralized network.

3 and 1 .
There are multiple global models in different streams, i.e., xt [m] corresponds to the global model in stream m out of M streams.There are |M v | models stored in each node, i.e., x−1 [m] to represent the global model corresponding to the last synchronization of stream m at node v.We define visited[m], pre node[m], and s v t [m]

FIGURE 4 :
FIGURE 4: Convergence results for MNIST dataset in terms of global loss over wall-clock time.

FIGURE 5 :
FIGURE 5: Convergence results for w8a dataset in terms of global loss over wall-clock time.

FIGURE 6 :
FIGURE 6: Convergence results for CIFAR-10 dataset in terms of global loss over wall-clock time.

mv
of nodes that are visited for the global model update in the most recent synchronization round for single-stream DIGEST visited[m] Set of nodes that are visited for the semi-global model update in the most recent synchronization round in stream m for multi-stream DIGEST xt The global model received by node v at t in single-stream DIGEST xt [m] The global model received by node v at t from stream m in multi-stream DIGEST pre node The node that node v receives the global model from in single-stream DIGEST pre node[m] The node that node v receives the semi-global model from in stream m for multi-stream DIGEST p v The node that node v receives the global model from for the first time in the current synchronization round in stream m d G uv The distance between u and v, the total delay of the links in the shortest path between u and v R G v The greatest distance from v, i.e., R G v = max v d G uv r Root or the node with the minimum R G v , i.e., r = arg min v R G v ST r The shortest path tree rooted at r P r Set of streams working over the path from v to r in ST r M Number of streams in multi-stream DIGEST M v The set of streams passing node v V m The set of nodes in domain of stream m VOLUME , B. Proof of Theorem IV.1 r t+1 + bη + cη 2 ≤ r 0 (T + 1)η + bη + cη 2 .
case) Combining (96) with µ > 0, and Lemma A.4 and considering that η t = η ≤ 1 30L(H+E) , provides We further improve the convergence time of single-stream DIGEST by enabling multiple streams of global model updates.For example, there are 4 streams working together at Fig.
• Design of DIGEST.We design a fast and communicationefficient asynchronous decentralized learning mechanism; DIGEST by particularly focusing on stochastic gradient descent (SGD).DIGEST works as follows.Each node keeps updating its local model all the time as in local-SGD.Meanwhile, there is an ongoing stream of global model update among nodes, Fig. 1d.For example, node 1 starts transmitting the global model to node 2 at time t.When node 2 receives the global model from node 1, it aggregates it with its local model.The aggregated global model is transmitted to node 3 next.We note that the exchanged models are global models as each node adds its own local updates to the received model.A node that has the global model selects the next node randomly among its neighbors for global model transmission.After all the nodes update their models with a global model, DIGEST pauses global model exchange, while local SGD computations still continue.The global model exchange is repeated at every H iterations.We name this algorithm single-stream DIGEST.• Multi-Stream DIGEST.
the gradient.However, there may be global model updates at node v, i.e., node v could receive a global model update from one of its neighbors at iteration t.Such a global model reception should be reflected in local model updates, which we discuss next.Global Model Update and Exchange.Let xt be the global model that is being transferred from one node to another at time slot t.If node v receives the global model xt from one of its neighbors, a global model update indicator Local Model Update" section.If s v t = 1, i.e., when a global model is received by node v from one of its neighbors, then the global model should be incorporated in the calculations.DIGEST sets the local model to the global model when there is a global model update as follows.
s v t is set to s v t = 1.Otherwise, i.e., when node v does not receive the global model from its neighbors, we set s v t = 0.If s v t = 0, then node v updates its model locally according to the update mechanism presented earlier in the " xt−1 is the global model received by node v at slot t − 1.The global model, i.e., xt is updated by using xt−1 as well as the local updates of node v.We use τ v t to denote the last time slot up to t, when node v's model was updated with the global model, i.e., τ v t The local and global model update of DIGEST is presented in Alg. 1.Every node v keeps its local model x v t as well as x v −1 , which is a copy of the local model in the latest global model update at node v. xt is the global model.All of these models are initialized with the same initial model x 0 .We note that only one of the nodes, let us say node v 0 , has the global model xt at the start of the algorithm.
2) Algorithm DesignDIGEST is comprised of two algorithms; (i) local and global model update at node v, and (ii) sending a global model from a node to its neighbor.Local and Global Model Update.
7), the global model update indicator is set to 1 (line 8), and the global model is updated (lines 11−13).The global model is updated using the most recent local model of node v (line 11).The local model is updated with the global model (line 12).The current local model is stored at node v and will be used in the next global update (line 13).
) shows the maximum gap between two subsequent 1s in S v T .For multi-stream DIGEST, we assume different bounds for each stream, i.e., gap(S v T Bounded synchronization interval.For single-stream DIGEST, we assume that the interval between two subsequent global model synchronizations is bounded, i.e., gap(S v T ) ≤ H, 1 ≤ v ≤ V , where gap(S v T

TABLE 1 :
This restriction in other scenarios is depicted in Table 1.□ Theorem IV.1 and Corollary IV.1.1 show a nice trade-off between convergence rate and communication overhead.It Constraints on H to get linear speed-up.