Deep Learning-Based Average Consensus

In this study, we analyzed the problem of accelerating the linear average consensus algorithm for complex networks. We propose a data-driven approach to tuning the weights of temporal (i.e., time-varying) networks using deep learning techniques. Given a finite-time window, the proposed approach first unfolds the linear average consensus protocol to obtain a feedforward signal-flow graph, which is regarded as a neural network. The edge weights of the obtained neural network are then trained using standard deep learning techniques to minimize consensus error over a given finite-time window. Through this training process, we obtain a set of optimized time-varying weights, which yield faster consensus for a complex network. We also demonstrate that the proposed approach can be extended for infinite-time window problems. Numerical experiments revealed that our approach can achieve a significantly smaller consensus error compared to baseline strategies.


I. INTRODUCTION
The distributed agreement problem for networks, which is often referred to as the consensus problem [1], is an important problem in network science and engineering with applications in load balancing [2], data fusion [3], multiagent coordination [4], distributed computing [5], distributed sensor networks [6], wireless communication systems [7], and power systems [8]. Recently, this problem has also appeared in online machine learning procedures for processing big data (see [9], [10] and the references therein).
In the average consensus problem, the nodes in a network seek to converge their states to the average of their initial states in a distributed manner. The standard solution to this problem is to use the linear average consensus algorithm [11], where each node updates its state by calculating the weighted linear average of its own state and the states of its neighbors. This algorithm generates a linear dynamical system whose state transition matrix involves the Laplacian matrix of the underlying communication network.
Designing consensus algorithms with fast convergence is of significant practical importance because such algorithms allow multi-agent systems to reach agreements in fewer iterations, meaning they will consume less communication resource. In the context of the linear average consensus algorithm, the problem of finding the optimal weights of edges to maximize asymptotic consensus speed can be reduced to a convex optimization problem [5] under the assumption that the communication network is static and undirected. The authors of [12] recently demonstrated that optimal weights can be computed in a distributed manner via iterative computations. Zelazo et al. [13] clarified the role of cycles in the linear average consensus algorithm and proposed an approach for accelerating this algorithm by adding new edges to a network. For the case of directed networks, Hao and Barooah [14] proposed a method to accelerate the convergence rate of a linear (but not necessarily average) consensus algorithm by tuning the weights of edges in the target network. Additionally, for the case of gossip algorithms, the authors of [15] demonstrated that it is possible to tune weights in a distributed manner such that the weights converge to the optimal weights, which yields solutions for averaging problems.
A natural consequence of seeking further acceleration of consensus algorithms is the emergence of finite-time consensus algorithms [16], where edge weights are typically as-VOLUME 4, 2016 1 arXiv:1908.09963v2 [math.OC] 7 Aug 2020 sumed to vary with time and designers exploit this additional flexibility to realize consensus in finite time. The finite-time consensus algorithm proposed in [17] achieves a consensus using a time-invariant weight updating law. The algorithm proposed in [18] achieves consensus using stochastic (possibly asymmetric) matrices in N(N − 1)/2 iterations, where N denotes the number of nodes in the target network. The authors of [19], [20] used graph signal processing tools (e.g., [21]) to demonstrate that by allowing non-stochasticity for state-updating matrices, one can realize finite-time consensus in at most N steps. The theoretical aspects of these works were further investigated in [22]. Recently, the authors of [23] showed that the number of steps required for consensus can be further improved to N/2 in the specific case of a ring network with an even number of nodes.
Despite the advances in consensus acceleration described above, there is still a need for an effective approach to answering the following basic question. Given a finite-time window and underlying network structure, how should one dynamically tune the edge weights in the network to achieve the most accurate consensus possible within a specific time window? If the length of a time window is insufficient for executing the aforementioned finite-time consensus algorithms, then currently available options are effectively limited to using static optimal strategies (e.g., [5]), which do not allow one to tune the weights of a network dynamically. To fill this gap, we propose a data-driven approach to tuning the weights of undirected temporal (i.e., time varying) networks using deep learning techniques. We first unfold the consensus algorithm to obtain a feedforward signal-flow graph [24], which we regard as a neural network. We then use the standard stochastic gradient descent algorithm to update the parameters in each layer of this neural network (i.e., the weights of each snapshot of the temporal network) to minimize consensus error over a finite-time window, which yields an optimized temporal network for faster consensus. We numerically confirm that our approach can significantly accelerate the convergence speed of the linear average consensus algorithm.
The remainder of this paper is organized as follows. In Section II, we define the problem of dynamically tuning edge weights to accelerate the linear average consensus algorithm and present our approach to solving this problem using standard techniques from the field of deep learning. In Section III, we evaluate the performance of the proposed method through various numerical experiments. We conclude this paper in Section IV.

II. WEIGHT OPTIMIZATION USING DEEP LEARNING TECHNIQUES
In this section, we describe our approach to tuning the edge weights of a network to accelerate the linear average consensus algorithm within a given finite-time window. We first provide a brief review of the linear average consensus algorithm and state its basic properties. We then describe our data-driven approach to tuning the weights of a network, in which we apply techniques from the deep learning to the signal-flow graph obtained by unfolding the consensus algorithm.

A. LINEAR AVERAGE CONSENSUS ALGORITHM
Let G be an undirected, unweighted, and connected network with a node set V = {1, . . . , N} and edge set E, which consists of unordered pairs of nodes in V . Each node in G represents an agent that is supposed to communicate with its neighbors at each time step. In this paper, we focus on discrete-time dynamics. Let x i (k) ∈ R denote the state of the ith node at time k ≥ 0 and let N i denote the set of neighbors of node i. In the standard linear average consensus protocol [1], each node i updates its own state according to the following difference equation: where w i j (k) = w ji (k) ≥ 0 represents the weight of the (undirected) edge {i, j} at time k and x 0,i denotes the initial state of node i. Although we assume the symmetry of edges for convenience, our framework can be easily extended to asymmetric cases. At each time k ≥ 0, we define the (i, j) element of the adjacency matrix of the network W (k) ∈ R N×N as and the degree matrix of the network at time k as Then, by using the Laplacian matrix of the network the evolution of the state vector in the linear average consensus protocol (1) can be written as where x 0 = x 1,0 · · · x N,0 denotes the initial state vector. The objective of this paper is to present a framework for tuning the weights {w i j (k)} k≥0,{i, j}∈E to achieve a faster average consensus within a given finite-time window. We denote the average of the initial states of the nodes as Define the consensus error vector where 1 denotes an all-ones N-dimensional column vector.
We are now ready to formally state the problem studied in this paper. Problem 2.1 (Consensus acceleration problem): Let G be an undirected and unweighted network containing N nodes. Let (1) x2 (1) x3 (1) xN (1) x1 (2) x2 (2) x3 (2) xN (2)   T be a given positive integer. Assume that the set of initial states follows a probability distribution X 0 , i.e., Find the set of nonnegative weights that minimizes the average consensus error defined by where · denotes the Euclidean norm in R N , and E[·] denotes the expected value. Because Problem 2.1 is a non-convex problem, it is difficult to compute a set {w i j (k)} k≥0,{i, j}∈E that minimizes ε T . This difficulty motivated us to tackle this problem by using a data-driven approach to find a suboptimal solution. In the next subsection, we describe our data-driven approach to tuning edge weights using deep learning techniques. It should be noted that although we assume some knowledge of the initial probability distribution X 0 during the process of optimization, optimized edge weights can significantly accelerate the consensus protocol even if the initial states do not follow the assumed distribution. We numerically demonstrate this universal property of our approach in Subsection III-B.

B. DATA-DRIVEN WEIGHT OPTIMIZATION
To adjust edge weights using deep learning techniques, we first unfold the recursive state-update formula (1) to obtain a signal-flow graph, as shown in Fig. 1(a). Unlike a standard deep neural network, the resulting neural network has a fixed structure and uses no activation functions. The structure between two layers of this neural network corresponds to the structure of the graph G and is the same between all layers. The neurons in the kth layer correspond to the nodes at time k, as shown in Fig. 1(b).
We then apply a standard technique from the field of deep learning to adjust the weights. We use the mean squared error, ε 2 T , as a loss function. We do not append any regularization terms. As in [24], we use the technique of the incremental training to adjust the weights during the training process. For incremental training, we first consider only the first layer (i.e., we set k = 1 in Fig. 1(a)) and attempt to minimize the loss function of the average consensus error ε 2 1 using a number of randomly generated initial states x 0 as the training data. This first step is called the first generation. After training the first set of weights w i j (0), we proceed to train the first two sets of edge weights by appending the second layer to the neural network and replacing the loss function with ε 2 2 . For this training, we use the result from the 1st generation as the initial values for the first layer and train the entire neural network. We repeat this process to optimize the all of the weights w i j (T − 1) between T − 1st and T th layers by minimizing ε 2 T . Therefore, using the set of equations (2)-(6) with the optimized weights, we can obtain the optimized weighted Laplacian matrices for the network L (0), . . . , L (T − 1) for given G and T . The proposed consensus algorithm is defined as The proposed approach for solving Problem 2.1 can be considered as supervised learning with an input object x 0 and desired output value c1, where c = 1 x 0 /N, with identity activation functions over a neural network with a specified structure.

C. PERIODIC CONTINUATION
For very long or infinite-time windows, where it is not necessarily feasible to tune weights using the proposed approach, we may adopt a periodic continuation of the proposed algorithm. By periodically extending the state transition matrices I − L (0), . . . , I − L (T − 1) obtained previously, the average consensus protocol for a time greater than T can be defined as follows: To assess the performance of this algorithm, we introduce the following quantity: which we call the asymptotic convergence factor. The next lemma provides an explicit representation of the asymptotic convergence factor (9) for the consensus algorithm (8).

VOLUME 4, 2016
Proposition 2.2: Given G and T , let L (0), . . . , L (T − 1) denote the optimized weighted Laplacian matrices of the networks derived by our deep learning algorithm. Define a matrix and let σ (M) denote the set of eigenvalues of M. Additionally, define Assume that the eigenvalue 1 of M is simple. Then, the asymptotic convergence factor of the periodic continuation with a period T given by the consensus algorithm (8) is equal to ρ 1/T . Remark 2.3: It should be noted that is the asymptotic convergence factor of the periodic continuation with a period T given by the consensus algorithm (8).
Here, r T asym with T = 1 does not necessarily correspond to the static-optimal solution in (9).

III. PERFORMANCE EVALUATIONS
In this section, we demonstrate the effectiveness of the proposed method based on various numerical experiments. The number of datasets per learning was set to 1000, and we performed the online learning (i.e., we set the batch size one). The weights of the edges were initialized to 0.1. For evaluation, 10 000 samples were used. Based on this setup, experiments were conducted in PyTorch [25] using Adam with a learning rate of 0.01 for training. Throughout our numerical experiments, we fixed the length of the time window for optimization to T = 10 and adopted the periodic continuation introduced in Subsection II-C.
To assess the effectiveness of the proposed approach, we compared the performance of the proposed method to that of two baseline strategies; the static optimal strategy in [5] and the finite-time distributed algorithm in [19]. These three approaches were compared on both empirical networks and random synthetic networks.

A. BASELINE STRATEGIES
Here, we introduce the static optimal strategy presented in [5] and the finite-time distributed algorithm in [19].

1) Static optimal strategy
The static optimal strategy uses a set of time-invariant weights. Thus, the problem is to find such a set of weights that together minimize the asymptotic convergence factor.
Assume that an initial state x 0 is a deterministic vector. Additionally, assume that the edge weights w i j (k) do not depend on the time k. Under these assumptions, the authors of [5] showed that the problem of finding the static edge weights minimizing the (worst-case) asymptotic convergence factor (9) can be reduced to solving a linear matrix inequality, which can be solved globally and efficiently [26]. As a baseline strategy, we adopted the following time-invariant consensus protocol: where w stat i j are the static optimal weights obtained by solving the linear matrix inequality.

2) Finite-time distributed algorithm
The finite-time distributed algorithm uses a method called successive nulling eigenvalues. Basically, it computes diagonal weights for a given set of non-diagonal weights so as to achieve a finite-time consensus. Let W = [w i j ] i, j be a symmetric N × N matrix satisfying w i j = 0 for all (i, j) / ∈ E. Let λ 1 , . . . , λ K be the distinct eigenvalues of W . Then, the finite-time distributed algorithm proposed in [19] can be written as where a(k) = λ k+1 for k = 0, ..., K − 2 and The authors of [19] showed that, under the assumption that the eigenvector corresponding to the eigenvalue λ K is 1, the finite-time distributed algorithm (12) achieves an average consensus at a time K (i.e., ε K = 0). To satisfy the assumption on the matrix W , we let W be the Laplacian matrix of the graph G in this numerical experiment.

B. EMPIRICAL NETWORKS
In this subsection, we consider the following six empirical and synthetic deterministic networks: Krackhardt kite graph (Krackhardt kite), Chvátal graph (Chvátal), Puppus graph (Puppus), bipartite network of Southern women and clubs (Davis), social network of a Karate club [27] (Karate), and Tutte graph (Tutte). We summarize the numbers of nodes, edges, and distinct eigenvalues of the Laplacian matrices of the empirical networks in Table 1.

1) Effects of initial distributions
First, we trained a Karate network with T = 10 based on four types of initial states sampled from 1) log-normal distribution with parameters µ = 0 and σ = 1.5, 2) exponential distribution with mean 1, 3) binomial distribution with parameters n = 50 and p = 0.5, and 4) uniform distribution. We then evaluated the performance ε T for each of the trained temporal networks by using the aforementioned distributions to generate the initial state x(0). We present the results of this experiment in Fig. 2. One can see that the networks trained using different distributions perform differently. Although performance is robust to the choice of an initial distribution overall, the network trained using a uniform distribution yields the best results. Therefore, from now on, we will assume that the initial state of each node independently follows a uniform distribution on the interval [−1, 1].

2) Dynamically changing edge weights
Before comparing the proposed approach to the baseline approaches, we show the optimized edge weights for the Karate network at times k = 0, . . . , 9 by minimizing ε T with T = 10 in Fig. 3. One can see that the optimized edge weights of the network change dynamically in a nontrivial manner.

3) Consensus errors for a finite-time window of length K
We empirically evaluated the average consensus error ε K , where K denotes the numbers of distinct eigenvalues of the Laplacian matrices of the networks. We also applied the static optimal strategy and finite-time consensus algorithm and empirically evaluated their average consensus errors ε K . We list these consensus errors in Table 2. The proposed method achieves small consensus errors and outperforms the static optimal strategy for all of the networks. The finite-time consensus algorithm achieves accurate consensuses for small networks (Krackhardt kite, Chvátal, and Puppus). However, this algorithm fails to achieve a consensus for large networks. This can be attributed to numerical instability when computing the coefficient (13), whose magnitude tends to become significantly smaller as the network size (and number K of distinct eigenvalues of the Laplacian matrix) increases. For computing the coefficient (13), we adopted the following straightforward procedure. We first numerically computed the eigenvalues λ 1 , . . . , λ K . We then computed the product of the differences λ K−1 − λ K , . . . , λ 1 − λ K . Finally, we calculated the inverse of this product to obtain a(K − 1).
In Fig. 4, we present the time evolutions of the consensus errors for the six empirical networks by the proposed method, the static optimal strategy, and the finite-time consensus algorithm. We terminate all three algorithms at time k = K, i.e., at the time when the finite-time distributed algorithm should theoretically supposed to achieve an exact average consensus. One can see huge consensus errors in the middle of the finite-time consensus algorithm, regardless of the underlying network. In contrast, the proposed method allows us to achieve an accurate consensus for any network size without overshooting.   Table 3 lists the convergence factors for the empirical networks obtained by the proposed method with periodic continuation and the static optimal strategy. We used Proposition 2.2 to compute the convergence factors. The proposed method improves upon the static optimal strategy for all of the networks, which is consistent with our observations of consensus errors.

C. RANDOM SYNTHETIC NETWORKS
We consider the following three random synthetic network models. The first is the Erdős-Rényi (ER) network, where we set the probability for edge creation to 0.1. The second is the Barabási-Albert (BA) model [28], where we set the number of edges to attach from a new node to existing nodes to 3. The third is the Watts-Strogatz (WS) model [29], where each node is joined to its 4 nearest neighbors in a ring topology and the probability of rewiring each edge is 0.15. As in the case of the deterministic networks, we assume that the initial states of the nodes independently follow a uniform distribution on the interval [−1, 1].   We first performed numerical experiments on small-scale networks. We generated 10 networks for each of the three network models with network sizes of N ∈ {10, 15, . . . , 30}. For each of the generated networks, we optimized the edge weights at the times k = 0, . . . , 9 by minimizing ε T with T = 10 and periodically continue the obtained sequence of weighted networks. We then evaluated the empirical average of the consensus error ε K . We present the results in Fig. 5. One can observe the same trend as that in Table 2. Specifically, although the finite-time consensus algorithm performs quite well for N = 10, its consensus error grows exponentially as network size increases, regardless of the network model. In contrast, the accuracy of the consensus achieved by the proposed method is robust to changes in network size and is greater than that of the consensus achieved by the static optimal strategy.

2) Large network consensus errors
We also conducted numerical experiments on larger networks. We used network sizes of N ∈ {100, 250} and performed experiments similar to those performed for small networks. The finite-time consensus algorithm was excluded because this algorithm generates extremely large consensus errors due to the numerical instability discussed above. We also reduced the probability for edge creation in the ER network to 0.025. We made this change because otherwise the linear matrix inequality arising from the static optimal strategy was infeasible. We show the empirical averages of the consensus errors by the proposed method and the static optimal strategy in Fig. 6. As in Fig. 5 for small networks, one can see that the proposed method yields consensuses with smaller errors compared to the static optimal strategy, regardless of network sizes or models. We can also confirm that the proposed method achieves smaller asymptotic convergence factors. In Fig. 7, we present the asymptotic convergence factors for WS networks of various sizes.

IV. CONCLUSION
In this paper, we presented a data-driven approach to accelerating the linear average consensus algorithm for undirected temporal networks. The proposed approach first unfolds the consensus algorithm to obtain an equivalent feedforward signal-flow graph, which is regarded as a neural network. Standard deep learning techniques are then applied to train the obtained neural network, which is a temporal network with optimized edge weights. Numerical experiments confirmed that the proposed method can significantly accelerate the average consensus algorithm for both finite and infinitetime windows.

APPENDIX. PROOF OF PROPOSITION 2.2
In this appendix, we present the proof of Proposition 2.2. We begin by presenting a few lemmas. For a real sequence a = {a(k)} ∞ k=0 , we define It should be noted that the following relationship holds: Lemma 2.1: Let a = {a(k)} ∞ k=0 and b = {b(k)} ∞ k=0 be real sequences. Assume that there exist integers L 1 , L 2 and positive constants C 1 , C 2 such that for all k ≥ max(0, −L 1 , −L 2 ). Then, η(a) = η(b).
Proof: By taking the logarithms in the inequality (16), we obtain for all k ≥ max(0, −L 1 , −L 2 ). As desired, taking the limit superiors with respect to k in this inequality and using (15) show log η(a) = log η(b). Lemma 2.2: Let a = {a(k)} ∞ k=0 be a real sequence. Let T be a positive integer. For a nonnegative integer k, we define Because b( ) = a(k T ) for any integer satisfying k T ≤ < (k + 1)T , we have As desired, equations (17) and (18) yield

Lemma 2.3:
Let M be an n × n real matrix with generalized eigenvectors v 1 , . . . , v n ∈ R n corresponding to the eigenvalues λ 1 , . . . , λ n , counted according to algebraic multiplicity. Let x 0 ∈ R n be arbitrary and assume that there exist a set of integers I ⊂ {1, . . . , n} and nonzero numbers c i (i ∈ I) such that x 0 = ∑ i∈I c i v i . Then, the sequence a = { M k x 0 } ∞ k=0 satisfies η(a) = max i∈I |λ i |. Proof: Let p denote the number of distinct eigenvalues of M. Without loss of generality, we can assume that the matrix M is in the Jordan canonical form with the Jordan blocks J k = c k I + N k ∈ R n k ×n k , k = 1, . . . , p, with a real number c k , a nilpotent matrix N k , and a positive integer n k satisfying ∑ p k=1 n k = n. We can also assume the existence of a positive integer q ≤ p such that each of the 8 VOLUME 4, 2016 generalized eigenvectors v i (i ∈ I) corresponds to one of the first q Jordan blocks J 1 , . . . , J q . This assumption implies the following identity: as well as the existence of nonzero vectors ξ k ∈ R n k (k = 1, . . . , q) such that where 0 n−∑ q k=1 n k denotes the zero vector of length n − ∑ q k=1 n k . Therefore, we obtain where w 1 , . . . , w q are nonzero vectors growing polynomially in k. Hence, the definition of η yields η(a) = max 1≤k≤q |c k |. This equation and (19) complete the proof.
We are now ready to prove Proposition 2.2. Using the notation (14), we obtain because e(0) 1/k converges to one as k → ∞. Equation (8) shows that the error vector e satisfies for all 0 ≤ τ ≤ T − 1 and s ≥ 0. Therefore, if we define then we can show that C −1 e( k + T T ) ≤ e(k) ≤ C e( k T ) for all k ≥ 0. By Lemmas 2.1 and 2.2, we obtain whereē is defined asē(k) = e(kT ) for all k ≥ 0. It should be noted that the sequence {ē(k)} ∞ k=0 satisfies e(k + 1) = Mē(k) for all k ≥ 0. Additionally, based on the assumption in the lemma, the eigenvalue 1 of M corresponding to the eigenvector 1 is simple. Becauseē(0) = e(0) = x 0 − c1 belongs to the space spanned by the generalized eigenvectors of M corresponding to the other eigenvalues, we have η({ ē(k) } ∞ k=0 ) ≤ ρ by Lemma 2.3. The equality is attained when x 0 equals one of the generalized eigenvectors of M corresponding to the eigenvalue having the modulus ρ. Therefore, equations (20) and (21)  TADASHI WADAYAMA (M'96) was born in Kyoto, Japan, on May 9,1968. He received the B.E., the M.E., and the D.E. degrees from Kyoto Institute of Technology in 1991, 1993 and 1997, respectively. On 1995, he started to work with Faculty of Computer Science and System Engineering, Okayama Prefectural University as a research associate. From April 1999 to March 2000, he stayed in Institute of Experimental Mathematics, Essen University (Germany) as a visiting researcher. On 2004, he moved to Nagoya Institute of Technology as an associate professor. Since 2010, he has been a full professor of Nagoya Institute of Technology. His research interests are in coding theory, information theory, and coding and signal processing for digital communication/storage systems. He is a member of IEICE. VOLUME 4, 2016