Distributed Adaptive Learning Under Communication Constraints

This work examines adaptive distributed learning strategies designed to operate under communication constraints. We consider a network of agents that must solve an online optimization problem from continual observation of streaming data. The agents implement a distributed cooperative strategy where each agent is allowed to perform local exchange of information with its neighbors. In order to cope with communication constraints, the exchanged information must be unavoidably compressed. We propose a diffusion strategy nicknamed as ACTC (Adapt-Compress-Then-Combine), which relies on the following steps: i) an adaptation step where each agent performs an individual stochastic-gradient update with constant step-size; ii) a compression step that leverages a recently introduced class of stochastic compression operators; and iii) a combination step where each agent combines the compressed updates received from its neighbors. The distinguishing elements of this work are as follows. First, we focus on adaptive strategies, where constant (as opposed to diminishing) step-sizes are critical to respond in real time to nonstationary variations. Second, we consider the general class of directed graphs and left-stochastic combination policies, which allow us to enhance the interplay between topology and learning. Third, in contrast with related works that assume strong convexity for all individual agents' cost functions, we require strong convexity only at a network level, a condition satisfied even if a single agent has a strongly-convex cost and the remaining agents have non-convex costs. Fourth, we focus on a diffusion (as opposed to consensus) strategy. Under the demanding setting of compressed information, we establish that the ACTC iterates fluctuate around the desired optimizer, achieving remarkable savings in terms of bits exchanged between neighboring agents.

the agents' gradient updates are compressed using the random quantizers proposed in [41]. Before feeding the quantizer, the local gradients are corrected with a compensation term that accounts for the quantization error from previous iterations. Convergence guarantees of stochastic gradient algorithms with compressed gradient updates and error compensation were provided in [45] and [46].
Let us switch to the differential quantization approach. This is a more direct way to leverage the memory present in the iterative algorithm, which aims at reducing the error variance by compressing only the difference between subsequent iterates. For a fixed budget of quantization bits, it is indeed more convenient to compress the difference (i.e., the innovation) between consecutive samples, rather than the samples themselves. This is because: i) the innovation typically exhibits a reduced range as compared to the entire sample; and ii) owing to the correlation between consecutive samples, quantizing the entire sample will waste resources by transmitting redundant information. The information-theoretic fundamental limits of (non-stochastic) gradient descent under differential quantization have been recently established in [49].
However, the aforementioned works on error compensation and differential quantization referred either to fully-connected networks (i.e., there exists a direct communication link between any two agents) or to agents communicating with a single fusion center tasked to perform a centralized gradient update. In this work, we focus instead on the more challenging setting where optimization must be fully decentralized. Under this scenario, each agent is responsible for its own inference, which is obtained by successive steps of local interaction with its neighbors. When moving to the fully distributed setting, new challenges related to the topology and the interplay between agents need to be considered, adding significant complexity to the analysis of compression for distributed optimization. The first important difference is that, since the agents are generally not fully connected, even without communication constraints they cannot compute a common gradient function, implying that exchanging only gradient updates would impair convergence [50]. Therefore, the fully distributed setting requires agents to exchange their iterates rather than plain gradient updates.
Typical strategies for fully-distributed optimization without compression constraints are consensus or diffusion strategies [19]. However, applying these strategies with compressed data is nontrivial, since without proper design of the quantizers, significant bias is introduced in the learning algorithm, which prevents plain consensus or diffusion implementations from converging to the right minimizer. One early characterization of adaptive diffusion with compressed data was provided in [51], where the compression errors were modeled as noise over the communication channel. In comparison, there are other works that address the quantization issue in fully-distributed strategies by focusing on the explicit encoder structure. Some useful results are available for the case of exact, i.e., non-stochastic gradient-type algorithms. In [52], uniform quantization of the iterates is considered for a distributed implementation of the subgradient descent algorithm. In this scheme, the agents update locally their state variables by averaging the state variables received from their neighbors, and then follow the subgradient descent direction. More recently, additional convergence results were presented in [53], where random (dithered) quantization is applied, along with a weighting scheme to give more or less importance to the analog local state and the quantized averaged state of the neighbors. In [54], the randomized quantizers proposed in [41] are considered for a distributed gradient descent implementation using consensus with compressed iterates and an update rule similar to the one adopted in [53].
All the aforementioned works on fully-distributed schemes under communication constraints differ significantly from our proposal, as they rely on availability of the exact gradient. In the present work, we focus instead on the adaptive setting where the agents collect noisy streaming data to evaluate a stochastic instantaneous approximation of the actual gradient, and must be endowed with online algorithms capable to respond in real time to drifts in the underlying conditions. Useful communication-constrained and fully-decentralized implementations that can be applied to this setting were recently proposed in [55]- [57].
We are now ready to summarize the main novel and distinguishing contributions offered in this article, in comparison to the pertinent previous works.

B. Main Contributions
-Diffusive Adaptation. The Adapt-Compress-Then-Combine (ACTC) strategy proposed in this work belongs to the family of diffusion strategies [19], while the available works on distributed optimization under communication constraints focus on consensus strategies [55]- [57]. Along with many commonalities, one fundamental difference between diffusion and consensus resides in the asynchrony of the latter strategy in the combination step (where the updated state of an agent is combined with the previous states of its neighbors) [19]. This asynchrony has an effect both in terms of stability and learning performance. In fact, it has been shown that consensus algorithms can feature smaller range of stability as compared to diffusion strategies and a slightly worse learning performance [19]. For these reasons, in this work we opt for a diffusion scheme. Starting from the traditional (uncompressed) Adapt-Then-Combine (ATC) diffusion strategy detailed in [19], we allow for local exchange of compressed variables by means of stochastic quantizers. We will see that, thanks to the diffusion mechanism, the ACTC scheme will be able to adapt and learn well, and in particular it will outperform previous distributed quantized strategies based on consensus.
We focus on a dynamic setting where the agents are called to learn by continually collecting streaming data from the environment. Under an adaptive setting, once the distributed learning algorithm starts, we want the agents to learn virtually forever, by automatically adapting their behavior in face of nonstationary drifts in the streaming data. To this end, stochastic gradient algorithms with constant step-size are necessary. These algorithms have been shown to tradeoff well learning and adaptation. On the learning side, each agent resorts to some instantaneous approximation of the cost function (which is not perfectly known in practice) and tries to learn with increasing precision by leveraging the increasing information coming from the streaming data.
On the adaptation side, the constant step-size leaves a persistent amount of "noise" in the algorithm (the "gradient noise") which automatically infuses the algorithm with the ability of promptly reacting to drifts. In contrast, over diminishing step-size implementations, the gradient noise is progressively annihilated over time.
As a consequence, diminishing step-size algorithms learn infinitely better as time progresses under stationary conditions. At the same time, if the minimizer changes (e.g., because of drifts in the underlying distribution) diminishing step-size algorithms get stuck on the previously computed minimizer, exhibiting a sort of "elephant's memory", i.e., requiring a time to get out from a local minimizer that is at least proportional to the time the algorithm needed to approach that minimizer.
The existing results on distributed stochastic gradient descent with compressed data focus mainly on nonadaptive implementations with diminishing step-size. Some results for constant step-sizes are available in [57], under a setting that differs from our setting in terms of the critical features described in the next two items, namely, type of combination policy and assumptions on the local cost functions.
-Left-Stochastic combination policies. The existing works on distributed optimization under communication constraints focus on symmetric and doubly-stochastic combination policies. This is not necessarily the case in distributed optimization algorithms. In particular, communication between pair of nodes can be asymmetric, meaning that node k can scale the data received from a neighbor with a weight a k that differs from the weight a k used by k to scale the data received from . In particular, we can have directed graphs where, e.g., and k are communicating only in one direction (e.g., a k > 0 while a k = 0). For this reason, in this work we consider the more general setting of left-stochastic combination policies. Left-stochastic matrices allow us to represent a significantly richer variety of distributed interactions, where the network topology plays a fundamental role.
For example, the limiting Perron eigenvector of left-stochastic matrices is not uniform, a property that can be exploited to compensate for non-uniform agents' behavior [19]. Moreover, by acting on the topology and/or on the left-stochastic combination matrix, one can tune the Perron eigenvector so as to explore different Paretooptimal solutions [19]. Last but not least, differently from doubly-stochastic matrices, left-stochastic matrices can be constructed in practice without requiring any coordination across the agents. From a technical viewpoint, the fact that our combination matrices are not required to be neither symmetric nor doubly stochastic, introduces significant additional complexity in the technical analysis.
-Global strong convexity. Convergence of the stochastic gradient iterates is typically examined under the assumption that the gradients are Lipschitz and the cost functions are strongly convex. In the distributed setting, the latter property is usually translated into assuming that all the local cost functions pertaining to the individual agents are strongly convex [55], [57]. Sometimes the additional assumption of uniform gradient boundedness is adopted (e.g., in [55]), which can however hold only approximately in the Lipschitz and strongly-convex setting. For Lipschitz gradient and strongly-convex local function, without the uniform boundedness approximation, convergence results were recently obtained for distributed primal-dual algorithms Capital letters refer to matrices, small letters to both vectors and scalars. Sometimes we violate the latter convention, for instance, we denote the total number of network agents by N . All vectors are column vectors.
In particular, the symbol 1 L denotes an L × 1 vector whose entries are identically equal to 1. Likewise, the identity matrix of size L is denoted by I L . For two square matrices X and Y , the notation X ≥ Y signifies that X − Y is positive semi-definite. In comparison, for two rectangular matrices X and Y , the notation X Y signifies that the individual entries of X − Y are nonnegative. For a vector x, the symbol x denotes the 2 norm of x. For a matrix X, the 2 induced matrix norm is accordingly X . Other norms will be characterized by adding the pertinent subscript. For example x 1 will denote the 1 norm of x, and X 1 the 1 induced matrix norm (maximum absolute column sum of X). The symbol ⊗ denotes the Kronecker product. The symbol * denotes complex conjugation. X is the transpose of matrix X, whereas X H is the Hermitian (i.e., conjugate) transpose of a complex matrix X. The symbol E denotes the expectation operator. For a nonnegative function f (µ), the notation f (µ) = O(µ) signifies that there exists a constant C > 0 and a value µ 0 such that f (µ) ≤ Cµ for all µ ≤ µ 0 .

II. BACKGROUND
We consider a network of N agents solving a distributed optimization problem. Each individual agent k = 1, 2, . . . , N is assigned a local cost or risk function: The local cost functions are assumed to satisfy the following regularity condition.
Assumption 1 (Individual cost function smoothness). For all w ∈ R M , each cost function J k (w) is twicedifferentiable and its Hessian matrix satisfies the following Lipschitz condition, for some positive constants {η k }: In practice, it is seldom the case that the cost functions are perfectly known to the agents. In contrast, each agent usually has access to a stochastic approximation of the true cost function. For example, in the adaptation and learning theory the cost functions are often modeled as the expected value of a loss function L k (w; x k ), namely, where the expectation is taken w.r.t. a random variable x k that can represent, e.g., some training data observed by agent k. In many scenarios of interest, the statistical characterization of x k is not available to the agents and, hence, J k (w) is not known and is rather approximated by the stochastic quantity L k (w; x k ). Moreover, if different data x k,i are collected over time by agent k, the stochastic approximation takes the form of an instantaneous approximation L k (w; x k,i ) depending on time index i.
More generally, whether or not the cost function is defined through (3), in the following treatment we assume that agent k at time i is able to approximate the true gradient ∇J k (w) through a stochastic instantaneous approximation g k,i (w) which, without loss of generality, can be written as the true gradient plus a gradient noise term n k,i (w), namely,

A. Classical ATC Diffusion Strategy
The Adapt-Then-Combine (ATC) diffusion strategy is a popular distributed mechanism that consists of iterated application of the following two steps, for i = 1, 2, . . .
In (5), agents k = 1, 2, . . . , N evolve over time i by producing a sequence of iterates w k,i ∈ R M . The adaptation step is a self-learning step, where each agent k at time i computes its own instantaneous stochastic approximation g k,i (·) of the local cost function J k (·), evaluated at the previous iterate w k,i−1 . Such an approximation is scaled by a small step-size µ k > 0 and used to update the previous iterate w k,i−1 following the (stochastic) gradient descent. The maximum step-size across the agents will be denoted by: giving rise to the scaled step-sizes: The combination step is a social learning step, where agent k aims at realigning its descent direction with the rest of the network by combining its local update ψ k,i with the other agents' updates scaled by some nonnegative scalars {a k }, which are referred to as combination weights. The support graph of the combination matrix A = [a k ] describes the connections between agents, i.e., the topology of a network whose vertices correspond to the agents, and whose edges represent directional links between agents. According to this model, when no communication link exists between agents and k, the combination weights a k and a k must be equal to zero. Likewise, when information can flow only from to k, we will have a k > 0 and a k = 0. In summary, the combination process is a local process where only neighboring agents interact.
It is useful to introduce the neighborhood of agent k: which is a directed neighborhood that accounts for the incoming flow of information from to k (possibly including the self-loop = k).
We will work under the following standard regularity conditions on the network.
Assumption 2 (Strongly-Connected Network). The network is strongly-connected, which means that, given any pair of nodes ( , k), a path with nonzero weights exists in both directions (i.e., from to k and vice versa), and that at least one agent k in the entire network has a self-loop (a kk > 0).
Assumption 3 (Stochastic combination matrix). For each agent k = 1, ..., N the following conditions hold: which imply that the combination matrix A = [a k ] is a left-stochastic matrix.
Under Assumptions 2 and 3, the combination matrix A is a primitive matrix, and thus satisfies the Perron-Frobenius theorem, which in particular implies the existence of the Perron vector π = [π 1 , π 2 , . . . , π N ] , a vector with all strictly positive entries satisfying the following relationship: For later use, it is convenient to introduce the vector p that mixes the topological information encoded in the Perron eigenvector with the scaled step-sizes {α k }, namely, We are now ready to introduce the global strong convexity assumption that will be required for our results to hold.
Assumption 4 (Global Strong Convexity). Let p k = α k π k be the k-th entry of the scaled Perron eigenvector in (11). The (twice differentiable) aggregate cost function: is ν-strongly convex, namely, a positive constant ν exists such that, for all w ∈ R M we have: We remark that, for Assumption 4 to hold, it is necessary that strong convexity holds for only one local cost function, with the other cost functions being allowed to be non-convex, provided that (13) is satisfied! 1 In contrast, most works consider the more restrictive setting where all individual cost functions are strongly convex. In this work, we depart from this restrictive assumption, and adhere instead to the more general setting that was considered in [13], [14], [19] for the uncompressed-data setting. Therefore, Assumption 4 turns out to be a useful generalization that will allow us to cover important practical cases, such as the case where all but a single agent work with locally not identifiable models. Under these scenarios, the agents with unidentifiable models would not be in the position of converging to a meaningful minimizer, whereas the presence of only a single good agent will enable successful collective learning for the entire network. This particular setting will be illustrated in more detail in Sec. VII. On the other hand, as we will see, working under the assumption that the local functions are not all strongly convex introduces significant complexity in the analysis and makes the derivations more demanding.
For later use, it is useful to notice that Assumptions 1 and 4 entail a relationship between the Lipschitz constants {η k } and the strong convexity constant ν. Specifically, using (2) and (13) and applying the triangle inequality we can write: The adaptation and learning performance of the ATC strategy has been examined in great detail in previous works [13], [14], [19]. The major conclusion stemming from these works is that, under Assumptions 1-3 (plus some classical assumptions on the gradient noise -see Assumption 6 further ahead), the ATC strategy is able to drive each agent toward a close neighborhood of the minimizer w of the global cost function (12). We remark that in the considered framework each cost function can be different, having a specific minimizer w 0 k (or even multiple minimizers), which may not coincide with the unique global network minimizer w of the aggregate cost function J(w).
We notice also that the structure of (12) allows us to solve different optimization problems where the objective function can be expressed as the linear combination of local cost functions, including the special case where the weights p k are all uniform, which can be obtained when the step-sizes are all equal (i.e., α k = 1 for all k) and the combination matrix is doubly-stochastic (since the Perron eigenvector of a doubly-stochastic matrix has entries p = 1/N for all = 1, 2, . . . , N ). In addition, and remarkably, it was shown in [19], [58] that the minimizer of (12) corresponds to a Pareto solution of the multi-objective optimization problem: where the choice of the weights {p k } drives convergence to a particular Pareto solution.

III. ATC WITH COMPRESSED COMMUNICATION
The ATC diffusion scheme (5) is designed under the assumption that agents exchange over the communication channel their intermediate updates ψ k,i . However, in a realistic environment the information shared by the agents must be necessarily compressed. This necessity gives rise to at least two fundamental questions. First, is it possible to design diffusion strategies that preserve the adaptation and learning capabilities of the ATC strategy despite the presence of data compression? Assume the answer to the first question is in the affirmative. Then it is natural to ask whether there is a limit on the amount of compression, since it is obviously desirable for the agents to save as much bandwidth and energy as possible. Our analysis will give precise elements to address these important questions. To start with, we introduce a compressed version of the ATC strategy.
There exist obviously several possibilities to perform data compression. In order to select one particular strategy, we need to consider the fundamental limitations of our setting, in particular: i) lack of knowledge of the underlying statistical model; ii) correlation across subsequent iterates.
Limitation i) is classically encountered in quantization for inference [38], where the usage of standard techniques for quantizer design is not viable. This is because, when the true statistical models are unavailable, such techniques lead to severe estimation bias that eventually impairs the algorithm's convergence. An excellent tool to overcome this issue is provided by stochastic quantization, where suitable introduction of randomness allows to compensate for the bias (on average, over time).
Limitation ii) is classically encountered in the quantization of random processes. An excellent tool to solve this problem is provided by differential quantization, which leverages similarity (i.e., correlation) between subsequent samples by quantizing only their difference (i.e., innovation).
Motivated by the above two observations, in the following we will rely on stochastic differential quantization, which will be plugged in the ATC recursion giving rise to the Adapt-Compress-Then-Combine (ACTC) diffusion strategy, which can be described as follows.
The time-varying variables characterizing the ACTC recursion are: an intermediate update ψ k,i , a differentiallyquantized update q k,i , and the current minimizer w k,i . At time i = 0 each agent k is initialized with an arbitrary state value q k,0 (with finite second moment). Then, agent k receives the initial states q ,0 from its neighbors ∈ N k (such initial sharing is performed with infinite precision, which is immaterial to our analysis since it happens only once) and computes an initial minimizer w k,0 = ∈Nk a k q ,0 . Then, for every i > 0, the agents perform the following four operations. First, each agent k performs locally the same adaptation step as in the ATC strategy: Second, each agent k compresses the difference between the update ψ k,i and the previous quantized update q k,i−1 , through a compression function Q k : R M → R M : The bold notation for the compression function highlights that randomized functions are permitted, as we will explain more carefully in Sec. III-A. Then, agent k receives from its neighbors ∈ N k the compressed values transmit index 3 with probability ξ m − y 2 ϑ Q (ψ ,i − q ,i−1 ). Since the quantization operation acts on differences, the quantized states q ,i must be updated by adding the quantized difference to the previous value q ,i−1 . Specifically, agent k updates the quantized values corresponding to all ∈ N k : where ζ ∈ (0, 1) is a design parameter that will be useful to tune the stability of the algorithm, as we will carefully explain in due time. Finally, agent k combines the updated states corresponding to its neighbors as usual: It is important to remark that, in order to perform the update step in (18), agent k must possess the variables q ,i−1 from its neighbors ∈ N k . This might appear problematic at first glance, since we have just seen that only the differences Q (ψ ,i − q ,i−1 ) are actually received by k from a neighboring agent . On the other hand, since at i = 0 agent k knows the initial quantized states {q ,0 } ∈Nk , and since the quantized update in (18) depends on {q ,i−1 } ∈Nk and the quantized innovation {Q (ψ ,i − q ,i−1 )} ∈Nk , we conclude that sharing of the quantized differences along with the initial states {q ,0 } ∈Nk is enough for agent k to implement (18) at every instant i, by keeping memory only of the last neighboring variables {q ,i } ∈Nk . In summary, the ACTC scheme can be compactly described as follows: Before concluding this section, we remark that the combination step of the ACTC strategy considers only quantized variables, even if, in principle, agent k might also combine its analog state ψ k,i in place of the quantized counterpart q k,i . As a general principle, the more we compress, the less we spend (even in terms of local processing). In this respect, showing that the ACTC strategy learns properly by combining only quantized variables has its own interest. Moreover, considering only quantized variables gives to the algorithm a symmetric structure that avoids adding further complexity to the mathematical formalization that is necessary to support the analysis.

A. Compression operators
In order to implement the ACTC recursion, it is necessary to specify the compression operators Q k (·). We will consider the class of compression operators that fulfill the following regularity assumptions.
Assumption 5 (Compression operators). For a positive parameter ω and an input value x ∈ R M , the compression operator Q : R M → R M is a randomized operator satisfying the following two properties: where expectations are evaluated w.r.t. the randomness of the operator only. When x is random, the compression operator is statistically independent of x.
By "randomized operator" we mean that, given a deterministic input x, the output Q(x) is randomly chosen (one meaningful way to perform such random choice will be illustrated in Sec. III-B). Accordingly, the expectations appearing in (21) and (22) are computed w.r.t. the randomness inherent to operator Q(·). As stated, whenever we apply the operator to a random input x, we assume that the random mechanisms governing Q(·) and x are statistically independent.
Two main observations are important to capture the meaning of Assumption 5. The first one concerns the role of parameter ω, which quantifies the amount of compression. Small values of ω correspond to small amount of compression, i.e., finely quantized data. Large values of ω are instead representative of severe compression. The second observation concerns properties (21) and (22). It will emerge from the technical analysis that property (21) enables the possibility that the quantization errors arising during the ACTC evolution average to zero as time elapses. Property (22) will be critical to guarantee that the variance of the quantization errors does not blow-up over time.
Remarkably, the class of randomized quantizers introduced in Assumption 5 is fairly general and flexible, including a broad variety of compression paradigms. Some of these paradigms are particularly tailored to our setting, where we have to compress vectors with possibly large dimensionality M . For example, one useful paradigm is the sparse compression paradigm, where a small subset of the M components of the input vector x is sent with arbitrarily large precision, while the remaining entries are set to zero [57].
Another compression scheme that meets Assumption 5 was recently popularized in [41], and will be illustrated in the next section.
B. Randomized Quantizers in [41] • The Euclidean norm x of the input vector x is represented with high resolution h, e.g., with machine precision. Then, each entry x m of x is separately quantized.
• One bit is used to encode the sign of x m .
• Then, we encode the absolute value of the m-th entry x m . Since x is transmitted with high precision, we can focus on the scaled value: The interval [0, 1] is partitioned into L equal-size intervals -see the illustrative example in Fig. 1. The size of each interval is: such that the intervals' endpoints can be accordingly represented as: In order to avoid confusion, we stress that the quantization scheme will require to transmit one of the L + 1 indices corresponding to the endpoints. This differs from classical quantization schemes where the index of the interval (instead of the endpoint) is transmitted. Accordingly, the bit-rate r is equal to: • In view of (25), the index of the (lower) endpoint of the interval the scaled entry ξ m belongs to, is computed as: and the corresponding endpoint is: Then, we randomize the quantization operation since we choose randomly to transmit the lower endpoint index j(ξ m ) or the upper endpoint index j(ξ m )+1. Specifically, the probability of transmitting one endpoint index is proportional to the distance of ξ m from that endpoint. In other words, the closer we are to one endpoint, the higher the probability of transmitting that endpoint will be. Formally, the random transmitted index j tx (ξ m ) is: • Once the index j tx (ξ m ) is received, the unquantized value ξ m is rounded to the lower or upper endpoint depending on the realization of the transmitted index j tx (ξ m ), and then the information about the norm x and the sign of x m is recovered, finally yielding the m-th component of the quantized vector Q(x): • Accounting for the h bits spent for representing the norm x and the single bit for representing the sign of each of the M entries of x, the total bit-rate is: It was shown in [41] that the value of the compression factor ω can be computed as: Equation (32) provides useful insight on the practical meaning of ω for the considered type of quantizers. We see that, for fixed dimensionality M , the parameter ω decays exponentially fast with the number of bits (≈ 2 −2r ), whereas for fixed number of bits it grows as √ M .
In order to avoid misunderstanding, we stress that in the forthcoming treatment we will use interchangeably the terminology compression or quantization to indicate a general compression operator fulfilling Assumption 5.
Reference to a specific class of compression operators, such as the randomized quantizers in [41] that we have illustrated in this section, will be made when needed (e.g., in Sec. VII).
Before going ahead, it is necessary to make the following notational remark. Since in our setting each individual agent k will be allowed to employ a different compression operator Q k (x), we will have possibly different compression parameters ω k , with their maximum value being denoted by: We will show in the remainder of the article that the mean-square-error approaches O(µ) for small step-sizes, i.e., the algorithm is mean-square-error stable even in the presence of quantization errors and gradient noise. The derivations are demanding and challenging due to the nonlinear and coupled nature of the network dynamics, as is clear from the arguments in the appendices. Nevertheless, when all is said and done, we arrive at the reassuring conclusion that the diffusion strategy is able to learn well in quantized/compressed environments.

IV. NETWORK ERROR DYNAMICS
In this section we illustrate the main formalism that will be exploited to conduct our analysis. Since we are interested in computing the deviation of the ACTC iterates from the global minimizer w , it is expedient to introduce the following centered variables: It is also convenient to rewrite the adaptation step in order to make explicit the role of the true cost functions J k (w). To this end, we must exploit the gradient noise introduced in (4), which quantifies the discrepancy between the approximate and true gradients. Exploiting (4), the first line in (20) can be rewritten as: Examining (35), we see that the gradient noise contains an additional source of randomness given by its argument w k,i−1 = ∈Nk a k q ,i−1 , whose randomness comes accordingly from the previous-step quantized iterates We assume the following standard regularity properties for the gradient noise process.
Assumption 6 (Gradient Noise). For all k = 1, 2, . . . , N and all i > 0, the gradient noise meets the following conditions: for some constants β k and σ k .
With reference to the actual gradient in (4), from the mean-value theorem one has (we recall that w k,i−1 = w k,i−1 − w ) [19]: where the integral of a matrix is intended to operate entrywise. Introducing the bias of agent k, and the Hessian matrix of agent k: Eq. (38) yields: For notational convenience, the scaled step-size α k has been embodied in the definitions of the bias and the Hessian matrix. Likewise, it is useful to introduce the scaled gradient noise vector: Using now (34), (35) and (41) in (20), the ACTC recursion can be recast in the form: We remark that in the second equation of (43), the argument of the compression function is expressed in terms of centered variables by adding and subtracting w .
We now manage to reduce (43) to a simpler form. First of all, the ACTC iterates w k,i are convex combinations of the quantized iterates { q k,i }, implying that the characterization of the mean-square behavior of q k,i will enable immediate characterization of the mean-square behavior of w k,i . It is therefore convenient to incorporate the third step of the ACTC strategy into the first step, and focus on the behavior of q k,i , obtaining: Moreover, it is convenient to introduce the difference variable: Subtracting q k,i−1 from the first equation in (44), we finally obtain:

A. Recursions in Extended Form
Since we are interested in a network-oriented analysis, it is useful to introduce a notation where the N agents' vectors of size M × 1 are stacked into the following M N × 1 vectors: Likewise, it is useful to consider the extended compression operator Q(·) that applies the compression operation to each M × 1 block of its input, and stacks the results as follows: Finally, in order to express the recursion (43) in terms of the joint evolution of the extended vectors we need to introduce the extended matrices: where ⊗ denotes the Kronecker product. It is now possible to describe compactly the ACTC strategy of the individual agents in (43) as:

B. Network Coordinate Transformation
By means of a proper linear transformation of the network evolution it is possible to separate the two fundamental mechanisms that characterize the learning behavior of the ACTC strategy. The first mechanism characterizes the coordinated evolution enabled by the social learning phenomenon. This is a desired behavior that will be critical to let each individual agent agree and converge to a small neighborhood of the global minimizer w . In comparison, the second mechanism represents the departure of the agents' evolution from the coordinated evolution, which arises from the distributed nature of the system (agents need some time to reach agreement through successive local exchange of information). We will see later how the ACTC diffusion strategy blends these two mechanisms so as to achieve successful learning.
The matrix J tot is the Jordan matrix of A (equivalently, of A, since A and A are similar matrices [59]) arranged in canonical form, i.e., made up of the Jordan blocks corresponding to the eigenvalues of A as detailed in Appendix A [59]. In particular, the single 3 eigenvalue equal to 1 corresponds to a 1 × 1 block, and the other Jordan blocks can be collected into the (N − 1) × (N − 1) reduced Jordan matrix J -see (121): The columns of V −1 collect the generalized right-eigenvectors of A (hence, generalized left-eigenvectors of A) [60], [61]. Likewise, the rows of V collect the generalized left-eigenvectors of A . Recalling that π is a righteigenvector of A, and 1 N a left-eigenvector of A, the matrices V and V −1 can be conveniently block-partitioned as follows: where the subscript R is associated with generalized right-eigenvectors of A. The same applies to subscript L as regards generalized left-eigenvectors.
We are now ready to detail the network coordinate transformation relevant to our analysis. To this end, we introduce the extended matrix: and the transformed extended vector q i : As we see, the transformed vector q i has been partitioned in two blocks: an M ×1 vectorq i and an M (N −1)×1 vector q q i . These two vectors admit a useful physical interpretation. Vectorq i is a linear combination, through the Perron weights, of the N vectors composing q i , i.e., of the vectors corresponding to all agents. As we will see, such combination reflects a "coordinated" evolution that will apply to all agents. In contrast, vector q q i is representative of the departure of the individual agent's behavior from the coordinated behavior. In the following, we will sometimes refer toq i as the coordinated-evolution component, and to q q i as the networkerror component. It is worth noticing that the coordinated-evolution componentq i is real-valued, whereas the network-error component q q i is in general complex-valued, since the matrix V R can contain complex-valued eigenvectors.
In order to highlight the role of the aforementioned two components on the individual agents, we can apply the inverse transformation V −1 to q i , obtaining: where [V L ] k denotes the k-th row of matrix V L . Equation (56) reveals that agent k progresses over time by combining the coordinated evolutionq i (which is equal for all agents) and a perturbation vector T k q q i , which quantifies the specific discrepancy of agent k (since matrix T k depends on k) from the coordinated behavior.
From the distributed optimization perspective, the goal is to reach agreement among all agents and, hence, it is necessary that the perturbation term is washed out as time elapses, letting all agents converge to the same coordinated behavior. Establishing that this is the case will be the main focus of our analysis.
For later use, it is convenient to introduce also the transformed versions of δ i , s i and b: The zero entry in the transformed bias vector arises from the fact that, in view of (39), we have: since the Perron-weighted sum of the gradients computed at the limit point w corresponds to the exact minimizer of the global cost function in (12).

V. MEAN-SQUARE STABILITY
In order to assess the goodness of an individual agent's estimate w k,i we will focus on the mean-squaredeviation: As we explained before, it is instrumental to work in terms of the quantized iterates q k,i and, more precisely, in terms of the transformed vectors q i = V q i . In order to characterize the mean-square evolution of the transformed vectors, it is particularly convenient to adopt the formalism of the energy operators introduced in [13].
Definition 1 (Average Energy Operator). Given a random vector: where each x k , for k = 1, 2, . . . , N is an M × 1 vector, we consider the operator: In particular, when we apply the average energy operator in (63) to one of our transformed vectors, e.g., to (55), we obtain the following block decomposition: where P[ q q i ] is an (N − 1) × 1 vector. Examining the time evolution of the energy vectors in (64) is critical because the individual agents' errors E q k,i 2 can be related to these energy vectors through the inverse network transformation V −1 in (56). In particular, we will be able to show that the quantity E q i 2 plays a domineering role in determining the (common) individual agents' steady-state mean-square behavior, whereas the quantity P[ q q i ] plays the role of a network error quantity that dies out during the transient phase.
We start with three lemmas that characterize the interplay over time (in terms of energy) of three main quantities in the network transformed domain: the gradient noise s i , the quantization error δ i , and the quantized iterates q i .
The first lemma relates the gradient noise to the quantized iterates.
Lemma 1 (Gradient Noise Energy Transfer). The average energy of the transformed gradient noise extended vector s i evolves over time according to the following inequality: where the transfer matrix T s and the driving vector x s = [x s , q x s ] are defined in Table II. Proof: See Appendix B.

Gradient noise
Bounding constants at agent k:

the equality holds since the bias is not random)
Network error matrix J = Λ + U is the reduced Jordan matrix in (124), The second lemma relates the quantization error to the quantized iterates and to the gradient noise.
Lemma 2 (Quantization Error Energy Transfer). The average energy of the transformed quantization-error extended vector δ i evolves over time according to the following inequality: where the transfer matrix T δ and the driving vector Table II.
The third lemma relates the quantized iterates to the quantization error and the gradient noise. [

Lemma 3 (Quantized Iterates Energy Transfer).
Let where ν is the global-strong-convexity constant introduced in (13) and η is the average Lipschitz constant in (14). Let where the constants∆ and q ∆ are defined in Table I. Then, the average energy of the transformed extended vector q i evolves over time according to the following inequality: where the transfer matrix T q and the driving vector x q = [x q , q x q ] are defined in Table II.
Proof: See Appendix D.
Combining Lemmas 1, 2, and 3 we arrive at a recursion on the quantized iterates q i , as stated in the next theorem. Unless otherwise specified, all matrices, vectors and constants in the statement of the theorem can be found in Tables I and II. where: and with φ being a positive scalar that embodies the constants appearing in the µ 2 -terms of the transfer matrices T s , T δ and T q in Table II. The evaluation of φ is rather cumbersome and is detailed in Appendix E. Let also where the driving vectors x s , x δ and x q are defined in Table II. Then, the average energy of the extended vector q i obeys the following inequality: By inspection of (70), we see that the transfer matrix T can be written as the sum of an upper-diagonal matrix T 0 and a rank-one perturbation of order µ 2 . In particular, the upper-diagonal structure of T 0 implies that the evolution relating the network error components P[ q q i−1 ] to P[ q q i ] takes place only through matrix E.
Accordingly, such matrix will be referred to as the network error matrix. Moreover, it is worth noticing that E is independent of the step-size µ.
It is tempting to conclude from (70) that the rank-one perturbation can be neglected as µ → 0. Were the matrix T 0 independent of µ, this conclusion would be obvious. However, since T 0 does depend upon µ, proving that for small µ the recursion in Theorem 1 can be examined by replacing T with T 0 is not necessarily true. We will be able to show that this is actually the case, but to this end we need to carry out the demanding technical analysis reported in the appendices.
Nevertheless, to gain insight on how recursion (75) and the quantities in Table II are relevant to the meansquare behavior of the ACTC strategy, let us simply assume for now that we can replace T with T 0 . Under this assumption, by developing the inequality recursion (75) we would arrive at the following inequality: Assume that T 0 is stable. Then, from (76) we would have: where we exploited the upper triangular shape of T 0 to evaluate the inverse (I − T 0 ) −1 . Examining (74) and Table II, we see that all the entries of x scale as µ 2 . Accordingly, from (77) we obtain the following important conclusions regarding the mean-square stability of the ACTC strategy.
The term E q i 2 scales as O(µ), whereas the energy of the network error component E q q i 2 is a higher-order term scaling as O(µ 2 ).
Since all entries in x scale as µ 2 , at the leading order in µ the behavior of E q i 2 is determined by the (1, 1)-entry of (I − T 0 ) −1 multiplied byx, i.e., the first entry in x. Inspecting Table II, we see that: Equation (78) contains useful information as regards the specific interplay between the compression stage and the mean-square behavior of the ACTC strategy. In the absence of compression, the contribution due to ∆ disappears (since with unquantized data ω k = 0 for all agents), and the only relevant term would then be the first componentx s corresponding to the driving vector of the gradient noise energy transfer in Lemma 1. This behavior matches perfectly the behavior of the classical (i.e., unquantized) ATC strategy.
On the other hand, when compression comes into play, another term arises, which mixes both components of x δ and x s , owing to the ∆1 N perturbation in (78). In particular, in the additional term due to quantization we can recognize the following relevant quantities. First, we see that the gradient noise component contributes again to the ACTC error through the quantity ζ 2∆x s .
Second, and differently from the classical ATC, also the network component q x s of the gradient noise is now injected into the ACTC error by the quantization mechanism. Finally, there is a term depending on the driving vector q x δ pertaining to the quantization error examined in Lemma 2. Notably, from Table II we see that q x δ depends on the bias term P[ q b ]. This means that, in the absence of bias (e.g., when the true cost functions of all agents are minimized at the same location), the term q x δ is zero.
In summary, the steady-state error contains a classical term determined by the gradient noise, plus an additional term arising from data compression. In the latter error, backpropagation of the quantization error lets additional components of the gradient noise and the bias seep into the ACTC evolution, determining an increase of the mean-square-deviation.
The qualitative arguments illustrated above will be made rigorous in the appendices, providing technical guarantees for the ACTC strategy to be mean-square stable, with a steady-state mean-square-deviation on the order of O(µ), as stated in the next theorem. In order to state the theorem, it is necessary to introduce a useful function that will be critical to evaluate the mean-square stability of the ACTC strategy.
Following the canonical Jordan decomposition illustrated in Appendix A, we denote by λ n the eigenvalue of A associated with the n-th Jordan block of A, by L n the dimension of this block, and by B the number of blocks. Let and let us introduce the function: and let where µ is the positive 4 root of the equation: with∆ and q ∆ being defined in Table I, and φ is a positive scalar that embodies the constants appearing in the µ 2 -terms of matrices T s , T δ and T q in Table II. The evaluation of φ is rather cumbersome and is detailed in Appendix E. Then the ACTC strategy is mean-square stable, namely, Moreover, in the small step-size regime the mean-square-deviation is of order µ, namely, A. Insights from Theorem 2 The stability analysis leading to Theorem 2 and carried out in Lemmas 6 (Appendix F) and 7 (Appendix G) is conducted in the transformed (complex) z-domain exploiting the formalism of resolvent matrices. This turns out to be a powerful approach that allows us to get necessary and sufficient conditions for stability.
In particular, the condition on ζ in (81) is a necessary and sufficient condition for the stability of matrix E, which is in turn related to the speed of decay of the network transient, namely, to how fast the network agents coordinate among themselves to converge to a small neighborhood of the (true) minimizer w . Condition (81) reveals the main utility of introducing the parameter ζ in the ACTC algorithm. Notably, this parameter is not present (i.e., ζ = 1) in the unquantized version of the diffusion algorithm, i.e., in the ATC algorithm. However, Eq. (81) shows that not necessarily the value ζ = 1 grants stability.
Examining the structure of γ(A) in (80), we see that the RHS of (81) depends on two main elements, namely, i) a constant q ∆ that contains the quantization blow-up factors ω k ; ii) the eigenstructure of A, and particularly the second largest magnitude eigenvalue λ 2 . Let us examine in detail the role played by each of these elements.
Regarding the quantization constant q ∆, we see that poorer resolutions (i.e., higher values of q ∆) go against stability, an effect that can be compensated by choosing smaller values for ζ. This means that ζ is useful to compensate for the quantization error that seeps into the recursion of the individual agent's errors.
Regarding the eigenstructure of A, the fundamental role of matrix E for mean-square stability is summarized by the function γ(A) in (80). This function provides an accurate stability threshold by capturing the full eigenstructure of A through the eigenvalues λ n , the size and the number of Jordan blocks. In this way, we are given the flexibility of providing accurate stability thresholds for different types of combination matrices.
Let us illustrate these useful features in relation to some relevant cases and existing results.
A detailed stability analysis of the classical (i.e., uncompressed) ATC diffusion strategy under the general assumptions considered in this work was originally carried out in [13], [19]. Such analysis relies on a more general Jordan decomposition (with arbitrary replacing the ones on the superdiagonal), which is instrumental to get an 1 upper bound on the spectral radius, which in turn provides a sufficient condition for mean-square stability. One feature of this sufficient condition is that the stability range for µ scales exponentially as −N , with < 1, a condition that becomes stringent for large-scale networks.
The analysis conducted in this work is different, and focuses on the resolvent matrices in the z-transformed complex domain. Thanks to this approach, we obtain a (necessary and sufficient) condition for stability that accounts for the entire eigenspectrum of A through the function γ(A) in (80). We will now explain how such eigenstructure plays a role for different types of combination matrices, and how the actual conditions for stability are in fact milder than the aforementioned exponential scaling.
Diagonalizable Matrices. For simplicity, let us consider the case that, excluding λ 1 = 1, all the remaining eigenvalues are equal, namely, λ n = λ 2 for n > 1. Note that this yields a n = const. in (79). If A is diagonalizable, we have B = N and L n = 1 for all n > 1, which, using (80), yields: i.e., γ(A) scales linearly with N . 5 "Very" Non-Diagonalizable Matrices. Consider the opposite case where B = 2 and L n = N − 1, namely, apart from the first Jordan block (i.e., the one associated with the single eigenvalue λ 1 = 1), we have only another block associated with λ n = λ 2 . Under this setting, we see from (80) that which implies an exponential scaling with N .
Typical Non-Diagonalizable Matrices. The exponential scaling observed in (87) is clearly not desirable for stability, since, in light of (81), it would significantly reduce the stability range.
However, typical non-diagonalizable matrices adopted in distributed optimization applications seldom feature the extreme eigenstructure described above. As a matter of fact, if we perform the Jordan decomposition of typical combination matrices, we see that the size of the Jordan blocks is usually modest, and in any case it does not increase linearly with N . This means that the exponents in (80) would be determined by the maximum Jordan block size, and not by the network size. Moreover, taking into account the fact that for typical combination matrices many eigenvalue magnitudes are considerably smaller than the second largest magnitude, the stability thresholds obtained through (80) are significantly far from exhibiting an exponential scaling with N .
We note also that larger values of |λ n |, and particularly of |λ 2 |, go against the stability of the network error matrix E, which means that ζ is useful to regulate the stability when the network component evolves more slowly, i.e., when |λ 2 | is closer to 1. 5 We remark that, in the case of diagonalizable A, the linear scaling is a tight estimate, since a known result about rank-one perturbations of diagonal matrices allows us to evaluate the spectral radius in an exact manner as the sum of the spectral radius of the unperturbed matrix plus N times the size of the perturbation [62].
In summary, the network error convergence depends upon the eigenstructure of A and the quantizer's resolution. While the role of the eigenstructure of A (and, hence, of the network connectivity) is common to the standard ATC strategy, one distinguishing feature of the ACTC strategy is represented by the fact that the spectral radius of E increases due to the presence of the additional factor q ∆. This means that the agreement among agents slows down due to backpropagation of the quantization error. However, and remarkably, this slowdown does not preclude the possibility of accurate performance given sufficient time for learning. Paralleling classical coding results from Shannon's theory, we could say that the price of compression is not the impossibility of learning, rather a slowdown in the convergence.

VI. TRANSIENT ANALYSIS
The next result refines the mean-square stability result in Theorem 2 to charactetize the learning dynamics of the ACTC strategy. In particular, we will characterize the transient phases of the algorithm before it converges to the steady state.
Theorem 3 (ACTC Learning Behavior). Let where ν is the global strong convexity constant in (13). Assume that let ρ(E) be the spectral radius of E (which under assumption (89) has been proved to be smaller than 1), and set > 0 such that: Using the definitions of the compression factor Ω in (33), of the entries {π } of the Perron eigenvector in (10), of the scaled step-sizes {α } in (7), and of the gradient-noise variances {σ 2 } in (37), in the small-µ regime the evolution of the mean-square-deviation of the individual agent k can be cast in the form, for all i > 0: where the Big-O terms depend in general on the particular agent k, except for the O(1) term multiplying ρ i cen , and c q is a positive constant independent of µ and i.
which allows us to examine closely the distinct learning phases as detailed in the following remarks.
-Transient Phases. First, we notice that the network rate ρ net depends only on the stability parameter ζ, and on the network connectivity properties through the eigenspectrum of the combination matrix A. As a result, we see that for sufficiently small µ we have that ρ cen > ρ net . Accordingly, for small step-sizes µ, the transient associated with the convergence of the network solution toward the centralized solution dies out earlier (Phase I). After this initial transient, a second transient dominates (Phase II), which is relative to the slower (since ρ cen > ρ net ) process that characterizes the convergence of the centralized solution to the steady-state.
Remarkably, these two distinct phases of adaptive diffusion learning have already been identified in the context of adaptive learning over networks without communication constraints [13], [14].
-Compression Loss. After transient Phase II, the ACTC behavior is summarized in the following upper bound on the steady-state mean-square-deviation: First of all, we remark that the product µ ζ stays fixed once we set a prescribed convergence rate ρ cen . Then, for a given convergence rate, we see that the mean-square-deviation is composed of two main terms. The first term does not depend on the amount of compression, and is proportional to an average over the Perron weights {π } of the scaled gradient noise powers {α 2 σ 2 }, further divided by the global strong convexity constant ν.
The second term on the RHS of (93) is the compression loss, which is in fact an increasing function of the compression factor Ω. The limiting case Ω = 0 corresponds to the setting without compression, i.e., to the ACTC algorithm in (20) where the compression operator is the identity operator, formally: where we referred to the strategy with Ω = 0 as to the uncompressed ACTC.
In particular, if we choose as compression operators the randomized quantizers examined in Sec. III-A, we can obtain an explicit connection between the mean-square-deviation and the bit-rate. In fact, examining (32), we see that either: In summary, in both cases we can write: where we denoted by r min the minimum bit-rate across agents. From (94) and (97) we conclude that the increase in mean-square-deviation from the uncompressed to the standard ACTC is given by: which reveals the following remarkable analogy with the fundamental laws of high-resolution quantization: for small step-sizes, the excess mean-square-deviation due to quantization scales exponentially with the bit-rate as 2 −2rmin [42].
-Comparison Against Classical ATC. From the ACTC algorithm in (20), it is readily seen that the uncompressed ACTC (i.e., the ACTC with compression operator equal to the identity operator) coincides with the classical ATC only for the case ζ = 1, yielding: 6 However, to compare fairly the two strategies, we need to set the same desired value for the rate. Let R be the desired rate, we have the following relationships: [ATC] 6 Actually, in (99) MSD ATC is an upper bound on the exact mean-square-deviation. This upper bound would be obtained by applying the mean-square stability analysis carried out in [13]. An exact evaluation (and not a bound) for the mean-square-deviation is instead provided in [14]. However, since the present work focuses on the mean-square stability analysis for the ACTC scheme and not on a steady-state analysis, the appropriate term of comparison is (99), and not the refined value that would be obtained from [14].
which, substituted in (99) yields: where we remark that the higher-order terms O(µ 3/2 ) do not cancel out since, in general, the O(µ 3/2 ) corrections are different for the uncompressed ACTC and the ATC strategies. In particular, the stabilization parameter ζ typically entails a slight increase in the mean-square-deviation that is incorporated in the O(µ 3/2 ) correction. 7 Equation (101) highlights that, in the small-µ regime, the uncompressed ACTC and classical ATC are equivalent at the leading order in µ. Therefore, joining (94), (100), and (101), we conclude that: In particular, for the randomized quantizers in Sec. III-A, from (98) we get: namely, for sufficiently small µ the excess of mean-square-deviation of the ACTC strategy w.r.t. to the classical ATC strategy scales as ≈ 2 −2rmin .

VII. ILLUSTRATIVE EXAMPLES
As an application of the ACTC diffusion strategy, we consider the scenario where N agents, interconnected through a network satisfying Assumptions 2 and 3, aim at solving a regression problem in a distributed way.
Each individual agent observes a flow of streaming information. Specifically, at each time i, agent k observes data d k,i ∈ R and regressors u k,i ∈ R M , which obey the following linear regression model: where w ∈ R M is an unknown (deterministic) parameter vector and v k,i ∈ R acts as noise. We assume that processes {u k,i } and {v k,i } have mean equal to zero, are independent both over time and space (i.e., across the agents), with second-order statistics given by, respectively: The goal is to to estimate the unknown w , which, by applying straightforward manipulations to (104), can be seen to obey the relationship: where r du,k = E[d k,i u k,i ]. In principle, each agent could perform estimation of w by solving the optimization problem: which in turn corresponds to adopting the following quadratic loss and cost functions: There are several reasons why the agents can be interested in solving the regression problem in a cooperative fashion. First of all, it was shown in [19] that, under suitable design, cooperation is beneficial in terms of inference performance. Even more remarkably, in many cases the local regression problem (107) can be illposed if the agents' regressors do not contain sufficient information. This is an issue classically known as collinearity, which basically implies that the regression covariance matrix R u,k is singular, and many w exist that solve (107). This behavior can be easily grasped by noticing that and, examining (106), we see that if R u,k is not invertible, w is a solution to the optimization problem, but not the only one. Accordingly, reliable inference about w is impaired by collinearity. Technically, when the regression covariance matrix of agent k is singular, the cost function J k (w) is not strongly convex, and the true w is one among the minimizers of (107).
However, if we now replace (107) with its global counterpart: it is readily seen that a single agent with a non-singular R u,k (i.e., with a strongly-convex cost function) is able to enable successful inference! Notably, such a minimal requirement is sufficient to our ACTC strategy to solve the regression problem in a distributed way and under communication constraints.

A. Role of Compression Degree
In Fig. 2, we examine the learning performance of the ACTC strategy as a function of the iteration i, for different values of quantization bits. The simulations were run under the following setting. The regression problem has dimensionality M = 50. The covariance matrices R u,k are all diagonal, and the associated regressors' variances are drawn as independent realizations from a uniform distribution with range (1, 4). The noise variances σ 2 v,k are drawn as independent realizations from a uniform distribution with range (0.25, 1). The network is made of N = 10 agents, connected through the topology displayed in Fig. 2, and with Metropolis combination policy. Under this setting, we run the ACTC algorithm with stability parameter ζ = 0.25, and with equal step-sizes µ k = µ = 4 × 10 −3 . Likewise, the common number of bits employed by the agents will be denoted by r, and we will examine the ACTC behavior for different values of r, namely, ranging from 1 to 4 bits. The compression operator is the randomized quantizer described in Sec. III-B. All errors are estimated by means of 10 2 Monte Carlo runs.
As a first performance index, we examine the network ACTC learning performance, i.e., the mean-squaredeviation averaged over all agents: The behavior observed in Fig. 2 summarizes sharply the essential characteristics of the ACTC algorithm, as captured by Theorem 3: i) for all quantizer's resolutions, the mean-square-deviation has a transient that is essentially governed by the predicted rate ρ cen = (1 − µ ζ ν) 2 (dashed line); ii) some higher-order discrepancies are absorbed in an initial (much faster) transient; iii) the ACTC errors corresponding to different bits converge to different steady-state error values that, yet for relatively low bit-rates, approach the performance of the ATC (i.e., unquantized) strategy.
With reference to the same setting of Fig. 2, but for a smaller dimensionality M = 10, in Fig. 3 we display the excess of mean-square-deviation of the ACTC strategy w.r.t. the uncompressed ACTC, see (94), and w.r.t.
classical ATC, see (102). We remark that the curves in Fig. 2 do not represent mean-square-deviations, but the difference between the mean-square-deviation attained by the ACTC diffusion strategy, and the meansquare-deviation that would be attainable in the absence of data compression by the uncompressed ACTC and the classical ATC. Accordingly, such excess of error summarizes only the effect of data compression, and is therefore expected to reduce as the bit-rate increases (while the overall mean-square-deviation cannot vanish since we are in a stochastic-gradient environment). We see that the curves scale with the number of bits as This result is in perfect accordance with the predictions of Theorem 3 -see (98) and (103).
As we discussed when commenting on (102), the uncompressed ACTC strategy features a slight increase in mean-square-deviation, contained in the higher-order O(µ 3/2 ) correction. Such small increase becomes visible only when comparable to the quantization error, i.e., when the quantization error becomes negligible. In the particular example of Fig. 3, this happens when the quantization error is ≈ −55dB.
In Fig. 4, we continue by examining the performance of the individual agents. In accordance with our results, all agents behave similarly, both in terms of transient and steady-state behavior. As shown by (91), initial discrepancies between the agents (see the inset plot) are absorbed into a faster network transient, after which all agents act in a coordinated manner, and converge to the steady-state value.
Finally, in Fig. 5 we examine the joint role of the step-size µ and of the stabilizing parameter ζ. Again, the theoretical predictions are confirmed, since we see that by keeping the product µ ζ constant, all curves behave It is useful to evaluate the saving, in terms of bits, achieved with the ACTC strategy. To this end, we must recall that the randomized quantizers in Sec. III-B compress finely (say, with machine precision 32 bits) the norm of the vectors to be quantized, send one additional bit for the sign of each entry, and then apply random quantization with r bits to each entry -see (31). Accordingly, given a dimensionality M , a number of iterations i max , and a number of quantization bits r, the overall bit expense of each agent is: Applying this formula to the setting in Fig. 2, we see that, for the time necessary to enter reliably the steady state (i max ≈ 2500), and referring to the coarser scheme that uses only 2 bits, we get: This value should be compared against the expense required by the plain ATC strategy, where each entry of the vector to be quantized is represented by 32 bits, yielding: implying a remarkable gain of about one order of magnitude in terms of bit rate. This gain should be evaluated in relation to the loss, in terms of mean-square-deviation, arising from data compression. Inspecting Fig. 2, we see that we loose ≈ 4 dB, which is definitely tolerable, especially in the light of the remarkable bit-rate savings.

B. Unidentifiable Problem
We now move on to consider the challenging case where only one agent (say, agent 1) has a locally identifiable regression problem.
Technically, the regressors' matrix R u,1 is invertible, whereas the remaining matrices R u,k , for k > 1, are singular. In particular, agents k = 2, 3, . . . , N solve a regression problem with two linearly dependent features.
In the following, agent 1 will be referred to as "farsighted agent", whereas the other agents as "singular agents".
Moreover, we assume that all agents have the same regressor and noise variances σ 2 u and σ 2 v , and we consider equal step-sizes at all agents, such that the vector p coincides with the Perron eigenvector.
Under this setting the local cost functions of the singular agents are not strongly convex, whereas the aggregate cost function in (111) is strongly convex, with constant ν given by: where p 1 is the entry of the Perron eigenvector corresponding to the farsighted agent.
In Fig. 6, we consider a doubly-stochastic combination policy, namely, the Metropolis rule. Five main conclusions arise. First, agent 1 in isolation is able to learn fairly well, with a steady-state error ≈ −26 dB, while the other agents, when in isolation, are unable to learn properly, with steady-state errors ≈ 18 dB. 8 Second, when align with the farsighted one. 9 Fourth, despite the fact that N − 1 agents have a singular regressors' matrix, they contribute to accelerate convergence to the steady state. However, we see that the cooperative steady-state performance is equivalent to the individual (i.e., non-cooperative) performance of the farsighted agent. We will now show that this conclusion is not a general conclusion, and depends on the particular combination policy.
To this end, in Fig. 7 we consider the same setting of  that we are choosing the best Perron eigenvector for the ACTC strategy, but we will now see that this turns out to be a meaningful choice. In fact, we see from Fig. 7 that, with the optimized combination policy, network cooperation achieves a twofold goal. As in the case of a doubly-stochastic policy, cooperation is beneficial to the singular agents. Moreover, it is also beneficial to the farsighted agent, which is now able to improve on the steady-state performance achieved without cooperation.
In summary, the conducted experiments lead to a revealing conclusion as regards the role of topology on the learning performance. By suitable design of the combination matrix, the regularization action played by agent 1 makes the singular agents capable of contributing more fully to the optimization problem, allowing all agents to achieve a mean-square-deviation that outperforms the non-cooperative performance achievable by the farsighted agent in isolation.
Finally, the convergence behavior of the ACTC strategy is visually illustrated in Fig. 8, with reference to a simple example with dimensionality M = 2, and N = 20 agents. Let us consider first the case where the N agents are all singular (red squares). In this case, we see that, moving from the initial iterates (blue circles), the singular agents follow wrong paths converging around the wrong point (4.5, 1.5) = w . In contrast, the ACTC strategy (green circles) allows all agents to converge well to a small neighborhood of the true minimizer, after an initial transient where they need to coordinate with each other.

C. Comparison with Existing Strategies
As we have illustrated in Sec. I-A, the present work generalizes the existing works on compressed distributed implementations under several aspects, including left-stochastic combination policies, lack of local strong convexity, diffusion strategies. For this reason, the existing theoretical results cannot cover the challenging setting considered in the present work, which required instead a significant additional effort. Nevertheless, even if formulated and studied under alternative settings, some of the existing algorithms can be practically applied to our setting. In particular, we select from the existing algorithms two particular up-to-date implementations that, as far as we know, constitute the actual benchmark performance, namely, CHOCO-SGD [55] and its dual version, DUAL-SGD [57]. Notably, the latter two algorithms have more tuning parameters than our algorithm.
Even if the necessity itself of tuning more parameters might be considered a disadvantage of these strategies, in order to ensure a fair comparison we performed a fine tuning of all the parameters to guarantee best performance of CHOCO-SGD and DUAL-SGD. The shaded areas shown in Fig. 9 correspond to the range of mean-squaredeviations spanned by a subset of the parameters explored during the tuning phase. Figure 9 displays the comparison involving the proposed ACTC strategy and the aforementioned two strategies.
Remarkably, for the same value of the transient time, the ACTC strategy outperforms both CHOCO-SGD and DUAL-SGD at steady state. In particular, we see that DUAL-SGD performs appreciably worser than the implementations in the primal domain. This conclusion is in perfect agreement with was shown in [63] for the uncompressed case. In fact, the core of DUAL-SGD is a primal-dual distributed strategy of Arrow-Hurwicz type, which was shown in [63, Corollary 3] to converge, despite being a distributed cooperative strategy, at most to the non-cooperative performance. In contrast, both the ACTC and CHOCO-SGD strategies are able to exploit fully the distributed cooperation, which explains the performance improvement exhibited in Fig. 9. 10 We move on to examine the improvement of the proposed ACTC strategy on the existing CHOCO-SGD strategy. Also in this case, this improvement can be neatly explained in the light of known behavior observed in the uncompressed case, since the improvement matches well similar gains achievable when using diffusion (as ACTC does) as opposed to consensus (as CHOCO-SGD does). In fact, as observed in distributed optimization without compression [19], diffusion strategies can outperform consensus strategies, and, remarkably, from our experiments we observe the same behavior when these types of strategies are called to operate under communication constraints.

VIII. CONCLUSION
We considered a network of agents tasked to solve a certain distributed optimization problem from continual aggregation of streaming observations. Fundamental features of our setting are adaptation, local cooperation and data compression. By adaptation we mean that the agents must be able to react promptly to drifts in the operational conditions, so as to adapt their inferential solution quickly. In this regard, stochastic-gradient algorithms with constant step-size become critical. By local-cooperation we mean that each individual agent is allowed to implement a distributed algorithm by exchanging information with its neighbors. Finally, data compression comes from the need of communicating information at a finite rate, owing to energy/bandwidth constraints.
We introduced a novel strategy nicknamed as Adaptive-Compress-Then-Combine (ACTC), whose core is an adaptive diffusion strategy properly twinned with a differential stochastic compression strategy. Our analysis is conducted under the challenging setting where: i) communication is allowed to be unidirectional (i.e., over directed graphs); and ii) the cost functions at the individual agents are allowed to be non-convex, provided that a global cost function obtained as linear combination of the local cost functions is strongly convex. We obtained the following main results. First, we established that the proposed ACTC scheme is mean-square stable, and in particular that each individual agent is able to infer well the value of the parameter to be estimated, with a mean-square-deviation vanishing proportionally to the step-size. Second, we characterized the learning behavior of each individual agent, obtaining analytical solutions that highlight the existence of two main transient phases, one (faster) relative to convergence of all agents to a coordinated evolution, the other (slower) relative to convergence of the coordinated estimate to the steady-state solution. Notably, these distinct learning phases were shown to emerge in diffusion strategies without data compression [13]. Therefore, our result implies that these distinct phases are preserved despite the presence of compressed data, for any degree of compression.
Moreover, there are also distinguishing features arising from data compression, and the obtained analytical solutions are able to reflect well the role of the compression degree (e.g., quantization bits) in the final learning behavior. A remarkable conclusion stemming from our analysis is that, for sufficiently small step-sizes, small errors are achievable for any compression degree. This behavior brings an interesting analogy with classical information-theoretic results: the information-rate limitation does not preclude the possibility of learning, but involves a reduction in the speed of convergence.
Another useful parallel can be drawn with the recently introduced paradigm of exact diffusion [47], [48], where no compression is present, and the true (i.e., non stochastic) gradient is available. Under this paradigm, diffusion strategies with constant step-size µ are enriched with an error compensation step, which allows them to attain a zero, rather than O(µ), mean-square-deviation [47], [48]. Inspired by the structure of the ACTC algorithm in (20), it could be worth including in the exact diffusion algorithm the parameter ζ and a general nonlinearity Q k (·), and tuning these two quantities to speed up convergence.
There is still a lot of work to be done in the context of distributed adaptive learning under communication constraints. One advance regards a steady-state performance analysis aimed at obtaining exact formulas for the mean-square-deviation. A further useful contribution is to extend results available under non-convex environments [64]- [66] to the case of compressed data. Under such setting, the traditional difficulties arising from the lack of convexity (e.g., the evolution of the stochastic-gradient iterates, their mean-square stability and steady-state performance) will be complicated by the complexity arising from the introduction of the nonlinear compression operator. Finally, an open problem that we are currently investigating regards the trade-off between number of transmissions and quantization bits, i.e., how to perform jointly the design of the topology and the allocation of bit-rate budget to maximize the performance.
APPENDIX A

A. Jordan Representation
Let J tot be the matrix associated with the canonical Jordan decomposition of the combination matrix A, which can be represented as [59]: where B is the number of Jordan blocks. As usual, the individual blocks can have different size, and the unspecified off-diagonal terms arising after block-diagonal concatenation are automatically set to zero. For n = 1, 2, . . . , B, we denote by λ n the eigenvalue associated with block J n , and, without loss of generality, we assume that the eigenvalues are sorted in descending order of magnitude, namely, Each Jordan block takes on the form: and can be accordingly written as: where L n is the dimension of the n-th block, and U Ln is a square matrix of size L n that has all zero entries, but for the first diagonal above the main diagonal, which has entries equal to 1.
In view Assumptions 2 and 3, the combination matrix A has a unique largest magnitude eigenvalue that is λ 1 = 1, i.e., the first Jordan block is J 1 = 1. The remaining B − 1 Jordan blocks can be conveniently arranged in the reduced matrix: Moreover, letting and we end up with the following useful representation:

B. Energy Operators
Definition 2 (Energy Vector Operator). Let x 1 , x 2 , . . . , x N be N vectors of size M × 1, and let be the block vector of size M N × 1 obtained by concatenating these vectors. The energy vector operator, P : C M N → R N , is defined as: The stochastic counterpart of operator P[·] is the operator P[·] introduced in (63), which, for a random argument x, can be written in terms of P[·] simply as: The block-matrix counterpart of operator P[·] is defined as follows.
The operators (126) and (128) are equipped with several useful properties. We now list those properties that will be exploited in the forthcoming proofs, and refer the Reader to [13] for the proof of these properties.
then we have that: P5) Relation to Euclidean norm: P6) Linear transformations: Let Q be a K ×N block matrix with M ×M blocks. Applying the energy operator to the linear transformation Qx we obtain: where the Γ is the following (N − 1) × (N − 1) matrix: (135)

C. Representation in Transformed Network Coordinates
Before proving all the pertinent lemmas and theorems, it is useful to write down the ACTC strategy in terms of the transformed variables introduced in Sec. IV-B. Regarding the transformed quantized vector q i , from the second step in (50) by V, we readily get: We note in passing that the term V Q V −1 δ i reflects well the inherent nonlinear behavior of the compression operators. In fact, the linear transformation V and the nonlinear operator Q(·) do not commute and, hence, the direct and inverse network transformation, V and V −1 , do not compensate perfectly with each other.
Let us switch to the transformed quantization-error vector δ i , and focus accordingly on the first step in (50).
We introduce the extended Jordan matrix: which, in view of (51), allows us to write the extended combination matrix A in (49) as: Therefore, in view of (50) we can write: where we introduced the matrix: Substituting now (139) into the first step of (50) and applying the network transformation, we get: Furthermore, by using (49) and (54) in (140), the matrix G i−1 can be written as: where we used the property of the Kronecker product (X ⊗ Z)(Y ⊗ Z) = XY ⊗ Z, holding for any three matrices X, Y, Z with compatible dimensions. Exploiting now the partitioned structure of V and V −1 in (53), we can write: where in the last matrix we used the equality A 1 N = 1 N , holding since A is a left-stochastic matrix. Using (143) and (144) in (142) we obtain the following block-decomposition for G i−1 : where Combining now (141) with (145), we obtain:  We conclude this section with a lemma that will be repeatedly used in the forthcoming proofs.
where ν is the global-strong-convexity constant introduced in (13) and η is the average Lipschitz constant in (14).
Second, a positive constant σ 12 exists such that: Finally, the matricesP[G 21,i−1 ] andP[G 22,i−1 ] obtained by applying the norm matrix operator to the matrices in (148) and (149), have bounded norm, in particular we have: for some positive constants σ 21 , and σ 22 .
Proof: The proof relies basically on the properties of the Hessian matrices H k,i−1 , which arise from Assumptions 1 and 4. Let us focus on (151). Using (13) and (40) in (146) we readily obtain: which proves the lower bound in (151). The upper bound is obtained by observing that: where the first inequality is the triangle inequality, the intermediate inequality is the mean-value inequality, and the last inequality follows by (14).
We continue by proving (152) and (153). First, we note that G 12,i−1 , G 21,i−1 , and G 22,i−1 have the following common structure: for a suitable choice of the matrices X and Y , having made explicit the definition of H in (49). The bound in (152) follows readily from the Lipschitz property in (2). Regarding (153), we observe that we can write: can be written as: where (a) follows by (37) Recalling that the transformed gradient noise vector s i is equal to (V ⊗ I M )s i , the -th block s ,i , for = 1, 2, . . . , N , is: where v k is the ( , k)-entry of matrix V . It is useful to examine separately the coordinated-evolution component s i = s 1,i and the remaining components s ,i , for = 2, 3, . . . , N . To this end, we exploit the block decomposition of matrix V in (53). Regardings i , since v 1k = π k , from (160) we have that: where the first inequality is Jensen's inequality with convex weights {π k }, whereas the second inequality comes from (159). Likewise, for = 2, 3, . . . , N , from (160) we can write: where the first inequality is Jensen's inequality with uniform weights 1/N , whereas the second inequality comes from (159). Finally, by introducing the "squared" counterpart of the complex matrix V R , whose entries, for = 1, 2, . . . , N − 1 and k = 1, 2, . . . , N are: and recalling the definition of the diagonal matrices C α , C β , and C σ in Table I, from (161) and (162) it is readily seen that the claim in (65) has been in fact proved, with the characterization of matrix T s and of the quantitiesx s and q x s as given in Table II. APPENDIX C

PROOF OF LEMMA 2
Proof: Applying the average energy operator P[·] to (150), we obtain: where the energy terms corresponding to the gradient noise are additive in view of property (36) and property P4) of the energy operator.
Let us consider the second term on the RHS of (164), for which we can write: where the first inequality is an application of Jensen's inequality, the second inequality comes from (151) and (152), and the final equality comes from property P5) of the energy operator P[·]. Taking expectations in (166) and using the result in (164) we obtain: Let us move on to examine (165). First of all, we appeal to the Jordan matrix representation in (124) to write: Then, the following chain of inequalities holds: where (a) follows by the convexity property P3) of the energy operator applied with weights 1/2; (b) follows by the same property applied with weights 1/4; (c) follows by property P6) of the energy operator, respectively in form (132) as regards the first two terms, and in form (133) as regards the remaining terms; (d) follows by observing that, due to the peculiar shape of D and U, one has the identities: and by the inequality: Finally, the inequality in (e) follows by the bounds in (153).
Taking expectations in (169) and then using (165) we get: Examining jointly (167) and (174), we see that we have in fact proved (66), with the matrix T δ and the quantities x δ and q x δ as given in Table II. APPENDIX D PROOF OF LEMMA 3 We start with an auxiliary lemma that will be then used to prove Lemma 3.

Lemma 5 (Quantized State Decomposition). Let
where v k is the ( , k)-entry of the transformation matrix V in (53). Then, for any ζ ∈ (0, 1) we have that: Proof: By adding and subtracting ζ δ i in (136) we can write: Consider now two realizations of q i−1 and δ i , which means that x = x becomes deterministic and that y contains only the randomness arising from the stochastic compression operator Q(·). From the quantizer's unbiasedness property (21) we have that: or: Since the randomness in the compression operator is independent of all the other random mechanisms in the system and is independent over time, the orthogonality condition in (181) holds also when expectations are computed w.r.t. all random variables, which allows us to apply property P4) of the energy operator in (179), yielding: On the other hand, recalling that δ i = V −1 δ i , we can write: The expected energy of the -th block in (183) is: where (a) follows from the fact that the compression operators are independent across agents and unbiased, and (b) follows by the non blow-up property in (22). Recalling that the first row of matrix V is the (transposed) Perron eigenvector, the first entry ( = 1) in (184) can be upper bounded by: where the last inequality follows by the definition of∆ in (175). Likewise, the other entries ( = 1) in (184) can be upper bounded by: having used the definition of q ∆ in (176). The claim of the lemma follows by joining (182), (184), (185) and (186), and using property P5) of the energy operator.
Proof of Lemma 3: The first term on the RHS in (177) can be represented in block form as follows: Let us start by examining the first block in (187). Exploiting the block decomposition in (150) we can write: First of all, using (36) we have the equality: Let us now examine the spectral radius of I M − µζ G 11,i−1 . Using (151) we can write: We have the following chain of equivalent relationships: 11 where the last implication is true because ν ≤ η in view of (14). Since all implications in (191) hold in both directions, we have in fact proved that: Moreover, since ν ≤ η, we also have: which, using (67), implies that 1 − µζν > 0, finally yielding, in view of (192): This upper bound will be useful in characterizing the last term in (189), which can be manipulated as follows, 11 Condition (67) is not the tightest condition one can use to guarantee stability of IM − µζG11,i−1. Some examples of how to get a better constant can be found in [13], [19], and, with straightforward algebra, we can get the refined upper bound IM − µζ G11,i−1 ≤ However, in our analysis the additional O(µ 2 ) term is expected to bring little information. In fact, as we will see in Lemma 7 further ahead, a number of O(µ 2 ) terms will be collected into a large correction constant φ that characterizes the stability analysis, with a stability threshold µ being usually smaller than the factor 2/(η + ν) that will be obtained from our characterization of for 0 < t < 1: where the first inequality is an application of Jensen's inequality, whereas the second inequality follows by setting t = µζν, and by using (152). Taking expectations in (195) and using the result in (189) we get: We continue by examining the second block in (187). Using the block decomposition in (150) we can write: and, using (36) along with property P4) of the energy operator, we get: On the other hand, by the convexity property P3) of the energy operator we have: Making explicit the definition of y in (197), we can write the following chain of inequalities: where step (a) applies property P7) of the energy operator, step (b) applies property P3) with weights 1/3, and step (c) uses (153). Letting taking expectations in (200), and using the result in (199) and then in (197), we obtain: Calling upon Lemma 5 along with (196) and (202), we see that (69) holds true, with the matrix T q and the quantitiesx q and q x q as given in Table II with the driving vector x defined in (74) and with the matrix T replaced by the matrix: The claim of the theorem will be proved if we show that the matrix T tmp is upper bounded by the matrix T appearing in (70).
The term T q can be upper bounded as: where for brevity we introduced the bounding constant φ (q) .
Likewise, concerning the term involving T s we can write: where φ (s) is a suitable constant.
Regarding the term involving T δ , by exploiting Table II we have: where we used the bound: and where φ (δ) is a suitable constant that upper bounds the terms of order µ 2 .
If we now introduce the maximal constant and the rank-one perturbed version of E 0 : by using (204), (205), and (206) in (203), we end up with the following bound: where the quantities τ , τ 12 , and v µ,ζ are defined in (71) and Then, the matrix E defined in (72) has spectral radius less than 1 if, and only if: Proof: We introduce the resolvent of matrix E: which is well-posed for z distinct from the eigenvalues of E. The stability of E will be proved if we show that: In order to prove (215), we examine first the resolvent of the unperturbed matrix E 0 in Table I -see also (201), namely, Since E 0 is upper triangular and all its diagonal elements are positive values strictly less than one, we conclude that E 0 is stable, which further implies that the resolvent R E0 (z) is bounded for |z| ≥ 1.
We continue by relating the resolvent of E to the resolvent of the unperturbed matrix E 0 . Exploiting the structure of E in (209), we see that E is given by E 0 plus an additive rank-one perturbation. Since we have shown that in the range of interest |z| ≥ 1 the matrix zI N −1 − E 0 is invertible, we can apply the Sherman-Morrison identity to zI N −1 − E, obtaining [59]: where the identity holds if, and only if, the denominator on the RHS of (217) is not zero. In particular, if the denominator is zero zI N −1 − E is not invertible. Therefore, to prove (215) we must examine the behavior of the complex scalar function: over the entire range |z| ≥ 1. To this end, it is critical to characterize the resolvent of E 0 . Exploiting (122), (123), and (124), we can represent E 0 as: where, for n = 2, 3, . . . , B, we introduced the L n × L n matrices: By computing the inverse of the block-diagonal matrix zI N −1 −E 0 as the block-diagonal matrix of the individual inverse matrix-blocks, we have that: On the other hand, for Jordan-type matrices like zI N −1 − E n , the inverse is known to be in the form: Using (221) and (222) in (218), we obtain: ( Applying the triangular inequality in (223), and noticing that for any two complex numbers z 1 and z 2 , we have |z 1 − z 2 | ≥ |z 1 | − |z 2 |, we can write the following inequality (in the range |z| ≥ 1): Equation (224) implies that a sufficient condition for the stability of E is: We now show this is also a necessary condition by reductio ad absurdum. Assume that (225) is violated, namely that (we rule out the equality since it obviously correspond to instability): Were (226) true, there would certainly exist one value z ∈ R, with z > 1, such that the denominator on the RHS of (217) is equal to zero. This is because the function g E (z) is analytic over the domain |z| ≥ 1 and, in particular, it is continuous on the real axis and vanishes as z → ∞. This implies that (225) is a necessary and sufficient condition for the stability of E.
Let us now recast (225) in a more explicit form. First of all, the inner summation can be computed by standard results on geometric series, yielding: Let now we have that where γ(A) is defined in (212). In view of (217), (218), and (225) and let µ be the positive root of the equation: Then, the matrix T in (70) has spectral radius less than 1 if, and only if, µ < µ .
Proof: The matrix T 0 in (70) is stable since τ < 1 in view of the second inequality in (230), and E is stable in view of the first inequality in (230) and Lemma 6. Then, the eigenvalues of T 0 lie all strictly within the unit disc, and, hence, the resolvent R T0 (z) exists. Accordingly, considering the resolvent of matrix T : and exploiting the structure of T in (70) (i.e., T 0 plus an additive rank-one perturbation), from the Sherman-Morrison identity we have [59]: where the formula is valid if, and only if, the denominator on the RHS of (233) is not zero. Moreover, if the denominator is zero, then (zI N − T ) is not invertible. Therefore, the stability of T will be proved if we show that the denominator on the RHS of (233) is not zero over the range |z| ≥ 1. To this end, we will now examine the resolvent of the unperturbed matrix T 0 .
From (70) we see that T 0 is block upper-triangular, which implies that we can compute the inverse R T0 (z) = (zI N − T 0 ) −1 as: and, exploiting the definition of v µ,ζ in (73) we get: Using (217) and (218), we have the following identity: which, applied in (235), yields: Accordingly, by triangle inequality we have that: where we used the inequality |z 1 − z 2 | ≥ |z 1 | − |z 2 |, and the fact that, since (213) is verified by hypothesis, the denominator of the last fraction in (238) is positive in view of (224) and (229). On the other hand, we know from (224) that |g E (z)| ≤ g E (1) in the range |z| ≥ 1, which, applied in (238), allows us to write: Reasoning as done in the proof of Lemma 6, a necessary and sufficient condition for the stability of T is: To this aim, let us apply the definitions in (71) and (229) in (239), to obtain: In view of (241), inequality (240) is true if, and only if, the (positive) step-size µ is smaller than the positive root of the equation in (231), and the proof of the lemma is complete.

A. Bounds on the Powers of T
The stability established in Lemma 7 allows to conclude that the matrix powers T i can be uniformly bounded w.r.t. to i. However, the bound would depend on the matrix T , and, in particular, would depend on the step-size µ. Since we are interested in characterizing the small-µ behavior of the ACTC mean-square-deviation, it is essential to establish how such bound behaves as µ → 0. To this end, Lemma 7 alone does not provide enough information, and we need to resort to the powerful framework of Kreiss stability [67].
Preliminarily, it is necessary to introduce the concept of Kreiss constant. Given the resolvent R X (z) associated with a matrix X, the Kreiss constant relative to X is defined as follows [67]: and it is useful to bound (from above and from below) the norm of matrix powers as follows: where e is Euler's number. In the next lemma we exploit the Kreiss constant to characterize the small-µ behavior of the powers of T .
Lemma 8 (Bound on the Powers of T ). Let where µ is the positive root of the equation in (231). Then we have that: where K(µ) is a function of µ, independent of i, with: Proof: Let us evaluate a bound on the Kreiss constant associated with matrix T . Accordingly, we will examine the behavior of the function over the range |z| ≥ 1. We have seen in the proof of Lemma 7 -see the argument following (233) -that under (244) it is legitimate to use the representation in (233). Applying now the triangle inequality to (233) we have: where the O(µ 2 ) term comes from the behavior of the perturbation vector v µ,ζ in (73), whereas the bound involving the term g T (1) comes from (235) and (239). Therefore, from (247) and (248) we conclude that: Let us examine the behavior of R T0 (z) 1 . Exploiting the structure in (234) and applying of the triangle inequality, we can write: First, we examine the resolvent of matrix E. Since by assumption Eq. (213) is verified, the spectral radius of E is strictly less than 1 in view of Lemma 6. This implies that all eigenvalues of E lie strictly inside in the unit disc, which in turn guarantees the existence of a constant C E such that [59]: Moreover, the stability of E implies that all powers of E are bounded, which, in view of the lower bound in (243), implies the existence of a finite Kreiss constant: 12 sup z∈C:|z|>1 We remark that both constants C E and K E are independent of µ, since so is matrix E.
Let us focus on the second term in (249). Using (250), we can write: Let us examine the first argument of the maximum in (253). Now, the known inequality |z − τ | ≥ | |z| − τ |, turns into |z − τ | ≥ (|z| − τ ) since |z| ≥ 1 over the considered range and τ < 1 in view of (230). Therefore we can write: where the equality comes from the definition of τ in (71).
Let us switch to the analysis of the second term in (253), which, by expanding the square, yields: ≤ CEK E , see (251) and (252) Using (254) and (255) in (253), we conclude that: Reasoning along the same lines we can show that: Applying now (256) and (257) in (249), we get: which implies, in view of (242), the existence of a function K(µ) such that, under assumption (244), we are allowed to write: From the properties of g T (1) examined in Lemma 7, we know that g T (1) < 1 in the range of µ permitted by (244), and g T (1) → 0 as µ → 0, which implies the claim of the lemma (246).

APPENDIX H PROOF OF THEOREM 2
Proof: Developing the recursion in (75) we have: and by application of the triangle inequality: where the first equality comes from the definition of average energy operator in (63). In view of Lemma 7, under the assumptions of the theorem the matrix T has spectral radius strictly less than 1, which, in view of (261), implies: On the other hand, from the network coordinate transformation in (55), we have q i = V −1 q i , which, in view of (262), implies: where we used the fact that the squared norm of the extended vector q i is the sum of the norms of the N individual vectors q k,i 2 . The claim in (84) follows from the fact that w k,i is a convex combination of { q ,i } ∈Nk -see (43).
We move on to prove (85), for which we need to examine the small-µ behavior of (I On the other hand, specializing (234) to the case z = 1, it is immediate to see that: Therefore we can write: which further implies: Combining now (265) and (268) we obtain: On the other hand, examining the entries of vector x in (74) with the help of Table II we readily see that: x O(µ 2 ) 1 N .
APPENDIX I PROOF OF THEOREM 3 In the following, we make repeated use of the following known equality, holding for any two nonzero scalars with a = b: Proof: In view of assumption (89), we can use Lemma 8 in (261) to conclude that, for all i ≥ 1, and for sufficiently small µ: where K(µ) is O(1) in view of (246) while the order of magnitude of (I N − T ) −1 x 1 is O(µ) in view of (271).
Let us develop the recursion in (75) by separating the role of the unperturbed matrix T 0 and the rank-one perturbation v µ,ζ 1 N in (70), yielding: where the second term on the RHS is O(µ 2 ) because so are v µ,ζ and x, while which is O(1) in view of (273). Developing the recursion in (274) we have: where the estimate of the last term comes from (265).
Let us now evaluate the i-th power of T 0 . Since T 0 is block upper-triangular, its i-th power admits the representation [13]: where we note that in the small-µ regime the matrix (τ I − E) −1 is certainly well-defined and has nonnegative entries, since as µ → 0 we have τ = 1 − µ ζ ν > ρ(E) (and since E has nonnegative entries). Therefore, in the small-µ regime we can write: Considering the evolution of the first component E q i 2 of P[ q q i ] as dictated by (276), from (277) and (278) we get: 13 Likewise, exploiting (276), (277), and (278) to get the evolution relative to the network-error component we can write: which allows us to write [59]: where by definition ρ net = ρ(E) + < 1.
We now exploit the exponential bounds in (279) and (281) Taking expectations and applying the Cauchy-Schwartz inequality for random variables, we obtain: where we applied (279) and (281), along with the inequality √ a + b ≤ √ a + √ b for a, b ≥ 0.
Using (283), we can rearrange the first row of the transfer matrix T q and of the vector x q appearing in Lemma 3, obtaining a new matrix T q and a new vector x q defined by: which allows us to replace the matrix T appearing in Theorem 1 with a matrix T of the following form: where we defined: with the µ 2 -term appearing in [T q ] 12 being conveniently embodied in the overall O(µ 2 ) rank-one perturbation.
Likewise, we construct a new driving vector x = x + χ i by replacing x q in (74) with x q in (284). Replacing now T and x with T and x in the recursion (75), we get: where in the last step we used (279) and (281). Developing the recursion in (287) we get: Applying now (277) (with τ 2 in place of τ , and τ 12 in place of τ 12 ), and reasoning as done to obtain the bound in (278), we conclude that the first row of matrix (T 0 ) i can be upper bounded as: Moreover, we can evaluate (I N − T 0 ) −1 through (234) applied with τ 2 in place of τ , and τ 12 in place of τ 12 , obtaining: Using (289) and (290) (288), we can obtain an inequality recursion on the first entry of P[ q i ]: where we have ignored the term χ i , since comparing this term against the term χ i in (283), we see that χ i is dominated by χ i as µ → 0 and, hence, can be formally embodied into χ i through the Big-O notation. Applying the definition of χ i in (283) we can write: where in the last inequality we exploited the geometric summation in (272).
Now we examine the small-µ behavior of the three terms appearing on the RHS of (292). The first term is O(µ) τ i/2 , since we have that: The second term on the RHS of (292) is O(µ) τ 2i , since: Finally, the third term is O(µ 3/2 ) since: As a result, we can use (292) in (291) and substitute the estimated orders of the aforementioned three terms to obtain: It remains to examine the small-µ behavior of the last two terms in (296). First, we observe that the quantities ∆ and q ∆ defined in Table I and characterizing vector x in (74) are proportional to the maximum compression factor in (33), namely,∆ ∝ Ω, q ∆ ∝ Ω, Then, exploiting Table II, the definition of x in (74), and (297), we obtain: where κ 1 is a suitable constant (i.e., independent of µ). Using (295), we can therefore represent the first fraction on the RHS of (296) as:x Let us focus on the last term in (296). Resorting again to (297), from (286) and (74) we get: for some constants κ 2 , κ 3 , and κ 4 . Using now (295) and (300), the last term in (296) can be represented as µ ζ κ 5 Ω(1 + Ω) + O(µ 2 ), for a suitable constant κ 5 . Joining this representation with (298), we can finally represent (296) as: where c q is a constant independent of µ.
It remains to characterize the behavior of the mean-square-deviation at an individual agent k. To this end, we evaluate the individual entry of the extended vector q i though (56), which allows us to write: where we resorted again to the Cauchy-Schwartz inequality for random variables. Now, from (301) we can write E q i 2 ≤ O(1) τ 2i + O(µ), which, used along with (281), yields: Further noticing that from (281) we can write E q q i 2 ≤ O(1) ρ i/2 net + O(µ 2 ), the claim of the theorem follows by using (302) and (303) in (301).