Learning Over Multitask Graphs—Part II: Performance Analysis

Part I of this paper formulated a multitask optimization problem where agents in the network have individual objectives to meet, or individual parameter vectors to estimate, subject to a smoothness condition over the graph. A diffusion strategy was devised that responds to streaming data and employs stochastic approximations in place of actual gradient vectors, which are generally unavailable. The approach relied on minimizing a global cost consisting of the aggregate sum of individual costs regularized by a term that promotes smoothness. We examined the first-order, the second-order, and the fourth-order stability of the multitask learning algorithm. The results identified conditions on the step-size parameter, regularization strength, and data characteristics in order to ensure stability. This Part II examines steady-state performance of the strategy. The results reveal explicitly the influence of the network topology and the regularization strength on the network performance and provide insights into the design of effective multitask strategies for distributed inference over networks.


I. INTRODUCTION
As pointed out in Part I [2] of this work, most prior literature on distributed inference over networks focuses on single-task problems, where agents with separable objective functions need to agree on a common parameter vector corresponding to the minimizer of an aggregate sum of individual costs [3]- [13].In this paper, and its accompanying Part I [2], we focus instead on multitask networks where the agents may need to estimate and track multiple objectives simultaneously [14]- [26].Although agents may generally have distinct though related tasks to perform, they may still be able to capitalize on inductive transfer between them to improve their performance.
Based on the type of prior information that may be available about how the tasks are related to each other, multitask learning algorithms can be derived by translating the prior information into constraints on the parameter vectors to be inferred.
In Part I [2], we considered multitask inference problems where each agent in the network seeks to minimize an individual cost expressed as the expectation of some loss function.The minimizers of the individual costs are assumed to vary smoothly over the topology, as dictated by the graph Laplacian matrix.The smoothness property softens the transitions in the tasks among adjacent nodes and allows incorporating information about the graph structure into the solution of the inference problem.In order to exploit the smoothness prior, we formulated the inference problem as the minimization of the aggregate sum of individual costs regularized by a term promoting smoothness, known as the graph-Laplacian regularizer [27], [28].A diffusion strategy was proposed that responds to streaming data and employs stochastic approximations in place of actual gradient vectors, which are generally unavailable.
The analysis from Part I [2] revealed how the regularization strength η can steer the convergence point of the network toward many modes of operation starting from the non-cooperative mode (η = 0) where each agent converges to the minimizer of its individual cost and ending with the single-task mode (η → ∞) where all agents converge to a common parameter vector corresponding to the minimizer of the aggregate sum of individual costs.
For any values of η in the range 0 < η < ∞, the network behaves in a multitask mode where agents seek their individual models while at the same time ensuring that these models satisfy certain smoothness and closeness conditions dictated by the value of η.We carried out in Part I [2] a detailed stability analysis of the proposed strategy.We showed, under conditions on the step-size learning parameter µ, that the adaptive strategy induces a contraction mapping and that despite gradient noise, it is able to converge in the mean-square-error sense within O(µ) from the solution of the regularized problem, for sufficiently small µ.We also established the first and fourthorder moments stability of the network error process and showed that they tend asymptotically to bounded region on the order of O(µ) and O(µ 2 ), respectively.
Based on the results established in Part I [2], we shall derive in this paper a closed-form expression for the steady-state network mean-square-error relative to the minimizer of the regularized cost.This closed form expression will reveal explicitly the influence of the regularization strength, network topology (through the eigenvalues and eigenvectors of the Laplacian matrix), gradient noise, and data characteristics, on the network performance.
Additionally, a closed-form expression for the steady-state network mean-square-error relative to the minimizers of the individual costs is also derived.This expression will provide insights on the design of effective multitask strategies for distributed inference over networks.
Notation.We adopt the same notation from Part I [2].All vectors are column vectors.Random quantities are denoted in boldface.Matrices are denoted in capital letters while vectors and scalars are denoted in lower-case letters.The operator denotes an element-wise inequality; i.e., a b implies that each entry of the vector a is less than or equal to the corresponding entry of b.The symbol diag{•} forms a matrix from block arguments by placing each block immediately below and to the right of its predecessor.The operator col{•} stacks the column vector entries on top of each other.The symbols ⊗ and ⊗ b denote the Kronecker product and the block Kronecker product, respectively.The symbol vec(•) refers to the standard vectorization operator that stacks the columns of a matrix on top of each other and the symbol bvec(•) refers to the block vectorization operation that vectorizes each block and stacks the vectors on top of each other.

A. Problem formulation and adaptive strategy
Consider a connected network (or graph) G = {N , E, A}, where N is a set of N agents (nodes), E is a set of edges connecting agents with particular relations, and A is a symmetric, weighted adjacency matrix.If there is an edge connecting agents k and , then [A] k = a k > 0 reflects the strength of the relation between k and ; otherwise, [A] k = 0. We introduce the graph Laplacian L, which is a differential operator defined as L = D − A, where the degree matrix D is a diagonal matrix with k-th entry [D] kk = N =1 a k .Since L is symmetric positive semi-definite, it possesses a complete set of orthonormal eigenvectors.We denote them by {v 1 , . . ., v N }.For convenience, we order the set of real, non-negative eigenvalues of L as 0 = λ 1 < λ 2 ≤ . . .≤ λ N = λ max (L), where, since the network is connected, there is only one zero eigenvalue with corresponding eigenvector v 1 = 1 √ N 1 N [29].Thus, the Laplacian can be decomposed as: where Let w k ∈ R M denote some parameter vector at agent k and let W = col{w 1 , . . ., w N } denote the collection of parameter vectors from across the network.We associate with each agent k a risk function J k (w k ) : R M → R assumed to be strongly convex.In most learning and adaptation problems, the risk function is expressed as the expectation of a loss function Q k (•) and is written as , where x k denotes the random data.
The expectation is computed over the distribution of this data.We denote the unique minimizer of J k (w k ) by w o k .Let us recall the assumption on the risks {J k (w k )} used in Part I [2].
Assumption 1. (Strong convexity) It is assumed that the individual costs J k (w k ) are each twice differentiable and strongly convex such that the Hessian matrix function H k (w k ) = ∇ 2 wk J k (w k ) is uniformly bounded from below and above, say, as: where λ k,min > 0 for k = 1, . . ., N .
In many situations, there is prior information available about W o = col{w o 1 , . . ., w o N }.In the current Part II, and its accompanying Part I [2], the prior belief we want to enforce is that the target signal W o is smooth with respect to the underlying weighted graph.References [16], [17] provide variations for such problems for the special case of mean-square-error costs.Here we treat general convex costs.Let L = L ⊗ I M .The smoothness of W can be measured in terms of a quadratic form of the graph Laplacian [27], [28], [30]- [32]: where N k is the set of neighbors of k, i.e., the set of nodes connected to agent k by an edge.The smaller S(W) is, the smoother the signal W on the graph is.Intuitively, given that the weights are non-negative, S(W) shows that W is considered to be smooth if nodes with a large a k on the edge connecting them have similar weight values {w k , w }.Our objective is to devise and study a strategy that solves the following regularized problem: in a distributed manner where each agent is interested in estimating the k-th sub-vector of W o η = col{w o 1,η , . . ., w o N,η }.The tuning parameter η ≥ 0 controls the trade-off between the two components of the objective function.We are particularly interested in solving the problem in the stochastic setting when the distribution of the data x k in ) is generally unknown.This means that the risks J k (w k ) and their gradients ∇ wk J k (w k ) are unknown.As such, approximate gradient vectors need to be employed.A common construction in stochastic approximation theory is to employ the following approximation at iteration i: where x k,i represents the data observed at iteration i.The difference between the true gradient and its approximation is called the gradient noise s k,i (•): Let w k,i denote the estimate of w o k,η at iteration i and node k.In order to solve (4) in a fully distributed and adaptive manner, we proposed in Part I [2] the following diffusion-type algorithm: where µ > 0 is a small step-size parameter and ψ k,i is an intermediate variable.

B. Summary of main results
One key insight that followed from the analysis in Part I [2] is that the smoothing parameter η can be regarded as an effective tuning parameter that controls the nature of the learning process.The value of η can vary from η = 0 to η → ∞.We showed that at one end, when η = 0, the learning algorithm reduces to a non-cooperative mode of operation where each agent acts individually and estimates its own local model, w o k .On the other hand, when η → ∞, the learning algorithm moves to a single-mode of operation where all agents cooperate to estimate a single parameter (namely, the Pareto solution of the aggregate cost function).For any values of η in the range 0 < η < ∞, the network behaves in a multitask mode where agents seek their individual models while at the same time ensuring that these models satisfy certain smoothness and closeness conditions dictated by the value of η.
In Part I [2], we carried out a detailed stability analysis of the proposed strategy (7).We showed, under some conditions on the step-size parameter µ, that: where W o η is the solution of the regularized problem (4) and W i = col{w 1,i , . . ., w N,i } denotes the network block weight vector at iteration i.Expression (9) indicates that the mean-square error E W o η − W i 2 is on the order of µ.
However, in this Part II, we are interested in characterizing how close the W i gets to the network limit point W o η .In particular, we will be able to characterize the network mean-square deviation (MSD) (defined below in (51)) value in terms of the step-size µ, the regularization strength η, the network topology (captured by the eigenvalues λ m and eigenvectors v m of the Laplacian L), and the data characteristics (captured by the second-order properties of the costs H k,η and second-order moments of the gradient noise R s,k,η ) as follows: where [v m ] k denotes the k-th entry of the eigenvector v m .The interpretation of ( 11) is explained in more detail in Section IV where it is shown, by coupling η and µ in an appropriate manner, that the term (O(1)+O(η)) will be a strictly higher order term of µ.As we will explain later in Sections IV and V, by properly setting the parameters, expression (11) allows us to recover the mean-square-deviation of stand-alone adaptive agents (η = 0) and single-task diffusion networks (η → ∞).
Recall that the objective of the multitask strategy (7) is to exploit similarities among neighboring agents in an attempt to improve the overall network performance in approaching the collection of individual minimizer W o by means of local communications.Section V in this paper is devoted to quantify the benefit of cooperation, namely, the objective of improving the mean-square deviation around the limiting point of the algorithm.In particular, we will be able to characterize the mean-square-deviation (MSD) value relative to the multitask objective W o in terms of the MSD in (11) and the mismatch W o η − W o as follows: where "bias" is the bias of algorithm (7) relative to W o η given in future expression (40).By increasing η, the MSD in (11) is more likely to decrease.However, by increasing η, from expression (31) in Part I [2], W o η −W o 2 is more likely to increase and the size of this increase is determined by the smoothness of W o .From future Lemma 2, it turns out that the third term on the RHS in ( 12) is a function of µ, η, and the smoothness of the multitask objective W o .By increasing η, this term is more likely to increase.The key conclusion will be that, while the second and third terms on the RHS in (12) will in general increase as the regularization strength η increases, the size of this increase is determined by the smoothness of W o which is in turn function of the network topology captured by L.
The more similar the tasks at neighboring agents are, the smaller these terms will be.This implies that as long as W o is sufficiently smooth, moderate regularization strengths η in the range ]0, ∞[ exist such that MSD at these values of η will be less than MSD at η = 0 which corresponds to the non-cooperative mode of operation.The best choice for η would be the one minimizing MSD in (12).We refer the reader to Fig. 2 in [2, Section II-B] for an illustration of this concept of multitask learning benefit.This example will be considered further in the numerical experiments section.

C. Modeling Assumptions from Part I [2]
In this section, we recall the assumptions used in Part I [2] to establish the network mean-square error stability (9).Assumption 2. (Gradient noise process) The gradient noise process defined in (6) satisfies for any w ∈ F i−1 and for all k, = 1, 2, . . ., N : for some β 2 k ≥ 0, σ 2 s,k ≥ 0, and where F i−1 denotes the filtration generated by the random processes {w ,j } for all = 1, . . ., N and j ≤ i − 1.
Let us introduce the network block vector W i = col{w 1,i , . . ., w N,i }.Recall from Part I [2] that at each iteration, we can view (7) as a mapping from W i−1 to W i : We introduced the following condition on the combination matrix (I M N − µηL).Assumption 3. (Combination matrix) The symmetric combination matrix (I M N − µηL) has nonnegative entries and its spectral radius is equal to one.Since L has an eigenvalue at zero, these conditions are satisfied when the step-size µ > 0 and the regularization strength η ≥ 0 satisfy: where condition (17) ensures stability and condition (18) ensures non-negative entries.
The results in Part I [2] established that the iterates w k,i converge in the mean-square-error sense to a small O(µ)− neighborhood around the regularized solution w o k,η .In this part of the work, we will be more precise and determine the size of this neighborhood, i.e., assess the size of the constant multiplying µ in the O(µ)−term.To do so, we shall derive an accurate first-order expression for the mean-square error (9); the expression will be accurate to first-order in µ.
To arrive at the desired expression, we first need to introduce a long-term approximation model and assess how close it is to the actual model.We then derive the performance for the long-term model and use this closeness to transform this result into an accurate expression for the performance of the original learning algorithm.To derive the long-term model, we follow the approach developed in [9].The first step is to establish the asymptotic stability of the fourth-order moment of the error vector, E W o η − W i 4 , which has already been done in Part I [2].This property is needed to justify the validity of the long-term approximate model.Recall that to establish the fourthorder stability, we replaced condition (14) on the gradient noise process by the following condition on its fourth order moment: for some β 4 k ≥ 0, and σ 4 s,k ≥ 0. As explained in [9], condition (19) implies (14).
To establish the mean-stability (8), we introduced a smoothness condition on the Hessian matrices of the individual costs.This smoothness condition will be adopted in the next section when we study the long term behavior of the network.
Assumption 4. (Smoothness condition on individual cost functions).It is assumed that each J k (w k ) satisfies a smoothness condition close to w o k,η , in that the corresponding Hessian matrix is Lipchitz continuous in the proximity of w o k,η with some parameter κ d ≥ 0, i.e., for small perturbations ∆w k ≤ .

III. LONG-TERM NETWORK DYNAMICS
Let W i = W o η − W i .Subtracting the vector (I M N − µηL)W o η from both sides of recursion ( 16), and using (6), we obtain: From the mean-value theorem [33, pp. 24], [9, Appendix D], we have: where and from the optimality condition of (4), we have: Replacing ( 22) into ( 21) and using (24), we arrive at the following recursion for W i : where We move on to motivate a long-term model for the evolution of the network error dynamics, W i , after sufficient iterations, i.e., for i 1.We examine the stability property of the model, the proximity of its trajectory to that of the original network dynamics (25), and subsequently employ the model to assess network performance.

A. Long-term error model
We introduce the error matrix , which measures the deviation of H i−1 from the constant matrix: with each H k,η given by the value of the Hessian matrix at the regularized solution, namely, Then, we can write: Using (31), we can rewrite the error recursion (25) as: in terms of the random perturbation sequence: Under Assumptions 1 and 4, and for small µ, it can be shown that lim sup i→∞ E c i−1 = O(µ), and that c i−1 = O(µ) asymptotically with high probability (see Appendix A).Motivated by this result, we introduce the following approximate model, where the last term involving c i−1 in (32), which is O(µ 2 ), is removed: Obviously, the iterates that are generated by (34) are generally different from the iterates generated by the original recursion (25).To highlight this fact, we are using the prime notation for the state of the long-term model.Note that the driving process s i (W i−1 ) in ( 34) is the same gradient noise process from the original recursion (25) and is evaluated at W i−1 .In the following, we show that, after sufficient iterations i 1, the error dynamics of the network relative to the solution W o η is well-approximated by the model (34).

B. Size of Approximation Error
We start by showing that the mean-square difference between the trajectories { W i , W i } is asymptotically bounded by O(µ 2 ) and that the mean-square error performance of the long term model ( 34) is within O(µ 2 ) from the performance of the original recursion (25).Working with recursion (34) is much more tractable for performance analysis because its dynamics is driven by the constant matrix B η as opposed to the random matrix B i−1 in the original error recursion (25).Therefore, we shall work with the long-term model ( 34) and evaluate its performance, which will provide an accurate representation for the performance of the original distributed strategy (7) to first order in the step-size µ.
We shall discuss now the mean and mean-square error stability of the long-term approximate model (34).

C. Stability of First-Order Error Moment
Conditioning both sides of ( 34), invoking the conditions on the gradient noise from Assumption 2, and computing the conditional expectations, we obtain: Taking expectation again, we arrive at: The above recursion is stable if the matrix B η in (30) is stable.This matrix has a similar form to the matrix Similarly, it can be verified that B η is stable when condition ( 17) and condition are satisfied.In this case, we obtain where the RHS in the above expression is similar to the RHS in equation ( 49) [2] with H ∞ replaced by H η .
Lemma 2. (Mean stability of long-term model) Under Assumptions 1, 2, 3, and 4, and for sufficiently small µ, the steady-state bias W ∞ = lim i→∞ E W i of the long-term model (34) given by (40) satisfies: Proof.The proof is similar to the proof of Theorem 1 in Part I [2] with H ∞ replaced by H η .

D. Stability of Second-Order Error Moment
In the following, we show that the long term approximate model ( 34) is also mean-square stable in the sense that E w k,i 2 tends asymptotically to a region that is bounded by O(µ).We follow the same line of reasoning as in Part I [2, Section III-A] where we studied the mean-square stability of the original model (25).Based on the inequality: where is the steady-state bias of the long term model given by (40) and where W ∞ − W i follows the recursion: and from Theorems 1 and 2 in Part I [2] and previous Lemma 2, we can establish the mean-square stability of (34).
Let us introduce the mean-square perturbation vector (MSP ) at time i relative to W ∞ : Lemma 3. (Mean-square stability of the long-term model) Under Assumptions 1, 2, 3, and 4, the MSP at time i can be recursively bounded as: where: and MSP i is the mean-square perturbation vector at time i relative to the fixed point W ∞ = col{w k,∞ } N k=1 of algorithm (7) in the absence of gradient noise (see [2, Section III-A3]).A sufficiently small µ ensures the stability of the above recursion.It follows that and that Proof.See Appendix C.

IV. MEAN-SQUARE-ERROR PERFORMANCE
We established in Theorem 2 in Part I [2] that a network running strategy ( 7) is mean-square-error stable for sufficiently small µ.Specifically, we showed that In the following, we assess the size of the network mean-square-deviation (MSD) using the definition [9, Chapter 11]: In addition to Assumptions 1, 2, 3, 4, and condition (19) on the individual costs, J k (w k ), the gradient noise process, s k,i (w k ), and the combination matrix, I N − µηL, we introduce a smoothness condition on the noise covariance matrices.
For any w k ∈ F i−1 , we let denote the conditional second-order moment of the gradient noise process, which generally depends on i because the statistical distribution of s k,i (w k ) can be iteration-dependent, and is random since it depends on the random iterate w k .We assume that, in the limit, this covariance matrix tends to a constant value when evaluated at w o k,η and we denote the limit by: Assumption 5. (Smoothness condition on the noise covariance) It is assumed that the conditional second-order moment of the noise process is locally Lipschitz continuous in a small neighborhood around w o k,η , namely, for small perturbations ∆w k ≤ , and for some constant κ d ≥ 0 and exponent 0 < θ ≤ 4.
One useful conclusion that follows from Assumption 5 is that, after sufficient iterations, we can express the covariance matrix of the gradient noise process, s k,i (w k ), in terms of the limiting matrix R s,k,η defined in (53).
Specifically, following the same proof used to establish Lemma 11.1 in [9], we can show that under the smoothness condition and for small step-size, the covariance matrix of the gradient noise process, s k,i (w k,i−1 ), at each agent k satisfies for i 1: For clarity of presentation, we sketch the proof in Appendix D where we used results from Theorems 2 and 3 in Part I [2].
Before studying the steady-state network performance, we establish some properties of the matrix: which is defined in terms of the block Kronecker operation using blocks of size M × M .In the derivation that follows, we shall use the block Kronecker product ⊗ b operator [34] and the block vectorization operator bvec(•).
As explained in [9], these operations preserve the locality of the blocks in the original matrix arguments.Since ρ(F η ) = (ρ(B η )) 2 , the matrix F η is stable under conditions (17) and (39).This matrix plays a critical role in characterizing the performance of the distributed multitask algorithm.In our derivations, the matrix F η will also appear transformed under the orthonormal transformation: where Lemma 4. (Coefficient matrix F η ) For sufficiently small step-size, it holds that and where X is an N 2 × N 2 block diagonal matrix with each block of dimension M 2 × M 2 : with ⊕ denoting the Kronecker sum operator [35]: The matrix W is an N 2 × N 2 block matrix arising from the matrices {H mn |m = n}.Moreover, we can also write: Proof.See Appendix E.
As we shall see in Theorem 1, it turns out that the decomposition in ( 59) is very useful to highlight some important facts arising in the steady-state performance of the multitask algorithm.
Theorem 1. (Network MSD performance) Under the same conditions of Lemma 5, it holds from Lemma 4 that the steady-state network MSD defined in (51) can be written as: Proof.See Appendix G.
As the derivation in Appendix G reveals, the second term on the RHS of (68) results from the matrix W in (59) which is zero when {H mn = 0, m = n}.When the Hessian matrices are uniform across the agents: we have H mn = 0 for m = n.In this case, the network MSD in (68) simplifies to: Moreover, in the non-uniform Hessian matrices scenario, by letting η = µ − with > 0 chosen such that Assumption 3 is satisfied, we obtain: In this case, the first term on the RHS of (68) dominates the factor O(µ 1+ ) and when we evaluate the network MSD according to definition (51), the last term on the RHS of (68) disappears when computing the limit as µ → 0.
As we will see by simulations, the first term on the RHS of (68) provides a good approximation for the network MSD for any η ≥ 0.
The first term on the RHS of (68) reveals explicitly the influence of the step-size µ, regularization strength η, network topology (through the eigenvalues λ m and eigenvectors v m of the Laplacian), gradient noise (through the covariance matrices R s,k,η ), and data characteristics (through the Hessian matrices H k,η ) on the network MSD performance.Observe that this term consists of the sum of N individual terms, each associated with an eigenvalue λ m of the Laplacian matrix, and given by: where H m,η and R m,η are transformed versions of H η in (28) and S η in (67) at the m-th eigenvalue λ m , respectively: As shown in Section VI-A, under some assumptions on the data and noise characteristics, the individual terms in (72) are decaying functions of η or λ m .
Before proceeding, we note that expression (68) can be written alternatively as: where we used the fact that λ 1 = 0 and v 1 = 1 √ N 1 N .Expression (75) allows us to recover the network MSD performance of the single-task diffusion adaptation employed to estimate w given by: To see this, we recall the expression for the network MSD performance of single-task diffusion adaptation derived in [9, pp. 594]: where H k, ∇ 2 wk J k (w ) and R s,k, is the covariance of the gradient noise in (53) at w .In order to estimate w using the multitask strategy (7), a very large η needs to be chosen.In this case, we have H k,η = H k, and R s,k,η = R s,k, .Moreover, the second and third terms on the RHS of expression (75) will be O(µ/η) which are negligible for a very large η.Thus, we obtain (77).

V. MULTITASK LEARNING BENEFIT
Now that we have studied in detail the mean-square performance of the multitask strategy (7) relative to W o η , the minimizer of the regularized cost (4), we will use the results to examine the benefit of multitask learning compared to the non-cooperative solution under the smoothness assumption.
Since each cost is strongly convex, each agent k is able to estimate w o k on its own, if desired, in a non-cooperative manner by running strategy (7) with η = 0. We know from previous established results on single-agent adaptation [9, pp. 390] that the network MSD in that case will be given by: where H k,o ∇ 2 wk J k (w o k ) and R s,k,o is the covariance of the gradient noise in (53) at w o k .Note that expression (68) allows us to recover the mean-square-error performance of stand-alone adaptive agents.In particular, it can be easily verified that (68) reduces to expression µ 2 Tr(H −1 k,o R s,k,o ) for the mean-square deviation of single agent learner [9, pp.390] when the network size is set to N = 1 and the topology is removed.
When η > 0 is used, the graph Laplacian regularizer (4) induces a bias relative to W o .However, when W o is smooth, we expect that promoting the smoothness of W o through regularization can improve the network MSD performance despite the bias induced in the estimation.

A. Induced bias relative to W o
The bias of the strategy ( 7) is given by: From the triangle inequality we have: where lim sup i→∞ E W o η − W i can be replaced by W ∞ in (40).From expression (31) in Part I [2] we know that the smoother W o is, the smaller W o − W o η will be.Furthermore, we know from Theorem 4 in Part I Thus, for a smooth signal W o and a small µ, the bias in (79) will be small.

B. Network MSD relative to W o
The mean-square-error performance of the strategy (7) relative to W o is given by: and the network MSD can be expressed as: where we used the bar notation to denote the network MSD relative to W o and where MSD is given in (65) or (68).Note that lim sup i→∞ E W o η − W i can be replaced by W ∞ in (40).In order to improve the network MSD compared to the non-cooperative case (78), the regularization strength η must be chosen such that: and the optimal choice of η is the one minimizing MSD in (82) subject to η ≥ 0.

VI. SIMULATION RESULTS
We consider a connected network of N = 15 nodes and M = 5 with the topology shown in Fig. 1 (left).Each agent is subjected to streaming data {d k (i), u k,i } that are assumed to satisfy a linear regression model [9], [16]: for some unknown M × 1 vector w o k with v k (i) a measurement noise.A mean-square-error cost is associated with agent k: The processes {d k (i), u k,i , v k (i)} are assumed to represent zero-mean jointly wide-sense stationary random pro- where R u,k > 0 and the Kronecker delta δ m,n = 1 if m = n and zero otherwise; ii) Ev k (i)v (j) = σ 2 v,k δ k, δ i,j ; iii) the regression and noise processes {u ,j , v k (i)} are independent of each other.We set a k = 0.1 if ∈ N k and 0 otherwise.It turns out that the Laplacian matrix has 15 distinct eigenvalues.We generate W o according to W o = VW o = col{w o m } N m=1 with: Recall from Part I [2] that S(W) in (3) can be written as: where w m = (v m ⊗ I M )W.From (87), we observe that the larger {τ j ≥ 0} are, the smoother the signal W o is.Note that, for MSE networks, it holds that Thus, the fixed point bias given by (49) in Part I [2] is equal to the long-term model bias W ∞ in (40).Furthermore, from the definition (6), the gradient noise process at agent k is given by: Consider the covariance R s,k,η defined in (53).From (88), we have: where To evaluate (89), we need the fourth order moment of the regressors.Let us assume that the regressors are zero-mean real Gaussian.In this case, using the fact that W k,η is symmetric, we obtain [36]: Replacing the above expression in (89), we obtain:

A. Uniform data profile
In this setting, we assume uniform covariance matrices scenario where R u,k = R u ∀k.This scenario is encountered for example in distributed denoising problems in wireless sensor networks (or image denoising) [28], [37].In this case, R u is equal to one.In such problems, the N sensors in the network are observing N -dimensional signal, with each entry of the signal corresponding to one sensor.Using the prior knowledge that the signal is smooth w.r.t.
the underlying topology, the sensor task is to denoise the corresponding entry of the signal by performing local computations and cooperating with its neighbor in order to improve the error variance.
It turns out that in this scenario, the output of the network W o η can be interpreted as the output of a low-pass graph filter applied to the graph signal W o [1], [28], [38].To see this, let us first recall the notion of graph frequencies, graph Fourier transform, and graph filters [28], [37]- [39].Consider the connected graph G = {N , E, A} equipped with a Laplacian matrix L, which can be decomposed as L = V ΛV .A graph signal supported on the set N is defined as a vector x ∈ R N whose k-th component x k ∈ R represents the value of the signal at the k-th node.By analogy to the classical Fourier analysis, the eigenvectors of the Laplacian matrix L are used to define a graph Fourier basis V and the eigenvalues are considered as the graph frequencies.The graph Fourier transform (GFT) transforms a graph signal x into the graph frequency domain according to x = V x where {x 1 , . . ., x N } are called the spectrum of x.A graph filter Φ is an operator that acts upon a graph signal x by amplifying or attenuating its spectrum as: Φ = N m=1 Φ(λ m )x m v m .The frequency response of the filter Φ(λ) controls how much Φ amplifies the signal spectrum.Low frequencies correspond to small eigenvalues, and low-pass or smooth filters correspond to decaying functions Φ(λ).Since we are dealing with vectors w k ∈ R M instead of scalars x k , the graph transformation x = V x becomes W = (V ⊗ I M )W.In the uniform covariance matrices scenario, we have (24) in Part I [2] reduces to: since and zero otherwise.By applying the matrix inversion identity [40], it holds that the m-th subvector corresponding to the m-th eigenvalue (or graph frequency) of If η = 0, we are in the case of an all-pass graph filter since the frequency content of the output signal W o η , is the same as the frequency content of the input signal W o .For η > 0, we are in the case of a low-pass graph filter since the norm of the m-th frequency content of the output signal W o η , namely, w o m,η , is less than or equal to the norm of the m-th frequency content of the input signal W o , namely, w o m .For fixed η, as m increases, the ratio in (95) decreases.This validates the low-pass filter interpretation.The regularization parameter η controls the sharpness of the low-pass filter.We illustrate this in Fig. 2  We observe that a similar behavior arises when studying the network MSD for smooth signal W o .When W o is smooth, W k,η in ( 90) is small and in this case, the covariance matrix R s,k,η in (92) can be approximated by In this case, the network MSD expression in (70) can be approximated as: Since the trace of a matrix is equal to the sum of its eigenvalues, we have: The above expression shows that for a fixed λ m , as η increases, the above trace decreases.We conclude that, when η increases, the sum on the RHS of (96) decreases.By further assuming uniform noise profile, i.e., σ 2 v,k = σ 2 v ∀k, expression (96) reduces to: From (97), we conclude that, for a fixed η, as λ m increases, the corresponding trace term in (98) decreases.This case is illustrated numerically in Fig. 2 (right).In the experiment, we set R u = I M , σ 2 v = 0.1, µ = 0.005, and τ j = 7 + j in (86) so that W o is smooth and we illustrate MSD(λ m ) in (98) for different values of η .

B. Varying data profile
We assume that R u,k = σ 2 u,k I M for all k.In this case, expression (92) reduces to: The variances σ 2 u,k and σ 2 v,k are given in Fig. 1 (right).We set τ j = j in (86).In order to characterize the influence of the step-size µ and the regularization parameter η on the performance of the algorithm relative to W o η , we report in Fig. 3 (left) the squared norm of the fixed point bias W ∞ = W o η − W ∞ given by (49) in Part I [2] (which is equal to the long-term model bias W ∞ in ( 40)) as a function of η where η = µ − with ≥ −1 for µ = {10 −3 , 10 −4 , 10 −5 }.We observe that for small η, the squared norm of the bias increases 40 dB per decade (when η goes from η 1 to 10η 1 ).This means that the squared norm of the bias is on the order of η 4 for fixed µ.For large η, the bias becomes constant and we see that, when µ goes from µ 1 to 10µ 1 , it increases 20 dB.This means that the squared norm of the bias is on the order of µ 2 .Finally, we observe that for fixed η, the squared norm of the bias is on the order of µ 2 .In Fig. 3 (middle), we report the network MSD learning curves relative to W o η obtained by running strategy (7) for η = 5 and for two different values of µ.The curves are obtained by averaging the trajectories In the simulations, we use the following approximation for the network MSD expression in (68): where we use the subscript "app" to indicate that it is an approximation.Compared with (68), we see that the term 1)+O(η)) has been removed in (100).It is observed that the learning curves tend to the same MSD value predicted by the theoretical expression (65) (with O(µ 1+θm ) removed).Furthermore, we observe that the MSD predicted by expression (100) provides a good approximation for the performance of strategy (7).Finally, it can be observed that the MSD is on the order of µ.In Fig. 1 (right), we report the MSD learning curves relative to W o η obtained by running strategy (7) for µ = 10 −3 and for six different values of η.As it can be seen, by increasing η, the network MSD decreases.Furthermore, it is observed that expression (100) provides a good approximation for the network MSD for any η ≥ 0.
In Fig. 4, we characterize the influence of η on the performance of strategy (7) relative to W o with W o a smooth signal generated according to (86) with τ j = 7 + j.We set µ = 5 • 10 −3 .In Fig. 4 (left), we plot MSD for η ∈ [0, 350].To generate MSD we use (82) with MSD replaced by MSD app in (100) which has a low computational complexity.As it can be seen from this plot, η = 4 gives the best MSD.In Fig. 4 (right), we report the network It is observed that the learning curves tend to the same MSD value predicted by the theoretical expression (82) with MSD replaced by (65) (with O(µ 1+θm ) removed) or (100).

VII. CONCLUSION
In this paper, and its accompanying Part I [2], we considered multitask inference problems where agents in the network have individual parameter vectors to estimate subject to a smoothness condition over the graph.Based on diffusion adaptation, we proposed a strategy that allows the network to minimize a global cost consisting of the aggregate sum of the individual costs regularized by a term promoting smoothness.We showed that, for small step-size parameter, the network is able to approach the minimizer of the regularized problem to arbitrarily good accuracy levels.Furthermore, we showed how the regularization strength influences the limit point and the steady-state mean-square-error (MSE) performance of the algorithm.Analytical expressions illustrating these effects are derived.These expressions revealed explicitly the influence of the network topology, data settings, step-size parameter, and regularization strength on the network MSE performance and provided insights into the design of effective multitask strategies for distributed inference over networks.Illustrative examples were considered and links to spectral graph filtering were also provided.

APPENDIX A SIZE OF THE PERTURBATION SEQUENCE c i−1
The argument is similar to the one presented in [9,Sec. 10.1].For ease of reference, we provide a sketch of the proof here.Let For each agent k, we have from eq. (174) in Part I [2]: so that and, consequently, from Theorem 2 in Part I [2].
In the following, we argue that c i−1 = O(µ) asymptotically with high probability.Let us introduce the nonnegative random variable u = c i−1 and let us recall Markov's inequality which states that for any nonnegative random variable u and ξ > 0, it holds that: Let r c = nµ, for any constant integer n ≥ 1 that we are free to choose.We then conclude from ( 103) and ( 104) that for i 1: where the term O(1/n) is independent of µ.This result shows that the probability of having c i−1 bounded by r c can be made arbitrarily close to one by selecting a large enough value for n.Once the value for n has been fixed to meet a desired confidence level, then r c = O(µ).

APPENDIX B PROOF OF LEMMA 1
To simplify the notation, we introduce the difference: Subtracting recursions (32) and (34), we get: in terms of the random perturbation sequence c i−1 given in (33).For each agent k, we have from (101): It follows that and Applying Jensen's inequality [41, pp. 77] to the convex function • 2 , we obtain from (107): where C is given by: Next note that for any arbitrary positive number t ∈ (0, 1).We select t = γ k with γ k defined in (48).This gives Let us introduce the mean-square perturbation vector at time i: Replacing ( 114) into (111), and using (110), we obtain: with G = diag{γ k } N k=1 .Iterating the above recursion starting from i = 1, we obtain: Under Assumption 3 and condition (39), the matrix CG is guaranteed to be stable.Following similar arguments to the ones used to establish eq. ( 68) in Part I (Appendix E) [2], and from Theorem 3 in Part I [2], we conclude that: where we used the fact that Finally note that where we used |Ex| ≤ E|x| from Jensen's inequality and where we applied Hölder's inequality: , when 1/p + 1/q = 1.

APPENDIX C
PROOF OF LEMMA 3 From (43), we have: where C is defined in (112) and where φ k,i is given by: Applying Jensen's inequality [41, pp. 77] to the convex function • 2 , we obtain: Under Assumption 2, we have: where Σ k (I M − µH k,η ) 2 , which due to Assumption 1, can be bounded as follows: where γ k is given by (48).Taking expectation again in (125), and using the bound (136 from Part I [2], we obtain: Now, combining (127) and (124), we obtain (45).

APPENDIX D LIMITING SECOND-ORDER MOMENT OF GRADIENT NOISE
It is shown in [9,Sec. 4.1] that the conditional noise covariance matrix satisfying Assumptions 2 and 5 satisfies more globally a condition of the following form for all w k ∈ F i−1 : for some nonnegative constant κ d .By adding and subtracting the same term we have: which using definition (52) can be rewritten as: Subtracting R s,k,η defined by (53) from both sides and computing expectations, we get: It then follows from the triangle inequality of norms, and from Jensen's inequality that: Computing the limit superior of both sides and using (53) to annihilate the limit of the first term on the RHS, we conclude that: We next use the smoothness condition (129) to bound the right-most term as follows: Under expectation and in the limit, we have: where we applied Jensen's inequality to the function f (x) = x θ 4 ; this function is concave over x ≥ 0 for θ ∈ (0, 4]. Moreover, in the last step we called upon the results in Theorems 2 and 3 in Part I [2] where it is shown that the second and fourth order moments of w k,i−1 are asymptotically bounded by O(µ) and O(µ 2 ), respectively.

APPENDIX E PROOF OF LEMMA 4
Consider the matrix F η in (57).Using the block Kronecker product property: it can be verified that: where so that: where H mn is defined in (63).It can be verified that the matrix We denote by [Z mn ] pq the M 2 × M 2 (p, q)-th block of Z mn .We have: We have: where H mm ⊕ H pp is given by (62) and Thus, Before proceeding, we recall the following useful properties of the Kronecker and Kronecker sum products [35].
Let {λ i (A), i = 1, . . ., M } and {λ j (B), j = 1, . . ., M } denote the eigenvalues of any two M × M matrices A and B, respectively.Then, From ( 142), (143), and (146), it can be verified that the matrix Z can be written as: The matrix X is N 2 × N 2 block diagonal defined as: where we used the fact that for m = 1 and p = 1, we have . This is due to property (148) and the fact that For the remaining blocks of X, from (145), property (148), and the fact that 0, it can be verified that the matrix: is also positive definite when: Furthermore, in this case, we have block matrix where each block is M 2 × M 2 given by: Applying the matrix inversion identity [40], we obtain: From (150), we have: From ( 155) and (153), we have: and Applying the block inversion formula to I + X −1 Y , we obtain: Finally, from (155), (156), and (158), we conclude that: Consider now the matrix It can be verified that: APPENDIX F PROOF OF LEMMA 5 Recursion (34) for the long term model includes a constant driving term on the RHS represented by µ 2 η 2 L 2 W o η .To facilitate the variance analysis, we introduce the centered variable: Subtracting ( 38) from (34), we obtain: where the deterministic driving terms are removed.Although we are interested in evaluating the asymptotic size of , we can rely on the centered variable z i for this purpose, since, from Lemma 2, it holds for i 1: for fixed η.Furthermore, we established in Lemma 1, that the error variances E W i 2 and E W i 2 are within 2 ) from each other.Therefore, we may evaluate the mean-square error in terms of the mean-square value of the variable z i by employing the correction: We therefore continue with recursion (162) and proceed to examine how the mean-square value of z i evolves over time by relying on the energy conservation arguments.
Let Σ denote an arbitrary symmetric positive semi-definite matrix that we are free to choose.Equating the squared weighted values of both sides of (162) and taking expectations conditioned on the past history gives: Taking expectation again removes the conditioning and we get: Consider the right-most term.We have: for some non-negative constant b o = Tr(Σ) • O(µ 2+min{1, θ 2 } ).Substituting (172) into (166) we obtain for i 1: Using the sub-additivity property of the limit superior, we obtain: Grouping terms we get: We conclude that the limit superior of the error variance satisfies: In order to obtain identity as a weighting matrix on the mean-square value of z i in (176), we select Σ as the solution to the following discrete time Lyapunov equation: We know that B η is stable under conditions ( 17) and (39).Accordingly, we are guaranteed that the above Lyapunov equation has a unique solution Σ, and moreover, this solution is symmetric and non-negative definite as desired.
We can then focus on evaluating the RHS of (176).
For this purpose, we start by applying the block vectorization operation to both sides of (177) to find that: where we used a positive constant r to account for the fact that matrix norms are equivalent.
Returning to (176), and using (177), ( 180), (181), and (182), we conclude that: with θ ∈ (0, 4].But since F η is a stable matrix, we can employ the expansion: and write: This series converges to the trace value of the unique solution of the following Lyapunov equation: where Consequently, using (164), we obtain: where This vector is the unique solution to the linear system of equations: or, equivalently, by using (151): where We conclude from the above equation that X m is the unique solution to the continuous time Lyapunov equation: whose solution is given by: Using the definitions (66), (67), and applying properties bvec(ACB) = (B ⊗ b A)bvec(C), and Tr(AB) = (bvec(B )) bvec(A), (199)

2 = 1 2 = 1
(left).In the experiment, we set R u = I M and τ j = 0 in (86) so that w o m ∀m and we illustrate in Fig.2(left) the squared 2 -norm of w o m,η for different values of η from (94).In order to visualize the frequency response of the graph filter, we also plot in dashed lines the ratio for λ m ∈ [0, 1.2] with w o m ∀m.

Fig. 2 . 2 for
Fig. 2. Uniform data profile scenario.(Left) Spectral content of W o η from (94) for different η.Dashed lines correspond to the ratio w o m,η

36 Fig. 4 .
Fig. 4. Network performance relative to W o with a smooth signal W o at µ = 0.005.(Left) Network MSD as a function of the regularization strength η ∈ [0, 350].(Right) Evolution of the network MSD learning curve for three different values of η.