Learning Over Multitask Graphs—Part I: Stability Analysis

This paper formulates a multitask optimization problem where agents in the network have individual objectives to meet, or individual parameter vectors to estimate, subject to a smoothness condition over the graph. The smoothness condition softens the transition in the tasks among adjacent nodes and allows incorporating information about the graph structure into the solution of the inference problem. A diffusion strategy is devised that responds to streaming data and employs stochastic approximations in place of actual gradient vectors, which are generally unavailable. The approach relies on minimizing a global cost consisting of the aggregate sum of individual costs regularized by a term that promotes smoothness. We show in this Part I of the work, under conditions on the step-size parameter, that the adaptive strategy induces a contraction mapping and leads to small estimation errors on the order of the small step-size. The results in the accompanying Part II will reveal explicitly the influence of the network topology and the regularization strength on the network performance and will provide insights into the design of effective multitask strategies for distributed inference over networks.


I. INTRODUCTION
Distributed inference allows a collection of interconnected agents to perform parameter estimation tasks from streaming data by relying solely on local computations and interactions with immediate neighbors.Most prior literature focuses on single-task problems, where agents with separable objective functions need to agree on a common parameter vector corresponding to the minimizer of an aggregate sum of individual costs [2]- [11].Many network applications require more complex models and flexible algorithms than single-task implementations since their agents may need to estimate and track multiple objectives simultaneously [12]- [22].Networks of this kind are referred to as multitask networks.Although agents may generally have distinct though related tasks to perform, they may still be able to capitalize on inductive transfer between them to improve their performance.
Based on the type of prior information that may be available about how the tasks are related to each other, multitask learning algorithms can be derived by translating the prior information into constraints on the parameter vectors to be inferred [12]- [22].For example, in [18]- [20], distributed strategies are developed under the assumption that the parameter vectors across the agents overlap partially.A more general scenario is considered in [21] where it is assumed that the tasks across the agents are locally coupled through linear equality constraints.In [22], the parameter space is decomposed into two orthogonal subspaces, with one of the subspaces being common to all agents.There is yet another useful way to model relationships among tasks, namely, to formulate optimization problems with appropriate regularization terms encoding these relationships [13]- [17].For example, the strategy developed in [13] adds squared 2 -norm co-regularizers to the mean-square-error criterion to promote task similarities, while the strategy in [14] adds 1 -norm co-regularizers to promote piece-wise constant transitions.
In this paper, and the accompanying Part II [23], we consider multitask inference problems where each agent in the network seeks to minimize an individual cost expressed as the expectation of some loss function.The minimizers of the individual costs are assumed to vary smoothly on the topology captured by the graph Laplacian matrix.The smoothness property softens the transition in the tasks among adjacent nodes and allows incorporating information about the graph structure into the solution of the inference problem.In order to exploit the smoothness prior, we formulate the inference problem as the minimization of the aggregate sum of individual costs regularized by a term promoting smoothness, known as the graph-Laplacian regularizer [24], [25].A diffusion strategy is devised that responds to streaming data and employs stochastic approximations in place of actual gradient vectors, which are generally unavailable.We show in this Part I of the work, under conditions on the step-size learning parameter µ, that the adaptive strategy induces a contraction mapping and that despite gradient noise, it is able to converge in the mean-square-error sense within O(µ) from the solution of the regularized problem, for sufficiently small µ.The analysis in the current part also reveals how the regularization strength η can steer the convergence point of the network toward many modes starting from the non-cooperative mode where each agent converges to the minimizer of its individual cost and ending with the single-task mode where all agents converge to a common parameter vector corresponding to the minimizer of the aggregate sum of individual costs.We shall also derive in Part II [23] a closed-form expression for the steady-state network mean-square-error relative to the minimizer of the regularized cost.This closed form expression will reveal explicitly the influence of the regularization strength, network topology, gradient noise, and data characteristics, on the network performance.Additionally, a closed-form expression for the steady-state network mean-square-error relative to the minimizers of the individual costs will be also derived in Part II [23].This expression will provide insights into the design of effective multitask strategies for distributed inference over networks.
There have been many works in the literature studying distributed multitask adaptive strategies and their convergence behavior.Nevertheless, with few exceptions [20], most of these works focus on mean-square-error costs.This paper, and the accompanying Part II [23], generalize distributed multitask inference over networks and applies it to a wide class of individual costs.Furthermore, previous works in this domain tend to show the benefit of multitask learning empirically by simulations.Following some careful and demanding analysis, we establish in Part II [23], which builds on the results of this Part I, a useful expression for the network steady-state performance.This expression provides insights into the learning behavior of multitask networks and clarifies how multitask distributed learning may improve the network performance.
Notation.All vectors are column vectors.Random quantities are denoted in boldface.Matrices are denoted in capital letters while vectors and scalars are denoted in lower-case letters.The operator denotes an element-wise inequality; i.e., a b implies that each entry of the vector a is less than or equal to the corresponding entry of b.The symbol diag{•} forms a matrix from block arguments by placing each block immediately below and to the right of its predecessor.The operator col{•} stacks the column vector entries on top of each other.The symbol ⊗ denotes the Kronecker product.

A. Problem formulation and adaptive strategy
We refer to Fig. 1 and consider a connected network (or graph) G = {N , E, A}, where N is a set of N agents (nodes), E is a set of edges connecting agents with particular relations, and A is a symmetric, weighted adjacency matrix.If there is an edge connecting agents k and , then [A] k = a k > 0 reflects the strength of the relation between k and ; otherwise, [A] k = 0. We introduce the graph Laplacian L, which is a differential operator defined as L = D − A, where the degree matrix D is a diagonal matrix with k-th entry [D] kk = N =1 a k .Since L is symmetric positive semi-definite, it possesses a complete set of orthonormal eigenvectors.We denote them by {v 1 , . . ., v N }.For convenience, we order the set of real, non-negative eigenvalues of L as 0 = λ 1 < λ 2 ≤ . . .≤ λ N = λ max (L), where, since the network is connected, there is only one zero eigenvalue with corresponding [26].Thus, the Laplacian can be decomposed as: where Let w k ∈ R M denote some parameter vector at agent k and let W = col{w 1 , . . ., w N } denote the collection of parameter vectors from across the network.We associate with each agent k a risk function J k (w k ) : R M → R assumed to be strongly convex.In most learning and adaptation problems, the risk function is expressed as the expectation of a loss function Q k (•) and is written as The expectation is computed over the distribution of this data.We denote the unique minimizer of J k (w k ) by w o k .We introduce a common assumption on the risks {J k (w k )}.This condition is applicable to many situations of interest (see, e.g., [7], [10]).

Assumption 1. (Strong convexity)
It is assumed that the individual costs J k (w k ) are each twice differentiable and strongly convex such that the Hessian matrix function H k (w k ) = ∇ 2 wk J k (w k ) is uniformly bounded from below and above, say, as: where λ k,min > 0 for k = 1, . . ., N .
In many situations, there is prior information available about W o = col{w o 1 , . . ., w o N }.In the current work, the prior belief we want to enforce is that the target signal W o is smooth with respect to the underlying weighted graph.
References [13]- [15] provide variations for such problems for the special case of mean-square-error costs.Here we treat general convex costs.Let L = L ⊗ I M .The smoothness of W can be measured in terms of a quadratic form of the graph Laplacian [24], [25], [27]: where N k is the set of neighbors of k, i.e., the set of nodes connected to agent k by an edge.Figure 1 provides an illustration.The smaller S(W) is, the smoother the signal W on the graph is.Intuitively, given that the weights are non-negative, S(W) shows that W is considered to be smooth if nodes with a large a k on the edge connecting them have similar weight values {w k , w }.Our objective is to devise and study a strategy that solves the following regularized problem: in a distributed manner where each agent is interested in estimating the k-th sub-vector of W o η = col{w o 1,η , . . ., w o N,η }.The tuning parameter η ≥ 0 controls the trade-off between the two components of the objective function.Reference [1] provides a theoretical motivation for the optimization framework where it is shown that, under a Gaussian Markov random field assumption, solving problem ( 4) is equivalent to finding a maximum a posteriori (MAP) estimate for W. We are particularly interested in solving the problem in the stochastic setting when the distribution ) is generally unknown.This means that the risks J k (w k ) and their gradients ∇ wk J k (w k ) are unknown.As such, approximate gradient vectors need to be employed.A common construction in stochastic approximation theory is to employ the following approximation at iteration i: where x k,i represents the data observed at iteration i.The difference between the true gradient and its approximation is called the gradient noise s k,i (•): Each agent can employ a stochastic gradient descent update to estimate w o k,η : where µ > 0 is a small step-size parameter.In this implementation, each agent k collects from its neighbors the estimates w ,i−1 , and performs a stochastic-gradient descent update on: By introducing an auxiliary variable ψ k,i , strategy (7) can be implemented in an incremental manner: where we replaced (w k,i−1 − w ,i−1 ) in the second step by the difference (ψ k,i − ψ ,i ) since we expect ψ k,i to be an improved estimate compared to w k,i−1 .Note that if we introduce the coefficients: then recursion (9) can be written in the diffusion form [6]- [10]: where the second step is a combination step.If we collect the scalars {c k } into the matrix C = [c k ], then the entries of C are non-negative for small enough µ and its columns and rows add up to one, i.e., C is a doublystochastic matrix.We shall continue with form (9) because the second step in (9) makes the dependence on η explicit.We will show later that by varying the value of η we can make the algorithm behave in different ways from fully non-cooperative to fully single-task with many other modes in between.

B. Summary of main results
Before delving into the study of the learning capabilities of ( 9) and its performance limits, we summarize in this section, for the benefit of the reader, the main conclusions of this Part I, and its accompanying Part II [23].One key insight that will follow from the detailed analysis in this Part I is that the smoothing parameter η can be regarded as an effective tuning parameter that controls the nature of the learning process.The value of η can vary from η = 0 to η → ∞.We will show that at one end, when η = 0, the learning algorithm reduces to a non-cooperative mode of operation where each agent acts individually and estimates its own local model, w o k .On the other hand, when η → ∞, the learning algorithm moves to a single-mode of operation where all agents cooperate to estimate a single parameter (namely, the Pareto solution of the aggregate cost function).For any values of η in the range 0 < η < ∞, the network behaves in a multitask mode where agents seek their individual models while at the same time ensuring that these models satisfy certain smoothness and closeness conditions dictated by the value of η.We are not only interested in a qualitative description of the network behavior.Instead, we would like to characterize these models in a quantitative manner by deriving expressions that allow us to predict performance as a function of η and, therefore, fine tune the network to operate in different scenarios.
To begin with, recall that the objective of the multitask strategy ( 9) is to exploit similarities among neighboring agents in an attempt to improve the overall network performance in approaching the collection of individual minimizer W o by means of local communications.In light of the fact that algorithm (9) has been derived as an (incremental) gradient descent recursion for the regularized cost (4), whose minimizer W o η is in general different from W o , the limiting point of algorithm (9) will therefore be generally different from W o , the actual objective of the multitask learning problem.This mismatch is the "cost" of enforcing smoothness.The analysis in the paper will reveal that the mismatch is a function of the similarity between the individual minimizers {w o k }, of second-order properties of the individual costs, of the network topology captured by L, and of the regularization strength η.
In particular, future expression (31) will allow us to understand the interplay between these quantities which is important for the design of effective multitask strategies.The key conclusion will be that, while the bias (difference between W o η and W o ) will in general increase as the regularization strength η increases, the size of this increase is determined by the smoothness of W o which is in turn function of the network topology captured by L. The more similar the tasks at neighboring agents are, the smaller the bias will be.This result, while intuitive, is reassuring, as it implies that as long as W o is sufficiently smooth, the bias induced by regularization will remain small, even for moderate regularization strengths η.
The analysis also quantifies the benefit of cooperation, namely, the objective of improving the mean-square deviation around the limiting point of the algorithm.This analysis is challenging due to coupling among agents, and the multi-task nature of the learning process (where agents have individual targets but need to meet certain smoothness and closeness conditions with their neighbors).Section III in this Part I and Sections III and IV in Part II [23], and the supporting appendices, are devoted to carrying out this analysis in depth leading, for example to Theorem 1 in Part II [23].This theorem gives expressions for the mean-square-deviation (MSD) relative to The expressions reveal the effect of the step-size parameter µ, regularization strength η, network topology, and data characteristics (captured by the smoothness profile, second-order properties of the costs, and secondorder moments of the gradient noise) on the size of the steady-state mean-square-error performance.The results established in Theorem 1 and expression (82) in Part II [23] provide tools for characterizing the performance of multitask strategies in some great detail.
To illustrate the power of these results, consider a connected network where each agent is subjected to streaming data.The goal at each agent is to estimate a local parameter vector w o k from the observed data by minimizing a cost of the form J k (w k ) = E Q k (w k ; x k ), where x k denotes the random data.Consider network applications where the minimizers at neighboring agents tend to be similar [13], [25].Although each agent is interested in estimating its own task w o k , cooperating neighboring agents can still benefit from their interactions because of this closeness.Given the graph Laplacian and data characteristics, one problem of interest would be to determine the optimal cooperation rule, i.e., the value of η that minimizes the network mean-square-error performance.Future expression (82) in Part II [23] can be used to solve this problem since it allows us to predict the network MSD relative to W o .By using expression (82) in Part II [23], for example, we will be able to construct curves of the form shown in Fig. 2, which illustrate how performance is dependent on the smoothness parameter η and how the nature of the limiting solution varies as a function of this parameter.As it can be seen from this figure, η = 4 gives the best network steady-state mean-square performance.Note that η = 0 corresponds to the non-cooperative scenario and that a large η induces a large bias in the estimation.In the sequel we will show that as η varies from η = 0 to η → ∞, the network behavior moves from the non-cooperative mode of operation (where agents act independently) to the single-task mode of operation (where all agents focus on estimating a single parameter).For values of η in between, the network can operate in any of a multitude of multitask modes (where agents estimate their own local parameters under smoothness conditions to allow for some similarity between adjacent nodes).These limits are indicated in Fig. 2.
Finally, we would like to mention that one of the main tools used in the analysis in this work, and its accompanying Part II [23], is the linear transformation relative to the eigenspace of the graph Laplacian L from [6], which is also known as the graph Fourier transform [25], [28], [29].Under some conditions on the data and costs profile, we show in Section VI-A in Part II [23] how the diffusion type algorithm ( 9) exhibits a low-pass graph filter behavior.
Such filters are commonly used to reduce the network noise profile when the signal to be estimated is smooth with respect to the underlying topology [25], [30]- [32].Interestingly, the theoretical results established in this Part I, and its accompanying Part II [23], reveal the reasons for performance improvements under localized cooperation.

C. Network limit point and regularization strength
Before examining the behavior and performance of strategy (9) with respect to the limiting point W o η in (4), we discuss the influence of η on W o η .When η = 0, we have from (4) that W o η = W o and strategy (9) reduces to the single-agent mode of operation or the non-cooperative solution where each agent minimizes J k (w k ) locally without cooperation.When η → ∞, we have from (4) that W o η = 1 N ⊗ w where and we are in the single-task mode of operation where all agents seek to estimate a common parameter vector w corresponding to the minimizer of the aggregate sum of individual costs [6]- [10].In order to study more closely the influence of (finite) η > 0 on the network output W o η , we examine the influence of η on the transformed vector: with the m-th sub-vector w o m,η denoting the spectral content of W o η at the m-th eigenvalue λ m of the Laplacian: From ( 1), the quadratic regularization term S(W) in (3) can be written as: where w m = (v m ⊗I M )W and where we used the fact that λ 1 = 0. Intuitively, given that λ m > 0 for m = 2, . . ., N , the above expression shows that W is considered to be smooth if w m 2 corresponding to large λ m is small.As a result, for a fixed λ m > 0, and as the regularization strength η > 0 in (4) increases, one would expect w o m,η 2 to decrease.Similarly, for a fixed η ≥ 0, and as λ m > 0 increases, one would expect w o m,η 2 to decrease as well.
However, as we will see in the sequel, this behavior does not always hold.We show in Section VI-A in Part II [23] that this is valid when the Hessian matrix function , the cost J k (w k ) is quadratic in w k and is uniform across the network.For more general scenarios, this is not necessarily the case.What is useful to note, however, is that as η moves from 0 towards ∞, a variety of solution points W o η can occur ranging from the non-cooperative to the single-task solution at both extremes.
From the optimality condition of (4), we have: Using the mean value theorem [33, pp. 24], we can write: where Let . Relation (16) can then be rewritten more compactly as: Note that the inverse in (19) exists for all η ≥ 0 since the matrix L is positive semi-definite and, under Assumption 1, the matrix H o η is positive definite.Pre-multiplying both sides of the above relation by (V ⊗ I M ) gives: where Since L has a single eigenvalue at zero, Λ and V can be partitioned as follows: Lemma 1. (Limiting point) Under Assumption 1, it can be shown that W o η given by (20) satisfies: where Proof.See Appendix A where we also show that: Consider the difference between W o η and W o .It turns out that the smoother W o is, the smaller W o − W o η will be.To see this, let us subtract W o from both sides of equation (24).We obtain: The difference W o η − W o depends on [W o ] 2:N .Thus, from ( 15) and ( 31), we conclude that the smoother W o is, the smaller W o η − W o = W o η − W o will be.Lemma 1 will be useful in the sequel to establish Theorem 1 and to provide a low-pass graph filter interpretation for the uniform Hessian matrices scenario considered in Section VI-A in Part II [23].

III. NETWORK STABILITY
We examine the behavior of algorithm (9) under Assumption 2 on the gradient noise processes {s k,i (•)} defined in (6).As explained in [7], [10], these conditions are automatically satisfied in many situations of interest in learning and adaptation.Condition (32) essentially states that the gradient vector approximation should be unbiased conditioned on the past data, which is a reasonable condition to require.Condition (33) states that the second-order moment of the gradient noise process should get smaller for better estimates, since it is bounded by the squared norm of the iterate.Condition (34) states that the gradient noises across the agents are uncorrelated.
In this section, we analyze how well the multitask strategy (9) approaches the optimal solution W o η of the regularized cost (4).We examine this performance in terms of the mean-square-error measure, E w o k,η − w k,i 2 , the fourth-order moment, E w o k,η − w k,i 4 , and the mean-error process, E(w o k,η − w k,i ).To establish mean-square error stability, we extend the energy analysis framework of [6] to handle multitask distributed optimization.Then, following a similar line of reasoning as in [7,Chapter 9], we establish the stability of the first and fourth-order moments, which is necessary to arrive at an expression for the steady-state performance in Part II [23].
Let us introduce the network block vector W i = col{w 1,i , . . ., w N,i }.At each iteration, we can view (9) as a mapping from W i−1 to W i : We introduce the following condition on the combination matrix (I M N − µηL), which is necessary for studying the performance of ( 9).It can be easily verified that this requirement is always met by selecting µ and η to satisfy the bounds ( 36)- (37).

Assumption 3. (Combination matrix)
The symmetric combination matrix (I M N − µηL) has nonnegative entries and its spectral radius is equal to one.Since L has an eigenvalue at zero, these conditions are satisfied when the step-size µ > 0 and the regularization strength η ≥ 0 satisfy: where condition (36) ensures stability and condition (37) ensures non-negative entries.

A. Stability of Second-Order Error Moment
We first show that algorithm (9), in the absence of gradient noise, converges and has a unique fixed-point.Then, we analyze the distance between this point and the vectors w o k,η and w k,i in the mean-square-sense. 1) Existence and uniqueness of fixed-point: Without gradient noise, relation (35) reduces to: Let X col{x 1 , . . ., x N } denote an N × 1 block vector, where x k is M × 1.The mapping (38) is equivalent to the deterministic mapping X → Y defined as: Lemma 2. (Contractive mapping) Under Assumption 1 and condition (36), the deterministic mapping defined in (39) satisfies: with γ max 1≤k≤N {γ k } where: This mapping is contractive when µ satisfies: Proof.See Appendix B.
It then follows from Banach's fixed point theorem [34, pp. 299-303] that iteration (38) converges to a unique fixed point W ∞ = lim i→∞ W i = col{w 1,∞ , . . ., w N,∞ } at an exponential rate given by γ.Observe that this fixed point is not W o η .Since we wish to study lim sup i→∞ E W o η − W i 2 , which can be decomposed as: we shall first asses the size of W o η − W ∞ 2 and then examine the quantity 2) Fixed point bias analysis: Now we analyze how far this fixed point W ∞ is from the desired solution W o η when the step-size µ is small.We carry out the analysis in two steps.First, we derive an expression for W ∞ W o η − W ∞ and then we asses its size.Since W ∞ is the fixed point of (38), we have at convergence: Using the mean-value theorem [33, pp. 24], [7, Appendix D], we can write: where Subtracting the vector (I M N − µηL)W o η from both sides of (44) and using relation (45), we obtain: where 16), recursion (47) can be written alternatively as: so that: The inverse exists when , its spectral radius is less than one.Since the spectral radius of a matrix is upper bounded by any of its induced norms, we have: in terms of the 2−induced norm.Under condition (36) and since λ 1 (L) = 0, we have I M N − µηL = 1.From Assumption 1, we have: so that (41).We conclude that when (36) and ( 42) are satisfied, the inverse exists.
From (49), we observe that W ∞ is zero in two cases: i) when η = 0; ii) when w o k = w o ∀k, i.e., W o = 1 N ⊗ w o .In the second case, consider (24) and observe that Theorem 1. (Fixed point bias size) Under Assumption 1 and for small µ satisfying conditions (36) and (42), the steady-state bias W ∞ = W o η − W ∞ of the mapping (38) satisfies: Proof.See Appendix C.

3) Evolution of the stochastic recursion:
We now examine how close the stochastic algorithm (9) approaches W o η .First, we introduce the mean-square perturbation vector (MSP) at time i relative to W ∞ : The k-th entry of MSP i characterizes how far away the estimate w k,i at agent k and time i is from w k,∞ .
Theorem 2. (Network mean-square-error stability) Under Assumptions 1, 2, and 3, the MSP at time i can be recursively bounded as: where: A sufficient condition for the stability of the above recursion is: It follows that and Proof.See Appendix D. With regards to (59) note first that for fixed η, we have O(µ) + O(µ 2 ) = O(µ).When η and µ are coupled (η = µ − ), we obtain: .

B. Stability of Fourth-Order Error Moment
The results so far establish that the iterates w k,i converge to a small O(µ)− neighborhood around the regularized solution w o k,η .We can be more precise and determine the size of this neighborhood, i.e., assess the size of the constant multiplying µ in the O(µ)−term.To do so, we shall derive in Part II [23] an accurate first-order expression for the mean-square error (59); the expression will be accurate to first-order in µ.This expression will be useful because it will allow us to highlight several features of the limiting point of the network as a function of the parameter η.
To arrive at the desired expression, we first need to introduce a long-term approximation model and assess how close it is to the actual model.We then derive the performance for the long-term model and use this closeness to transform this result into an accurate expression for the performance of the original learning algorithm.When this argument is concluded we arrive at the desired performance expression, which we then use to comment on the behavior of the algorithm in a more informed manner.To derive the long-term model, we shall follow the approach developed in [7].The first step is to establish the asymptotic stability of the fourth-order moment of the error vector, E W o η − W i 4 .This property is needed to justify the validity of the long-term approximate model that will be introduced in Part II [23].
To establish the fourth-order stability, we replace condition (33) on the gradient noise process by the following condition on its fourth order moment: for some β 4 k ≥ 0, and σ 4 s,k ≥ 0. As explained in [7], condition (60) implies (33) and, likewise, condition (60) holds for important cases of interest.
Exploiting the convexity of the norm functions x 4 and x 2 and using Jensen's inequality, we can write: and (62) Let us introduce the mean-fourth perturbation vector at time i relative to W ∞ : Theorem 3. (Fourth-order error moment stability) Under Assumptions 1, 2, 3, and condition (60), the MFP at time i can be recursively bounded as: where A sufficiently small µ ensures the stability of the above recursion.It follows that and Proof.See Appendix E.

C. Stability of First-order Error Moment
We next need to examine the evolution of the mean-error vector E(W o η − W i ).To establish the mean-stability, we need to introduce a smoothness condition on the Hessian matrices of the individual costs.This smoothness condition will be adopted in the next Part II [23] when we study the long term behavior of the network.

Assumption 4. (Smoothness condition on individual cost functions).
It is assumed that each J k (w k ) satisfies a smoothness condition close to w o k,η , in that the corresponding Hessian matrix is Lipchitz continuous in the proximity of w o k,η with some parameter κ d ≥ 0, i.e., for small perturbations ∆w k ≤ .
From the triangle inequality, we have: Let us introduce the square-mean perturbation (SMP) vector at time i relative to W ∞ : Theorem 4. (First-order error moment stability) Under Assumptions 1, 2, 3, and 4, the SMP at time i can be recursively bounded as: where with κ d = max{κ d , λk,max−λk,min }.Under condition (60), a sufficiently small µ ensures the stability of the above recursion.It follows that and that (77) Proof.See Appendix F.
We have established so far the stability of the mean-error process, E(W o η − W i ), the mean-square-error 2 , and the fourth order moment E W o η − W i 4 .Building on these results, we will derive in Part II [23] closed form expressions for the steady-state performance of algorithm (9).Section VI in Part II [23] will provide illustration for the theoretical results in this part (Theorems 1, 2, and 4), and its accompanying Part II.

IV. SIMULATION RESULTS WITH REAL DATASET
In this section, we test algorithm (9) on a weather dataset corresponding to a collection of daily measurements (mean temperature, mean dew point, mean visibility, mean wind speed, maximum sustained wind speed, and rain or snow occurrence) taken from 2004 to 2017 at N = 139 weather stations located around the continental United States [35].We construct a representation graph G = (N , E, A) for the stations using geographical distances between sensors.Each sensor corresponds to a node k and is connected to |N k | neighbor nodes with undirected edges weighted according to a k = 1 2 (p k + p k ) with [30]: otherwise.We would like to construct a classifier that allows us to predict whether it will rain (or snow) or not based on the knowledge of the feature vector h k,i .In principle, each station could use an individual logistic regression machine [7], [36], [37], that seeks a vector w o k , such that γ k (i) = sign(h k,i w o k ) and In this application, however, it is expected that the decision rules {w o k } at neighboring stations will be similar.In the experiment, the dataset is split into a training set used to learn the decision rule w o k , and a test set from which γ k (i) are generated for performance evaluation.The first dataset comprises daily weather data recorded at the stations in the interval 2004 − 2012 (a total number of D a = 3288 days) and the training set contains data recorded in the interval 2012 − 2017 (a total number of D t = 1826 days).We set µ = 3 • 10 −4 and ρ = 10 −5 .We generate the first iterate w k,0 from the Gaussian distribution N (0, I M ) and we run strategy (9) over the training set (i = 1, . . ., D a ) for different values of η.For each value of η, we report in Table I the prediction error over the test set defined as: where N = 139 is the number of nodes, w k,∞ is the average of the last 200 iterates generated by the algorithm at agent k, and I[x] is the indicator function at x, namely, I[x] = 1 if x is true and 0 otherwise.Table I shows that through cooperation, the agents improve performance.This is due to the fact that the non-cooperative solution (η = 0) may suffer from a slow convergence rate [1, Section V-B] in which case some nodes may not be able to converge in the finite dataset scenario.By increasing η, the convergence rate improves.However, a large value of η (such as η = µ −1 ) yields a deterioration in the accuracy since in this case all agents converge approximately to the same classifier.By setting η = 45, we obtain the smallest prediction error.We show in Fig. 3 (right) the results of the prediction on July 30, 2015 across the US for η = 45.

Rainy day Dry day
Rainy day Dry day

V. CONCLUSION
In this work, we considered multitask inference problems where agents in the network have individual parameter vectors to estimate subject to a smoothness condition over the graph.Based on diffusion adaptation, we proposed a strategy that allows the network to minimize a global cost consisting of the aggregate sum of the individual costs regularized by a term promoting smoothness.We showed that, for small step-size parameter, the network is able to approach the minimizer of the regularized problem to arbitrarily good accuracy levels.Furthermore, we showed how the regularization strength can steer the convergence point of the network toward many modes starting from the non-cooperative mode and ending with the single-task mode.

APPENDIX A PROOF OF LEMMA 1
Consider the matrix inversion identity [38]: which allows us to write: for any invertible matrix U .Using (82), we write (20) alternatively as: Let Using the definitions ( 21) and ( 22), we can partition Q into blocks: with Q 11 , Q 12 , and Q 22 defined in ( 25), (26), and (27), respectively.Since which is positive definite from Assumption 1. Observe that Q is invertible since it is similar to H o η + ηL which is positive definite under Assumption 1. Now, by applying the block inversion formula to Q, we obtain: where Replacing ( 86) into (83) and using (21), we arrive at: Using definition ( 29) into (88), we conclude (24).Now, we establish (30).Let us first introduce the matrix G : Using the above definition and expressions ( 28) and ( 27), we can re-write the matrix K in (29) alternatively as: The matrix G in (89) is the Schur complement of H o η in (22) which can be partitioned as: Thus, G is positive definite since it is the Schur complement of the positive definite matrix H o η [39, pp. 651].Since G is symmetric, from Weyl's inequality [40, pp. 239] we have: Furthermore, since G is the Schur complement of the positive definite matrix H o η , we have [41,Theorem 5]: Therefore, from (92) and (93), we get: and Since the 2−induced norm of a positive definite matrix is equal to its maximum eigenvalue, we obtain: From the sub-multiplicative property of the 2−induced norm and from (90), (97), and (94), we obtain: Given any two input vectors X 1 and X 2 with corresponding updated vectors Y 1 and Y 2 , we have from (39): From the mean-value theorem [33, pp. 24], we have: Using (100) into (99), and the sub-multiplicative property of the 2−induced norm [7], we obtain: where We have Let ρ(•) denote the spectral radius of its matrix argument.Since L is symmetric, we have I N −µηL = ρ(I N −µηL).
Since L has one eigenvalue at zero, ρ(I N − µηL) is guaranteed to be equal to 1 if µη satisfies condition (36).For the block diagonal symmetric matrix D in (102), we have: Due to Assumption 1, we have: It follows that D ≤ γ where γ max 1≤k≤N {γ k } and γ k is given in (41).It holds that 0 < γ k < 1 when µ is chosen according to (42).Combining the previous results, we arrive at: for γ < 1 when ( 36) and ( 42) are satisfied and, in this case, the deterministic mapping ( 39) is a contraction.

APPENDIX C
PROOF OF THEOREM 1 From (49), we obtain the following expression for W ∞ : Pre-multiplying both sides of (107) by V = V ⊗ I M gives: where and J is given by (21).
In the following we show that W ∞ can be written as: where K is defined in (29) and: We introduce the following matrix, which appears in (108): where the blocks {P ij } are given by ( 112)-(115).Note that, under Assumption 1, P 11 in ( 112) is invertible since it can be bounded as follows: Applying the block inversion formula to P, we obtain: with T defined in (111).Replacing (118) into (108), and using ( 21) and ( 24), we conclude (110).
Our goal now is to show that for some constant c that may depend on η (the regularization strength), but not on µ (the step-size parameter).
From (110), we have: Since the Euclidean norm is continuous, we have lim µ→0 g(µ) = lim µ→0 g(µ) .In the following we show that Let us first establish (121).We have: From (30), we have For sufficiently small step-sizes, we have: Following the same line of reasoning as in (89)-(97), we can show that, when µ → 0, we have: Thus, we conclude (121).Now, we establish (122).From (117), we have P 11 = O(1) and P −1 From ( 6), (35), and (44), we have: Using the mean-value theorem (100), the above relation can be written as: where Let From the Laplacian matrix definition, it can be verified that the off-diagonal entries of the matrix C are non-negative and that its diagonal entries are non-negative under condition (37).Furthermore, since we have L1 N = 0, the entries on each row of C will add up to one.Thus, applying Jensen's inequality [39, pp. 77] to the convex function • 2 , we obtain from (128) and ( 130): where φ k,i is the k-th sub-vector of φ i given by: Squaring both sides of (133), conditioning on F i−1 , and taking expectations we obtain: where Σ k,i−1 (I M − µH k,i−1 ) 2 and where the cross term is zero because of the zero-mean condition (32).Due to Assumption 1, Σ k,i−1 can be bounded as follows: where γ k is given by (41).From Assumption 2, E[ s k,i (w k,i−1 ) 2 |F i−1 ] can be bounded as follows: Taking expectation again in (134), and using the bounds (135) and ( 136), we obtain: Now, combining (137) and (132), we obtain (54).
Iterating (54) starting from i = 1, we get: Under Assumption 3 and condition (57), the matrix CG can be guaranteed to be stable.To see this, we upper bound the spectral radius as follows: where we used the fact that, under condition (37), the matrix C is a right-stochastic matrix.We have: which is guaranteed to be less than one when: Then we conclude that the matrix CG is stable under condition (57).In this case, we have: Using the submultiplicative property of the induced infinity norm, we obtain: where we used the fact that C ∞ = 1 and where G ∞ = max 1≤k≤N γ 2 k + 3µ 2 β 2 k .From (140), we have: where Thus, Substituting into (143), we obtain: For sufficiently small µ, we have from (56) and Theorem 1 that b = O(1) + O(µ 2 η 4 )(O(1) + O(η)) −4 .We conclude that lim sup i→∞ MSP i ∞ ≤O(µ).

APPENDIX E PROOF OF THEOREM 3
Applying Jensen's inequality [39, pp. 77] to the convex function • 4 , we obtain from ( 128) and ( 130): where C and φ k,i are given by ( 131) and (133), respectively.Using the inequality [7, pp.523]: we obtain from (133) under Assumption 2 on the gradient noise: From Assumption 1, the matrices (I M − µH k,i−1 ) 2 and (I M − µH k,i−1 ) 4 can be bounded as follows: where γ k is given by (41).Thus, we obtain: Under condition (60), we have: where we applied Jensen's inequality to the function • 4 .Furthermore, from (136), the last term on the RHS of (154) can be bounded as follows: where we used the fact that for any random variable a, we have (Ea) Now, combining (157) and (149), we arrive at (64).
Iterating (64) starting from i = 1, we get: Under Assumption 3 and for sufficiently small µ, the matrix CG can be guaranteed to be stable.To see this, we upper bound its spectral radius as follows: since under condition (37), C is a right-stochastic matrix.The ∞−norm of G is given by: A sufficiently small µ ensures G ∞ < 1 and, thus, ensures the stability of CG .
We have established in Theorem 2 that, for small µ, after sufficient iterations have passed, MSP j converges to a bounded region on the order of µ.This implies that, there exists a j o large enough such that for all j ≥ j o it holds that: where C and H k,i−1 are given by ( 131) and (129), respectively.Let where Then, we can write: in terms of a deterministic perturbation sequence defined by By applying Jensen's inequality to the convex function • 2 , we obtain: for any arbitrary positive number t ∈ (0, 1).We select t = γ k where γ k is given by (41), which is guaranteed to be less than one under condition (42).From Assumption 1, we have I M − µH k,η 2 ≤ γ 2 k .Thus, we obtain: As shown in [7, Appendix E], the Hessian of a twice differentiable strongly convex function J k (w k ) satisfying Assumptions 1 and 4 is globally Lipschitz relative to w o k,η , namely, it satisfies: where κ d = max{κ d , λk,max−λk,min }.Then, for each agent k we obtain: and, hence, where we used the stochastic version of Jensen's inequality: when f (x) ∈ R is convex.Applying Jensen's inequality to the convex function • 2 and using the fact that (Ea) 2 ≤ Ea 2 for any real-valued random variable a, we obtain from (176): From (167) and using the above bound in (173), we conclude (73).
Iterating (73) starting from i = 1, we obtain: Under Assumption 3 and condition (42), the matrix CG is guaranteed to be stable.From (58), (68), and following similar arguments as the ones used to establish (68) in Appendix E, we conclude that Using (71) and since E(W ∞ − W i ) 2 = 1 N • SMP i , we conclude (77) from Theorem 1 and (76).

Fig. 1 .
Fig. 1.Agents linked by an edge can share information.The weight a k over an edge reflects the strength of the relation between w o k at node k and w o at node .

Fig. 3 .
Fig. 3. (Left) Occurrence of rain reported by 139 weather stations across the US on July 30, 2015.(Right) Prediction of rain occurrence from weather data based on logistic regression and multitask learning.

TABLE I RAIN
PREDICTION ERROR (80) IN WEATHER SENSOR NETWORKS FOR DIFFERENT VALUES OF REGULARIZATION STRENGTH η. η = 0 η = 10 η = 45 η = 100 η = 1000 η = µ −1 is the set of 4-nearest neighbors of node k and d k denotes the geodesical distance between the k-th and -sensors -see Fig. 3 (left).Let h k,i ∈ R M denote the feature vector at sensor k and day i composed of M = 5 entries corresponding to the mean temperature, mean dew point, mean visibility, mean wind speed, and maximum sustained wind speed reported at day i at sensor k.Let γ k (i) denote a binary variable associated with the occurrence of rain (or snow) at node k and day i, i.e, γ k (i) = 1 if rain (or snow) occurred and γ k 1 11 + P −1 11 P 12 T P 21 P −1 2≤ Ea 2 .Replacing (155) and (156) into (154),