Networked Exponential Families for Big Data Over Networks

The data generated in many application domains can be modeled as big data over networks, i.e., massive collections of high-dimensional local datasets related via an intrinsic network structure. Machine learning for big data over networks must jointly leverage the information contained in the local datasets and their network structure. We propose networked exponential families as a novel probabilistic modeling framework for machine learning from big data over networks. We interpret the high-dimensional local datasets as the realizations of a random process distributed according to some exponential family. Networked exponential families allow us to jointly leverage the information contained in local datasets and their network structure in order to learn a tailored model for each local dataset. We formulate the task of learning the parameters of networked exponential families as a convex optimization problem. This optimization problem is an instance of the network Lasso and enforces a data-driven pooling (or clustering) of the local datasets according to their corresponding parameters for the exponential family. We derive an upper bound on the estimation error of network Lasso. This upper bound depends on the network structure and the information geometry of the node-wise exponential families. These insights provided by this bound can be used for determining how much data needs to be collected or observed to ensure network Lasso to be accurate. We also provide a scalable implementation of the network Lasso as a message-passing between adjacent local datasets. Such message passing is appealing for federated machine learning relying on edge computing. We finally note that the proposed method is also privacy-preserving because no raw data but only parameter (estimates) are shared among different nodes.


I. INTRODUCTION
The data generated in many important application domains have an intrinsic network structure. Networked data arises in the study of social networks, natural language processing and personalized medicine [3], [9], [51]. Most existing network science provides powerful tools for the analysis of such data based solely on its intrinsic network structure [14], [38].
We consider networked data where each node of the network represents a local dataset or high-dimensional data point. To study disease spread, we represent individuals by nodes in a network whose links indicate physical proximity as relevant for disease transmission [38,Ch. 17]. Existing compartment models consider all individuals to behave (structurally) similar, somewhat like an ''i.i.d. assumption'' in statistical learning theory. However, the tendency to get The associate editor coordinating the review of this manuscript and approving it for publication was Zhaojie Ju . infected might strongly depend on personal factors such as current health status. These personal factors can be characterized via a plethora of attributes, ranging from healthcare records up to the recent travel history [35]. In natural language processing, text corpora are represented as networks of documents that are connected via co-authorship [9].
To jointly capitalize on network structure and the information conveyed by high-dimensional data points, we introduce networked exponential families. Networked exponential families couple network structure with (nodewise) local parameters of an exponential family [50]. Conceptually, they unify and considerably extend non-parametric models for learning clustered or smooth graph signals and, more generally, networked (generalized) linear models [11], [28], [32]. Another special case of networked exponential families are networked time series models. Networked time series models are useful for networks of weather observation stations (see Section VII-B).
Networked exponential families can also be applied to non-parametric density estimation [36]. Indeed, we might represent partitions of the feature space as a planar graph with nodes representing individual regions. The distribution of the data points within a particular region is estimated using an exponential family whose parameters are optimized for a specific region. Another important application of networked exponential families is to stratified models [40]. Indeed, stratified models are networked exponential families with each node in the network representing one stratum [48].
Networked exponential families are powerful statistical models for many important application domains such as personalized (high-precision) health-care [31], or natural language processing [4], [9]. In contrast to [9], which uses a probabilistic model for the network structure of text corpora, this article assumes the network structure as fixed and known.
While traditional clustering methods only use network structure, methods based on networked exponential families also use the information provided by node attributes. Joint clustering and optimization has been considered in [52] for probabilistic models of the network structure. We consider the network structure fixed and given and use a probabilistic model for the node attributes (features and labels).

A. EXISTING WORK
The closest to this work is [32] which considers regression with network cohesion (RNC). The RNC model is a special case of networked exponential families. While RNC uses a shared weight vector and a local (varying) intercept term, this article allows for arbitrarily varying weight vectors (see end of Sec. II).
Another main difference between [32] and our approach is the choice of regularizer for the networked model. While [32], similar to most existing work on semi-supervised learning [10], uses the graph Laplacian quadratic form as a smoothness measure, our approach controls the non-smooth total variation (TV) of the model parameters. TV-based regularization produces predictors which are piece-wise constant over wellconnected subset of nodes. This behaviour is useful in image processing of natural images which are composed or homogenous segments whose boundaries result in sharp edges [17].
This article substantially extends our prior work on networked linear models for regression and classification [1], [28], [29], [47]. We have recently derived conditions on the data network structure such that nLasso accurately learns a clustered graph signal [29]. The clustered graph signal model is a special case of a networked linear regression model (see Section III-A) Minimizing the Laplacian quadratic form amounts to solving a linear system. In contrast, TV minimization is intrinsically non-linear which requires more advanced techniques such as proximal methods [8], [41] (see Section VI). The higher computational cost of TV minimization affords improved accuracy when learning from a small number of observed data points (see [37] and Section VII-A).
We learn the parameters of networked exponential families with the network Lasso (nLasso). The network Lasso is a recently proposed extension of the Lasso to networked data [20], [22]. It is an instance of regularized empirical risk minimization, using total variation for regularization [18], [21]. We show how the nLasso can be implemented as highly scalable message passing protocol over the data network structure. This method is privacy-preserving in the sense of not sharing any raw data but only (estimates) of parameter vectors for the node-wise exponential families.

B. CONTRIBUTION
We now summarize our main contributions.
• We introduce networked exponential families as a novel modelling paradigm for high-dimensional data points having an intrinsic network structure (''big data over networks'').
• We present sufficient conditions on the network structure and observed data points that allow to accurately learn the parameters of an underlying networked exponential family with high probability.
• We solve the nLasso for learning the parameters of a networked exponential family using a highly scalable message passing algorithm. This algorithm is suitable for federated learning and edge computing environments, where computation is carried by a collection of lowcomplexity units (e.g., IoT devices) [30], [33].
• We verify our theoretical findings using illustrative numerical experiments. The source files for these experiments are made available to ensure reproducible research.

C. OUTLINE
We introduce networked exponential families in Section II. Section III details how some recently proposed models for networked data are obtained as special cases of networked exponential families. In Section IV, we show how to learn a networked exponential family using an instance of the nLasso optimization problem. We present an analysis of the nLasso estimation error in Section V. Section VI presents the implementation of nLasso using a primal-dual method for convex optimization. The computational and statistical properties of nLasso in networked exponential families are illustrated in numerical experiments within Section VII.

D. NOTATION
The Euclidean norm of a vector ∈ R d denotes the jth column of the identity matrix I d of size d × d. Given a subset A ⊆ B we denote the complement as A := B \ A.

II. NETWORKED EXPONENTIAL FAMILIES
We consider networked data represented by an undirected weighted graph (the ''empirical graph'') G = (V, E, A) (see Figure 1). The nodes i ∈ V = {1, . . . , N } represent individual FIGURE 1. A networked exponential family is a probabilistic model for (high-dimensional) data points z (i ) related by some intrinsic network structure. The data points z (i ) might represent individuals during a pandemic which are related by contact-networks as well as bio-medical network structures.
data point (such as social network users). Data points i, j ∈ V are connected by an undirected edge e = {i, j} ∈ E with weight if they are considered similar (e.g., befriended users). We denote the edge set E by {1, . . . , E := |E|}. The neighbourhood of a node i ∈ V is N (i) := {j : {i, j} ∈ E}.
We assume the empirical graph G is fixed and known. The empirical graph might be obtained via physical proximity (in time or space), physical connection (communication networks) or statistical dependency (probabilistic graphical models) [15], [25], [50]. Section VIII briefly speculates on how our analysis could allow to use network design methods for data-driven learning of the empirical graph.
Beside network structure, datasets convey additional information via attributes z (i) ∈ R d of data points i ∈ V. We model the attributes z (i) of data points i ∈ V as independent random variables distributed according to (a member of) some exponential family [50] p(z; w (i) ) : The distribution (2) is parametrized by the weight vectors w (i) ∈ W (i) , for i ∈ V. The weight vectors are fixed but unknown and the main focus of this article is the accurate estimation (learning) of these weight vectors. In order to ensure (2) defines a valid probability measure (non-negative and total measure equal to one), we must restrict the weight vectors to some subset W (i) ⊆ R d (see [50,Sec. 3.2]). It is convenient to collect weight vectors w (i) assigned to each n ode i into a vector-valued graph signal w : V → R d which maps a node i to the function value w (i) . The (hypothesis) space of all such vector-valued graph signals is We also define the related space of vector-valued signals defined on the edges E of the empirical graph G as Strictly speaking, (2) represents a probability density function relative to some underlying base measure ν defined on the value range of the sufficient statistic t (i) (z (i) ). Important examples of such a base measure are the counting measure for discrete-valued t (i) or the Lesbegue measure for continuousvalued t (i) . The distribution defined by (2) depends on z (i) only via the sufficient statistic t (i) (z (i) ). In what follows, we suppress the argument and write t (i) with the implicit understanding that it is a function of the random vector z (i) .
Several important properties of the model (2) can be read off the cumulant function (i) (·) : W (i) → R, [50] (i) (w (i) ) := log The domain W (i) ⊆ R d of the cumulant function is given by all weight vectors w (i) such that the integral in (5) exists and is finite. The Fisher information matrix (FIM) F (i) for (2) is the Hessian The structure of F (i) determines the statistical and computational properties of (2) (see Section V and VI ). The node-wise models (2), for all nodes i ∈ V, are coupled by requiring the weight vectors w (i) to be similar for wellconnected data points. In particular, we require the weight vectors to have a small total variation (TV) Requiring a small TV of the weight vectors w (i) , for i ∈ V, forces them to be approximately constant over well connected subsets (clusters) of nodes. It will be convenient to define the TV for a subset S of edges: Networked exponential families are obtained as the combination of (2) with a constraint on the TV (8) of weights w in (8). Networked exponential families are somewhat similar to the RNC model put forward in [32]. Let us detail some key differences between those two modelling frameworks. First, RNC considers the special case of distributions (2) with the sufficient statistic The component β is the same for all nodes i ∈ V, while the intercept α (i) is allowed to vary over nodes V. In contrast, we allow the entire weight vector w to vary between different nodes. Moreover, while the RNC model uses the smooth Laplacian quadratic form of the intercepts α (i) , we use the non-smooth TV (7) to ensure that the weight vectors conform with the network structure of the data.

III. SOME EXAMPLES
We now discuss important special cases of generic exponential families (2). These special cases are obtained for specific choices for the sufficient statistic t (i) (z) and cumulant

A. NETWORKED LINEAR REGRESSION
Consider networked data points i ∈ V with features x (i) ∈ R d and label y (i) ∈ R. Maybe the most basic (yet quite useful) model for the relation between features and labels of a data point is the linear model with Gaussian noise ε (i) ∼ N (0, σ 2 ) of known variance σ 2 i which can vary for different nodes i ∈ V. The linear model (9) is parametrized by the weight vectors w (i) for each i ∈ V.
The weight vectors are coupled by requiring a small TV (7) [28]. The model (9) is obtained from (2) using the choices z (i) . In some applications we might have only a crude estimate for the label of some data points. We can cope with varying levels of accuracy in observed labels by using a varying noise variance σ 2 i in (9). For nodes i ∈ V for which we only have a rough label estimate, we use a larger noise variance σ 2 i in (9).

B. NETWORKED LOGISTIC REGRESSION
Consider networked data points i ∈ V with features x (i) ∈ R d and labels y (i) ∈ {−1, 1}. Within logistic regression, we interpret the label y as the realization of a random variable with probability distribution The distribution (10) is parametrized by the weight vector w (i) for each node i ∈ V. Note that (10) is the posterior distribution of label y (i) given the features x (i) if the feature vector x (i) is a Gaussian random vector conditioned on y (i) . Networked logistic regression requires the weight vectors w (i) in (10) to have a small TV (7) [1], [47]. The logistic regression model (10) is the special case of (2) for the choices z (i)

C. NETWORKED LDA
Consider a networked dataset representing a collection of text documents (such as scientific articles). The LDA is a probabilistic model for the relative frequencies of words in a document [4], [50]. Within LDA, each document is considered a blend of different topics. Each topic has a characteristic distribution of the words in the vocabulary. A simplified form of LDA represents each document i ∈ V containing N ''words'' by two sequences of multinomial random variables z . . , T } with V being the size of the vocabulary defining elementary words and T is the number of different topics. It can be shown that LDA is a special case of the exponential family (2) with particular choices for t(·) and (i) (·) (see [4], [50]).

IV. NETWORK LASSO
This article focuses on accurate learning of the true underlying weights w (i) (see (2)) based on the nodes attributes z (i) for a small ''training set'' M = {i 1 , . . . , i M } ⊆ V. A reasonable estimate for the weight vectors is obtained from maximizing the likelihood of observing the attributes z (i) , Maximizing (11) is equivalent to minimizing The criterion (12) is not enough to learn the weights w (i) for all i ∈ V. Indeed, (12) ignores the weights of unobserved nodes i ∈ V \ M. We need to impose additional structure on the estimate for the weight vectors. In what follows, we require any reasonable estimate w (i) to conform, in a specific sense, with the cluster structure of the empirical graph G [38].
Networked data is often organized as clusters (or communities) which are well-connected subset of nodes. Many supervised learning methods use a clustering assumption that nodes belonging to the same cluster represent similar data points. We implement this clustering assumption by requiring the parameter vectors w (i) in (2) to have a small TV (7).
We are led to learning the weights w for (2) via the regularized empirical risk minimization (ERM) The learning problem (13) is an instance of the generic nLasso problem [20]. The parameter λ in (13) allows to tradeoff small TV w TV against small error E( w) (cf. (12)). Choosing λ can be based on validation [22] or the error analysis in Section V. It will be convenient to reformulate (13) using the block- The e-th block of Dw is A ij (w (i) − w (j) ) in (7) and, in turn, with the norm u 2,1 := e∈E u (e) 2 defined on D (see (4)). We can then reformulate the nLasso (13) as with h(w) = E(w) and g(u) := λ u 2,1 .
Related to the incidence matrix (14), is the graph Laplacian with the (element-wise) squared weight matrix (1)) and the ''degree matrix'' The (sorted) eigenvalues 0 = λ 1 ≤ λ 2 ≤ . . . of L reflect the connectivity of the graph G. A graph G is connected if and only if λ 2 > 0. Moreover, the spectral gap ρ(G) := λ 2 provides a measure of the connectivity of the graph G.
The Laplacian matrix L is closely related to the incidence matrix D (see (14)). Both matrices have the same nullspace. Moreover, the spectrum of DD T coincides with the spectrum of L. The column blocks S (j) ∈ R (Nd)×d of the pseudo-inverse This bound can be verified using the identity D † = (DD T ) † D T and well-known vector norm inequalities (see, e.g., [23]).

V. STATISTICAL ASPECTS
We now turn to the characterization of statistical properties of nLasso by analysing the prediction error w = w − w incurred by a solution w of the nLasso problem (13). In order to analyze the error incurred by the nLasso (13), we assume that the true weight vectors are clustered Here, v (C) ∈ R d is the value of the true weigh vector for all nodes in the cluster C. We also used the indicator map I C [i] = 1 for i ∈ C and I C [i] = 0 otherwise.
The model (19) involves a partitioning P = {C 1 , . . . , C |P| } of the nodes V into disjoint subsets (''cluster'') C l . The model (19) is a special case of piece-wise polynomial signal model which allows the weight vectors to vary within each cluster [12].
In principle, our analysis applies to an arbitrary choice for the partition P. However, the analysis is most useful if the partition is such that the boundary is small in some sense. In particular, we focus on partitions such that e∈∂P A e is small. We will use the model (19) is a (''zero-order'') approximation for the true underlying weight vectors in (2). The analysis below indicates that nLasso methods are robust to model mismatch, i.e., the true underlying weight vectors in (2) can be well approximated by (19).
Assumption 1: Node attributes z (i) are distributed according to (2) with weight vectors w (i) that are piece-wise constant over some partition P = {C 1 , . . . , C |P| } (see (19)). We measure the clusteredness of the partition P using the spectral gap We emphasize that the partition underlying the model (19) is only required for the analysis of the nLasso error. For the implementation of nLasso (see Section VI), we do not need any information about the partition P.
The next assumption requires the node-wise exponential families to be well conditioned. In particular, we require the eigenvalues of the FIMs F (i) (see (6) to be upper and lower bounded with known constants.
Our third and final assumption involves the training set M and requires the cluster boundaries to be well connected to nodes in the training set. Conceptually, this assumption is similar to inconvertibility conditions such as the restricted isometry property or the compatibility condition in compressed sensing [7], [16]. This assumption ensures that no clustered weight vectors (see (19)) can be small on the entire training set M. We measure the size of the weight vectors on the training set via the norm Assumption 3: There is K > 0 and L > 1 such that for any piece-wise constant z ∈ H (see (3) and (7)), Note that both, Assumption 2 and 3, use the same constant L in the lower bounds (22) and (23), respectively. Our main analytic result is an upper bound on the probability that the nLasso error exceeds a given threshold η.
Theorem 1: Consider networked data G and training set M such that Assumption 1, 2 and 3 are satisfied with condition number (see 23) κ := K +3 L−3 < 1. We estimate the weight vectors w using a solution w of nLasso (13) with λ := η/(5κ 2 ) using some pre-specified error level η > 0. Then, The bound (24) becomes useful, tending towards zero, for sufficiently large clusters C l and sufficiently large training set M. In particular, the bound is useful for massive datasets represented by some large empirical graph which is composed of a modest number of clusters or segments. One application involving large empirical graphs and a rather small number of clusters is image segmentation (see [46] and Section VII-C).
The bound (24) indicates that, for a prescribed accuracy level η, the training set size M has to scale according to κ 4 /ρ 2 P . Thus, the sample size required by Algorithm 1 scales with the fourth power of the condition number κ = K +3 L−3 (see Assumption 3) and inversely with the spectral gap ρ P of the partitioning P.
Thus, nLasso methods (13) (such as Algorithm 1) require less training data if the condition number κ is small and the spectral gap ρ P is large. This is reasonable, since having a small condition number κ = K +3 L−3 (see Assumption 3) typically requires the edges within clusters to have larger weights on average than the weights of the boundary edges.
It also makes sense that nLasso is more accurate for a larger spectral gap ρ P . Indeed, a large spectral gap ρ P indicates that the nodes within each cluster C l are well connected. A graph G consisting of well-connected clusters C l favours clustered graph signals (see (19)) as solutions of nLasso (13).

VI. COMPUTATIONAL ASPECTS
The objective function in (16) is highly structured as a sum of a smooth convex function h(w) and a non-smooth convex function g(Dw). Both of these two components can be optimized efficiently when considered separately. This suggests to use proximal methods to solve (16) [41].
A recently popularized instance of proximal methods is the alternating direction method of multipliers (ADMM) [5], [20]. However, we will choose another type of proximal method which is based on a dual problem to (16) [8], [42]. This primal-dual method is appealing since its analysis provides natural choices for the algorithm parameters. In contrast, tuning the ADMM parameter is non-trivial [39].

A. PRIMAL-DUAL METHOD
To develop an efficient method for solving (16), we start with reformulating the problem (16) as a saddle-point problem with the convex conjugate g * of g [8]. Any solution ( w, u) of (25) is characterized by [43] −D T u ∈ ∂h( w), and D w ∈ ∂g * ( u).
This condition is, in turn, equivalent to w − TD T u ∈ (I dN + T∂h)( w), and with positive definite matrices ∈ R dE×dE , T ∈ R dN ×dN . The matrices , T are design parameters whose choice will be detailed below. The condition (27) lends naturally to the following coupled fixed point iterations [42] If the matrices and T in (28), (29) satisfy the sequence w k+1 (see (28), (29)) converges to a solution of (13) [42, Thm. 1]. The condition (30) is satisfied for with d (i) = j =i A ij and some τ < 1 [42, Lemma 2]. The update (29) involves the resolvent operator where v := √ v T v. The convex conjugate g * of g (see (16)) can be decomposed as g * (v) = E e=1 g * 2 (v (e) ) with the convex conjugate g * 2 of the scaled 2 -norm λ . . Moreover, since is a block diagonal matrix, the e-th block of the resolvent operator (I dE + ∂g * ) −1 (v) can be obtained by the Moreau decomposition as [41, Sec. 6.5] where (a) + = max{a, 0} for a ∈ R.
The update (28) involves the resolvent operator (I+T∂h) −1 of h (see (12) and (16)), which does not admit a simple closed-form solution in general. Using (31), the update (28) decomposes into independent node-wise updates with g (i) (w) := −w T t (i) The update (33) is a regularized maximum likelihood estimator for exponential families [50,Eq. 3.38]. The varying regularization termτ (i) w − w (i) 2 enforces w (i) k+1 to be close to w (i) . The vector w (i) is a corrected version of the previous iterate w (i) k (see (34)). In general, there is no closed-form solution for the update (33). However, the update (33) is a smooth convex optimization problem that can be solved efficiently using iterative methods such as L-BGFS [34]. We detail a computationally cheap iterative method for approximately solving (33) in Sec. VI-C.
Let us denote the approximate solution to (33) by w (i) k+1 and assume that it is sufficiently accurate such that We require the approximation quality (for approximating the update (33)) to increase with the iteration number k. According to [13,Thm. 3.2], the error bound (35) ensures the sequences obtained by (28) and (29) when replacing the exact update (33) with the approximation w k+1 still converge to a saddle-point of (25) and, in turn, a solution of the nLasso problem (16). k := k + 1 9: until stopping criterion is satisfied Output: ( w k , u k ).

Algorithm 1 Primal-Dual nLasso
Note that Algorithm 1 requires as input only the empirical graph along with the observed node attributes z (i) , for i ∈ M. Algorithm 1 does not require any specification of a partition of the empirical graph. Moreover, in contrast to the ADMM implementation of nLasso (see [20,Alg. 1]), the proposed Algorithm 1 does not involve any additional tuning parameter for solving (16).

B. COMPUTATIONAL COMPLEXITY
Algorithm 1 can be implemented as message passing over the empirical graph G (see [1]). During each iteration, messages are passed over each edge {i, j} ∈ E in the empirical graph. The computation of a single message requires a constant amount of computation. The precise amount of computation, measured by the number of additions and multiplications, required for a single message depends on the particular instance of the update (33).
For a fixed number of iterations used for Algorithm 1, its computational complexity scales linearly with the number of edges E. For bounded degree graphs, such as grid or chain graphs, this implies a linear scaling of complexity with number of data points.
However, the overall complexity for Algorithm 1 depends crucially on the number of iterations required to achieve accurate learning. A worst-case analysis shows that, even when computing the exact updates (33), the number of iterations scales inversely with the required estimation accuracy [8]. This convergence speed is optimal for chain graphs [26].

C. APPROXIMATE PRIMAL UPDATE
We discuss a simple iterative method for approximately solving the primal update (33). A solution w (i) k+1 of (33) is characterized by the zero gradient condition [6] ∇f w Applying basic calculus to (36), The necessary and sufficient condition (37) (for w (i) to solve (33)) is a fixed point equation By the mean-value theorem [45,Thm. 9.19.], the map T is Lipschitz with constant (τ (i) /M ) F(w) where F (i) is the FIM (6). Thus, if we choose τ (i) such that the map T in (38) is a contraction and the fixed-point iteration will converge to a solution of (33). Moreover, if (39) is satisfied, we can bound the deviation between the iterate w (r) and the (unique) solution w (i) k+1 of (33) as (see [45, Proof of Thm. 9.23]) Thus, if we use the approximation w (i) k+1 := w (r) for the update (33), we can ensure (35) by iterating (40) for at least Computing the update (40) requires the evaluation of the gradient ∇ (i) ( w (r) ) of the cumulant function (i) (w). According to [50,Prop. 3.1.], In general, the expectations (43) cannot be computed exactly in closed-form. A notable exception are exponential families p(z; w) obtained from a probabilistic graphical model defined on a triangulated graph such as a tree. In this case it is possible to compute (43) in closed-form (see [50,Sec. 2.5.2]). Another special case of (2) for which (43) can be evaluated in closedform is linear and logistic regression (see Sec. III).

D. PARTIALLY OBSERVED MODELS
The learning Algorithm 1 can be adapted easily to cope with partially observed exponential families [50]. In particular, for the networked LDA described in Sec. III, we typically have access only to the word variables z  t,N but those are not observed since they are latent (hidden) variables. In this case we can approximate (33) by some ''Expectation-Maximization'' (EM) principle (see [50,Sec. 6.2]). An alternative to EM methods, based on the method of moments, for learning (latent variable) topic models has been studied in a recent line of work [2].

VII. NUMERICAL EXPERIMENTS
We report on the numerical results obtained by applying particular instances of Algorithm 1 to different datasets. The source code to reproduce these experiments can be found at https://github.com/alexjungaalto/ nLassoExpFamPDSimulations.

A. TWO-CLUSTER DATASET
This experiment constructs an empirical graph G by sparsely connecting two random graphs C 1 and C 2 , each of size N /2 = 40 and with average degree 10. The nodes of G are assigned feature vectors x (i) ∈ R 2 obtained by i.i.d. random vectors uniformly distributed on the unit sphere {x ∈ R 2 : x = 1}. The labels y (i) of the nodes i ∈ V are generated according to the linear model (9) with zero noise ε (i) = 0 and piecewise constant weight vectors w (i) = a for i ∈ C 1 and w (i) = b for i ∈ C 2 with some two (different) fixed vectors a, b ∈ R 2 . We assume that the labels y (i) are known for the nodes in a small training set M which includes three data points from each cluster, |M ∩ C 1 | = |M ∩ C 2 | = 3.
As shown in [27], the validity of (23) in Assumption 3, depends on the connectivity of the cluster nodes with the boundary edges ∂ := {{i, j} ∈ E : i ∈ C 1 , j ∈ C 2 } which connect nodes in different clusters. To quantify the connectivity of the observed nodes M with the cluster boundary, we compute, for each cluster C l , the normalized flow value ρ (l) from one particular in each cluster C l and the cluster boundary ∂. We normalize this flow by the boundary size |∂|. Fig. 2 depicts the normalized mean squared error (NMSE) ε := w− w 2 2 / w 2 2 incurred by Algorithm 1 (averaged over 10 i.i.d. simulation runs) for varying connectivity, as measured by the empirical averageρ of ρ (1) and ρ (2) (having same distribution). According to Fig. 2 there are two regimes of levels of connectivity. For connectivityρ > √ 2, Algorithm 1 is able to learn piece-wise constant weights w (i) . To compare the effect of using TV (7) in Algorithm 1 instead of the graph Laplacian quadratic form (see [32]) as network regularizer, a networked signal in noise model with zero mean and known variance σ 2 . The signal weights are piece-wise constant withw (i) = 1 for i ∈ C 1 andw (i) = −1 for i ∈ C 2 . The labels y (i) are observed for the nodes M = {1, 2, 3, N − 2, N − 1, N }. Algorithm 1 is then used to learn weightsŵ (i) using a fixed number of 1000 iterations and λ = 10. The RNC estimator reduces to to one matrix inversion (see [32,Eq. 2.4]) and is computed for the choices λ ∈ {1/100, 1, 100} of the RNC regularization parameter. The resulting estimatesŵ (i) are shown in Fig. 3.
According to Figure 3, Algorithm 1 accurately learns the piece-wise constant weightsw (i) from only two labels y (i) , for i ∈ {1, N }. In contrast, RNC fails to leverage the network structure in order to learn the weights from a small number of labels.

B. WEATHER DATA
In this experiment, we consider networked data obtained from the Finnish meteorological institute. The empirical graph G of this data represents Finnish weather stations. which are initially connected by an edge to their K = 3 nearest neighbors. The feature vector x (i) ∈ R 3 of node i ∈ V contains the local (daily mean) temperature for the preceding three days. The label y (i) ∈ R is the current day-average temperature. We apply Algorithm 1 to learn the weight vectors w (i) for a localized linear model (9). For the sake of illustration we focus on the weather stations in the capital region around Helsinki. These stations are represented by nodes C = {23, 18,22,15,12,13,9,7, 5} and we assume that labels y (i) are available for all nodes outside C and for the nodes i ∈ {12, 13, 15} ⊆ C. Thus, for more than half of the nodes in C we do not know the labels y (i) but predict them viâ y = w (i) T x (i) with the weight vectors w (i) obtained from Algorithm 1 (using λ = 1/7 and a fixed number of 10 4 iterations). The normalized average squared prediction error is ≈ 10 −1 and only slightly larger than the prediction error incurred by fitting a single linear model to the cluster C.

C. IMAGE SEGMENTATION
We now discuss an experiment which show-cases Algorithm 1 for image segmentation [19], [44]. An image can be represented by an empirical graph whose nodes i ∈ V are image pixels at coordinates p (i) , q (i) ∈ {1, . . . , P} × {1, . . . , Q} (see Figure 4). Two nodes i, j ∈ V are connected by an edge We assign all edges {i, j} ∈ E the same weight W i,j = 1. Pixels i ∈ V are characterized by feature vectors x (i) obtained by normalizing (zero mean and unit variance) the red, green and blue components of each pixel.
We then constructed a training set M of labeled data points by combining a background set B ⊆ V (y (i) = 0) and a foreground set F ⊆ V (y (i) = 1). These sets are determined based on the normalized redness r (i) We apply Algorithm 1, with λ = 100 and fixed number of 10 iterations, to learn the weights w (i) for a networked logistic regression model (see Section III-B). For the update (33) in Algorithm 1 we used a single Newton step. The resulting predictions w (i) T x (i) are shown on the right of Figure 4. The middle of Figure 4 depicts the hard segmentation obtained by the ''GrabCut'' method [44]. Using MATLAB version 19 on a standard laptop, Algorithm 1 is almost ten times faster than GrabCut.

VIII. CONCLUSION
We have introduced networked exponential families as a flexible statistical modeling paradigm for networked data. The error of nLasso applied to learning networked exponential families has been analyzed. An efficient implementation of nLasso has been proposed using a primal-dual method for convex optimization. Directions for future research include a more detailed analysis of the convergence of nLasso for typical network structures as well as data-driven learning of the network structure (graphical model selection). The analysis of nLasso presented in Section V might guide the design of network structure by relating Assumption 3 to network flow problems (see [29]).

IX. PROOFS
We first collect some helper results in Sec. IX-A that will be used in Section IX-B to obtain a detailed derivation of Theorem 1. (3)) defined on an empirical graph G,

Lemma 2: For any two vector signals u, v ∈ H (see
Here, D ∈ R (d|E|)×(d|V|) denotes the block-wise incidence matrix (14) of the empirical graph G. Proof: Any graph signal u can be decomposed as with P denoting the orthogonal projection matrix on the nullspace of the block-wise graph Laplacian matrix L (17). For a connected graph, the nullspace K(L) is spanned by d graph signals (see [49]) Here, we used the constant graph signal 1 ∈ R V assigning all nodes the same signal value 1. The projection matrix associated with the nullspace K(L) is Here, M (j) := e (j) e (j) T . Therefore,

Pu
The projection matrix on the orthogonal complement of K(L) ⊆ H is I − P. Then (see [24]), with the block-wise incidence matrix D (14). Combining (49) and (50) with (46), To further develop (51), we define the norms on the space D of vector-valued edge signals (see (4)). By the Cauchy-Schwarz inequality a T b ≤ a 2 b 2 , Combining (51) with the inequality a T b ≤ a 2 b 2 , i∈V u (i) T v (i) VOLUME 8, 2020 The result (45) follows from (54) by using (15). Applying Lemma 2 to the subgraphs induced by a partition P = {C 1 , . . . , C |P| }, yields the following result. Here, D C l denotes the block-wise incidence matrix of the induced subgraph C l (see (14)). The proof of Theorem 1 (see Section IX-B) will require a large deviation bound for weighted sums of independent random vectors z (i) distributed according to (2).
Lemma 4: Consider M independent random vectors z (i) , for i ∈ M, distributed according to (2). For fixed unit-norm vectors m (i) = 1, denote y (i) := m (i) T t (i) (z (i) ) and µ (i) := E y (i) . If ∇ 2 (i) U I for all i ∈ M, then The last equality in (58) is due to the independence of the random variables y (i) .
For controlling the probability of (83) failing to hold, we combine (18) with Lemma 4. This yields, using a union bound over all edges e ∈ E, A union bound yields (24) by summing the bounds (85) and (86) for the choice (81).