Graph Learning Over Partially Observed Diffusion Networks: Role of Degree Concentration

This work examines the problem of learning the topology of a network from the samples of a diffusion process evolving at the network nodes, under the restriction that a limited fraction thereof is probed (<italic>partial observability</italic>). We provide the following main contributions. Given an estimator of the combination matrix (i.e., the matrix that quantifies the pairwise interaction between nodes), we introduce the notion of <italic>identifiability gap</italic>, a minimum separation between the entries of the estimated matrix that is critical to enable discrimination between connected and unconnected node pairs. Then we focus on the popular Granger estimator. First, we prove that this matrix estimator, followed by a <italic>universal</italic> clustering algorithm inspired by the <inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula>-means algorithm, learns faithfully the probed subgraph as the network size increases. This result is proved for the case where the network topology is obtained through an Erdős-Rényi <italic>random</italic> graph under <italic>statistical concentration</italic> of the node degrees, and the combination matrix is symmetric with nonzero entries bounded in terms of the reciprocal of the maximal and minimal degree. The analysis explores different connectivity regimes, including the <italic>dense</italic> regime where the probed nodes are influenced by many connections coming from the latent (hidden) part of the graph. Second, we answer a <italic>sample complexity</italic> question and establish that the number of samples for the Granger estimator scales almost <italic>quadratically</italic> with the expected graph degree. We also propose three other estimators that are proved to achieve faithful graph learning, and compare them to the Granger estimator, gaining nontrivial insights especially for the case of directed graphs.


I. INTRODUCTION
Learning the graph structure that governs the evolution of a networked dynamical system from data collected at some accessible nodes is a challenging inverse problem with applications across many domains. The objective of such inferential problems is to discover the interaction profile among the network nodes since the topology has a critical effect on system behavior [3], [4], [5], [6]. Graph learning plays a central role in many applications including, among other possibilities: estimating the longevity or the source of an epidemics [7], [8]; revealing commonalities and agent influence over social networks [9], [10], [11]; discovering the routes of clandestine information flows [12], [13]; identifying defective elements [14]; addressing the fundamental issue in neuroscience that links brain functional connectivity (i.e., a "functional" topology estimated from blood-oxygenation-level-dependent signals) to brain structural connectivity (i.e., the anatomical topology of neuron interconnections) [15], [16], [17].
Depending on the particular context, the aforementioned class of problems can be referred to in different ways, including topology inference [18], network tomography [19], [20], structure learning [21], or graph learning [22]. We adopt these terminologies almost interchangeably throughout our treatment.
This article addresses the graph learning problem under the framework of partial observability, i.e., when only a fraction of the network nodes can be probed. This setting is particularly important in large-scale networked systems, where it is not feasible to gather data from all nodes comprising the network. We solve the learning problem for distinct regimes of network wiring, including densely-connected networks, a case often overlooked in the literature.

II. OVERVIEW OF THE LEARNING PROBLEM
The structure of the learning problem addressed in this work can be summarized as follows. Streams of data originating from a certain subnetwork are collected, and the goal is to estimate the (unknown) topology linking the nodes of this subnetwork from the collected data. The graph learning protocol will involve two main steps: an estimation step, where a combination matrix (i.e., a matrix quantifying the strength of the connections among the network nodes) is estimated; and a thresholding step, where node pairs linked by a strong edge (i.e., node pairs whose corresponding estimated matrix entry lies above some threshold) are deemed connected. A structurally-consistent estimator is one that ends up assigning strong ties to interacting pairs and weak ties to non-interacting pairs. In this way, at the thresholding stage, one can correctly classify the pairs as interacting and non-interacting. Fig. 1 gives a graphical summary of the aforementioned procedure.
It is useful to illustrate the learning problem in relation to some popular networked systems. To start with, let us neglect some practical limitations, in particular assume that: i) all nodes can be monitored (full observability); ii) there are no limitations in terms of computational power; and iii) there are no limitations on the available time samples. Then, the first inferential stage consists of finding a matrix estimator to quantify the strength of pairwise interactions in the network. One notable estimator relies on computing the (spatial) covariance matrix R 0 = lim n→∞ E[y n y n ], where y n denotes the vector collecting the data from all nodes at time n, and where iii) justifies the limit and (under an ergodic assumption) implies that the statistical average can be learned from the data. When the matrix R 0 provides a consistent estimator for the connection strengths, we talk of a correlation network [18]. For these networks, interactions between two nodes are direct and they are accordingly captured by pairwise correlations. One example of a correlation network is the ferromagnetic Ising model [23] with independent and identically distributed (i.i.d.) time samples, and under certain constraints of sparsity on the network and of regularity on the interaction weights.
Another classic model for graph learning is a Gaussian graphical model. In this case, R 0 is no longer the proper estimator, but its inverse R −1 0 (which is often referred to as the precision or concentration matrix) is a consistent estimator, in that its support coincides with the underlying graph of interactions. Over Gaussian graphical models, the pairwise interaction between adjacent nodes is affected by other nodes, and this latent influence is the reason why spatial correlation between measurements is no longer sufficient to capture the network structure.
For most standard graphical models, interactions across network nodes are described through a multivariate distribution that characterizes a collection of dependent random variables defined on the nodes. It is usually assumed that i.i.d. samples of these variables are available for the learning process. In other words, over graphical models the data samples do not arise from a dynamical process governing the time evolution of the nodes' outputs. In contrast, in this article we will be dealing with networked dynamical systems, where signals evolve at the nodes and are affected by the evolution of the signals at neighboring nodes as well. One relevant example is the diffusion or first-order Vector AutoRegressive (VAR) system described by (3) further ahead. For such graphs, the proper estimator for graph connectivity turns out to be the Granger estimator, R 1 R −1 0 , which combines in a suitable way information contained in the covariance matrix, R 0 , and in the one-lag covariance matrix, R 1 = lim n→∞ E[y n y n−1 ].

A. STRUCTURAL-CONSISTENCY, HARDNESS AND SAMPLE-COMPLEXITY
We are now ready to introduce three concepts that play an important role in graph learning problems.

1) STRUCTURAL CONSISTENCY
In the previous examples, the graph structure can be retrieved from a statistical descriptor related to the measurements, i.e., R 0 for correlation networks, R −1 0 for Gaussian graphical models, and R 1 R −1 0 for VAR models. Since the involved covariance matrices can be computed from the measurements, with arbitrary precision as the number of samples increases, we conclude that in the aforementioned three examples the graph can be correctly identified.
In a more general setting, consider a statistical descriptor that can be consistently estimated from the data as the number of samples goes to infinity. If, for sufficiently large networks, the statistical descriptor allows to identify the correct graph structure, we shall say that the graph learning problem is identifiable, and that the corresponding descriptor achieves structural consistency.
In this work we focus on the case in which the measurements are available from only a limited subset of nodes, with identifiability referring here to the subgraph connecting these monitored nodes. Under this setting, 1 many interesting questions arise, such as: Does partial observability impair identifiability of the monitored subnetwork? If not, how can we design structurally-consistent estimators ? We remark that the concepts of identifiability and structural consistency disregard complexity issues, since they assume that the necessary statistical quantities (e.g., the true covariance matrices) are available. For example, we assume that R 1 R −1 0 can be computed exactly, which means that matrix inversion is possible whatever the size of the network, and that we have sufficient time to learn perfectly the true covariance matrices.

2) HARDNESS
How much computational complexity is required to evaluate the matrix estimator necessary for a particular graph learning problem? For example, if one is interested in the precision matrix, R −1 0 , the hardness is related to the complexity of the matrix inversion. Many works attempt to reduce the complexity by leveraging particular constraints such as smoothness of the node-level signals and/or sparsity of the underlying graph of interactions.
Notably, there are topology inference problems that are computationally intractable (e.g., NP-hard) [24], [25], [26], [27]. It is therefore critical to identify meaningful systems where this does not happen [28]. The present work sheds light on a relevant class of systems whereby structure learning with statistically dependent observations and under partial observability has affordable computational complexity, even for the important case of dense (thus, loopy and non-tree like) networks. However, we remark that the concept of hardness disregards the complexity associated with the empirical estimation of the pertinent matrices from the available data samples. This fundamental element of complexity is usually referred to as sample complexity. 1 It should be remarked, however, that certain issues may arise also in the full observability case. For instance, as the network size increases, the entries of the matrix estimators might become smaller and smaller, and, hence, it is critical to identify whether a stable threshold can be found to classify connected/unconnected node pairs.

3) SAMPLE COMPLEXITY
In practice, only a finite amount of data is available and, therefore, only approximate versions of the aforementioned matrix estimators can be computed. There exist several results about sample complexity in the context of high-dimensional graphical models, where the number of samples necessary to get some prescribed accuracy is related to the system parameters, e.g., to the network size and to the density of connections. Results relative to sample complexity over dynamical graph systems are comparably less mature [18]. Over these systems, the dependence among the time samples induced by the dynamical model complicates the theoretical analysis of the convergence of the empirical estimators, and, hence, the analysis of sample complexity. Useful results about the sample complexity of vector autoregressive models like the one addressed in this work are available in [29], [30], [31], [32]. These results do not consider the partial observability setting and, hence, they do not apply here. 2 Nonetheless, in [29], [30], [31], [32] we can find useful techniques to bound the errors associated to the empirical covariance matrices over vector autoregressive models. These types of bounds will be exploited in the proof of Theorem 3, where we establish the sample complexity of the estimators proposed in this work.

III. RELATED WORK
The learning problem considered in this work lies in the broad field of signal processing over graphs [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], dealing in particular with the identification of an unknown network topology from measurements gathered at the network nodes [18]. These types of inferential problems can be addressed under different settings, including the case where measurements from all network nodes are available (full observability), and the case where only a fraction of nodes is accessible (partial observability). Even if we focus on the partial observability setting, we deem it useful to start with some results pertaining to the full observability setting.

A. GRAPH LEARNING UNDER FULL OBSERVABILITY
The majority of works on graph learning over networks focuses on linear system dynamics, with nonlinear dynamics typically being tackled by variational characterizations under a small-noise assumption [47], [48], [49], or by increasing the dimensionality of the observable space [50], [51].
Topology inference for a general class of linear stochastic dynamical systems (e.g., VAR models of arbitrary order, or even non-causal linear models) is addressed in [52]. An approach based on Wiener filtering is proposed to infer the topology, which provides exact reconstruction for self-kin networks or, in general, guarantees reconstruction of the smallest self-kin network embracing the true network.
There exist works dealing with more general dynamical systems and graphs. For example, in [53] the concept of directed information graphs is advocated to discover dependencies in networks of interacting processes linked by causal dynamics, and a metric based on functional dependencies is proposed in [54] to learn causal relationships over a possibly nonlinear network.
Moving closer to our setting, among linear (or linearized) systems, special attention is devoted to autoregressive diffusion models [55], [56], [57]. For instance, in [55] causal graph processes are exploited to devise a computationally tractable algorithm for graph structure recovery with performance guarantees. Recent works exploit optimization and graph signal processing techniques to feed the graph learning algorithm with proper structural constraints. In [56], [57], it is shown how to capitalize on the fact that the weighting matrix and the covariance matrix share the same eigenvectors, and how to solve the topology inverse problem through optimization methods under sparsity constraints. An account of the methods for the full observability regime can be found in [18].
We stress that most of the aforementioned methods work in the graph spectral domain. This has to be contrasted with the methods proposed in this paper, which rely instead on the graph edge domain. Working in the edge domain allows us to obtain a transparent relationship for the matrix estimators, which is critical to establish identifiability under the challenging partial observability setting.
In summary, while in certain cases (e.g., general linear models and/or nonlinear models) identifiability can be an issue, most of the works on diffusion models focus on reducing complexity by exploiting proper structural constraints (e.g., smoothness or sparsity of the signals defined on the graph) [18]. However, all the aforementioned results pertain to the case where node measurements from the entire network are available. We focus instead on the case in which only partial observation of the network is permitted.

B. GRAPH LEARNING UNDER PARTIAL OBSERVABILITY
A fundamental challenge of our work is performing structure learning when only partial observation is allowed [58]. Under this setting, results for retrieving particular network graphs (polytrees) are available in [59] and [60]. Considering instead the case of general topologies, and with focus on VAR/diffusion models like the one considered in this work, references [61], [62] establish technical conditions for exact or partial topology identifiability. However, these identifiability conditions act at a very "microscopic" level (they are formulated in terms of some precise details of the local graph structure and/or of the statistical model), and are therefore impractical over large-scale networks. In contrast, in this work we pursue a statistical approach that is genuinely tailored to the large-scale setting: an asymptotic framework is considered, where the thermodynamic limit of large networks is afforded by using random graphs, and the conditions on the network connection topology are summarized at a macroscopic level through average descriptive indicators (e.g., probability of drawing an edge). In a similar vein, detailed asymptotic analysis with performance guarantees are available for graphical models with latent variables. For example, in [63] it is shown that, under certain conditions concerning the interactions between the observed and the unobserved network nodes, the "sparsity+low-rank" framework can be exploited to estimate the amount of latent variables [63], and to reconstruct the topology of the observable subnetwork. Likewise, in [21] the graph learning problem is tackled in the context of locally-tree graphs, whereas in [64] a local separation criterion is imposed to deal with Gaussian graphical models. Still in the framework of learning graphical models with latent variables, in [65] an influencemaximization metric is proposed, to show that ferromagnetic restricted Boltzmann machines with bounded degree are an instance of graphical models that can be efficiently learned.
However, and as already explained in Section II, graphical models do not match the networked dynamical models considered in this work. For these models, results for graph learning under partial observability have been recently obtained in [19], [20], [66], [67], [68]. More specifically, i) in [19] the whole network graph is assumed to follow an Erdős-Rényi construction and the number of observable nodes grows with the overall network size N; whereas ii) in [20] the number of monitored nodes is held fixed (and, hence, the fraction of observable nodes vanishes in the limit of large N), the graph of the monitored nodes is left arbitrary, and the unobserved component continues to obey an Erdős-Rényi model. The present work focuses on the former model. 3 It is therefore necessary to explain clearly why the present work constitutes a significant progress in the context of local tomography over diffusion networks, in comparison to [19].

C. MAIN ADVANCES
The key contributions of this work in relation to [19] are as follows.
-One first advance relates to the regime of connectivity. Reference [19] addresses only the case that the network is sparsely connected, which means that the connection probability is allowed to vanish with N in a way that preserves network connectedness. In this work we examine also the dense regime, where the connection probability is not vanishing. -We advance also with respect to the results currently available under the sparse regime. In [19] a consistency result is proved, for all sparsely connected networks, in terms of an average fraction of misclassified node pairs. In the present work, we introduce in Section V a significantly stronger notion of consistency (referred to as universal local structural consistency), which will apply to all networks that fulfill a property of statistical concentration of node degrees. These networks will be shown to be all the dense networks, as well as most part of the sparse networks. Using such stronger notion of consistency answers also the following important question (posed, and only partially answered in [19], where the answer was obtained under a certain approximation of independence): does the fraction of mistakes scale properly with N? -We are able to offer a rigorous proof that the connected and unconnected node pairs can be recovered through some universal clustering algorithm. In particular, we propose a variant of the k-means algorithm that is shown to be asymptotically consistent. -We achieve the significant advance of ascertaining the sample complexity of the Granger estimator, i.e., we establish how the number of samples must scale with the network size to let this estimator learn the graph of probed nodes faithfully. -Another advance relates to the topology inference algorithms. The matrix estimators available in the fullobservability setting help guide the choice of a matrix estimator for the partial observability setting. For example, one can replace the Granger estimator with a version that considers only the subnetwork of observed measurements. This choice is widely adopted in causal inference from time series (when one neglects the existence of latent components), and has been adopted, e.g., in [19], [20]. In this work we characterize thoroughly the Granger estimator. Then, in Section VIII we consider three other matrix estimators, namely, the one-three-lags estimator (computing the difference between the one-lag and three-lags covariance matrices), the one-lag covariance matrix and the covariance matrix between the residuals (i.e., difference between subsequent time samples). We show that all the three estimators are structurally consistent. -Finally, we examine the important case of directed graphs. Our theorems are proved under the assumption of symmetric combination matrices. However, while the three new estimators are implicitly constructed by exploiting symmetry, the construction of the Granger estimator does not assume any symmetry, and, hence, we expect that it performs well also over directed graphs. We show by numerical simulations that this is actually the case. In contrast, and remarkably, the other estimators lose their learning ability over directed graphs. Notation: We use boldface letters to denote random variables, and normal font letters for their realizations. Matrices are denoted by capital letters, and vectors by small letters. This convention can be occasionally violated, for example, the total number of network nodes is denoted by N. The symbol p −→ denotes convergence in probability as N → ∞.
Sets and events are denoted by upper-case calligraphic letters, whereas the corresponding normal-font letter will denote the cardinality of the set. For example, the cardinality of S is S. The complement of S is denoted by S .

IV. NETWORKED DYNAMICAL SYSTEM
Let y i (n) be the output measurement produced by node i at time n. Likewise, let x i (n) be the input source (e.g., streaming data or noise) exciting node i at time n. It is convenient to stack the input and output variables, respectively, into the vectors: The stochastic dynamical system considered in the present work is given by the following network diffusion process (a.k.a. first-order Vector AutoRegressive (VAR) model): where A is some stable N × N matrix with nonnegative entries, and σ 2 is a variance factor. The bold notation for A is used since, as explained in the next section, we will be dealing with random graphs. By rewriting (3) on an entrywise basis: we readily see that the support-graph of A reflects the connections among the network nodes. Indeed, (4) shows that, at time n, the output of node i is updated by combining the outputs of other nodes from time n − 1. In particular, node i scales the output of node by using a combination weight a i , which implies that the output of node is effectively used by node i if, and only if a i = 0. After the combination step, the output measurement y i (n) is adjusted by incorporating the streaming-source value, x i (n), which is locally available at node i at current time n.
In our partial observability setting, only a subset of nodes can be probed: for each node i belonging to the subset of probed nodes, S, a stream of n measurements, y i (1), y i (2), . . . , y i (n) is acquired. The learning task is to reconstruct the graph of interconnections corresponding to the combination (sub)matrix A S . Since we will study the graph learning problem in the asymptotic regime where the network size N goes to infinity, it is necessary to specify how the cardinality of the probed set scales with N.
Definition 1 (Partial observability setting): The subnetwork of observable measurements, S, has a cardinality S scaling as: which means that ξ is the (asymptotic) fraction of monitored nodes. Since ξ is strictly less than one, condition (5) conforms to a partial observability setting.

A. RANDOM GRAPH AND COMBINATION MATRIX
In the following, we denote by G the adjacency matrix of the network graph, whose entry g i j is equal to one if nodes i and j are connected, and is equal to zero otherwise. The bold notation is used because we deal with random graphs. Given a realization of the random graph, a combination matrix is obtained by applying a certain combination rule or policy to this graph. The combination policy defines how the weights a i j are assigned given the particular graph structure. Formally, we have that: where π : {0, 1} N×N → R N×N is a deterministic policy and the randomness of A arises from the randomness of graph G.
In summary, the system in (3) contains three sources of randomness, namely, the combination matrix A, the initial state y 0 , and the input source {x n }. Let us now detail how these sources of randomness interact. Once a realization A = A is given, the stochastic dynamical system in (3) evolves according to the randomness of the initial state and the input source. In the statistical physics jargon, matrix A is a quenched variable: once realized, it is frozen and process y n evolves over the same matrix for all n.
Conditionally on A, vector y 0 is assumed to have finitevariance entries, possibly mutually dependent and dependent on the particular matrix realization. The random variables x i (n) are independent of y 0 and of the matrix realization.
They have zero mean and unit variance, and are independent and identically distributed (i.i.d.), both spatially (i.e., w.r.t. to index i) and temporally (i.e., w.r.t. to index n).
The particular distribution of y 0 and x n is immaterial to the results stated in Theorems 1 and 2 further ahead. In comparison, Theorem 3 is proved under the assumption of Gaussian x n , and initial state y 0 distributed (conditionally on A) according to the stationary distribution of the VAR model. 4 These two assumptions are commonly adopted in the sample-complexity analysis of graph learning models, where the available results exploit the concentration properties of the covariance matrices R 0 and R 1 under stationary Gaussian VAR models [29], [30], [31], [32].

V. CONSISTENT GRAPH LEARNING
Let us consider a stream of n consecutive observations taken over the probed subset of nodes S, and collected into the S × n matrix Y n whose ( , i) entry is, for ∈ S and i = 1, 2, . . . , n: A matrix estimator will be formally defined as some measurable function of the data A S,n = f n (Y n ), namely, We focus on the class of asymptotically stable estimators that converge as the number of samples increases. In particular, we consider the class of estimators that, for any realization of the combination matrix A, guarantee the following convergence in probability: We remark that the limiting estimator in (9), h, is a deterministic quantity, given A = A. However, this limit will be in general different for the 2 N (N−1)/2 possible realizations of the random graph, i.e., we should write h = h(A) in (9). In the following, we will use the following notation where we suppressed the sample subscript n to denote the limiting matrix estimator A S . The terminology "limiting matrix estimator" will be generally used in our achievability analysis, i.e., when we disregard sample complexity and let n = ∞. Using (10) in (9) we have: Owing to dependence on A, the limiting matrix estimator A S is still random, and its randomness is determined solely by the randomness of the underlying graph. Our main goal is to establish that, in the commonly adopted doubly-asymptotic framework [21], [63] where the network size N becomes large and the number of samples n increases with N, it is possible to retrieve consistently (i.e., with probability converging to 1) the true graph of the probed subset of nodes. In order to achieve this goal, in the next section we start by examining the asymptotic properties of the limiting estimator A S as the network size N goes to infinity.

A. ANALYSIS OF THE LIMITING ESTIMATOR AS N → ∞
In order to ascertain whether or not it is possible to discriminate interacting (i.e., connected) from non-interacting (i.e., unconnected) nodes, via observation of their output measurements, we now introduce the concept of margins and identifiability gap. Definition 2 (Margins): Let A S be a limiting matrix estimator for A S . The lower and upper margins corresponding to the unconnected pairs are defined as, respectively: 5 Likewise, the lower and upper margins corresponding to the connected pairs are defined as, respectively: The aforementioned margins are useful to examine the achievability of structural consistency for a limiting estimator A S -see Fig. 2 for an illustration -and lead to the concept of identifiability gap. Definition 3 (Universal local structural consistency): Let A S be a limiting matrix estimator for A S . If there exist a positive sequence s N , a real value η, and a positive value , such that, for any > 0: 5 The definitions in (12) and (13) are void if the nodes in S are all connected or all unconnected, respectively. For these singular cases, we can formally assign arbitrary values to the margins. We will see later that, under the Erdős-Rényi model, these events are irrelevant as N → ∞.
we say that A S achieves universal local structural consistency, with scaling sequence s N , bias η, and identifiability gap . Remark 1 (Locality): We use the qualification "local" to emphasize that the structure of the subnetwork S must be inferred from observations gathered locally in S, even if the nodes of S undergo the influence of many other nodes belonging to the larger embedding network.
Remark 2 (Bias): For the true combination matrix, the entries corresponding to unconnected pairs are zero. In contrast, (14) reveals that the scaled entries for unconnected pairs can be close to η, which results therefore in a bias. However, and remarkably, this bias does not constitute a problem for consistent classification of connected/unconnected node pairs, i.e., the bias does not affect in any manner identifiability.
Remark 3 (Identifiability gap): Since we can write: and from the second and third formulas in (14) we conclude that, with high probability as N gets large: 6 The first inequality in (18) means that the minimum entry of s N A S taken over the connected pairs essentially stays above the value η + > η. Likewise, the second inequality means that the maximum entry of s N A S taken over the unconnected pairs essentially does not exceed η. Combining these two relationships, we conclude that the entries of the (limiting) estimated matrix corresponding to connected node pairs stand clearly separated from the entries corresponding to unconnected node pairs. The minimum amount of separation is quantified by the gap, .
The notion of structural consistency implies the existence of a threshold, comprised between η and η + , which correctly separates (in the limit of large N) the entries of the matrix estimator, in such a way that the entries corresponding to connected pairs lie above the threshold, whereas the entries corresponding to unconnected pairs lie below the threshold.
However, an accurate determination of the separation threshold requires some prior knowledge of the monitored system. For instance, to set a detection threshold one needs to 6 Actually, the existence of a gap would be guaranteed by a weaker notion of consistency, namely, by: However, this notion would not be sufficient to guarantee the important clustering property that we discuss in Remark 4. know the scaling sequence s N . In the problems dealt with in this work we will see that s N = N p N , where N p N represents the average number of neighbors in the network, and in several practical applications this number is unknown. As a result, the threshold setting might be a critical issue, and it would be highly desirable to have a universal (i.e., blind and nonparametric) method to set the threshold. For example, it would be highly desirable to determine a separation threshold using machine learning tools such as a standard clustering algorithm (e.g., a k-means clustering with k = 2). As discussed in the next remark, this possibility is automatically enabled by the clustering property embodied in Definition 3.
Remark 4 (Clustering): According to the notion of universal structural consistency, the pair of (scaled) margins over the unconnected pairs, s N δ N and s N δ N , converge to one and the same value, η, which implies that all the entries of s N A S corresponding to the unconnected pairs are sandwiched between these margins -see Fig. 2. A similar behavior is observed for the scaled entries over the connected pairs, which converge altogether to η + since they are sandwiched between s N N and s N N . In summary, we conclude that the connected and unconnected node pairs cluster into well-separated classes that can be identified, e.g., by means of a universal clustering algorithm.

B. A CONSISTENT CLUSTERING ALGORITHM
The definition of universal local structural consistency implies that the unconnected node pairs cluster around η, whereas the connected node pairs cluster around the higher value η + . Accordingly, for sufficiently large N, there is no doubt that any reasonable clustering algorithm will be able to identify properly these two clusters. For example, an asymptotically correct guess of the true clusters (i.e., of the true graph) can be obtained by simply choosing as a threshold the midpoint between the maximum and minimum matrix entries.
The simplest classification rule based on such threshold does not try to cluster the data, but it works asymptotically since, as N → ∞, the scaled matrix entries converge to two possible values, η or η + . For finite network and/or sample sizes, the scaled matrix entries exhibit a certain variability around these values, and it would be more appropriate to employ a clustering algorithm, like the popular k-means algorithm (in our case, we know that k = 2).
However, the k-means algorithm has a well-known drawback in the case of unbalanced clusters. One example of unbalanced clusters is shown in Fig. 3. We see that the kmeans algorithm (top panel) tends to be highly biased by the largest cluster, resulting in a wrong configuration. Here the two centroids estimated by the k-means algorithm are both located within the largest ensemble (circles), leading to a wrong classification. Since in our model it is actually permitted that, for large N, one cluster can dominate the other one (for instance, when p N → 0, the cluster of unconnected node pairs becomes predominant), we are not guaranteed that the k-means algorithm works properly as N → ∞. In summary, the k-means algorithm with k = 2 can mitigate finite-size issues, but it does not provide asymptotic guarantees. In order to solve this issue, we propose a simple modification of the k-means algorithm, detailed in the pseudo-code shown in Algorithm 1.
Let v be the L × 1 vector to be clustered, with entries that have been arranged in ascending order. The k-means algorithm, with k = 2, attempts to minimize the following cost function: over all possible clusters C 0 and C 1 , with c 0 and c 1 being the cluster centroids, defined as: It is useful to recall that the minimum of the cost function in (19) must fulfill the following necessary condition: the midpoint between the two centroids is a threshold that separates the two clusters [91]. In our one-dimensional case, with k = 2, this property implies that it suffices to consider only the cluster configurations C 0 ( j) = {1, 2, . . . , j} and C 1 ( Obviously, if two points, say v i and v i+1 , are coincident, considering their possible permutations is pointless. Accordingly, we see that any possible partition is identified by an index j. Now, for large N the true clusters of connected and unconnected pairs become almost perfectly localized, and, hence, two centroids belonging to the true clusters match the necessary condition -see the bottom panel in Fig. 3. However, as we have observed when examining Fig. 3, if the sizes of the true clusters are very different, the k-means algorithm can be in error. For this reason, we now introduce a simple modification of the k-means algorithm that overcomes this issue by taking into account also the distance between the centroids. First, the algorithm enumerates all admissible cluster pairs through an index j spanning the set {1, 2, . . . , L − 1}. The set of indices fulfilling the necessary condition of k-means are collected in the set A = { j 1 , j 2 , . . .}. At this stage, the classic k-means would simply select, among these admissible configurations, the one ensuring the minimum cost. We modify this rule by selecting the index 7 j ∈ A that maximizes the distance between the clusters' centroids, namely, j = argmax j∈A [c 1 ( j) − c 0 ( j)], with c 0 ( j) and c 1 ( j) being the centroids corresponding to the clusters identified by index j. With this modified rule, we want to i) retain the good behavior exhibited by k-means in typical situations; and ii) guarantee that the algorithm achieves consistent graph learning, as we will formally establish in the next theorem.
Before stating the theorem, let us introduce the input-output relationship that relates the estimated combination matrix to the estimated graph through the clustering algorithm. The procedure is as follows. Once an estimated matrix M is computed, it is vectorized and sorted in ascending order, before feeding the clustering algorithm described in the pseudo-code reported in this page. More formally, given a matrix M with index set S, let diag(M ) be the diagonal matrix with same diagonal entries as M, be a one-to-one 8 mapping that transforms the off-diagonal 7 In principle we could have multiple maximizers, but we will see later in Lemma 8 and Theorem 1 that in our case the maximizer j is unique with high probability. 8 The mapping is invertible since it operates on matrices with null diagonal.
with being the permutation matrix that sorts the entries of u in ascending order. The vector v is given as input to the clustering algorithm clu(v), obtaining the clusters: as described in the pseudo-code of the algorithm. We use the reverse permutation −1 to cluster the entries of u from the corresponding clustering on the entries of v. Then, using the inverse mapping vec −1 , we decide which cluster a particular entry m i j belongs to, and accordingly estimate an adjacency matrix G, whose main diagonal is set conventionally to 0, and whose off-diagonal entries, for all i, j ∈ S with i = j, are set as: where I[E] denotes the indicator function, which is equal to 1 if condition E is true, and is equal to 0 otherwise. The overall mapping that leads from M to G, passing through the algorithm clu, will be compactly denoted by graphclu. Theorem 1 (Sample consistency of the proposed clustering algorithm): Let A S,n be a stable matrix estimator belonging to class (11), with the limiting matrix estimator A S achieving universal local structural consistency according to Definition 3. Let G S be the (random) support graph associated to A S and let be the subgraph estimated by the proposed clustering algorithm. If the probability that G S is fully connected or fully disconnected vanishes as N → ∞, a certain scaling law n(N ) exists such that: Proof: See Appendix H.

VI. GRANGER ESTIMATOR
Preliminarily, it is useful to examine the steady-state covariance matrix corresponding to the dynamics in (3).
According to our generative model, given a certain realization of the combination matrix A = A, we let the system evolve according to (3). Exploiting (3) we see that: Using now the stability and symmetry of A along with (27) we get the following limiting covariance (convergence of the series is guaranteed by stability): where I is the N × N identity matrix. The bold notation: will be finally used to account for the randomness in A coming from the underlying Erdős-Rényi graph. Likewise, we introduce the steady-state one-lag covariance matrix: (30) which exploiting the dynamics in (3) can be written as: From (31) we obtain the following well-known relationship: a quantity that is also referred to as the best one-step predictor or Granger estimator [52], [61]. Under the partial observability setting, it is tempting to adapt the structure in (32) by considering only the observable subnet S [19], [20]: which obviously does not allow us to incorporate well the contribution of latent nodes. In [19] it is shown that such limiting matrix estimator admits the following representation: where (we recall that S denotes the subset of unobserved nodes): with: For later use, it is also useful to rewrite (35) on an entrywise basis, for all i, j ∈ S: (2) m j (37) where the symbol a (k) i j denotes the (i, j) entry of the k-th matrix power A k .

VII. MAIN RESULT
In this section, we illustrate the characterization of consistency and sample complexity regarding the Granger estimator. We start by introducing the assumptions on the class of random graphs and combination matrices used to prove the results.

A. ASSUMPTIONS ON THE GRAPH
In this article we address the useful case where the network graph is generated according to the Erdős-Rényi random graph model, namely, an undirected graph whose edges are drawn, one independently from the other, through a sequence of Bernoulli experiments with identical probability of success (i.e., of connection) [88], [89]. In particular, the notation G (N, p N ) will represent an Erdős-Rényi graph over N nodes, and with connection probability p N . Accordingly, the variables g i j , for i = 1, 2, . . . , N and j > i, are independent Bernoulli random variables with P[g i j = 1] = p N , and the matrix G is symmetric. As it will be clear soon, the explicit dependence of the connection probability upon N will be critical to examine the evolution of random graphs in the thermodynamic limit of large N.
As one fundamental graph descriptor, in this work we use the degree of a node. The degree of node i is defined as: namely, the cardinality of the i-th node neighborhood (including i itself). In particular, we shall use the minimal and maximal degrees that are defined as, respectively: One meaningful (and classic) way to characterize the behavior of random graphs is to examine their thermodynamic limit as the network size goes to infinity. Such an asymptotic characterization is useful because it captures average behavior that emerges with high probability over large networks.
In examining the thermodynamic behavior of random graphs, the connection probability p N is generally allowed to scale with N. This degree of freedom allows representing different types of asymptotic graph behavior. For example, recalling that the average number of neighbors over an Erdős-Rényi graph scales as N p N , different graph evolutions can be obtained with different choices of p N . For example, a constant p N will let the number of neighbors grow linearly with N. In comparison, a p N scaling as (log N )/N would correspond to a number of neighbors growing logarithmically with N. In summary, different limiting regimes are determined by the way the connection probability evolves with N. It is useful for our purposes to list briefly the main regimes that are of interest for the forthcoming treatment.
-Connected regime: In this work we focus on the regime where the graph is connected with high probability. This regime prescribes that the pairwise connection probability scales as [88], [89]: -Sparse (connected) regime: The connected regime can be obtained also when the pairwise connection probability, p N , vanishes as N gets large. In particular, we shall refer to this scenario as the sparse connected regime: -Dense regime: We call dense the regime where the pairwise connection probability converges to a nonzero quantity, namely p N → p > 0. The aforementioned taxonomy basically focuses on the concepts of connectedness and sparsity. These concepts have been advocated in previous works related to topology inference under partial observability, and, in particular, some One essential element of novelty in our analysis is exploiting a different feature, namely, the concentration of graph degrees. We wish to avoid confusion here: the term "concentration" does not refer to the number of node connections. Instead, the concept of concentration is borrowed from a common terminology in statistics, which is used to refer to statistical quantities that concentrate around some deterministic value as N → ∞ [90]. In particular, we will focus our attention on the uniform concentration properties of the minimal and maximal degrees of random graphs.
-Uniform concentration regime: The uniform concentration regime is enabled by choosing the following pairwise connection probability: which is tantamount to assuming that (40) holds true with the sequence c N growing faster than log N. Under this regime, the minimal and the maximal degrees of the graph both concentrate asymptotically around the expected degree 1 + (N − 1)p N ∼ N p N , in the following precise sense: The physical meaning of (43) is that both the minimal and the maximal degrees scale, asymptotically with N, as the expected degree. Indeed, (43) can be restated as: d min ∼ N p N + g N and d max ∼ N p N + g N , where g N and g N are sequences that are asymptotically dominated by N p N . Table 1 summarizes the sparsity/concentration taxonomy arising from the previous arguments. We are now ready to extract from the above taxonomy the elements that are relevant to the forthcoming treatment.
1) Comparing (42) against (40), we see that the regime of concentration does not include all classes of connected Erdős-Rényi graphs. In fact, while in (40) c N is any arbitrary divergent sequence (e.g., we can have c N = log log N), according to (42) the sequence c N should grow with N more than logarithmically. The regime where the graph is connected, whereas (42) is not fulfilled, will be referred to as the very sparse regime. 2) According to (42), the regime of concentration can be either sparse or dense. In particular, the regime is dense when p > 0, and is sparse when p = 0. The aforementioned categorizations are illustrated in Fig. 4 by means of a Venn diagram.

B. ASSUMPTIONS ON THE COMBINATION MATRIX
The forthcoming theorems will be proved under the assumption that the combination matrix belongs to the following class.
Assumption 1 (Regular diffusion matrices): The combination matrix A is symmetric and satisfies the conditions: for some parameters ρ and κ, with 0 < κ ≤ ρ < 1.
We remark that the most common combination matrices encountered in the literature automatically satisfy Assumption 1. Some popular choices are the Laplacian and the Metropolis rules reported below, which arise naturally in many applications, for instance, they are one fundamental ingredient of adaptive networks [35]. The matrix entries corresponding to these combination rules are defined as follows. For i = j, 0 < ρ < 1, and 0 < λ ≤ 1: whereas the self-weights are determined by the leftmost condition in (44), yielding a ii = ρ − =i a i . It is immediate to verify that the Laplacian rule yields a regular diffusion matrix with κ = ρλ, whereas the Metropolis rule yields a regular diffusion matrix with κ = ρ.

C. GRANGER ESTIMATOR: UNIVERSAL LOCAL STRUCTURAL CONSISTENCY
The next theorem establishes the fundamental consistency properties of the Granger estimator presented in Section VI.
the Granger estimator achieves universal local structural consistency as detailed in Definition 3, with scaling sequence s N = N p N , and with bias η and identifiability gap given by: Proof: The proof of the theorem is provided in Appendix D, and relies on a number of auxiliary lemmas and theorems reported in the appendices. In particular, the core of the proof is the following. First, Theorem 6 in Appendix C constructs uniform (w.r.t. N) bounds on the entries of the matrix H in (36). These bounds are useful to characterize the error associated to the Granger estimator. Then, exploiting the asymptotic concentration property of the maximal and minimal degrees, it is possible to prove the convergence of the matrix series relevant for the computation of the errors in (37). These convergence properties are finally used to compute the bias and the identifiability gap in (48).
The main message conveyed by Theorem 2 is illustrated in Fig. 5, where we depict: i) the entries of the true combination matrix, vectorized and ordered as shown in the figure, and magnified by N p N ; and ii) the entries of the limiting estimated combination matrix, magnified by N p N , vectorized and ordered with the same ordering used for the true combination matrix. The essential features illustrated in Section V are clearly visible in Fig. 5. Comparing the true and estimated matrices, we can appreciate the emergence of the bias and of the gap. Both these phenomena are predicted by Theorem 2, as we can see from the theoretical values η and , computed by using (48) and displayed in Fig. 5 with broken lines. We observe how the (scaled) estimated matrix entries corresponding to unconnected node pairs are clustered around the theoretical value η, whereas the entries corresponding to connected pairs are clustered around the theoretical value η + . It is also seen how the bias does not affect separability between the groups of connected and unconnected node pairs. To this aim, we start by examining the useful representations of the limiting matrix estimator in (34). From this representation, we see that the existence of an identifiability gap in the limiting matrix estimator A (Gra) S depends on the true matrix A S , but will depend strongly also on the error matrix E (Gra) . Since each entry in E (Gra) is a function of the entries in A (in general, also of the latent nodes belonging to the unobserved subset S ), a key point is to understand how the a i j 's combine with each other to produce E (Gra) . As observed before, the a i j 's exhibit concentration (in their nonzero values). On the other hand, they exhibit also randomness (in the location of the nonzero entries, due to the random graph model). The attributes of concentration and randomness are critical to reveal the nontrivial result shown in Theorem 2. It will be seen that the a i j 's combine with each other so as to induce a concentration in E (Gra) , which in turn determines the emergence of an identifiability gap in A (Gra) S . In summary, the overall influence of latent nodes is quantified by an error matrix, whose entries converge to some deterministic quantity, equally for both connected and unconnected node pairs. In this way, the connections among the probed nodes stick out consistently from the error floor. In summary, in the limit of large networks the Granger estimator equals the ground-truth matrix plus a uniform shift of its entries and, hence, the network structure is preserved.
We now illustrate the relevance of Theorem 2 by means of numerical experiments. In Fig. 6, we display the probability of correct graph learning for the limiting Granger estimator, with reference to the dense case (with constant connection probability p N = p = 0.1 for all N) for different choices of combination policies and relative parameters. In Fig. 7, we consider instead one example of uniform-and-sparse regime,  with ω N = log log N, namely, with connection probability decreasing with N as In both plots, we see that the simulations match well the theoretical predictions obtained from Theorem 2, since the probability of correct graph learning approaches 1 as the network size grows.

D. GRANGER ESTIMATOR: SAMPLE COMPLEXITY
In light of Theorem 1, the universal local structural consistency of the limiting estimators A S implies consistency of the matrix estimators A S,n = f n (Y n ) (i.e., of the real estimators based on the measured samples) as the sample size n grows with the network size N with some law n(N ). This section is devoted to establishing which law n(N ) is sufficient to achieve consistent learning for the Granger estimator. First of all, it is necessary to introduce the following estimator: which is nothing but the finite-sample version of (32) obtained by replacing the true covariance matrices with their empirical counterparts. The estimator in (50) is assumed to be unspecified when [ R 0,n ] S is singular (which must happen when n < N). In order to bypass the instability related to matrix inversion, we introduce also the following regularized version of the Granger estimator.
The 1 -constraint on the admissible row vectors x in (51) arises from the fact that the rows of our target estimator, the limiting Granger estimator, have 1 -norm bounded by 1see (315) in Appendix I.
It is also useful to observe that, in view of (50), when the sample covariance matrix is invertible, the non-regularized Granger estimator is the only matrix that yields a zero residual in (51). As a result, whenever A (Gra) S,n ∞ ≤ 1 the plain and regularized Granger estimators coincide. Now, assuming that the empirical covariance matrices converge to the true covariance matrices as n → ∞, it is easily seen that both the plain and regularized Granger estimators converge to their limiting counterparts and, hence, guarantee condition (11). Since in Theorem 2 we established that the limiting Granger estimator achieves universal local structural consistency, in view of Theorem 1 this property implies that estimators (50) and (51) are able to learn well the underlying subgraph of probed nodes, provided that the number of samples n grows as the network size N diverges. The goal of a sample complexity analysis is to establish how this number of samples should scale with N in order to grant correct graph learning. The forthcoming theorem provides an answer for the class of models (3) with Gaussian input source.
Theorem 3 (Sample complexity of the Granger estimator): Assume that model (3) holds with i.i.d. standard Gaussian source data {x n }, and with initial state y 0 distributed (conditionally on A) according to the stationary distribution of the VAR process in (3). Let A be a regular diffusion matrix with parameters ρ and κ, with the network graph drawn according to an Erdős-Rényi random graph model G (N, p N ) where the fraction of observable nodes, S/N, converges to some nonzero value ξ . Let A S,n be the regularized Granger estimator in (51). Then, under the uniform concentration regime where: we have that: provided that the number of samples is on the order of: 9 Moreover, in the dense regime where p > 0, the same result holds for the non-regularized Granger estimator in (50).
Proof: See Appendix I. We see from (54) that under the dense regime the growth is essentially quadratic in N, since p N converges to some positive constant p. In order to see what happens under the sparse regime, it is useful to apply (52) and rewrite the sample size in (54) as (we recall that S grows linearly with N in view of (5)): revealing that the specific sample complexity under the sparse regime depends on the specific speed of growth of the sequence ω N , which in fact regulates the sparsity of the problem. The scaling law found in (54) is significant since it matches well with the scaling laws that have been found in the literature in relation to other graphical models.
For linear dynamical systems like (3) operating under full observability, sample complexity has been examined for a fixed error in estimating the combination matrix and/or the covariance matrices. It has been shown that the sample complexity grows logarithmically with N, but proportionally to the inverse square of the error [29], [30], [31], [32]. In our case, such growth would imply the (N p N ) 2 -scaling, since we need the error to be bounded by / (N p N ).
Under the regime of partial observability, an algorithm is proposed in [92] whose sample complexity scales as log N when the node degree is kept fixed. When the degree grows with N (as in our setting), then the sample complexity in [92] contains an additional (N p N ) 3 factor. However, the authors of [92] indicate that they suspect the exponent could be reduced to 2, which would then match our result. In addition, to grant graph recovery a lower bound on the minimum combination-matrix entry is assumed in [92], but this bound is not allowed to scale with N.
Remark 6 (Factors affecting sample complexity): The sample complexity found in Theorem 3 is primarily determined by the error in estimating the empirical covariance matrices, 9 With the (·) notation we mean that there exist a constant C > 0 and a value N 0 > 0 such that, for all N ≥ N 0 , any sample complexity scaling as n(N ) ≥ C(N p N ) 2 log S guarantees consistent learning. which is examined in Lemma 9. The limitations of the empirical covariance matrices in high-dimensional settings are well known. For example, the covariance matrix is singular when n < N. Even for larger n the number of parameters to be estimated (i.e., the matrix entries) grows quadratically, thus requiring a high number of samples to get good performance. On the other hand, when focusing on the · max error norm, it is known that a maximum error up to an can be obtained when the number of samples grows logarithmically with the dimension [29], [30], [31], [32]. This effect is summarized by the log S factor appearing in (54). Unfortunately, this is not the main factor determining the sample complexity. The main factor is instead the N p N term. This term arises because, in order to guarantee that the matrix entries are well classifiable, we need to guarantee a maximum error up to / (N p N ), which is related to the scaling law of the smallest combination-matrix entry.
We now complement the asymptotic result in Theorem 3 with an empirical evaluation of the sample complexity. This is a cumbersome task from a computational viewpoint, which is however critical to establish the practical relevance of Theorem 3. In Fig. 8, we considered different values of the network size N. For every N, we evaluated numerically (circles) the number of samples necessary to reach 90% of the performance reached by the limiting estimator for the same value of N. We display with broken line the theoretical curves, which are proportional to (N p N ) 2 log S, according to the theoretical scaling law in (54). Under both the dense and sparse regimes, we see that the match with the theoretical curves is excellent. Therefore, under the dense regime (green) the scaling law of the sample complexity is almost quadratic in N. In comparison, under the sparse uniform concentration regime (blue) the sample complexity is mitigated, and scales as: since in the considered example p N was chosen as in (49).

VIII. OTHER ESTIMATORS
So far we have shown that, over uniformly-concentrated Erdős-Rényi graphs, the Granger estimator operating under partial observability achieves consistent graph learning. The proof has been carried out assuming symmetric combination matrices, albeit the Granger estimator construction does not require symmetry. Symmetry has been assumed for technical reasons, since it leads to a closed-form expression for the covariance matrix that helps in the proof.
In this section, we present three other estimators that provably achieve universal local structural consistency under Assumption 1. However, differently from the Granger estimator, the construction itself of these estimators relies on symmetry. For this reason, we will show later that these estimators are sensitive to the symmetry assumption and their performance deteriorates over asymmetric combination matrices, whereas the performance of the Granger estimator does not.

A. ONE-THREE-LAGS ESTIMATOR
We have already observed that (3) can be exploited to obtain the formula R 1 = AR 0 . Actually, the formula can be generalized to covariance matrices corresponding to any lag, namely, for m = 1, 2, . . . we have: Considering now the series expansion in (28), from (57) we get: which motivates the one-three-lags limiting estimator proposed in [93]: revealing that in the symmetric case, the desired submatrix over the probed subset S is equal, up to a scaling factor, to the difference between the one-lag and the three-lags covariance matrices.

B. LIMITING ONE-LAG ESTIMATOR
In this section, we use the one-lag covariance matrix to estimate the combination matrix. 10 The reason behind such 10 When σ 2 is known, the relationship between R 1 = σ 2 A(I − A 2 ) −1 and A is invertible under the full observability regime. This can be easily shown choice is the following series expansion of the one-lag covariance matrix: When applied only to the submatrix corresponding to S, (60) yields: It is convenient to rewrite (61) as: where: or, for all i, j ∈ S:

C. LIMITING RESIDUAL ESTIMATOR
Let us introduce the residual vector that computes the (scaled) difference between consecutive time samples: We observe that: Accordingly, it makes sense to introduce the following limiting estimator: The structure of (67) motivates the introduction of the matrix: yielding: Equation (68) implies that, for all i, j ∈ S, with i = j: by resorting to the spectral decomposition of R 1 /σ 2 and A (which share the same eigenvectors), and by finding the eigenvalues of A from those of R 1 /σ 2 ; this inversion operation can be successfully realized because A is symmetric and has spectral radius less than one. However, we remark that σ 2 is assumed unknown and, more importantly, that we work under a partial observability regime, and, hence, the aforementioned inversion procedure does not apply. We are ready to state the theorem that characterizes the consistency of the aforementioned three estimators.
Theorem 4 (Universal local structural consistency of the estimators in Section VIII): Let A be a regular diffusion matrix with parameters ρ and κ, with the network graph drawn according to an Erdős-Rényi random graph model G (N, p N ) where the fraction of observable nodes, S/N, converges to some nonzero value ξ . Then, under the uniform concentration regime where: the one-three-lags, one-lag, and residual estimators achieve universal local structural consistency as detailed in Definition 3, with scaling sequence s N = N p N , and with the biases and identifiability gaps listed in Table 2.
Proof: See Appendix E. It is instructive to compare the different estimators reported in Table 2. First of all, we notice that the proper scaling sequence for all estimators is s N = N p N . Consider now the identifiability gap. The gap of the Granger estimator is equal to the bounding constant κ that characterizes the regular diffusion matrix (44). The other estimators exhibit two main differences with respect to the Granger estimator. First, they depend also on the variance of the input process, σ 2 . This behavior should be expected, since in the Granger estimator the one-lag covariance matrix multiplies the inverse of the covariance matrix, and, hence, the effect of σ 2 disappears. In contrast, the other estimators do not cancel this effect out. Second, when κ = ρ (a condition that occurs, for instance, for the Laplacian rule), the term σ 2 κ multiplies a factor that is a function of ρ − κ. This factor is greater than one for the one-lag estimator, whereas it is smaller than one for the residual estimator. Note that the dependence alone upon σ 2 , as well as a magnified/reduced gap, do not imply any conclusion about the structure-learning performance of the pertinent estimators for finite network and/or sample sizes. What plays a role in this case is the spread of the matrix entries around their  limiting values, and the identifiability gap does not contain information about such spread.
Let us switch to the analysis of the bias. First, we notice that η = 0 for the one-three-lags estimator, 11 which is obvious since (the limiting version of) this estimator is equal 11 In order to avoid misunderstandings, we point out that in our treatment we use the term "bias" to quantify the distance from zero of the estimated matrix entries corresponding to unconnected pairs. In this respect, the onethree-lags estimator is unbiased. However, it is not an unbiased estimator of the combination matrix A S , due to the presence of the scaling factor. to the true matrix scaled by σ 2 . The biases of the other estimators are proportional to the limiting connection probability p and, hence, we have always η = 0 in the sparse case, where p = 0. Furthermore, we observe that only the bias of the Granger estimator depends upon the fraction of monitored nodes ξ . This finding makes sense, since the Granger estimator is based upon inversion of a partial matrix (which clearly varies with the number of latent variables), whereas the other estimators are natively determined by pairwise correlations. In comparison, only the bias of the Granger estimator does not depend upon σ 2 , and this behavior is easily grasped in light of the explanation in the previous paragraph.
We complement our theoretical findings with some numerical experiments. First, we test the limiting estimators over one of the settings adopted in Fig. 6. The results are reported in Fig. 9. We see that, in agreement with Theorem 4, all the limiting estimators have probability of correct graph recovery converging to 1 as N grows. In particular, the one-three-lags estimator (which, as we know, coincides with the true matrix) has the same performance as the residual estimator. These estimators perform better than the one-lag estimator, which is in turn better than the Granger estimator. This ordering is relative to this particular example, and we have observed various behavior in other experiments, depending on different factors, such as the degree of observability or the type of combination matrix. However, there is a more important comparison that should be made between the Granger estimator and the other estimators, which pertains the case of directed graphs.
To this end, we generate a directed topology by using a binomial graph, namely, a random graph where all directed edges are drawn as i.i.d. realizations of Bernoulli variables, with the success corresponding to the edge presence. We remark that, over a binomial graph, the event of a directed edge i → j is independent of the event of a directed edge j → i. Then, we consider the following uniform averaging combination policy, with 0 < λ ≤ 1: and a ii = ρ − N =i a i . The performance of the different estimators is reported in Fig. 10. We see that all but the Granger estimator fail in learning the true graph. This happens also for the one-threelags estimator, which was used because it reproduces (up to the scaling factor σ 2 ) the true combination matrix. However, it must be remarked that this property relies exclusively on the symmetry assumption, and is lost in the asymmetric case. Also the other two estimators are sensitive to symmetry. More information is gained by examining the matrix-entries realizations shown in Fig. 11, where we see clearly that, in this example with a directed graph, the identifiability gap is preserved by the Granger estimator, but is lost by the other proposed estimators, a behavior that explains clearly the performance obtained in Fig. 10. In particular, the estimator that seems to suffer more from the lack of symmetry is the one-three-lags estimator, with a kind of reversed pattern exhibited in the second plot of Fig. 11. This behavior confirms a classical mismatched-model trade-off: the more one method is matched to a nominal scenario (i.e., the symmetric scenario), the more sensitivity it exhibits in a mismatched scenario.

IX. CONCLUDING REMARKS AND OPEN ISSUES
This work examined the problem of graph learning when data can be collected from a limited subset of nodes. The goal is to learn the topology of the subgraph of probed nodes. We showed that an estimator of the combination matrix known as Granger estimator, followed by a universal clustering algorithm, is able to achieve faithful graph learning, requiring a number of samples that grows almost quadratically with the expected node degree. We explored various regimes of connectivity, including the often overlooked regime of dense connectivity. Several works in the literature of graph learning assume in fact sparsity in the graph of connections. The role of sparsity in graph learning can be (at least) twofold. On one hand, sparsity can be leveraged to reduce the complexity associated to the estimators (we have seen this effect also in our analysis of the sample complexity). On the other hand, in the presence of latent variables, sparsity can be leveraged since it reduces the (unseen) effect of the latent unobserved nodes on the probed nodes [20], [63]. One revealing conclusion stemming from our analysis is that, under the setting considered in this work (Erdős-Rényi graphs and regular diffusion matrices) an important role is played by another structural property of the graph, namely, the statistical concentration of node degrees. Finally, we proposed three structurally consistent estimators and compare them against the Granger estimator, obtaining nontrivial insights, especially over directed graphs.
There are several open issues that may deserve attention. This work focused on the Erdős-Rényi model, and used certain regularity assumptions on the diffusion matrix. A useful extension would be to examine structural consistency for other graph models, and/or under different regularity assumptions. For example, one might have network heterogeneity (e.g., different connectivity across nodes) and/or dependence in the edge formation process, and an interesting question is whether or not consistency can be achieved under these conditions. Preliminary results along this direction are available in [94].
Another open issue concerns directed graphs, which are relevant, e.g., in the context of causal inference. In this work we have examined this issue through numerical experiments, showing that the Granger estimator seems to preserve its learning capability over directed graphs, while the other proposed estimators are much more sensitive to asymmetries in the graph construction. From the theoretical side, the graph-edge-domain approach developed in this work could be exploited and generalized to get insights about the directed graph setting.

APPENDIX A USEFUL PROPERTIES OF MAXIMAL AND MINIMAL DEGREES
We denote by B i (N, q), with i = 1, 2, . . . , K, a sequence of K binomial random variables (not necessarily independent) with success probability q over N independent Bernoulli trials. Moreover, we denote by B max (N, q, K ) and B min (N, q, K ) the maximum and the minimum over this sequence, respectively. The following two relationships are standard inequalities arising from the application of the Chernoff bounding technique, and will be the fundamental building blocks to characterize the asymptotic behavior of several random quantities arising in our problem. The inequalities are as follows. 12 12 For any t > 0, we can write: where the latter inequality is an application of Markov's inequality. Since a binomial variable of parameters N and q is the sum of N independent Bernoulli variables with success probability equal to q, we can further write: For any t > 0: We now apply these fundamental bounds to some specific random variables that are of interest in our setting. We start by characterizing the behavior of the variables: (N, p N , N ), B min (N, p N , N ), (77) which, as we will see, are useful to characterize the behavior of the maximal and minimal degree of the graphs that we use in this work. The forthcoming lemma contains fundamental (classic) results about the asymptotic behavior of B max (N, p N , N ) and B min (N, p N , N ) under the different regimes for the probability p N . Lemma 1 (Asymptotic scaling of B max (N, p N , N ) and B min (N, p N , N )): Let the probability p N scale with N according to (42). Then: Proof: The following inequality, holding for all > 0, is easily obtained from (75) Using now (42) in (79) we get: which follows because g > 0 for all > 0, as g 0 = 0 and dg /d > 0 for all > 0.
As a simple corollary to Lemma 1, we can now obtain the characterization of the maximal and minimal degrees.
Corollary 1 (Behavior of d max and d min ): If the connection probability of the Erdős-Rényi model obeys (42), then we have: (82) where the latter inequality follows by observing that, for z > 0, one has (1 + z) N ≤ e Nz . Combining (74) with (73) yields (75). (76) is worked out with a similar technique.
Proof: The degree of a single node is equal to 1 plus (because in our setting the degree counts also the node itself) a binomial random variable with parameters N − 1 and p N . Therefore, we have the following representation: (N − 1, p N , N ), (83) d min = 1 + B min (N − 1, p N , N ). (84) In order to obtain useful bounds involving d max and d min , let us introduce a modified sequence of binomial variables, obtained by adding one more (independent) Bernoulli trial to each binomial variable B i (N − 1, p N ), with i = 1, 2, . . . , N.
The corresponding maximum and minimum taken over the modified sequence will be denoted by B max (N, p N , N ) and B min (N, p N , N ), respectively. Since a Bernoulli variable can be either zero or one, from (83) and (84) we get readily the following bounds: and, hence, the claims of the corollary follow readily from Lemma 1, with the factor 1 playing no role as N → ∞.

A. ANOTHER USEFUL CONCENTRATION RESULT
Lemma 2 (Maximum and minimum of N 2 binomial variables with success probability p 2 N ): Assume that the success probability obeys (42). Then we have that: Proof: If p N → p > 0, we can set p N = p 2 N , and obviously p N converges to p 2 > 0, implying that the binomial variables of parameters N and p N are generated under the uniform concentration regime. The result in (87) then readily follows by exploiting (75) and (76) as done in the proof of Lemma 1.
If p N → p = 0, it suffices to prove the claim for the maximum. Applying (75) we can write: where, in the last step, we used the equality N p N = ω N log N that follows from (42). Moreover, since we are considering the case where p N → 0 as N → ∞, for any > 0 and for sufficiently large N we will have p N < , so that, asymptotically, it is legitimate to replace (88) with: small enough so that (e t − 1) < t, we finally get: which completes the proof of the lemma.

APPENDIX B BOUNDS ON THE POWERS OF A
In the following, the symbol a (k) i j denotes the (i, j) entry of the k-th matrix power A k , and N denotes the set of all nodes. Table 3 lists some random variables that are necessary to state and prove the theorems.
We start with a technical lemma that will be used to prove Theorem 2.
Lemma 3 (Bounds on the entries of A k ): The entries of the combination matrix power A k are bounded as follows: and, for i = j: where, for k ≥ 2, the (random) sequences α k , α k , β k , β k , γ k , and γ k , are determined by the following recursions: with the initialization choices: and with the initialization choices: Proof: We start by examining the relationships pertaining to the main diagonal terms, namely, (91). For k = 2 the claim is trivially true with the values of α 2 and α 2 in (96) and (100), because of the definitions of M a 2 ,self and m a 2 ,self reported on line 3) of Table 3. We shall therefore reason by induction to prove that (91) holds for an arbitrary k. In particular, we assume the claim verified for k, and manage to prove that it is verified for k + 1. To this aim, we start by writing the diagonal terms of the (k + 1)-th matrix power as: and, hence: where we have used the fact that ∈N a (k) i = ρ k along with the definition of M a appearing on line 1) of Table 3. In (102), we can bound the term a ii by using the definitions on line 2)  Table 3, and the term a (k) ii by using (91) (which is true for k by the induction hypothesis), yielding: from which we conclude that (91) holds true for k + 1, with the sequences α k and α k obeying the recursions in (93) and (97), respectively. We switch to the proof of (92). In particular, we will focus on the upper bound in (92), since the proof for the lower bound is similar. For all i, j ∈ N, with i = j, we have: where we have applied the definitions listed on lines 1), 2) and 8) of Table 3. First, we observe that (105) implies that the right inequality in (92) holds true in the case k = 2, with the choices detailed in (96). Moreover, we have: Therefore, since (91) holds true, and assuming that (92) holds for an arbitrary k ≥ 2, we have: which shows that the right inequality in (92) holds with the sequences β k and γ k obeying (94) and (95), respectively, with the initialization choices in (96). Let us introduce the following series (all the infinite summations will be shown to be bounded), defined in terms of the variables introduced in Lemma 3: We are now ready to state and prove the first theorem, which provides upper and lower bounds on sums of powers of the matrix A. These bounds will be critical to examine the concentration behavior of the one-lag and residual estimators. The main message conveyed by the theorem is that, for all i, j ∈ N, the individual (i, j) entries of sums of powers of A can be upper and lower bounded in terms of the elements a i j , and in terms of suitable bounding random variables that do not depend on the node indices i and j. These bounding random variables are m and M on line 8) of Table 3, and the " " variables appearing in (109)-(114). We remark that, in order to prove our main results in Theorem 2, we will not need the explicit expression of these " ", but only their limiting properties, detailed in (118) and (119).

Theorem 5 (Bounds on the sum of powers of A):
The combination matrix A fulfills the following bounds for all i, j ∈ N: and for i = j: If A is a regular diffusion matrix of parameters ρ and κ according to Assumption 1, then under the uniform concentration regime the bounding variables " " converge in probability, as N → ∞, to deterministic quantities, namely, where we set ζ = ρ − κ.
Proof: Calling upon Lemma 3, and using in (91) and (92) the definitions (109)-(114), we immediately get (115), (116), and (117). However, it is necessary to show that the " " random variables are proper random variables, i.e., that all the infinite summations in (109)-(114) are bounded. We will explain this conclusion with reference to the upper bounding sequences, with the case of the lower bounding sequences being dealt with similarly. The system of recursions in (93)- (95) can be solved by calculating first α k , then β k (after substituting α k ) and finally γ k (after substituting β k ). Applying Lemma 7, it can be verified that all the obtained solutions are linear combinations of geometric sequences with ratio strictly smaller than one, from which finiteness of the summations α , β and γ follows.
Next we focus on proving the convergence in probability in (118) and (119). To this end, it is instrumental to introduce the following auxiliary series: We start by showing that the following convergence takes place: First of all, it is convenient to rewrite the series in (120) in the following more explicit form: Let us consider (93). By summing over index k and using (122), we can write: where we have set = M a ρ 2 1−ρ . Likewise, operating on (94) and using (123), we get: (126) Finally, applying the above procedure to (95) and using (124), we obtain: which, using (125) and (126), after straightforward algebra yields: Now, in view of the convergences in probability proved in Lemma 6 -specifically, in view of (206), (209), (212) -it is legitimate to replace the pertinent variables in (125), (126), and (128), with their limits (that are reported in Table 3). After some lengthy, though straightforward, algebra, this replacement leads to (121). We now move on to prove (119). Using the definitions of (even) α and (odd) α in (109) and (112), and summing over k the terms in (93), we see that: yielding: where we defined = M a ρ 2 1−ρ 2 . Using now (120) and (130), we further obtain: Proceeding in a similar way, from (94) we obtain: yielding: and Finally, from (95), we can write: yielding:  Table 3.

APPENDIX C BOUNDS ON THE MATRIX H IN (36)
The following technical lemma is instrumental to prove Theorem 6. We recall that, from (36), we have the definition

Lemma 4 (Bounds on the entries of C k ):
The entries of the matrix power C k are bounded as follows: and, for = m: where, for k ≥ 1, the (random) sequences α k , α k , β k , β k , γ k , and γ k , are determined by the following recursions: with the initialization choices: and with the initialization choices: Proof: We start with the inequalities pertaining to the main diagonal of matrices C k , namely, with (138). For k = 1 we use the definitions of α 1 and α 1 in (143) and (147), respectively, to see that (138) is trivially satisfied in view of the definitions appearing on line 6) of Table 3. We shall now prove that (138) holds for an arbitrary k by induction. Assume thus that (138) is true for k, we need to show that it is true for k + 1. From the definition of matrix C in (36), we can write its terms on the main diagonal as: In view of (106) we can write: (149) Moreover, since C = [A 2 ] S , the first relationship in (44) can be recursively applied to show that: Therefore, from (148), (149) and (150) we conclude that: having further used the definitions on line 6) of Table 3 to bound the term c . Since we have assumed that (138) holds true for k, we can further apply (138) into (151), yielding: which reveals that (138) holds true for k + 1, with the sequences α k and α k obeying the recursions in (140) and (144), respectively.
Let us move on to examine the case = m. For all , m ∈ S , with = m, we can use (105) to conclude that: Equation (153) shows that the upper bound in (139) holds for k = 1, with the choices in (143). Now, rewriting the relationship C k+1 = CC k on an entrywise basis, we have: Now, we can bound c (k) mm by applying (138) and c m by applying (153). Moreover, assuming as induction hypothesis that (139) holds true for an arbitrary k, we can bound c (k) hm , finally obtaining from (154): where in the last step we applied the definition of M c,sum appearing on line 7) of Table 3 to bound the sum of the c h . Let us focus on the last summation in (155). Using the definitions on line 6) of Table 3 to bound the term c , and (153) to bound the term c h for h = , we get: where, in the latter inequality, we have further applied the definitions listed on lines 5) and 9) of Table 3. Using now (156) into (155) we get: which shows that the right inequality in (139) holds also for k + 1, with β k , and γ k obeying the recursions in (141) and (142), respectively, with the initialization choices in (143). The proof of the left inequality in (139) is similar.
Let us now introduce the following series (all the infinite summations will be shown to be bounded): The next theorem provides upper and lower bounds on the matrix H introduced in (36). These bounds will be critical to examine the concentration behavior of the Granger estimator.

Theorem 6 (Bounds on the matrix H):
The matrix H in (36) fulfills the following bounds, for all , m ∈ S : and for = m: If A is a regular diffusion matrix of parameters ρ and κ according to Assumption 1, then under the uniform concentration regime the bounding variables " " converge in probability, as N → ∞, to deterministic quantities, namely, where we set: (164) Proof: The matrix H defined in (36) can be expressed as: Calling upon Lemma 4 and summing the inequalities in (138) over index k, from (158) we immediately get (161). Likewise, summing over k the inequalities in (139), from (159) and (160) we obtain (162). However, it is necessary to show that the " " random variables are proper random variables, i.e., that all the infinite summations in (158)-(160) are bounded. We will explain this conclusion with reference to the upper bounding sequences, with the case of the lower bounding sequences being dealt with similarly. The system of recursions in (140)-(142) can be solved by calculating first α k , then β k (after substituting α k ) and finally γ k (after substituting α k and β k ). Applying Lemma 7, it can be verified that all the obtained solutions are linear combinations of geometric sequences with ratio strictly smaller than one, from which convergence of the series α , β and γ in (158)-(160) follows. Next we focus on proving the convergence in probability in (163). Let us consider (140). By summing over index k and using (158), we can write: where we have set: Likewise, from (141), summing over index k and using (159), we can write: yielding: Using now (166) into (169), we get: Finally, applying the same procedure to (142) and using (160), we obtain: which yields the solution: Now, in view of the convergences in probability proved in Lemma 6 -specifically, in view of (206), (214), (223), (224) and (230) -it is legitimate to replace the pertinent variables with their limits (that are reported in Table 3) in (166), (169), and (172), which, after some lengthy, though straightforward, algebra, leads to (163).

APPENDIX D PROOF OF THEOREM 2
We start by proving an auxiliary lemma. Lemma 5 (Sufficient conditions for universal local structural consistency): Let the network graph be drawn according to an Erdős-Rényi random graph model, and let A be a regular diffusion matrix with parameters ρ and κ. Let S be the set of observable nodes and consider then a limiting estimator: Assume that, for all i, j ∈ S, with i = j: where the quantities w N , w N , z N and z N do depend on the network size, N, but they do not depend on (i, j), and fulfill the following convergences: Then, under the uniform concentration regime, the limiting estimator A S achieves universal local structural consistency, with scaling sequence s N = N p N , bias η = z, and identifiability gap = κ (1 + w). Proof: Using (12) and (173) we can write: and, hence, from (174) we get: Using (175), we conclude that: Let us now examine the connected pairs. From (174) we know that: which, used along with (13) gives: and In view of Assumption 1 we can write, for all pairs (i, j) where a i j > 0: Using (182) in (180) and (181), calling upon Corollary 1, and exploiting the convergences in (175), we conclude that: (183) It remains to apply the definition of bias and identifiability gap in (14) to get the claim of the lemma.
Proof of Theorem 2: The proof boils down to combining Theorem 6 with Lemma 5.
From (37) we can write, for i, j ∈ S, with i = j: (2) m j = ∈S a i h a (2) j + ,m∈S =m a i h m a (2) m j (2) m j (185) where the inequality is obtained by bounding the entries of the matrix H, specifically, we have that (184) follows from (161), whereas (185) and (186) We now use (105) to bound the term a (2) j in (184) and the term a (2) m j in (185), whereas we use (187) to bound (186), finally obtaining: which can be recast in the following convenient form: where we have used the definitions listed on lines 9), 10), 11), 12), 13) and 14) of Table 3. The arguments leading to this result can be repeated by replacing upper bounds with lower bounds, and maxima with minima (e.g., M replaced by m, or α replaced by α ), yielding: Now, under the uniform concentration regime we can use the pertinent convergences in probability listed in Table 3, in conjunction with the convergences in (163), which, after some tedious but straightforward algebra, lead to: where η is the bias corresponding to the Granger estimator in Table 2. Accordingly, we can conclude that the error for the Granger estimator fulfills the hypotheses of Lemma 5, with the choice w N = w N = 0, and with the quantities z N and z N defined in (189) and (190), respectively. This concludes the proof for the claim pertaining to the behavior of the limiting Granger estimator.

APPENDIX E PROOF OF THEOREM 4
Proof: Since the limiting one-three-lags estimator is equal to the true matrix A S , the proof for this estimator comes from Assumption 1 and Corollary 1.
The proof for the one-lag and residual estimators boils down to combining Theorem 5 with Lemma 5. In light of the definitions of bias and gap, it suffices to prove the claim with σ 2 = 1, and then scale the values obtained for the bias and the gap by an arbitrary σ 2 .
Using (64), the error corresponding to the one-lag estimator can be written as (for all i, j ∈ S with i = j): and, hence, using the bounds in (117), we can write: where η is equal to the bias of the one-lag estimator as defined in the pertinent row of Table 2. For what concerns (odd) β and (odd) β , from (119) we see that: where is equal to the bias of the one-lag estimator as defined in the pertinent row of Table 2 (recall that we are working with σ 2 = 1). It remains to apply Lemma 5, with the choices: which concludes the proof for the limiting one-lag estimator.
Let us switch to the analysis of the residual estimator. Using (70), the error corresponding to the residual estimator can be written as (for all i, j ∈ S with i = j): and, hence, using the bounds in (116) and (117), we can write: and (199) Now, using the convergence results in (118), (119) and (230), simple algebraic calculations lead to: where η corresponds to the bias for the residual estimator listed in the pertinent row of Table 2. Likewise, exploiting (118), (119) and the representation of the gap of the residual estimator in Table 2, we can prove that: It remains to apply Lemma 5, with the choices: and which concludes the proof of the theorem.
is a binomial random variable with number of trials equal to S − 2, and success probability equal to p N . In other words, we get the following representation: because maximization is carried over all pairs , m ∈ S , with = m. Moreover, since in the uniform concentration regime we have: and since S /N → 1 − ξ as N → ∞, we can regard the connection probability p N as a connection probability scaling with respect to S , namely, where: This shows that the uniform concentration regime can be referred also to the scaling of the involved quantities w.r.t. S (in place of N). Accordingly, applying (75) with the choices K = (S − 1)S , N = S − 2, and p S = ω S log(S )/S , and reasoning as done in the proof of of Lemma 2, we get: It remains to use (217) in (215) to get: Repeating the same reasoning with lower bounds in place of upper bounds, and with minima in place of maxima, yields the same result, and, hence, (214) follow.

6)
M c,self This result follows readily by repeating the same steps used to prove (209).

7)
M c,sum Using the definition of C in (36) Applying the same procedure used in the previous items of this section, it is readily proved that, if j ∈ S (and, hence the self-term a is not present, because ∈ S ): In view of (44), we can write: Now we see that the quantity: is a binomial random variable with number of trials equal to N − 2, and success probability equal to p 2 N , since when = i, j, the product variable g i g j is a Bernoulli variable with success probability p 2 N . Therefore, we are allowed to introduce the definition: which, in view of Lemma 2, yields: Now, from the definition of M in Table 3, line 8), we get: (235) Repeating the same reasoning with lower bounds in place of upper bounds, and with minima in place of maxima, we get the claim in (230). 9) The proof for the case where p = 0 comes from (230) because, from the definitions listed on lines 8) and 9) of Table 3, we see that M (S ) ≤ M. The proof for the case where p > 0 is readily obtained by using the same arguments leading to (230). 10) N p N M (S ) We have that: Reasoning as done for proving (236), we can show that: Likewise, reasoning as done for proving (214), we can show that: Finally, using (239) and (240) into (238), repeating the same reasoning with lower bounds in place of upper bounds, and with minima in place of maxima, we get (237). 11) The following list of convergences is obtained by trivial variations on the previous proofs.

APPENDIX G USEFUL LEMMA
Lemma 7: Let 0 < α < 1, 0 < ρ < 1 and β ∈ R for all = 1, 2, . . . , L, and introduce the following recursion: Then, f k is equal to: namely, f k can be represented as a linear combination of geometric sequences with ratio strictly smaller than one, a structure that will be particularly convenient in the proofs of Theorems 5 and 6. Proof: Exploiting (242), we can write: . . .
The last equation can be manipulated as follows: which corresponds to (243).

APPENDIX H SAMPLE CONSISTENCY
We start with a useful lemma that characterizes the output of the clustering algorithm proposed in Section V-B, when its input is constituted by two well-separated classes.

Lemma 8 (Consistency of the clustering algorithm):
Let be a set of real numbers such that, for ν 0 , ν 1 ∈ R, > 0, and k ∈ {1, 2, . . . , L}: Then, for all > 0 such that: the clustering algorithm clu(v) described in Section V-B, produces as output a unique pair of clusters corresponding to j = k, namely: Proof: We recall that the algorithm clu(v) considers a pair of clusters as admissible if the midpoint between the centroids of the clusters separates them. Accordingly, we start by showing that the cluster configuration in (249) is admissible. In view of (247), the centroids of C 0 (k) and C 1 (k) fulfill the following bounds: and, hence, their midpoint fulfills the condition: In view of (247) and (251), a sufficient condition for the clusters in (249) to be admissible is the following: which amounts to: and the admissibility of the configuration in (249) follows by (248). In principle, other admissible configurations could exist. Next we show that if another admissible configuration distinct from (249) exists, then this configuration must necessarily exhibit a smaller inter-cluster distance. First, we note that, in view of (250), the distance between the centroids of the clusters in (249) fulfills the bound: Let us assume that another admissible configuration j exists, for a certain j > k. Since now the point v j > ν 1 − belongs to C 0 ( j), and since we assumed that configuration j is admissible, then v j must be smaller than the midpoint between the centroids, yielding, in view of (247): On the other hand, in view of (247) we have: which used along with (255) yields: Examining the bound on the distance in (254), we see that the condition: is sufficient for the configuration in (249) to maximize the inter-cluster distance. Finally, since the situation is similar for j < k, the proof is complete. Proof of Theorem 1: In the following, for i, j ∈ S, the (i, j) entry of the estimated matrix A S,n is denoted by a i j (n). Likewise, the (i, j) entry of the limiting estimated matrix A S is denoted by a i j . Let us introduce the following events (we recall that bold notation refers to random variables): In view of Lemma 8, assuming a sufficiently small , the clustering algorithm proposed in Section V-B reconstructs the exact graph whenever E 0 (n, N ) ∩ E 1 (n, N ) occurs. Accordingly, the theorem will be proved if we show that for some n(N ): To this aim, let us introduce the events: and By triangle inequality, we have the implication: yielding: (264) Since by assumption the limiting matrix estimator A S achieves universal local structural consistency we know that: Consider now a sequence N > 0 that vanishes as N → ∞.
Since A S,n is in the class defined by (11), for any fixed N, there exists n(N ) such that, for all n ≥ n(N ) we have: or, by the sandwich theorem: Using now (265) and (267) in (264) implies (260) and, hence, the claim of the theorem.

APPENDIX I SAMPLE COMPLEXITY
In the following we will make repeated use of the following inequality, holding for any two matrices A, B: Given a regular diffusion matrix fulfilling Assumption 1 with parameters ρ and κ, let ζ = ρ − κ. We introduce, for a small δ, with 0 < δ < 1 − ζ 2 , the auxiliary quantities: We notice that when δ is sufficiently small, ϕ 1 > ϕ 2 . Next we introduce some auxiliary functions. Let where x is a positive quantity, n is the sample size and S is the number of probed nodes. It is immediate to verify that b n (x) is non-increasing in x, and that: The second auxiliary function is: We will now examine the behavior of b N (δ) as N → ∞. We start by showing that, as N → ∞, the minimum and maximum diagonal entry of R 0 concentrate (in probability) around the same value. In view of (29) we have, for all i ∈ N: which, using (115), yields: Applying to the lower and upper bounds in (274) the convergence shown in the first line of (118), by the sandwich theorem we get: (275) We note from (274) that the convergence in (275) is uniform across the index i. In other words, further implying that: Since for all δ > 0, from (272) we conclude that: The following lemma characterizes the rate of convergence of the sample covariance estimators. This lemma adapts Lemmas 1 and 2 in [30] to exploit the peculiarities of our setting.
Lemma 9 (Convergence rate of the sample covariance estimators): Let us consider the VAR model in (3) with i.i.d. standard Gaussian input source and initial state distributed (conditionally on A) according to the stationary distribution of the VAR process. Let the combination matrix A fulfill Assumption 1 with parameters ρ and κ, and let the underlying graph fulfill the Erdős-Rényi model under the uniform concentration regime. Then, for all > 0 we have: and Proof: Preliminarily, it is worth noticing that the assumption of stationary initial state is adopted because some results about covariance matrices that will be used to prove our theorems are obtained under this assumption, which is classically adopted in sample complexity analysis [29], [30], [31], [32], [95]. As observed, e.g., in [29], such assumption entails a mild restriction since the (stable) system in (3) converges exponentially to the stationary regime.
Let us start by examining the probability in (280) for a fixed realization of the combination matrix A = A, i.e., we consider the quantity, for i, j ∈ N: where and where, in the considered conditional space, R 0,n is an empirical covariance matrix estimated over a VAR model (3) with deterministic combination matrix A. In fact, given the graph generation (i.e., given a certain A = A), the random sequence y n evolves according to (3), with x n being the assigned i.i.d. zero-mean, unit-variance source, and with y 0 following the stationary distribution of the VAR model corresponding to A. For these types of models, an upper bound on the probability appearing in (282) is derived in [30] relying on a concentration result for the covariance matrix entries shown in [95]. More precisely, to obtain an upper bound on the probability in (282) we now modify the proof of Lemma 1 in [30] by exploiting the peculiarities of our model. In order to avoid redundancy, we do not report here the complete arguments and proofs in [30]. We limit ourselves to modify the specific part of the proof that is necessary to obtain our bounds.
Let us consider, for m, m = 1, 2, . . . , n, the following known relationship that is obtained exploiting (3) In [30] the following inequality is used: In our case we can replace (285) by the following inequality, which exploits additional constraints on A: where the last equality comes from noticing that: i) the matrix A/ρ is doubly stochastic; and ii) the off-diagonal entries of the covariance matrix are upper bounded as |[R 0 ] i j | ≤ [R 0 ] ii [R 0 ] j j by Cauchy-Schwarz inequality. Applying (286) in the proof of Lemma 1 in [30] (with every other detail of the proof being unaltered) we get, for √ n ϕ(R 0 ) > √ 2 and all i, j ∈ N with i = j: where we introduced the definition: Using the union bound over the set of probed nodes S and the definition in (270), from (287) and (288) we get, for all n: P [ R 0,n ] S − [R 0 ] S max > |A = A ≤ 3 b n ( ϕ(R 0 )). (289) We now complete the proof by taking into account the randomness of the combination matrix A. To this end, let R the set of all possible realizations of A, and introduce the following set: where we recall that R 0 is a function of A -see (283). We notice that T is a high-probability set, since in view of (272) and (279) we have: By the law of total probability we can write:  13 The specific representation R m−m = A |m −m| R 0 holds due to the symmetry of A.
Moreover, from (273) we see that: yielding: and in view of the definition of ϕ 2 in (269) we conclude that ϕ(R 0 ) ≥ ϕ 2 for all R 0 , implying then: Applying (293) and (296) in (292), we can write: and (280) follows by observing that P[A ∈ T ] = b N (δ)see (291). In order to obtain (281), we must repeat the same steps shown above to the proof of Lemma 2 in [30]. Let us comment briefly on the structure of the bounds in (280) and (281). For clarity of presentation, we focus on (280), since similar considerations would apply to (281). We notice that the bound in (296) could be used for all possible realizations of A, while in (297) we used it only for A ∈ T . This is because for A belonging to the high-probability set T, the covariance matrix R 0 exhibits a concentration property that allowed us to obtain the bound b n ( ϕ 1 ) in (293), which goes to zero with n faster than b n ( ϕ 2 ), since ϕ 1 > ϕ 2 . On the other hand, examining (280) we see that the other bound b n ( ϕ 2 ) is further multiplied by the quantity b N (δ), which is independent on the sample size n and vanishes as the network size N goes to infinity as a consequence of the aforementioned concentration properties. As a possible extension, one could try to refine also the term b n ( ϕ 2 ) by exploring alternative bounding techniques -see, e.g., [29], [32]. Proof: We will prove the claim with reference to (299), with the proof being identical for (300). Examining (280), we see that in order to get (299) it suffices to guarantee that: since the second term in (280) goes to zero automatically as N → ∞ due to the presence of b N (δ), whatever law n(N ) is chosen. Let us first focus on the second exponential term in (270), which can be written as: After some straightforward algebra, the exponent in (302) can be more conveniently rewritten as: Now, since S/N → ξ > 0 as N → ∞, we see that S → ∞. Therefore, the second exponential term in (270) converges to zero if the quantity under brackets in (303) becomes asymptotically positive. Noticing that the term ( √ log S) −1 vanishes as N → ∞, we see that this condition is verified if we have, for > 0: from some N onward, which corresponds to (298). Finally, since with the found scaling law n(N ) diverges faster than log S, the first exponential term in (270) converges to zero as well, and the desired claim in (301) is obtained. The next two lemmas are auxiliary to the proof of Theorem 3.
Lemma 10: Let A, R 0 , R 1 , A, R 0 and R 1 be square matrices of equal size, with: Assume that R 0 is invertible and let Then we have that: (307) Proof: From (306) we can write: and the result in (307) follows because A ∞ ≤ 1 in view of (305). Lemma 11: Let A, R 0 , R 1 , A, R 0 and R 1 be square matrices defined over an index set S, with: Let the i-th row of A be a solution to the optimization problem: where x is a row vector and [ R 1 ] iS is the i-th row of [ R 1 ] S . If R 0 is invertible, then we have that: (311) Proof: We recall that for a vector the · ∞ norm amounts to the maximum absolute value of its entries, whereas for matrices it is the maximum absolute row sum. Accordingly, we see that the matrix A introduced in the statement of the lemma is a candidate solution to (310) since by assumption A ∞ ≤ 1. As a result, a solution to the optimization problem in (310) fulfills: On the other hand, we can write: and the claim in (311) follows from (312). Proof of Theorem 3: Examining the proof of Theorem 1, and in particular (262), we see that the sample complexity of A S,n can be determined by finding, for a sufficiently small > 0, a law n(N ) that ensures: Let us consider the regularized sample Granger estimator defined by (51). First we show that the limiting Granger estimator obeys the following bound: In view of (35) we can write: where we remark that the matrix F is nonnegative by construction. Exploiting (316) we have: where the last inequality follows since from (75) in [19] we know that: Equation (317) We conclude that the scaling law in (298) will be sufficient for the regularized Granger estimator if we show that ([R 0 ] S ) −1 1 is bounded by some constant. To this end, let From one of the block matrix representations for the inverse matrix we have that [96, p. 18]: Since A is nonnegative with A 2 ∞ = ρ 2 , while F is nonnegative and fulfills (318), we conclude that: which, in view of (321) and the symmetry of R 0 , yields: which completes the proof for the regularized Granger estimator.
Let us finally prove that in the dense case where p N → p > 0, the convergence in (314) with the scaling law in (54) holds for the non-regularized Granger estimator in (50). To this end, we expand [ R 0,n ] S as: Now we observe that: where: i) in the second-to-last inequality we used (323) and the fact that for an S × S matrix the · 1 norm is upper bounded by S times the · max norm; ii) we used the fact that S/N → ξ and p N → p > 0 as N → ∞; and iii) in view of Corollary 2, the last inequality holds (and, hence, [ R 0,n ] S is invertible) with probability converging to 1 as N → ∞. Finally, applying matrix inversion to (324), and upper bounding the norm of (I S + D) − and (314) follows by using (326) and Corollary 2 in Lemma 10.