Estimating the Topology of Preferential Attachment Graphs Under Partial Observability

This work addresses the problem of learning the topology of a network from the signals emitted by the network nodes. These signals are generated over time through a linear diffusion process, where neighboring nodes exchange messages according to the underlying network topology, and aggregate them according to a certain combination matrix. We consider the demanding setting of graph learning under partial observability, where signals are available only from a limited fraction of nodes, and we want to establish whether the topology of these nodes can be estimated faithfully, despite the presence of possibly many latent nodes. Recent results examined this problem when the network topology is generated according to an Erdős-Rényi random model. However, Erdős-Rényi graphs feature homogeneity across nodes and independence across edges, while, over several real-world networks, significant heterogeneity is observed between very connected “hubs” and scarcely connected peripheral nodes, and the edge construction process entails significant dependencies across edges. Preferential attachment random graphs were conceived primarily to fill these gaps. We tackle the problem of graph learning over preferential attachment graphs by focusing on the following setting: first-order vector autoregressive models equipped with a stable Laplacian combination matrix, and a network topology drawn according to the popular Bollobás-Riordan preferential attachment model. The main result established in this work is that a combination-matrix estimator known as Granger estimator achieves graph learning under partial observability.

many domains, for example, in biology, when one is interested in revealing how fish in a school coordinate their movements to escape a predator [2], [3]; in social learning, to discover how spatially dispersed network agents collaboratively share information to form their own opinions [4]; in neuroscience, when the goal is to unveil the relationships between functional and structural connectivity in the brain [5].
There exist several works addressing the graph learning problem. We refer the reader to the survey articles [6], [7] for an overview. Most works assume that signals from all network nodes can be gathered to estimate the topology [8], [9], [10], [11], [12], [13], [14], [15], [16], [17]. In contrast, one distinguishing feature of this work is the challenging scenario of partial observability, where only signals from a limited fraction of nodes are available to estimate the topology of the probed subnetwork.
Useful conditions for graph learning under partial observability have been obtained in [18], [19] for graphs having a polytree shape. More general shapes are considered in [20], [21], where, however, conditions for graph learning are formulated in terms of detailed local features of the network topology, which are not particularly suited to the large-scale setting considered here. In order to circumvent dependence on specific network structures, an asymptotic approach is considered in [22], [23], in the context of high-dimensional graphical models with latent variables. In [22], provable learning guarantees are offered under an appropriate local separation criterion, whereas in [23] graph learning is shown to be achievable under the so-called sparsity plus low-rank condition, namely, when the probed subnetwork is sparsely connected, and the unobserved subnetwork has suitably bounded size.
However, all the aforementioned works do not consider the time dynamics of the signals emitted by the nodes, and, hence, they are not applicable to dynamical systems like the Vector AutoRegressive (VAR) model considered heresee Eq. (2) further ahead. For such models, relevant results under full observability 1 are obtained in [24], [25], [26], [27], while partial observability is addressed in [28], relying on the aforementioned sparsity plus low-rank assumption used in [23].
Graph learning under partial observability with the VAR model considered in this work was recently addressed in [29], [30], [31], [32], [33], for the case where the dynamical system runs on top of an Erdős-Rényi random graph [34], [35]. Under reasonable technical assumptions, it was shown that graph learning is achievable under different regimes of connectivity. Erdős-Rényi graphs are by far the most common model adopted by researchers and practitioners to get insights about network problems and to validate network algorithms. The generative model of Erdős-Rényi graphs draws the edges independently one from the other, and in a completely homogeneous manner. Owing to this simple generation rule, these graphs are often unable to represent faithfully a number of phenomena emerging over real-world networks. For this reason, going beyond the Erdős-Rényi model is of critical importance.
In order to overcome the aforementioned limitations, other random graphs have been proposed in the last decades, among which a prominent role is played by preferential attachment graphs. The physical principles underlying the preferential attachment mechanism were originally discovered in [36], stimulating the research of several mathematical graph models, with one of the most popular and elegant being the Bollobás-Riordan model [37], [38].
These types of graphs were shown to capture several useful characteristics of real-world networks. For example, they are able to encompass the coexistence between hubs with many connections and peripheral, scarcely connected nodes. In contrast, as noticed, Erdős-Rényi graphs are homogeneous by their own nature [34], [35]. As explained in Sec. III, such enhanced descriptive power hinges basically on the statistical dependence enforced between the edges, as opposed to the edge independence implied by the Erdős-Rényi construction.
In view of this dependence, extending results valid for Erdős-Rényi graphs to preferential attachments graphs is a challenging task, which is addressed in this work providing the following main contributions. First, we formulate a sufficient condition for consistent graph learning under partial observability in terms of an identifiability gap (Theorem 1). This condition requires to extend existing results [6] to deal with the randomness of the identifiability gap that arises from the preferential attachment construction. Second, we ascertain that, when the network topology is a Bollobás-Riordan graph and the exchange of signals between neighboring nodes is ruled by the Laplacian matrix of the graph, a matrix estimator known as Granger estimator achieves consistent graph learning, assuming available an arbitrarily large number of samples (Theorem 2). Third, we establish that the sample complexity of the Granger estimator over Bollobás-Riordan graphs is almost linear in the network size (Theorem 3).
Notation. Matrices are denoted by upper-case letters, vectors by lower-case letters. We use boldface font to denote random variables, and normal font for their realizations. Sets For a graph G, the corresponding capital letter G is used to denote its adjacency matrix, which has zero diagonal, and whose off-diagonal (k, )-entry g k is equal to 1 if a directed edge from to k exists, and is zero otherwise. The symbol · max computes the maximum absolute entry of its matrix argument, whereas the symbol · max-off computes the maximum absolute off-diagonal entry of its matrix argument. The symbol p − → denotes convergence in probability as the network size scales to infinity. Likewise, the symbol a.s. − − → denotes almost-sure convergence.

II. GRAPH LEARNING UNDER PARTIAL OBSERVABILITY
We consider a random graph defined over N nodes and denoted by G(N ) (bold notation highlights graph randomness). The qualification "random" signifies that connections between nodes are drawn according to some probabilistic mechanism. Given a certain graph, the actions of the network nodes are described by a distributed linear dynamical system. Every node k, at time i = 1, 2, . . . , is driven by a random input source x k,i (N ) and produces the output signal y k,i (N ) according to the following diffusion model, a.k.a. first-order VAR model [39]: which can be conveniently recast in matrix form as: where x i (N ) and y i (N ) stack the entries x k,i (N ) and y k,i (N ) into N × 1 column vectors, and where matrix A(N ) = [a k (N )] collects the nonnegative combination weights a k (N ). 2 The eigenvalues of A(N ) are assumed to lie strictly inside the unit circle to ensure that system (2) is stable. The combination matrix A(N ) reflects the interconnections dictated by graph G(N ). Thus, weight a k (N ) is strictly positive if there is an edge from to k, and is zero otherwise. In view of (2), this structure implies that node k at time i updates its state y k,i (N ) by incorporating only previous-time signals y ,i−1 (N ) received from nodes for which a k (N ) > 0. In general, A(N ) need not be symmetric. For example, we could have a k (N ) > 0 and a k (N ) = 0. Accordingly, when we talk of "connected/disconnected pairs", we refer to ordered pairs with (k, ) being distinct from ( , k). The stochastic dynamical system in (2) contains different sources of randomness. All involved random variables are assumed to lie in a common probability space (Ω, F, P ). One source of randomness is given by the sequence of random graphs G(N ), for N = 1, 2, . . . We need to specify which probabilistic mechanism is used to generate such sequence. In the forthcoming sections we will focus on preferential attachment random graphs, described in Sec. III-C. The combination matrix A(N ) is a deterministic function of the graph G(N ). Thus, once a graph realization is fixed, the combination matrix becomes deterministic, and the system in (2) evolves according to the randomness of the input signals x k,i (N ), which are independent and identically distributed (i.i.d.) w.r.t. to node index k, time index i, and network size N . These signals are statistically independent of the sequence of graphs, and, without loss of generality, are assumed to have zero mean and unit variance. The vectors y 0 (N ) that initialize the recursion (2) are assumed to be square-integrable random vectors with arbitrary distribution, independent of all input signals x k,i (N ). They are allowed to depend only on G(N ) and, conditionally on G(N ), they are independent of all other graphs in the sequence. The particular distribution of y 0 (N ) will be mostly immaterial for our results, since we will be dealing with the steady-state regime where the number of samples goes to infinity and the initial state does not play a role. Only when we study the sample complexity, we will assume a specific distribution for the initial statesee Theorem 3.

A. Partial Observability
The evolution of the signals y k,i (N ) is dictated by repeated interactions between neighboring nodes, and these interactions are determined by the graph topology. Thus, it is legitimate to ask whether the topology can be inferred from observing the evolution of the signals at the nodes.
However, in practice it is often impossible to collect signals from all nodes, and one can more likely observe signals emitted within a subset of nodes. Given a probed subset P available under partial observability, and an observation time window i = 1, 2, . . . , T , the collection of signals available to perform graph learning will be compactly denoted by: Our goal is to estimate the interconnections between nodes in P, namely, the topology of the partial graph G P (N ) relative to P, starting from the signals in (4). Formally, we need to build a graph estimator 3 : where in the notation we emphasized that the properties of the graph estimator will depend on the number of samples and the network size. We will judge the goodness of a graph estimator in the classical asymptotic framework where the network size N goes to infinity, and the sample size T = T N to grow with N . The law T N follows will characterize the so-called sample complexity of the estimator, namely, how the number of samples scales with the network size to provide faithful graph learning. In this analysis, we allow the probed subset to depend on N , namely, we introduce a deterministic sequence of subsets: where S N is the probed subset of the graph of size N . For ease of notation, when a graph on N nodes or an N × N matrix is evaluated over subset S N , subscript N will be omitted, for example, we will write G S (N ) in place of G SN (N ). According to this notation, a graph estimator will be said to be consistent for the family of random graphs G(N ) and for the sequence of probed subsets S N if: One meaningful asymptotic regime to quantify the degree of observability, studied in [29], prescribes that the cardinality of S N scales with the network size so as to converge to an asymptotic fraction ξ of probed nodes: We will consider this specific regime later in Sec. IV.

B. Achievability of Consistent Graph Learning
In order to build a graph estimator, we start by building an estimator for the combination submatrix A P (N ): It is seen that the properties of the matrix estimator A P (T, N ) will depend on the number of samples and the network size. We will follow the following roadmap to establish whether graph learning is possible. First, we disregard sample complexity issues and consider the limiting case where an infinite number of samples is ideally available, namely, with T → ∞ and N fixed. The estimator corresponding to an infinite number of samples will be referred to as the limiting estimator A P (N ), which is formally introduced in Definition 1 further ahead. Second, in Definition 2, we specify when a limiting estimator is successful in recovering the true graph as N → ∞. Finally, in Theorem 1 we come full circle and show that graph learning is achievable when the number of samples grows together with the network size. Let us now introduce the notion of limiting matrix estimator.

Definition 1 (Limiting Matrix Estimator):
We say that a matrix estimator A P (T, N ) converges to a limiting estimator A P (N ) if, for any ε > 0: We notice that: i) the conditional probability in (10) depends on the overall combination matrix A(N ); and ii) the limiting estimator A P (N ) inherits the randomness in A(N ), but in (10) we used normal font to denote it since the realization of A(N ) is fixed in the pertinent conditional space. Condition (10) is a standard condition of consistency achieved by many empirical estimators, which are convergent as the number of samples scales to infinity.
If a limiting estimator A P (N ) exists, it makes sense to examine whether it allows us to recover faithfully the true graph. To this end, it is useful to introduce the following notion of consistency.
Definition 2 (Universal Local Structural Consistency of the Limiting Matrix Estimator): Consider a family of random graphs G(N ) and a deterministic sequence of probed subsets S N . If there exist a positive sequence c N and a positive random variable γ defined on the probability space (Ω, F, P ), such that: then we say that the limiting estimator A P (N ) achieves universal local structural consistency for the family of graphs G(N ) and for the sequence of probed subsets S N , with scaling sequence c N and identifiability gap γ.
As a matter of terminology: i) the adjective universal is used because, as we will promptly show, Eq. (11) automatically enables the possibility of recovering the topology by means of unsupervised clustering, i.e., without prior knowledge as regards the network size and other system parameters; ii) the adjective local comes from the observation that the structure of the topology connecting nodes in the probed subset will be faithfully recovered by probing only these particular nodes; and iii) the adjective structural is used because we estimate only the structure (i.e., the support graph) underlying the combination matrix.
By making explicit the definition of · max-off , we see that Definition 2 is equivalent to the definition of Universal Local Structural Consistency used in [6] for the case of null bias (which is of interest for the type of graphs studied here), but for one important novelty: Definition 2 encompasses the possibility that the identifiability gap is random. Notably, we will see that randomness in the identifiability gap does not affect graph learning achievability.
In principle, the nature of the randomness of γ looks rather abstract from the definition. It is therefore useful to explain where this randomness comes from in our specific setting. As we will see in more detail later, the Bollobás-Riordan preferential attachment procedure produces a sequence of graphs of increasing size by adding edges at each step of the procedure. In this way, we obtain a sequence of random graphs defined on the same probability space. The random variable γ will arise as a specific limiting value associated to the sequence of maximum degrees of the graphs obtained during the preferential attachment procedure.
Let us now give some insight into the practical meaning of Definition 2. Since γ is positive, from (11) we can write, for ε > 0 4 : 4 Given a sequence of random variables e(N ) vanishing in probability with N , and a strictly positive random variable z, for an arbitrary δ > 0 we have: where z δ/2 > 0 is chosen such that P[z ≤ z δ/2 ] ≤ δ/2 (a condition that can be met for any δ since P[z ≤ 0] = 0), and the last inequality holds for sufficiently large N since e(N ) vanishes in probability. The arbitrariness of δ implies that lim N→∞ P[e(N ) > z] = 0, and (13) follows from (11) by setting e(N ) = c N A S (N ) − γ G S (N ) max-off and z = εγ. which means that, with high probability as N → ∞, for all k = : We see that, for small ε, the estimated matrix entries are tightly clustered: i) around a positive value γ for connected node pairs, and ii) around zero for disconnected node pairs -see Fig. 1 for a graphical illustration. This property automatically enables the possibility of clustering the entries of matrix A S (N ) so as to classify connected vs. disconnected pairs, ruling out the trivial case that the probed subgraph is either fully connected or fully disconnected, i.e., that we have only one cluster. It is not difficult to envisage clustering algorithms that can achieve correct classification of the node pairs under condition (14). The choice of a particular clustering algorithm could matter from a practical perspective but, from a theoretical standpoint, it suffices to know that there exists at least one algorithm achieving correct clustering. To show that such an algorithm actually exists, we need to introduce first a formal definition of correct clustering.
Definition 3 (Correct Clustering): Let A = [α k ] be an S×S matrix with nonnegative entries and let G the adjacency matrix corresponding to the support graph of A. The diagonal entries of G are zero by convention (since we are not interested in self-loops).
Let A = [ α k ] be an estimated matrix fulfilling, for certain positive values ε and a, and for all k = : and let be a clustering algorithm that computes an estimated adjacency matrix G. Then, algorithm graphclu(·) will be said to be correct if, when the sets of connected and disconnected pairs in G are both non-empty, for a sufficiently small and independently of the value of a, we have G = G.
As an example of clustering that matches Definition 3, let us consider an algorithm that computes the midpoint between the maximum and minimum off-diagonal entries of the estimated matrix. Using the bounds in (15) we can write: (17) Accordingly, correct clustering will be surely performed if the lowest admissible value (1 − ε)a for the connected pairs lies above the threshold, namely if and if the highest admissible value εa for the disconnected pairs lies below the threshold, namely if In summary, the simple algorithm that employs an intermediate threshold to separate the clusters matches Definition 3, provided that ε < 1/4. However, it was shown in [33] that such algorithm, even if provably consistent as the network size scales to infinity, might suffer for finite network size effects. For this reason, another clustering algorithm was proposed in [33], namely, a modified k-means algorithm that: i) first finds all admissible k-means (with k = 2) cluster configurations, i.e., the configurations guaranteeing that the midpoint between the cluster centroids separate the clusters [40]); and ii) then selects the configuration featuring the largest distance between centroids. It was shown in [33] that this algorithm satisfies Definition 3 for any ε < 1/6. From the above arguments, we conclude that a limiting estimator fulfilling Definition 2, paired with an algorithm graphclu(·) fulfilling Definition 3, achieves correct classification. However, in practice we need to establish that consistent graph learning is achieved by sample estimators, rather than by limiting estimators. The next theorem fills this gap, revealing that looking at the properties of the limiting estimator is in fact sufficient.
Theorem 1 (Universal Local Structural Consistency of the Limiting Matrix Estimator Implies Consistency of the Graph Estimator): Let A P (T, N ) be an estimator that converges to the limiting matrix estimator A P (N ) as T → ∞ according to Definition 1. Consider a family of graphs G(N ) and a sequence of probed subsets S N such that the probability that G S (N ) is either fully connected or fully disconnected vanishes as N → ∞. Let the limiting matrix estimator achieve universal local structural consistency according to Definition 2, and let graphclu(·) be a correct clustering algorithm according to Definition 3. Then, the graph estimator: is consistent for the family of graphs G(N ) and for the sequence of probed subsets S N , namely, there exists a scaling law T N that ensures: Proof: See Appendix A.

C. Granger Estimator
Preliminarily, it is necessary to introduce the steady-state covariance matrix and the one-lag covariance matrix corresponding to model (2), which are, respectively: where bold notation for the covariance matrices is used to encompass randomness of the underlying graph. Under model (2) (with a stable matrix A(N )), it is known that the covariance matrix is the solution to the discrete-time Lyapunov Moreover, by exploiting (2) it is readily seen that: which implies the following inversion formula: Since covariance matrices can be faithfully estimated through sample covariance matrices as the number of samples increases, Eq. (26) suggests that the true matrix A P (N ) can be actually estimated from the samples, which would imply that consistent graph learning is trivially possible. However, under partial observability we can only compute the covariance matrices over the probed subset, [R 0 (N )] P and [R 1 (N )] P . As a result, computation of the inversion formula (26) is impaired by the unavailability of signals from the latent nodes. Nevertheless, the limiting estimator (which actually depends only on the covariances over the probed subset): is a meaningful choice that, in the context of Granger causality, is referred to as the Granger estimator or predictor, and attempts to provide the best linear prediction of the future samples from the past one-lag samples over the probed subset. On the other hand, from elementary matrix algebra we know that: which gives rise to the question of whether A P (N ) can be still profitably used to estimate graph G P (N ). We will answer this question in Sec. IV, with reference to our setting of preferential attachment Bollobás-Riordan graphs. Finally, we introduce the sample Granger estimator: which replaces the true covariance matrices appearing in (27) with the sample covariance matrices R 0 (T, N ) and R 1 (T, N ), whose (k, )-entries are defined as: (30) By ergodicity, the sample Granger estimator converges to the limiting estimator in (27) in the sense of Definition 1.

D. Laplacian Combination Matrix
Once a network graph G(N ) is constructed, it is necessary to define a policy to assign the combination matrix A(N ). Combination matrices arise across several domains, including distributed optimization, adaptation and learning over networks, social learning, network stochastic control. In all these domains, the system designer is called to run an algorithm (e.g., for optimization, learning, control) over a certain network topology. To this end, he/she devises a distributed procedure where the network nodes exchange locally information according to a certain combination matrix. One popular choice in these contexts is the Laplacian policy [42], [43], [44], [45], [46], [47]. The graph Laplacian L(N ) is a matrix whose off-diagonal (k, )-entry is −1 for connected pairs (k, ) and 0 otherwise, and whose k-th main diagonal entry is the degree of node k. Starting from L(N ), the Laplacian combination matrix is defined as: where μ G (N ) is the maximal degree of G(N ), λ ≤ 1 is a positive parameter that tunes the relative importance of the self-weights, and ρ < 1 is a positive parameter that grants stability of the dynamical system in (2) -see [41]. The Laplacian combination rule can be conveniently described in terms of the individual entries as follows, for k = : III. BOLLOBÁS-RIORDAN MODEL Preferential attachment graphs are typically obtained through an iterative process that goes as follows. Starting from a graph with a certain structure, at each subsequent iteration one node is added, along with some edges connecting this node to the graph constructed until that iteration. The term "preferential attachment" is used because the probability that the new node is connected to an existing node is proportional to the degree of the latter. Therefore, nodes that have already experienced a large amount of connections are favored, giving rise to a dichotomy in the network, where some nodes emerge as hubs with most of the connections, whereas the remaining nodes become peripheral and feature few connections.
The way to build a preferential attachment model is not unique. Since the pioneering work [36], several preferential attachment models have been proposed. One of the most popular variants is the Bollobás-Riordan random graph, which is the model examined in this work [37], [38]. The Bollobás-Riordan model provides an elegant mathematical formulation that allows to capture many features of real-world networks and to obtain clean analytical results for useful graph descriptors (e.g., node degrees, minimum and maximal degrees, centrality measures). Let us delve into the mathematical description of the Bollobás-Riordan model [37], [38].
First of all, a Bollobás-Riordan graph is a multigraph, which means that multiple self-loops and multiple edges are permitted. A random multigraph of size n will be denoted by M(n) and its adjacency matrix by M (n). Matrix M (n) is the symmetric (since Bollobás-Riordan graphs are undirected) n × n matrix whose off-diagonal (k, )-entry m k (n) is the number of edges between nodes k and , and whose diagonal entry m kk (n) is the number of self-loops of node k.
Then, the Bollobás-Riordan preferential-attachment model with parameter η ∈ N generates iteratively a random sequence of multigraphs M(n), for n = 1, 2, . . . , according to the following procedure -see Fig. 2 for a graphical illustration. The initial multigraph M(1) is a deterministic multigraph with one node and η self-loops. Multigraph M(n) is constructed starting from M(n − 1) by adding a new node n and η new connections (edges or self-loops). Specifically, η steps are performed, and at each step node n is connected to a node randomly chosen from the set {1, 2, . . . , n}. The intermediate multigraph obtained at steps s = 1, 2, . . . , η, is denoted by M(n; s). Accordingly, since after η steps we obtain the updated multigraph M(n), we have the identity M(n; η) = M(n). Likewise, we have M(n; 0) = M(n − 1).
Exploiting the procedure shown in Fig. 2 we can argue that the adjacency matrices possess the following structure: . . .
with M (1) = η. In fact, when passing from M(n − 1) to M(n) we simply attach η new edges to the new node n, so that the number of edges m k (n − 1) between any pair of nodes k, < n remains unaltered. In comparison, the adjacency matrix entries relative to the fresh node n evolve according to the following rule: where I(·) is the indicator function (which is equal to 1 if its argument is true and is zero otherwise) and we denote by v(n; s) the particular node that becomes connected to n through the edge introduced at step s. For this reason, for any k ≤ ≤ n, the (k, )-entry of the adjacency matrix is actually determined only at iteration , and, hence, it makes sense to drop the dependence on n and write:

A. Node Degrees and Preferential Attachment Rule
We adopt the standard convention that the degree of node k, denoted by d M,k (n), is the number of edges connected to k plus twice 5 the number of self-loops [48]: Likewise, we denote by d M,k (n; s) the degree of node k in the intermediate multigraph M(n; s). At each step s, the degree of a node k = n in the intermediate multigraph M(n; s) increases by 1 if the node picked at step s is equal to k, namely, In comparison, the degree of the new node n increases by 1 if the node picked at step s is equal to k < n, while it increases by 2 if node n itself is picked, since each self-loop is counted twice in the degree, with the initialization d M,n (n; 0) = 0.
The description of the multigraph construction is now completed by assigning the probability that a particular node is picked. Consider first the probability that the new node n is attached to an existing node k < n, namely, Let us ignore for now the term 1 appearing in the denominator. We see that the probability mass function in (39) matches well the preferential attachment paradigm, since we see that nodes 5 If we sum all degrees over index k in (36), each edge is counted twice (because m k = m k ), and each self-loop is counted twice (because of the factor 2). As a result, with the adopted convention the half-sum of the degrees in the multigraph is exactly equal to the total number of edges and self-loops. with higher degrees in M(n; s − 1) are more likely to be connected to the incoming node n, and so their degrees are more likely to increase further as the multigraph construction proceeds, according to "the rich get richer" philosophy.
We switch to the probability that a self-loop is created on the new node n: The term 1 in the numerator corresponds to first attaching one end of a new edge to the new node and updating the degree of that node before attaching the other end of the edge [37], [38]. Note that this term grants a nonzero probability of self-loops (we recall that d M,n (n; 0) = 0) when the new node enters the system. The term 1 in the denominator is necessary to get an admissible probability mass function, i.e., to let the sum of the probabilities in (39) and (40) be equal to 1.
It is useful to provide a more explicit representation for the denominator in (39) and (40). Since we know (see footnote 5) that the half-sum of all degrees is equal to the total number of edges and self-loops, and since at each step the Bollobás-Riordan construction adds exactly η new connections, we get the following equality: (41) which reveals that the denominator of the preferential attachment probability is a deterministic quantity. Finally, by merging (39) and (40) in a single equation, and using (41) to represent the denominator, we get, for all k = 1, 2, . . . , n, the compact representation: where δ kn is the Kronecker delta.

B. Maximal Degree
One fundamental graph descriptor that will play a critical role in our analysis is the maximal degree μ M (N ). In particular, we will rely on the asymptotic growth of the maximal degree with the network size N , which, as formally stated in Appendix B, was found to be on the order of √ N , in the following rigorous sense [49, Th. 8.14, p. 280]: where a.s.
− − → denotes almost-sure convergence as N → ∞, and μ is a certain positive random variable.
The square-root growth of the maximal degree in a Bollobás-Riordan graph can be related to the well known power-law or scale-free behavior of these graphs. The power-law decay refers to the average number of nodes with degree equal to d, which was shown to scale as an inverse power of d, precisely as d −3 for Bollobás-Riordan graphs. It was shown in [38] that such heavy-tailed behavior, as opposed, for instance, to the exponential tail corresponding to an Erdős-Rényi graph, reflects into a faster growth of the maximal degree, namely, the √ N growth prescribed by (43).

C. From Multigraph M(N ) to Graph G(N )
The multigraph structure was chosen by Bollobás and Riordan because it was instrumental to prove a number of theoretical results [37], [38]. The final goal of their model, however, was to construct a standard (i.e., simple) graph, with single edges and no self-loops. Actually, the multigraphs generated according to the Bollobás-Riordan model are approximately similar to simple graphs, since it is possible to show that the fraction of edges that are either repetitions or self-loops vanishes as N grows, as formally stated in the following lemma. Before stating the lemma, it is useful to notice that, by construction, the number of edges in M(n) is equal to ηn since we start with η loops in M(1) and add η new edges at a time.

Lemma 1 (Equivalence Between Bollobás-Riordan Multigraphs and Simple Graphs):
Let M(N ) be a Bollobás-Riordan multigraph of size N , and letm(N ) andm(N ) be the number of self-loops and redundant edges, respectively. Then, the number of self-loops and redundant edges are asymptotically negligible w.r.t. the total number of connections ηN , namely, Proof: See Appendix C. According to Lemma 1, it makes sense to introduce the simple 6 graph G(N ) associated to a multigraph M(N ), obtained by uprooting all self-loops and redundant edges from M(N ). The entries of the adjacency matrix G(N ) of graph G(N ) are: Likewise, the degree of node k in G(n) and the corresponding maximal degree are, respectively: The equivalence between M(N ) and G(N ) holds also in terms of maximal degrees, as stated in the following lemma. Lemma 2: Let μ M (N ) and μ G (N ) be the maximal degrees of the multigraph M(N ) and of the associated simple graph G(N ), respectively. We have that: which further implies: where μ is the same limiting variable introduced in (43).
Proof: See Appendix C. In the following, we will refer to graph G(N ) as Bollobás-Riordan simple graph, or simply as Bollobás-Riordan graph.

IV. STRUCTURAL CONSISTENCY AND SAMPLE COMPLEXITY OF THE GRANGER ESTIMATOR
In this section we collect the main results established in this work. First, we ascertain that the limiting Granger estimator achieves consistent graph learning under partial observability.
Theorem 2 (Universal Local Structural Consistency of the Limiting Granger Estimator): Let us consider the dynamical system (2), with Laplacian combination matrix as in (32), and with network graph G(N ) being a simple graph obtained from a Bollobás-Riordan multigraph M(N ) with step parameter η. Then, the limiting Granger estimator in (27) achieves universal local structural consistency according to Definition 2 for any sequence of probed subsets S N , with scaling sequence √ N and identifiability gap: where μ is given by (43).
Proof: See Appendix F. Equation (43) reveals that the limiting (scaled) maximal degree μ is an intrinsic property of the specific Bollobás-Riordan multigraph instance. In other words, as the Bollobás-Riordan multigraph construction progresses, the maximal degree, scaled by √ N , tends to become stable, and converges to a certain value μ. However, this value is random, implying that if we repeat the Bollobás-Riordan multigraph construction with the same parameters, we would obtain a different value for μ. In view of (49), this implies that the value of the identifiability gap that is critical for graph learning purposes is random as well, i.e., it depends on the particular graph sequence. This is a fundamental difference that distinguishes the behavior of Bollobás-Riordan graphs from the behavior of Erdős-Rényi graphs, where the identifiability gap is instead deterministic and independent of the particular graph realizations. However, and remarkably, we already know from Theorem 1 that randomness of the gap does not impair the possibility of consistent graph recovery. 7 We observe that the coupling between Bollobás-Riordan graphs (which are undirected) and the Laplacian rule gives rise to a symmetric combination matrix. Under this assumption, the series for the covariance matrix in (24) can be computed as (I denotes the N × N identity matrix): whose structure is exploited in the proofs of our results. While symmetry is useful to develop these technical arguments, we remark that the Granger estimator is based upon relation (25), which does not rely on symmetry at all. This observation, along with the series structure in (24) that is similar to the structure exploited in Appendix F for the symmetric case, suggests that the Granger estimator can work also with non-symmetric matrices, as we will show in Sec. V by examining a directed version of Bollobás-Riordan graphs.
In order to conclude the proof of graph learning achievability, we need to establish that the sample Granger estimator in (29) achieves consistent learning. To this end, we notice that the sample Granger estimator converges to the limiting Granger estimator by ergodicity. Then, in view of Theorem 1, the claim in Theorem 2 implies that the graph estimator in (20) is consistent for any sequence S N such that the sequence of graphs G S (N ) does not end up being trivial (i.e., fully connected or fully disconnected) as N → ∞. As shown in Lemma 6 (Appendix B), when the cardinality of S N scales linearly with N as in (8), these pathological situations are ruled with high probability as N → ∞, for any nonzero fraction ξ of probed nodes.
However, the question of how the number of samples T must scale with N to achieve consistent learning is still open. To answer this question we need to focus on sample complexity. First of all, when dealing with covariance-based estimators, one source of sample complexity comes from how well-conditioned these matrices are. For this reason, it is useful to replace (29) with its regularized counterpart, namely, a matrix A P (T, N ) constructed as follows. For k ∈ P, the k-th row of A P (T, N ) is a solution to the constrained optimization problem (here x ∈ R |P| is a row vector, and, for a matrix M , the notation [M ] kP denotes the k-th row of submatrix M P ): We remark that, when the sample covariance matrix is invertible, the non-regularized Granger estimator in (29) is the only matrix that yields a zero residual in (51). As a result, when the non-regularized Granger estimator fulfills the constraint in (51), the two estimators coincide.
We are ready to state our sample complexity result. Following [24], [25], [26], [27], the analysis is performed under the following classical assumption on the initialization of system (2).
Assumption 1 (Stationary Gaussian VAR): For the sample complexity analysis, we assume that the input signals x k,i (N ) are standard Gaussian variables (independent w.r.t. to node index k, time index i, and network size N ). Under these conditions, given a realization A(N ) = A of the combination matrix, the VAR process in (2) admits a Gaussian stationary distribution (which is a function of A) [24], [39]. We assume that, given A(N ) = A, the initial vector y 0 (N ) is distributed according to the stationary distribution.
Theorem 3 (Sample Complexity of Regularized Granger Estimator): Let us consider the dynamical system (2) operating under Assumption 1, with Laplacian combination matrix as in (32), and with network graph G(N ) being a simple graph obtained from a Bollobás-Riordan multigraph M(N ) with step parameter η. Let the number of samples T N grow with N as: where ω N can be chosen as a positive sequence diverging in an arbitrarily slow fashion. Then, the graph estimator (20) obtained by using the regularized Granger estimator in (51) is consistent for any sequence of probed subsets S N fulfilling (8).
Proof: See Appendix G. Let us comment on the main ramifications of Theorem 3. First of all, since the sequence ω N can grow in an arbitrarily slow fashion, any sample complexity that scales slightly faster than N log N achieves consistency. Therefore, the bottom line of Theorem 3 is that the sample complexity of the proposed estimator is essentially linear. Let us now see where this linear law originates from.
According to the Laplacian matrix structure in (32), the growth of the maximal degree determines the way the nonzero entries of the combination matrix shrink down as N → ∞. The smaller they are, the higher is the precision required by the sample estimators to distinguish the nonzero entries from the zero entries. For this reason, faster scaling laws of the maximal degree become more demanding in terms of number of samples. This argument can be made rigorous, and is in fact exploited in the proof of Theorem 3 to show that the sample complexity goes essentially (i.e., up to a log N factor) as μ 2 G (N ). As a result, the √ N -growth of the maximal degree over Bollobás-Riordan graphs reflects into a final sample complexity that is essentially linear in N .
In summary, from a technical viewpoint we conclude that the main factor influencing sample complexity is the maximal degree of the graph. However, it is useful to relate this behavior to more "physical" attributes of the system, to capture the factors that play a domineering role on sample complexity. One important attribute of Bollobás-Riordan graphs is sparsity. Bollobás-Riordan graphs are very sparsely connected, since, over a total number of possible N (N − 1)/2 edges, only ηN edges are drawn, which results into a sparsity ratio (no. of connected edges over total no. of possible edges) scaling as 1/N . The sporadic presence of connections might suggest that the nodes have small degree, which, in the light of the previous discussion, would suggest a slow growth of the maximal degree. However, this conclusion is not precise. To understand why, it is useful to contrast Bollobás-Riordan graphs against Erdős-Rényi graphs. We consider in particular Erdős-Rényi graphs under the uniform concentration regime because under this regime results on sample complexity are available [6], [33]. Erdős-Rényi graphs are built homogeneously (i.e., presence/absence of edges is established in an i.i.d. manner). This homogeneity implies in particular that the maximal and average degree of an Erdős-Rényi graph scale comparably, i.e., μ G (N ) ∼ N p, where p is the connection probability. Accordingly, over sparse Erdős-Rényi graphs where p ≈ (log N )/N , the sample complexity is polylogarithmic in N , whereas over dense graphs with constant p it is almost-quadratic in N .
Let us see what happens over Bollobás-Riordan graphs. Notably, the latter graphs are sparser than the sparsest connected Erdős-Rényi graphs! In fact, we observed that the sparsity ratio of Bollobás-Riordan graphs is 1/N , whereas for sparse connected Erdős-Rényi graphs we have a sparsity ratio given by the connection probability ≈ (log N )/N . However, despite such increased sparsity, the maximal degree of Bollobás-Riordan graphs grows as √ N , namely, faster than the logarithmic law characterizing sparse Erdős-Rényi graphs. This difference must be ascribed to the fact that Bollobás-Riordan graphs are highly inhomogeneous and, hence, even with a small number of overall connections, there are nodes with a very large number of neighbors, inducing a faster growth of the maximal degree.

A. Synthetic Data
We start by illustrating the significance of Theorem 2 by examining the behavior of the limiting Granger estimator on synthetic data.
The two panels in Fig. 3 display the pattern exhibited by the entries of the limiting Granger estimator, A P (N ), for two realizations of the random graph G(N ) with a probed subset P containing half the nodes of the entire network. For both realizations, the entries of A P (N ) are scaled by √ N , and, for clarity of visualization, they are vectorized and rearranged so that the entries corresponding to disconnected nodes come first. The vertical arrow displays the gap γ, which was estimated using (43), (47) and (49), with reference to the pertinent graph topology shown in the figure. The following notable effects are observed. First, in perfect accordance with Theorem 2, we see the emergence of an identifiability gap that separates clearly the entries corresponding to disconnected node pairs from the entries corresponding to connected node pairs. Second, clustering is definitely visible: the entries pertaining to disconnected nodes cluster around zero, whereas the entries corresponding to connected nodes around γ, the limiting value displayed by the vertical arrow. Last but not least, by comparing side-by-side the panels in Fig. 3, we see that the two different realizations correspond to different values of the gap γ, which confirms that this gap is in fact random.
We see from (49) that the limiting random variable μ is one fundamental ingredient of the identifiability gap. It is therefore useful to examine the statistical distribution of μ, and in particular its behavior in comparison to the finite-size (scaled) maximal degree μ G (N )/ √ N -see (48). To this end, in Fig. 4 we display: i) the empirical histograms of the scaled maximal degree, for three values of N (first three panels from the left); and ii) the empirical histogram corresponding to the limiting variable μ (rightmost panel). To obtain the latter histogram, we exploit the following result.
Theorem 17 in [37]: Let p 1 , p 2 , . . . be the points of a Poisson process with rate η, i.e., equal to the number of new edges added at each iteration of the Bollobás-Riordan procedure (see Sec. III). The limiting random variable in (64) is equal to: where, for n = 0, 1, . . . , we defined: According to this theorem, we simulate a Poisson process with rate η and use the expression of μ provided in (53) and (54). By comparing the different panels in Fig. 4, we see that the distribution of the scaled maximal degree approaches the distribution of μ as N increases, and that the result is stable yet for the values N = 100 and N = 250. These are interesting values since, in the range [100, 250], the probability of correct graph learning is close to 1, as we can appreciate from the quantitative performance analysis reported in Fig. 5.
More specifically, in Fig. 5 we show the probability of correct graph learning evaluated empirically over 10 3 Monte Carlo runs, as a function of the network size N . Specifically, the dynamical evolution in (2) is simulated over a network of increasing size N ranging from 50 to 250, and we consider a subset of probed nodes having cardinality ξN , with ξ = 0.15. The curves displayed with continuous line refer to the limiting Granger estimator in (27), which is obtained by using the true covariance matrices. Markers refer to the regularized Granger estimator in (51), which is instead computed over the samples. The take-away messages from Fig. 5 are that: i) for sufficiently large number of samples, the learning curve of the empirical Granger estimator reaches the curve of the limiting Granger estimator; and ii) consistent learning is progressively achieved as N grows.
In Fig. 5, a relatively large number of samples is considered, and kept constant across all values of the network size N . Another useful analysis pertains to the effective number of samples necessary to achieve a target learning probability. In Fig. 6 we evaluate empirically the number of samples needed to get a probability equal to 90% of the probability of   (51) obtained by using the sample covariances (markers) evaluated over T = 3 · 10 6 samples. The clustering algorithm applied to the Granger estimator is the modified k-means algorithm proposed in [33]. The parameters of the Laplacian matrix are ρ = 0.5 and λ = 0.75. correct learning achieved by the limiting Granger estimator. The blue curve corresponds to Bollobás-Riordan graphs, and shows a growth that matches well the almost-linear growth prescribed by Theorem 3.
It is useful to compare the observed behavior against the behavior of Erdős-Rényi graphs. The sample complexity laws relative to Erdős-Rényi graphs mentioned in the previous section are confirmed by the curves in Fig. 6, revealing in particular that: i) the intermediate growth rate is given by Bollobás-Riordan gaphs (blue curve), with almost-linear sample complexity; ii) the highest sample complexity is quadratic, and is required by dense Erdős-Rényi graphs (green curve); and iii) the lowest sample complexity is achieved by sparse Erdős-Rényi graphs (red curve), and depends on the specific law chosen for the vanishing connection probability p.
The bottom line is that: i) sparsity of preferential attachment graphs makes them easier to learn than dense graphs; whereas ii) heterogeneity of preferential attachment graphs implies a power-law behavior that reflects into a √ N -growth of the maximal degree, making them harder to learn than sparse homogeneous graphs.
Finally, we provide some quantitative data as regards the computational complexity of the graph learning strategy in the considered examples. To this end, we now report the run times relative to the XPS 7390 laptop of Dell Inc. ® , equipped with an i7 Intel ® processor and a 16GB RAM. The graph learning algorithm can be decoupled in two steps: i) computing the Granger estimator; and ii) performing the clustering algorithm on its entries. The cost associated to the clustering algorithm is negligible. In the first step, if we use the regularized Granger estimator, we need an optimization algorithm to solve numerically (51). In our simulations, we employed the MATLAB ® package CVX [50], [51], which exhibited a run time ranging from ≈ 3 s to ≈ 8 s when N ranges from 100 to 250, with ξ = 0.15. The run time reduces to less than Fig. 7. Experiments over real-world topologies. The parameters of the Laplacian matrix are ρ = 0.5 and λ = 0.75. Left. A simulation run of (2) over a power-grid network of N = 4941 nodes taken from the web-repository [54]. The plot shows the entries of the regularized Granger estimator in (51), scaled by √ N , vectorized and rearranged so that entries corresponding to disconnected nodes come first. In this run we set T = 2 · 10 5 , and consider a subset of probed nodes having cardinality N/3 = 1647. In the network topology, probed nodes are displayed in green, while latent nodes in purple, with the circle radius being proportional to the node degree. Right. The same general setting as in the left panel, with reference to a network of N = 100 real-world routers, whose connection topology was extracted from a bigger network available in the web-repository [54]. In this run we set T = 10 6 , and consider a subset of probed nodes having cardinality N/2 = 50.
1 ms if we use instead the non-regularized Granger estimator 8 in (29), which in the considered examples was found to coincide with its regularized counterpart -see the discussion following (51).

B. Real Networks and Directed Graphs
So far, we tested our results over synthetic network topologies generated according to the Bollobás-Riordan procedure described in Sec. III. Since the main motivation behind the challenging study of these graphs is their similarity to realworld graphs, in this section we examine some topologies of existing networks. The examples that we are going to illustrate should be intended in the following way. We are given the topology of a real-world network, such as, e.g., a power-grid network, a network of routers, or a social network, which can support the implementation of distributed learning algorithms for different useful purposes. We therefore use the assigned network topology to build/run on top of it a distributed algorithm, for example, an adaptive distributed detection algorithm, or a social learning algorithm, which are examples matching well the considered model -see [6, Sec. V-A], [52], [53]. Then, the focus of topology inference is to solve the reverse learning problem of retrieving the network graph from partial observation of the nodes' output.
We are now ready to illustrate the tests conducted over two real-world networks provided by a popular webrepository [54], corresponding to the topologies shown in Fig. 7. As a preliminary comment, we can see that these topologies exhibit a dichotomous structure with "hubs" featuring many connections as opposed to "peripheral" nodes with few connections. Similarly shaped structures match well the heterogeneity guaranteed by Bollobás-Riordan models, while they are impossible to mimic through the independent/homogeneous Erdős-Rényi generation.
The example in the left panel of Fig. 7 refers to a power-grid network composed of N = 4941 nodes, connected according to the displayed topology. Over this topology, we let the VAR 8 We remark that no theoretical proof on the sample complexity of the non-regularized Granger estimator is available. system (2) run with a Laplacian combination policy, and then applied our inference algorithm under the case that only one third of the nodes are probed, with T = 2 · 10 5 samples. The results of the test are shown in the left plot in Fig. 7. We see that the clustering algorithm is able to separate correctly the disconnected/connected nodes, therefore providing faithful graph learning. Moreover, we see that the spread exhibited by the matrix entries of the sample estimator is larger than the spread exhibited by the limiting estimator, which makes perfect sense.
The second example, illustrated in the right panel of Fig. 7, refers to a network of 100 routers connected according to the shown topology, which was extracted from a bigger network reported in the web-repository [54]. In this case, 50% of the nodes are probed, and we have T = 10 6 available samples. The results of the test are shown in the right plot in Fig. 7, where we can appreciate that the graph learning algorithm successfully classifies the node connections within the probed subnetwork. In comparison to the left panel, we see that in the right panel the spread of the sample estimators is reduced, which is a consequence of the fact that we have a larger number of samples and a smaller network size.
Finally, we test whether the Granger estimator can achieve faithful graph learning over directed graphs. While Bollobás-Riordan graphs are naturally undirected, there are of course several straightforward ways to devise directed variants thereof, see, e.g., [37]. Perhaps the simplest way is to perform, at each step of the preferential attachment construction: i) the insertion of a directed edge from the new node n to an existing node, based on an attachment probability ruled by the in-degree of the existing nodes; and ii) the insertion of a directed edge from an existing node to the new node n, based on an attachment probability ruled by the out-degree of the existing nodes. Such construction is used in Fig. 8, where we report two realizations relative to the parameters described in the caption. Regarding the Laplacian matrix, in the directed case we use definition (32) with the maximal in-degree. We see that, even in this non-symmetric case, the limiting Granger estimator is still able to separate well connected from disconnected pairs. Fig. 8. Two realizations of a directed Bollobás-Riordan graph, whose construction is detailed in the main text. We set N = 100 and η = 3. The plots show the entries of the limiting Granger estimator in (27), scaled by √ N , vectorized and rearranged so that entries corresponding to disconnected nodes come first. In the shown network topologies, probed nodes are displayed in green, while latent nodes in purple, with the circle radius being proportional to the node in-degree. The probed subset has cardinality N/2 = 50, and its nodes are randomly picked from {1, 2, . . . , N} without replacement. The parameters of the Laplacian matrix are ρ = 0.5 and λ = 0.75.

VI. CONCLUSION
We examined the problem of learning a network graph from the signals diffusing across the network according to the VAR model in (2). The distinguishing features of our work are: i) the network topology is modeled as a preferential attachment random graph; and ii) only part of the network is monitored (partial observability). We established that the Granger estimator under partial observability achieves faithful graph learning (with high probability as the network grows) for the class of Bollobás-Riordan graphs when the signals of neighboring nodes are combined according to the Laplacian matrix of the graph.
Previous results on consistent graph learning under partial observability, for diffusion models like (2), were relative to Erdős-Rényi graphs, which cannot reproduce faithfully the behavior of several useful real-world networks. In contrast, preferential attachment graphs were shown to be powerful in capturing useful real-world effects such as node heterogeneity and statistical dependence across graph edges. Accordingly, moving from Erdős-Rényi to Bollobás-Riordan graphs constitutes a useful research advance, which was rather demanding, especially because the multigraph construction relies on a preferential attachment mechanism, which introduces significant dependence across the edges, thus preventing from application of the simpler i.i.d. models adopted for the former graph models. Exploiting statistical concentration results for dependent processes, we are able to examine in detail the limiting properties of these graphs, and to ascertain (Theorem 2) that the entries of the Granger matrix estimator computed over the probed subnetwork split into two classes, separated by an identifiability gap. As a distinguishing feature of the Bollobás-Riordan model, this gap is a random variable, which depends on the particular instance of the multigraph generation, as opposed to the deterministic gap observed for Erdős-Rényi graphs. We proved that the emergence of the gap is critical to enable achievable graph learning (Theorem 1), and characterized the scaling law that relates how the number of samples must grow with the network size (Theorem 3), finding that the sample complexity is slightly larger than N log N .
There are many other open questions that might deserve attention. For example, achievability is expressed in a worst-case perspective where perfect graph reconstruction is required. It would be useful to relax the criterion to encompass a possible fraction of misclassified edges, and see how this impacts the performance and the requested number of samples. Moreover, since classification is performed by using an unsupervised clustering algorithm, it would be useful to see whether one could explore side-information to set a classification threshold, and see how this changes the fraction of misclassified nodes.
We established that the identifiability gap over Bollobás-Riordan graphs is random, as opposed to the deterministic nature that was proved over Erdős-Rényi graphs. This difference stimulates an open question that concerns the connections between the nature (deterministic or random) of the identifiability gap and the generative mechanism of the graph.
Also considering that the ubiquity of scale-free networks is a debated point [55], it would be interesting to consider other useful graph models, such as the stochastic block model, Chung-Lu graphs, random dot product or random geometric graphs. In particular, it would be interesting to ascertain whether randomness of the identifiability gap is related to the scale-free property or the preferential attachment mechanism.
The conducted analysis does not depend on a specific selection rule for the subset of probed nodes. Indeed, some of our results hold for any (deterministic) subset of probed nodes. For Theorem 3, we just need to impose that the cardinality of this subset scales linearly with N so as to avoid trivial (i.e., fully connected or fully disconnected) subgraphs as N → ∞. Regarding the choice of the probed subset, it is interesting to consider an adversarial perspective where a malicious entity wants to select it to impair the topology inference algorithm. In the traditional setting, attacks to network graphs have the goal of impairing the connectivity properties the network. One classical attack is the deletion attack, where the attacker has the freedom of deleting some nodes [56]. The goal is to reduce the connectivity of the selected subgraph surviving after the deletion process. In particular, when the attacker leverages knowledge of the preferential attachment construction, he/she can delete all the "oldest" nodes to minimize connectivity. In our context, connectivity of probed nodes is not of interest, but one way to impair the clustering algorithm would be to select a subset that is fully connected or fully disconnected. However, when the cardinality of the probed subset grows linearly with N , we know that both these extreme situations occur with vanishing probability as N grows. The picture changes if the attack does not rely only on the graph model, and the adversary has the power of choosing the subset based on the actual realizations of the graph and/or of the nodes' output. In this case, the subset becomes statistically dependent on other random variables, and the statistical properties of the combination matrix and its estimated counterpart change due to the dependence introduced between the subset and the matrix entries. Carrying out the analysis under this scenario requires a different analysis that is not covered by the results of this work.
Devising matrix estimators different from (51), exploiting in particular some other structural constraints such as sparsity or smoothness can be useful to reduce sample complexity.
Another important aspect pertains to online graph learning algorithms, for example, one could implement an online version of (51) that learns from streaming data, encompassing the possibility that the graph topology changes due to the evolutionary mechanism of the Bollobás-Riordan graph itself.
Finally, other useful research advances concern the generalization to higher-order VAR processes, nonlinear models, or other combination matrices.

APPENDICES
The proofs reported in these appendices are organized as follows. In Appendix A we prove Theorem 1. In Appendices B and C we collect several properties on the Bollobás-Riordan model useful for our analysis. By exploiting these results, in Appendix D we show how the underlying graph structure influences the behavior of the Laplacian combination matrix. Next, in Appendix E we establish some useful properties of the limiting Granger estimator. Our main results, namely, Theorems 2 and 3, are proved in Appendices F and G, respectively. Finally, in Appendix H we collect two auxiliary results used in our analysis.

APPENDIX A PROOF OF THEOREM 1
By application of the triangle inequality we can write: Let us focus on the first term on the RHS of (55). From the definition of limiting estimator in (10), we have, for any N ∈ N and ε > 0: By definition of limit, from (56) we conclude that, for any N ∈ N and any δ, ε > 0, there exists always a value T 0 (N, δ, ε) such that for all T ≥ T 0 (N, δ, ε): Let now f N and g N be two positive sequences vanishing with N with arbitrary laws. Since the reduction of δ and/or ε is a more demanding condition, the function T 0 (N, δ, ε) can be always chosen to be non-increasing w.r.t. both δ and ε, which implies that for sufficiently large N : further implying, in view of (57): In other words, if the number of samples scales with N as T N = T 0 (N, f N , g N ), we can write: Plugging this result into (55) and noticing that the second term on the RHS of (55) vanishes in probability in view of Definition 2 (since by assumption the limiting estimator achieves universal local structural consistency), we conclude that: which implies that, for any ε > 0 and all k = we have (see footnote 4): with high probability as N → ∞. By assumption, there exists a certain ε > 0 such that the algorithm graphclu(·) achieves successful classification for all configurations fulfilling (62), provided that G S (N ) is neither fully connected nor fully disconnected. Thus, the proof is complete because we have just shown that configurations fulfilling (62) occur with high probability as N → ∞, and by assumption the probability that G S (N ) is fully connected or fully disconnected vanishes as N → ∞.

APPENDIX B USEFUL RESULTS ON BOLLOBÁS-RIORDAN MULTIGRAPHS
We start by enunciating two properties of the maximal degree μ M (N ) that will be critical in our development.
Theorem 8.14 in [49, p. 280]: For N = 1, 2, . . . , let μ M (N ) be the maximal degree sequence defined over the multigraph sequence M(N ). For any m ∈ N we have that: Moreover, there exists a strictly positive random variable μ such that: where a.s.
− − → denotes almost-sure convergence. We continue by proving a lemma that provides a uniform upper bound on the preferential attachment probability.
Lemma 3 (Bounds on the Preferential Attachment Probability): The preferential attachment probability defined in (42) obeys the following bound: Moreover, for any set T ⊆ {1, 2, . . . , s − 1}, we have that: Proof: First we focus on the numerator in (42). Joining (37) and (38), we can write the degree of node k in multigraph M(n, s − 1) as: where δ kn is the Kronecker delta. Developing the recursion in (67) over index s, we get: which implies that the numerator in (42) is upper bounded as: where the last inequality follows by observing that s ≤ η.
We switch to the analysis of the denominator in (42), which is lower bounded as: Using (69) and (70) in (42), we get (65). It remains to prove (66). By applying the law of total probability, we can write: In the summation, M spans the space of possible multigraphs M(n, s − 1) compatible with the multigraph M(n − 1) and the collection of selected nodes {v(n; τ )} τ ∈T , and where: (a) comes from the fact that the preferential attachment rule is Markovian, namely, the selection at step s depends only on the previous multigraph M(n; s − 1) (regardless of any details about the previous multigraph evolution); while (b) comes from (65).

Lemma 4 (Sum of Maximal Degree Powers):
For all m ≥ 2 we have that: Proof: First we observe that: From the Stolz-Cesàro theorem, for any two positive sequences f N and g N with g N → ∞ as N → ∞, we have that [57], [58]: which applied to the last fraction in (73) yields: where the last inequality follows by (63).
Focusing on the first fraction in (73) we have that, for m > 2: where ζ(·) is the Riemann zeta function (which is finite), while for m = 2 the summation in (76) is the harmonic number, which diverges logarithmically as N → ∞. Accordingly, we conclude that for all m ≥ 2: The claim of the lemma now follows by applying (75) and (77).

Lemma 5 (Correlation Between Degrees):
For any 1 ≤ k < ≤ N , we have that: Proof: It is convenient to introduce the following binary random variables: for 1 ≤ k ≤ N and 1 ≤ s ≤ η. Using (34), we can write: where the last step holds true since k = by assumption. Let us examine the behavior of the individual term in (80) and focus, without loss of generality, on the case s > t. Since β ks (N ) and β t (N ) are indicator variables, we have that: where the last inequality follows from (66), and the claim follows by observing that η(η − 1)/(8η 2 ) < 1.

APPENDIX C EQUIVALENCE BETWEEN BOLLOBÁS-RIORDAN MULTIGRAPHS AND SIMPLE GRAPHS
Proof of Lemma 1: We start with the analysis of the number of self-loops in M(N ), which is equal to: Given a multigraph M(k − 1), let us consider the η steps, s = 1, 2, . . . , η, necessary to build the multigraph M(k).
Consider a sequence of nodes v 1 , v 2 , . . . , v η selected during the η steps, with the prescription that exactly m out of the η nodes are equal to k, i.e., we have m self-loops attached to the new node k. We denote by V m the ensemble of configurations v 1 , v 2 , . . . , v η that match such prescription. We note in passing that the cardinality of V m is equal to η m . According to the adopted notation, the probability of having exactly m selfloops attached to k admits the following representation: where (a) follows by the chain rule and (b) follows from (66), once noticing that d M,k (k − 1) = 0. Now, by exploiting the definition of M(1), the multigraph made by a single node with η self-loops, we can write: which, in view of (83) yields: where the finite constant C(η) is implicitly defined in the last step of (85). We immediately see from (85) that: since the last summation in (85) is the harmonic number, which grows logarithmically with N . We continue by examining the expected number of redundant edges in M(N ): where the number of redundant edges between two distinct nodes k and in the multigraph M( ) can be conveniently represented as:m We want to show that: In order to prove (89), we can call upon the Stolz-Cesàro theorem, and apply (74) with the choices f = −1 k=1 E[m k ] and g = 1 (which corresponds to apply the Cesàro-mean theorem), implying that it would suffice to show that: Reasoning as done to obtain (83), we have the following representation: where in the last step we applied (66). Exploiting (88), from (91) we can compute the conditional expected value of m k , obtaning: where, in the last step the degree of node k has been upper bounded m − 1 times by the maximal degree. Summing over index k, from (93) we get: Using (41) and recalling that d M,k ( − 1) = d M,k ( − 1; η), we have that: which, taking expectation w.r.t. to M( − 1) in (94), yields: and the claim in (90) follows by (63), which, further applying (86), completes the proof of the lemma. Next we prove the asymptotic equivalence between the maximal degrees of the multigraph and the associated simple graph, claimed in Lemma 2.
Proof of Lemma 2: Considering the definition of the multigraph degree in (36), and separating the contribution of the redundant edges in (88) from the contribution of the simple graph term in (45), we can write: where in (a) we adopt the convention that the second summation is equal to zero when k = 1, and that the third summation is equal to zero when k = N ; and in (b) the inequality follows because the number of self-loops attached to node k at cycle k, as well as the number of edges connecting node k to a node < k, are at most equal to the number of steps η, namely, m kk ≤ η and k−1 =1m k < η. Taking the maximum over k ∈ [1, N] in (97) we can write: where the equality follows because of the summation is zero when k = N . For any k ≥ 1, let us introduce the sequence: It is readily verified that the family of sequences {u k ( )} ≥1 indexed by the parameter k, matches the hypotheses of Lemma 13, with the choices Θ = N \ {0}, b = η, and with the filtration {F( )} ≥1 generated by the random sequence {M( )} ≥1 . In particular, for any ε > 0, and for any N ≥ 1, by choosing T = [1, N] and u = ε √ N , we can apply Lemma 13 and write: On the other hand, for 1 ≤ k ≤ N − 1 we have that: where in the second inequality we applied (92), whereas the third inequality follows from the definition of maximal degree. Applying Markov's inequality and exploiting (102) we obtain: Substituting (103) into (100) and applying Lemma 4, from (98) we obtain the convergence in (47). Then, the convergence in (48) comes directly from (43), and the proof is complete. Lemma 6 (Subgraphs G P (N ) Are Nontrivial): Let G(N ) be the simple graph associated to a Bollobás-Riordan multigraph M(N ), and let G S (N ) be the subgraph over a subset sequence S N fulfilling (8). Then: i) for sufficiently large N , subgraph G S (N ) is not fully connected; and ii) the probability that G S (N ) is fully disconnected vanishes as N → ∞.
Proof: The sum of all degrees in multigraph M(N ) grows linearly with N . In order to have a fully connected subgraph G S (N ), the sum of all degrees of this subgraph should be equal to |S N | 2 , which scales as N 2 in view of (8). Therefore, subgraph G S (N ) cannot be fully connected as N grows.
We move on to examine the probability that G S (N ) is fully disconnected. Let with: We can write: where the inequality follows by considering only the connections at step s = 1 of the preferential attachment construction, whereas the equality follows by the chain rule. Now, from (42), for any k < n we obtain the following lower bound: Accordingly, the individual term in (106) can be upper bounded as follows: Using (108) in (106), we finally get 9 : where the second inequality holds because all terms in the product are smaller than 1, while convergence holds in view of (8).

Theorem 4 (Correlation Between Matrix Entries): Let
Then we have that: 9 For each multigraph M(n −1) such that {v(n k ; 1) / ∈ V k−1 } −1 k=2 , we can write: Since the bound in (107) does not depend on M(n − 1), by applying the law of total probability, we can use (107) to bound also the probability Proof: Using (32) in (111) we get: Since the term √ N 1+μ G (N ) converges almost surely to 1/μ, it is sufficient to show that the term t N in (113) vanishes in probability as N → ∞. To this aim, it is expedient to work in terms of the original multigraph M(N ) that originates the simple graph G(N ). We have that: By construction, m k ∈ {1, . . . , η}, and using (36), for any k < we have: where the last inequality holds because, in the multigraph M( ), node has only η edges, and so its degree d M, ( ) is upper bounded by 2η. Applying (115) to the random sequence t N in (114), we conclude that t N vanishes almost surely as N → ∞. It remains to show that t N in (114) vanishes in probability. To this end, we call upon Lemma 13, by introducing the following family of sequences, for any k, ∈ N with 1 ≤ k < : Following the notation adopted in Lemma 13, we have a family of sequences {u k (j)} j≥1 parameterized by the set: Moreover, we introduce the aggregate variable: Convergence to zero of the random variable t N in (114) is equivalent to the following statement: Now, for any (k, ) ∈ Θ: where the upper bound follows because, in view of (34) and (35), both m jk and m j cannot exceed the number of steps η. From (120) we see that the family of sequences in (116) meets the hypotheses of Lemma 13 with the filtration {F(n)} n≥1 generated by the random sequence {M(n)} n≥1 . We conclude that the probability in (119) is upper bounded by: where: On the other hand, from Lemma 5 we have that: Therefore, applying Markov's inequality and (123) to the second term in (121), we conclude that this term is upper bounded by: which vanishes as N → ∞ in view of Lemma 4, concluding the proof of the theorem.

APPENDIX E DETERMINISTIC PROPERTIES OF THE LIMITING GRANGER ESTIMATOR
In this section we obtain an upper bound on the error of the limiting Granger estimator. To this aim, we start by proving two auxiliary lemmas that hold for any N × N scaled left-stochastic matrix A = [a k ], namely, for any matrix whose entries satisfy the conditions: In the following analysis we denote by a (i) k the (k, )-entry of the matrix power A i , and we use the following quantity: Lemma 7 (Bounds on Matrix Powers): Let A be an N × N scaled left-stochastic matrix as in (125). For i = 1, 2, . . . , we have that: • The main diagonal entries of A 2i satisfy the inequalities: where the sequence α i is recursively defined as: • The off-diagonal entries of A 2i satisfy the inequalities: where β i and γ i are two sequences recursively defined as: Proof: Preliminary, it is useful to observe that 10 : We start by proving (127) by induction. For i = 1, the claim follows directly from (133). We shall therefore prove that (127) holds for i + 1, assuming that it holds for i. To this aim, let us write the diagonal terms of matrix A 2(i+1) as: We observe that (133) implies in particular the following inequalities: which, applied in (134), yield: Since a (2i) kk ≤ α i by the induction hypothesis, from (136) we get: a which corresponds to (127) for the case i + 1, and the claim for the diagonal entries is proved. We continue by proving (129) by induction. For any k, = 1, 2, . . . , N, with k = , we have: where: i) in the inequality we exploited the fact that the diagonal entries of A are upper bounded by ρ in view of (125), and we used the definition of M in (126); and ii) in the last equality we applied the definitions of β 1 and γ 1 appearing in (130) and (131), respectively. We conclude from (138) that the claim in (129) holds for i = 1. Let us now show that, if the 10 We can prove this property by induction as follows. For i = 1 the property is exactly (125), while the induction step comes from: claim holds for a generic i, then it holds for i + 1. To this aim, we observe that: where the first inequality follows by applying the induction hypothesis to term a (2i) h , while in the second inequality we used (127). Let us now bound the individual terms that multiply the quantities α i , β i , and γ i in (139).

Lemma 8 (Bounds on a Useful Matrix Power Series): Let
A be an N × N scaled left-stochastic matrix as in (125). Let where we recall that P {1, 2, . . . , N} \ P. Let further Then, for i = 1, 2, . . . , we have that: • The main diagonal entries of matrix H satisfy the inequalities: • The off-diagonal entries of matrix H satisfy the inequalities: Proof: The fact that h k ≥ 0 for any k and is an immediate consequence of the definition of matrix H in (144), since the entries of matrix powers C i are nonnegative for any i. So in the next we will focus on the upper bounds in (146) and (147).
As it can be trivially verified by induction, we first note that for any k, ∈ P and any i = 1, 2, . . .: implying that the upper bounds provided in Lemma 7 are also valid for the matrix powers C i . Therefore, by the definition of H in (144), we have: where the sequences α i , β i and γ i are defined in (128), (130) and (131), respectively. According to (149), to establish the upper bounds in (146) and (147) it suffices to show that: To this aim, we note that the sequence α i in (128) matches (212) in Lemma 12 with the choices: Therefore, we can apply (213) to obtain: which, recalling the series ∞ i=1 i a i = a (1−a) 2 (for |a| < 1), allows us to write: Thus, we proved (146). Let us move on to prove (147). By substituting (152) in (130), we get: and therefore we see that the sequence β i matches (212) in Lemma 12 with the choices: In view of (213), we conclude that: which implies that the series ∞ i=1 β i converges. Since we showed that also the series ∞ i=1 α i is convergent, by summing over index i in (130) we can write: which, using β 1 = 2ρ and (153), yields: It remains to examine the behavior of the summation in (149) involving the sequence γ i in (131). Substituting (152) and (156) in (131) we get: which shows that the sequence γ i matches (212) in Lemma 12 with the choices: We conclude that the series ∞ i=1 γ i is convergent. Thus, by summing over i in (131), we can write: which, using γ 1 = M along with (153) and (159), yields: and the proof is complete.
We are now ready to apply Lemmas 7 and 8 to obtain a bound on the error of the limiting Granger estimator. By definition, the limiting Granger estimator A P (N ) is a deterministic function of the combination matrix A(N ). In fact, for a realization A of A(N ), the limiting Granger estimator is: with: When A is symmetric, we have (recall that I denotes the N × N identity matrix): which is critical to prove the following lemma. Lemma 9 (Bound on the Error of the Limiting Granger Estimator): Let A be an N × N scaled left-stochastic matrix as in (125). If A is symmetric, then we have: where κ is a positive constant, and M is defined in (126). Proof: Since A is symmetric, the limiting Granger estimator admits the representation in (167), which, with the notation introduced in (144), becomes: or, in terms of the individual (k, )-entry: We note that e k ≥ 0 since all involved matrices are nonnegative -see (146) and (147) for what concerns H. Therefore, it suffices to prove that: for some positive constant κ. To this aim, let us consider two indices k, ∈ P. Calling upon Lemma 8, we can apply (146) and (147) in (170), yielding: m . (172) The first summation in (172) can be upper bounded as follows: where in the first inequality we used (138), while in the second inequality we used the following bounds: Here, we remark that the inequality on the left exploits the fact that k, ∈ P and j ∈ P , so we are allowed to extend the sum across j ∈ P to a sum across j ∈ {1, 2, . . . , N} \ {k, }. Next we focus on the second summation in (172), which can be upper bounded as follows: where in the first inequality we used (138), while in the second inequality we used (125) and (126) to get: which proves the claim.

APPENDIX F PROOF OF THEOREM 2
Since Lemma 9 holds for any symmetric matrix A fulfilling (125) and any subset P ⊆ {1, 2, . . . , N}, in view of (168) we can write: where M(N ) is the random variable defined in (111). Applying Theorem 4 to (181), we conclude that: for any sequence of probed subsets S N . Now, by applying the triangle inequality we can write: The first term on the RHS of (183) vanishes in probability in view of (182). Moreover, in view of (32) we have that: where diag(·) is a diagonal matrix having on the main diagonal the entries of its matrix argument. Using (184), we see that the second term on the RHS of (183) is upper bounded by: where the convergence follows from (64) and (47) since μ is strictly positive. We conclude that: which concludes the proof.

APPENDIX G SAMPLE COMPLEXITY ANALYSIS
The following lemmas characterize the rate of convergence of the sample covariance estimators. Preliminarily, it is convenient to introduce the following auxiliary function: and the error matrices, for j ∈ {0, 1}: Lemma 10 (Sample Covariance Errors): Let us consider the dynamical system (2), with Laplacian combination matrix as in (32), and with network graph G(N ) being a simple graph obtained from a Bollobás-Riordan multigraph M(N ) with step parameter η. Then, under Assumption 1, there exists a constant C such that: Proof: In this proof we will often consider conditional probabilities given A(N ) = A. This is tantamount to assuming that the dynamical system (3) is run with a deterministic matrix A. In such a scenario, matrix R 0 (N ) becomes deterministic, since according (24) it is a deterministic function of A(N ). Accordingly, we will conveniently denote by the normal-font symbol R 0 the realization of R 0 (N ) corresponding to A. In contrast, the quantity R 0 (T, N ) in (30) will remain random, since by definition it depends also on the source of randomness given by {x i (N )} T i=1 and y 0 (N ). The proof of (189) and (190) is a slight variation of the bounding technique used in [25,Lemma 1]. In particular, let us define for i, j = 0, 1, . . . , with j ≤ i, the conditional cross-covariance between y i (N ) and y j (N ), namely, where the intermediate equality is a classical result on VAR models [39], while the last equality comes from the stationarity enforced by Assumption 1. Starting from (191), in [25, Lemma 1] the following bound is used: In our case we can exploit additional constraints on A to replace (192) by: where the inequality comes from the fact that, for any two matrices M 1 , M 2 of compatible dimensions, we have M 1 M 2 max ≤ M 1 ∞ M 2 max . The equality in (193) follows by using (132) and by applying Cauchy-Schwarz inequality to obtain appearing in (195) is evaluated in the region where it is decreasing, i.e., for x > Ô 2/T . Proof: We need to show that, for any δ > 0: We will prove the claim with reference to the case j = 0, with the proof being identical for j = 1. Let us consider Lemma 10 with the choice = δ/ √ N . Since Eq. (202) implies that: we see that condition on T in (189) is met for N sufficiently large. We conclude that to prove (204) it suffices to show that, for any δ > 0: Now, the first term on the RHS of (187) converges to zero since T N in (202) tends to +∞ faster than log |S N |. On the other hand, the second term on the RHS of (187) can be written as: and vanishes as N → ∞ in view of (8) and the fact that Moreover, in [33,Eq. (321)], it was shown that: If we now use (209) in (208), from Lemma 11 we conclude that: Since we know from Theorem 2 that the limiting Granger estimator fulfills (11), by application of the triangle inequality we have that: which in turn implies (see footnote 4) that (62) holds true with high probability as N → ∞ with the sample scaling law in (202). This means that the clustering algorithm graphclu is able to reconstruct correctly the subgraph of probed nodes, provided that the latter is neither fully connected nor fully disconnected. However, the probability that G S (N ) is fully connected or fully disconnected vanishes N → ∞ in view of Lemma 6, and the proof is complete.

APPENDIX H AUXILIARY TECHNICAL RESULTS
Lemma 12: Let f i be the sequence recursively defined as: with 0 < a < 1 and b, c, d ∈ R. Then we have that: Proof: Unfolding the recursion in (212), we conclude that, for all i > 1: Thus, to obtain (213) we use the well-known results: The following lemma is an adaptation of Theorem 2.1 in [59], useful for the proofs of Lemma 2 and Theorem 4.
Lemma 13: Let us consider a family of random sequences {u θ (n)} n≥1 spanned by the parameter θ ∈ Θ and defined on the same probability space. Assume that the following conditions are met for all θ ∈ Θ: u θ (1) = 0, 0 ≤ u θ (n) ≤ b for all n > 1, and finally consider: Then, for any subset T ⊆ Θ and any u > 0 we have: Proof: For any two events E 1 and E 2 , it is true that (Ē 2 is the complement of event E 2 ): so that we can write: Let us focus on the first term on the RHS of (223). We have the following relations: where (a) follows from the definition ofŪ θ (N ) in (220); (b) is obtained by retaining only the event corresponding to θ = θ in the interesection; (c) holds since in the intersection −u/2 ≤ −C θ (N ); and (d) follows by observing that, in view of (216), (217) and (218) we have χ θ (n) ≤ bν θ (n), which in turn implies, from the definitions in (219) and (220), that Q θ (N ) ≤ b C θ (N ). Using (224) in the first term on the RHS of (223), and further applying the union bound, we obtain: The sequence {Ū θ (N )} N ≥1 is a martingale by construction, since it is a sum of random variables (i.e., u θ (n)) minus their conditional expectation (i.e., ν θ (n)). Moreover, from (216) we have the bound: Therefore, it can be readily checked that the scaled sequence {Ū θ (N )/b} N ≥1 meets the hypotheses of Theorem 2.1 in [59], and in particular the upper bound obtained by combining Eqs. (10), (11) and (15) in [59], which finally yields: and the proof is complete.