A Universal Lossless Compression Method Applicable to Sparse Graphs and Heavy–Tailed Sparse Graphs

Graphical data arises naturally in several modern applications, including but not limited to internet graphs, social networks, genomics and proteomics. The typically large size of graphical data argues for the importance of designing universal compression methods for such data. In most applications, the graphical data is sparse, meaning that the number of edges in the graph scales more slowly than <inline-formula> <tex-math notation="LaTeX">$n^{2}$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> denotes the number of vertices. Although in some applications the number of edges scales linearly with <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>, in others the number of edges is much smaller than <inline-formula> <tex-math notation="LaTeX">$n^{2}$ </tex-math></inline-formula> but appears to scale superlinearly with <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>. We call the former sparse graphs and the latter heavy-tailed sparse graphs. In this paper we introduce a universal lossless compression method which is simultaneously applicable to both classes. We do this by employing the local weak convergence framework for sparse graphs and the sparse graphon framework for heavy-tailed sparse graphs.


I. INTRODUCTION
T HE sheer amount of graphical data in modern applications argues for finding efficient and optimal methods of compressing such data for storage and further data mining tasks. Graphical data arises in social networks, molecular and systems biology, and web graphs, as well as in several other application areas. To be concrete, an instance of graphical data arising in a web graph network would be a snapshot view of the network at a given time. Each vertex in such a graph represents a web page, and an edge represents a link between two web pages. An instance of graphical data in systems biology would be a protein-protein interaction network. Manuscript  Each vertex corresponds to a protein and an edge to an interaction between proteins.
Largely motivated by such applications, there has recently been an increased interest in the problem of graphical data compression. In existing works, typically assumptions are made regarding the properties of the graphical data of interest. One approach is to design compression schemes for specific data sources such as web graphs or social networks, with the model for the properties of the graphical data derived from a limited set of prior samples. For instance, Boldi and Vigna have proposed the webgraph framework to address the efficient compression of internet graphs [1], Boldi et al. have proposed the layer label propagation (LLP) method to compress social network graphs [2], and Liakos et al. have proposed the BV+ compression method and evaluated its performance on certain datasets such as web and social graphs [3]. In this approach, the compression method is usually based on some properties of the data which are extracted based on observing real-world samples. Therefore, such approaches usually do not come with information-theoretical guarantees of optimality.
The other approach in the literature is to assume that the input data is generated through a certain stochastic model, and the goal is to study the information content and compression of such models by employing a notion of entropy. Thus these works are less tied to a specific application. For instance, Choi and Spankowski have studied the structural entropy of the Erdős-Rényi model and compressing such graphs [4], Aldous and Ross have studied the asymptotic behavior of the entropy associated to some models of sparse random graphs [5], and Abbe has studied the compression of stochastic block models [6].
In contrast to these prior works, we adopt the perspective of universal compression. Namely, we study the compression of graphical data in a "pointwise" sense, which is made more precise below. In particular, we try to make as few assumptions as we can about the properties or statistical characteristics of the graphical data that we are trying to compress. Unlike some of the prior works such as [4] which focus on compressing the isomorphism class of the input graphical data (so that the identities of the vertices do not matter and one only cares about the structure of the graph), in this work we aim to compress the graph so that the decoded graph matches the input graph at the level of both the structure and the identities of the vertices. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ It is widely believed that real world graphical data are "sparse". Roughly speaking, a graph with n vertices is said to be sparse (in a broad sense) if its number of edges is much smaller than n 2 . This yields a whole spectrum of regimes under which one can study sparsity. One interesting sparsity regime is where, roughly speaking, the number of edges is a constant times the number of vertices (more precisely, when the number of edges grows linearly with the number of vertices in an asymptotic regime). In recent works the authors of this paper have studied the problem of universal lossless compression [7] and distributed compression [8] for sparse graphs in this sparsity regime (the latter in a modelbased framework). This was done by employing the notion of "local weak convergence", an instance of the so-called the "objective method" [9], [10], [11], which, roughly speaking, allows one to think of the graphical data as a sample from a limiting stochastic object derived from the empirical characteristics of the given sample (more precisely, this is done in an asymptotic setting, and the limiting stochastic object is a probability distribution on rooted graphs; details are given in Section II-A). Moreover, the authors have built upon the work of Bordenave and Caputo [12] to introduce a notion of entropy called the marked BC entropy which turns out to be the correct information-theoretic measure of optimality on a per-edge basis (which is the same as the pervertex basis in this sparsity regime) for the purpose of the universal compression of graphical data in this formulation of the compression problem [13]. Note that compression to the correct information-theoretic limit on a per-edge basis is a significantly deeper guarantee of information-theoretic optimality than a crude guarantee of matching the growth rate of the overall entropy of the graphical data, since the leading term in the overall entropy depends only on the average degree of the graph and is on the scale of n log n where n denotes the number of vertices; see the details in Section II.
The idea behind the local weak convergence framework is to study the asymptotic behavior of the distribution of the neighborhood structure of a typical vertex in the graph. This allows one to define a limit object associated to a sequence of sparse graphs, the sparsity regime of interest being where the number of edges grows linearly with the number of vertices. From the point of view of the compression problem, it is desirable to go beyond this sparsity regime and achieve universal compression for graphs which are still sparse, but with the number of edges growing super-linearly with the number of vertices, i.e. sparse graphs with heavytailed degree distributions. Indeed, it is generally believed that heavy-tailed degree distributions are more representative of real world networks. Achieving universal compression in an information-theoretically optimal sense on a per-edge basis while being able to include heavy-tailed sparse graphical data in the framework is the purpose of this paper. More precisely, we build upon the universal compression scheme of [7] to go beyond the local weak convergence framework, and we design a universal compression scheme which is capable of encoding graphs which are either consistent with the local weak convergence framework or come from a specific class of sparse graphs with heavy-tailed degree distributions (and this has to be done while not knowing which regime the graphical data is from).
In order to address graphs with heavy-tailed degree distributions, we employ a version of the graphon theory adapted for sparse graphs [14], [15], [16]. For dense graphs (the graphs where the number of edges scales as n 2 ), the theory of graphons allows one to make sense of a notion of limit and provides a comprehensive framework to study the asymptotic behavior (see, for instance, [17], [18], [19], [20], [21]). There has been a recent effort to bridge the gap between the above sparse regime addressed by local weak convergence, and the dense regime addressed by the graphon theory (see, for instance, [22], [14], [15]). This framework, which we call the sparse graphon framework, defines a notion of convergence for heavy-tailed sparse graphs, similar to the local weak convergence framework, but in a completely different metric.
Motivated by the above discussion, the local weak convergence framework and the sparse graphon framework together yield a powerful machinery which is capable of addressing sparsity in a broad range. In particular, we use this machinery to address the problem of universal compression of sparse graphical data. More precisely, we aim to compress a graph which is either consistent with the local weak convergence framework or the sparse graphon framework. However, the universality condition requires that the encoder does not know which of the two frameworks the input graph is consistent with, neither does it know the limiting object in each of the two frameworks. However, we want the encoder to be informationtheoretically optimal, in the sense that if we appropriately normalize the codeword length associated to the input graph, it does not asymptotically exceed the entropy of the limit object on a per-edge basis, with an appropriate notion of the entropy for each of the two frameworks. In order to make sense of optimality in the local weak sense, we employ the notion of BC entropy from [12] which we discussed above. On the other hand, in order to make sense of optimality in the sparse graphon sense, we introduce a notion of entropy for this framework in Section III, which can be of independent interest. The main purpose of this work is to address information theoretic limits of compression. An important future work would be to design efficient coding algorithms, similar to our prior work for the local weak convergence framework [23].
The structure of this paper is as follows. In Section II, we review local weak convergence, the BC entropy, sparse graphons, and the universal lossless compression scheme introduced in [7]. Then, in Section III, we introduce our notion of entropy for the sparse graphon framework. We then rigorously define the problem of finding a universal compression scheme which addresses both the local weak convergence and the sparse graphon frameworks in Section IV and state our main results on the existence of such schemes. We explain the details of our compression scheme in Section V. Afterwards, we analyze the performance of this scheme under the local weak convergence and the sparse graphon frameworks in Sections VI and VII respectively.
We close this section by introducing some notational conventions. We write := and =: for equality by definition. R and R + denote the sets of real numbers and nonnegative real numbers respectively. Z and N denote the set of integers and the set of positive integers respectively. We denote the set of integers {1, . . . , n} by [n]. For x ∈ R, x ≥ 1, we may write [x] as a shorthand for [ x ]. All the logarithms are to the natural base, unless otherwise stated. We write {0, 1} * − ∅ for the set of finite sequences of zeros and ones, excluding the empty sequence. For a sequence x ∈ {0, 1} * − ∅, we denote its length in bits by bits(x). Moreover, we denote the length of x in nats by nats(x) = bits(x) × log 2. S p×q denotes the set of p × q matrices with values in the set S. For two sequences (a n : n ≥ 1) and (b n : n ≥ 1) of nonnegative real numbers, we write a n = O(b n ) if there exists a constant C > 0 such that a n ≤ Cb n for n large enough. Moreover, we write a n = o(b n ) if a n /b n → 0 as n → ∞. Also, we write a n = ω(b n ) if a n /b n → ∞ as n → ∞. For a probability distribution P defined on a finite set, H(P ) denotes the Shannon entropy of P . Similarly, for a random variable X with finite support, H(X) denotes the Shannon entropy associated to X. Moreover, for α ∈ [0, 1], we define to be the Shannon entropy (to the natural base) of a Bernoulli random variable with parameter α. We use the abbreviation "a.s." for the phrase "almost surely". Table I in Appendix F summarizes the main notation used in this paper.

II. PRELIMINARIES
All graphs in this document are assumed to be undirected and simple, the latter meaning that self loops and multiple edges are not allowed. Hence we may drop the term "simple" when referring to graphs. We use the terms "node" and "vertex" exchangeably. We consider graphs which may have either a finite or a countably infinite number of vertices. For a graph G, let V (G) denote the set of vertices in G. Two nodes v and w in a graph G are said to be adjacent if they are connected by an edge, and we show this by writing v ∼ G w. We denote the degree of a vertex v in a graph G by deg G (v). G n denotes the set of simple graphs on the vertex set [n]. A graph G is called locally finite if the degree of every vertex in the graph is finite. Given a graph G ∈ G n , we denote its adjacency matrix by A(G), which we recall is the n×n matrix whose entry (i, j) is one if nodes i and j are adjacent in G, and zero otherwise. The density of a graph G, which is denoted by ρ(G), is defined to be the density of ones in its adjacency matrix. More precisely, Here, n and m denote the number of vertices and edges in G respectively. For p ≥ 1, the L p norm of an n × n matrix A is defined as Note the normalization. Thus ρ(G) = A 1 .
A walk between two vertices v and w in a graph G is a sequence of nodes The length of such a walk is defined to be k. The distance between two nodes v and w in a graph G is defined to be the minimum length among the walks connecting them, and is defined to be ∞ if no such walk exists.
Two graphs G and G are said to be isomorphic, and we To better understand this notion, let S n denote the permutation group on the set [n]. For a permutation π ∈ S n and a graph G on the vertex set [n], let πG be the graph on the same vertex set after the permutation π is applied on the vertices. Namely, for each edge (v, w) in G, we place an edge between the vertices π(v) and π(w) in πG. Then each πG is isomorphic to G and every graph that is isomorphic to G is of the form πG for some π ∈ S n .
Given a graph G, and a subset S of its vertices, the subgraph induced by S is the graph comprised of the vertices in S and those edges in G that have both their endpoints in S. The connected component of a vertex v ∈ V (G) is the subgraph of G induced by the vertices that are at a finite distance from v. We write G v for the connected component of v ∈ V (G). Note that G v is a connected graph.
The focus on how a graph looks from the point of view of each of its vertices is the key conceptual ingredient in the theory of local weak convergence. For this, we introduce the notion of a rooted graph and the notion of isomorphism of rooted graphs. Roughly speaking, a rooted graph should be thought of as a graph as seen from a specific vertex in it and the notion of two rooted graphs being isomorphic as capturing the idea that the respective graphs as seen from the respective distinguished vertices look the same. Notice that it is natural to restrict attention to the connected component containing the root when making such a definition, because, roughly speaking, a vertex of the graph should only be able to see the component to which it belongs.
For a precise definition, consider a graph G and a distinguished vertex o ∈ V (G). The pair (G, o) is called a rooted graph. We call two rooted graphs (G, o) and (G , o ) isomorphic and write

A. The Framework of Local Weak Convergence
In this section we review the framework of local weak convergence of graphs, which is an instance of the so-called objective method. See [10], [11], [24] for more details.
This framework can also take into account marked graphs, i.e. graphs where each vertex carries a label from a set called the vertex mark set and each edge carries a label from a set called the edge mark set. However, for the purpose of this work, we only focus on simple graphs without marks.
Let  [G , o ]) is defined to be 1/(1 +ĥ). One can check that d * is a metric; in particular, it satisfies the triangle inequality. Moreover, G * together with this metric is a Polish space, i.e. a complete separable metric space [11]. Let T * denote the subset of G * comprised of the equivalence classes [G, o] arising from some (G, o) where the graph underlying G is a tree. In the sequel we will think of G * as a Polish space with the metric d * defined above, rather than just a set. Note that T * is a closed subset of G * .
For a Polish space Ω, let P(Ω) denote the set of Borel probability measures on Ω. We say that a sequence of measures μ n on Ω converges weakly to μ ∈ P(Ω), and write μ n ⇒ μ, if for any bounded continuous function on Ω, we have f dμ n → f dμ. It can be shown that it suffices to verify this condition only for uniformly continuous and bounded functions [25]. For a Borel set B ⊂ Ω, the -extension of B, denoted by B , is defined as the union of the open balls with radius centered around the points in B. For two probability measures μ and ν in P(Ω), the Lévy-Prokhorov distance d LP (μ, ν) is defined to be the infimum of all > 0 such that for all Borel sets B ⊂ Ω we have μ(B) ≤ ν(B ) + and ν(B) ≤ μ(B ) + . It is known that the Lévy-Prokhorov distance metrizes the topology of weak convergence on the space of probability distributions on a Polish space (see, for instance, [25]). For x ∈ Ω, let δ x be the Dirac measure at x.
For a finite graph G, define U (G) ∈ P(G * ) as Note that U (G) ∈ P(G * ). In creating U (G) from G, we have created a probability distribution on rooted graphs from the given graph G by rooting the graph at a vertex chosen uniformly at random. Furthermore, for an integer h ≥ 1, let We then have U h (G) ∈ P(G * ). See Figure 1 for an example. We say that a probability distribution μ on G * is the local weak limit of a sequence of finite graphs {G n } ∞ n=1 when U (G n ) converges weakly to μ (with respect to the topology on P(G * ) induced by the metric d * on G * ). This turns out to be equivalent to the condition that, for any finite depth h ≥ 0, With G being the graph in (a), (b) illustrates U 2 (G), which is a probability distribution on rooted graphs of depth at most 2 and (c) depicts U (G), which is a probability distribution on G * . In each of the figures in (b) and (c) the root is the vertex at the top. the structure of G n from the point of view of a root chosen uniformly at random and then looking around it only to depth h converges in distribution to μ truncated up to depth h. This description of what is being captured by the definition justifies the term "local" in local weak convergence.
In fact, U h (G) could be thought of as the "depth h empirical distribution" of the graph G. On the other hand, a probability distribution μ ∈ P(G * ) that arises as a local weak limit plays the role of a stochastic process on graphical data, and a sequence of graphs {G n } ∞ n=1 could be thought of as being asymptotically distributed like this process when μ is the local weak limit of the sequence.
The degree of a probability measure μ ∈ P(G * ), denoted by deg(μ), is defined as which is the expected degree of the root. We next present some examples to illustrate the concepts defined so far.
1) Let G n be the finite lattice {−n, . . . n} × {−n, . . . , n} in Z 2 . As n goes to infinity, the local weak limit of this sequence is the distribution that gives probability one to the lattice Z 2 rooted at the origin. The reason is that if we fix a depth h ≥ 0 then for n large almost all of the vertices in G n cannot see the borders of the lattice when they look at the graph around them up to depth h, so these vertices cannot locally distinguish the graph on which they live from the infinite lattice Z 2 . 2) Suppose G n is a cycle of length n. The local weak limit of this sequence of graphs gives probability one to an infinite 2-regular tree rooted at one of its vertices. The intuitive explanation for this is essentially identical to that for the preceding example. 3) Let G n be a realization of the sparse Erdős-Rényi graph G(n, α/n) where α > 0, i.e. G n has n vertices and each edge is independently present with probability α/n (here n is assumed to be sufficiently large). One can show that if all the G n are defined on a common probability space then, almost surely, the local weak limit of the sequence is the Poisson Galton-Watson tree with mean α, rooted at the initial vertex. To justify why this should be true without going through the details, note that the degree of a vertex in G n is the sum of n − 1 independent Bernoulli random variables, each with parameter α/n. For n large, this approximately has a Poisson distribution with mean α. This argument could be repeated for any of the vertices to which the chosen vertex is connected, which play the role of the offspring of the initial vertex in the limit. The essential point is that the probability of having loops in the neighborhood of a typical vertex up to a depth h is negligible whenever h is fixed and n goes to infinity.

B. Unimodularity
In order to get a better understanding of the nature of the results proved in this paper, it is helpful to understand what is meant by a unimodular probability distribution μ ∈ P(G * ). We give the relevant definitions and context in this section.
Since each vertex in G n has the same chance of being chosen as the root in the definition of U (G n ), this should manifest itself as some kind of stationarity property of the limit μ with respect to changes of the root. A probability distribution μ ∈ P(G * ) is called sofic if there exists a sequence of finite graphs G n with local weak limit μ. The definition of unimodularity is made in an attempt to understand what it means for a Borel probability distribution on G * to be sofic.
To define unimodularity, let G * * be the set of isomorphism classes [G, o, v] where G is a connected graph with two distinguished vertices o and v in V (G) (ordered, but not necessarily distinct). Here, isomorphism is defined by an adjacency-preserving vertex bijection which also maps the two distinguished vertices of one object to the respective ones of the other. G * * can be metrized as a Polish space in a manner similar to that used to metrize G * . A measure μ ∈ P(G * ) is said to be unimodular if, for all measurable functions f : This is called involution invariance [11]. Let P u (G * ) denote the set of unimodular probability measures on G * . Also, since T * ⊂ G * , we can define the set of unimodular probability measures on T * and denote it by P u (T * ). A sofic probability measure is unimodular. Whether the other direction also holds is unknown.

C. The BC Entropy
In this section we review the notion of entropy introduced by Bordenave and Caputo for probability distributions on the space G * [12]. We call this notion the BC entropy. The authors of this paper have generalized this entropy to the regime where the vertices and edges in the graph also carry marks, but we omit that discussion here since we focus on unmarked graphs throughout this work [13].
For integers n, m ∈ N, let G n,m denote the set of graphs on the vertex set [n] with precisely m edges. An application of Stirling's formula implies that if d > 0 and the sequence m n is such that m n /n → d/2, then we have The key idea to define the BC entropy is to count the number of "typical" graphs. More precisely, given μ ∈ P(G * ) and > 0, let G n,m (μ, ) denote the set of graphs G ∈ G n,m such that d LP (U (G), μ) < , where d LP refers to the Lévy-Prokhorov metric on P(G * ) [25]. In fact, one can interpret G n,m (μ, ) as the set of -typical graphs with respect to μ. It turns out that, roughly speaking, the number of -typical graphs scales as follows: where Σ(μ) is the BC entropy of μ which will be defined below. In order to make this precise, we make the following definition. Definition 1: Assume μ ∈ P(G * ) is given, with 0 < deg(μ) < ∞. Assume that d > 0 is fixed and a sequence m n of integers is given such that m n /n → d/2 as n → ∞. With these, for > 0, we define which we call the -upper BC entropy. Since this is increasing in , we can define the upper BC entropy as We may similarly define the -lower BC entropy Σ d (μ, )| (mn) as Since this is increasing in , we can define the lower BC entropy as Theorem 1.2 in [12] summarizes some of the main properties of the BC entropy. For better readability, we split that theorem as Theorems 1 and 2 below. The following Theorem 1 shows that certain conditions must be met for the BC entropy to be of interest.
A consequence of Theorem 1 is that the only case of interest in the discussion of BC entropy is when μ ∈ P u (T * ), d = deg(μ), and the sequence m n is such that m n /n → deg(μ)/2. Namely, the only upper and lower BC entropies of interest are Σ deg(μ) (μ)| (mn) and Σ deg(μ) (μ)| (mn) respectively.
The following Theorem 2 establishes that the upper and lower BC entropies do not depend on the choice of the defining sequence m n . Further, this theorem establishes that the upper BC entropy is always equal to the lower BC entropy.
Theorem 2: Assume that d > 0 is given. For any μ ∈ P(G * ) such that 0 < deg(μ) < ∞, we have 1) The values of Σ d (μ)| (mn) and Σ d (μ)| (mn) are invariant under the specific choice of the sequence m n such that m n /n → d/2. With this, we may simplify the notation and unambiguously write Σ d (μ) and Σ d (μ). 2) Σ d (μ) = Σ d (μ). We may therefore unambiguously write Σ d (μ) for this common value and call it the BC entropy of μ ∈ P(G * ) with respect to d.
We are now in a position to define the BC entropy. Definition 2: For μ ∈ P(G * ) with 0 < deg(μ) < ∞, the BC entropy of μ is defined to be Σ(μ).
The reader is referred to [12] for a detailed discussion of the BC entropy and some of its additional properties. For instance, it can be shown that the BC entropy of a probability distribution μ ∈ P u (T * ) can be approximated in terms of the finite depth truncation of μ [12,Theorem 1.3]. The reader is also referred to [13] for the generalization of this notion to the marked regime.
The following proposition is then an immediate consequence of Lemma 1.

D. Graphons
The theory of graphons provides a comprehensive framework to study the asymptotics of dense graphs by introducing a limit theory for such graphs (see, for instance, [17], [18], [19], [20], [21]). There has been some effort in adapting this theory for sparse graphs (see, for instance, [22], [14], [15]). Also, the problem of graphon estimation given random graph samples has been extensively studied both in the dense regime and in the sparse regime (see, for instance, [26], [27], [28], [29], [30]). In this section, we review the notion of graphons in the sparse regime. Furthermore, we review the result from [16] on graphon estimation in this regime. Here, we mainly stick to the setup and notation introduced in [16].
Assume that a probability space (Ω, F , π) is given. A graphon on this probability space is defined to be a measurable function W : Ω × Ω → R + which is symmetric, i.e. W (x, y) = W (y, x) for all x, y ∈ Ω, and is L 1 , i.e. W 1 < ∞. Here, the L p norm of a function f : Moreover, W ∞ is defined to be the essential supremum of W with respect to the product measure π × π. We may simply say that W is a graphon on Ω when the σ-algebra F and the probability measure π are clear from the context. In particular, when we refer to a graphon W as being defined over [0, 1], it refers to a graphon over the probability space [0, 1] equipped with the standard Borel σ-algebra and the uniform distribution, unless otherwise stated. A simple graph G on a finite vertex set V naturally defines a graphon W over the probability space V equipped with the uniform distribution, defined as W (v, w) = (A(G)) v,w for v, w ∈ V . Note that for each p ≥ 1 the L p norm of this graphon is the same as that of the adjacency matrix of the underlying graph G, as defined in (2).
Assume that a symmetric n × n matrix B with nonnegative entries is given together with a probability vector p = (p 1 , . . . , p n ) such that p i ≥ 0, 1 ≤ i ≤ n, and n i=1 p i = 1. We define the block graphon ( p, B) to be a graphon W over the finite probability space [n] equipped with the probability distribution p such that for 1 ≤ i, j ≤ n, we have W (i, j) = B i,j . This generalizes the notion in the preceding paragraph of a graphon associated to a simple graph. Now, we state the notion of equivalence for graphons (see Definition 2.5 in [16]). Given two graphons W and W on probability spaces (Ω, F , π) and (Ω , F , π ), respectively, we say that W and W are equivalent if there exists a third probability space (Ω , F , π ) and two measure preserving maps φ : Ω → Ω and φ : Ω → Ω together with a graphon U on (Ω , F , π ), such that for almost all (x, y) ∈ Ω × Ω, with respect to the product measure π×π, we have W (x, y) = U (φ(x), φ(y)), and similarly for almost all (x , y ) ∈ Ω , with respect to the product measure π × π , we have W (x , y ) = U (φ (x ), φ (y )).
For two L 2 graphons W and W , defined on probability spaces (Ω, F , π) and (Ω , F , π ) respectively, we define where the infimum is taken over all couplings ν of π and π , i.e. ν is a probability measure over Ω × Ω with marginals π and π , respectively. In fact, δ 2 yields a metric on the space of equivalence classes of L 2 graphons, with reference to the notion of equivalence described above (see [16,Theorem 2.11 and Appendix A] and [31]). Moreover, for two graphons W and W on two probability spaces (Ω, F , π) and (Ω , F , π ) respectively, we define the cut norm as where the infimum is taken over all couplings ν of π and π , similar to the above, and the supremum is over measurable subsets S and T of Ω × Ω . Note that every graphon is by definition an L 1 function with respect to its underlying product measure, hence the cut norm is well defined. In fact, δ yields a metric on the space of equivalence classes of graphons with reference to the notion of equivalence described above (see [16,Theorem 2

.11 and Appendix A] and [31]).
A graphon W is said to be normalized if W 1 = 1. Given a normalized graphon W on a probability space (Ω, F , π) and a sequence of target densities ρ n , i.e. strictly positive real numbers, we define the sequence of W -random graphs with target density ρ n as a sequence of random graphs G (n) , where G (n) is defined on the vertex set [n], as follows. We first generate random variables ( Then, for each n and each pair of vertices 1 ≤ v, w ≤ n, we independently place an edge between v and w in G (n) with probability min{1, ρ n W (X v , X w )}. Note that the distribution of G (n) is dependent on the random variables X 1 , . . . , X n , and the sequence (X i : i ≥ 1) is generated prior to generating G (n) . Consequently, the random graphs G (n) defined in this procedure are dependent. We denote the law of G (n) in this procedure by G(n; ρ n W ). The following theorem from [16] shows that equivalent graphons generate identical distributions, given some conditions on the sequence ρ n .
In the dense regime, graphons are defined to take values bounded by 1 (see for instance Section 7.1. in [21]). However, in the sparse regime discussed above, this condition is relaxed, and the graphons are allowed to be unbounded. Instead, the sequence of target densities ρ n is introduced which scales the graphon in order to get the desired probability. In fact, under some conditions on the sequence ρ n , if W is a normalized graphon, then W -random graphs have a density close to ρ n , justifying the term target density. Moreover, under some conditions on the sequence ρ n , the sequence of W -random graphs converges to W with respect to the cut metric defined in (7). These statements are made precise in the following theorem.
Theorem 4 (Theorem 2.14 in [16]): Let G (n) ∼ G(n; ρ n W ) be a sequence of W -random graphons with target density ρ n , where W is a normalized graphon over an arbitrary probability space, and ρ n is such that nρ n → ∞ and either lim sup n→∞ ρ n W ∞ ≤ 1 or ρ n → 0. Then, almost surely, ρ(G (n) )/ρ n → 1 and Note that, as we discussed above, G (n) naturally defines a graphon, and G (n) /ρ(G (n) ) refers to the scaled graphon corresponding to G (n) . Recall from (1) that ρ(G (n) ) is the density of the graph G (n) , and is defined to be 2m (n) /n 2 , where m (n) denotes the number of edges in G (n) . Theorem 4 above implies that for G (n) ∼ G(n; ρ n W ), with probability one ρ(G (n) )/ρ n → 1 or equivalently 2m (n) /(ρ n n 2 ) → 1.
Recall that since we want to study sparse graphs, we want m (n) to scale much slower than n 2 . For this to happen, since 2m (n) /(ρ n n 2 ) → 1 almost surely, from this point forward, we assume that ρ n → 0. Moreover, motivated by Theorem 4 above, from this point forward we assume that nρ n → ∞. Roughly speaking, the condition nρ n → ∞ ensures that the graphs in the sequence of W -random graphs are not too sparse. More precisely, since nρ n ≈ 2m (n) /n, the condition nρ n → ∞ roughly means that the average degree in G (n) goes to infinity. Therefore, this sparse graphon framework allows us to study heavy-tailed sparse graphs, as opposed to the local weak convergence theory, which requires the existence of a well-defined limit degree distribution at the root.
Borgs et al. have addressed the problem of estimating the graphon W upon observing a sample, for each vertex size n, of a sequence of W -random graphs [16]. They study three methods for doing so, namely least squares estimation, cut norm estimation, and degree sorting. Here, we only review the least square estimation method, and refer the reader to [16] for further reading. We will later employ this estimation method in our universal compression scheme.
1) Least Squares Algorithm: In this section, we explain the least squares algorithm for graphon estimation from [16] and state its properties. First, we need to introduce some notation. Given integers n and k, a function π : Least Squares Algorithm: Given a graph G on n vertices, and a parameter β such that 1 ≤ β ≤ n, let where the minimization is over natural numbers k, k × k matrices B with nonnegative entries, and functions π : In other words, we may rewrite (8) equivalently as follows where the minimization is taken over is a shorthand for [ β ].) Assume that we have solved the optimization in (9), andπ andB are arbitrary optimizers. Then, we define the output of the least squares estimation algorithm to be the block graphon ( p,B) where the probability vector Note that the discrete optimization problem in (9) requires searching over all mappings π : [n] → [β]. However, since the objective is an L 2 norm, by fixing π the objective is minimized by choosing B such that for 1 ≤ i, j ≤ β such that π −1 ({i}) and π −1 ({j}) are not empty we have Note that the choice of B i,j for i and j such that either π −1 ({i}) = ∅ or π −1 ({j}) = ∅ has no effect on the objective. In other words, for fixed π the optimizer B must take the average of the adjacency matrix over the blocks defined by π.
It can be shown that with an appropriate choice of the parameter β, the above algorithm yields a consistent graphon estimation scheme, in the following sense.
Theorem 5 (Theorem 3.1 in [16]): Let W be an L 2 graphon, normalized so that W 1 = 1, and let G (n) be a sequence of W -random graphs with target densities (ρ n : n ≥ 1). Furthermore, let W = ( p,B) be the output of the above least squares algorithm for G (n) with parameter β n . Then, if ρ n and β n are such that as n → ∞, we have ρ n → 0, nρ n → ∞, β n → ∞, and β 2 n log β n = o(nρ n ), then, with probability one, we have lim Remark 1: In order to simplify the expressions, in the above we have stated a reparametrization of the least squares algorithm presented in [16]. We can show that Theorem 5 is a consequence of Theorem 3.1 in [16]. In [16], the least squares algorithm is explained based on the optimization problem (8), with the only difference that the constraint |π −1 ({i})| ≥ n/β for nonempty classes is replaced by the constraint . Given β and the optimization problem in the above form (8), one can choose κ to be n/β /n to obtain an optimization problem in the form presented in [16]. Also, given κ and the optimization in the form presented in [16], one can choose β = n/ nκ to obtain an optimization of the form (8). Furthermore, in the setup of Theorem 5 above, given the sequence β n such that β n → ∞ and β 2 , which is precisely the assumption required by Theorem 3.1 in [16].

E. A Universal Lossless Compression Scheme Adapted to the Local Weak Convergence Framework
In this section we review the compression scheme introduced by the authors in [7]. This scheme yields a universal lossless compression for a sequence of sparse graphs converging to a limit in the local weak sense, without knowing a priori what that limit is. The compression scheme in [7] allows for the graphs to be marked, i.e. vertices and edges in the graph can carry additional marks on top of the connectivity structure of the graph. However, since we do not include marks in our discussion here, we present the results of [7] reduced to our unmarked setting.
More precisely, we introduce a compression map f lwc n : G n → {0, 1} * − ∅ which assigns a codeword to each graph on the vertex set [n] in a prefix-free way. Here, the superscript lwc stands for "local weak convergence", and is assigned to distinguish it from the compression map we will introduce later in Section V. This compression scheme is lossless, i.e. there exists a decompression map g lwc n such that g lwc n • f lwc n is the identity map. Moreover, the compression scheme is universal in the sense that given a sequence of graphs G (n) converging to a limit μ ∈ P u (G * ) in the local weak sense where deg(μ) ∈ (0, ∞), without a priori knowledge of μ, we have Here, m (n) is the number of the edges in G (n) , and normalization is done in a way consistent with the definition of the BC entropy in Section II-C. It can be shown that the compression scheme described below satisfies the above properties. 1) Given the input graph G (n) , define the set In other words, we encode the appearance frequency of each possible local neighborhood in G (n) . c) Let W n denote the set of graphs G ∈ G n with degrees bounded by log log n such that for all In other words, W n is the set of graphs with the same appearance frequency of local structures as in G (n) . Note that G (n) ∈ W n , and we can encode G (n) by specifying it among the elements of W n using 1 + log 2 |W n | bits. 4) Now, it remains to encode those edges present in G (n) but not in G (n) , i.e. those edges which were removed during the truncation step 2 above. Let Z n denote the set of such edges. Note that, by definition, for every edge (v, w) ∈ Z n we have v ∈ Y n and w ∈ Y n . We first encode the set Y n by encoding |Y n | using 1 + log 2 n bits, and then encoding Y n among the set of all subsets of [n] with the same size using 1 + log 2 n |Yn| bits. 5) Let m (n) and m (n) denote the number of edges in G (n) and G (n) respectively. Therefore, the set Z n consists of m (n) − m (n) many edges, and both of the endpoints of each such edge are in Y n . Thereby, having encoded Y n in the previous steps, we can encode the m (n) − m (n) remaining edges in Z n using 1+ log 2 bits. It can be shown that the above compression scheme is indeed universal in the following sense.
Theorem 6 (Theorem 3 in [7]): Given any unimodular μ ∈ P u (T * ) such that deg(μ) ∈ (0, ∞), and a sequence of graphs G (n) converging to μ in the local weak sense, we have where m (n) denotes the number of edges in G (n) . Together with the following converse result, this implies that the BC entropy is indeed the correct information-theoretic limit for compression on a per-edge basis in the local weak convergence framework.
Theorem 7 (Theorem 4 in [7]): Assume that a lossless compression scheme ((f n , g n ) : n ≥ 1) is given. Fix some unimodular μ ∈ P u (T * ) such that deg(μ) ∈ (0, ∞) and Σ(μ) > −∞. Then there exists a sequence of random graphs (G (n) : n ≥ 1) defined on a joint probability space such that G (n) converges a.s. to μ in the local weak sense and where m (n) denotes the number of edges in G (n) .
The following bound will also be useful for our future analysis.
Lemma 2: Assume that the sequence (G (n) ) n≥1 is given where, for all n ≥ 1, . Then we have the following bound on the codeword length associated to G (n) : where m (n) denotes the number of edges in G (n) , and the o(n) term does not depend on (G (n) ) n≥1 . Proof: Following the compression scheme that we discussed above, since all degrees in G (n) are bounded by log log n, we have Y n = ∅ and G (n) = G (n) . Therefore, the number of bits required to encode the set Y n in step 4 is bounded by 2 + log 2 n ≤ 2 + log 2 n. Also, since Z n = ∅, step 5 does not contribute any bits to the output codeword. Now, we find a bound on the number of bits required to encode G (n) = G (n) in step 3. Note that we use |A n |(1 + log 2 n ) bits in part 3b. But from Lemma 7 in [7], we have |A n | = o(n/ log n). Thereby, |A n |(1 + log 2 n ) = o(n). Observe that since the sequence of sets (A n ) n≥1 does not depend on (G (n) ) n≥1 , this o(n) term also does not depend on (G (n) ) n≥1 . Now, we find a bound on the size of the set W n defined in step 3c. We first claim that all the graphs in W n have precisely m (n) edges. To see this, take G ∈ W n and note that since all the degrees in G are bounded by log log n, we have where (a) uses the fact that by definition, |{v : This means that the number of edges in G is precisely m (n) . As a result, we have |W n | ≤ ( n 2 ) m (n) . Hence, the number of bits we use in step 3c in order to specify G (n) among the graphs in W n is at most 1 + log 2 ( n 2 ) m (n) . Putting all the above together and multiplying by log 2 to convert bits to nats, we get nats(f lwc n (G (n) )) ≤ 2 + log n + 1 + log This completes the proof.

III. A NOTION OF ENTROPY FOR THE SPARSE GRAPHON FRAMEWORK
In the dense regime, the asymptotic behavior of the entropy of dense graph ensembles generated by a graphon has been extensively studied in the literature (see, for instance, [31]). We first briefly review this notion before focusing on the sparse regime.
To remain consistent with the notation we defined in Section II-D, let W : [0, 1] × [0, 1] → [0, 1] be a graphon with values bounded by 1, but not necessarily normalized. Also, consider the sequence of W -random graphs G n ∼ G(n; ρ n W ) where the target density ρ n is set to 1 for all n. This yields a sequence of dense graphs. It can be shown that the entropy of this sequence of random graphs has the following asymptotic behavior: where [31,Theorem D.5]). In the following, we study the analog of this question in the sparse regime, i.e. we study the asymptotic behavior of the entropy of a sequence of W -random graphs when W is not necessarily bounded, but is an L 2 graphon, and the sequence of target densities ρ n is such that ρ n → 0 and nρ n → ∞. To the best of our knowledge, this is the first instance of such an analysis. As we will see, in the sparse regime the asymptotic behavior of the entropy is quite different from (11). For one thing, since W no longer necessarily takes values in [0, 1], the right hand side of (11) is no longer meaningful. In fact, the following definition turns out to be useful for our analysis in the sparse regime.
Definition 3: For an L 2 graphon W over a probability space (Ω, F , π), we define Ent(W ) as follows Here, as usual, we have 0 log 0 = 0.
Viewing W as a nonnegative random variable on the space Ω×Ω equipped with the product measure π ×π, we may write This is in fact the so called entropy functional associated to W (see, for instance, [ Before studying the asymptotics of the entropy of a sequence of W -random graphs, we state some properties for this entropy in the following Theorem 8, whose proof is given in Appendix A. Theorem 8: Assume that W is an L 2 graphon on a probability space (Ω, F , π). Then the following hold for the notion of entropy above: In the following Proposition 2, we discuss the relation between the entropy of a normalized graphon W and the asymptotic behavior of the entropy of a sequence of W -random graphs. The proof of Proposition 2 will be given in Appendix B.
Proposition 2: Assume that W is a normalized L 2 graphon over (Ω, F , π). Also, assume that G (n) ∼ G(n; ρ n W ) is a sequence of W -random graphs with target density ρ n such that ρ n → 0 and nρ n → ∞. Then, with m n := n 2 ρ n , we have Recall from Theorem 4 that we have ρ(G (n) )/ρ n → 1 a.s.. Therefore, if m (n) denotes the number of edges in G (n) , this implies that m (n) /m n → 1 a.s.. We also have E m (n) /m n → 1. To see this, note that and, since ρ n → 0 as n → ∞, we have Motivated by this, we can think of m n as, roughly speaking, the "nominal" number of edges in G (n) . Remark 2: Note that unlike the asymptotic in the dense regime as in (11), in the sparse regime of Proposition 2 above the entropy of G (n) has a leading term which is m n log 1/ρ n , and Ent(W ) appears at the second order term.
Remark 3: From part 2 in Theorem 8, we have 1 − Ent(W ) ≤ 1. Also, 1 − Ent(W ) = 1 when W = 1 almost everywhere. This means that, by fixing the sequence of target densities ρ n , among the normalized L 2 graphons the constant graphon W = 1, which corresponds to the measure π × π on Ω × Ω, has the maximum asymptotic entropy rate. Moreover, comparing to (13), for a normalized L 2 graphon W the amount by which the asymptotic entropy rate deviates from this maximum value is precisely the divergence between the measure corresponding to W and the product measure π × π on Ω × Ω, which corresponds to the constant graphon with value 1.

IV. PROBLEM STATEMENT AND MAIN RESULTS
In this section we formalize the problem of finding an optimal universal compression scheme which is capable of compressing a sequence of sparse graphs which is either convergent in the local weak sense as we discussed in Section II-A, or is generated as a sequence of W -random graphs as we discussed in Section II-D, the compression being information-theoretically optimal on a per-edge basis.
More precisely, for each integer n, we want to design a compression map f n : G n → {0, 1} * − ∅ which assigns a prefix-free codeword to every graph on the vertex set [n], as well as a decompression map g n , such that g n • f n is the identity map, i.e. we have lossless compression. In addition to this, we want this compression scheme to be informationtheoretically optimal. More precisely, assume that we have a sequence of random graphs G (n) where either G (n) converges in the local weak sense to a unimodular measure μ ∈ P u (T * ) with probability one, or G (n) is a sequence of W -random graphs with target densities ρ n for a normalized L 2 graphon W . However, the encoder does not know which of the two cases holds, nor does it know the limit objects μ or W in each case, or even the sequence of target densities ρ n in the latter case. Nonetheless, we want the compression scheme to be universally optimal. Motivated by our discussion of the BC entropy in Section II-C, this means that if G (n) converges in the local weak sense to μ ∈ P u (T * ) with probability one, then we want lim sup n→∞ nats(f n (G (n) )) − m (n) log n n ≤ Σ(μ) a.s.. (15) Here, m (n) denotes the number of edges in G (n) , and the normalization of the codeword length is performed in a way consistent with the definition of the BC entropy in Section II-C. Moreover, motivated by the discussion of the notion of entropy for the sparse graphon framework in Section III, and in particular Proposition 2 therein, if G (n) ∼ G(n; ρ n W ) for a normalized L 2 graphon W and a sequence of target densities ρ n with ρ n → 0 and nρ n → ∞, then we want where m n := n 2 ρ n . Note that in this setup the encoder only observes the graph realization G (n) and not the whole sequence (G (n) : n ≥ 1). Moreover, as we discussed above, the encoder does not a priori know from which of the two ensemble types the realization G (n) comes from, nor does it know the limit objects for each of the two sequences of ensembles.
We address this problem by introducing a compression scheme, and will further discuss a converse result. Our compression scheme employs a splitting method. More precisely, given a graph realization G (n) ∈ G n as an input to the encoder, we choose a splitting parameter Δ n , and split G (n) into two graphs denoted by G (n) Δn and G (n) * . These two graphs are both on the vertex set [n], and each edge in G (n) appears in precisely one of them. More precisely, G (n) Δn consists of those edges (v, w) from G (n) such that each of their endpoints has degree at most Δ n . We then define G (n) * to include those edges in G (n) which do not appear in G (n) Δn , i.e. those edges where the degree of at least one of their endpoints is bigger than Δ n . We then encode each of these two graphs separately, where the details are provided in Section V. Roughly speaking, the splitting parameter is chosen so that when G (n) is coming from a local weak convergence ensemble G (n) Δn contains most of the edges in G (n) , while when G (n) is coming from a sparse graphon ensemble, G (n) * contains most of the edges in G (n) . In order to emphasize the dependence of the compression and the decompression maps on the parameter Δ n we denote these mappings by f Δn n and g Δn n , respectively. We will explain the details of this compression scheme in Section V. However, in the following, we state how the choice of the parameter Δ n affects the asymptotic normalized codeword length associated to our compression method in each of the two (local weak convergence and sparse graphon) regimes. We do this in Propositions 3 and 4 below. Although in the sequel we will mainly fix the splitting parameter Δ n prior to observing the realization G (n) , the analysis in Propositions 3 and 4 is general in the sense that Δ n can be chosen after observing G (n) . First, in Proposition 3 below, we study the local weak convergence scenario. Note that although in this setting we assume that the sequence of random graphs G (n) is convergent in the local weak sense, it is not necessarily the case that the sequence of truncated graphs G (n) Δn also converges to the same limit. In fact, the following proposition states that the asymptotic behavior of the codeword length depends on the local weak limit of this truncated sequence G (n) Δn , if such a limit exists. We define R n to be the set of vertices The proof of Proposition 3 below is given in Section VI.
As we will discuss later, if Δ n → ∞ a.s. then with probability one we have U (G (n) Δn ) ⇒ μ and |R n |/n → 0. Therefore, the right hand side of (17) in Proposition 3 above becomes Σ(μ). Comparing this with (15), we realize that choosing Δ n deterministically (i.e. in a way that depends only on n) so that Δ n → ∞ as n → ∞ is reasonable from the local weak convergence perspective. However, motivated by Proposition 4 below, roughly speaking, if Δ n goes to infinity "too fast", then we may lose the optimality condition (16) in the sparse graphon scenario. In other words, there is a trade-off between the two regimes in terms of choosing the parameter Δ n . Next, we state the asymptotic behavior of the codeword length in the sparse graphon regime. We denote the number of edges in G  Proposition 4: Assume that W is a normalized L 2 graphon on a probability space (Ω, F , π). Let G (n) ∼ G(n; ρ n W ) be a sequence of W -random graphs with target density ρ n , such that ρ n → 0 and nρ n → ∞. Furthermore, assume that the sequence Δ n is chosen so that we have m (n) Δn /m n → 0 a.s., Δ n ≤ log log n for n large enough a.s., and Δ n / √ nρ n → 0 a.s.. Then, with probability one, we have where f Δn n refers to the compression scheme of Section V. This proposition requires that Δ n must not grow faster than √ nρ n . Note that the encoder does not have any a priori knowledge of the sequence ρ n . In fact, it turns out that it is impossible to simultaneously satisfy the above conditions imposed by both Propositions 3 and 4. In particular, we show that any general splitting mechanism, which does not even necessarily truncate the graph using the parameter Δ n above, cannot satisfy even a subset of the conditions imposed by Propositions 3 and 4. This is the purpose of Proposition 5 below. Before that, we need to define what we mean by a general splitting mechanism.
Given an integer n, we define a splitting mechanism for graphs in G n as a pair of functions T : G n → G n , such that for all G ∈ G n , the superposition of T given a splitting parameter Δ n , as we discussed above.
Definition 4: We say that a sequence of splitting mechanisms ((T 1) For any unimodular μ ∈ P u (T * ) with deg(μ) ∈ (0, ∞), and any sequence of random graphs G (n) converging with probability one to μ in the local weak sense, T 1 (G (n) ) also converges with probability one to μ. 2) For any normalized L 2 graphon W , and any sequence ρ n such that ρ n → 0 and nρ n → ∞, if G (n) ∼ G(n; ρ n W ) is the sequence of W -random graphs with target density ρ n , with m (n) 1 being the number of edges in T (n) 1 (G (n) ), then we have m (n) 1 /m n → 0 a.s., where m n := n 2 ρ n . The first condition above is motivated by Proposition 3 above so that the limit ν in that proposition is the same as μ, and the second condition is motivated by Proposition 4.
Proposition 5: There does not exist a sequence of good splitting mechanisms.
We give the proof of Proposition 5 in Appendix C. 2 Roughly speaking, the reason why there does not exists a sequence of good splitting mechanisms is that the sequence ρ n can be chosen such that nρ n goes to infinity arbitrarily slowly, and this confuses the splitting mechanism and prevents it from being able to distinguish between the local weak convergence and the sparse graphon regimes.
Motivated by the above discussion, we restrict the sequence ρ n such that nρ n does not go to infinity arbitrarily slowly. More precisely, we assume that nρ n ≥ a n where (a n : n ≥ 1) is a sequence known a priori to both the encoder and the decoder such that a n → ∞ as n → ∞. In this case, we still assume that the encoder does not know whether the input graph G (n) is coming from an ensemble consistent with the local weak convergence convergence framework or the sparse graphon framework, nor does it know the limit objects in each case. However, both the encoder and the decoder know that if G (n) is a realization of a sparse graphon ensemble, the unknown target density ρ n is such that nρ n ≥ a n . We show that under this assumption information-theoretically optimal universal compression on a per-edge basis can be achieved by appropriately choosing the sequence of splitting parameters Δ n .
Theorem 9: Let (a n : n ≥ 1) be a sequence known to both the encoder and the decoder such that a n → ∞ as n → ∞. Assume that nρ n ≥ a n . Then, there exists an appropriate choice for the sequence of splitting parameters (Δ n : n ≥ 1), with Δ n depending only on n, such that our sequence of compression schemes ((f Δn n , g Δn n ) : n ≥ 1), which is introduced in Section V, achieves optimal universal compression in the sense discussed above. More precisely, we have 1) If G (n) is a sequence of random graphs such that, almost surely, where m (n) denotes the number of edges in G (n) . 2) On the other hand, if G (n) ∼ G(n; ρ n W ) is a sequence of W -random graphs with target densities ρ n , where W is a normalized L 2 graphon, assuming that ρ n → 0 as n → ∞, nρ n → ∞ as n → ∞, and nρ n ≥ a n , with 2 In fact, one can show that the proof in Proposition 5 still holds even if we allow for random splitting mechanisms. We have restricted T to be deterministic mainly to simplify the presentation and because this suffices for our purpose, which is to motivate the introduction of a sequence (an, n ≥ 1) for which we require that nρn ≥ an for all n ≥ 1.
In the above setting, the encoder and the decoder only know the sequence a n , and do not know from which of the two settings the input graph G (n) is generated, neither do they know the limit objects μ or W in each setting. Proof of Theorem 9: Let Δ n = min{log a n , log log n}. Since a n → ∞ as n → ∞, we have Δ n → ∞. Assume that G (n) is such that almost surely U (G (n) ) ⇒ μ for some μ ∈ P u (T * ) with deg(μ) ∈ (0, ∞). Then, since Δ n → ∞, Lemma 6 in [7] implies that U (G (n) Δn ) ⇒ μ a.s.. Moreover, Lemma 8 in [7] implies that |R n |/n → 0 a.s.. Consequently, (18) follows from Proposition 3. Now, assume that G (n) ∼ G(n; ρ n W ) for a normalized L 2 graphon W , and nρ n ≥ a n . We verify that the assumptions in Proposition 4 hold. Note that clearly Δ n ≤ log log n. Also, since Δ n ≤ log a n , a n → ∞, and nρ n ≥ a n , we have Δ n / √ nρ n → 0. On the other hand, since all the degrees in G Δ n a n ≤ n n − 1 log a n a n → 0.
Hence, all the assumptions in Proposition 4 hold, and (19) follows from Proposition 4. When a sequence of lower bounds a n , as we discussed above, is not known, we may choose the sequence Δ n to be a constant, i.e. Δ n = Δ for some fixed Δ > 0. Theorem 10 below suggests that if we do so, we still have the universal optimality condition (16) in the sparse graphon regime. However, the optimality condition (15) in the local weak convergence framework holds in a weaker sense, i.e. it only holds after we send Δ to infinity.
Theorem 10: If Δ n = Δ for n ≥ 1 where Δ > 0 is fixed, our sequence of compression schemes ((f Δ n , g Δ n ) : n ≥ 1) has the following properties: 1) If G (n) is a sequence of random graphs such that, almost surely, U (G (n) ) ⇒ μ for some μ ∈ P u (T * ) with deg(μ) ∈ (0, ∞), with probability one we have where m (n) denotes the number of edges in G (n) . 2) On the other hand, if G (n) ∼ G(n; ρ n W ) is a sequence of W -random graphs with target densities ρ n , where W is a normalized L 2 graphon, and ρ n → 0 and nρ n → ∞ as n → ∞, with m n := n 2 ρ n , then for all Δ > 0 we have Proof of Theorem 10: First assume that G (n) is a sequence of random graphs such that, almost surely, U (G (n) ) ⇒ μ for some μ ∈ P u (T * ) with deg(μ) ∈ (0, ∞). For Δ > 0, define μ Δ ∈ P(T * ) to be the law of [T Δ , o] when [T, o] ∼ μ. Here, T Δ is the tree obtained from T by removing all the edges where the degree of at least one of their endpoints is strictly bigger than Δ, followed by taking the connected component of the root o. It is easy to see that It is also continuous, because its value is determined by the depth-2 neighborhood of the root. Therefore, the assumption U (G (n) ) ⇒ μ a.s. implies that with probability one, |R n |/n → F Δ dμ =: η Δ . Consequently, if Δ is sufficiently large so that deg(μ Δ ) > 0, using Proposition 3, we get The dominated convergence theorem implies that η Δ → 0 as Δ → ∞. Moreover, from Proposition 1 in Section II-C, we know that lim sup Δ→∞ Σ(μ Δ ) ≤ Σ(μ). Therefore, we arrive at (20) by sending Δ to infinity in the above inequality. Now, assume that G (n) ∼ G(n; ρ n W ) is a sequence of W -random graphs with target densities ρ n , where W is a normalized L 2 graphon, and ρ n → 0 and nρ n → ∞ as n → ∞. We claim that with Δ n = Δ fixed, all the assumptions in Proposition 4 are satisfied for n large enough. Indeed, Δ ≤ log log n for n large, and Δ/ √ nρ n → 0. Furthermore, since m So far, in Theorems 9 and 10, we have discussed the existence of a sequence of compression and decompression schemes ((f n , g n ) : n ≥ 1) that almost surely achieve the asymptotic compression limits Σ(μ) and 1 − Ent(W ) in the local weak convergence and the sparse graph regimes respectively. In the following converse result, we argue that these are indeed the smallest possible thresholds that can be achieved almost surely. The proof of the following Theorem 11 is given in Section VIII.
Theorem 11: Assume that ((f n , g n ) : n ≥ 1) is a sequence of lossless compression/decompression maps (i.e. g n • f n is the identity). Then, 1) For any μ ∈ P u (T * ) with deg(μ) ∈ (0, ∞), there exists a sequence of random graphs G (n) defined on a joint probably space such that U (G (n) ) ⇒ μ a.s. and we have lim inf n→∞ nats(f n (G (n) )) − m (n) log n n ≥ Σ(μ) a.s., and lim inf n→∞ E nats(f n (G (n) )) − m (n) log n n ≥ Σ(μ), (23) where m (n) denotes the number of edges in G (n) . 2) For any normalized L 2 graphon W and any sequence of target densities ρ n such that ρ n → 0 and nρ n → ∞, if G (n) ∼ G(n; ρ n W ) is the sequence of W -random graphs with target densities ρ n , then for all t < 1 − Ent(W ), we have P lim sup n→∞ nats(f n (G (n) )) − m n log 1 ρn m n ≤ t < 1.
Furthermore, we have V. CODING SCHEME In this section, we provide the details of our compression scheme by introducing the compression and decompression maps f Δn n and g Δn n . Recall from Section IV that Δ n is the splitting parameter which governs how we obtain G we put an edge in G (n) Δn between the nodes v and w. Otherwise, if either deg G (n) (v) > Δ n or deg G (n) (w) > Δ n , we place an edge in G (n) * between the nodes v and w. Let R n denote the set of vertices v ∈ [n] such that either deg G (n) (v) > Δ n or deg G (n) (w) > Δ n for some w ∼ G (n) v. Note that for every edge (v, w) in G (n) * , we have v ∈ R n and w ∈ R n . We first encode G (n) Δn using the compression method from [7] which we reviewed in Section II-E. 3 Next, we discuss how to encode G (n) * . Overall, f (n) (G (n) ) will be comprised of the compressed form of G (n) Δn concatenated with the compressed form of G (n) * . In order to encode G (n) * , we first encode the set R n . For this, we first encode |R n | using at most 1 + log n nats, and then we encode the set R n using at most 1 + log n |Rn| nats by specifying R n among all the subsets of [n] with the same size. Recall that all the edges in G Δn are separated out and placed in the set Zn. However, as we discussed in Section IV, we have two methods for choosing Δn: we either set Δn = min{log an, log log n} as in Theorem 9, or Δn = Δ is fixed as in Theorem 10. In either case, we have Δn ≤ log log n for n large enough, which means that all the degrees in G Additionally, we define Note that the decoder knows m (n) * at this point, and can compute the value of β n . Next, we run the least squares algorithm from [16] which we discussed in Section II-D.1 on the input graph G (n) , with parameter β n defined above. Let π n andB n be the outputs of this algorithm, i.e. the optimizers in (9). Recall from (10) that for 1 ≤ i, j ≤ β n such that π −1 n ({i}) andπ −1 n ({j}) are not empty, we have Note that whenπ −1 n ({i}) = ∅ orπ −1 n ({j}) = ∅, the value of (B n ) i,j does not affect the objective function in (9). Therefore, without loss of generality, we may assume that (B n ) i,j = 0 for such i, j. We emphasize that we run this algorithm on the input graph G (n) , and not on G (n) * . However, we will useπ n and B n to compress G (n) * , as we discuss below. We first need to define some notation. Let β n be the number of 1 ≤ i ≤ β n such thatπ −1 n ({i}) = ∅. Note that in (9), we may reorder the vertex class labels governed by π and modify B accordingly without changing the objective. Therefore, without loss of generality, we may assume that π −1 n ({i}) = ∅ for 1 ≤ i ≤ β n andπ −1 n ({i}) = ∅ for i > β n . Letπ n be the restriction ofπ n on R n . More precisely, with R n = {r 1 , . . . , r |Rn| } such that r 1 < r 2 < · · · < r |Rn| , we defineπ n : [|R n |] → [β n ] such thatπ n (i) =π n (r i ).
Let β * n be the number of 1 ≤ i ≤ β n such that n * i = 0. Note that β * n ≤ β n . By a reordering argument similar to the above in (9), without loss of generality, we may assume that n * In other words, m * i,j is the number of edges in the i, j block formed bỹ π n in the adjacency matrix of G (n) * . Likewise, for 1 ≤ i, j ≤ β n , we define and we define m i,j to be zero for i > β n or j > β n . In other words, m i,j is the number of edges in the i, j block formed byπ n in the adjacency matrix of G (n) . Note that m * i,j ≤ m i,j for 1 ≤ i, j ≤ β n . Having defined the above notation, we continue with encodingπ n . Sinceπ n (i) ≤ β n for 1 ≤ i ≤ |R n |, we may encodeπ n using at most |R n |(1 + log β n ) nats. Next, we encode A(G (n) * ). We do this by separately encoding each block in A(G (n) * ) formed byπ n . More precisely, for 1 ≤ i ≤ β * n , we encode the blockπ −1 n ({i}) ×π −1 n ({i}) of A(G (n) * ) as follows. We first encode m * i,i using at most 1+2 log |R n | nats. This is possible since m * i,i ≤ m nats. Similarly, for 1 ≤ i < j ≤ β * n , we first encode m * i,j using at most 1 + 2 log |R n | nats, and then we encode the positions of the m * i,j ones in the blockπ −1 At the decoder, we first reconstruct G (n) Δn . Next, we decode for the set R n and m (n) * . We then find β n from (28). Then we decode forπ n , and decode each of the blocks of A(G (n) * ) separately. We then reconstruct G Otherwise, if R n = ∅, we have (n) * ≤ 1 + log n. Proof: Following the encoding procedure, we begin with encoding the set R n using at most 2 + log n + log n |Rn| nats. If R n = ∅, we encode R n using at most 1+log n nats, and the encoding procedure stops at this point. Therefore, if R n = ∅, we have (n) * ≤ 1 + log n. Now, assume that R n = ∅. In this case, after encoding R n , we encode m (n) * using at most 1 + 2 log n nats. Then, we encodeπ n using |R n |(1 + log β n ) nats. Moreover, we encode each diagonal block 1 ≤ i ≤ β * n in A(G (n) * ) using at most 2 + 2 log |R n | + log ( n * nats. Also, we encode each non-diagonal block 1 ≤ i < j ≤ β * n using at most 2 + 2 log |R n | + log n * nats. This completes the proof. We illustrate the coding scheme above through an example. Consider a sequence of Erdős-Rényi random graphs G (n) ∼ G(n, n α−2 ) with 1 < α < 2. The expected number of edges in G (n) is Θ(n α ). We can interpret G (n) as a sequence of W -random graphs with target density ρ n = n α−2 , where W : [0, 1] × [0, 1] → R + is the constant graphon with value 1. Recall from Section IV that the encoder is assumed to know beforehand of a sequence a n such that nρ n > a n if the sequence of graphs is in the sparse graphon regime (but does not know the value of ρ n ). For the purposes of the example, let us assume that a n = n (α−1)/2 . From the proof of Theorem 9, we know that an appropriate value for the threshold Δ n is min{log a n , log log n} = α−1 2 log n. It is easy to verify, using the Chernoff bound, that with probability asymptotically approaching 1 all the degrees in G (n) are bigger than Δ n . In particular, this implies that with probability asymptotically approaching 1 it holds that G = m (n) is the number of edges in G (n) . Due to the homogeneity in G (n) , we expect that the least squares graphon estimation step returns uniform partitions so that with high probability, π −1 n ({i}) has roughly size n * := n/β n for 1 ≤ i ≤ β n , m * i,i ≈ n * 2 ρ n , and m * i,j ≈ n 2 * ρ n for i = j. For the rest of this example, since the only purpose is to roughly illustrate how the algorithm works via an example, let us assume that all these approximations hold with equality. Since all theπ −1 n ({i}) are nonempty, we have On the other hand, using Lemma 3, we realize that (n) * ≤ 3 + n + 3 log n + n log β n + 2β 2 n + β n 2 log n + log n * 2 n * 2 ρ n + β n 2 2 log n + log Note that m n = n 2 ρ n = Θ(n α ). Therefore, from (31), we have β 2 n log n = o(m n ). Using these together with the inequality log r s ≤ s log(re/s) in (32) and simplifying, we get Note that the adjacency matrix of G n is divided into β 2 n blocks, each with size n * × n * . This means that β n n * 2 + βn 2 n 2 * = n 2 . Using this in (33), together with the fact that since G (n) Δn is empty, we have nats(f n (G (n) )) = (n) * , and so we get where the last step uses the fact that Ent(W ) = 0 since W is a constant graphon with value 1. Notice that this is indeed consistent with the asymptotic bound (19) in Theorem 9.

VI. PROOF OF PROPOSITION 3: LOCAL WEAK CONVERGENCE ANALYSIS
In this section, we prove Proposition 3. Since U (G where m where m (n) * denotes the number of edges in G (n) * . Note that this together with (34) finishes the proof. From Lemma 3, if R n = ∅, we have (n) * ≤ 1+log n and (35) holds. Therefore, we assume that R n = ∅ for the rest of the proof. Note that when R n = ∅, there must be at least two vertices in R n and hence |R n | ≥ 2. In this case, again from Lemma 3, we have Using the bound r s ≤ (re/s) s , we can write where in (a), we have used the fact that since |R n | ≥ 2, x for x > 0 and s(0) = 0, and we have simplified the expression using Furthermore, (c) uses the concavity of the function s(.). Now, we consider two cases to bound this expression.

VII. PROOF OF PROPOSITION 4: GRAPHON ANALYSIS
In this section, we prove Proposition 4. Before giving the proof, we introduce some notation. Recall from Section V that π n andB n are the optimizers in (9) associated to A(G (n) ) with parameter β n , and n i = |π −1 n ({i})| for 1 ≤ i ≤ β n . Let W (n) be the block graphon ( p,B n ) where p = (p 1 , . . . , p βn ) with p i = n i /n for 1 ≤ i ≤ β n . More precisely, using (29) and (30) in Section V, W (n) is defined on the finite probability space {1, . . . , β n } equipped with probabilities ( n1 n , . . . , where β n and m i,j were defined in Section V. Moreover, we define the graphon W (n) * on the same probability space {1, . . . , β n } equipped with probabilities ( n1 n , . . . , where β * n and m * i,j were defined in Section V. The following Proposition 6 discusses some useful facts regarding the asymptotic behavior of the number of edges in a sequence of sparse W -random graphs, and will be useful in the proof of Proposition 4. The proof of Proposition 6 is given in Appendix D. Proposition 6: Assume that W is a normalized L 2 graphon and G (n) ∼ G(n; ρ n W ) is a sequence of W -random graphs with target density ρ n such that ρ n → 0 and nρ n → ∞. Then, with m (n) being the number of edges in G (n) and m n := The following lemmas will be useful in the proof of Proposition 4. The proofs of these lemmas are given in Appendix E.

Proof of Proposition 4: Using Lemma 5, we have
Recall from Section V that we denote by Δn using the compression method discussed in [7], which we reviewed in Section II-E. Therefore, using Lemma 2, we have By assumption, we have m Note that since nρ n → ∞ as n → ∞ we have m n → ∞ as n → ∞, and from (49) we have m (n) * → ∞ a.s. as n → ∞. Thereby, with probability one, R n = ∅ for n large enough, and from Lemma 3 we have (n) * ≤ (n) * ,1 + (n) * ,2 where (n) * ,1 := 3 + |R n | + 3 log n + log We claim that In order to show this, we consider two cases. If α n ≤ e 2 , then from (28) we have β n = 1. Therefore, using n |Rn| ≤ 2 n and |R n | ≤ n, we have (n) * ,1 ≤ 3 + n + 3 log n + log 2 n = 3 + n + 3 log n + n log 2.
Combining the two cases, we get the following upper bound for (n) * ,1 , which holds in both cases, (n) * ,1 ≤ 3 + n + 3 log n + n log 2 + From Lemma 5 we have m (n) /m n → 1 a.s. as n → ∞. Therefore, with probability one, for n large enough, we have where in the last step we have used nρ n → ∞ as n → ∞.
Using this in (53), we realize that with probability one we have lim sup n→∞ (n) * ,1 m n ≤ lim sup n→∞ 3 + n + 3 log n + n log 2 + n 2 log(nρ n ) (n − 1)(nρ n )/2 = 0, which shows (52). Now, we study (n) * ,2 . Using r s ≤ ( re s ) s , we can write where (a) uses β * n ≤ β n and n * i ≤ n i and in (b), we have used the fact that since |R n | ≥ 2, 1 + log |R n | ≤ 3 log |R n |. Note that Simplifying the bound in (54) using (55) and (56), we get (n) * ,2 ≤ 6β 2 n log |R n |+m We claim that To see this, note that if α n ≤ e 2 , then β n = φ(α n ) = 1 and β 2 n log |R n | = log|R n | ≤ log n. On the other hand, if α n > e 2 , recalling the definition of α n in (26), since α n ≤ m (n) * /n, we have Thereby, we have β 2 n log |R n | ≤ m (n) * n log n. Combining the two cases, we get where in ( * ), we have used Proposition 6. This completes the proof.

VIII. PROOF OF CONVERSE (THEOREM 11)
In this section, we give the proof of our converse results, i.e. the two parts of Theorem 11.
The proof of (22) directly follows from the converse result of Theorem 7 in Section II-E. To prove (23), first note that if Σ(μ) = −∞ there is nothing to be proved. Next note that any simple graph on n vertices can be encoded with O(n 2 ) bits simply by indicating which edges exist. By using an additional bit to indicate whether we use this code or f n (using the shorter of the two encodings) and then a header of fixed length O(log n), we get a lossless prefix-free code whose length is no more than O(log n) more than that of the code f n for each graph. Now assume that Σ(μ) > −∞. Using the definition of the BC entropy, we may find a sequence n → 0 together with a sequence m (n) such that m (n) /n → deg(μ)/2 and Let G (n) be uniformly distributed over G n,m (n) (μ, n ). From the prefix-free and lossless nature of the code we constructed above, we conclude that Comparing this with (61), since log n/n → 0 as n → ∞, we arrive at (23). In fact, this sequence G (n) used here is the same as that in the proof of Theorem 7 (Theorem 4 in [7]). Hence, (22) and (23) simultaneously hold for the same sequence of random graphs. Now we prove the second part. Assume that a sequence of lossless/decompression maps ((f n , g n ) : n ≥ 1) is given. Similar to the above discussion for the first part, we may assume without loss of generality that f n satisfies the prefix condition. Thereby, we have E nats(f n (G (n) )) +O(log n) ≥ H(G (n) ). Therefore, from Proposition 2, we have which establishes (25). In order to show (24), consider the lossless compression map f n : G n → {0, 1} * defined as follows. Given a graph G ∈ G n with m edges, f n (G) is comprised of the binary representation of m, followed by the index of G among all the graphs in G n,m which have the same number of edges m. Since m ≤ n 2 < n 2 , and the number of the graphs with m edges is precisely ( n 2 ) m , we have bits(f n (G)) ≤ 2 + 2 log 2 n + log 2 n 2 m =: l n,m . Now, we define another compression mapf n : G n → {0, 1} * as follows. Assume that G ∈ G n is given. If bits(f n (G)) ≤ l n,m , define b ∈ {0, 1} * to be obtained by concatenating the binary representation of bits(f n (G)) using 1 + log 2 (l n,m ) bits followed by f n (G). Thereby bits(b) ≤ 1 + log 2 l n,m + bits(f n (G)).
Then, if bits(b) < bits(f n (G)), we definef n (G) to be a single bit with value zero followed by b. Otherwise, if either bits(f n (G)) > l n,m , or bits(f n (G)) ≤ l n,m and bits(b) ≥ bits(f n (G)), we definef n (G) to be a single bit with value one followed by f n (G). Observe thatf n defined above satisfies the prefix condition. Additionally, since both f n and f n are lossless,f n is also lossless. Moreover, for all G ∈ G n with m edges, we have bits(f n (G)) ≤ 1 + l n,m = 3 + 2 log 2 n + log 2 or equivalently nats(f n (G)) ≤ 3 log 2 + 2 log n + log In addition to this, we claim that for all G ∈ G n having m edges we have bits(f n (G)) ≤ (1 + 1) + log 2 l n,m + bits(f n (G)), or equivalently nats(f n (G)) ≤ (1 + 1) log 2 + log l n,m + nats(f n (G)). (64) To see this, observe that if bits(f n (G)) ≤ l n,m and bits(b) < bits(f n (G)), then using (62) we have bits(f n (G)) = 1 + bits(b) ≤ 1+(1+log 2 l n,m )+bits(f n (G)). On the other hand, if bits(f n (G)) ≤ l n,m and bits(b) ≥ bits(f n (G)), we have bits(f n (G)) = 1 + bits(f n (G)) ≤ 1 + bits(b) ≤ 1 + (1 + log 2 l n,m ) + bits(f n (G)), where the last step again uses (62). Finally, if bits(f n (G)) > l n,m , we have bits(f n (G)) = 1 + bits(f n (G)) ≤ 1 + l n,m ≤ 1 + bits(f n (G)) ≤ 1 + (1 + log 2 l n,m ) + bits(f n (G)). Hence, we have verified that the claimed bound in (64) holds in all the three cases. Now let W and the sequence ρ n be as in the statement of Theorem 11 and let G (n) be a sequence of W -random graphons with target density ρ n . Let m (n) denote the number of edges in G (n) . Using (64), for all t, we have Using a crude upper bound, we have log l n,m (n) ≤ log 2 + log 2 n + log 2 2 ( n 2 ) ≤ log(2 + n + n 2 ) ≤ log(n + 1) 2 = 2 log(n + 1).
On the other hand, we have m n = n 2 ρ n and nρ n → ∞ as n → ∞. Consequently, with probability one we have lim n→∞ 2 log 2 + log l n,m (n) m n = 0.
Comparing this with (65) above, we realize that in order to show (24), it suffices to show that for all t < 1 − Ent(W ), We fix t < 1 − Ent(W ) and define the random variables Note that, from (63), with probability one we have Thereby, employing Fatou's lemma, we get Note that Combining this with (67), we get Note that m n = n 2 ρ n = n−1 2 nρ n and nρ n → ∞. Hence, (3 log 2 + 2 log n)/m n → 0 as n → ∞. Thereby, from Part 3 of Proposition 6 in Section VII, we have Also, using Part 4 of Proposition 6, we have Using (69) and (70) in (68), we get or equivalently, Note thatf n is lossless and satisfies the prefix condition, which implies that E nats(f n (G (n) )) ≥ H(G (n) ). Consequently, using Proposition 2, we have Combining this with (71), we get This establishes (66) and completes the proof of Theorem 11.

IX. CONCLUSION
We introduced a universal lossless compression method simultaneously applicable to both sparse graphs and heavytailed sparse graphs. We employed the framework of local weak convergence for sparse graphs, and the sparse graphon framework of [16] for heavy-tailed sparse graphs. The main purpose of this work is to address information theoretic limits of compression. An important future work would be to design efficient coding algorithms that can achieve these compression bounds.
Another future direction would be to investigate if there is a unifying framework for compression, together with an appropriately defined notion of entropy, covering a wider range of sparsity regimes. Of possible relevance for this, in a recent work, Backhausz and Szegedy [33] have defined a notion of graph convergence which covers L p graphons and graphings as special cases (see [11], [33], [34], [35] for the definition and more on graphings). Furthermore, we believe that an important and challenging avenue of future research is to develop compression schemes for graphs that are not locally tree-like (such as random geometric graphs [36]). Note that the BC entropy is only interesting when the local weak limit is supported on rooted trees (see Theorem 1). Therefore, a fundamental challenge is to define an appropriate notion of entropy when the limit is not tree-like.

APPENDIX A PROOF OF THEOREM 8
In this section we give the proof of Theorem 8. Before that, we state and prove the following lemma.
Proof: The claim is obvious when x = y. Hence assume without loss of generality that x > y. By the convexity of Substituting this on the right hand side in the preceding inequality completes the proof.
Proof of Theorem 8: To see part 1, note that |x log x| ≤ 1 e + x 2 on [0, ∞). Thereby, we have Using the definition of the δ 2 norm in (6), we can find for each n a coupling ν n of π n and π such that Note that we have To simplify the notation, define μ n := ν n × ν n to be the product measure on (Ω n × Ω) × (Ω n × Ω). Moreover, define Then, using (75), we can write We bound each term separately. We start with the integral over B n . Using Lemma 7, we have

APPENDIX B PROOF OF PROPOSITION 2
Proof of Proposition 2: Note that since ρ n → 0 we have ρ n < 1 for all n large enough. Also, since nρ n → ∞, for n large enough we have ρ n > 1/n > 0. Therefore, throughout the proof, we may assume that n is large enough so that 0 < ρ n < 1.
We prove the result in two steps. First, we show that Recall that in order to generate G (n) we start with an i.i.d. sequence (X i ) ∞ i=1 from distribution π and connect two nodes 1 ≤ i, j ≤ n, i = j, with probability ρ n W (X i , X j ) ∧ 1. Note that, conditioned on X [1:n] , the placement of edges is performed independently for each pair of vertices. Therefore, denoting the binary entropy of x ∈ [0, 1] to the natural base, and identifying 0 log 0 ≡ 0 as usual, we may write where in the last line we view W as a random variable on Ω × Ω with probability distribution π × π. We may write We continue by bounding each term separately. For the first term in (83), we may write Since W is normalized by assumption, we have E [W ] = 1. Moreover, we have Therefore, we have Multiplying both sides by n 2 , then dividing by m n , and recalling m n = n 2 ρ n , we realize that Since W is a L 2 graphon we have E W 2 < ∞. Moreover, since ρ n → 0 we have ρ n log 1/ρ n → 0. Thereby, On the other hand, since W is L 2 , from (72) we know that E [|W log 1/W |] < ∞. Therefore, as 1/ρ n → ∞, the dominated convergence theorem implies that Substituting (85) and (86) into (84), we get Now we turn to the second term on the right hand side of (83). Let τ n := 1/ √ ρ n ≤ 1/ρ n and note that Using the Taylor remainder theorem, for x ≥ 0, we can write where η(x) → 0 as x → 0. Thereby, Using this in (88), multiplying both sides by n 2 , and then dividing bym n , we get Note that E [W ] = 1 and τ n → ∞ as n → ∞. Hence, On the other hand, when W ≤ τ n we have ρ n W ≤ ρ n τ n = √ ρ n . Recall that η(x) → 0 as x → 0, and √ ρ n → 0 as n → ∞. Hence, we can conclude that there exists a sequence n → 0 such that |η(ρ n W )| ≤ n when W ≤ τ n . Therefore, Substituting (90) In other words, W k is obtained from W by taking the average in each cell formed by the partition (Y i : 1 ≤ i ≤ k). Lemma 5.6 in [15] implies that since W is L 2 , we have Now, we fix k ≥ 1 and find an upper bound for H(G (n) ). Define the random variables X i , 1 ≤ i ≤ n, as follows. For = nH( X 1 ) where (a) uses the fact that X 1 is uniformly distributed over {1, . . . , k} and that a joint entropy is bounded above by the sum of the corresponding marginal entropies. Also, in (b), we have used the symmetry in G (n) . Note that conditioned on X 1 = i and X 2 = j for some 1 ≤ i, j ≤ k, X 1 and X 2 are independent and are distributed uniformly over Y i and Y j , respectively. Therefore, For 1 ≤ i, j ≤ k and n ≥ 1, define Consequently, we have We now simplify each of these three terms. For 1 ≤ i ≤ k, let y i be an arbitrary point in the interval Y i . With a (n) i,j as in (97), we have Thereby, where in the last step, we have used the assumption that W is a normalized graphon. Consequently, when n is so large that ρ n < 1, we have From the definition of a (n) i,j in (97), since 1/ρ n → ∞, for all Hence, since W k 1 = W 1 = 1, we have On the other hand, using .
i,j ≤ W k (y i , y j ) ≤ max i,j W k (y i , y j ), and we have ρ n → 0 as n → ∞. Thereby, there exists a sequence n → 0 such that η(ρ n a (n) i,j ) ≤ n for 1 ≤ i, j ≤ k. Using this together with (100) in the above, we get Multiplying the numerator and the denominator on the left hand side by n 2 and recalling m n = n 2 ρ n , we get Using this in (96), we get where the last line uses the fact that since m n = n 2 ρ n and nρ n → ∞, n/m n → 0. Note that this bound holds for all k ≥ 1. Moreover, from (95), δ 2 (W k , W ) → 0 as k → ∞. Therefore, part 4 in Theorem 8 implies that Ent(W k ) → Ent(W ). Hence, we arrive at (93) by sending k to infinity in the above bound. The proof is complete by putting (81) and (93) together.
APPENDIX C PROOF OF PROPOSITION 5 We assume that ((T 2 ) : n ≥ 1) is a sequence of good splitting mechanisms, and we arrive at a contradiction.
For n ≥ 1 and 1 ≤ k ≤ n, let G (n) k be an Erdős-Rényi random graph on n vertices where each edge is independently present with probability k/n. We can assume that (G (n) k : n ≥ 1, k ≤ n) live independently on a joint probability space. From this point forward, all of our probabilistic statements will refer to this joint probability space.
We know that with μ k ∈ P(T * ) being the law of the unimodular Galton-Watson tree with Poisson degree distribution and average degree k we have U (G (n) k ) ⇒ μ k a.s. for each fixed k ≥ 1 as n → ∞. Therefore, the assumption that ((T We now define a sequence of integers (n k : k ≥ 1) inductively as follows. Let n 0 := 0 and, for k ≥ 1, assuming that n k−1 is chosen, we choose n k large enough such that the following three conditions are satisfied: 1) n k > n k−1 ; 2) P m (n k ) k,1 3) k + 1 n k < 1 k + 1 .
Note that condition (107) can be satisfied, due to (106). We next define the sequence (G (n) : n ≥ 1) of random graphs as follows. For n ≥ 1, let k(n) be the unique integer k ≥ 1 such that n k−1 < n ≤ n k and let G (n) = G (n) k(n) . Note that since n k(n)−1 < n, using (108) we have In particular this means that k(n) < n and so the sequence G (n) is well defined.
Observe that this sequence G (n) can be represented in terms of a sequence of W -random graphs for the graphon W defined on the probability space [0, 1] equipped with the uniform distribution such that W (x, y) = 1 for all x, y ∈ [0, 1]. To see this, let ρ n := k(n)/n for n ≥ 1. The distribution of the sequence G (n) = G (n) k(n) is then identical to the distribution of the sequence of W -random graphs with target densities ρ n . Further, due to (109), we have lim n→∞ ρ n = 0.
and lim n→∞ nρ n = ∞, because k(n) → ∞ as n → ∞. As a result, the assumption that ((T where m n := n 2 ρ n . But note that by definition we have k(n k ) = k and thereby ρ n k = k(n k )/n k = k/n k . Hence Therefore, using (107) (G (n k ) ). Since n k → ∞ as k → ∞, this in particular means that But this is in contradiction with (112). Therefore no sequence of good splitting mechanisms exists and the proof is complete.

APPENDIX D PROOF OF PROPOSITION 6
Throughout this section, we assume that W is a normalized L 2 graphon and G (n) ∼ G(n; ρ n W ) is a sequence of W -random graphs with target density ρ n such that ρ n → 0 and nρ n → ∞. Also, m (n) denotes the number of edges in G (n) and m n := n 2 ρ n . For better organization, we prove Proposition 6 in separate lemmas.

Proof of Lemma 9:
We pick (X i ) ∞ i=1 i.i.d. from Ω and generate G (n) based on X [1:n] . From Theorem 2.9 in [16], W is equivalent to a graphon over [0, 1] equipped with the uniform distribution. Therefore, without loss of generality, we may assume that W is a L 2 graphon over [0, 1], and (X i ) ∞ i=1 is an i.i.d. sequence of random variables distributed uniformly over [0, 1].

Note that we have
M n = E m (n) |X [1:n] .
With this definition, we prove the lemma in two steps, namely On the other hand, if for some 1 ≤ i < j ≤ n, we have W (X i , X j ) ≤ τ n , then ρ n W (X i , X j ) ≤ ρ n τ n = ρ 7/8 n . But ρ n → 0 as n → ∞. Hence, for n large enough, we have ρ 7/8 n < 1. This means that for n large enough we have Putting this together with (116), we realize that for n large enough we have Now, we claim that lim n→∞ Y n − m n m n log 1 ρ n = 0 a.s., and lim n→∞ Z n m n log 1 ρ n = 0 a.s..
Note that (113) follows from (118), (119), and (120). We start with showing (119). Observe that for 1 ≤ i ≤ n, x 1 , . . . , x n ∈ [0, 1], and x i ∈ [0, 1], since W is a symmetric function, we have Therefore, using the bounded difference inequality (see, for instance, [ Since n≥1 exp(−2n 1/4 ) < ∞, the Borel-Cantelli lemma implies that with probability one, for n large enough (where the threshold of n can be random), we have where the last equality follows from the facts that as n → ∞, we have ρ n → 0 and nρ n → ∞. Consequently, we have We turn to studying E [Y n ]. Recalling the definition of Y n from (115a), and writing W for W (X, X ) with X and X being i.i.d. on Ω with distribution π, we may write Therefore, where the last step uses the fact that E [W ] = 1. Note that by assumption W is an L 2 graphon and so W 2 < ∞. Also, by assumption, ρ n → 0 as n → ∞. This together with (123) implies that Putting (122) and (124) together, we arrive at (119). We next focus on showing (120). Note that we have Since ρ n → 0 as n → ∞, we have ρ n < 1 when n is large enough. Consequently, for n large enough, we have This together with (144) implies that (β (i) n ) 2 log β (i) n = o(nρ n ). Consequently, we have verified (143). As we discussed earlier, with probability one, for n large, we have β n ∈ {β (i) n : 1 ≤ i ≤ 3}. This together with (143) implies that with probability one, β n → ∞ and β 2 n log β n = o(nρ n ). On the other hand, if W (n) i is defined similar to W (n) based on solving the estimation problem (9) with β n replaced by β (i) n , from Theorem 3.1 in [16], with probability one, we have but we have previously shown that with probability one, we have β n ∈ {β (i) n : 1 ≤ i ≤ 3} for n large enough. This means that with probability one, for n large enough, we have W (n) ∈ { W (n) i : 1 ≤ i ≤ 3}. This together with (145) implies that with probability one, δ 2 ( W (n) /ρ n , W ) → 0 and completes the proof.
Proof of Lemma 6: Recalling the definition of δ 2 and employing the identity coupling, we get On the other hand, using the facts that for 1 ≤ i, j ≤ β n , we have λ i,j ≥ λ * i,j and n i ≥ n/β n , we have ⎛ Comparing this with (146), we get Notice that