A Universal Low Complexity Compression Algorithm for Sparse Marked Graphs

Many modern applications involve accessing and processing graphical data, i.e. data that is naturally indexed by graphs. Examples come from internet graphs, social networks, genomics and proteomics, and other sources. The typically large size of such data motivates seeking efficient ways for its compression and decompression. The current compression methods are usually tailored to specific models, or do not provide theoretical guarantees. In this paper, we introduce a low-complexity lossless compression algorithm for sparse marked graphs, i.e. graphical data indexed by sparse graphs, which is capable of universally achieving the optimal compression rate in a precisely defined sense. In order to define universality, we employ the framework of local weak convergence, which allows one to make sense of a notion of stochastic processes for sparse graphs. Moreover, we investigate the performance of our algorithm through some experimental results on both synthetic and real-world data.

data is generated from a certain statistical model and the encoder aims to achieve the entropy of this input distribution.For instance, Choi and Szpankowski studied the structural entropy of the Erdős-Rényi model, i.e., the entropy associated to the isomorphism classes of such graphs [16].Moreover, they proposed a compression scheme which asymptotically achieves the structural entropy.Aldous and Ross studied the asymptotics of the entropy of several models of random graphs, including the sparse Erdős-Rényi ensemble [4].Abbe studied the asymptotic behavior of the entropy of stochastic block models, discussed the optimal compression rate for such models up to the first order term [1], and also considered the case where vertices in a stochastic block model can carry data which is conditionally independent given their community membership.Łuczak et al. studied the asymptotics of the entropy associated to the preferential attachment model, for both the labeled and unlabeled regimes, and used this to design optimal compression schemes [26].Turowski et al. studied the information content of the duplication model [29].
A second line of research aims to compress specific types of graphical data, such as Web graphs [9], [13], social networks [12], [15], [27], or biological networks [22], [23].These works usually take advantage of some properties specific to the data source.Results in this category of work usually do not come with an information-theoretic guarantee of optimality.For instance, the Web graph framework of [13] employs the locality and similarity properties existing in Web graphs to design an efficient compression algorithm tailored for such data.The reader is referred to [8] for a survey on graph compression methods.

B. Our Contribution
The key property distinguishing our approach from the existing ones is universality.More precisely, we introduce a scheme which is capable of compressing graphs which come from a certain "stochastic process" without any prior knowledge of this process, yet is able to achieve the optimal compression rate, in a precisely defined sense.Additionally, in contrast to several earlier works, we assume that the graphs are "marked" so that vertices and edges can carry additional information on top of the connectivity structure of the graph.Our focus in this paper is on sparse marked graphs, in the sense that the number of edges in the graph scales linearly with the number of vertices.The motivation for this is that usually real-world graphs are more or less sparse.In another work of the authors, we have studied the problem of graph data compression for a different regime of sparsity [21].
To make sense of the notion of a "stochastic process" for sparse marked graphs, we employ the framework of local weak convergence [2], [3], [7].Moreover, we employ a notion of entropy called the marked BC entropy [19], which is an extension of the notion of entropy introduced by Bordenave and Caputo in [14] and serves as a counterpart of the Shannon entropy rate for this framework.The marked BC entropy is defined by analyzing the asymptotic behavior of the size of typical graphs.It can be seen that the logarithm of the size of the set of typical marked graphs has a leading term proportional to n log n, n being the number of vertices in the graph, with the proportionality constant being half the average degree of the graph, and the marked BC entropy shows up at the second order term, which scales linearly with the number of vertices.In other words, the marked BC entropy captures the per-node growth rate of the logarithm of the size of the set of typical marked graphs, after carefully separating out the leading term.The authors have already introduced a universal compression scheme in [20] which shows that this notion of entropy is indeed the optimal information-theoretic threshold of compression.In [20], the encoder needs to find the index of the input graph among all graphs which have the same frequency of local structures as the input graph.Although this method is proved to universally achieve the marked BC entropy in an asymptotic sense, it is computationally intractable.The focus of this paper is to provide a compression algorithm which asymptotically achieves the optimal compression rate and also is computationally efficient.
There have been some attempts in the literature to address universality in the context of graphical data compression, for instance to compress deep neural networks [5], [6], or stochastic block models [10].However, such attempts usually address universality in a limited fashion, i.e., the graph is generated from a class of ensembles with unknown parameters.Moreover, such attempts usually consider the ratio of the codeword length to the overall ensemble entropy, as opposed to our framework which considers the per-node entropy by carefully separating out the leading term.

C. Notational Conventions
N denotes the set of positive integers.For a positive integer n, [n] denotes the set of integers {1, 2, . . ., n}.For integers i and j, [i : j] denotes the set {i, . . ., j} if i ≤ j, and the empty set otherwise.Throughout the paper, log refers to the logarithm in the natural basis, while log 2 refers to logarithm in base 2. We write {0, 1} * − ∅ for the set of nonempty binary sequences with finite length.Given x ∈ {0, 1} * − ∅, bits(x) refers to the length of x in bits, while nats(x) denotes the length of x in nats, i.e., nats(x) = bits(x) × log 2.
We write := and =: for equality by definition.For two sequence (a n : n ≥ 1) and (b n : n ≥ 1) of nonnegative real numbers, we write a n = O(b n ) if there exists a constant C > 0 such that a n ≤ Cb n for n large enough.Moreover, we write For nonnegative integers r, s ≥ 0, we define the binomial coefficient r s := r! s!(r−s)!if r ≥ s, and r s := 0 if r < s.For nonnegative integers r and s, we define the falling factorial (r) s to be r(r−1)(r−2) . . .(r−(s−1)).In other words, if r ≥ s, we have (r) s = r!/(r − s)!, while if r < s, we have (r) s = 0.For an even integer k > 0, we define is the number of matchings on k objects.Moreover, we define (−1)!! := 1.For a probability distribution P defined on a finite set, H(P) denotes the Shannon entropy of P. Other notation used in this document is defined on its first appearance.

D. Structure of the Document
The structure of the document is as follows.In Section II, we review some of the tools that we use, specifically the local weak convergence framework and the marked BC entropy.In Section III, we rigorously define the problem of universal graphical data compression and state our main results in Theorem 3 which introduces our compression algorithm and discusses its main properties, i.e., information-theoretic optimality and low computational complexity.In Section IV, we give an overview of the steps of our universal compression algorithm, without going through the details.In Section V, we illustrate the performance of our compression algorithm for both synthetic and real-world graphical data.The details of our compression algorithm together with the analysis of its optimality and complexity are given in the longer version of this paper [17].A subset of this document which only contains the algorithm details is also available [18].

A. Graphs
A simple graph G consists of a set of vertices, denoted by V(G), and a set of edges, without multiple edges or self loops.We use the terms "vertex" and "node" interchangeably.For vertices v and w in V(G), we write v ∼ G w to denote that there is an edge in G between v and w.All graphs in this document are simple.
A (simple) marked graph is a simple graph such that every vertex carries a mark coming from a finite vertex mark set , and also every edge carries two marks, one towards each of its endpoints, coming from a finite edge mark set .The mark of vertex v ∈ V(G) is denoted by τ G (v), and the mark of an edge between vertices v and w towards vertex w is denoted by ξ G (v, w).See Figure 1 for an example.Given a marked graph G, the edge mark count vector of G is defined to be the vector m G := (m G (x, x ) : x, x ∈ ) such that for x, x ∈ , m G (x, x ) is the number of edges in G with mark x towards one endpoint and mark x towards the other endpoint.Note that by definition, we have m G (x, x ) = m G (x , x) for all x, x ∈ .Furthermore, the vertex mark count vector is defined to be the vector u G := (u G (θ ) : θ ∈ ) where u G (θ ) for θ ∈ is the number of vertices in G with mark θ .A marked tree is a marked graph T where the underlying graph is a tree.An unmarked graph can be thought of as a marked graph where and are of cardinality 1.
In a marked graph G a walk between two vertices v and w is a sequence of distinct vertices that for all 1 The length of such a walk is defined to be k.The distance between vertices v and w is defined to be the length of the shortest walk connecting them, or infinity if such a walk does not exist.The connected component of a vertex v ∈ V(G) which is denoted by G v , is the subgraph comprised of vertices in G with a finite distance from v. Note that G v , by definition, is connected.The degree of v, denoted by deg G (v), is defined to be the number of edges connected to v.
G is called locally finite if the degree of all the vertices is finite.All graphs in this document are assumed to be locally finite.
Given a (simple, locally finite) marked graph G together with a vertex v ∈ V(G), we define the "universal cover of G at v", denoted by UC v (G), as follows.Every vertex in UC v (G) is in one-to-one correspondence with a non-backtracking walk starting at v, i.e., a sequence of vertices v = v 0 , v 1 , . . ., v k for some k ≥ 0 such that for 1 ≤ i ≤ k, we have v i ∼ G v i−1 and v i = v i−1 and such that for 1 ≤ i ≤ k − 1, we require that v i−1 = v i+1 , i.e., the walk is non-backtracking.A vertex in UC v (G) corresponding to a non-backtracking walk (v i : 0 ≤ i ≤ k) is given the mark τ G (v k ).Moreover, for each nonbacktracking walk v = v 0 , . . ., v k for which k > 1, if ṽ and w denote the vertices in UC v (G) corresponding to the walks v 0 , . . ., v k and v 0 , . . ., v k−1 , respectively, we place an edge in UC v (G) between the vertices ṽ and w with mark ξ With an abuse of notation, we denote the vertex in UC v (G) associated to the walk (v) with length 0 by v. Likewise, for a vertex w ∼ G v, we denote the vertex in UC v (G) associated to the walk (v, w) by w.See Figure 2 for an example.See [28], for instance, for more discussion on universal covers.

B. The Framework of Local Weak Convergence
In this section we introduce the framework of local weak convergence, which we use in order to make sense of stochastic processes for the space of marked sparse graphs.The reader is referred to [2], [3], [7] for more details on this framework.
A rooted marked graph is defined to be a marked graph G with a distinguished vertex v ∈ V(G).We denote such   d *  ([G, o], [G , o ]) := 1/(1 + ĥ).One can check that d * is a metric; in particular, it satisfies the triangle inequality.Moreover, Ḡ * together with this metric is a Polish space, i.e., a complete separable metric space [2].Let T * denote the subset of Ḡ * comprised of the equivalence classes [G, o] arising from some (G, o) where the graph underlying G is a tree.In the sequel we will think of Ḡ * as a Polish space with the metric d * , rather than just a set.Note that T * is a closed subset of Ḡ * .
For a Polish space , let P( ) denote the set of Borel probability measures on .We say that a sequence of measures μ n on converges weakly to μ ∈ P( ) and write μ n ⇒ μ, if for any bounded continuous function on , we have fdμ n → fdμ.It can be shown that it suffices to verify this condition only for uniformly continuous and bounded functions [11].For a Borel set B ⊂ , the -extension of B, denoted by B , is defined as the union of the open balls with radius centered around the points in B. For two probability measures μ and ν in P( ), the Lévy-Prokhorov distance d LP (μ, ν) is defined to be the infimum of all > 0 such that for all Borel sets B ⊂ we have μ(B) ≤ ν(B ) + and ν(B) ≤ μ(B ) + .It is known that the Lévy-Prokhorov distance metrizes the topology of weak convergence on the space of probability distributions on a Polish space (see, for instance, [11]).For x ∈ , let δ x be the Dirac measure at x.
For a finite marked graph G, define U(G) ∈ P( Ḡ * ) as ( Note that U(G) ∈ P( Ḡ * ).In creating U(G) from G, we have created a probability distribution on rooted marked graphs from the given marked graph G by rooting the graph at a vertex chosen uniformly at random.Furthermore, for an integer h ≥ 1, let We then have U h (G) ∈ P( Ḡ * ).See Figure 3 for an example.We say that a probability distribution μ on Ḡ * is the local weak limit of a sequence of finite marked graphs {G n } ∞ n=1 when U(G n ) converges weakly to μ (with respect to the topology induced by the metric d * ).This turns out to be equivalent to the condition that, for any finite depth h ≥ 0, the structure of G n from the point of view of a root chosen uniformly at random and then looking around it only to depth h, converges in distribution to μ truncated up to depth h.This description of what is being captured by the definition justifies the term "local weak convergence".
In fact, U h (G) could be thought of as the "depth h empirical distribution" of the marked graph G. On the other hand, a probability distribution μ ∈ P( Ḡ * ) that arises as a local weak limit plays the role of a stochastic process on graphical data, and a sequence of marked graphs {G n } ∞ n=1 could be thought of as being asymptotically distributed like this process when μ is the local weak limit of the sequence.
The degree of a probability measure μ ∈ P( Ḡ * ), denoted by deg(μ), is defined as ), which is the expected degree of the root.Similarly, for μ ∈ P( Ḡ * ) and x, x ∈ , let deg x,x (μ) be defined as ), which is the expected number of edges connected to the root with mark x towards the root and mark x towards the other endpoint.We use the notation deg(μ) := (deg x,x (μ) : x, x ∈ ).Moreover, for μ ∈ P( Ḡ * ) and θ ∈ , let the probability under μ that the mark of the root is θ be denoted (3) We use the notation (μ) := ( θ (μ) : θ ∈ ).
All the preceding definitions and concepts have parallels in the case of unmarked graphs.As in the current literature, for the set of rooted isomorphism classes of unmarked graphs one uses the notation G * , and d * for the metric on G * , which are just Ḡ * and d * on Ḡ * respectively where both and are sets of cardinality 1.
Since each vertex in G n has the same chance of being chosen as the root in the definition of U(G n ), this should manifest itself as some kind of stationarity property of the limit μ, with respect to changes of the root.This property is called unimodularity.A probability distribution μ ∈ P( Ḡ * ) is called sofic if there exists a sequence of finite graphs G n with local weak limit μ.The definition of unimodularity is made in an attempt to understand what it means for a Borel probability distribution on Ḡ * to be sofic.
To define unimodularity, let Ḡ * * be the set of isomorphism classes [G, o, v] where G is a marked connected graph with two distinguished vertices o and v in V(G) (ordered, but not necessarily distinct).Here, isomorphism is defined by an adjacency preserving vertex bijection which preserves vertex and edge marks, and also maps the two distinguished vertices of one object to the respective ones of the other.A measure μ ∈ P( Ḡ * ) is said to be unimodular if, for all measurable functions f : Ḡ * * → R + , we have Here, the summation is taken over all vertices v which are in the same connected component of G as o.It can be seen that it suffices to check the above condition for a function f [2].A sofic probability measure is unimodular.
Whether the other direction also holds is unknown.See [2] for more details.

C. The Marked BC Entropy
In order to make sense of information-theoretic optimality of our compression algorithm we look for a notion of entropy defined for probability measures on the space of rooted marked graphs, i.e., Ḡ * , defined above.Recall that the sets of vertex and edge marks, i.e., and , are fixed and finite sets.The notion of entropy we employ is a generalization to the marked framework discussed above of that defined by Bordenave and Caputo in [14].This generalization is due to us, and the reader is referred to [19] for more details.
An edge mark count vector is defined as a vector of nonnegative integers m := (m(x, x ) : x, x ∈ ) where m(x, x ) = Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.m(x , x) for all x, x ∈ .A vertex mark count vector is defined as a vector of nonnegative integers u := (u(θ ) : θ ∈ ).Since is finite, we may assume it is an ordered set.Let m 1 := x≤x ∈ m(x, x ) and u 1 := θ∈ u(θ ).
For an integer n ∈ N and edge mark and vertex mark count vectors m and u, define G (n)  m, u to be the set of marked graphs on the vertex set 2 and u 1 = n.An average degree vector is defined to be a vector of nonnegative reals d = (d x,x : x, x ∈ ) such that for all x, x ∈ , we have d x,x = d x ,x , and x,x ∈ d x,x > 0.
Definition 1: Given an average degree vector d and a probability distribution Q = (q θ : θ ∈ ), a sequence ( m (n) , u (n) ) of edge mark count vector and vertex mark count vector pairs is called adapted to where The key idea to define the BC entropy is to count the number of "typical" graphs.More precisely, given μ ∈ P( Ḡ * ) such that 0 < deg(μ) < ∞, for > 0 and edge and vertex mark count vectors m (n) and u (n) , respectively, define where d LP refers to the Lévy-Prokhorov metric on P( Ḡ * ).In fact, one can interpret (μ, ) as the set of -typical graphs with respect to μ.It turns out that, roughly speaking, the number of -typical graphs scales as , where (μ) is the marked BC entropy of μ.In order to make this precise, we make the following definition: Fix an average degree vector d and a probability distribution Q = (q θ : θ ∈ ), and also suppose that . Since this is increasing in , we can define the upper BC entropy . Since this is increasing in , we can define the lower BC entropy Certain conditions must be met for the marked BC entropy to be of interest.
Theorem 1 [19,Th. 1]: Let an average degree vector d = (d x,x : x, x ∈ ) and a probability distribution Q = (q θ : θ ∈ ) be given.Suppose μ ∈ P( Ḡ * ) with 0 < deg(μ) < ∞ satisfies any one of the following conditions: (1) μ is not unimodular; (2) μ is not supported on T * ; (3) deg x,x (μ) = d x,x for some x, x ∈ , or θ (μ) = q θ for some θ ∈ .Then, for any choice of the sequences m (n) and u (n) Thus the only case of interest in the discussion of marked BC entropy is when μ ∈ P( T * ) is unimodular, , and ( m (n) , u (n) ) is adapted to ( deg(μ), (μ)).Namely, the only upper and lower marked BC entropies of interest are deg(μ), (μ The upper and lower marked BC entropies do not depend on the choice of ( m (n) , u (n) ).Further, the upper marked BC entropy is always equal to the lower marked BC entropy.

III. PROBLEM STATEMENT AND MAIN RESULTS
Let Ḡn denote the set of simple marked graphs on the vertex set [n] with edge and vertex marks coming from fixed and finite sets and respectively.Without loss of generality, we may assume that = {1, . . ., | |} and = {1, . . ., | |}.These mark sets are fixed and known to both the encoder and the decoder.Our goal is to design a lossless compression algorithm which maps simple marked graphs in Ḡn to {0, 1} * −∅ in an one-to-one manner.Our algorithm uses two integers h ≥ 1 and δ ≥ 1 as hyperparameters.Therefore, for each n ≥ 1, we will introduce an encoding function f (n)  h,δ : Ḡn → {0, 1} * − ∅ together with a decoding function g (n)  h,δ such that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. g h,δ to satisfy the standard prefix-free condition.In addition to this, we want f (n)  h,δ to be universally optimal from an information-theoretic perspective.More precisely, if a sequence of simple marked graphs G (n) is given which converges to a limit μ ∈ P(G * ) in the local weak sense as was described in Section II-B, i.e., U(G n ) ⇒ μ, then we want the asymptotic codeword length nats(f ), after normalization and sending n, δ, and h to infinity in that order, to be no more than the marked BC entropy of the limit μ, as was defined in Section II-C.Motivated by the definition of the marked BC entropy in Section II-C, the correct normalization is to consider (nats(f (n)  h,δ (G (n) )) − m (n) log n)/n, where m (n) is the total number of edges in G (n) .More precisely, we say that such an encoding function is optimal if the following condition is satisfied: This notion of optimality is justified by the converse result in our earlier work [20,Th. 4].Here, in order to address universality, we assume that the encoder does not know the limit μ, and only gets to see the simple marked graph G (n) as its input.
We assume that a simple marked graph G (n) is given to the encoder by representing (1) its vertex mark sequence θ (n) , where m (n) denotes the total number of edges in G (n) , and for 1 ≤ i ≤ m (n) , the tuple (v i , w i , x i , x i ) represents an edge between the vertices v i and w i , with mark x i towards v i and mark x i towards w i , i.e., ξ G (n) (w i , v i ) = x i and ξ G (n) (v i , w i ) = x i .We assume that each marked edge is represented only once in this list, but in an arbitrary orientation, i.e., either v i < w i or w i < v i .We call this form of representing G (n) the "edge list" representation.
Our main contribution is to introduce a compression algorithm, which (1) is universally optimal from an information theoretic perspective, and (2) has a low computational complexity for both encoding and decoding.Theorem 3 below states what this means.
Theorem 3: There exists a compression/decompression algorithm with encoding and decoding functions f (n)  h,δ and g (n)   h,δ as defined above, which has the following properties: 1) (Optimality) Assume that unimodular μ ∈ P( T * ) with deg(μ) Also, assume that a sequence G (n) of simple marked graphs is given such that U(G (n) ) ⇒ μ, and with m (n) being the total number of edges in G (n) , we have m (n) /n → deg(μ)/2.Then, we have lim sup 2) (Computational Complexity) The time complexity of encoding a graph G (n) with m (n) many edges using our ).Also, the memory complexity of the compression phase is O((m Moreover, the time complexity of decoding for this graph using our decompression algorithm is . Furthermore, the memory complexity of the decompression phase is O((m The proof of Theorem 3 is given in the longer version of this paper [17].The following corollary, in the case that m (n) = (n) and all the other parameters are constant, helps study the effect of n, the number of vertices, on the complexity of our algorithm.
Corollary 1: Remark 1: Since the graph is given to our algorithm in its edge list form, the time complexity of reading the graph from the input is (m (n) log n).From Corollary 1, the complexity of our algorithm in n, the number of vertices, is optimal up to logarithmic factors.
Remark 2: In [17, Th. 4] we derive upper bounds on the asymptotic normalized codeword length for fixed h and δ.If the limit is known, such bounds help determine how large h and δ should be to get to the marked BC entropy to within a desired threshold.Where the limit is not known, we treat h and δ as hyperparameters that need to be chosen heuristically.
In Section IV below, we explain the steps of our algorithm without going into the details.The details as well as the complexity analysis and proof of optimality are provided in [17].

IV. STEPS OF THE UNIVERSAL COMPRESSION ALGORITHM
In this section, we give an overview of the steps of our universal compression algorithm.For the details, see [17].A shorter document with just the algorithm details is at [18].We will use the graph in Figure 4 to illustrate the steps of our algorithm.Recall from Section III that we assume that = {1, . . ., | |} and = {1, . . ., | |}.Thus, in the graph of Figure 4, we have = { , } and = { , }.For a marked graph G, on a finite or countably infinite vertex set, and adjacent vertices u and v in G, we define G(u, v) to be the pair (ξ G (u, v), (G , v)) where G is the connected component of v in the graph obtained from G by removing the edge between u and v. Similarly, for h ≥ 0,

A. Preprocessing
Recall from Section III that the input graph G (n) is given in its edge list form, i.e., is represented by its vertex mark sequence θ (n) and a list of marked edges.First, we go through a preprocessing step to convert this into the "neighbor list" representation of G (n) , which consists of the following components: 1) The sequence θ denote the two edge marks corresponding to the edge connecting v to γ For instance, for the graph in Figure 4, the neighbor list for vertex 5 is (γ (16) 13,14).Moreover, (x (16)

B. Definition of Edge Types
Our algorithm uses two integers h ≥ 1 and δ ≥ 1 as hyperparameters.We assume that they are fixed.With this, let be the set consisting of all such that in the subtree component of t, i.e., t[s], the degree of the root is strictly less than δ, and the degree of all other vertices is at most δ.Moreover, for x ∈ , let x be fictitious distinct elements not present in F (δ,h) , and define F(δ,h) := F (δ,h) ∪ { x : x ∈ }.Note that x for x ∈ are auxiliary objects, and are not of the form of a pair of a mark and a rooted tree.Also, define C (δ,h) := F (δ,h) × F (δ,h) and C(δ,h) := F(δ,h) × F(δ,h) .
For adjacent vertices v ∼ G (n) w in G (n) , we define which is indeed a member of × T h−1 * . Moreover, define However, this is not the case for h = 1.This is why the two conditions deg G (n) (v) ≤ δ and deg G (n) (w) ≤ δ in the above definition are not degenerate.In fact, the degree at the root for any [T, o] ∈ T 0 * is zero, meaning that F (δ,1) = × T 0 * for any δ ≥ 1. Observe that, by definition, t(n) h,δ (v, w) ∈ F(δ,h) .Furthermore, we define and call it the "type" of the edge (v, w).Note that we have , where (a, b) := (b, a).The notion of edge types plays a crucial role in our compression algorithm.

Our next step is to find ψ (n)
h,δ (v, w) for adjacent vertices v and w in G (n) .We use a message passing algorithm to do this.The algorithm returns an array c In addition to the array c, our algorithm returns the following: 1) TCount: the range of integers showing up in c, so that for v ∈ [n] and i ∈ [d (n)  v ], c v,i is a pair of integers, each in the range {1, . . ., TCount}.One can show that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
there is a bijection from a subset of F(δ,h) to the set of integers {1, . . ., TCount}, i.e., with and and t there exists a one to one mapping J n : It can be shown that TCount ≤ 4m (n) ; 2) An array TIsStar = (TIsStar(i) : where TIsStar(i) for 1 ≤ i ≤ TCount is 1 if the member of F(δ,h) corresponding to i, i.e., J −1 n (i), is of the form x for some x ∈ , and 0 otherwise; 3) An array TMark = (TMark(i) : where for 1 ≤ i ≤ TCount, with t = J −1 n (i) being the member in F(δ,h) corresponding to i, if t is of the form x for some x ∈ , we have TMark(i) = x; otherwise, we have TMark(i) = t[m], i.e., the mark component associated to t ∈ F (δ,h) .We ideally would expect the set T (n) above to only consist of t(n) h,δ (v, w) for v ∼ G (n) w.However, our algorithm encounters some extra objects in the process of exploring edge types, and these are precisely t (n)  h (v, w) for (v, w) ∈ B (n) .Figure 5 illustrates the result of running this algorithm on the graph of Figure 4. To summarize, Proposition 1: There is an algorithm that, given a simple marked graph G (n) with m (n) edges, and parameters δ ≥ 1 and h ≥ 1, finds integer representations of ψ (11), together with TCount, TIsStar, and TMark as explained above.The algorithm runs in O((m After we find the quantities TCount, c, TIsStar, and TMark, we write them to the output bit sequence so that the decoder can use them later during the decompression phase.

D. Encoding Star Vertices
We separately encode in G (n) which have a type of the form ( x , x ) for some x, x ∈ .But before that, we first encode V (n) , which is defined to be the set of vertices v ∈ [n] such that for at least one of their neighbors w (v, w).We call an edge (v, w) with type ( x , x ) for some x, x ∈ a "star edge".Likewise, we call a vertex v ∈ V (n) a "star vertex".Note that with our discussion above, a vertex Due to the symmetry in the graph, we have only presented a subset of (v, i) pairs, e.g.
The set T (n) together with the mapping J n are shown.Here, denotes the set {t denotes the set 2 , we write one corresponding pair (v, w) below that element (note that there might be more than one pair resulting in that element).In this example, we have B (n) = {(v, 1) : 2 ≤ v ≤ 6}.Also, we have TCount = 6.The TMark and TIsStar arrays are also illustrated.
In order to encode V (n) , we first represent V (n) using a bit sequence y = (y i : i ∈ [n]) of size n, where y i = 1 if i ∈ V (n)  and zero otherwise.Then we encode this bit sequence.

E. Encoding Star Edges
Next, we encode the star edges, i.e., those edges with type ( x , x ) for some x, x ∈ .Note that both endpoints of a star edge are in V (n) .Motivated by this, for each pair of edge marks x, x ∈ , we go through the vertices v in V (n) in an increasing order and encode the edges connected to v with type ( x , x ) by writing a bit with value 1 to the output together with the index of the other endpoint.When we finish checking all the neighbors of v, we write a bit with value 0 to the output to ensure the prefix condition.

F. Encoding Vertex Types
where, for t, t ∈ F (δ,h) , we write For instance, in the graph of Figure 4, with h = 2 and δ = 4, we have , (7) = 1, and D (16) t,t (1) = 0 for all t, t ∈ F (δ,h) (since all edges connected to vertex v = 1 are star edges).Note that if for a vertex v, we have D Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

t and t(n)
h,δ (w, v) = t .Since t, t ∈ F (δ,h) , recalling (8), this in particular means that we must have t Note that if for a vertex v ∈ [n], we have D (n) t,t (v) = 0 for all t, t ∈ F (δ,h) , then the above inequality automatically holds for that vertex.In particular, (13) implies that We define the "type" of a vertex v ∈ [n] to be the pair (θ ).The next step in our compression algorithm is to encode vertex types, i.e., jointly encode the sequences θ (n) and ).In order to do so, we will construct a sequence y = (y Then, we encode the sequence y.For instance, in the graph of Figure 4, with parameters h = 2 and δ = 4, (θ are the same for 2 ≤ v ≤ 6, and also (θ Therefore, the sequence y in this example will be of them form y = (1, 2, .

G. Encoding Partition Graphs
Our next step is to encode the remaining edges in the graph, i.e., those edges which are not star edges.In order to do so, we partition such edges based on their types.This will result in a number of unmarked graphs which will be encoded separately.More precisely, let E (n) denote the set of all edge types that appear in the graph, excluding star edges, i.e., E (n) t,t is the set of vertices with at least one edge with type t,t are sorted in an increasing order.Figure 6 illustrates the set 4, where again h = 2 and δ = 4.
We construct the "partition graph" G (n) t,t associated to (t, t ) ∈ E (n) , as follows: t,t are in a one to one correspondence with the nodes in t ,t right nodes Fig. 6.The set E (n) for the graph in Figure 4 with parameters h = 2 and δ = 4. Also, V (n) t,t and I (n) t,t for (t, t ) ∈ E (n) are illustrated.Fig. 7. Partition graphs for the graph of Figure 4 with parameters h = 2 and δ = 4. See Figure 5 for edge types, and also Figure 6 for t,t are in a one to one correspondence with the nodes in t,t between the left node I (n) t,t (v) and the right node I (n) t ,t (w). Figure 7 illustrates the partition graphs for the graph of Figure 4 with parameters h = 2 and δ = 4.The next step in our compression algorithm is to encode these partition graphs.Observe that for (t, t ) ∈ E (n) such that t = t , the partition graphs G ≤ be the set of (t, t ) ∈ E (n) such that with t = J n (t) and t = J n (t ) being the integer representations of t and t respectively, we have t ≤ t .With the above discussion, we may only consider those partition graphs G Note that each edge in G (n) which is not an star edge appears in precisely one of the partition graphs {G ≤ }, which justifies our terminology "partition graph".As an example, for the graph of Figure 4 with parameters h = 2 and δ = 4, from Figure 5, we realize that J n ( ) = 3 < 4 = J n ( ), which implies that E (n) ≤ = {( , ), ( , )}.Before discussing how to encode these partition graphs, we explain how this can be helpful in reconstructing the original marked graph G (n) at the decoder.Recall from Section IV-F Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
that we have already encoded θ (n) , the sequence of vertex marks.On the other hand, in Section IV-E, we have encoded all the star edges.Hence, it remains to encode the remaining edges, which is precisely what is being done in this step.As we discussed above, every edge in G (n) that is not a star edge appears in exactly one partition graph.On the other hand, for (t, t ) ∈ E (n) ≤ such that t = t , if there is an edge between the left node i and right node j in G (n) t,t , we realize that there must be a corresponding edge in G (n) between the vertices (I (n)  t,t ) −1 (i) and (I (n)  (n) .Hence, we can reconstruct G (n) given the partition graphs and the previous steps.Now we explain how we encode the partition graphs.Recall from Section IV-F above that we have transmitted the sequence D (n) to the decoder.We claim that this is enough for the decoder to infer the degree sequences of the partition graphs.Note that, by construction, for t,t , the degrees of a left node 1 ≤ i ≤ N (n) t,t and a right node 1 Motivated by this, it suffices to design compression schemes to encode simple unmarked graphs and simple unmarked bipartite graphs with given degree sequences, as we discuss next.Note that these two schemes can be of independent interest.
Proposition 2: There exists a compression algorithm that maps a bipartite graph G with n l left vertices, n r right vertices, left degree sequence a = (a 1 , . . ., a n l ), and right degree sequence b = (b 1 , . . ., b n r ) to an integer f , where S := n l i=1 a i = n r j=1 b j .Moreover, there exists a decompression algorithm g such that for all such bipartite graph G, we have Further, if for some δ ≤ n r we have a i ≤ δ for all 1 ≤ i ≤ n l , the time and memory complexities of the compression algorithm are O(δ ñ log 4 ñ log log ñ) and O(δ ñ log ñ) respectively, where ñ = max{n l , n r }.Furthermore, the decompression algorithm has the same time and memory complexities.
See [17] for the details and proof of the above proposition.The term S!/( n l i=1 a i !n r j=1 b j !) is related to the number of bipartite graphs with the given left and right degree sequences.The idea is to find and encode the index of the input bipartite graph G among the set of all bipartite graphs with the same degree sequence ( a, b).This index is calculated with respect to a certain lexicographic ordering of all such bipartite graphs.Note that, due to (14), all the degrees in a partition graph G ≤ , t = t , are bounded by the parameter δ, so δ enters the complexity analysis.
We have a similar compression algorithm for simple unmarked graphs as follows.
Proposition 3: There exists a compression algorithm that maps a simple unmarked graph G with ñ ≥ 2 vertices and degree sequence a = (a 1 , . . ., a ñ) to an integer , where S := ñ v=1 a v .Also, for 1 ≤ i ≤ 16ñ/ log 2 ñ , we have 0 ≤ f (ñ) a,i (G) ≤ S.Moreover, there exists a decompression algorithm with a decompression map g (ñ) a such that for any such graph G, On the other hand, if for some δ ≤ ñ we have a i ≤ δ for all 1 ≤ i ≤ ñ, the time and memory complexities of the compression algorithm are O(δ ñ log 4 ñ log log ñ) and O(δ ñ log ñ) respectively.Furthermore, with the same assumption, the time and memory complexities of the decompression algorithm are O(δ ñ log 5 ñ log log ñ) and O(δ ñ log ñ) respectively.
See [17] for the details and proof of the above proposition.The term (S − 1)!!/( ñ v=1 a v !) is related to the number of simple unmarked graphs with degree sequence a.The sequence f (ñ) a (G) ensures efficiency during the decompression phase.Due to (14), all the degrees in a partition graph G (n) t,t for (t, t) ∈ E (n)  ≤ are bounded by the parameter δ, so δ enters the complexity analysis.Combining the complexity estimates in this section gives Theorem 3.

V. EXPERIMENTS
In this section, we illustrate the performance of our algorithm.First, in Section V-A, we consider some synthetic data.Since the local weak limit is known for such data, we can compare the performance of the algorithm with what Theorem 3 predicts.Then, in Section V-B, we consider some real-world datasets which do not have large cycles, or roughly speaking they are locally-tree-like.The motivation for considering such datasets is that the optimality guarantee of Theorem 3 requires the local weak limit to be unimodular and supported on rooted trees.Finally, in Section V-C, we test the performance of our algorithm on some real-world social network datasets.Social networks often have small cycles and are not locally-tree-like, so the assumptions of Theorem 3 do not hold for such datasets and the theoretical guarantee no longer exists.Nonetheless, we see that our algorithm has performance comparable to the state of the art in most cases.

A. Synthetic Data
We generate a random graph G (n)   v to each of these d v chosen vertices.We do this for all nodes 1 ≤ v ≤ w.If for two nodes v = w, v decides to connect to w, and w also decides to connects to v, we treat this as a single edge between v and w.Therefore, the resulting graph is simple.We also add a mark from = {1, 2} to each vertex independently.Moreover, we add two independent edge marks to each edge from = {1, 2}, one in each direction.The choice of edge and vertex marks is done independently throughout the graph, conditioned on the unmarked realization of the graph.It can be seen that the local weak limit of this model is a Poisson Galton-Watson tree with mean degree 6 and independent vertex and edge marks.Since the limit distribution is completely characterized by the depth 1 neighborhood distribution at the root, we choose h = 1, and run the algorithm with different values of δ.See Figure 8 for the behavior of l n := (nats(f (n)  h,δ (G (n) )) − m n log n)/n, where m n is the number of edges in G (n) .As we see, for large values of δ, l n converges to the marked BC entropy of the limit as n gets large, which is consistent with Theorem 3.An interesting observation from Figure 8 is that when n is small, smaller values of δ might result in smaller l n .Note that this is not a contradiction with Theorem 3, since it only predicts the asymptotic behavior of l n .When n is small, increasing δ might result in an increased overhead of certain parts of the compressed sequence, such as vertex types as discussed in Section IV-F, which results in increasing l n .

B. Locally Tree-Like Data
Recall from Theorem 3 that our theoretical guarantee holds when the limit μ is supported on marked rooted trees.Motivated by this, we test our algorithm on the following two real-world locally-tree-like datasets.We also compare the compression results with the ones reported in [25], which, to the best of our knowledge, are the best results in the literature for these datasets.These datasets are collected from [24].
• roadnet-CA: the graph of the road network of California, consisting of 1,965,206 vertices and 5,533,214 edges.[25] To measure whether these graphs are locally-tree-like, we use a measure called the average clustering coefficient.For a simple undirected graph G on the vertex set [n], the clustering of a node v with deg G (v) > 1 in G is defined as C v := |{(u,w) : u∼ G v,w∼ G v,u∼ G w}| , which is obtained by looking at pairs of neighbors of v which can form a triangle, and calculating the fraction of them which actually form a triangle.If deg G (v) ≤ 1, we define C v := 0. The average clustering coefficient of G is defined to be C := 1 n n v=1 C v .Note that C ∈ [0, 1], and the bigger the clustering coefficient of a graph, the more short cycles it has, and so the less tree-like it is.The average clustering coefficients of both the roadnet-CA and the roadnet-PA datasets are 0.046, which is relatively small.Following the convention in the literature, we report the compression ratios in bits per link (BPL).Table I compares the best compression ratios of our algorithm.These are more than 40% better than the ones in [25].As we can see from Table I, the value of δ in the pair (h, δ) which yields the best result is small.Comparing this to Figure 8 and the discussion in Section V-A, one possible explanation for this can be that n is small and hence we are not yet in the asymptotic regime.To address this, in Table II, we force δ to be chosen large enough so that at most 20% of the edges become star edges, and find the value of (h, δ) which achieves the best compression ratio given this constraint.As seen in Table II, even in this case, our compression ratios are better than those reported in [25].

C. Social Networks
We now consider the social network graphs summarized in Table III.These are accessed via the Laboratory of Web Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Algorithms website 1 [12], [13].As reported in Table III, the average clustering coefficients for these datasets are far away from zero, meaning that they are not locally-tree-like, hence they do not satisfy the conditions of Theorem 3. We represent the directed edges by appropriately employing edge marks.If there is a directed edge from a node u to a node v in the original graph, but none from node v to node u, we model this by a single marked edge between nodes u and v, with edge mark 1 towards node v and edge mark 0 towards node u.Furthermore, if there is a directed edge from node u towards node v and another from node v towards node u, we model this by a single marked edge between nodes u and v with edge marks 1 towards both endpoints.
In Table IV, we report the best ratios over the values of h and δ, and compare them with the ones in [12], which are the best in the literature to the best of our knowledge.We have also reported in Table IV the corresponding running times.We see that the value of δ for the (h, δ) pair that optimizes 1 http://law.di.unimi.it/ the compression ratio is small.Similar to our discussion in Section V-B, one possible explanation of this can be that we are not yet in the asymptotic regime.In order to address this, similar to our approach in Section V-B, we look for the values of (h, δ) which result in at most 40% of the edges in the graph becoming star edges, and find the best compression ratio among such (h, δ) pairs.This is reported in Table V.
Table III shows that all the social graphs have relatively large average clustering coefficient, so they are not locallytree-like.Since the method in [12] is tailored for social graphs, whereas ours is universal, we should not expect our results to outperform those of [12].However, as in Tables IV and V, they are comparable in most cases, and our compression ratios even outperform those of [12] in some cases.

Fig. 2 .
Fig.2.The universal cover UC 1 (G) for the simple marked graph G depicted in (a) is illustrated in (b).By definition, every vertex in UC 1 (G) corresponds to a non-backtracking walk in G starting at vertex 1.The walk associated to every vertex in UC 1 (G) is shown adjacent to that vertex in (b).Note that although G is a finite graph in this example, UC 1 (G) is an infinite graph.Moreover, the universal cover is always a tree.

Fig. 3 .
Fig. 3.With G being the graph from Figure 1, (a) illustrates U 2 (G), which is a probability distribution on rooted marked graphs of depth at most 2 and (b) depicts U(G), which is a probability distribution on Ḡ * .
and h, δ, | |, and | | are all constants not growing with n, then the time and memory complexities of our compression algorithm are O(n log 4 n log log n) and O(n log n), respectively.Moreover, the time and memory complexities of our decompression algorithm are O(n log 5 n log log n) and O(n log n), respectively.
we call the component of g its mark component and denote it by g[m].Moreover, we call the Ḡ *

Fig. 5 .
Fig.5.The result of running the message passing algorithm to find edge types for the graph of Figure4with parameters h = 2 and δ = 4.The array c is depicted on the left table.Due to the symmetry in the graph, we have only presented a subset of (v, i) pairs, e.g., v = 1, i ∈ {2, . . ., 5} is identical to v = 1, i = 1.The set T(n) together with the mapping J n are shown.Here, T

t
,t are basically the same, and one is obtained from the other by flipping the right and the left nodes.Let E (n)

Fig. 8 .
Fig.8.Synthetic data results.Note that for large δ the asymptotic performance converges to the actual BC entropy.
Recalling our discussion in Section IV-C, if t = J n (t) and t = J n (t ) are the integer representations of t and t respectively, then t[m] = TMark(t) and t [m] = TMark( t ).Consequently, for (t, t ) ∈ E −1 (j) in the decoded graph.It is easy to see that since we have already encoded (D(n)(v) : v ∈ [n])as in Section IV-F, the decoder can reconstruct I on n vertices as follows.At each vertex v ∈ [n] we generate a Poisson random variable d v with mean 3 and pick d v many vertices uniformly at random from [n]\{v} without replacement.Then, we connect

TABLE I COMPARING
[25]COMPRESSION RATIOS OF OUR ALGORITHM WITH THOSE IN[25]FOR ROAD NETWORKS.IN THE THIRD COLUMN, THE BEST RATIO OF OUR ALGORITHM TOGETHER WITH THE RELATIVE IMPROVEMENT OVER THE BEST RESULTS IN THE LITERATURE ARE GIVEN.HERE BPL STANDS FOR BITS PER LINK.IN THE FOURTH COLUMN, THE CORRESPONDING VALUES OF h AND δ ARE REPORTED.IN THE FIFTH COLUMN, THE CORRESPONDING ENCODING/DECODING TIMES IN SECONDS ARE GIVEN TABLE II COMPARING THE COMPRESSION RATIOS OF OUR ALGORITHM WITH THOSE IN [25] ASSUMING THAT THE VALUE OF δ IS CHOSEN SO THAT AT MOST 20% OF THE EDGES ARE ALLOWED TO BECOME STAR EDGES.THE STRUCTURE OF THIS TABLE IS SIMILAR TO THAT OF TABLE I. AS WE CAN SEE, EVEN IN THIS CASE, OUR COMPRESSION RATIOS ARE BETTER COMPARED TO THOSE IN

TABLE III SOCIAL
NETWORK DATASETS AND THEIR PROPERTIESTABLE IV COMPARING THE COMPRESSION RATIOS OF OUR ALGORITHM WITH THOSE IN [12] FOR SOCIAL NETWORKS.WE REPORT THE BEST COMPRESSION RATIO OF OUR ALGORITHM IN THE THIRD COLUMN TOGETHER WITH RELATIVE COMPARISON WITH THE RESULTS IN [12].THE CORRESPONDING VALUE OF (h, δ) AS WELL AS THE ENCODING/DECODING TIMES ARE ALSO PROVIDED TABLE V BEST COMPRESSION RATES OF OUR ALGORITHM SUBJECT TO CHOOSING (h, δ) SO THAT AT MOST 40% OF THE EDGES IN THE GRAPH ARE STAR EDGES.THE VALUES FROM [12] ARE ALSO PRESENTED.MOREOVER, THE VALUES OF (h, δ) ACHIEVING THE REPORTED RESULTS TOGETHER WITH ENCODING/DECODING TIMES ARE PRESENTED