Partition-Merge: Distributed Inference and Modularity Optimization

This paper presents a novel meta algorithm, Partition-Merge (PM), which takes existing centralized algorithms for graph computation and makes them distributed and faster. In a nutshell, PM divides the graph into small subgraphs using our novel randomized partitioning scheme, runs the centralized algorithm on each partition separately, and then stitches the resulting solutions to produce a global solution. We demonstrate the efficiency of the PM algorithm on two popular problems: computation of Maximum A Posteriori (MAP) assignment in an arbitrary pairwise Markov Random Field (MRF), and modularity optimization for community detection. We show that the resulting distributed algorithms for these problems essentially run in time linear in the number of nodes in the graph, and perform as well -- or even better -- than the original centralized algorithm as long as the graph has geometric structures. Here we say a graph has geometric structures, or polynomial growth property, when the number of nodes within distance r of any given node grows no faster than a polynomial function of r. More precisely, if the centralized algorithm is a C-factor approximation with constant C \ge 1, the resulting distributed algorithm is a (C+\delta)-factor approximation for any small \delta>0; but if the centralized algorithm is a non-constant (e.g. logarithmic) factor approximation, then the resulting distributed algorithm becomes a constant factor approximation. For general graphs, we compute explicit bounds on the loss of performance of the resulting distributed algorithm with respect to the centralized algorithm.

into small subgraphs, runs the centralized algorithm on each partition separately (which can be done in a distributed or parallel manner), and finally stitches the resulting solutions to produce a global solution. We apply the PM algorithm to two representative classes of problems: the MAP computation in a pairwise MRF and modularity optimization-based graph clustering.
The paper establishes that for any graph that satisfies the polynomial growth property, the resulting distributed PM-based implementation of the original centralized algorithm is a (C +δ)-approximation algorithm whenever the centralized algorithm is a C-approximation algorithm for some constant C ≥ 1. In this expression, δ is a small number that depends on a tunable parameter of the algorithm that affects the size of the induced subgraphs in the partition; the larger the subgraph size, the smaller the δ. More generally, if the centralized algorithm is an α(n)-approximation (with α(n) = o(n)) for a graph of size n, the resulting distributed algorithm becomes a constant factor approximation for graphs with geometric structure! The computational complexity of the algorithm scales linearly in n. Thus, our meta-algorithm can make centralized algorithms faster, distributed and improve their performance.
The algorithm applies to any graph structure, but strong guarantees on performance, as stated above, require geometric structure. 1 However, it is indeed possible to explicitly evaluate the loss of performance induced by the distributed implementation compared to the centralized algorithm, as stated in Section IV-B.
A cautionary remark is in order. Indeed, by no means, this algorithm means to answer all problems in distributed computation. Specifically, for dense graphs, this algorithm is likely to exhibit poor performances, and definitely such graph structure would require a very different approach. Our meta-algorithm requires that the underlying graph problem is decomposable or Markovian in a sense. Not all problems have this structure, and therefore these problems require a different way to think about them.
To verify the validity of the PM algorithm, we have applied it both on real-world networks and on synthetic graphs of various sizes, comparing it with the original centralized algorithms. For MAP inference, we used the sequential tree-reweighted max-product message passing (TRW-S) [10] with the energy function defined in Section V-B. For modularity optimization, we selected three different algorithms from the old-fashioned way to state-of-the-art: Girvan-Newman (GN) [11], Clauset-Newman-Moore (CNM) [12], and Louvain-Method (LM) [13]. Overall, as long as the network is decomposable like grid graphs, PM considerably reduces the running time while maintaining the performance of the original centralized algorithm. In the case of a comparatively dense network, such as the Barabási-Albert model, PM produces a decent result, although the efficiency of PM is relatively low.
In our experiments, PM particularly performs better when applied on well-distributed regular networks and when the centralized algorithm has high complexity. Moreover, the performance with respect to MAP inference on graphs with a high average degree is outstanding; PM achieves similar performance to the centralized algorithm in less than half the time. Besides, we researched what value of partition radius offers better efficiency for our algorithm. We used a fixed partition radius to understand the performance of PM according to its value, and it leads to the conclusion that PM generally operates to its best efficiency when a partition radius is close to the average distance between a pair of nodes in the network.

B. RELATED WORK AND OUR CONTRIBUTIONS
The results of this paper, on the one hand, are related to a long list of works on designing distributed algorithms for decomposable problems. On the other hand, the applications of our method to MAP inference in pairwise MRFs and clustering relate our work to a large number of results in these two respective problems. We will only be able to discuss very closely related work here.
We start with the most closely related work on the use of graph partitioning for distributed algorithm design. Such an approach is quite old; see, e.g., [14], [15] and [16] for a detailed account of the approach until 2000. More recently, such decompositions have found a wide variety of applications, including local-property testing [17]. All such decompositions are useful for homogeneous problems, e.g., finding maximum-size matching or independent set rather than the heterogenous maximum-weight variants of it. To overcome this limitation, Jung and Shah [18] introduced a different (somewhat stronger) notion of decomposition, built upon [15], for minor-excluded graphs. All of these results effectively partition the graph into small subgraphs and then solve the problem inside each small subgraph using exact (dynamic programming) algorithms. While this results in a (1 + )-approximation algorithm for any > 0 with computation scaling essentially linearly in the graph size (n), the computation constant depends super-exponentially on 1/ . Therefore, even with = 0.1, the algorithms become unmanageable in practice.
As the main contribution of this paper, we first propose a novel graph decomposition scheme for graphs with geometry or polynomial growth structure. Then we establish that utilizing this decomposition scheme along with any centralized algorithm (instead of dynamic programming) for solving the problem inside the partition leads to performance comparable (or better) to that of the centralized algorithm for a graph with polynomial growth. Unlike the dynamic programming approach, the resulting distributed algorithm becomes very fast in practice if the centralized algorithm inside the partition runs fast. We verify the effectiveness of the PM algorithm through experiments, finding that this decomposition scheme actually produces a similar performance (better in some cases) to that of the centralized algorithm in a very short time. As mentioned above, the result is established for both MAP in pair-wise MRF and modularity optimization-based clustering. Similar guarantees can be obtained for minor-excluded graphs as well using the scheme utilized in [18].
In this work, we focus our attention on two questions, as mentioned above. However, the method suggested here can be applied broadly to generic ''optimization'' when (i) the constraints are represented through graph structure over variable nodes, (ii) there is a notion of ''default'' assignment to variables that satisfies all the constraints. Indeed, problems such as the maximum weight independent set, vertex cover, or MAP inference in generic pair-wise Markov Random Field are instances of this. And these are instances of NP-complete problems.

C. MAP INFERENCE
Computing the exact Maximum a Posteriori (MAP) solution in a general probabilistic model is an NP-hard problem. Several algorithmic approaches have been developed to obtain approximate solutions for these problems. Most of these methods work by making ''local updates'' to the assignment of the variables. Starting from an initial solution, the algorithms search the space of all possible local changes that can be made to the current solution (also called move space) and choose the best amongst them.
One such algorithm (which has been rediscovered multiple times) is called Iterated Conditional Modes, or ICM for short. Its local update involves selecting (randomly or deterministically) a variable of the problem. ICM assigns a value to the selected variable keeping all other variables fixed, which results in a solution with the maximum probability. This process is repeated by selecting other variables until the probability cannot be increased further. The local step of the algorithm can be seen as performing inference in the smallest decomposed subgraph possible.
Another family of methods is related to max-product belief propagation (cf. [19] and [20]). In recent years, a sequence of results suggests an intimate relationship between the max-product algorithm and a natural linear programming relaxation -for example, see [21]- [25]. Many of these methods can be seen as making local updates to partitions of the dual problem [26], [27].
We also note that the Swendsen-Wang algorithm (SW) [28], a local flipping algorithm, has a philosophy similar to ours in that it repeats a process of randomly partitioning the graph and computing an assignment. However, the graph partitioning of SW is fundamentally different from ours, and there is no known guarantee for the error bound of SW.
In summary, all the approaches thus far with provable guarantees for the local update-based algorithm are primarily for linear or, more generally, convex optimization setup.

D. MODULARITY OPTIMIZATION FOR CLUSTERING
The notion of modularity optimization was introduced by Newmann [29] to identify the communities or clusters in a network structure. Since then, it has become quite popular as a metric to find communities or clusters in various networked data cf. [13], [30], [31]. The major challenge has been designing an approximation algorithm for modularity optimization (which is computationally hard in general) that can operate in a distributed manner and provide performance guarantees. Such algorithms with provable performance guarantees are known only for few cases, notably logarithmic approximation of [32] via a centralized solution.
Our contribution in the context of modularity optimization lies in showing that indeed it is a decomposable problem and therefore admits a distributed and fast approximation algorithm through our approach.

E. ORGANIZATION
The rest of the paper is organized as follows. Section II describes the problem statement and preliminaries. Section III describes our main algorithms, and Section IV presents analyses of our algorithms. The proofs of our main theorems are given in the Appendix (Section VII-A and Section VII-B). Section V and Section VI present the setup and results of an experiment, respectively, and Section VII presents the conclusion.

II. SETUP A. GRAPHS
Our interest is in processing networked data represented through an undirected graph G = (V , E) with n = |V | vertices and E being the edge set. Let m = |E| be the number of edges. Graphs can be classified structurally in many different ways: trees, planar, minor-excluded, geometric, expanding, and so on. We shall establish results for graphs with geometric structure or polynomial growth, which we define next. A graph G = (V , E) induces a natural ''graph metric'' on vertices V , denoted by d G : V × V → R + with d G (i, j) given by the length of the shortest path between i and j; defined as ∞ if there is no path between them.
Definition 1 (Graph With Polynomial Growth): We say that a graph G (or a collection of graphs) has polynomial growth of degree (or growth rate) ρ, if for any i ∈ V and r ∈ N, Note that interesting values of C, ρ are integral between {0, 1, . . . , n}, and it is easy to compute in O(mn) time. Therefore we will assume knowledge of C, ρ for algorithm design. A large class of graph model naturally fall into the graphs with polynomial growth. To begin with, the standard d-dimensional regular grid graphs have polynomial growth rate d. More generally, in recent years, in the context of computational geometry and metric embedding, the graphs with finite doubling dimensions have become a popular object of study [33]. It can be checked that a graph with doubling dimension ρ is also a graph with polynomial growth rate ρ.
Finally, the popular geometric graph model, where nodes are placed arbitrarily in some Euclidean space with some minimum distance separation, and two nodes have an edge between them if they are within a certain finite distance, has a finite polynomial growth rate [34].

B. PAIR-WISE GRAPHICAL MODEL AND MAP
For a pair-wise Markov Random Filed (MRF) model defined on a graph G = (V , E), each vertex i ∈ V is associated with a random variable X i which we shall assume to be taking value from a finite alphabet ; the edge (i, j) ∈ E represents a form of ''dependence'' between X i and X j . More precisely, the joint distribution is given by where φ i : → R + and ψ ij : 2 → R + are called node and edge potential functions. 2 The question of interest is to find the maximum a posteriori (MAP) assignment x * ∈ n , i.e.
x * ∈ arg max x∈ n P[X = x]. Equivalently, from the optimization point of view, we wish to find an optimal assignment of the problem For completeness and simplicity of exposition, we assume that the function H is finitely valued over n . However, the results of this paper extend for hard constrained problems such as the hardcore or independent set model. We call an algorithm α approximation for α ≥ 1 if it always produces assignment x such that C. SOCIAL DATA AND CLUSTERING/COMMUNITY DETECTION Alternatively, in a social setting, vertices of graph G can represent individuals, and edges represent some form of interaction between them. For example, consider a cultural show organized by students at a university with various acts. Let there be n students in total who have participated in one or more acts. Place an edge between two students if they participated in at least one act together. Then the resulting graph represents the interaction between students in terms of acting together. Based on this observed network, the goal is to identify the set of all acts performed and its ''core'' participants. The true answer, which classifies each student/node into the acts in which s/he performed, would lead to partitions of nodes in which a node may belong to multiple partitions. Our interest is in identifying disjoint partitions, which would, in this example, roughly mean identification of ''core'' members of acts.
In general, it is not clear what is the appropriate criteria to select a disjoint partition of V given G. Newman [29] proposed the notion of modularity as a criterion. The intuition behind it is that a cluster or community should be as distinct as possible from being ''random''. The modularity of a partition of nodes is defined as the fraction of the edges that fall within the disjoint partitions minus the expected such fraction if edges were distributed at random with the same node degree sequences. Formally, the modularity of a subset S ⊂ V is defined as where A ij = 1 iff (i, j) ∈ E and 0 otherwise, d i = |{k ∈ V : (i, k) ∈ E}| is the degree of node i ∈ V , and m = |E| represents the total number of edges in G. More generally, the modularity of a partition of V , V = S 1 ∪ · · · ∪ S for some 1 ≤ ≤ n with S i ∩ S j = ∅ for i = j, is given by The modularity optimization approach [29] proposes to identify the community structure as the disjoint partitions of V that maximizes the total modularity, defined as per (3), among all possible disjoint partitions of V with ties broken arbitrarily. The resulting clustering of nodes is the desired answer.
We shall think of clustering as assigning colors to nodes. Specifically, given a coloring χ : V → {1, . . . , n}, two nodes i and j are part of the same cluster (partition) iff χ(i) = χ(j). With this notation, any clustering of V can be represented by some such coloring χ and vice versa. Therefore, modularity optimization is equivalent to finding a coloring χ such that its modularity M(χ) is maximized, where Here 1 {·} is the indicator function with 1 { true} = 1 and 1 { false} = 0. Let χ * be a clustering that maximizes the modularity. Then, as before, an algorithm will be said α-approximate if it produces χ such that

III. PARTITION-MERGE ALGORITHM
We describe a parametric meta-algorithm for solving the MAP inference and modularity optimization. The meta-algorithm uses two parameters; a large constant K ≥ 1 and a small real number ε ∈ (0, 1) to produce a partition of V = V 1 ∪ · · · ∪ V p so that each partition V j , 1 ≤ j ≤ p is small. We will specify the values of K and ε in Section IV. The meta-algorithm uses an existing centralized algorithm to VOLUME 9, 2021 solve the original problem on each of these partitioned sub- The resulting assignment leads to a candidate solution for the problem on the entire graph. As we establish in Section IV, this becomes a pretty good solution. Next, we describe the algorithm in detail.
Step 1 (Partition): We wish to create a partition of V = V 1 ∪ · · · ∪ V p for some p with V i ∩ V j = ∅ for i = j so that the number of edges crossing partitions are small. The algorithm for such partitioning is iterative. Initially, no node is part of any partition. Order the n nodes arbitrarily, say i 1 , . . . , i n . In iteration k ≤ n, choose node i k as the pivot. If i k belongs to ∪ k−1 =1 V , then set V k = ∅, and move to the next iteration if k < n or else the algorithm concludes.
Let V k be the set of all nodes in V that are within distance R k of i k , but that are not part of V 1 ∪ · · · ∪ V k−1 . Since we execute this step only if i k / ∈ ∪ k−1 =1 V and R k ≥ 1, V k will be non-empty. At the end of the n iterations, we have a partition of V with at most n non-empty partitions. Let the non-empty partitions of V be denoted as V = V 1 ∪ · · · ∪ V p for some p ≤ n. A caricature of an iteration is described in Figure 1. Step 2 [Merge (Solving the Problem)]: Given the partition We shall apply a centralized algorithm for each of these graph G 1 , . . . , G k separately. Specifically, let A be an algorithm for MAP or clustering: the algorithm may be exact (e.g., one solving a problem by exhaustive search over all possible options, or dynamic programming), or it may be an approximation algorithm (e.g., α-approximate for any graph). We apply A for each subgraph separately.
• For MAP inference, this results in an assignment to all variables since in each partition each node is assigned some value and collectively all nodes are covered by the partition. Declare resulting global assignment, say x as the solution for MAP.
• For modularity optimization, nodes in each partition V j are clustered. We declare the union of all such clusters across partitions as the global clustering. Thus two nodes in different partitions are always in different clusters; two nodes in the same partition are in different clusters if the centralized algorithm applied to that partition clusters them differently. Computation Cost: The computation cost of the partitioning scheme scales linearly in the number of edges in the graph. The computation cost of solving the problem in each of the components G 1 , . . . , G p depends on component sizes and on how the computation cost of algorithm A scales with the size. In particular, if the maximum degree of any node in G is bounded, say by d, then each partition has at most d K nodes. Then the overall cost is O(Q(d K )p), where Q( ) is the computation cost of A for any graph with vertices.

A. GRAPHS WITH POLYNOMIAL GROWTH
We state sharp results for graphs with polynomial growth. We state results for MAP inference and for modularity optimization under the same theorem statement to avoid repetition. The proofs, however, will have some differences.

(6)
Then, the following holds for the meta-algorithm described in Section III.
(a) If A solves the problem (MAP or modularity optimization) exactly, then the solution produced by the algorithm x and χ for MAP and modularity optimization respectively are such that (b) If A is an α(n) ≥ 1 approximation algorithm for graphs with n nodes, then

B. GENERAL GRAPH
The theorem in the previous section was for graphs with polynomial growth. We now state results for a general graph.
Our result tells us how to evaluate the ''error bound'' on solutions produced by the algorithm for any instantiation of randomness. The result is stated below for both MAP and modularity optimization. The ''error function'' depends on the problem. Theorem 2: Given an arbitrary graph G = (V , E) and our algorithm operating on it with parameters K ≥ 1, ε ∈ (0, 1) using a known procedure A, the following holds: (a) If A solves the problem (MAP or modularity optimization) exactly, then the solution produced by the algorithm x and χ for MAP and modularity optimization respectively are such that (with whereK is the maximum number of nodes that are within K hops of any single node in V . In the expression above, ψ U ij max σ,σ ∈ ln ψ ij (σ, σ ), and ψ L ij min σ,σ ∈ ln ψ ij (σ, σ ).

C. DISCUSSION OF RESULTS
Here we dissect the implications of the above stated theorems. To start with, Theorem 1(a) suggests that when graphs have polynomial growth, there exists a Randomized Polynomial Time Approximation Scheme (PTAS) for MAP computation and modularity optimization that has computation time scaling linearly with n.
Theorem 1(b) suggests that if we use an approximation algorithm instead of the exact procedure for each partition, the resulting solution almost retains its approximation guarantees: if α(n) is a constant, then the resulting approximation guarantee is essentially the same constant; if α(n) increases with n (e.g., log n), then the resulting algorithm provides a constant factor approximation! In either case, even if the approximation algorithm has superlinear computation time in the number of nodes (e.g., semi-definite programming), then our algorithm provides a way to achieve similar performance but in linear time for polynomially growing graphs.
The algorithm, for general graphs, produces a solution for which we have approximation guarantees. Specifically, the error scales with the fraction of edges across partitions that are induced by our partitioning procedure. This error depends on parameters K , ε utilized by our partitioning procedure. For a graph with polynomial growth, we provide recommendations on what the values should be for these parameters. However, for general graphs, one may try various values of K ∈ {1, . . . , n} and ε ∈ (0, 1) and then choose the best solution. Indeed, a reasonable way to implement such procedure would be to take values of K that are 2 k for k ∈ {0, . . . , log n} and ε chosen at regular interval with granularity that an implementor is comfortable with (the smaller the granularity, the better).
The dependence on ρ and δ in Theorem 1 is inspired by the worst-case scenario. While they provide theoretically useful bounds, practically even for moderately small δ, it may be an exorbitant amount of computation required if we followed the guidelines of Theorem 1 to implement brute-force/dynamic programming procedure. However, in practice, a smaller radius than that suggested by Theorem 1 can lead to better performance, as observed in experiments.

V. EXPERIMENTAL SETUP
In Sections V and VI, we present the experimental evaluations of our algorithm regarding two questions of interest: computation of Maximum A Posteriori (MAP) inference in a pairwise Markov Random Field (MRF), and modularity optimization for community detection. For our experiments, we applied PM both on real-world networks and on synthetic networks to verify its efficiency compared to the original centralized algorithm. For the MAP inference, we employed the sequential tree-reweighted max-product message passing (TRW-S) as the centralized algorithm. For the modularity optimization, Girvan-Newman (GN), Clauset-Newman-Moore (CNM), and Louvain-Method (LM) have been used as the centralized algorithm.
In the experiments, we fixed the radius (for every iteration) by a specific number to simplify and to understand the performance of our algorithm according to the radius. Furthermore, we investigated the value of partition radius as an appropriate value for our algorithm. All experiments were carried out using a single core 3 of an Intel i7 processor with 256GB of RAM, and PM was implemented in C++.

A. DATASETS 1) REAL NETWORKS
To cover various aspects of the real-world networks, we conducted experiments on several types of real-world networks, such as social networks (Facebook, YouTube, Live Journal, Twitter, Friendster), citation or collaboration networks (ArXiv, DBLP), and web or communication networks (Email-Enron, Web uk-2005, Web Webbase-2001). Most of the real-world network datasets were downloaded from SNAP, 4 a web site that offers the information and statistics of the networks. The network statistics are summarized in Table 1. Note that the list below the dashed line is large-scale networks with more than 10 million nodes. For a directed real-world network, we used the corresponding undirected network in our experiments.

2) SYNTHETIC GRAPHS
We also made use of synthetically generated graphs such as grid graphs, random k-regular graphs, Watts-Strogatz networks, and Barabási-Albert networks.

a: GRID GRAPHS
We tested our algorithm with two types of grid graphs. One is a simple grid, whose nodes are connected if they are directly adjacent to each other in either the horizontal or the vertical direction; the other one is the grid graph with diagonal edges.

b: RANDOM k-REGULAR NETWORKS
A random k-regular network on n nodes, G n,k , is a random graph chosen uniformly at random from all the graphs whose nodes have degree k.

c: WATTS-STROGATZ NETWORKS
To construct Watts-Strogatz, we start from a simple grid of n nodes. First, each node is connected to the other nodes within a radius of r, which represents the length of the shortest path between the locally connected nodes. Next, we add l long distance edges, one end node of which remains the same and the other chosen uniformly at random from among all nodes. This process is repeated for each node in the network.

d: BARABÁSI-ALBERT NETWORKS
We begin with the Erdös-Rényi model of m 0 nodes with edge probability p, which will be defined below. In other words, given m 0 nodes, each node-pair is connected with probability p independent of every other node-pair. Then, the network is developed following an iterative process until the total number of nodes becomes n. At each discrete time step, we add a new node, and each new node is connected to m existing nodes with a probability that is proportional to the degree of each node. We require m to be m 0 times p.

B. CENTRALIZED ALGORITHMS
To facilitate understanding of the effectiveness of our algorithm, we selected TRW-S for MAP inference and three popular community detection algorithms for modularity optimization, which are taken as a centralized algorithm for PM. The algorithms used for our experiments are briefly summarized as follows: The TRW-S algorithm is used for minimizing energy functions of discrete variables with unary and pairwise terms. It is the modified version of TRW that is not guaranteed to increase a lower bound on the energy and does not always converge. By adding a weak tree agreement (WTA) condition, TRW-S guarantees to find at least a local maximum of the bound, and it has a subsequence converging to a vector satisfying WTA [10].

b: ENERGY FUNCTION
For the MAP inference, we considered an energy function with binary labels (i.e. = {0, 1}) on a graph G = (V , E): The GN algorithm is the first known modularity-based method for community detection. Starting from the original network, the edges are iteratively removed to uncover the underlying community structure of the network. This process is based on the concept of the edge betweenness, which represents the number of shortest paths between pairs of nodes that pass through the edge, and it is repeated until no edges remain. The algorithm is somewhat slow and has a computational complexity O(N 3 ) on a sparse network [11].

2) CLAUSET-NEWMAN-MOORE (CNM)
The CNM algorithm, which utilizes efficient data structures such as a max-heap, is a kind of fast version of Girvan-Newman algorithm. Every node begins as a single community of the network. At each time step, the two communities that yield the largest increase of modularity among all the pairs of communities are combined together. The algorithm allows analysis of the community structure of large graphs, up to 10 6 nodes, with a computational complexity O(n log 2 n) on a sparse network [12].

3) LOUVAIN METHOD (LM)
The LM algorithm is known as one of the best for detecting community structure. As a state-of-the-art algorithm, it finds good divisions in terms of modularity even on large networks, and it reveals a hierarchical community structure that is based on the two-step sequential processes (local maximization of the modularity and aggregation of communities). The algorithm is fast enough to be limited in its applicable network size due to restricted storage capacity rather than restricted computation time, with a computational complexity that is essentially linear to network size [13].

VI. EXPERIMENTAL RESULTS
As a measure of performance, we compared (i) energy and running time for MAP inference, and (ii) modularity and running time for modularity optimization, calculated by our algorithm, against those obtained by using the original centralized algorithms. In addition, we investigated what value of partition radius can be determined as appropriate for our algorithm, which will be described in Subsection VI-B. We lay out the experimental results in several changes of the network factors to determine the conditions under which our algorithm performs well. For MAP inference, the simulation was carried out over the 30 different samples of the same problem. Then, the average energy was computed over those samples for 100 trials for each case. While the average value obtained from our setup is a negative quantity, we report it as a non-negative value regardless of its sign for the unity of expression. For modularity optimization, the average modularity is simply calculated for the 100 trials for each case.
This section is organized as follows: Section VI-A presents the overall result of the experiment. Section VI-B describes how to obtain the proper partition radius. Sections VI-C and VI-D present the analyses of the experimental results on real-world networks and synthetic graphs, respectively. Section VI-E describes the results and their explanation in regard to the two problems.
Plot: To be conducive to performance comparison, we used a ratio of energy/modularity and running time, and these quantities are plotted as functions of the partition radius. In other words, the partition radius is plotted on the x-axis, the ratio of the results of PM algorithm to that of the centralized algorithm on the y-axis.

A. OVERALL RESULTS
Overall, PM provides a decent trade-off between high accuracy and low complexity, with particularly good efficiency when a proper partition radius is chosen. That is, our algorithm produces a good approximation in a relatively short time for most networks. In particular, PM substantially reduces the running time for the centralized algorithms with high computational complexity, while energy/modularity remains at similar levels -even better in some cases -to that of the centralized algorithms.
To sum up, PM performs better under the following conditions: (i) when applied on well-distributed regular networks; (ii) when the centralized algorithm has a high complexity; and (iii) generally when the network has a large size.
In some cases, PM shows different tendencies towards the two problems we dealt with, even under the same conditions, which is due to the two problems being different: MAP inference is based on assignment, and modularity optimization is based on graph clustering. Furthermore, we empirically prove that modularity optimization are indeed decomposable problems as long as the network has geometric structures.

1) ENERGY AND MODULARITY
Energy/Modularity typically tends to increase along with the partition radius. Our partitioning scheme involves the removal of the edges that are not included in any partition. As we have previously proved, the experimental results demonstrate that the error actually scales with the fraction of edges across partitions. That is, as partition radius increases, the number of edges crossing partition decreases. Accordingly, the error is reduced, and we can take into account the connectivity between more nodes. This leads us to find better assignment/division, which means an increase in energy/modularity.

2) RUNNING TIME
As shown in Section III, the overall computation cost increases as an exponential function of radius. However, some experimental results show that this is not the only case. One exceptional case can be observed when large numbers of partitions are generated due to a small partition radius. We assume that this situation can cause a great deal of wall-clock time. The other case is when applying to the network with hubs (e.g., Barabási-Albert network), the node with a very high degree. In such cases, each partition has a large number of nodes even when the partition radius is small. In exceptional cases, the performance of our algorithm is relatively poor for modularity optimization, while PM shows excellent performance for MAP inference. In particular, for MAP inference, PM produces similar results to the centralized algorithms in less than half the time, with a small partition radius. In our experiments, the time required for the partitioning procedure is just within 0.1-0.2 seconds, even on the network of 10 6 nodes, and it is so tiny as to be almost insignificant in total running time. In addition, it should be noted that, even if the efficiency of PM decreases, the actual gain of running time generally increases due to the significant growth in the running time of the centralized algorithms.  earlier in this work provide guidelines. It is definitely worth following. In addition, practically, we find a simple heuristic rule works well for ''regular'' enough graphs. To that end, define the average distance of a graph as the average shortest path length between all pairs of nodes. To estimate it, one can simply sample 1,000 random node pairs and calculate the average over them. Then the following is a heuristic we suggest to choose partition radius: choose Avg.D or Avg.D − 1 as the partition radius.
The intuition behind this rule is as follows. If the graph is ''regular'' enough, then the pair-wise distance distribution between nodes is ''uni-modal'', and in which case the average distance is precisely that covers good fraction of interactions. It may make sense to choose a little smaller radius (by 1) if the graph has few very high degree nodes. Figures 2 and 3 show the experimental results from applying our algorithm on the small-scale real-world networks. In Figures 2(a) and 2(b), on Facebook, PM finds solutions with higher energy than those obtained by using the original . Small-scale real-world networks for modularity optimization. In Figure 3(b), the actual running time (unit: minute) can be read out from the right y-axis. Note that it takes 22 hours for GN to analyze the ArXiv network.

C. ON REAL-WORLD NETWORKS 1) ON SMALL-SCALE NETWORKS
TRW-S within half the time in regards to MAP inference. For modularity optimization, on the same network, PM also achieves similar modularity to the original CNM in a relatively short time, as shown in Figures 3(c) and 3(d). These particularly good results are due to the very high average degree (evenly distributed) of Facebook nodes (about 43). Note that in the case of R(Avg.D) -1, which is smaller than the value of general cases, it could be a better choice for the partition radius. ArXiv has an intractable size for GN. Although it requires a long time to analyze large partitions, solving the problem inside small partitions brings decent results. This is possible because the network is well distributed. Figures 3(a) and 3(b) shows that PM gives around 78% modularity of GN in about 50 minutes. Considering that it takes about 22 hours for GN to apply on the entire network, this is a decent result. On the whole, PM shows good results on small-scale real-world networks.

2) ON LARGE-SCALE NETWORKS
When applying algorithms to large-scale graphs, PM is the most necessary moment. To verify our algorithm on large-scale graphs, we carried out the experiments on four VOLUME 9, 2021 real-world networks with more than 10 million nodes. We used the TRW-S and LM as the centralized algorithm for MAP inference and modularity optimization, respectively. Note that we present the results centering on the partition radius we suggested in Section VI-B. As shown in Figure 4, PM shows outstanding results on MAP inference. When we choose the partition radius as Avg.D − 1, PM produces the same level of results as the centralized algorithm (TRW-S) in less than half the time. We analyze that these results are attributable to the high average degree of the network (the average degree of Web uk-2005 and Twitter is 40 and 58, respectively). Our algorithm involves the process of erasing the edges, but in this case, each node still has many edges that can refer to other nodes. We can see a similar trend on the Facebook network in Figure 2(a). Moreover, the original TRW-S algorithm takes more than 24 hours to compute the results, but PM makes the application of the algorithm more practical. For modularity optimization, our algorithm inevitably causes the loss in that modularity is the metric to be directly affected by removing the edges. Nevertheless, PM shows a good trade-off between modularity and running time, as shown in Figure 5.

D. ON SYNTHETIC GRAPHS 1) ON GRID GRAPHS
PM shows a tremendous performance on the types of grid graphs stated in Section V-A. For MAP inference,  Figure 7. The substantial decrease in running time underlines the effectiveness of our algorithm and strongly supports the conclusion that PM performs well on regular networks. In addition, the increase in the number of neighbor nodes makes networks more complex, which leads to large errors and, thus, a decrease in the efficiency of PM. As shown in Figures 8 and 9, PM also yields good results for modularity optimization. Moreover, PM achieves a better performance on larger graphs.

2) RANDOM K-REGULAR NETWORKS
For MAP inference, PM shows almost the same efficiency for a fixed degree, irrespective of the network size. In Figure 10, the results show that PM gives around 80% of the energy of TRW-S in about 20% of the time in all cases. On the other hand, for modularity optimization, PM performs better as the network size increases, as shown in Figure 12. For the same size of networks, the increase in degree makes our algorithm less effective, as shown in Figures 11 and 13. These results indicate that increasing randomness and complexity of networks have detrimental effects on our algorithm. As is the case of grid graphs, PM produces better efficiency on well-distributed regular networks.

3) ON WATTS-STROGATZ NETWORKS
Note that a Watts-Strogatz network has an underlying regular structure with a small amount of randomness. First, the randomness increases with the number of long-distance edges. Increasing randomness could have adverse effects on our algorithm. For both the problems, the experimental results are better for our algorithm with the small number of long-distance edges, as shown in Figures 14 and 16. In addition, long-distance edges reduce distributed processing capability by making the networks denser, leading to deterioration in the efficiency of our algorithm.
On the other hand, two problems produce slightly different results regarding the change in the radius. For MAP inference, a large radius serves as an advantage for the small partition radius. This is attributed to the growth in the number of neighbor nodes, which is proportional to 2rad 2 . However, the performance deteriorates rapidly with the increase in the partition radius. In Figure 15, the results show that our   algorithm is very efficient, producing nearly 81% energy compared to TRW-S within just 2% of the time for very small partition radii, while the performance is drastically degraded as the partition radius increases. For modularity optimization, the increase in radius vastly improves the efficiency of PM, as shown in Figure 17. The proximity-based connectivity is reinforced by a larger radius, offsetting the impact of randomness and thereby promoting the efficiency of clustering. However, as is the case with long-distance edges, the increasing radius reduces the distributed effects.

4) ON BARABÁSI-ALBERT NETWORKS
For both the problems, as the number of additional edges at each time step increases, the advantage of our    algorithm decreases. The results for this case are shown in Figures 18 and 19. Hubs could exist due to the generative processes of the Barabási-Albert network. These hubs significantly shorten the average distance between two nodes compared to the regular networks, such as grid graphs. For this reason, Barabási-Albert networks become more complex and dense, and thus the efficiency of our algorithm, including the distributed effects, is reduced. For fixed additional edges, larger networks lead to better results. In Figure 20, the results show that increasing network size brings about more efficient distributed effects.

E. DISCUSSION FOR EACH PROBLEM 1) MAP INFERENCE a: INTERACTION PARAMETER α
We defined energy functions of discrete variables with unary and pairwise terms, where parameter α of (11) determines the strength of the relationship between pairs of nodes. Accordingly, the increasing value of α assigns a larger weight to edges, which consequently incurs more errors caused by our partitioning scheme. Therefore, our algorithm generally yields better    approximation when α has a relatively small value, as shown in Figure 21.

b: AVERAGE DEGREE
It is observed that MAP inference is more affected by the average degree than modularity optimization. When partitions are very small, there is a tendency to produce better approximation as the average degree increases. In particular, when the average degree is very high, our algorithm obtains outstanding results even for tiny partitions, such as in Facebook, Web uk-2005, and Twitter. However, one cannot always obtain good results with a high average degree. As you can see throughout the experiments, the increase of average degree that leads to an increase in randomness can VOLUME 9, 2021   negatively influence our algorithm. Our algorithm accomplishes better efficiency generally when the increase of average degree improves the regularity. However, in like manner to α, it requires relatively more time for better assignment as the partition radius increases when the average degree is high.
Taken together, the best circumstances for our algorithm in regards to MAP inference is when α has a small value with a high average degree. However, our algorithm can only realize positive effects when the increase of the average degree improves the regularity. In other words, PM shows better results when applied on well-distributed regular networks  while taking less consideration of the relationship between nodes.

2) MODULARITY OPTIMIZATION
Due to its high computational complexity, it is difficult for GN to analyze networks under several conditions. Accordingly, we present the experimental results of GN separately, and PM shows a good performance in this case, as shown in Figure 22. Also, we observe that the three centralized algorithms we used to extract the community structure show somewhat different tendencies regarding the change in the partition radius. Focusing on these differences, we split the analysis of the experimental results into two parts for closer examination: a: MODULARITY As shown in Figure 23(a), GN produces the largest modularity on the small partitions, followed by CNM and LM in order. This differs from the results we could get by applying the original centralized algorithms on the entire network. This observation indicates that fast approximation algorithms, giving a good trade-off between high accuracy and low complexity, do not guarantee a good result in this situation. Indeed, LM finds the smallest number of the communities on the small partitions out of the three algorithms. By the same token, GN is likely to reduce the error most rapidly as the partition radius increases.

b: RUNNING TIME
In Figures 23(b) and 23(c), GN and CNM both tend to dramatically increase the running time gap between two consecutive radius steps along with partition radius. Compared to GN and CNM, LM shows a relatively small change in the gap. When the algorithm with high computational complexity is taken as a centralized algorithm, the running time is more profitable for our algorithm.
Taken together, the experimental results show that GN and CNM offer better conditions than LM for our algorithm. That is, they provide better approximations of modularity in a relatively short time. Furthermore, this brings a new advantage to our algorithm. The algorithms with high complexity, such as GN, are limited in terms of the size of networks they can adopt. PM enables them to analyze the networks considered too large to be tractable, as long as the networks are well distributed. As demonstrated in the previous section, PM actually allows them to get an adequate result, although it is still challenging to tackle large partitions.

VII. CONCLUSION
In recent years, it has become increasingly important to design distributed high-performance graph computation VOLUME 9, 2021 algorithms that can deal with large-scale networked data in a cloud-like distributed computation architecture. Inspired by this, in this paper, we have introduced Partition-Merge, a simple meta-algorithm, that takes an existing centralized algorithm and produces a distributed implementation. With the underlying graph having polynomial growth property, the resulting distributed implementation runs in essentially linear time and is as good as, and sometimes even better than the centralized algorithm. The experiments demonstrate the efficiency of the PM algorithm, finding that it actually produces a similar performance (better in some cases) to that of the centralized algorithm in a relatively short time.
The algorithm is applicable to any graph in general, and its computation time as well as performance guarantees depend on the underlying graph structure -interestingly enough, we have evaluated the performance guarantees for any graph. We strongly believe that such an algorithmic approach would be of great value for developing large-scale cloud-based graph computation facilities.

APPENDIX PROOFS OF THEOREMS 1 AND 2 A. MAP INFERENCE
In this Section, we first prove Theorem 1 and Theorem 2 for MAP inference.
Bound on |E\∪ p k=1 E k |: We first state the following lemma which shows the essential property of the partition scheme. Lemma 1 and Lemma 2 both for MAP and modularity optimization. The proof of Lemma 1 is stated at the end of this section.
Lower bound on H(x * ): Here we provide a lower bound on H * = H(x * ) that will be useful to obtain multiplicative approximation property.
Lemma 2: Let H * = max x∈ n H(x) denote the maximum value of H for a given pair-wise MRF on a graph G. If G has maximum vertex degree d * , then Proof: Assign weight w ij = ψ U ij to an edge (i, j) ∈ E. Since graph G has maximum vertex degree d * , by Vizing's theorem there exists an edge-coloring of the graph using at most d * + 1 colors. Edges with the same color form a matching of the G. A standard application of Pigeon-hole's principle implies that there is a color with weight at least 1 d * +1 ( (i,j)∈E w ij ). Let M ⊂ E denote these set of edges. Then Now, consider an assignment x M as follows: for each for remaining i ∈ V , set x M i to some value in arbitrarily. Note that for above assignment to be possible, we have used matching property of M . Therefore, we have Here (a) follows because ψ ij and φ i are non-negative valued functions. Since H(x * ) ≥ H(x M ) and ψ L ij ≥ 0 for all (i, j) ∈ E, we prove Lemma 2.
Decomposition of H * : Here we show that by maximizing H(·) on a partition of V separately and combining the assignments, the resulting x has H(·) value as good as that of MAP with penalty in terms of the edges across partitions.
Lemma 3: For a given MRF defined on G, the algorithm of the partition scheme produces output x such that , and ψ L ij min σ,σ ∈ ln ψ ij (σ, σ ). Proof: Let x * be a MAP assignment of the MRF X defined on G. Given an assignment x ∈ |V | defined on a graph G = (V , E) and a subgraph S = (W , E ) of G, let an assignment x ∈ |W | be called a restriction of x to S if x (v) = x(v) for all v ∈ W . Let S 1 , . . . , S K be the connected components of G = (V , E − B), and let x * k be the restriction of x * to the component S k . Let X k be the restriction of the MRF X to Let x be the output of the partition scheme, and let x k be the restriction of x to the component S k . Note that since x k is a MAP assignment of H k (·) by the definition of our algorithm, for all k = 1, 2, . . . K , Now, we have Here (a) follows from the definitions of ψ U ij and ψ L ij , and (b) follows from (15). This completes the proof of Lemma 3.
Completing Proof of Theorem 1(a): Recall that the maximum vertex degree d * of G is less than 2 ρ C by the definition of polynomially growing graph. Remind our definition ε = δ 2C2 ρ for MAP inference. Now we have that Here (a) follows from Lemma 3, (b) follows from Lemma 1, (c) from Lemma 2, and (d) follows from the definition of ε for MAP inference. This completes the proof of Theorem 1(a) for MAP inference.
Completing Proof of Theorem 1(b): Suppose that we use an approximation procedure A to produce an approximate MAP assignment x k on each partition S k in our algorithm. Let A be such that the assignment produced satisfies that H k ( x k ) has value at least 1/α(n) times the maximum H k (·) value for any graph of size n. Now since A is applied to each partition separately, the approximation is within α(K ) whereK = CK ρ is the bound on the number of nodes in each partition.
By the same proof of Lemma 3 together with (21), we have that Hence we have that Here (a) follows from Lemma 1, (b) follows from Lemma 2, and (c) from the definition of ε for MAP inference. This completes the proof of Theorem 1(b) for MAP inference.
Completing Proof of Theorem 2: The same arguments as in the proof Theorem 1 together with Lemma 3 completes the proof of Theorem 2 for MAP inference.
Proof of Lemma 1: Now we prove Lemma 1. First, we consider property of the partition scheme applied to a generic metric space G = (V , d G ), where V is the set of points over which metric d G is defined. We state the result below for any metric space (rather than restricted to a graph) as it's necessary to carry out appropriate induction based proof. Note that the algorithm the partition scheme can be applied to any metric space (not just graph as well) as it only utilizes the property of metric in it's definition. The edge set E of metric space G is precisely the set of all vertices that are within distance 1 of each other. Thus, we have verified the base case for induction (n = 1).
As induction hypothesis, suppose that the Proposition 1 is true for any graph with n nodes with n < N for some N ≥ 2. As the induction step, we wish to establish Proposition 1 for any G = (V , d G ) with |V | = N . For this, consider any v ∈ V . Now consider the last iteration of the the partition scheme applied to G. The algorithm picks i 1 ∈ V uniformly at random in the first iteration. Given e, depending on the choice of i 1 we consider three different cases (or events). We will show that in these three cases, Case 1. Suppose i 1 is such that d G (i 1 , e) < K , where the distance of a point and an edge of G is defined as a minimum distance from the point to one of the two end-points of the edge. Call this event E 1 . Further, depending on choice of random number R 1 , define the following events That is, Note the following subtle but crucial point. We are not changing the metric d G after we remove points from the original set of points.
Case 2. Now, suppose i 1 ∈ V is such that d G (i 1 , e) = K . We will call this event E 2 . Further, define the event E 21 = Under the event E c 21 ∩E 2 , we have e ∈ W 1 , and the remaining metric space (W 1 , d G ). This metric space has < N points. Further, the ball of radius K around e with respect to this new metric space has at most |B(e, K )| − 1 points (this ball is with respect to the original metric space G on N points). Now we can invoke the induction hypothesis for this new metric space to obtain From (30) and (31), we have P[e ∈ B|E 3 ] ≤ P K + (1 − P K )(ε + P K · (|B(e, K )| − 1)) = ε(1 − P K ) + P K |B(e, K )| + P 2 K (1 − |B(e, K )|) ≤ ε + P K |B(e, K )|.
In above, we have used the fact that |B(e, K )| ≥ 1 (or else, the bound was trivial to begin with).
Case 3. Finally, let E 3 be the event that d G (i 1 , e) > K . Then, at the end of the first iteration of the algorithm, we again have the remaining metric space (W 1 , d G ) such that |W 1 | < N . Hence, as before, by induction hypothesis we have P[e ∈ B|E 3 ] ≤ ε + P K |B(e, K )|. Now, the three cases are exhaustive and disjoint. That is, i=1 E i is the universe. Based on the above discussion, we obtain the following.
This completes the proof of Proposition 1.
Now, we will use Proposition 1 to complete the proof of Lemma 1. The definition of growth rate implies that, From the definition P K = (1 − ε) K −1 , we have Therefore, to show Lemma 1, it is sufficient to show that our definition of K satisfies the following Lemma. Lemma 4: We have that where the last inequality follows because the term inside the summation in (37) is positive only if A ij = 1, i.e. (i, j) ∈ E or else it is negative. Therefore, for the purpose of lower bound, we only need to worry about (i, j) ∈ E such that (i, j) / ∈ ∪ p k=1 V k × V k . This is precisely equal to E\ ∪ p k=1 E k . The (a) follows because χ , by definition, assigns nodes in V i and V j for i = j to different clusters. The (b) follows because χ k has maximum modularity in G k and hence it is at least as large (in terms of modularity) as that of the χ * ,k , the restriction of χ * to G k . This completes the proof of Lemma 6 since the first term in (38) is precisely 2mM(χ * ) = 2mM * .
Approximation Factor for M( χ): Let β = |E\∪ p k=1 E k |/m denote the fraction of edges that are across partitions for a given partition V = V 1 ∪ · · · ∪ V p . Then, from Lemmas 5 and 6, it follows that for m ≥ C 2 , Therefore, if 2(2C − 1)β ≤ δ, then M( χ ) is at least M * · (1 − δ). Now from Lemma 1 and the linearity of expectation, we have Completing Proof of Theorem 1(a): When A produces exact solution to the modularity optimization for each partition, the resulting solution of our algorithm is χ. Therefore, from (39) and (40), it follows that Completing Proof of Theorem 1(b): Suppose we use an approximation procedure A to produce clustering on each partition in our algorithm. Let A be such that the clustering produced has modularity at least 1/α(n) times the optimal modularity for any graph of size n. Now since A is applied to each partition separately, the approximation is within α(K ) whereK = CK ρ is the bound on the number of nodes in each partition. Letχ 1 , . . . ,χ p be the clustering (coloring) produced by A on graphs G 1 , . . . , G p . Then by the approximation property of A, we have Therefore, for the overall clusteringχ obtained as union of VINCENT BLONDEL received the degree in engineering and the degree in philosophy from the Université catholique de Louvain (UCL), Belgium, the M.Sc. degree in pure mathematics from Imperial College, London, U.K., and the Ph.D. degree in applied mathematics from UCL. He is currently a Professor of applied mathematics and the President of UCL. His major research interests include mathematical control theory, theoretical computer science, and network science. He has also completed a master thesis at the Institut National Polytechnique de Grenoble, SEUNGPIL WON received the B.S. degree in electrical and electronic engineering from Yonsei University, South Korea, in 2015. He is currently pursuing the Ph.D. degree in electrical and computer engineering with Seoul National University, South Korea. His research interests focus on the areas that have benefitted from generative models, including natural language processing and computer vision. VOLUME 9, 2021