Optimizing Distance Computation in Distributed Graph Systems

Given a large graph, such as a social network or a knowledge graph, a fundamental query is how to find the distance from a source vertex to another vertex in the graph. As real graphs become very large and many distributed graph systems, such as Pregel, Pregel+, Giraph, and GraphX, are proposed, how to employ distributed graph systems to process single-source distance queries should attract more attention. In this paper, we propose a landmark-based framework to optimize the distance computation over distributed graph systems. We also use a measure called set betweenness to select the optimal set of landmarks for distance computation. Although we can prove that selecting the optimal set of landmarks is NP-hard, we propose a heuristic distributed algorithm that can guarantee the approximation ratio. Experiments on large real graphs confirm the superiority of our methods.


I. INTRODUCTION
With the rapid development of the Internet and social networks in recent years, large-scale data in graph models have gradually increased. As a general data model, the graph model is widely used in many applications. For example, the Internet can be represented as a graph. In the graph, each web page is a vertex, and a hyperlink between the web pages acts as an edge; in a social network, each person is a vertex of the graph, and the relationship between persons constitutes the edge. In addition, many information systems, such as search engines and recommender systems, also use graphs as information carrier models.
When graph models are used in an increasing number of applications, as one of the most classic problems in a graph, single-source shortest path length (SSSP length) queries have been studied for more than half a century and have received increasing attention [6], [16], [26]. Given a graph and a source vertex, an SSSP length query finds the distance from the source vertex to each vertex in the graph. The SSSP The associate editor coordinating the review of this manuscript and approving it for publication was Liangxiu Han . length query is widespread in many real applications. For example, the selection of travel routes in traffic networks is one of the most basic applications of SSSP length. In addition, the transmission efficiency of information flow in a computer network is optimal when it is transmitted via the shortest router sequences.
As the sizes of real graphs increase, the classic solutions for SSSP length queries in real graphs become inefficient. Additionally, the emergence of distributed graph systems, including Pregel [16], Pregel+ [25], Giraph [21] and GraphX [9], is inevitable to maintain large graphs. These distributed graph systems follow the vertex-centric BSP (bulk synchronous parallel) computing model, which divides the calculation into a series of superstep iterations. To improve the scalability of the SSSP length query evaluation, we propose a landmark-based framework over distributed graph systems for computing the SSSP length query in large graphs in this paper.
The computational framework in this paper first selects a sequence of appropriate landmarks based on a measure named set betweenness, which are extended from [7], [15] and provides a landmark selection criterion by evaluating the number of shortest paths covered by the set of landmarks. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Although we prove that selecting the optimal set of vertices with the maximal set betweenness as the landmarks is NPhard, our proposed heuristic distributed algorithm can guarantee the approximation ratio. Then, we take advantage of the calculated shortest path trees of the landmarks to compute the distances from the source vertex of the SSSP length query to other vertices over distributed graph systems. In summary, the main contributions we make are as follows: • We propose a landmark-based framework to evaluate the SSSP length query over distributed graph systems, which utilizes the characteristics of the distributed graph systems to improve the efficiency and scalability of our method.
• We formulate the measure named set betweenness to select the optimal set of landmarks. Although we prove that the problem of selecting the optimal set of landmarks is NP-hard, we propose a heuristic distributed strategy to guarantee the approximation ratio.
• Finally, we conduct extensive experiments over different kinds of real graphs on multiple distributed graph systems (Pregel+ [25], Giraph [21] and GraphX [9]) to verify the performance of the proposed techniques.

II. PRELIMINARIS
In this section,we introduce related concepts throughout this paper.

A. GRAPH AND PATH
In this paper, we consider an undirected, unweighted graph G = (V , E), where V is a set of vertices and E ⊆ V × V is a set of edges. Each edge e = (v, u) ∈ E is an edge connecting two vertices u and v. It should be noted that we mainly consider the undirected and unweighted graphs in this paper. However, our method can also be applied to directed and weighted graphs. For example, Figure 1 shows a graph used as a running example in this paper. Given two vertices s, t ∈ V , we define a sequence p(s, t) = (s, v 1 , v 2 , . . . , v l−1 , t) to denote a path length between two vertices s and t, where Then, the distance d(s, t) between vertices s and t is defined as the length of the shortest path between s and t. A shortest path tree rooted at vertex v is a spanning tree T of G, such that the distance from v to any other vertex u in T is equal to the distance from v to u in G. Here, we denote the shortest path tree rooted at vertex v as SPT (v). Furthermore, Figure 2(a) and Figure 2(b) show SPT (v 6 ) and SPT (v 7 ), which are the shortest path trees rooted at v 6 and v 7 for the example graph. In this paper, we study the problem of answering singlesource shortest path length (SSSP length) queries. Given a graph G and a source vertex s, an SSSP length query finds the distances from s to every vertex in G.

B. LANDMARK
The shortest path distance in a graph is a metric and satisfies the triangle inequality. That is, for any vertex s, t, v ∈ V , We call the right side of the inequality the upper bound of the distance between s and t, which is denoted asd(s, t). The upper bound equals the exact distance between s and t if the shortest path p(s, t) passes through v. Thus, if we select a vertex v and precompute the distance d(v, u) from this vertex to each other vertex u in the graph, we can obtain an estimated distanced(s, t) = d(s, v) + d(v, t) between any two vertices s and t.
Furthermore, if we select a set {l 1 , l 2 , . . . , l k } of k vertices as landmarks, a potentially upper bound approximation can be computed as follows.
Most previous studies [10], [14], [17], [18], [22] focus on how to select landmarks for accurately estimating the distance of two vertices. In this paper, we propose a landmark-based distance computation framework to improve the performance of evaluating the problem of SSSP length.

C. COMPUTATIONAL MODEL OF A DISTRIBUTED GRAPH SYSTEM
With the increasing scale of the graph, the distributed graph system has become the main application method in reality. The computational model of distributed graph systems is a vertex-centric model based on the BSP model that is composed of a sequence of supersteps. A user-defined function is executed at each superstep by each vertex in the graph in parallel. This function describes the operations that a vertex v needs to perform in a superstep.
In particular, the computational model of distributed graph systems takes a graph G as input data. Each vertex v is in the active state or inactive state. Each vertex of G has a vertex ID and a modifiable user-defined value associated with it. In addition, each edge recording its target vertex ID is associated with its source vertex and a modifiable user-

III. A LANDMARK-BASED DISTANCE COMPUTATION FRAMEWORK OVER DISTRIBUTED GRAPH SYSTEMS
The main contribution of this paper is to use the landmarkbased method to perform an efficient computation of an SSSP length query over distributed graph systems. Our idea is to first select some landmarks and compute their shortest path trees in the offline stage. Then, when a query of vertex v is input, we employ distributed graph systems, such as Pregel [16], Pregel+ [25], Giraph [21], and GraphX [9], to perform the distance computation between v and any other vertex based on the shortest path trees of the selected landmarks.
In the offline stage, we take advantage of our heuristic method to select a set of vertices as landmarks. The details of our landmark selection method are discussed in Section IV. Let L = {l 1 , l 2 , l 3 , · · ·, l k } be the set of landmarks. Then, we calculate the shortest path tree from each landmark to the remaining vertices in the graph. All the shortest path trees computed at this stage are maintained and used for online query processing.
During the evaluation of a query starting from s, for any vertex v, the distance from s to v, d(s, v), is computed over distributed graph systems using the following steps: • We initialize each vertex v with a distance value v.d and initialize it as +∞. We start the traversal by setting the source vertex s active in the first superstep and assign s.d as 0.
• In each superstep, the active vertices send their distances to s to their neighbors, and their neighbors become active. When a vertex v becomes active, if it is a landmark, we look up the shortest distances s.
If v is not a landmark, we only update v.d.
• The traversal terminates when all vertices are inactive, which indicates that all vertices' distances to s are computed.
In practice, as the landmark-based distance computation algorithm runs, increasing distances from the source to other vertices are computed. Thus, the effect of the landmarks' shortest path tree becomes limited. At the last moment of the distance computation, only a few vertices' distances have not been computed. Then, when the traversal meets a landmark, it would be better to directly continue the traversal rather than load its shortest path tree. Therefore, we can count the number of shortest path trees of landmarks that are loaded, Count l , when we utilize the current landmark's shortest path tree. If Count l is larger than a threshold θ , we stop loading the shortest path tree of the landmark even when the traversal meets a vertex v.
We propose an algorithm as shown in Algorithm 1 to implement the above ideas over distributed graph systems. We first associate each vertex with a tag visited and a value d, where visited records whether the vertex is visited and d records the distance from the source to the vertex (Lines 1-2 in Algorithm 1). We also initialize Count l as 0 (Line 3 in Algorithm 1). Then, we execute the user-defined function

Algorithm 1: Query Processing Over Distributed Graph Systems
Input: A graph G = (V , E) and a source vertex s Output: The distances from s to other vertices 1 for each vertex v do 2 Initialize v with a tag v.visited and a value of if v is a landmark and Count l < θ then 18 for each vertex v in V do 19 if We assume that we want to process an SSSP length query from v 3 in Figure 1, and {v 6 , v 7 } is the set of landmarks. We start the traversal over the graph from v 3 and v 3 becomes active first. During the traversal, the first visited landmark is v 7 . Then, for any vertex v, we can obtain an approximate distance to v 3 as v 3 .d v 7 + v.d v 7 , and we use this approximate distance to update v.d. For example, the approximate distance The above traversal terminates when all vertices are inactive which means that the distances from v 3 to all vertices are computed.

IV. LANDMARK SELECTION
In the algorithm description above, we optimize the distributed distance computation by considering the set of land-marks L = {l 1 , l 2 , . . . , l k }. We can freely choose the set of landmarks, and moreover, the set of landmarks is crucial for the performance of this method.
The best landmark to select is a vertex that is very central in the graph and many shortest paths pass through. In fact, selecting the best landmark is related to finding the vertex with the highest betweenness centrality [8]. The betweenness centrality of vertex v is defined as where σ st is the number of shortest paths from s to t and σ st (v) is the number of shortest paths from s to t that v lies on.
In real graphs, although some vertices have large betweenness values, the shortest paths that they lie on highly overlap. The more shortest paths that on which two vertices lie overlap, the fewer shortest paths they can totally lie on. If two vertices lie on a similar set of shortest paths in the graph, it is probably wise to include only one of them.
For example, let us consider the two vertices v 6 and v 12 in our example graph in Figure 1. It is obvious that the shortest paths passing through v 6 (or v 12 ) also pass through v 12 (or v 6 ), so the shortest paths that they lie on highly overlap. Hence, although both of them may have high values of betweenness, we should only select one of them as the landmark.
To achieve the goals above, we extend the definitions in [7], [15] to define the set betweenness of k vertices by extending the definition of betweenness of a vertex to the number of shortest paths that the k vertices lie on to measure the benefit of selecting them as landmarks. Given a set of k vertices L = {v 1 , v 2 , . . . , v k }, the set betweenness of L is defined as follows.
As suggested in Equation 3, we should select the set of k vertices with the largest set betweenness as landmarks to cover as many shortest paths as possible. However, we can prove that the problem of selecting the set of k vertices with the largest set betweenness is NP-hard in the following theorem.
Theorem 1: The problem of selecting the set of k vertices with the largest set betweenness as landmarks is NP-hard.
Proof: Here, we first prove that the function of set betweenness, C B (L) in Equation 3, is submodular. Here, let be the discrete derivative of set betweenness at L with respect to v. In other words, for every V 1 ⊆ V 2 and a vertex v / ∈ V 2 , we need to prove that B (v|V 1 ) ≥ B (v|V 2 ). For vertex v, there are three kinds of shortest paths on which v lies: the set SP 1 of shortest paths on which any vertices in V 2 do not lie, the set SP 2 of shortest paths on which only some vertices in (V 2 −V 1 ) lie, and the set SP 3 of shortest paths on which some vertices in V 1 lie.
Since any shortest paths in SP 1 and SP 3 do not concern vertices in (V 2 − V 1 ), the marginal gains of v for V 1 and V 2 over SP 1 and SP 3 are the same. However, for SP 2 , the shortest paths in SP 2 passing through v are not considered when computing C B (V 1 ) but are considered when computing C B ({v} ∪ V 1 ). In contrast, the shortest paths in SP 2 passing through v are considered when both computing C B (V 2 ) and C B ({v} ∪ V 2 ). Hence, B (v|V 1 ) > B (v|V 2 ).
In conclusion, B (v|V 1 ) ≥ B (v|V 2 ) and the function C(L) is submodular. Since the problem of maximizing submodular functions is NP-hard [4], the problem is NP-hard.

A. OUR SOLUTION
As proven in Theorem 1, selecting the set L of k vertices with the largest set betweenness as landmarks is an NP-complete problem. We propose a greedy algorithm as outlined in Algorithm 2.
In general, for each vertex v, we initialize a set SP(v) that denotes the set of shortest paths on which v lies (Lines 1-2 in Algorithm 2). Then, we first compute all vertices' shortest path trees and traverse each shortest path tree to obtain SP(v) for each vertex v (Lines 4-9 in Algorithm 2). Note that computing all vertices' shortest path trees can also use our proposed algorithm in distributed graph systems (Algorithm 1). During computing the shortest path tree of a sample vertex v, we can temporarily deem the sample vertices whose shortest path trees have been computed as landmarks, and then call Algorithm 1 to compute v's shortest path trees (Line 6 in Algorithm 2). Afterward, we iteratively select the vertex v max to maximize the marginal gain of the set betweenness of selected landmarks until we meet the constraint of the number of the landmarks or cannot find a landmark to increase the set betweenness (Lines 11-15 in Algorithm 2). Finally, the algorithm outputs L (Line 16 in Algorithm 2).

Algorithm 2: Landmark Selection
Input: 5 Add v in S; 6 Use S as the landmarks to call Algorithm 1 to compute the shortest path tree SPT (v) rooted at v; Proof: In Algorithm 2, the problem of selecting the set of k vertices with the largest set betweenness as landmarks is a problem of maximizing a submodular set function subject to a constraint as discussed in Theorem 1. We directly apply the greedy algorithm in [12] to iteratively select the vertex with the largest set betweenness. In [12], the authors prove that the worst-case performance guarantee of the greedy algorithm is 1 2 (1 − 1 e ), so the set betweenness of the selected landmarks is at least 1 2 (1 − 1 e ) of the optimal set betweenness. In real applications, there are often too many vertices. Then, computing the shortest path trees of all vertices is too costly. Hence, we can sample some vertices and compute their shortest path trees to estimate the set betweenness of vertices.

B. ANALYSIS
In this section, we analyze the space and time complexities of our proposed method.
We first analyze the space complexity of our method. Each landmark needs to store its distances to all vertices. We assume that the number of selected landmark is k, so the total number of entries to be stored is O(k × |V |), which indicates that the space complexity is O(k × |V |).
Next, we analyze the time complexity of precomputation. We can divide the landmark selection in precomputation (Algorithm 2) into two stages: the first stage is sampling a set of vertices and computing the approximate set of shortest paths on which each vertex lies; and the second stage is selecting the optimal set of landmarks.
For the first stage, we assume that the numbers of sample vertices is s and the diameter (i.e. longest shortest path) of the graph is d. In distributed graph systems, for each sample vertex, it takes at most d supersteps to compute its shortest path tree and find the shortest path between it and any other sample vertices. Each superstep involves all edges at most. Hence, the time complexity of the first stage is O(s×d ×|E|). Finally, we analyze the time complexity for processing queries. In the worst case, no landmark is used to speed up the computation and we do a full traversal over the whole graph, so it takes at most d supersteps and each superstep involves all edges at most. Hence, the time complexity of processing queries is O(d × |E|).

V. EXPERIMENTS
In this section, we evaluate the performance of our method on several real graphs and on different distributed graph systems. VOLUME 8, 2020 A. SETUP We test our method on the following datasets that exist in the real world and can be classified in knowledge graphs, road networks and social networks.

1) DBpedia
DBpedia [3] is a crowd-sourced RDF knowledge graph derived from Wikipedia. It can also be represented by a graph, where resources are vertices and relationships between resources are edges.

2) YAGO
YAGO [19] is an RDF knowledge graph that mainly integrates data from Wikipedia, WordNet,and GeoNames. It can be represented as a graph, where entities are vertices and relationships between entities are edges.

3) SOC-ACADEMIA
Soc-Academia [20] is a social network extracted from Academia.edu, a platform for academics to share research papers. In this network, each vertex corresponds to a researcher. An undirected edge between two authors represents a link between them.

4) Soc-YouTube
Soc-YouTube [20] is a social network extracted from YouTube, a video-sharing website that includes a social network. In this network, each vertex corresponds to a user. An undirected edge between two authors represents a link between them.

5) RoadNet-TX
RoadNet-TX [13] is a dataset describing the road network of Texas. Intersections and endpoints in Texas are represented by vertices and the roads connecting these intersections or endpoints are represented by edges.

6) RoadNet-CA
RoadNet-CA [13] is a dataset describing the road network of California. Similarly, intersections and endpoints are vertices and the roads are edges.
These datasets and their properties are listed in Table 1, where |V |, |E| and are the numbers of vertices, edges and triangles, respectively,δ represents the average clustering coefficiency and d expresses the diameter (i.e. the longest shortest path).
All experiments are conducted on a cluster with 12 physical nodes in the Alibaba Cloud. Each node has four CPUs with 32 GB memory and a 100 GB disk drive. We test our proposed techniques over multiple distributed graph systems, including Pregel+ [25], Giraph [21] and GraphX [9]. Pregel+ has the best scalability and Giraph [21] and GraphX [9] fail over road networks, so we use Pregel+ in different experiments by default.
By default, we sample 500 vertices to estimate the value of the set betweenness of each vertex and select 100 landmarks from them. When we evaluate the SSSP length query performance, we set the threshold θ of Count l discussed in Section III as 30. Even for one query, our method can speed up its performance and the response times of different queries spread out over a small range. Thus, we sample 500 vertices from each graph as the input distance queries to evaluate different methods and evaluate them one by one. Then, we report the average response time per query.

B. COMPARISON WITH DIFFERENT LANDMARK SELECTION METHODS
We compare our method defined as set betweenness with some classic landmark selection methods. Other methods involved in the assessment include random (denoted as RN), degree (denoted as Degree) [5], betweenness (denoted as BN) [8], and coverage (denoted as Coverage) [14]. We also selected 100 landmarks from them. Random is a landmark selection strategy. It randomly selects vertices as landmarks in the graph to perform the calculation process. Degree is a method of selecting the vertex with the largest degree in the graph as the landmark. Betweenness is a strategy of selecting a series of vertices with the highest probability of appearing in the shortest path between pairs of vertices as landmarks. The idea of coverage is to iteratively choose vertices as landmarks having the approximate largest coverage, which represents the number of vertex pairs that the shortest paths from the vertex to other vertices pass through in the graph. The landmark selection based on the measure of set betweenness is denoted as SB. We also compare with a baseline where a distance query is not optimized based on landmarks but still computed on the distributed graph system (denoted as Baseline). Figure 3 summarizes the average query response time of a query over various landmark selection methods on all six datasets. We find that the measure of set betweenness has superior performance in the optimization of distance computational time compared with other measures, especially for road networks. The diameters of road networks are much larger than those of the other two kinds of graphs, which indicates that it takes many more iterations for distance computation in road networks than other kinds of graphs. Our landmark-based distance computation frameworks can greatly reduce the number of iterations to improve performance, especially when there are many iterations for road networks.

C. EFFECT OF LANDMARKS' NUMBER
In this experiment, we study the performance of the method in this paper when different numbers of landmarks are selected. As shown in Figure 4, we summarize the optimized distance computational time for six datasets varying the number of landmarks from 100 to 500. We can intuitively find that with the increase in the number of landmarks, the average query response time has a tendency to decrease. This is because when computing the distance online, some landmarks can   be selected to participate in the computation. The larger the number of landmarks is, the larger the range of landmarks leading to the nearest landmarks, in theory, becoming closer to reality.

D. EFFECT OF θ
As discussed in Section III, as the landmark-based distance computation algorithm runs, the effect of the landmarks' shortest path tree becomes limited and we define a threshold VOLUME 8, 2020  θ . We stop loading the shortest path tree of the landmark during distance computation if the number of landmarks' shortest path trees that have been loaded is more than θ . In this experiment, we evaluate the effect of θ . Figure 5 shows the experimental results. As θ gradually increases from 10 to 90, we find that the trend of time after optimization basically decreases first and then increases, and there is an inflection point. This indicates that the effect of landmarks' shortest path trees used in distance computation is initially large and then decreases after an inflection point.

E. PERFORMANCE OVER DIFFERENT DISTRIBUTED GRAPH SYSTEMS
In this experiment, we conduct experiments to evaluate the performance of our proposed techniques on different distributed graph systems, including Pregel+ [25], Giraph [21] and GraphX [9]. Note that because the diameters of road networks are so large that Giraph [21] and GraphX [9] fail on them, we only compare these distributed graph systems on knowledge graphs and social networks. As shown in Figure 6, Pregel+ has better overall performance than Giraph and GraphX. Pregel+ uses less memory and can be 2-3 times faster than a Java implementation.

F. OFFLINE PERFORMANCE
In this experiment, we conduct an experimental evaluation of the precomputation time of the proposed method in this paper. We divide our landmark selection into two stages: sampling a set of vertices and computing the approximate set of shortest paths on which each vertex lies and selecting the optimal set of landmarks. We can clearly discover from Table 2 that for small social networks such as Soc-Academia, the landmarks can be obtained within approximately 666 minutes, while the largest road networks (RoadNet-CA) require approximately 6, 133 minutes of preprocessing.

VI. RELATED WORK A. LANDMARK-BASED DISTANCE COMPUTATION
Landmark-based distance computation has been widely studied in previous studies [1], [2], [10], [14], [17], [18], [22], [27]. Potamias et al. [17] first evaluated the landmarkbased algorithm for approximate distance estimation in large graphs. The algorithms in [10] extended the above method by storing complete shortest paths to each landmark at each vertex. Qiao et al. identified a local landmark close to the specific query vertex [18]. Tretyakov [22] discussed how to maintain the information of landmarks incrementally under edge insertions and deletions. Valstar et al. [23] discussed how to utilize landmarks to optimize regular simple path query processing on large graphs. Li et al. [14] proposed a new measure named coverage to optimize the lower bound of landmark-based distance estimation. Akiba et al. [1], [27], Yano et al. [27] proposed a pruned landmark labeling framework to evaluate the exact distance queries, approximate distance queries and distance queries over time-evolving graphs.
Betweenness centrality and its variants have also been studied and many papers discuss how to compute them in a single machine [7], [15]. Fink and Spoerhase [7] first give a reduction of the problem of computing betweenness to the problem of budgeted maximum coverage, and then design a new algorithm with an approximation factor based on the reduction. Mahmoody et al. [15] present a randomized algorithm based on sampling shortest paths while providing theoretical guarantees. In this paper, we use the concept of set betweenness and design a heuristic distributed algorithm to compute the set betweenness of a large graph.

B. DISTRIBUTED GRAPH PROCESSING SYSTEMS
As an increasing number of real applications concern large graphs, many distributed graph systems have been proposed. The computation model of these systems is a vertex-centric model that is composed of a sequence of supersteps conforming to the bulk synchronous parallel (BSP) model. The first distributed graph processing system was Pregel [16].
After Pregel, many systems were proposed to optimize it. Giraph [21] is the open-source counterpart to Pregel; GraphX [9] is Spark's graphs computing API, which also implements the Pregel operator; Pregel+ [25] proposes integration mirroring and message combining as well as a request-response mechanism; Quegel [28] extends the vertex-centric model to the query-centric model; TurboGraph++ [11] discusses how to process large graphs by exploiting external memory without compromising efficiency; GraphD [26] adopts a semistreaming model to avoid scanning the whole graph in each superstep, where only a portion of the vertex states are maintained in the main memory and edges and messages are streamed on the local disk; G-thinker [24] extends the vertexcentric model to the subgraph-centric model for computeintensive graph mining workloads.

VII. CONCLUSION
In this paper, we propose a distributed landmark-based distance computation framework for using distributed graph systems to compute the distances from a source to all vertices in large graphs. We use a measure set betweenness to select a sequence of approximate optimal landmarks, which plays an important role in speeding up distance calculations. Although we prove that finding the optimal set of landmarks is NP-complete, we propose a heuristic distributed algorithm with a bounded approximation ratio. Finally, we verify that our algorithm is effective and efficient for computing the distance between two vertices in large graphs by extensive experiments.