Efficient Similarity Search on Quasi-Metric Graphs

Similarity search in metric spaces finds similar objects to a given object, which has received much attention as it is able to support various data types and flexible similarity metrics. In real-life applications, metric spaces might be combined with graphs, resulting in geo-social network, citation graph, social image graph, to name but a few. In this paper, we introduce a new notion called quasi-metric graph that connects metric data using a graph, and formulate similarity search on quasi-metric graphs based on the combined similarity metric considering both the metric data similarity and graph similarity. We propose two simple efficient approaches, the best-first method and the breadth-first method, which traverse the quasi-metric graph following the best-first and the breadth-first paradigms, respectively, and utilize the triangle inequality to prune unnecessary evaluation. Extensive experiments with three real datasets demonstrate, compared with several baseline methods, the effectiveness and efficiency of our proposed methods.


I. INTRODUCTION
Given a query object q and an object set S O , a similarity query in metric spaces finds objects from S O similar to q under a certain similarity metric.Considering that metric spaces can support a wide range of data types and similarity metrics, metric similarity queries are useful in GIS, information retrieval, multimedia recommendation, etc.In real-life applications, metric spaces might be combined with graphs, i.e., the relationships between objects in a metric space can be modeled as a graph, resulting in geo-social network, citation graph, social media graph, to name but a few.Motivated by this, we introduce a new notion called quasi-metric graph (see Definition 1 for details) that connects metric data using a graph and investigate similarity search (including range query and k nearest neighbor (kNN) search) on quasi-metric graphs.Here, we need to consider the metric data similarity and the graph similarity simultaneously.In the following, we give three representative examples.
Application 1. (Geo-Social Network).As illustrated in Fig. 1(a), a static geo-social network is an undirected graph where each vertex denotes a user and each edge indicates that two connected users are friends.The geo-social network allows users to capture their geographic locations and share them in the social network via an operation called check-in.Here, similarity search can help a user to find candidates who take part in an event.In this case, candidates are friends nearest to the query user, i.e., both the social and the geographic distances are considered.
Application 2. (Citation Graph).As depicted in Fig. 1(b), a static citation graph is a directed graph in which each vertex represents a publication and each edge means a citation from the current publication to another.Here, similarity search can help users to find related publications to a specified one.In this case, both the similarity (e.g., Jaccard distance, tfidf) between the features of the publications and the shortest path distance in the citation graph are considered for finding related publications.Application 3. (Social Image Graph).As shown in Fig. 1(c), a static social image graph is an undirected graph where Here, similarity search can help users to find the images that they might be interested in.In this case, both the similarity (e.g., L p -norm, SIFT) between the features of images and the social distances are considered for image recommendation.
Most of the existing efforts on similarity search only consider metric spaces [1]- [3] or traditional graphs [4], [5], separately.Nevertheless, they are inefficient to support combined quasi-metric graphs, which is also demonstrated in our experiments.Recently, attribute graphs were introduced [6]- [9], which differ from quasi-metric graphs as follows: (i) Vertices in an attribute graph may have any type of attribute data, while vertices in a quasi-metric graph only associate with metric data, where the triangle inequality can be employed to accelerate the search.(ii) Existing approaches on attribute graphs either transform attributes into parts of graphs (e.g., an edge is added if two vertices have the same attributes [8], [9]) or utilize a probability model for query processing, which cannot be used to solve our studied problem, because we utilize a combined similarity metric (i.e., considering both metric data similarity and graph similarity) with a parameter to control the weights of metric data similarity and graph similarity.In addition, some studies aim at handling specific quasi-metric graphs, e.g., geo-social network, citation graph, and social image graph.Nonetheless, the algorithms designed for those particular graphs cannot tackle generic quasi-metric graphs.
A naïve solution for similarity search on quasi-metric graphs is to compute the metric data similarity and graph similarity between all the vertices in the graph and the query vertex.Unfortunately, it is inefficient due to a huge amount of superfluous metric data and graph similarity computation.
To support efficient similarity search on quasi-metric graphs, two challenges have to be addressed.The first challenge is how to reduce the number of metric data similarity computations.Distance computation is one of the most expensive operations in metric spaces that we would like to avoid.As a result, we present several filtering techniques based on the triangle inequality.The second challenge is how to reduce the number of graph similarity computations.The graph similarity can be calculated as the shortest path distance with the time complexity O(|V |2 ) (|V | denotes the number of vertices in the graph), which is costly.To avoid unnecessary graph similarity computations, we traverse the graph in a best-first or breadth-first paradigm, so that we could obtain all the needed graph similarities by traversing graph only once (i.e., only one shortest path distance computation is needed).In brief, the key contributions of this paper are summarized as follows: • We introduce the notion so-called quasi-metric graph that connects metric data using a graph and explore similarity search on quasi-metric graphs with a simple but effective combined similarity metric.• We propose two efficient approaches that follow the best-first and breadth-first paradigms to answer similarity search on quasi-metric graphs, in which several filtering techniques are developed based on the triangle inequality to boost search.• We conduct extensive experiments using three real datasets, compared with three baseline algorithms, to verify the effectiveness and efficiency of our proposed algorithms.
The rest of the paper is organized as follows.Section II reviews related work.Section III formalizes our problem, and presents several pruning and validating lemmas.Section IV elaborates three baseline methods.Section V proposes two efficient approaches for supporting similarity search on quasi-metric graphs.Considerable experimental results and our findings are reported in Section VI.Finally, Section VII concludes the paper with some directions for future work.

II. RELATED WORK
In this section, we overview the existing work on similarity search in metric spaces and on graphs, respectively.

A. SIMILARITY SEARCH IN METRIC SPACES
Similarity search (including range query and k nearest neighbor (kNN) retrieval) in metric spaces has been surveyed well in the literature [1]- [3].More specifically, two broad categories of metric indexes exist that aim to accelerate similarity search in metric spaces, namely, compact partitioning methods and pivot-based approaches.The former methods partition the space as compact as possible, they try to prune unqualified partitions during search.BST [10], [11], GHT [12], [13], SAT [14], M-tree family [15]- [17], D-index [18], eDindex [19], LC family [20]- [22], and BP [23] all belong to this category.Methods of the other category store precomputed distances from every object in the database to a set of pivots, they utilize the distances and triangle inequality to prune unqualified objects during search.BKT [24], AESA [25], [26], EP [27], FQT [28], VPT [29], [30], and Omni-family [31] all belong to this category.Recently, hybrid methods that combine compact partitioning with the use of pivots have presented.The PM-tree [32] utilizes cut-regions defined by pivots to improve query processing on the M-tree.The M-Index [33] generalizes the iDistance technique for metric spaces, which compacts the objects by using precomputed distances to their closest pivots.The SPB-tree [34] integrates the pivot-mapping method with the space-filling curve technique to further improve efficiency.
Since the shortest path distance (i.e., the graph distance) on the graph with non-negative edge weights satisfies the triangle inequality but does not meet symmetry when the graph is directed, the quasi-metric graph using the combined similarity metric (containing metric data similarity and graph similarity) can be regarded as a general quasi-metric space [35].Note that, techniques in general metric spaces are usually based on the triangle inequality and thus can be used for quasi-metric graphs.However, the above similarity search algorithms designed for metric spaces still need combined similarity computations (including metric data similarity and graph similarity computations) for unpruned objects, incurring lots of unnecessary graph similarity computations, which is also confirmed by our experiments.

B. SIMILARITY SEARCH ON GRAPHS
Many measurements are presented to define the similarity between two vertices in graphs, e.g., the shortest path distance [36]- [38], SimRank [39], [40], Personalized PageRank (PPR) [41], [42], to name just a few.The shortest path distance is an intuitive graph distance to measure how close one vertex is to another.SimRank is a measurement that says two objects are considered to be similar if they are referenced by similar objects.PPR utilizes random walk to estimate each vertex's similarity score w.r.t. a query vertex.For sake of simplicity, we utilize the shortest path distance to define graph similarity in this paper, while other similarity metrics would be investigated as a direction of our future work.
There are many previous studies on addressing the problem of shortest path distance computation.A landmark-based method [38] for shortest path distance estimation preselects a subset of vertices as landmarks and precomputes the shortest path distances between each vertex in the graph and those landmarks, and thus, the shortest path distance between a pair of vertices can be estimated by combining the precomputed distances.Pruned landmark labeling (PLL) [36] is one of up-to-date exact algorithms.PLL precomputes distance labels for all vertices by performing pruned breadth-first search from every vertex, and then, a shortest path distance query for any pair of vertices can be exactly computed using the distance labels.Nevertheless, either aforementioned algorithms designed for shortest path distance computation or top-k algorithms [4], [5] designed for similarity search on traditional graphs cannot be directly applied for our studied problem, because the metric data similarities between vertices are ignored.
Recently, attributed graphs are proposed [6]- [9], in which every vertex has a set of attributes.They differ from quasimetric graphs.Specifically, vertices in an attribute graph can have any type of attribute data, whereas vertices in a quasi-metric graph associate with metric data where the triangle inequality can be utilized to accelerate the search.Most existing efforts on attributed graphs either transform the attributes into parts of graphs (e.g., building an additional edge to connect two vertices that have the same attributes [9], and then weighting the edge between two vertices by using the attribute similarity [43]) or use a probability model [7] to model an attribute graph.However, in this paper, we utilize a combined similarity metric, and thus, existing methods used for attribute graphs cannot solve our problem.
In addition, studies on specific quasi-metric graphs [8], [44]- [47], such as geo-social network [44], [45], citation graph [46], and social media graph [47], have also been investigated.Nonetheless, they are designed for particular quasi-metric graphs, i.e., techniques used to improve search utilize the characteristics of the specific quasi-metric graphs, and hence, they cannot be applied for the general case.

III. PROBLEM FORMULATION
In this section, we first present the definition of quasi-metric graph, and then, we formalize the range query and kNN search based on the quasi-metric graph.Finally, we present several pruning and validating lemmas to accelerate similarity search on the quasi-metric graph.Table 1 summarizes the symbols frequently used throughout this paper.

A. QUASI-METRIC GRAPH
Before defining quasi-metric graph, we first review metric space and graph, respectively.
Metric space [35].A metric space is denoted as a tuple (M, d M ), in which M is an object domain and d M is a metric distance function to measure similarity between two objects in M .The metric distance function d M has four properties: (1) symmetry: Quasi-metric [35] is similar to metric, the only difference between a metric and a quasi-metric is that a quasi-metric does not possess the symmetry axiom (in the case d(q, o) = d(o, q) is allowed).
Graph.A graph is denoted as G(V, E, w), where V is a set of vertices, E is a set of edges, and w is an edge w(e i ).Note that, d G (u, v) satisfies all the properties defined in the metric space except for symmetry when the graph is directed, and hence, it is a quasi-metric [35] distance.
By combining the metric space and the graph, we introduce the quasi-metric graph as follows.
is a set of vertex-specific metric data, i.e., each vertex v in V associates with metric data v.data, E is a set of edges, and w is an edge weight function.We define a combined distance to measure the similarity between two vertices u and v in the quasi-metric graph, the parameter α (0 < α < 1) is employed to control the weights between graph similarity and metric data similarity.Obviously, the combined distance d(u, v) is a quasi-metric distance, since d M (u, v) is a metric distance and d G (u, v) is a quasi-metric distance.Thus, the sum of their linear varieties must be the quasi-metric distance.
Example 1.Consider a quasi-metric graph example M G(V, M v , E, w), i.e., a geo-social network, in Fig. 2(a), where represents a set of friendships between the users, w(e) is equal to 1 for any edge e ∈ E, and metric data v.data associated with each vertex v denotes the corresponding location of v (e.g., v 2 .data= (1, 1)).Here, L 2 -norm is used as d M to measure the distance between locations of users, and the shortest path distance is used as d G to measure the relationship between users.

B. SIMILARITY SEARCH ON QUASI-METRIC GRAPHS
Based on the quasi-metric graph, we formally define similarity search, including range query and k nearest neighbor (kNN) query, as stated below.
Definition 2. (Range Query on Quasi-Metric Graph).Given a quasi-metric graph M G(V, M v , E, w), a query vertex q, and a search radius r, a range query on quasi-metric graph finds the vertices v in V that are within the distance r to q, i.e., RQ(q, r) Definition 3. (kNN Query on Quasi-Metric Graph).Given a quasi-metric graph M G(V, M v , E, w), a query vertex q, and an integer k, a kNN query on the quasi-metric graph finds k vertices in V that are most similar to q, sorted in ascending order of their combined distances w.r.t.q, i.e., Consider the example depicted in Fig. 2(a).Suppose α is equal to 0.5, a range query on the quasi-metric graph M G retrieves the vertices whose distances to a query vertex v 5 are within 2, i.e., RQ(v 5 , 2) = {v 1 , v 2 }.A 2NN (k = 2) query on the quasi-metric graph M G returns 2 vertices most similar to the query vertex v 5 , i.e., kN N Q(v 5 , 2) = {v 2 , v 1 }.It is worth noting that, a kNN query can be regarded as a range query if the combined distance from a query vertex q to its kth nearest neighbor, denoted as N D k , is known in advance.

C. PRUNING QUASI-METRIC GRAPH
Considering that distance calculation in metric spaces are usually complex, the pivot mapping technique is employed to avoid unnecessary distance computations.
Pivot mapping.Given a pivot set P = {p 1 , p 2 , • • • , p l }, the objects in a metric space can be mapped to data points Notations: Consider the example in Fig. 2 again, where L 2 -norm is used as d M .If P = {v 1 , v 6 }, the original metric space (as illustrated in Fig. 2(a)) is mapped to a two-dimensional vector space (as shown in Fig. 2 Back to the example depicted in Fig. 2(a), where Based on the lower and upper bounds of d M , corresponding pruning and validating lemma is developed below.Lemma 1.Given a range query with a query vertex q and a search radius r, ) ≤ r, and hence, v can be validated.The proof completes.
Consider a range query with r = 2 and q = v 5 on the quasimetric graph shown in Fig. 2(a), where P = {v 1 , v 6 } and α = 0.5.Vertices v 3 and v 6 to v 9 can be discarded (i.e., v 3 , Note that, the distances d M (v, p i ) from all the vertices v in the quasi-metric graph to the pivots p i can be precomputed and stored.Consequently, if we compute the distances d M (q, p i )(p i ∈ P) once, the lower and upper bound distances (i.e., ld M (v, q) and ud M (v, q)) from all the vertices v to the query vertex q can be obtained without any further metric distance computation.Nonetheless, for the vertices that cannot be pruned or validated using Lemma 1, we still need to compute their corresponding metric distances to the query vertex q for further verification.
Since the combined distance function d = α × d G + (1 − α) × d M used for the quasi-metric graph is quasi-metric, it satisfies the triangle inequality.Hence, the whole quasimetric graph can be mapped to data points in the vector space.
As an example, the quasi-metric graph in Fig. 2(a) with P = {v 1 , v 6 } and α = 0.5 can be mapped to data points as depicted in Fig. 2(c) (e.g., vertex v 5 can be mapped to point 2, 5 ).In addition, Definition 4 is also applied by replacing d M with d.Based on the lower and upper bounds of d, corresponding pruning and validating lemma is presented below.
Lemma 2. Given a range query with a query vertex q and a search radius r, a vertex v can be pruned if ld(q, v) > r, and v can be validated if ud(q, v) ≤ r.Here, ld(q, v) and ud(q, v) represent the lower bound and the upper bound of the combined distance d(q, v), respectively.
Proof.Since the combined distance d(u, v) is a quasi-metric distance, it satisfies triangle inequality.Hence, |d(q, ) > r, and thus, v can be pruned.If ud(q, v) ≤ r, then d(q, v) ≤ r, and hence, v can be validated.The proof completes.
Again consider a range query with r = 2 and q = v 5 on The difference between Lemma 1 and Lemma 2 is that, both the metric distance and the graph distance are computed and stored together for Lemma 2, while the metric and graph distances are calculated separately for Lemma 1.

IV. BASELINE METHODS
To address similarity search on quasi-metric graphs, similarity query approaches on graphs or in metric spaces can be adapted accordingly.In the following, we elaborate three baseline methods, namely, Pruned Landmark Labeling based method, M-tree based method, and SPB-tree based method.

A. PRUNED LANDMARK LABELING BASED METHOD
Pruned Landmark Labeling (PLL) [36] is the state-of-theart method to compute the shortest path distance.It provides two functions Landmark_ub and Landmark for computing the upper bound and the exact shortest path distance, respectively.Specifically, PLL precomputes a distance label for every vertex v, denoted as L(v), by performing a pruned breadth-first search from each vertex.L(v) is a set of pairs (u, δ uv ), where u is a vertex and δ uv is the shortest path distance from u to v. A shortest path distance query between vertices s and t can be computed as min{δ sw +δ wt | (w, δ sw ) ∈ L(s), (w, δ wt ) ∈ L(t)}.For example, Table 2 shows the PLL index for the quasi-metric graph in Fig. 2(a), then,  d To address similarity search on quasi-metric graphs, a pruned landmark labeling based method is developed.It traverses every vertex in the quasi-metric graph in sequel.For every vertex v, the method first computes the upper bound of the shortest path distance between a query vertex q and the vertex v by invoking function Landmark_ub.Next, the method validates v using the upper bound of the combined distance by Lemma 2. If v cannot be validated, the method needs to compute the exact shortest path distance from q to v using function Landmark, and then, it prunes or validates v via Lemma 1.If v still cannot be pruned or validated, the metric distance between q and v should be computed for the final verification.The pruned landmark labeling based Algorithm 1 PLL based Range Query Algorithm (LRA) Input: a query vertex q, a search radius r, a quasi-metric graph M G(V, Mv, E, w), a parameter α Output: the result set RQ(q, r) of a range query 1: for each vertex v ∈ V do 2: udG(q, v) = Landmark_ub(q, v) 3: if α×udG(q, v)+(1−α)×udM (q, v) ≤ r then // Validated by Lemma 2 4: insert v into RQ(q, r) dG(q, v) = Landmark(q, v) if α×dG(q, v)+(1−α)×udM (q, v) ≤ r then // Validated by Lemma 1 8: insert v into RQ(q, r) insert v into RQ(q, r) 13: return RQ(q, r) method includes PLL based Range query Algorithm (LRA) and PLL based kNN query Algorithm (LNA).
Algorithm 1 presents the pseudo-code of LRA.It takes as inputs a query vertex q, a search radius r, a parameter α, and a quasi-metric graph M G(V, M v , E, w), and outputs the result set RQ(q, r) of a range query.LRA traverses the whole quasi-metric graph until all answer vertices are found (lines 1-12).For every vertex v in V , the algorithm first calls Landmark_ub function to compute the upper bound ud G (q, v) of the shortest path distance between query vertex q and vertex v.If α×ud G (q, v)+(1−α)×ud M (q, v) ≤ r, the vertex v is inserted into the result set RQ(q, r) by Lemma 2 (lines 3-4); otherwise, LRA invokes Landmark function to compute the exact shortest path distance d G (q, v) ), and inserts v into the result set if d(q, v) ≤ r by Lemma 1 (lines 9-12).Finally, the result set RQ(q, r) is returned (line 13).
For LNA, as N D k is not known in advance for kNN query, kNN query is more complex than range query.The differences between LNA and LRA are as follows.(i) LNA first set the current k-th NN distance curN D k to infinity, and update the value during the search until it reaches N D k (i.e., kNN vertices are found).(ii) LNA cannot validate vertices, because the search radius curN D k is decreasing during the search, i.e., lines 2-4 and lines 7-8 of Algorithm 1 do not work for LNA.(iii) LNA uses (1 − α) × ld M (q, v) ≤ curN D k to prune before invoking Landmark function for computing the shortest path distance.

B. M-TREE BASED METHOD
Since the distance used for a quasi-metric graph is quasimetric, it also satisfies the triangle inequality, and the metric indexes can be directly employed to tackle similarity search Example 2. Fig. 3 shows an example of M-tree to index the quasi-metric graph in Fig. 2(a).An intermediate (i.e., a nonleaf) entry (i.e., partition region) E in a root node (e.g., N 0 ) or a non-leaf node (e.g., N 1 , N 2 ) records: (i) A center vertex E .v that is a selected vertex in the subtree ST E of E .(ii) A covering radius E .r which is the maximal distance between the center vertex E .v and any vertex in its subtree ST E .(iii) A parent distance E .P D that equals the distance from E .v to the center vertex of its parent entry.Since a root entry E (e.g., E 2 ) has no parent entry, E .P D = ∞.(iv) An identifier E .ptrpointing to the root node of its subtree ST E .In contrast, a leaf entry (i.e., vertex) v in a leaf node (e.g., N 3 , N 6 ) records: (i) A vertex v j which stores the detailed information of v. (ii) An identifier vid representing v's identifier.(iii) A parent distance v.P D that equals the distance from v to the center vertex of v's parent entry.
Based on the M-tree, a new lemma is developed to prune unnecessary entries in the M-tree, as stated below.
Lemma 3. Given an M-tree, a range query with a query vertex q and a search radius r, for a non-leaf entry E in the M-tree, if d(q, E .v)> E .r+ r, any vertex u in E cannot be in the final result set RQ(q, r), and thus, E can be pruned safely.
Proof.For any vertex u in a non-leaf entry E , if d(q, E .v)> E .r+ r, then d(q, u) ≥ d(q, E .v)− d(u, E .v)> E .r+ r − d(u, E .v)due to the triangle inequality.According to the definition of M-tree, d(u, E .v)≤ E .r, and hence, d(q, u) > r.Therefore, any vertex u in E cannot be in the final result set RQ(q, r), i.e., E can be pruned away safely, which competes the proof.
Consider the example in Fig. 3, for a range query with q = v 2 and r = 2, E 5 and E 6 can be pruned by Lemma 3, because d(v 2 , E 5 .v)= 5 > E 5 .r+2and d(v 2 , E 6 .v)= 3.8 > E 6 .r+2.To avoid unnecessary distance computations, we can utilize Algorithm 2 M-tree based Range Query Algorithm (MRA) Input: a query vertex q, a search radius r, a parameter α, an M-tree M build on the quasi-metric graph M G(V, Mv, E, w) Output: the result set RQ(q, r) of a range query 1: push all root entries of M into a queue H 2: while H = ∅ do compute d(q, ES) = α × dG(q, ES) + (1 − α) × dM (q, ES) insert ES into RQ(q, r) 15: return RQ(q, r) the triangle inequality with the parent distances stored in the M-tree to prune unqualified entries, as stated in Lemma 4. Lemma 4. Given an M-tree and let E P be the parent entry of entry E , a range query with a query vertex q and a search radius r, for the entry E in the M-tree, if |d(q, E P .v)− E .P D| > E .r+ r, then any vertex v in the entry E cannot be in the final result set RQ(q, r), and hence, E can be pruned safely.
Note that, if E is a leaf entry, then E .r= 0. Take Fig. 3 as an example again.For a range query with q = v 2 and r = 1.It takes as inputs a query vertex q, a search radius r, an M-tree M build on the quasi-metric graph M G(V, M v , E, w), and a parameter α, and outputs the result set RQ(q, r).Initially, MRA pushes all the root entries of M into a queue H (line 1).Thereafter, a while-loop is performed until H is empty (lines 2-14).In each iteration, the algorithm first pops the top entry E from H (line 3).Next, if E points to a non-leaf node, it inserts the subentries E S that cannot be pruned by Lemma 3 and Lemma 4 into H (lines 4-8).Otherwise, if E points to a leaf node, it prunes subentries (i.e., vertices) E S by Lemma 4 (lines 9-11).For the unpruned subentries E S , MRA computes the combined distance d(q, E S ) (line 12).If d(q, E S ) ≤ r, E S is inserted into RQ(q, r) (lines [13][14].Finally, the algorithm returns the result set RQ(q, r) (line 15).
The differences between MNA and MRA are as follows.(i) MNA uses a current k-th NN distance curN D k for pruning, and updates the corresponding value using unpruned vertices.(ii) MNA visits the entries E in the M-tree in ascending order of their minimal distances to the query vertex q (denoted as M IN D(q, E )).Thus, E can be safely pruned if M IN D(q, E ) ≥ curN D k , i.e., for MNA, H is a priority queue, in which all entries E are sorted in ascending order of M IN D(q, E ).

C. SPB-TREE BASED METHOD
SPB-tree [34] is the state-of-the-art metric index belonging to the hybrid methods, which combines compact partitioning with the use of pivots.The SPB-tree can be directly built on the quasi-metric graph to answer similarity search.
The SPB-tree utilizes the two-stage mapping, i.e., pivot mapping (as discussed in Section III-C) and space-filling curve (SFC) mapping, to map vertices in the vector space to SFC values (i.e., integers) in a one-dimensional space while maintaining spatial proximity.Then, a B + -tree with the minimum bounding box (MBB) information is used to index the SFC values.
Example 3. Fig. 4 depicts an example of SPB-tree, where Fig. 4(a) illustrates an SPB-tree to index the quasi-metric graph in Fig. 2(a) and Fig. 4(b) shows the space-filling curve (SFC) mapping after the pivot mapping depicted in Fig. 2(c).For instance, φ(v 5 ) = 2, 5 after the pivot mapping, and SF C(φ(v 5 )) = 29 after the SFC (i.e., Hilbert curve) mapping.An SPB-tree shown in Fig. 4(a) contains three parts, i.e., the pivot table that stores selected vertices (e.g., v 1 and v 6 ) to map a metric space to a vector space, the B + -tree, and the RAF which is sorted to store the vertices in ascending order of SFC values as they appear in the B + -tree.Note that, each leaf entry E in a leaf node (e.g., N 3 , N Here, a i and b i can be obtained by E .M BB. Consider the example depicted in Fig. 4, where P = {v 1 , v 6 }.According to Definition 5, ld(E 6 , v 5 ) = 4 as E 6 .M BB = { [6,6], [0, 1]}.Based on the newly defined lower bound distance, we develop Lemma 5 to prune unnecessary entries.
Lemma 5. Given an SPB-tree, a range query with a query vertex q and a search radius r, for a non-leaf entry E in the SPB-tree, if ld(E , q) > r, then E can be pruned safely.
Proof.∀ u ∈ E , we can get that a i ≤ d(u, p i ) ≤ b i .According to Definition 4, ld(u, q) =max{|d(u, r, then d(u, q) > r for any u ∈ E , and thus, E Input: a query vertex q, a search radius r, a parameter α, an SPBtree S build on the quasi-metric graph M G(V, Mv, E, w) Output: the result set RQ(q, r) of a range query 1: push all root entries of S into a queue H 2: while H = ∅ do if E points to a non-leaf node then 5: for each subentry ES in E do if ld(ES, q) ≤ r then // Pruned by Lemma 5 insert ES into RQ(q, r) 12: if ld(ES, q) ≤ r then // Pruned by Lemma 2 insert ES into RQ(q, r) 16: return RQ(q, r) can be pruned safely.
Back to the example illustrated in Fig. 4. For a range query with q = v 5 and r = 2, the non-leaf entry E 6 can be pruned by Lemma 5, because ld(E 6 , v 5 ) = 4 > 2.
Based on the SPB-tree, we propose SPB-tree based Range query Algorithm (SRA) and SPB-tree based kNN query Algorithm (SNA).Algorithm 3 depicts the pseudo-code of SRA.It takes as inputs a query vertex q, a search radius r, a parameter α, and an SPB-tree S build on the quasi-metric graph M G(V, M v , E, w), and outputs the result set RQ(q, r) of a range query.First of all, SRA pushes all the root entries of S into a queue H (line 1).Thereafter, a while-loop is performed until H is empty (lines 2-15).In every iteration, the algorithm first pops the top entry E from H. Next, if E points to a non-leaf node, it inserts the subentries E S that cannot be pruned by Lemma 5 into H (lines 4-7).Otherwise, if E points to a leaf node, SRA validates or prunes subentries (i.e., vertices) E S by Lemma 2, and inserts the unpruned subentries E S into the result set RQ(q, r) if d(q, E S ) ≤ r (lines 8-15).Finally, the algorithm returns RQ(q, r) (line 16).
The differences between SNA and SRA are as follows.(i) SNA uses a current k-th NN distance curN D k for pruning, and updates the corresponding value using unpruned vertices.(ii) SNA visits entries E in the SPB-tree in ascending order of their lower bound distances ld(E , q) to the query vertex q, i.e., for SNA, all the entries E in the queue H are sorted in ascending order of ld(E , q). (iii) SNA cannot validate the vertices since the search radius curN D k is decreasing during the search, i.e., lines 10-11 of Algorithm 3 do not work for SNA.

D. DISCUSSION
In this subsection, we analyze query processing costs for all baseline methods/algorithms.
In general, shortest path distance computation and metric distance computation are main operations in query processing, and thus, their costs dominate the query cost.For pruned landmark labeling based method, it needs to traverse the whole quasi-metric graph to find the final result.As analyzed in [36], each shortest path distance computation between a pair of vertices s and t can be answered in For M-tree based method and SPB-tree based method, they can prune unqualified vertices using Lemmas 2 through 5.For every unpruned vertex u, M-tree based or SPB-tree based method first needs to traverse the M-tree or the SPB-tree to locate the vertex u, which takes log(|V |) time.Next, shortest path distance and metric distance between u and a query vertex q are evaluated, and the corresponding time complexities are O(|E| + |V |log|V |) and f (m), respectively.Hence, the total cost for M-tree based or SPB-tree based method is Clearly, the cost of M-tree based or SPB-tree based method is more expensive than pruned landmark labeling based method, due to the high cost for computing the shortest path distances (i.e., traversing the entire quasi-metric graph to compute every shortest path distance).In addition, the SPB-tree achieves better pruning ability than the M-tree, and hence, the SPBtree based method is more efficient, which is also verified in Section VI.

V. GRAPH TRAVERSING METHODS
In this section, we propose two simple but efficient methods for answering similarity search on quasi-metric graphs, i.e., best-first method and breadth-first method.These methods visit the vertices in ascending order of the shortest path distances w.r.t. a query vertex q in order to terminate computation earlier and utilize the triangle inequality to filter unnecessary verification.

A. BEST-FIRST METHOD
To avoid traversing the whole quasi-metric graph multiple times (i.e., every shortest path distance computation needs to traverse the quasi-metric graph once) and avoid the index construction cost, we propose a simple yet robust best-first traversal method.It visits the vertices in ascending order of their shortest path distances to a query vertex q, i.e., the Input: a query vertex q, a search radius r, a quasi-metric graph M G(V, Mv, E, w), a parameter α Output: the result set RQ(q, r) of a range query 1: for each vertex v ∈ V do 2: v.f lag = false; dG(q, v) = ∞ 3: q.f lag = true; dG(q, q) = 0 4: push q into a queue H1 5: while H1 = ∅ do for each adjacent vertex u of v do 8: if u.f lag = false and dG(q, v) + w(v, u) < dG(q, u) then push u into a priority queue H2 sorted in ascending order of the current shortest path distance dG(q, u) pop vertex s from H2; push s into H1; and set s.f lag = true 13: if α × dG(q, s) ≤ r then else if α × dG(q, s) + (1 − α) × ldM (q, s) ≤ r then // Pruned by Lemma 1 17: compute dM (q, s)  return RQ(q, r) 22: return RQ(q, r) smaller the shortest path distance from the query vertex q to a vertex v is, the earlier verification that whether v is an answer vertex is made.Moreover, Lemma 1 can be employed to prune or validate vertices.
The best-first method includes Best-first Range query Algorithm (BeRA) and Best-first kNN query Algorithm (BeNA).
Algorithm 4 presents the pseudo-code of BeRA.It takes as inputs a query vertex q, a search radius r, a parameter α, and a quasi-metric graph M G(V, M v , E, w), and outputs the result set RQ(q, r).To begin with, for each vertex v in V , BeRA initializes variables d G (q, v) and v.f lag that denotes whether exact d G (q, v) has been computed, and pushes q into a queue H 1 (lines 1-4).Thereafter, a while-loop is performed (lines 5-21).In each iteration, BeRA first pops the top vertex v from H 1 , and for every v's adjacent vertex u whose exact d G (q, u) has not been calculated, BeRA updates its current d G (q, u) and pushes u with the updated d G (q, u) into a priority queue H 2 , in which vertices are sorted in ascending order of their current shortest path distances (line 6-10).Next, BeRA pops the top vertex s with the minimum d G (q, s) from H 2 , pushes s into H 1 for further traversal, and sets s.f lag to true because the exact d G (q, s) is computed (line 11-12).In the sequel, if α × d G (q, s) ≤ r, BeRA proceeds to verify vertex s.If α × d G (q, s) + (1 − α) × ud M (q, s) ≤ r, s is added to the result set RQ(q, r) by Lemma 1 (lines [14][15].Otherwise, if α × d G (q, s) + (1 − α) × ld M (q, s) ≤ r, BeRA computes Algorithm 5 Breadth-First Range Query Algorithm (BrRA) Input: a query vertex q, a search radius r, a quasi-metric graph M G(V, Mv, E, w), a parameter α Output: the result set RQ(q, r) of a range query 1: for each vertex v ∈ V do 2: v.visit = false; dG(q, v) = ∞ 3: q.visit = true; dG(q, q) = 0 4: push q into a queue H 5: while H = ∅ do dG(q, u) = dG(q, v) + 1; u.visit = true if α × dG(q, u) ≤ r then 12: if α × dG(q, u) + (1 − α) × udM (q, u) ≤ r then // Validated by Lemma 1 13: insert v into RQ(q, r) 14: else if α×dG(q, u)+(1−α)×ldM (q, u) ≤ r then // Pruned by Lemma 1 15: compute dM (q, u) insert v into RQ(q, r) return RQ(q, r) 20: return RQ(q, r) d M (q, s) and inserts s into the result set RQ(q, r) if d(q, s) ≤ r by Lemma 1 (lines [16][17][18][19].Once α × d G (q, s) > r, BeRA stops, and returns the result set RQ(q, r) due to the earlier termination condition presented by Lemma 7 in Section V-C (lines 20-21).Finally, after the whole iteration terminates, the algorithm returns the final result set RQ(q, r) (line 22).
The differences between BeNA and BeRA are that, (i) BeNA uses a current k-th NN distance curN D k instead of r as the search radius for pruning, and updates its corresponding value using unpruned vertices; and (ii) BeNA cannot validate the vertices as the current k-th NN distance curN D k is decreasing during the search, i.e., lines 14-15 of Algorithm 4 do not work for BeNA.
Example 4. Fig. 5 illustrates an example of graph traversing method on the quasi-metric graph M G(V, M v , E, w) depicted in Fig. 5(a), where and w(e i ) is equal to 1 for any edge e i (1 ≤ i ≤ 9).Suppose a pivot set P is {v 7 , v 8 }, and metric distance function is L 1 -norm.The distances d M (v i , p i ) from all vertices v i in the quasi-metric graph to the pivots p i (∈ P ) can be precomputed and stored, as shown in Fig. 5(b).Given a range query with q = v 1 and r = 1, and set α as 0.5.Initially, BeRA pushes a query vertex v 1 with its corresponding shortest path distance into a queue H 1 , resulting in H 1 = { v 1 , 0 }.Then, BeRA performs a while-loop to traverse the quasi-metric graph from the vertex v 1 in bestfirst manner.
Loop 1 : because Lemma 1 cannot validate or prune v 2 , and inserts v 2 into the result set RQ(v 1 , 1).After Loop 1, we can get that Loops 2-4 : The processing is similar as Loop 1.After that, we can get that Loop 5 : As H 1 = ∅, BeRA pops v 5 , 2 from H 1 , and updates d G (v 1 , v 6 ) = 3 since v 6 is the adjacent vertex of v 5 .Thereafter, BeRA pushes v 6 , 3 into H 2 and then pops v 6 , 3 from H 2 .As α × d G (v 1 , v 6 ) = 1.5 > r, BeRA stops traversing the quasi-metric graph due to the earlier termination condition, and returns the final result set RQ(v 1 , 1) = {v 2 , v 3 }.

B. BREADTH-FIRST METHOD
Best-first method can be applied to both weighted and unweighted quasi-metric graphs.Nonetheless, for the unweighted quasi-metric graph, a more efficient way for similarity search is breadth-first traversal from the query vertex.This is because, for the best-first method, only one vertex is verified in every iteration, whereas for the breadth-first method, the shortest path distances between the query vertex and all the traversed vertices are obtained due to the property of the unweighted graph, and thus, all the traversed vertices can be verified in each iteration, which boosts the search.
The breadth-first method contains Breadth-first Range query Algorithm (BrRA) and Breadth-first kNN query Algorithm (BrNA).Algorithm 5 depicts the pseudo-code of BrRA.Initially, for each vertex v in V , it initializes d G (q, v) and v.visit that denotes whether v has been visited (lines 1-3), and pushes a query vertex q into a queue H (line 4).Thereafter, a while-loop is performed (lines 5-19).In every BrRA first pops the top vertex v from H. Next, for every v's adjacent vertex u that has not been traversed, BrRA computes the exact d G (q, u), sets u.visit as true, and pushes u into the queue H for further evaluation (line 7- BrRA computes d M (q, u), and inserts u into the result set RQ(q, r) if d(q, u) ≤ r by Lemma 1 (lines 14-17).Once α × d G (q, u) > r, BrRA stops, and returns the result set RQ(q, r) according to Lemma 7 proposed in Section V-C (lines [18][19].Finally, BrRA returns the result set RQ(q, r) (line 20).
The difference between BrNA and BrRA is similar as that between BeNA and BeRA and thus omitted.
Example 5. Back to Example 4, we illustrate BrRA using a range query with q = v 1 and r = 1.Similar as Example 4, BrRA first pushes v 1 into H, after which H = { v 1 , 0 }.Thereafter, it starts a while-loop to traverse the quasi-metric graph following the breadth-first fashion.
Loop 1 : BrRA first pops the top vertex v 1 from H.Then, for two unvisited adjacent vertices v 2 and v 3 of v 1 , the algorithm computes d G (v 1 , v 2 ) = 1 and d G (v 1 , v 3 ) = 1, and pushes v 2 and v 3 into H.Since d(v 1 , v 2 ) = 0.7 < r and d(v 1 , v 3 ) = 0.7 < r, v 2 and v 3 are added to the result set RQ(v 1 , 1).After the loop, we can get that H Loops 2-4 : The processing is similar as Loop 1 and hence skipped.

C. DISCUSSION
In this subsection, we first clarify the advantage of both best-first method and breadth-first method, compared with baseline methods/algorithms, and then, we analyze their correctness and time complexities.
Although baseline algorithms construct offline indexes that can be used to prune unqualified vertices, they still need to compute shortest path distances between all unpruned vertices and the query vertex during the search as mentioned To prove the correctness of best-first method and breadthfirst method, we present two lemmas, as stated below.Lemma 6.Given a quasi-metric graph M G(V, M v , E, w) and a query vertex q, the best-first method can compute exact d G (q, u) when u is popped from a queue H 2 .
Proof.Let P s = {q, e 1 , t 1 , • • • , t m−1 , e m , u} be the current shortest path when u is popped from H 2 , then, t 1 ,• • •,t m−1 must be the vertices that have been popped from H 2 , and u is the vertex with the minimal d G (q, u).By contradiction, assume that P s is not the exact shortest path.Thus, there exists an exact shortest path P s containing vertices that have not been popped from H 2 .Let P s = {q, e 1 , t 1 , • • • , s, • • • , t m−1 , e m , u}, s be the first vertex that has not been popped from H 2 , and d G (q, u) be the corresponding exact shortest path distance.Hence, d G (q, s) ≤ d G (q, u) < d G (q, u), indicating that s is the vertex with the minimal d G (q, s), which contradicts that u is the vertex with the minimal d G (q, u).The proof completes.
Lemma 7. Given a quasi-metric graph M G(V, M v , E, w), a parameter α, and a range query with a query vertex q and a search radius r, best-first and breadth-first methods can terminate and return the exact result RQ(q, r) if α × d G (q, u) > r, in which u is the visited vertex.
Proof.Since vertices are visited in ascending order of their shortest path distances w.r.t. a query vertex q, we have d G (q, v) ≥ d G (q, u) for any vertex v that has not been visited.Thus, if α × d G (q, u) > r, then α × d G (q, v) > r.Consequently, d(q, v) ≥ α × d G (q, v) > r, i.e., all the unvisited vertices cannot be in the final result set RQ(q, r), and then, best-first and breadth-first methods can stop and return the final right result RQ(q, r).The proof completes.
Note that, Lemma 7 is also applicable for kNN search by replacing r with curN D k .Obviously, Lemma 1, Lemma 6, and Lemma 7 guarantee the correctness of best-first and breadth-first methods.
Next, we present the time complexities of best-first method and breadth-first method, respectively.
The best-first method maintains two queues.One is used for graph traversal, and the other is a priority queue that is utilized to compute the shortest path distance.The bestfirst method traverses the vertices in ascending order of their shortest path distances w.r.t. the query vertex, in order to take  The breadth-first method only maintains one queue that is used for graph traversal, and the shortest path distance between a vertex v and the query vertex is evaluated once the vertex v is visited.Thus, in the worst case, the cost for computing shortest path distances between all the vertices and the query vertex is O(|E| + |V |), and the total time complexity of breadth-first method is O(|E| + |V | × (1 + f (m))).

VI. EXPERIMENTAL EVALUATION
In this section, we present a comprehensive experimental evaluation.In what follows, we first introduce experiment settings, and then, we evaluate the efficiency and effectiveness of our methods.

A. EXPERIMENT SETTINGS
We employ three real datasets, viz., Gowalla1 , Flickr2 , and Citation3 .Gowalla contains locations, in which L 2 -norm is utilized to compute the metric distance.Two locations are connected if they are shared by the same user.Flickr includes images, where every image is associated with 282dimensional features, and L 2 -norm is used to compare image features.If two images are tagged by the same user, an edge is added between them.Citation provides a comprehensive list of research papers, in which every paper is associated with a set of keywords, and Jaccard distance is employed to measure the corresponding metric similarity.Two papers are connected if one cites another.Table 3 summarizes the statistics of the real datasets used in our experiments.
We study the performance of the algorithms when varying the parameters shown in Table 4, where the bold denotes the defaults, and d + is the maximal distance between any two vertices.In every experiment, we change one parameter, and set the others to their default values.The main performance metrics include query time, the number of shortest

FIGURE 2 :
FIGURE 2: Illustration of pruning quasi-metric graph (b)), in which the x-axis denotes d M (v i , v 1 ) and the y-axis represents d M (v i , v 6 ) for any vertex v i .For instance, object v 5 is mapped to point 2, 4 .Based on the pivot mapping technique, the lower and upper bounds of d M can be derived as follows.Definition 4. (Lower and Upper Bounds) Given a pivot set P , the upper bound ud

FIGURE 3 :
FIGURE 3: Example of M-tree

FIGURE 4 :
FIGURE 4: Example of SPB-tree 4 , N 5 , and N 6 ) of the B + -tree records (i) the SFC value E .key, and (ii) an identifier E .ptr to the actual object in the RAF.Each nonleaf entry E in the root or an intermediate node (e.g., N 0 , N 1 , and N 2 ) of the B + -tree records (i) the minimal SFC value key E .key in its subtree, (ii) an identifier E .ptr to the root node of its subtree, and (iii) the SFC values min and max for a 1 , a 2 , . . ., a l and b 1 , b 2 , . . ., b l to represent the minimum bounding box E .M BB = {[a i , b i ] | 1 ≤ i ≤ l}.Here, E .M BB is the axis aligned minimum bounding box to include all φ(v) with SF C(φ(v)) ∈ E , e.g., the non-leaf entry E 6 uses min (= 60) and max (= 61) to denote the M 6 of N 6 .Definition 5. (Lower Bound Distance of Entry) Given a pivot set P , the lower bound distance ld(E , v) between a vertex v and a non-leaf entry E is set as max{a

3 :
pop the top entry E from H 4: O(|L BP (s)| + |L BP (t)|) time using the bit-parallel labels technique.Here, L BP (s) (resp.L BP (t)) denotes bit-parallel labels of s (resp.t), and |L BP (s)| (resp.|L BP (t)|) represents the corresponding cardinality.Hence, in total, Landmark function or Landmark_ub function used to compute the shortest path distance takes O(|V | × (|L BP (s)| + |L BP (t)|)) time.For an unpruned vertex u, pruned landmark labeling based method calculates the metric distance between vertices q and u.Let |V u | be the number of unpruned vertices and f (m) be the cost of metric distance computation, pruned landmark labeling based method needs O(|V u | × f (m)) time for metric distance computation.Thus, the total cost for pruned landmark labeling based method is O )

6 : 7 :
pop the top vertex v from H for each adjacent vertex u of v do 8: if u.visit = false then 9:

1 as v 2
and v 3 are two adjacent vertices of v 1 , and pushes

FIGURE 5 :
FIGURE 5: Example of graph traversing methods

VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2930753,IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS in Section IV-D, which is costly due to traversing the quasimetric graph multiple times.Using the best-first and breadthfirst traversal paradigms, all the shortest path distances can be computed by traversing the quasi-metric graph only once.In addition, in most cases, both best-first method and breadthfirst method traverse part of the graph due to the early termination condition.For instance, in Example 4 and Example 5, the search space is bounded by the red dashed rectangle in Fig. 5(c), resulting in better search performance as to be verified in Section VI-B.

TABLE 1 :
Symbols and description = {u, e 1 , t 1 , e 2 , t 2 , • • • , e m , t m , e m+1 , v} be the shortest path between u and v, where {u, t 1 , t 2 , • • • , t m , v} and {e 1 , e 2 , • • • , e m , e m+1 } are the sequenced vertices and edges respectively in the shortest path, then d G (u, v) = 5, E 6 can be pruned without computing d(v 2 , E 6 .v)byLemma 4, since |d(v 2 , E 2 .v)−E 6 .P D| > E 6 .r+1.5.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2930753,IEEE Access Based on Lemma 3 and Lemma 4, we present M-tree based Range query Algorithm (MRA) and M-tree based kNN This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/.Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 3 :
Statistics of the datasets used