Top-k Distance Queries on Large Time-Evolving Graphs

Fast extraction of top-<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> distances from graph data is a primitive of paramount importance in the fields of data mining, network analytics and machine learning, where ranked distances are exploited for several purposes (e.g., link prediction or network classification). While investigation on computational methods to address this retrieval task for regularly sized, static inputs has been extensive, much less is known when managed graphs are massive, i.e., having millions of vertices/edges, and time-evolving, i.e., when their structure can grow over time, a scenario that introduces a number of scalability and effectiveness issues otherwise not arising. Since, nowadays, most real-world applications exploiting top-<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> distances have to handle inherently dynamic and rapidly growing graphs, in this paper we present the first dynamic indexing scheme that supports very fast queries on top-<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> distances when graphs are massive and incrementally time-evolving. We assess the scalability and effectiveness of our method through extensive experimentation on both real-world and artificial graph datasets.


I. INTRODUCTION
Mining path-related properties (e.g.distances, communities, or centrality measures) is considered a fundamental operation to be performed on graph data, for several reasons.Chiefly, such model of data is one of the most used in computing systems, due to its effectiveness in capturing the inherent networked nature of many domains, and algorithms to quickly compute such properties represent indispensable tools in many prominent real-world scenarios where graph datasets have to be managed [37].For instance, distances and centrality measures are largely exploited to accomplish artificial reasoning and machine learning tasks or for network optimization purposes, locally connected communities and eccentricities are employed to support many meaningful network analytics processes [3], [44], [47], [49], [59], [60].Due to such applicability, computational problems related to the aforementioned properties have been deeply investigated in the literature, and for most of them efficient The associate editor coordinating the review of this manuscript and approving it for publication was Chong Leong Gan .algorithms, with polynomially-bounded time complexities, are well-known since decades [28], [37], [57].
Nonetheless, a recent trend of research has been concerned with the scalability issues that most of such algorithms exhibit and, in particular, with the poor performance they show when applied to so-called massive graphs, i.e. graphs having millions of vertices and edges.In these cases, in fact, (even) polynomial-time algorithms can often yield unsustainable running times in practice, which are incompatible with the requirements of modern data-intensive information systems [4], [19], [39].For this reason, and since massive graphs are pervading computing and data management applications, researchers and practitioners have been motivated to design innovative algorithmic strategies to achieve faster solutions to many computational problems of interest, at least from a practical viewpoint and/or for special graph classes [4], [8], [12], [19].
A particularly active area in this context is that dedicated to the so-called k shortest distances (k-sd) problem which asks to retrieve, upon a query, the top-k distances for a pair of vertices of a graph, i.e. the lengths of the k shortest paths connecting the pair.Fast computation of top-k distances FIGURE 1. Top-1 distance versus Top-k distances for the two black vertices in three different graphs [3].
from graph data is a primitive of paramount importance in the fields of data mining, information retrieval, and machine learning, where ranked distances are exploited for diverse purposes [3], [22], including networks design/optimization, speech-recognition, hypertext classification, reconstruction of metabolic pathways, analysis of gene networks, similarity searching, link prediction.In general, the problem finds applications in all those analytics tasks where classic shortest-path distances are not enough informative to characterize the structural properties of a graph dataset.A paradigmatic example is given by all those real-world scenarios where, due to the well-known small-world phenomenon, the diameter of the managed graph is small.As a result, on the one hand many pairs of vertices are at the same (shortest path) distance.On the other hand, such pairs typically do not share the same set of top-k distances (see Fig. 1 for an example).Hence, based on the distance information alone, many pairs of vertices (or vertices) could be considered as equally relevant, which is by no means a realistic assumption; on the contrary, considering ranked, top-k distances is crucial to address mentioned analytics tasks with accuracy and effectiveness.We refer the interested reader to [22] and [44], and references therein, for a more thorough list of applications relying on effective computation of top-k distances, including the prominent setting of high-accuracy link prediction [3], [44].
The reference approach for solving the k-sd problem, in terms of worst-case time complexity, is Eppstein's algorithm [21] which takes O(n + m + k) (O(n log n + m + k), respectively) time to compute the k shortest distances for a pair of vertices on an unweighted (weighted, respectively) nvertex, m-edge graph.However, despite the polynomial time complexity, the approach is known to exhibit unsatisfactory performance, and to scale poorly with respect to the graph size and the value of k, as it can require up to tens of seconds to answer a query on the top-k distances for a single pair of vertices in large graphs [3].
Since in the above-mentioned applications this kind of queries must be interactively computed for many vertex pairs on large graphs, the authors of [3] introduce an algorithmic framework, called k-pll in what follows, to allow the extraction of top-k distances from massive graphs with practical running times.The method divides the computational effort in two steps: (i) in an offline phase, a one-time preprocessing of the input is performed, with the aim of computing a compact data structure, named k-2-Hop-Cover index (k-2hc index or simply k-2hc, for short), that stores appropriately selected lengths of paths and cycles in the graph; (ii) at runtime, upon a query on the top-k distances for a pair of vertices, an appropriate query algorithm, that takes as input only the k-2hc data structure, is executed to answer such queries very quickly.Specifically, while in terms of worst-case query time and space complexities k-pll is not better than Eppstein's algorithm, in [3] it is experimentally shown that k-pll performs very well in practice and enables answering to top-k distance queries within few microseconds per vertex pair (thousands of times faster than Eppstein's algorithm), at the price of at most few thousands of seconds of preprocessing time and of storing some GBs of indexing data, even when graphs have millions of edges.Hence, k-pll is considered the state-of-the-art for solving the k-sd problem at scale.
Unfortunately, the solution proposed in [3], as many similar methods for massive graph processing based on indexing techniques [17], [19], is not suited to be adopted in scenarios where the input graph is time-evolving (also known as dynamic), i.e. when the graph topology and/or edge weights can change over time.In fact, essentially all approaches that rely on preprocessing to obtain fast query answering, do not natively guarantee correctness when the managed graph grows over time since, even after few modifications on the input, precomputed data structures might become obsolete (i.e.no longer properly reflecting the underlying graph structure) and therefore lead to incorrect query results [5], [17], [24], [63].This is the case of the framework of [3], as it is easy to see how, even after a single update to the topology of the input graph (e.g. an edge insertion), an arbitrary number of lengths of paths and cycles, stored in the index, can become obsolete and therefore potentially lead to incorrect top-k distances returned by the query algorithm.
To the best of our knowledge, the only possibility to use the k-2hc index on a time-evolving network (and hence to solve the k-sd problem very quickly at scale when graphs dynamically change) is to recompute the data structure from scratch after each update to the network occurs.The latter, however, cannot be considered a viable option in practice since the precomputation step, though effective, generally induces non-negligible time overheads, incompatible with data-intensive applications that rely on ranked distances.
Since, as well-documented in the literature [9], [15], [23], [32], [54], most real-world applications exploiting ranked distances deal with inherently dynamic and rapidly growing graphs, the availability of effective dynamic algorithms, able to identify and update only the part of the data structure that is compromised by some graph change faster than the preprocessing routine, is essential to enable the retrieval of top-k distances in temporal contexts.Again to the best of our knowledge, no method of this kind exists for k-pll and hence for the k-sd problem at scale, while similar investigations have been successfully conducted for preprocessing-based methods for extracting other path-related properties from large-scale graph datasets [5], [17], [30], [33], [34].Our Contribution.In this paper, we design dyn-kpll, an incremental dynamic algorithm that is able to keep k-2-Hop-Cover indexes updated when graphs can grow over time, i.e. when they are subject to incremental updates (vertex/edge insertions, or weight decreases).We prove its correctness, give its time complexity and present the results of extensive experimentation, involving both real and artificial time-evolving graphs, to demonstrate its scalability and effectiveness.
Specifically, we provide strong empirical evidences of dyn-kpll: (i) being able to update k-2hc indexes very quickly, by running several orders of magnitude faster than the recomputation from scratch, even for massive graphs; (ii) being capable of preserving the compactness of the data structure and thus its competitive performance in terms of time to answer top-k distance queries.Thus, our method can be considered the first dynamic indexing scheme that natively supports very fast answers to top-k distance queries in large time-evolving graphs.
It is worth remarking here that the focus of this work is on incremental updates only for several reasons, wellmotivated in the literature [5].Specifically, such updates are the most frequent types of updates that occur in the real-world domains where ranked distances are exploited (e.g.co-authorship, co-occurrence, and interaction networks, are just few examples of graphs that can only grow over time, digital social networks are instead an example of graphs in which decremental updates -removal of edges or nodes or weight increases -are extremely rarer than incremental ones); second, no method is known to answer to top-k distance queries with the same excellent performance of k-pll under dynamic conditions, without re-executing a preprocessing routine from scratch, hence designing an effective incremental algorithm represents a first step toward using k-pll with time-evolving graphs; third, addressing incremental updates very often represent the first natural step to understand the inherent complexity of the problem of handling generic updates, and to drive the design of techniques to effectively attack it [17], [24], [43].

A. RELATED WORKS
The problem of computing ranked, top-k distances and paths has been largely investigated in the last decades, in several variants and flavors and within various domains of computer science and engineering.Perhaps the problem that is most closely related to the k-sd problem and that has been received similar attention in the literature is the so-called k-Simple Shortest Paths problem (or kSiSP) which asks to find, upon a query, the top-k shortest paths, in terms of length, that connect a given pair of vertices of a graph and are simple, i.e. that do not self-intersect or contain loops (differently w.r.t. the k-sd problem where loops must be taken into account, as part of the graph structure, while computing ranked distances).This version of the problem finds application in specific domains such as e.g.data routing for communication networks or journey planning in transit networks.Despite the evident similarities with the k-sd problem, the kSiSP is generally considered computationally harder.In particular, the best worst-case running time for this problem is that of Yen's algorithm [61], which requires O(kn(m + n log n)) time to compute the top-k simple shortest paths for a vertex pair of a graph having n vertices and m edges.Such approach remains, to this day, the reference method to address the problem, even tough several attempts at improving its running time has been made in the last forty years.Specifically, Gotthilf et al. have managed to improve the algorithm to run in (O(kn(m + n log log n))) worst-case time [29]) and there exists an algorithm that, for undirected graphs only, yields an O(k(m + n log n)) worst-case running time, by Katoh et al. [38].Nonetheless, both methods have been shown, experimentally, to exhibit a performance in practice that is similar to that of Yen's solution, with peak performance on moderate to large diameter graphs such as square grids or large road networks in the undirected variant [2], [26], [50].Some heuristics, to achieve the computation of top-k simple shortest paths faster than above mentioned strategies, at least from a practical perspective, or for special graph classes or under assumptions on the computational model (e.g. in distributed settings), have been proposed in the recent past and are worth being mentioned to complete the overview on available solutions to the problem, see e.g.[11], [27], [36], [42], [55], [62], and [64].
However, none of them exhibit the same scalability properties, and query performance at scale, of methods based on preprocessing that have been proposed for other relevant problems of the graph mining domain, such as the method of Akiba et al. for the k-sd problem [3] or that by Delling et al. for plain shortest path distance queries [19].In this sense, designing an algorithmic method to compute top-k simple shortest paths with small running times per query (within microseconds, compatible with interactive applications) when the managed graph is massive, is still an open problem and represents a very active area of research [64].
Among strategies based on preprocessing, those that rely on computing indices in the form of vertex labelings have represented a significant portion of such progress, especially for path-related problems [63].In particular, the hub-labeling technique, originally introduced for connectivity problems on large graphs by Cohen et al. [13], has been adapted to solve several problems on large graph data: besides the aforementioned work of Delling et al. for shortest path distances [19], it is certainly worth mentioning the studies of: (i) Wang et al. [59], in which hub-labelings are exploited to accelerate the computation of best routes on timetable graphs; (ii) Abraham et al.where labelings are combined with hierarchical strategies to speed-up the computation of shortest paths in road networks [1]; (iii) Peng et al. in which a precomputed labeling is exploited to answer reachability queries up to 5 orders of magnitude faster than state-of-theart [52]; (iv) Zhang et al. which adapt hub-labeling techniques to efficiently retrieve the number of shortest paths (i.e. the number of paths having the same, which is also the shortest, length) between any pair of vertices [63].
Essentially all studies on acceleration of algorithms for large graph mining by precomputation of suited data structures have been followed by investigations on corresponding dynamic algorithms to update/maintain such data structures under dynamic conditions, i.e. when the given input graph is time-evolving, in order to amortize the time necessary to the preprocessing (i.e. to avoid the recomputation from scratch of the data structure) whenever the graph is subject to some modification.The latter is universally considered a far more realistic setting with respect to static, non-changing graphs.Examples include the design and experimental evaluation of dynamic algorithms for shortest path trees [16], for transitive closures [30], for centrality measures [9], [58], or for graphbased timetable models [12].
Similar works have been concerned with the design and experimental evaluation of dynamic algorithms to update/maintain labeling based indices, such as e.g.: the work of Akiba et al. to update 2-hop-cover labelings when graphs are subject to incremental modifications (edge/vertex insertions) [5]; the work by D'Angelo et al., which extended the approach of [5] to handle the fully dynamic scenario (when the managed graph can undergo also edge/vertex deletions) [17]; and the work of [25] which improved the overall performance in terms of space and preprocessing/update time of [5] and [17] by considering an hybrid, landmark-based strategy that induce larger query times; the studies in [23] and [24], which have focused on the effect of batch of updates occurring simultaneously on mentioned hybrid labelings.
Note that, a thorough survey on recent advances in the field of dynamic graph algorithms has been drawn up by Hanauer et al. in [31].

II. PRELIMINARIES
In this study we focus on networks that are modeled as a graph G = (V , E) with a vertex set V and an edge set E. We denote by n = |V | (m = |E|, respectively) the number of vertices (edges, respectively) of G. To simplify our discussion, we consider only undirected, unweighted graphs first.Nonetheless, the method presented in this paper can be extended to weighted digraphs, as discussed in Section III-A.
. We call cycle any path whose endpoints coincide while we call simple a path with no self-intersections, i.e. without vertex repetitions.An internal vertex of a path p is a vertex in p different from its endpoints.The length ℓ(p) of a path p is the number of edges in p; note that, for non-simple paths, path length considers possible multiplicities of occurrences of edges.A shortest path p(s, t), for a pair of vertices s, t ∈ V , is a path having minimum length among all those in G connecting s and t.The distance d(s, t) between s and t is the length of a shortest path p(s, t).
We assume vertices are uniquely represented by integers, so to enable natural comparisons for any pair u, v ∈ V by expressions such as u < v or u ≤ v. Given any two vertices s, t ∈ V , we define: (i) P st to be the set of paths connecting s and t in G; (ii) P >v st to be the set of paths in P st whose internal vertices are all larger than v, for some v ∈ V ; (iii) P ≯v st to be the set of paths in P st such that at least one internal vertex is smaller than or equal to v. Furthermore, we call p i (s, t) the i-th shortest path between s and t, that is the i-th element in P st , sorted in non-decreasing order according to path lengths, and use d i (s, t) = ℓ(p i (s, t)) to refer to the i-th shortest distance for pair s, t, i.e. the length of the i-th shortest path in P st .Similarly, we use , respectively) to refer to the i-th shortest distances when paths are restricted to consider only internal vertices that are larger (larger than or equal to and not greater, respectively) than some v ∈ V .Similarly, we use p >v i (s, t) (p ≥v i (s, t), respectively) to refer to the corresponding i-th shortest paths, and d >v (s, t) = d >v 1 (s, t) (p >v (s, t) = p >v 1 (s, t), respectively) to refer to the distance (a shortest path inducing such distance, respectively) subject to the same restrictions on vertices.Finally, we call deg v the degree of a vertex v ∈ V , that is the number of neighbors {w : (v, w) ∈ E} of v.We use deg >v v to denote the number of such neighbors > v.

A. K SHORTEST DISTANCES PROBLEM
Given a graph G = (V , E), an integer k ≥ 1, and a pair of vertices s, t ∈ V , the k shortest distances (k-sd) problem asks to compute the set D k st = {d 1 (s, t), d 2 (s, t), . . ., d k (s, t)} of the k shortest distances between s and t in G.The framework of [3] is the state-of-the-art approach to address the k-sd problem at scale.It is based on the computation of a data structure called k-2-Hop Cover index, which is a generalization of the 2-Hop Cover index, originally introduced in [4], and defined as follows.
Definition 1 (k-2-Hop Cover Index): Given a graph G = (V , E), define, for each vertex v ∈ V : (i) a length label L(v), containing pairs (u, δ uv ) where u ∈ V and δ uv is the length of a path from u to v; (ii) a loop label C(v), storing a sequence of k integers (δ 1 , δ 2 , . . ., δ k ) representing lengths of cycles in G that include vertex v.Then, the pair I = (L, C), where L = A k-2-Hop Cover index is often referred to as k-2-Hop Cover labeling or simply as k-2-Hop Cover (k-2hc for short); we use these notations interchangeably and refer to elements in the labels as entries.Depending on the entries stored in a k-2hc, such data structure can be used to solve the ksd problem correctly or not.Specifically, this hold when the index satisfies the so-called k-cover property, which we define as follows: Definition 2 (k-Cover Property): Given k-2hc I = (L, C) of a graph G = (V , E), let QUERY(I , s, t) denote a query on I for a pair of vertices s, t ∈ V , that returns the smallest k elements from multiset (I , s, t)

Then, I satisfies the k-cover property if and only if, for any s, t ∈ V , we have
In other words, an index satisfying the k-cover property for a graph G allows to retrieve the k shortest distances in G, for any pair of vertices s, t ∈ V , by a query on the index that selects the smallest summations in (I , s, t), obtained by properly combining values of lengths of paths and cycles.Specifically, such combinations are obtained by summing the length of a path from s to some vertex v, the length of a cycle on v, and the length of a path from v to t.We say index I coversG whenever I satisfies the k-cover property for G. Any vertex v that form one of the k smallest combinations in (I , s, t) is called a hub vertex for pair s, t, for any s, t ∈ V .Clearly, whenever a pair s, t is disconnected in G then d i (s, t) = ∞ ∀i ∈ [1, k] and QUERY(I , s, t) returns a single.default infinity value whenever there is no vertex v ∈ V such that (v, δ sv ) ∈ L(s), δ vv ∈ C(v), (v, δ tv ) ∈ L(t).An example of k-2hc I = (L, C) covering a graph is shown in Figure 2. The size of the index is defined to be the total number of entries in all labels, both of length and loop type, and it can be easily shown that computing a k-2hc covering a graph and having minimum size is NP-hard: this follows from the hardness of computing a minimum sized 2-hop cover index [13].Moreover, computing a k-2hc having size O(kn 2 ) can be easily achieved by, e.g., O(n 2 ) executions of the Eppstein's algorithm.
To the best of our knowledge, no algorithm is known for computing a k-2hc with a guarantee on the approximation on the size of the index.The method in [3] however achieves practical performance, in terms of trade-off between preprocessing time, index size and query time, and is currently considered the most effective framework to solve the k-sd problem for large graphs.Such method is based on precomputing a k-2hc that covers a given input graph by (i) sorting vertices according to some easy-to-compute centrality measure (e.g.degree); (ii) filling both loop and length labels progressively, by performing appropriately modified visits of the graph, each rooted at a different vertex of the graph, following the established sorting; (iii) incorporating a stopping criterion that prunes the searches whenever no length, shorter than those already stored, can be found.The preprocessing strategy to build a k-2hc is summarized in Algorithm 1 and consists of two main sub-routines, named mod-bfs and prun-ksd, given in Algorithms 2 and 3, respectively.

Algorithm 1 Algorithm k-pll
Input: Procedure mod-bfs (prun-ksd, respectively) computes loop (length, respectively) labels by performing visits of the graph, each starting from a different vertex v i , following the vertex sorting, that traverses only vertices larger than or equal to (larger to, respectively) v i .The construction guarantees that: (i) lengths in L(v), associated with a vertex u, form the sequence ( ) of a graph.Vertex IDs are assigned in non-increasing order of degree [19].
of the 1 ≤ l ≤ k shortest lengths induced by paths whose internal vertices are larger than u and that are shorter than of the lengths of k shortest cycles in G that include v and vertices larger than or equal to v. Properties (i) and (ii), combined, guarantee that the resulting k-2hc satisfies the k-cover property.It is easy to observe that the running time of Algorithm 1 is O(nkl(n + m)), if l is the maximum number of entries in any label [3].Note that, in the reminder of the paper, for the sake of brevity, we use acronym k-pll also to refer to the preprocessing routine of the framework, i.e.Algorithm 1.

B. TIME-EVOLVING SCENARIOS
We assume we are given an initial graph, say G = (V , E), and that such graph can undergo incremental modifications (i.e.vertex/edge insertions) for G (i.e. the graph is timeevolving).We focus on the incremental k-sd problem which asks, given an incremental modification x (e.g. the insertion of an edge e ̸ ∈ E) occurring on G, to compute the set D k st = {d ′ 1 (s, t), d ′ 2 (s, t), . . ., d ′ k (s, t)} of the k shortest distances between s and t in G ′ , for some s, t ∈ V ′ , where G ′ = (V ′ , E ′ ) is the graph obtained by applying x to G (e.g. by inserting e into E).Clearly, such problem can be solved, with the same complexity and practical performance, by any algorithm that solves the static counterpart of the problem without relying on preprocessed data (e.g.[21]), as it suffices to execute such algorithm on G ′ , after a change, for the given pair.However, if preprocessed data are exploited to achieve superior query performance, as in the k-pll framework, then solving the incremental k-sd problem requires updating such data in order to preserve the correctness of the approach, which translates into the definition of the following problem.
Definition 3 (Incremental k-2hc Problem): Given a graph G = (V , E) and a k-2hc I covering G. Let x be an incremental modification of G and let G ′ = (V ′ , E ′ ) be graph obtained by applying x to G.Then, the incremental k-2hc problem asks to compute a k-2hc I ′ that covers G ′ .To the best of our knowledge, the only known way to address the incremental k-2hc problem is to recompute from scratch a k-2hc I ′ covering G ′ via k-pll.However, this induces

Algorithm 4 Algorithm dyn-kpll
Input: resume-pksd(v, x, δ vy ) large time overheads.Thus, in the next section we introduce a dynamic algorithm to cope with such problem without executing the preprocessing on each updated graph.Observe that, if the change to be managed is a vertex insertion, this can be modeled and handled as a sequence of edge insertions to a newly inserted vertex [17].Therefore, in what follows, again for the simplicity of the description' sake, we focus on graphs subject to edge insertions only.In the reminder of the paper, we will use d ′ j (s, t) = ℓ(p ′ i (s, t)) to denote the j-th shortest distance, 1 ≤ j ≤ k, for a pair s, t in a graph G ′ whenever the meaning is clear from the context.

III. DYNAMIC ALGORITHM
In this section, we introduce our new method, called dynkpll, to solve the incremental k-2hc problem.
The main routine (see Algorithm 4), takes as input a graph G, for which an index I = (L, C) covering G is available, and an incremental update (the insertion of an edge e = {x, y}) for G. Let G ′ the graph obtained by inserting e into G; then, the algorithm updates I to obtain an index I ′ = (L ′ , C ′ ), which covers G ′ , by separately performing the update of the loop labeling C and of the length labeling L. Specifically, first the update of C is performed.To this aim, the procedure identifies any vertex v for which at least one length in C(v) may be incorrect due to the change, i.e. any vertex such that ).This is done by computing a set aff-set, defined as: Such set contains any vertex v: (i) which is connected to either x or y by a shortest path not longer than k and whose internal vertices are greater than v; (ii) whose set of neighbors greater than v has size smaller than k.For each of such vertices, previously found cycle lengths are removed from the loop labels and new ones are computed via procedure mod-bfs.
Then, the algorithm continues with the update of L, which is achieved by a strategy inspired by the dynamic algorithm of [5].Essentially, the underlying idea is to resume visits of the graph, rooted at specific vertices, and to prune such visits under certain conditions, in order to update only the length labels of vertices that are affected by the edge insertion.A vertex is said to be affected by an insertion if at least an entry must be added to the corresponding length label in order to guarantee that the resulting k-2hc I ′ covers the new graph G ′ .Such resumed visits are performed by procedure resume-pksd, shown in Algorithm 5, and mimic those performed by routine prun-ksd.More specifically, the update procedure and its pruning mechanism are based on the following observation: if any of the k shortest distances between two vertices s and t changes, then any new value of distance that becomes part of the top-k shortest distances for the pair must be induced by paths from s to t passing through the new edge e in the new graph.Hence, the update procedure must process the graph, after the edge insertion, in order to find those vertices s for which the above condition holds toward some other vertex t, since its length label must be updated to store lengths of paths that induce new distances in the top-k set, and to limit the visit of the graph to such vertices only.
The above is done in two steps: first, we identify candidate pairs of vertices for which at least one value in the set of top-k shortest distances might change because of the new arc.This is achieved by scanning the length labels of the two endpoints of the newly inserted arc.Then, we start BFS-like visits, rooted at vertices that are in such length labels, from either of the two endpoints, and incorporate in such visits a pruning strategy that stops the traversing of the graph, at some vertex, once no more shortest distances induced by paths passing through said vertex can be found.Now, w.l.o.g, we describe the details of the procedure for one endpoint only, say x, as it is symmetric for the other.The algorithm starts by scanning the length label of x and, for each pair (v, δ vx ) ∈ L(x), we execute procedure prun-ksd, that takes as inputs vertices v and y, and value δ vx .Such routine ''resumes'' a visit, rooted at v, starting from vertex y and extending a path to x of length δ vx .This is done by initializing a suited queue and by exploring the graph in a BFS fashion.Whenever a vertex w, together with its path length δ, is dequeued we test whether the maximum value returned by QUERY(I , v, w) is larger than δ, to evaluate whether any of the k shortest distances to w are shortened by the edge insertion.If this is the case, we add entry (v, δ) to L(w), which corresponds to the shorter length induced by the new path, and continue the search towards neighbors of w having order greater than the root v. On the other hand, if δ is not less than the values returned by QUERY, than the visit from w is pruned.The procedure terminates when either all branches of the visit are pruned at some vertex or when the queue becomes empty.Figure 3 shows the result of the execution of algorithm dynkpll on the k-2hc of Figure 2.
Observe that it can be shown that algorithm dyn-kpll is able to correctly solve the incremental k-2hc problem, i.e. it is able to update a k-2hc index I , covering a graph G, to a k-2hc index I ′ , covering graph G ′ , which is the graph obtained by inserting an edge into G.Note that, I ′ satisfying the k-cover property on G ′ implies that I ′ can be used to correctly answer to top-k distance queries on G ′ (i.e. to solve the incremental k-sd problem).More specifically, we can prove the following result.and since we execute such routine for each vertex in set aff-set, in order to prove (a) it is sufficient to show that C ′ (w) = C(w) for vertices w ̸ ∈ aff-set.To this end, by contradiction assume that C ′ (w) is not correct for some vertex w ̸ ∈ aff-set, i.e. if we take the k shortest values in C ′ (w), say l 1 , l 2 . . ., l k , then there exists one value l i for some i ∈ [1, k] that is longer than the length of one of the k shortest cycles in G ′ , say c, that includes w and vertices larger than or equal to w. Clearly c must include edge e, as otherwise its length would already be in C(w).Now, notice that any vertex w ̸ ∈ aff-set is such that either (i) both d >w (w, x) ≥ k + 1 and d >w (w, y) ≥ k + 1, for some shortest path p >w (w, x) and p >w (w, y), resp., since any vertex w ∈ aff-set satisfies w ≤ min(x, y), or (ii) deg >w w ≥ k (see line 1).Consider (i): any cycle including edge e and vertex w, whose vertices are larger than or equal to w, must be at least 2k + 3 long.This value of length is higher than all lengths 2, 4, . . ., 2k of the (at least) k cycles that are induced by the traversal, back and forth, in a BFS order and starting from w, of any sub-path of length at most k of either p >w (w, x) or p >w (w, y).Hence, we reach a contradiction.For case (ii) a similar argument can be applied.In fact, if a vertex w is such that deg >w w ≥ k, then the k shortest cycles on w are given by paths of length 2 obtained by traversing back and forth any k edges incident to w.We have thus reached a contradiction also here, since the new edge e does not contribute to the shortest cycles on w, i.e.C ′ (w) = C(w), and this concludes the proof of (a).
We now focus on (b) and distinguish two cases: s / ∈ L(x) ∪ L(y) or s ∈ L(x) ∪ L(y).Assume that s and t are connected in G, as viceversa shortest distances are infinity in both G and G ′ and the claim trivially follows.In the first case, i.e. s / ∈ L(x)∪L(y), we have that s is not hub vertex in G for any of the k shortest distances from s to both x and y, and that either s is not connected to x and y, or any hub vertex for such pairs, say h, is such that h < s.In the former sub-case, no path in G from s to t passes through x and y, hence the same holds for G ′ , since the only difference is the insertion of e, and the claim follows.In the latter sub-case, instead, we have that any hub vertex, say h, for pairs s, x and s, y, is such that h < s.This implies that all paths in G, inducing the k shortest distances from s to x, y, have an internal vertex that is smaller than s.Therefore none of the k shortest paths in G from s to t, whose internal vertices are larger than s (if any), passes through x or y.Thus, the corresponding shortest distances are not changed by the insertion and the claim again holds.
We now consider the second case, i.e. s ∈ L(x) ∪ L(y), and prove the statement for sub-case s ∈ L(x) only, as the proof is symmetric for sub-case s ∈ L(y).Suppose by contradiction that s ∈ L(x) but lengths in L ′ (t), associated to s, do not form the sequence (d ′>s 1 (s, t), d ′>s 2 (s, t), . . ., d ′>s l (s, t)) of the 1 ≤ l ≤ k shortest lengths induced by paths whose internal vertices are larger than s and that are shorter than d ≯s k (s, t).Let γ 1 , γ 2 , . . ., γ l be the sequence of the first l ≥ 0 lengths associated to s in L ′ (t) and let γ i be i-th smallest value in this sequence such that γ i ̸ = d ′>s i (s, t).Specifically, observe that since we are inserting an edge, we have that d ′>s i (s, t) < d >s i (s, t) hence γ i > d ′>s i (s, t) can only be an overestimation of the true value d ′>s i (s, t) (and clearly this holds also if s / ∈ L(t) as in that case d >s 1 (s, t) = ∞ by hypothesis).Moreover, the path inducing d ′>s i (s, t), say p ′>s i (s, t), must contain edge e, as otherwise we would have γ i = d ′>s i (s, t).We can thus divide the path p ′>s i (s, t) as p ′>s i (s, x), {x, y}, p ′>s i (y, t).Since s ∈ L(x), we have that an execution of procedure resume-pksd is started, rooted at s, by enqueueing (y, δ sx +1) to Q for each length δ sx in entries of L(x) associated to s.Now, consider the value δ sx induced by the path p ′>s i (s, x), which is such that (s, δ sx ) ∈ L(x), since I covers G.Then, it is easy to show, by induction on the lengths of paths induce by the visit, that procedure resume-pksd rooted at s, with (y, δ sx + 1) enqueued into Q as initial step, will not be pruned neither in y nor in any of the vertices in p ′>s i (y, t), including t, leading to a contradiction.In fact, suppose that at some vertex w at distance δ sw in the path p ′>s i (y, t) the visit is pruned.Since the path traverses only vertices whose order is higher than s, it must be the case that δ sw is larger than any of the top-k distances from s to w, provided by the current index.This implies that, in G ′ , there exist k distances (d ′≯s 1 (s, t), d ′≯s 2 (s, t), . . ., d ′≯s k (s, t)), induced by the k shortest paths from s to w concatenated to path p ′>s i (w, t), whose total length is less than d ′>s i (s, t), which is clearly a contradiction.Therefore, it follows that no vertex on p ′>s i (y, t) is pruned during routine resume-pksd, which eventually adds entry (s, d ′>s i (s, t)) to L ′ (t).□ Concerning the time complexity of dyn-kpll, we can prove the following result, expressed in an output bounded sense, a commonly done for dynamic algorithms in the literature [5], [17].
Theorem 2: Let I = (L, C) be a k-2hc covering a graph G and let l the maximum number of entries in any length label of I .Given an edge insertion x on G, let G ′ be the graph obtained by applying x to G.Then, algorithm dyn-kpll takes O(kl 2 s + rkc) time to update I to a k-2hc I ′ covering G ′ where r, s and c denote the cardinality of aff-set, the maximum size of the subgraph visited during any execution of resume-pksd and mod-bfs, respectively.
Observe that dyn-kpll invokes resume-pksd at most k times for any vertex v ∈ L(x) (v ∈ L(y), resp.).Notice also that, in any given execution of dyn-kpll, the number of such vertices (hence calls to resume-pksd) is β = |L(x) ∪ L(y)| and that, clearly, β is at most l.Each execution, moreover, in the worst-case takes O(ls) time to perform QUERY on each visited vertex.Thus, the total running time required to update the length labeling by performing β times procedure resume-pksd is O(β(kls)).□ It is easy to notice that the time complexity of dyn-kpll is in the worst-case, asymptotically speaking, larger than that of k-pll, as l and r are O(kn) and O(n), respectively, while s and c are O(m).However, our experimentation shows that, in practice, such values are by far smaller than the worst case.Moreover, since dyn-kpll preserves the k-cover property, one can repeatedly solve the incremental k-2hc problem for sequences of modifications of arbitrary length σ in O(σ (kl 2 s + rkc)) time, by updating σ times the index via dyn-kpll.

A. GENERALIZATIONS
In what follows we briefly discuss on how both k-pll and dyn-kpll can be extended to handle general, possibly weighted, digraphs.

1) DIRECTED GRAPHS
In this case, a k-2hc stores three labels for each vertex v ∈ V , to consider edge orientations: (i) C(v), storing lengths of (now oriented) cycles; (ii) L in (v), containing lengths of paths that terminate into v; (iii) L out (v), containing lengths of path emanating from v. The preprocessing phase is adapted to consider both directions and to run twice the preprocessing routine, one in G and one in the transpose graph of G.The visits are pruned by performing properly oriented queries, that combine lengths of paths emanating from s, cycles on hub vertices v, and lengths of paths terminating into t, for a pair s, t.Similarly, dyn-kpll can be adapted to handle directed graphs by executing prun-ksd twice, once in G and once in the transpose graph of G, and by identifying vertices v in aff-set (line 1 in Algorithm 4) as in Eq. 1 that also have a path p ≥x (v, x) (p ≥x (v, y) resp.) of length at most k.

2) WEIGHTED GRAPHS
In this case, label entries store weights of cycles and paths, rather than lengths, which are the sums of the weights of the edges they are formed of.To build a k-2hc in this setting, one has to apply weighted versions of both mod-bfs and prun-ksd, where the exploration is performed in a Dijkstra's algorithm fashion (i.e. by using a priority queue and by assigning priorities on the basis of path/cycle weights [20]).Similarly, dyn-kpll can be adapted to handle general incremental changes to the graph, including weight decreases, by: (i) resuming weighted versions of mod-bfs and prun-ksd from, resp., vertices in aff-set (line 1 in Algorithm 4) and vertices in the length labels of the updated edge endpoints (lines 4 and 6 in Algorithm 4).The upper bound used to identify vertices in aff-set (line 3 in Algorithm 4) is replaced by kW , where W is the largest edge weight in the graph.

IV. EXPERIMENTAL EVALUATION
In this section, we describe the experimental evaluation we conducted to assess the performance of dyn-kpll.

A. EXPERIMENTAL SETUP
We implemented both k-pll and dyn-kpll; all our code is written in C++ and compiled with GCC 9.4.0 with optimization level O3. 1 All tests have been executed on a workstation equipped with an Intel © Xeon © CPU E5-2643 3.40 GHz and 128 GB of RAM, running Ubuntu Linux.As inputs to our experiments, inspired by other experimental studies on graph algorithms [3], [19], [63], we consider a large collection of both real-world and artificial graphs.The former were taken from publicly available repositories [41], [45], [51], [56] and include graphs representing networks of various application domains of interest (e.g.web graphs) with heterogeneous densities and topologies.The latter were produced via well-established generation models, such as Erdős-Rényi and Barabási-Albert [10].More details on used inputs, including number of vertices and edges, type (real-world or synthetically generated), average degree, and diameter (denoted by ), are reported in Table 1.Graphs are sorted from top to bottom according to |V |.
1 Publicly available at https://github.com/D-hash/IncrementalK2HC102236 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

1) EXECUTED TESTS
For parameter k, our experimental trials use values as in [3], namely k ∈ {2, 4, 8, 16}, since (i) the range is relevant to the applications domains of interest; (ii) evaluating performance indicators across doubling values of the parameter magnifies the observed changes in the algorithms' behaviors [48].For each input graph G and value of k, we perform three types of experiments, depending on how edges to insert are selected.

2) EXPERIMENT RAND-INS
In this experiment, we first execute k-pll to compute a k-2hc index I covering G.Then, we select uniformly at random, for σ > 0 times, two vertices x, y that are not adjacent in the graph, add edge (x, y), obtain a graph G ′ , and run dyn-kpll to update index I to I ′ covering G ′ .Eventually, we compute from scratch a k-2hc index I ′′ covering the last snapshot of the graph via k-pll.The purpose of this setting is to evaluate the algorithm's behavior regardless of the probability of an insertion to occur.

3) EXPERIMENT SEMI-REAL
In this experiment, we start by removing σ > 0 edges, selected uniformly at random, from a graph G to obtain a graph G init .We compute a k-2hc index I covering G init via k-pll and then re-insert, one after the other, the σ sampled edges, until such removed edges are all re-inserted and the original graph G is restored.After each insertion we run dynkpll to update the index to an index I ′ covering the graph G ′ comprising the insertion.Finally, we execute k-pll to compute from scratch a k-2hc index I ′′ covering G.The purpose of this setting is to assess the algorithm's behavior in a semi-realistic context, where insertions are sampled to follow the distribution induced by edges that are actually in the graph at some point of its evolution.

4) EXPERIMENT TEMPORAL
In this experiment, we consider real-world graphs whose historical evolution is known in the form of timestamps, defining the order in which any edge has been added to the graph.For each dataset of this kind (identified by the flag temporal in Table 1), having a total of η edges, we start by considering the graph G to be a snapshot of the dataset with η − σ > 0 edges and by computing a k-2hc I covering G via k-pll.Then, we proceed by adding the σ edges in the order dictated by the timestamps and by running dyn-kpll, after each insertion, to update index I to I ′ covering the graph G ′ containing the insertion.Eventually we execute k-pll to compute a k-2hc index I ′′ covering the final graph, as in the previous settings.The purpose of this setting is to evaluate the algorithm's performance in real-world scenarios.
In all mentioned experiments, at the end of the σ insertions (each followed by an execution of dyn-kpll), and after the final execution of k-pll, we perform 10 5 top-k distance queries on both I ′ and I ′′ and measure average query times.We also measure sizes of indexes I ′ and I ′′ , preprocessing time to build I ′′ from scratch via k-pll, and running time for obtaining each updated I ′ time via dyn-kpll.In all trials vertex ordering is established according to non-increasing values of vertex degree, while σ is set to 10 000, since this induces significant changes to the topology of all considered input graphs, and solicits boundary conditions for the algorithm under study.

B. ANALYSIS
The results of our experimentation are summarized in Tables 2-3 (rand-ins), Tables 4-5 (semi-real) and Tables 6-7 (temporal).For each input graph G, we report: (i) computational time (column CT, in seconds), that is running time of k-pll to rebuild the index and average running time of dyn-kpll to update the index after each insertion, resp.; (ii) average speed-up, that is average ratio of the running time of k-pll to rebuild the index to the running time of dyn-kpll to update the index, after each insertion; (iii) index size, that is size of the index recomputed via k-pll and size of that updated via dyn-kpll after all insertions, resp.(column IS, expressed in MBs); (iv) average query time for performing the 10 5 queries on the index recomputed via k-pll and on that updated by dyn-kpll after each insertion, resp.(column QT, in microseconds).Rows are sorted top-tobottom according to graph order, i.e. |V |.

1) SPEED-UP AND SCALABILITY
Our data show that, despite the worst-case time complexity of Thm. 2, dyn-kpll is extremely fast in updating indexes even for the largest inputs and values of k, and outperforms the recomputation from scratch by k-pll in all experimental trials by orders of magnitude, regardless of graph size, density, diameter and k.More in details, the observed speed-up by dyn-kpll is minimum when the value of k approaches the graph diameter, where dyn-kpll is, on average, more than two orders of magnitude faster than k-pll (see Tables 3-5, for k = 16), and it increases on large networks where dyn-kpll is up to tens of thousands times faster than k-pll (see e.g.graph ytb in Table 2 where k-pll requires ≈ 13 minutes to build the index, while dyn-kpll updates it in ≈ 5 hundredths of a second, or graph wik in Table 7 where k-pll requires ≈ 8 hours to build the index, while dyn-kpll runs for ≈ 13 seconds on average).Indeed, we observe that speed-ups increase as the graph size increases, which suggests that our approach scales well with input size (see Figures 4-5 where graphs on the x-axis are sorted left-to-right according to |V |).
102238 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

2) SIZE AND QUERY TIME
Measures collected during our experimentation also represent strong evidences of the fact that indexes updated via dyn-kpll preserve the nice properties of k-2-Hop Covers in terms of compactness (i.e.index size), which is reflected into extremely small average query times, even for the largest inputs.In fact, essentially across all graphs, values of k, and settings, the sizes of indexes obtained through dyn-kpll are comparable to those of indexes recomputed via k-pll, to within few MBs of difference.Some exceptions are observed on graph cts (e.g. when k = 16 in randins experiment), where the size of the index updated by dyn-kpll grows to become around up to 11% larger than that of the index recomputed via k-pll (see Table 3).This phenomenon is expected and most likely due to the lazy nature of the update strategy by dyn-kpll which, similarly to other dynamic algorithms for labelings [5], avoids the removal, from the index, of so-called obsolete entries (i.e.entries that, due to incremental changes, are no longer necessary to cover any pair).This choice is done to keep update times low and, on the one hand, it is easy to observe that it does not affect the correctness: in fact, the k-cover property is preserved since the query algorithm always selects the smallest values in the multisets (and hence any path longer or having the same length as the k-th, if any, is not returned as part of a solution for any given pair).On the other hand, our data show that above mentioned deviations in sizes are in most of the cases negligible and in all cases do not affect average query times, which remain in the order of few microseconds after thousands of updates and also for instances having millions of edges.We remark that this is a desirable behavior to exhibit in a time-evolving environment, since non-preprocessing based methods can require tens of seconds to extract top-k distances from large graphs [3], while dyn-kpll allows query answering in few microseconds at the price of few seconds of update time, and thus to exploit top-k distances in meaningful real-world scenarios (e.g.dynamic link prediction [3], [44]).

3) PLATFORM-INDEPENDENT METRICS
All above considerations on both effectiveness and scalability of dyn-kpll are corroborated by the measures of values  β, r, s and c (as defined in Thm. 2) we collected in our tests.In almost all cases, in fact, β and r are orders of magnitude smaller than |V | while c and s are orders of magnitude smaller than |E| (except for large k, approaching the graph diameter; in such a case the measure of s tend to reach |V |).An excerpt of such measures is shown in Table 8 for the rand-ins experiment with k = 16.Values are averaged, and rounded to the first integer digit, over the total number of graph updates.Results for other graphs and k are omitted as they are similar and lead to equivalent considerations.

4) IMPACT OF K AND TYPE OF EXPERIMENTS
Concerning the impact of k on the performance of dyn-kpll, we observe that the provided speed-up tend to decrease as k increases (see e.g.column Avg.Speed-up in Tables 2-7 or trend lines in Figure 5), even if dyn-kpll remains orders of magnitude faster than k-pll.This might be due to the fact that the number of new cycles and paths, induced by a newly inserted edge and whose length is shorter than existing ones, tend to increase with k.This conclusion is supported by our measures of r, s and c which are such that r ≪ |V | and s, c ≪ |E| for low values of k but tend to increase with k itself, more evidently when k becomes larger than the graph diameter.Other performance indicators (e.g.query time) are weakly influenced by increases of k, and this witnesses for our method scaling well also against k.Finally, it is worth noticing that performance indicators observed for dyn-kpll do not exhibit significant variation across experimental settings, which suggests that our method is robust against ''adversarial'' scenarios where edges to be inserted are not sampled from an empirical distribution and insertions do not follow typical network formation dynamics (e.g.preferential attachment).
To summarize, our experimentation provides strong evidences that maintaining k-2-Hop Covers via dyn-kpll is the most practical framework to deal with top-k distance queries when large graphs subject to incremental updates have to be managed, and that hence dyn-kpll improves over the stateof-the-art method in dynamic contexts.

V. CONCLUSION
We have studied methods to extract top-k distances from massively sized graphs.We have introduced dyn-kpll, a new dynamic algorithm to update k-2-Hop Covers when the managed graph is time-evolving, and assessed its effectiveness and scalability through extensive experimentation, hence delivering the first scalable algorithmic framework for fast retrieval of top-k distances from massive time-evolving graphs.
Several future research directions can be identified.Perhaps the most relevant one concerns the consolidation of the experimental evaluation presented here to include weighted digraphs.Another interesting direction might be investigating whether and how dyn-kpll can be generalized to handle any type of graph modification and hence to avoid the recomputation from scratch also when vertex/edge removals can occur, even though such modifications are much less frequent in the real-world domains where ranked distances are exploited [3], [5], [19].

Algorithm 5
Sub-Routine resume-pksd of Algorithm 4 Input: Vertex v ∈ V , endpoint u ∈ {x, y} of inserted edge, length l u of path from v to u induced by the insertion Output: Updated L(w) for any affected vertex w

Theorem 1 :
Given a graph G and a k-2hc index I = (C, L) covering G, let G ′ be the graph obtained by inserting an edge e ̸ ∈ E into G.Call k-2hc I ′ = (C ′ , L ′ ) the updated k-2hc computed by Algorithm 4.Then, I ′ = (C ′ , L ′ ) satisfies the k-cover property for G ′ .Let e = (x, y) the edge inserted into G.The proof is divided in two parts: (a) first, we show that C ′ (v) is correct, i.e. contains lengths (d ′≥v 1 (v, v), d ′≥v 2 (v, v), . . ., d ′≥v k (v, v)) for any v ∈ V ; (b) then we prove that I ′ = (C ′ , L ′ ) satisfies the k-cover property for G ′ by showing that the length label L ′ (t) of any vertex t contains the sequence (d ′>s 1 (s, t), d ′>s 2 (s, t), . . ., d ′>s l (s, t)) of the 1 ≤ l ≤ k shortest lengths induced by paths whose internal vertices are larger than s, for any s ∈ V such that s < t, that are shorter than d ≯s k (s, t).Concerning (a), observe that the only step of dyn-kpll that alters loop labels is line 3. Since we employ the mod-bfs sub-routine, which computes lengths (d ′≥v 1
Finally, line 3 is executed at most r times, each requiring O(kc) time.Since set aff-set is computed in O(r + c) time, the total running time is O(β(kls) + r(kc)) = O(kl 2 s + rkc), since β = O(l).

FIGURE 4 .
FIGURE 4. Speed-up by dyn-kpll vs graph size in rand-ins (left) and semi-real (right) experiments.Lines show linear regressions.

TABLE 1 .
Overview of input graphs.

TABLE 2 .
Results of the rand-ins experiment, k = 2 and k = 4.

TABLE 3 .
Results of the rand-ins experiment, k = 8 and k = 16.

TABLE 4 .
Results of the semi-real experiment for k = 2 and k = 4.

TABLE 5 .
Results of the semi-real experiment, k = 8 and k = 16.

TABLE 6 .
Results of the temporal experiment for k = 2 and k = 4.

TABLE 7 .
Results of the temporal experiment for k = 8 and k = 16.