An Efficient Snapshot Strategy for Dynamic Graph Storage Systems to Support Historical Queries

,


I. INTRODUCTION
As an important part of big data, dynamic graph data are widespread in many applications such as social networks [1]- [3], communication networks [4], [5], biology and disease networks [6], [7], coauthor networks [8], [9], etc. Graph is also a useful tool in areas such as manufacturing [10]. Recently people not only make analysis of the real time state of a graph, but also pay attention to how a graph evolves to obtain more knowledge [11], [12]. For example, Leskovec et al. [13] studied diameter changes of an evolving social network. Semertzidis and Pitoura [14] investigated the The associate editor coordinating the review of this manuscript and approving it for publication was Guangdong Tian . most durable graph pattern that exists for the longest period of time.
The increasing demands for evolving analysis pose great challenges to the dynamic graph storage system. The system is required to be capable of handling historical queries.
In other words, it should be able to recreate any historical states [15], [16].
There are three optional storage models for dynamic graph storage systems to support historical queries. The first is called 'sequence of snapshots'. A dynamic graph is regarded as a sequence of snapshots. Each snapshot is a static graph representing the state of the dynamic graph at a certain moment. The second is 'log file'. All the updates of a dynamic graph are saved in a log file. Therefore, any historical state of the graph can be regenerated through redoing the update operations recorded in the log file. The third is named 'set of varying instances'. A dynamic graph is regarded as a set of dynamically changing instances. Each instance contains the information about the evolving process of one single vertex, e.g., when the vertex is created, how and when the attribute of the vertex is updated, and how and when its relationships with other vertices change.
The first model enables high recreation performance at the cost of very high storage consumption, while the second one consumes least storage space suffering from poor recreation performance. The third one saves storage consumption as well as provides high performance in recreating the historical state of a single vertex or a small sub-graph. However, it is inefficient to recreate the historical state of the whole graph [17].
The combination the first two models named 'snapshot plus log' has been investigated in the area of fault tolerance [18]. It provides storage consumption as well as high performance in state recovery. In the scenario, snapshots are stored at regular intervals. Besides, since only the latest state is required to be recovered, the older snapshots could be discarded to reduce storage consumption as time goes by.
We investigate how to apply this 'snapshot plus log' model in the dynamic graph storage system for supporting historical queries. The main challenge lies in the snapshot strategy. The traditional one that stores snapshots at regular intervals is inefficient in this new situation, since the historical states do not share the same frequency of being requested. With the traditional snapshot strategy, some snapshots may be seldom accessed by any historical queries, while some historical queries may request historical states not near to any snapshot. The former case leads to waste of storage space, while the latter case leads to recreation performance degradation.
The contribution of this paper includes three aspects. First, we formally define the snapshot optimization problem. It is stated as the minimization of the number of redone and undone operations in the historical state recreation process through the optimal selection of the timestamps for the snapshots. Second, a new snapshot strategy is proposed to solve the problem. The historical queries are clustered according to the timestamps of the requested historical states and the centroids are used to determine the location of the snapshots. The novelty lies in that the distribution of the snapshots is not uniform in the time axis but consistent with the density of the historical queries. Last but not least, we conduct experimental analysis to validate the efficiency of the proposed strategy. The results show that with the same number of snapshots, the proposed strategy greatly improves the performance of historical state recreation. Meanwhile, with the same creation performance guarantee, the proposed strategy sharply reduces the storage consumption.
The paper is organized as follows. Section II summarizes the related work; Section III formally defines and analyzes the problem; Section IV elaborates upon the proposed snapshot strategy; Section V shows the experimental results; Section VI concludes the whole paper.

II. RELATED WORK
Recently the research on the storage of dynamic graphs with support for historical queries attracts much attention [19], [20]. The fundamental challenge here is the tradeoff between recreation performance and storage consumption [21]. The storage model can be classified into the following three categories: the snapshot sequence model, the log file model and the set of varying instances model. The snapshot sequence model regards the evolution of a dynamic graph as a set of static graph snapshots. One graph snapshot represents the state of the graph at a certain moment. This approach may be suitable for small-size and slowly evolving graphs [22]. Both Yang et al. [23] and Ren et al. [24] adopted the snapshot sequence model for historical analysis. To improve space efficiency, Ren et al. [24] used a compression method to reduce data redundancy at the cost of update time increase. Zaki et al. [25] proposed a method for storing the sequence of snapshots in a compact manner while maintaining a very low update overhead. The main idea is to separate the parts of the data that may change in the future from other parts. For a given set of graph snapshots which correspond to the state of an evolving graph at different time instances, Semertzidis et al. [26] discussed the Best Friends Forever (BFF) problem, i.e., how to identify the set of nodes that are the most densely connected in all snapshots. Nelson et al. [27] investigated several problems of interest on time-evolving graphs and proposed corresponding algorithms that run on compressed time-evolving graphs. The advantage of the methods based on the snapshot sequence model is that the requested historical state can be quickly obtained without any recreation cost, while the disadvantage is that the storage consumption is too high. The log file model saves all the update operations of a dynamic graph in a log file. Theoretically, all the historical states can be recreated by redoing the update operations, and the storage consumption is the minimum. However, it is not feasible in practice, since it takes too long to recreate a historical state.
In the third storage model, a dynamic graph is regarded as a set of varying instances. Each instance contains the update history of a single vertex [17], [28]. Han et al. [29] proposed a method that divides the graph into multiple spatio-temporal chunks with each chunk covering a subset of vertices and spanning a certain time interval. The method is suitable for graph state recreation in a distributed manner. Aridhi et al. [30] also discussed the distributed processing of large scale dynamic graphs. The advantage of this model lies in the efficient historical query of a single vertex or a small sub-graph, but the disadvantage is the inconvenience for recreating the whole historical state for a given timestamp.
In order to make a tradeoff between the performance of historical state recreation and the storage reduction, some researchers try to combine the first two models [31]. It is called as 'snapshot plus log'. The main idea is to save the state of the graph as a snapshot at some moments, and record the update operations such as vertex or edge addition, deletion, VOLUME 8, 2020 and attribute modification in a log file. When a historical query arrives, the snapshot nearest to the requested historical state is retrieved, and the update operations happened between the requested historical state and its nearest snapshot will be redone or undone depending on whether the snapshot is earlier or later than the requested historical state. After redoing or undoing the update operations, the historical state will be recreated. Bhattacherjee et al. [21] also discussed the tradeoff between recreation performance and storage cost. However, the method is designed for handling multiple-versioned data, not for dynamically changing data.
In the 'snapshot plus log' storage model, it is difficult to determine when to save a snapshot. The objective is to store a minimum number of snapshots while maintaining a small number of redone and undone operations in the recreation process, or minimizing the number of redone and undone operations for a given number of snapshots. At present, there are two typical snapshot strategies. One is based on time interval, and the other is based on update operation. The first stores snapshots at regular intervals, while the second stores a snapshot whenever the number of the update operations accumulates to a certain value. However, whether the snapshots are appropriate depends on whether they hit the historical queries, or at least near to the requested historical states. Snapshots far away from the historical states do not favor the recreation process at all while consuming storage space.

III. PROBLEM STATEMENT
We use U as the set of the update operations that have happened on the dynamic graph G, with each element u i representing the ith update operation. Here each u i is a tuple consisting of sn i , ut i and o i , with sn i denoting the serial number of the update operation, ut i representing the time when the update operation u i happened and o i describing how the graph G was updated at time ut i . Let M denote the number of elements contained in U , and then there are totally M different historical states. The valid timestamps corresponding to the ith historical state can be described as an interval [ut i , ut i+1 ). It means that at every time instant during the time interval [ut i , ut i+1 ), the state of the graph G is always the same to that at time ut i .
We use Q as the set of historical queries that are predicted to happen in the near future, with the element q j representing the jth kind of historical query. Here each q j is a tuple consisting of ht j and n j , with ht j representing the timestamp related to the requested historical state of the graph G and n j representing the frequency of the jth kind of historical query in the near future. In other words, the historical state of the graph G at time ht j will be queried n j times. It should be noted that, Q is a predicted set, and the more exact it is, the better. A simple prediction method is to take the latest query set as the predicted value. The prediction technique is out of the scope of this paper. We just assume that Q is available through some prediction methods.
We use S as the set of the snapshots and T as the set of the timestamps corresponding to the snapshots in S. To serve a historical query that requests the historical state of the graph G with the timestamp ht j , the system works as follows. First, it finds the two neighboring snapshots s left and s right . Let ts left and ts right respectively represent the timestamps of the two neighboring snapshots. We obtain that ts left is the largest snapshot timestamp in those not greater than ht j , and ts right is the smallest snapshot timestamp in those not smaller than ht j . Second, the system calculates the distances from the requested historical state to the two neighboring snapshots. The nearer neighboring snapshot is chosen for further computation. Here the distance between a historical state and a snapshot is evaluated by not the time interval itself, but the number of the update operations occurred during the time interval. Finally, the requested historical state is recreated by redoing or undoing operations on the nearer snapshot. If s left is the nearer snapshot, the update operations during the interval (ts left , ht j ] need to be redone and the process is called forward recreation. Otherwise, the update operations during the time interval (ht j , ts right ] need to be undone and the process is called backward recreation. The update operation details could be looked up in the log file that records the whole updating process of the graph. The number of the update operations that are redone and undone greatly affects the performance of historical queries. Constrained by the storage cost, we assume that only N snapshots of the graph G are allowed to be created. The problem is how to determine T for the N snapshots such that the average number of redone and undone operations served for the historical queries is minimized. Suppose that ts k (1 ≤ k ≤ N ) represents the timestamp of the kth snapshot. The optimization problem is then stated as follows: In Equation (5), we use O 2 (j) rather than O(j), because the former promises lower variation.

IV. THE PROPOSED SNAPSHOT STRATEGY A. MAIN IDEA
The two traditional snapshot strategies (i.e., the time interval based strategy and the update operation based strategy) make decisions independent of the distribution of historical queries. As a result, some snapshots are far away from all the historical states requested by the historical queries, while some historical queries request historical states that are not near to any snapshot. The former case leads to storage waste, while the latter case leads to recreation performance degradation.
Therefore, we propose a new snapshot strategy that determines the timestamps of the snapshots according to the distribution of historical queries. Firstly, a clustering method is used to aggregate the historical queries into a number of groups. The number of the groups equals the number of the snapshots to store. Secondly, the cluster centroids are used to determine the timestamps of the snapshots. Historical queries aggregated to the same group will be served by the same snapshot. Some of the historical queries are served in a forward recreation manner by redoing update operations on the snapshot, while others are served in a backward recreation manner by undoing update operations on it.
To make a further explanation of the idea of our snapshot strategy, we take an example as shown in FIGURE 1. In the figure, the graph has been updated 10 times during the period from the time instant 0 to 20, and the update operations are represented with u 1 . . . u 10 . There are eight historical queries, i.e., h 1 . . . h 8 , and the timestamps of the requested historical states are marked in the figure. Suppose that only two snapshots are allowed to be stored.
With the time based snapshot strategy, the timestamps of the two snapshots are 10 and 20 respectively. All of the historical states requested by the eight historical queries are created from Snapshot1. For serving the historical query either h 1 or h 2 , four update operations, i.e., u 3 , u 4 , u 5 and u 6 , need to be undone. For serving either h 3 or h 4 , three update operations, i.e., u 4 , u 5 , and u 6 , need to be undone. For serving either h 5 or h 6 , no operation needs to be undone or redone, because the requested historical state is equal to Snapshot1. For serving either h 7 or h 8 , one update operation, i.e., u 7 , needs to be redone. Therefore, the average number of redone or undone operations for serving each historical query equals 2.
With the update operation based snapshot strategy, the first snapshot is created once the number of the update operations reaches five, and so the timestamps of the two snapshots are six and twenty respectively. All of the historical states requested by the eight historical queries are created from Snapshot1. For serving the historical query either h 1 or h 2 , three update operations, i.e., u 3 , u 4 , and u 5 , need to be undone. For serving either h 3 or h 4 , two update operations, i.e., u 4 and u 5 , need to be undone. For serving either h 5 or h 6 , one operation, i.e., u 6 , needs to be redone. For serving either h 7 or h 8 , two update operations, i.e., u 6 and u 7 , need to be redone. Therefore, the average number of redone or undone operations for serving each historical query also equals 2.
With our cluster based snapshot strategy, the two snapshots are located at the centroids of the historical queries, and the two timestamps are four and eleven. For the two historical queries h 1 and h 2 , the requested historical states are recreated from Snapshot1 by undoing the update operation, i.e., u 3 . For the two historical queries h 3 and h 4 , the requested historical states are equal to Snapshot1 without redoing or undoing any update operation. For the two historical queries h 5 and h 6 , the requested historical states are recreated from Snapshot2 by undoing the update operation, i.e., u 7 . For the two historical queries h 7 and h 8 , the requested historical states are equal to Snapshot2 without redoing or undoing any update operation. Therefore, the average number of redone or undone operations for serving each historical query equals 0.5.
Our cluster based snapshot strategy ensures that the historical queries are served by a very near snapshot with a very high probability, and the number of redone and undone operations is minimized in recreating the requested historical states. VOLUME 8, 2020 FIGURE 2. A scenario for explaining the distance between a historical query and a centroid.

B. ELABORATIONS
As stated in the above subsection, there are two phases in our strategy for determining the snapshot timestamps. One is to cluster the historical queries, and the other is to obtain the timestamps of the snapshots through the cluster centroids. We elaborate upon the two phases in the following.

1) PHASE 1: CLUSTERING THE HISTORICAL QUERIES
We use the K-means algorithm for clustering historical queries. Although other clustering algorithms may also work, we verify that the simple K-means algorithm already meets our requirements in the situation, as shown in Section V.
The K-means algorithm clusters the samples in an iteration way. It initializes the centroids with random values. In each iteration, there are three tasks. The first is to calculate the distance between each sample and each centroid. The second is to classify the samples into groups. According to the values of the distances calculated in the first step, the nearest centroid for each sample is figured out, and samples with the same nearest centroid are classified into the same group. The third is to update the centroids according to the samples in the group, and start the next iteration step.
In the K-means algorithm, a central part is to calculate the distance between a given sample and a given centroid. In our situation, the samples are the historical queries, and the centroids are the timestamps of the snapshots. A simple indicator for the distance is the time interval between the timestamp of the historical state requested by the historical query and the timestamp of the snapshot. However, not the time interval length but the number of update operations occurred during the time interval actually matters, as it determines the cost of redoing and undoing in the historical state recreation. Therefore, we adopt the number of update operations occurred during the time interval between the timestamp of the requested historical state and that of the snapshot as the criterion of the distance.
For example, in FIGURE 2, the time interval between the timestamp of the requested historical state and that of Centroid1 is 4, while the time interval between the timestamp of the requested historical state and that of the Centroid2 is 6. However, the distance between the requested historical state and Centroid1 is longer than that between the requested historical state and Centroid2. The reason is that there are three update operations during the former time interval, while two update operations during the latter.
For the convenience of calculating the distance defined above, we take the serial number of the latest update operation to substitute the timestamp for describing the historical state. Suppose that the timestamp of the requested historical state is ht j . Among all the update operations in U , we look for u i satisfying that the element ut i is the maximum one among those not greater than ht j , and then sn i is the serial number of the latest update operation. In the clustering method, we use sn i to denote the requested historical state requested by q j .
The historical query clustering algorithm in shown in Algorithm 1. The input of the algorithm includes the update operation set U , the historical operation set Q, and the number of snapshots to be stored N , while the output C is the set of cluster centroids in the form of the sequential numbers of the update operations. Each element c k in C will be transformed into snapshot timestamps in Phase 2. Here we just describe the transformation as ts k = g(c k ).
From Lines 1 to 11, the algorithm transforms the historical query set from Q to Q . The timestamp of a requested historical state is substituted by the serial number of the latest update operation. As stated above, it is more convenient to calculate the distance in this way. From Lines 12 to 13, the cluster centroids are initialized with random values, and they are optimized in an iteration way as shown from Lines 14 to 25. In each iteration, each historical request q v is attributed to the nearest centroid as shown from Lines 17 to 20, and then the cluster centroids are updated according to the belonging historical queries as shown from Lines 21 to 24. Besides, the cluster centroids are saved as C old at the beginning of each iteration shown in Line 15, and are compared with the newly updated cluster centroids C at the end of each iteration shown in Line 25. If there is no difference between C and C old , the iteration process is finished, and the cluster centroids are returned.
In each iteration, each c k is updated with the average of the sequential numbers of the latest update operations with regard to the historical states requested by the historical queries belonging to the cluster. The reason is explained below.
f (ts 1 , . . . ts N ) = f (g(c 1 ), . . . , g(c N )) 90842 VOLUME 8, 2020 Algorithm 1 Historical Query Clustering Algorithm Input: The update operation set U = {u i |u i is a tuple <sn i ,ut i ,o i >}; the historical query set Q = {q j |q j is a tuple <ht j ,n j >}; the number of clusters to aggregate N Output: The cluster centroid set C = {c k |1 ≤ k ≤ N } 1: Q = ∅ 2: num = 0 3: for each q j in Q do 4: Traverse U to find sn max that is the maximum of i satisfying that ut i is not greater than ht j 5: for r from 1 to n j do 6: num + + 7: sn num = sn max ; 8: q num =< num, sn num > 9: end for 11: end for 12: Find both the minimum and the maximum of sn j in Q , namely min and max respectively 13: Initialize each c k in C with a value randomly chosen between min and max 14: repeat 15: C old = C 16: for each v from 1 to num do 18: Find k 0 satisfying that |sn v − c k 0 | is the minimum among all the |sn v − c k | with 1 ≤ k ≤ N 19: end for 21: for k from 1 to N do 22: Calculate avg k that is the average of sn j for all the q j in Cluster k

2) PHASE 2: CALCULATING THE SNAPSHOT TIMESTAMPS FROM THE CLUSTER CENTROIDS
The centroids obtained in Phase 1 are described with the serial numbers of the update operations, and we need to convert them to snapshot timestamps. First, according to the serial number, the time instant when the update operation occurred can be obtained from the information contained in the set U . Second, the time instant when the following update operation occurred can also be obtained. Any time between the two time instants could be taken as the timestamp of the snapshot. Without loss of generality, we take the first time instant as the snapshot timestamp. for each u i in U do 4: if not sn i <c k then 5: break 7: end if 8: end for 9: end for The calculation of the snapshot timestamps based on the cluster centroids is shown in Algorithm 2.
Since the centroids of historical queries may change as time goes by, the timestamps of the snapshots will be updated periodically. The snapshots can be created according to their timestamps.
The serving of the historical queries is the same to other traditional strategies. First, according to the timestamp of the requested historical state, two nearest snapshots are found, i.e., the left nearest snapshot and the right nearest snapshot. Second, the number of update operations between the timestamp of the left nearest snapshot and the timestamp of the requested historical state is obtained, and that between the right nearest snapshot and the historical state is also obtained. Third, according to the number of update operations, the real nearest snapshot is determined. If the left one is the real nearest, forward recreation is adopted, and the update operations will be redone. Otherwise, backward recreation is adopted, and the update operations will be undone.

V. EXPERIMENTAL EVALUATION
We mainly concentrate on the performance improvement in serving historical queries and the reduction in storage consumption provided by our proposed snapshot strategy. The average number of redone and undone operations in serving each historical query is taken as the criterion for evaluating the recreation performance, while the total number of snapshots is taken as the criterion for storage consumption.

A. EXPERIMENTAL SETUP
We adopt the Monte Carlo simulation method for evaluation. The three snapshot strategies (including the time interval based strategy, the update operation based strategy and our clustering method based strategy) are implemented in Python 3.7 with SKlearn. The computer is equipped with 8-core Intel i5 processor and 8GB of RAM.
Two kinds of datasets are required for evaluation. The first one is a sequence of arrival time instants of graph update operations, and the second one is a sequence of timestamps representing the historical states requested by the historical queries.
We use the dataset 'CollegeMsg temporal network' which is released by SNAP (Standford Network Analysis Project) at the website [32] for evaluation. The 'CollegeMsg temporal network' is a dynamic graph which had been updated 59,835 times spanning 193 days. The time instants of all the updates are available for generating the sequence of time instants of graph update operations. The statistics of the dataset are listed in TABLE 1.
In many applications, a minor part of the data tends to be frequently accessed while the majority part is rarely accessed. The Zipf distribution is widely adopted to describe the skewness of popularity in many real-world applications [33]. We also use the Zipf distribution to randomly generate the timestamps of the historical states requested by the historical queries.

1) ANALYSIS OF IMPROVEMENT IN RECREATION PERFORMANCE
In this experiment, the sequence of arrival time instants of graph update operations is composed of 59,835 elements as depicted in the above subsection. We totally generate 10,000 random values as the timestamps of the requested historical states. The frequency of the historical state being requested obeys the Zipf distribution with the parameter α = 1.2. The number of snapshots N varies from 10 to 200.
For each given number of snapshots, we use the three snapshot strategies to calculate the timestamps of the snapshots. After that, the average number of redone and undone operations in serving a historical query is obtained. We compare the three strategies in FIGURE 3. As shown, our proposed clustering based strategy greatly improves the recreation performance compared with the two traditional strategies.
As the number of snapshots increases, the average number of redone and undone operations decreases with all the three strategies. However, the storage consumption becomes higher as well. In practice, we should tradeoff the recreation performance and the storage consumption. As shown in FIGURE 3, the average number of redone and undone operations decreases sharply when the number of snapshots varies between 10 and 100, but very slowly between 100 and 200. Therefore, we choose 100 as the number of snapshots for evaluation.
For N = 100, we demonstrate the distribution of the number of redone and undone operations in serving each historical  query in FIGURE 4. Each point (x, y) in the curve means that the proportion of the historical queries that require no more than x redone or undone operations equals y. As shown in the figure, the historical queries that require no more than 50 redone or undone operations nearly take a proportion of 77.2% with our cluster based snapshot strategy, while the proportion is 45.7% with the time-based strategy and only 8.7% with the operation-based strategy. The historical queries that require no more than 200 redone or undone operations take a proportion of nearly 100% with our cluster based snapshot strategy, while the proportion is about 90.3% and 58.2% respectively with the other two strategies.
We assign the Zipf distribution parameter α with 1.5, and keep all the other parameters unchanged. The results are shown in FIGURE 5, indicating that the superiority of our strategy is more apparent when the distribution of the historical queries is more skewed. As a comparison, we list the average number of redone and undone operations for each historical query with N = 100 in TABLE 2. The reduction of our strategy is between 70.7% and 95.5% compared with the two traditional strategies.

2) ANALYSIS OF REDUCTION IN STORAGE CONSUMPTION
We try to find the minimum number of snapshots to meet the recreation performance satisfying that the average number of redone and undone operations does not exceeding 15 and 100 respectively. The sequence of arrival time instants of graph update operations is also composed of 59,835 elements,   and we totally generate 10,000 random values as the timestamps of the requested historical states with the frequency of the historical state being requested obeying the Zipf distribution with the parameter α = 1.5.
The comparison results are shown in FIGURE 6. With the same recreation performance guarantee, our strategy sharply reduces the storage consumption (nearly 78.9% on average). It should be noted that we do not differentiate the redoing or undoing cost of different kinds of update operations, and neither differentiate the storage cost of different snapshots.
In a word, our proposed strategy leads to recreation performance improvement and storage consumption reduction compared with traditional time-based and operation-based snapshot strategies. However, it should continually analyze the distribution of the historical queries and make adjustment dynamically, while the traditional ones require little analysis and make no dynamic adjustment. The analysis and adjustment process itself leads to computation overhead.

VI. CONCLUSION & FUTURE WORK
This paper studies the snapshot strategy for the 'snapshot plus log' solution to support historical queries of dynamic graphs. It points out that the inefficiency of the traditional snapshot strategies lies in the contradiction between the uniform distribution of snapshots and the skewed distribution of historical queries. In other words, the historical states are not equally requested, while the snapshots are evenly distributed under the traditional snapshot strategies. Therefore, this paper proposes a new snapshot strategy that determines the timestamps of the snapshots based on the distribution of the historical queries, such that the snapshots are near to the requested historical states with a very high probability. The experimental results show that the proposed strategy greatly improves the performance of historical state recreation and sharply reduces the storage consumption.
In the future, we will investigate how to apply this 'snapshot plus log' model in a distributed storage system of dynamic graph data. In a distributed environment, not only the snapshot strategy but also the placement strategy will affect the recreation performance. It is a big challenge to place the snapshots with variable timestamps onto distinct storage nodes to balance the historical query workload as well as to maximize the recreation performance.