Towards PageRank Update in a Streaming Graph by Incremental Random Walk

As the internet and the Internet of Things (IoT) have been widely applied in many application fields, a large number of graphs are continuously produced and change over time, which leads to difficulties in graph analysis and utilization. This paper studies a PageRank update algorithm for a streaming graph using incremental random walk. We focus on the information about the local changes of nodes and edges in the current graph, analyze the impact of such local changes on this current graph, and then use the idea behind wave propagation theory to seek and determine all affected nodes that need to be updated their PageRank in the new graph. For new nodes, the existing nodes in the current graph that are connected with these new nodes are aggregated into a supernode, and the PageRank of the new nodes is solved in the new graph with the supernode. Finally, we conduct a series of experiments on real-world and synthetic graph datasets. Compared with the state-of-the-art incremental computing algorithm, our algorithm not only ensures the accuracy of calculating the PageRank in a large streaming graph, but also speeds up the computational process by avoiding many redundant computations.


I. INTRODUCTION
To evaluate the importance of web pages, the concept of PageRank in Google was first introduced by Page and Brin [1]. Because web pages and their hyperlinks can be treated as nodes and edges in a directed graph respectively, PageRank is also used in other graphs and in many application fields. For example, in e-commerce, a PageRank of products can help to recommend the relevant products to clients efficiently. In intelligent transportation systems, detecting the PageRank in an urban traffic network can be used to optimize the travel path. In social networks, people can find a user's friends by using PageRank, and if this user is a criminal, the policemen can easily determine the criminal's associates. In biological networks, geneticists can find some common oncogenes by using PageRank, which can avoid congenital diseases effectively. Generally speaking, the above graphs are very large and dynamically change over time [2]. Moreover, the fixed PageRank in dynamic streaming graphs cannot always be always valid because their structures The associate editor coordinating the review of this manuscript and approving it for publication was Yiqi Liu . change, especially when the addition and deletion of nodes or edges occurs. In this situation, PageRank needs to be updated after dynamic changes [3]. However, it is mostly a small-scale change at every moment of attention compared with the whole streaming graph, e.g., the total number of articles added was less than 4% for English Wikipedia in 2019 [4]. If we compute the PageRank of all nodes in a graph from scratch every time, this process will continuously produce rather redundant computation continuously. Computing PageRank in this way is not only inefficient but also wastes considerable computing resources. Additionally, even if the parallelism of this method could be realized in a parallel distributed environment, the execution time would not be advisable for real-time applications. However, PageRank in dynamic streaming graphs needs to be updated in real-time, which is a nontrivial challenge. Therefore, the incremental calculation method of a PageRank has emerged in some current studies [5,6]. They make good use of the previous results and compute the PageRank only, which needs to be updated incrementally every time [6]. However, to the best of our knowledge, these existing incremental methods still have some shortcomings in identifying the nodes that VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ need to update their PageRank, which makes it difficult to regulate a trade-off between improving computing efficiency and minimizing the calculation error. In this paper, we use the idea of a random walk and design a novel incremental computation algorithm for updating PageRank in a dynamic streaming graph. Our main contributions can be summarized as follows: (1) Local changes of nodes and edges in a streaming graph are particularly concerning, which may cause the PageRank of some nodes to change. Thus, we apply wave propagation theory to discover the affected local area in such graphs at the present moment and develop a novel algorithm to determine all affected nodes that can reduce error and the amount of calculation at the same time. (2) The PageRank of the affected nodes is calculated in an incremental manner, which is based on the Monte Carlo idea. We leverage the information about random walks and about changes including adding and deleting nodes or edges in order to identify the number of the changed random walk paths and calculate the probabilities of visiting these nodes. Then, an optimized PageRank update method for the affected nodes is presented to avoid redundant computation and accelerate the PageRank update. (3) For newly added nodes, the existing nodes in the current graph that are connected with these new nodes are aggregated into a supernode. These new nodes with the supernode form a smaller new graph where the PageRank of newly added nodes is calculated at a very low computational cost. The rest of this paper is structured as follows. Section II investigates the related work. Section III presents the problem description about updating PageRank in a large dynamic streaming graph. Section IV proposes a novel approach to finding all affected nodes and conducts a deep analysis. Section V elaborates on an incremental computation of PageRank in a streaming graph. Some experiments are conducted to show the advantage of our approach over traditional techniques in Section VI. Section VII concludes this paper.

II. RELATED WORK
When we review PageRank computation, the traditional methods can be divided into two categories: power iteration methods [7] and Monte Carlo methods [8], [9]. The former converges to the PageRank by iterative operation of the graph connection matrix, and the latter uses a random walk to approximate the PageRank. Both methods can be used to calculate PageRank for static and dynamic graphs, and many extension methods based on them have been developed.
For static graphs, Desikan et al. [10] used a power iteration method to calculate all nodes' PageRank for a static graph. Kamvar et al. [11] used the quadratic extrapolation method to remove the nonmain eigenvectors in the current iteration and accelerate the convergence process of iterative calculation. Arnal [12] presented a parallel vision of the power iteration method to reduce the execution time for PageRank computation. However, according to the Monte Carlo idea, Nigam [13] designed a method to approximate PageRank based on the Markov model, and verified its effectiveness. Fushimi et al. [14] proposed a p-avg approach to obtain the approximate PageRank scores, which improved the speed of PageRank computation.
Considering a dynamic graph, the easiest way is that the methods of calculating PageRank for a static graph can be used to calculate PageRank for a dynamic graph. Christian et al. [15] turned a dynamic graph into a series of snapshots, treated these snapshots as static graphs, and calculated nodes' PageRank related to these static graphs by the Gauß-Seidel method. Gonzalez [16] regarded a dynamic graph in batches over time as a static graph and computed nodes' PageRank. The sliding time window was introduced by Bahmani et al. [17] to deal with the dynamic streaming graph for computing PageRank. Essentially, these methods convert dynamic graphs into many static graphs, and then calculate all nodes' PageRank again and again. If the time window is very short and too many static graphs are generated, these methods are time-consuming.
Generally, in some application fields, a dynamic graph often locally changes on a small scale, and a few nodes or edges are added into or removed away from the current graph during a period of attention. For this reason, only a small number of nodes' PageRank will be affected. To speed up the solution, one only needs to recalculate the PageRank of this part of the affected nodes rather than all of the graph nodes. Inspired by this idea, some incremental computing frameworks and platforms have emerged, such as GraphIn [18], Tornado [19] and DZIG [20]. Moreover, many incremental update PageRank algorithms have been proposed. Kim [21] updated the PageRank of the affected nodes by using an incremental iteration method. Chien et al. [22] proposed a novel method for discovering the affected nodes, showing that the influence gradually decreases as it passes through its reachable nodes. Zhang [23] presented an incremental power iteration method, I-PageRank, which avoided computing PageRank from scratch. MaSherry [24] invented a new power iteration method that can accelerate the convergence of updated iterations. Bahmani et al. [25] proposed a proportional probing method, which could determine the existing nodes with high PageRank changes and take a small affected portion of the graph to update the PageRank of these nodes. Yu et al. [26] designed an incremental random walk algorithm IRWR, which mainly focused on the edges that were constantly changed and found top-k nodes that were affected by these edges to update PageRank. Pruning needless calculation was effective. Prasanna et al. [27] treated a strongly connected components as an affected area in a graph and calculated the PageRank of the local affected nodes at low computational cost. Atish et al. [8] modified the PageRank of all nodes by adjusting the affected random walk paths in a streaming manner. Liao et al. [28] addressed an incremental PageRank algorithm, IMCPR, based on the Monte Carlo method, which obtained lower computational complexity O((m + n/c)/c), where m, n and c are the number of changed nodes, the number of changed edges and the reset probability, respectively. In addition, currently distributed computing systems are widely used to speed up big graph processing. Mariappan et al. [29] proposed an incremental BSP processing model to calculate the PageRank in a big streaming graph.
Although the above methods for a big dynamic graph can save computational time, their disadvantages include possibly being inaccurate or unreasonable when finding the affected node area. Riedy [3] proposed a PageRank updating method based on iterative refinement. When a dynamic graph changed, the error of PageRank was compensated in the process of updating PageRank to ensure that the updating result was accurate. Zhang et al. [30] presented two dynamic versions of forward push and reverse push based on the power iteration method, which are more accurate than those in Reference [17]. Yoon et al. [31] proposed a fast and accurate OSP algorithm, which updated PageRank on dynamic graphs by random walk with a restart and set the restart probability to balance accuracy and calculation time. Reference [28] assumed that the random walk path did not visit the same node repeatedly, which was inconsistent with the fact. Therefore, the calculated PageRank had a large error, and the error increased continuously with the dynamic change of the graph. Zhan et al. [9] proposed the revisit probability model to truly reflect the random walk process. When a dynamic graph changed, the PageRank was updated based on the original random walk information.
The above works all attempted to accelerate the calculation process of PageRank while ensuring high accuracy, but there was still no a good trade-off between accuracy and efficiency. Therefore, this work studies an algorithm for PageRank update in a streaming graph using incremental random walk, and realizing the parallelism of this algorithm. Our goal is to achieve high precision and high efficiency at the same time.

III. CONCEPTS AND PROBLEM DESCRIPTION A. TRADITIONAL METHOD TO CALCULATE PAGERANK
In this paper, we let G = (V , E) be a directed graph, where V and E are the sets of nodes and edges, respectively. |V | is the number of nodes and |E| is the number of edges. For two nodes u, v ∈ V , if there exists a directed edge (u, v) ∈ E, Out u is the set of u's outgoing neighbors, and v ∈ Out u . we let A be the adjacency matrix related to G as follows: We let D be the diagonal matrix of nodes' out-degrees. Then, the transition matrix related to G is denoted as C = A T D −1 . When X (0) is an initial vector with |V | entries, the value of each entry in X (0) is 1/|V |. Then the PageRank of each node in G can be PR = (pr 1 , pr 2 , . . . , pr |V | ) T , which can be calculated as follows: where α is the probability that any node's outgoing neighbors are visited, e is a |V |-dimensional unit vector, ε is a threshold satisfying the convergence requirement, and pr i is the calculated PageRank of node v i .

B. THE MONTE CARLO IDEA TO CALCULATE PAGERANK
The Monte Carlo idea has been reported to approximate the PageRank of a graph [32]. To clarify the calculation process, we first introduce some concepts. Definition 1 (Random Walk, RW): In a graph G = (V , E), a random walk is a type of random process, during which many steps need to be taken as follows: a walker chooses a node v ∈ V as a starting node and randomly walks along the out-edge of this node with a probability of α at each step. In the same way, if the walker keeps walking, the total number of steps must not exceed R. Therefore, a walker may stop with a probability of 1 − α after each step, or terminate where the node has no any out-edge. To approximate the PageRank of each node more effectively, M -times random walks starting from each node are performed.
Definition 2 (Random Walk Path, RWP): In a graph G = (V , E), a random walk path is a continuous path starting from any node to a reachable node. It can be denoted as For each node v i , we let s v i be the total number of times that all RWPs have visited v i using the random walk method in Definition 1. Then, we can approximate the PageRank of v i , denoted by pr v i as follows: Let w v i be the number of all RWPs passing through node v i . If each RWP visits v i only once, then s v i = w v i . However, there may exist a cycle in G, which means that an RWP may visit the same node many times. In this case, To discuss the relationship between s v i and w v i , let's take an example is shown in Fig.1 Here, we use the variable r v 1 v 2 to represent the probability that the path starting from node v 1 passes through an out-edge (v 1 , v 2 ) and returns to v 1 , then where L is the length of this path, and Out v i is the set of node v i 's outgoing neighbors. Therefore, the probability of RWs that start from node v 1 and return to v 1 can be calculated as follows:  Thus, we can obtain the relationship among r v i , w v i and s v i as follows: Proposition 1: In a graph G = (V , E), for each node u ∈ V , we let the PageRank of u be pr u which is calculated by Formula 2. If pr u is calculated by Formula 3, then pr u approximates to pr u .
Proof: Assume that the true PageRank of node u is denoted as π u , and pr u is calculated by using the basic method that is equal to π u . A ''random walk'' that we define is based on the Monte Carlo methods. We walk randomly along the directed edges in the graph, and the probability of passing through any node u tends to be stable. This probability of node u, denoted as pr u , is treated as the approximate PageRank of u [33]. Since pr u = π u , then pr u approximates to pr u .

Definition 3 (Streaming Graph, SG):
A streaming graph is a dynamic graph changing over time 1, 2, 3, . . . , t . . ., where G t is a small graph that contains all the change information about node or edge insertion or deletion at time t. G t = G t − G t−1 can be generated and informed by an online graph computing system. More concretely, we where V 0,t is the set of nodes with one or more edges added or deleted in G t−1 , V +,t is the set of all new nodes, V −,t is the set of all deleted nodes at time t, E +,t is the set of all new edges, and E −,t is the set of all deleted edges at time t. An example is shown in Fig. 2. A black-circle represents the original node, a green rectangle indicates that a new node has been added, and a green arrow indicates that a new edge has been added. A dotted green circle indicates the deletion of a node and the dotted green arrows indicate the deletion of an edge. Using traditional methods to calculate PageRank is timeconsuming. Moreover, a large number of redundant processing operations occur when calculating PageRank from scratch with the continuous expansion of the data scale in a streaming graph. Therefore, we need to seek some new methods to improve. According to the actual situation, G t is much smaller than G t , and the PageRank in G t may be very similar to that in G t−1 . As in Fig. 2(a), a large number of computational operations are wasted on some nodes with almost unchanged PageRank, meaning repetitive and redundant computations. As a result, we will study an incremental method to calculate PageRank as shown in Fig. 2(b), where only a few red nodes need to update their PageRank.

IV. AFFECTED AREA DUE TO GRAPH CHANGES
If a dynamic graph does not change, the PageRank of all nodes does not need to be recalculated. However, even if a small number of nodes or edges are added or deleted, the PageRank of some related local nodes are affected. In this section, we discuss how to determine the affected area.

A. WAVE PROPAGATION PHENOMENON
Throughout the process of sound propagation, the sound intensity gradually decreases as sound travels farther. It has been reported that the relationship between sound intensity and propagation distance is P x = P 0 e −βx [34], where P 0 is the sound intensity at the origin, P x is the sound intensity at location x far from the origin, and β is the attenuation coefficient. Generally, β = 0.0255. According to this theory, we can apply it to analyze the dynamic change and local impact in a streaming graph.

As stated in Section
This means that the PageRank of all nodes in G t−1 remain unchanged. If V 0,t = ∅, and because all nodes in V 0,t are also in V t−1 , then G t may affect some nodes in G t−1 through nodes in V 0,t . Therefore, we treat nodes in V 0,t ∪ V −,t as the starting nodes to determine all affected nodes in G t−1 , and update the PageRank of these affected nodes.
To determine the affected nodes in G t−1 due to G t , we consider two factors: one is the distance dist(u, v i ) between any node v i V t−1 and a starting node u ∈ V 0,t ∪ V −,t and the outdegree at v i , which is |Out v i |. Similar to wave propagation, we first initialize the impact degree on u due to G t , which is aff u =1; then, the impact degree on v i is aff v i = aff u e −βdist(u,v i ) . Moreover, the larger v i 's outdegree is, the more paths it propagates, which means that a smaller impact degree spreads to its outgoing neighbors [35].
, v i has no any outgoing neighbor, and the impact stops propagating at v i . Thus, the impact degree on each reachable node v i+1 from u to v i+1 is as follows: Let δ be the threshold to terminate impact propagation, δ ∈ (0, 1). If aff v i+1 < δ, the impact degree on v i+1 can be ignored. In particular, if v i+1 has no any outgoing neighbor, the impact will stop spreading at v i+1 . Thus, the set of all affected nodes in G t−1 , denoted as V t aff , can be obtained as follows: To clearly describe the process of finding the affected nodes, we design an algorithm, and the corresponding pseudocode is as follows.
Proposition 2: We suppose the distance between two adjacent nodes in a graph is 1, the propagation attenuation coefficient is β, and the threshold to terminate impact propagation is δ. G t−1 is affected by G t . Then, the length of the farthest path where the impact is propagated in Proof: Because G t−1 is affected by G t , we might also assume that the impact starts to propagate from u, a reachable node on the propagation path that is denoted as v i , and the outdegree of v i is |Out v i | ≥ 1. If Out v i = 1, i = 1, 2, . . .. This impact only propagates along one path, and the distance is the farthest. We suppose this path is where v x is the farthest reachable node, then according to Formula 6,  return V t aff ; 7: else 8: for all v ∈ BV t // Propagate impact from the boundary nodes 9: aff v ← 1; // Initialize the impact degree on a boundary node 10: dist v ← 0; 11: end for 12: V t aff ← BV t ; 13: Tmp ← BV t ; 14: while Tmp! = ∅ 15: for ∀v i ∈ Tmp 16: 17: if |D 1 | = 0 18: dist v i ← dist v i + 1; 19: for all v j ∈ D 1 20: and dist (u, v i ) = dist (u, v i−1 ) + 1, we have the following:

V. PAGERANK UPDATE BY AN INCREMENTAL RANDOM WALK
Since a deleted node in a graph does not need to calculate the corresponding PageRank, this section only discusses the update of all affected nodes' PageRank in the current graph and the calculation of the newly added nodes' PageRank. To update the PageRank of node v i ∈ V t aff , we need to calculate w t u , the number of RWPs passing through u, and s t u , the total times that RWPs have visited u. Following the random walk method in Definition 1, even if the total number of edges in a graph is changed and the total number of nodes in such the graph is not changed, the total number of random walk paths will still not be changed, since w t u = w t−1 u . The total times of passing through node u can be calculated by Formulas 4 and 5, or . Since a new edge is added at node u, the total times that all the RWPs have passed through u to subsequent nodes increases. Let a t u be the increased number of RWPs passing through u due to adding e = (u, v) as follows: Obviously, adding an edge e may change the total number of times of passing through node v i ∈ V t aff , which is denoted as s t−1 v i . To calculate s t−1 v i as accurately as possible so as to reduce the random error, we set a larger value a t u , which is the number of rounds of repeated random walks. Following the random walk method in Definition 1, we perform a t u rounds of repeated random walks from node u. Once these RWPs start from u and pick v as the next node passes through v i , then After finishing all the specified random walks, node v i records the updated Finally, the PageRank of v i ∈ V t aff can be calculated by using Formula 3.
Since a new edge e = (u, v) is added, the PageRank of the affected node v i ∈ V t aff is changed. To update the PageRank of these affected nodes, we design an algorithm and its pseudocode is as follows.
Proposition 3: , if a new edge e = (u, v) / ∈ E t−1 is added, the set of all nodes in G t−1 affected by adding edge e is V t aff . To update the PageRank of the nodes in V t aff , it is stipulated that a t u -times random walks starting from node u are performed. Compared with the random walk algorithm [9], the minimum computational complexity saved by Algorithm 2 is

13:
Tmp ← Tmp + {v i }; 14: while Tmp! = ∅ 15: v j ←randomChoose(Tmp); // Get a node randomly 16: ; // The increased number of RWPs passing through u 31: for z = 1 to a t u // Perform a t u rounds of repeated random walks from node u 32: path z ←doRandomWalk(G t−1 , u); 33: for all v i ∈ V t aff 34: if coverNewEdge(path z , v i ) =true // Pass through the new edge (u, v) 35: , where α and M are two parameters set in Definition 1.
Proof: According to the random walk algorithm [9], the computational complexity of the overall random walk method is V t aff a t u /(1 − α). In Algorithm 2, if e = (u, v) / ∈ E t−1 is added in G t−1 , we perform a t u rounds of repeated random walks starting from node u after adding edge e to adjust the PageRank of nodes in V t aff where a t u < M . According to Algorithm 2, the computational complexity is Thus, the minimal computational complexity of Algorithm 2 can be reduced by

2) AN EDGE DELETION
We assume both nodes u and v are the existing node in the current graph G t−1 , and there is a directed edge e = (u, v). If this edge e is deleted in the graph, that is, e = (u, v) will no longer exist in the new graph G t . As stated in section III-B, we can obtain w t−1 u , r t−1 u and s t−1 u in G t−1 . To update the PageRank of node v i ∈ V t aff , we need to calculate w t u , the number of RWPs passing through u, and s t u , the total number of times that RWPs have visited u. Similar to the discussion in section VI-A, w t u = w t−1 u and s t u = ) . Because an edge e is deleted at node u, the total number of times that all the RWPs have passed through u deceases. We let c t u be the number of times passing from u to its subsequent nodes due to casting away this edge e = (u, v) as follows: Obviously, deleting an edge e may change the total number of times an edge passes through node v i ∈ V t aff , which is denoted as s t−1 v i . Following the random walk method in Definition 1, we perform c t u rounds of random walks starting from the outgoing neighbors of node u or node v. Once the RWP of the former passes through v i ∈ V t aff , then s t−1 v i + 1. Once the RWP of the latter passes through v i , then s t−1 v i − 1. After finishing all the specified random walks, node v i records the updated s t−1 v i , which is s t v i at time t. Finally, the PageRank of v i ∈ V t aff is calculated by using Formula 3. Since an edge e = (u, v) is deleted, the PageRank of the affected node v i ∈ V t aff is changed. To update the PageRank of these affected nodes, we design an algorithm and its pseudocode is as follows.

Algorithm 3 Updating PageRank After Casting Away an Edge (UPR_CE)
The set of the affected nodes' PageRank PR t aff ; // Similar to Algorithm 2, we omit the details here.

B. CALCULATING THE PAGERANK OF NEW NODES IN G t
Using the information about adding or deleting nodes or edges contained in G t , G t−1 can evolve and generate G t . For these nodes affected by G t , the PageRank of the nodes can be updated by using Algorithms 1 and 2. For the newly added nodes kept in V +,t , next, we discuss the method for calculating the corresponding PageRank. The main idea is to aggregate all the nodes in V 0,t into a supernode, denoted as sn, treat each edge between the node in V +,t and the node in V 0,t as the edge between the node in V +,t and sn, and then construct a new, much smaller graph, denoted as G t . Following the random walk method in Definition 1, we can calculate the total number of times that all RWPs have visited node v i ∈ V +,t , which is, s t v i , and then obtain the PageRank of v i by using Formula 3.
As shown in Fig. 3, after G t−1 is affected by G t , a new graph G t is constructed. x represents adding a new edge between two nodes in G t−1 , which is processed by Algorithm 2. y represents deleting a node from G t−1 , which is processed by Algorithm 3. z represents adding some new nodes and edges in G t−1 . For these new nodes, we allow the nodes in the dashed blue circle to aggregate into a solid-gray circle, which is regarded as a supernode.
Inspired by Reference [35], we design an algorithm to calculate the PageRank of new nodes, and its pseudocode is described as follows.
Proposition 4: We suppose G t−1 evolves into G t = V t , E t by adding a nonisolated node v. By using Formula 3, or the overall random walk method, we can obtain VOLUME 10, 2022 for all u ∈ V +,t

23:
if coverSelectedNode(path z , v) =true // Pass through node v 24 the PageRank pr t v corresponding node v. Algorithm 4 is used to construct a new graph G t = (V t , E t ) and calculate the PageRank pr t v corresponding node v. Then, the relative error is calculated as follows: |M can be obtained by Algorithm 4; Then, the relative error is calculated as follows: According to Algorithm 4, the edges connected to v are not changed during the process of aggregating a supernode. If M is large enough, the probabilities of revisiting v in G t and G t are equal, namely, r v = r v . Moreover, the number of random walk paths passing through v is w v ≤ w v . Thus, we obtain RE ≤ V t / V t − 1 .

C. COMPREHENSIVE ALGORITHM FOR ALL NODES
Considering a dynamic streaming graph SG, if the local changes occur, the PageRank of all nodes need to updated in time. Because G t evolves from G t−1 by using G t , it is reasonable to suppose that the PageRank of all nodes in G t−1 has been obtained in advance. Moreover, G t is a small graph including adding or deleting nodes or edges. We consider all the information about G t and present a comprehensive algorithm, namely an incremental algorithm, to calculate the PageRank of all nodes in G t by using the random walk method. The main process is as follows.
Step 1: When G t arrives, the set V t aff that includes the all affected nodes in G t−1 = (V t−1 , E t−1 ) can be obtained by using Algorithm 1. If there is indeed one or more affected nodes, that is V t aff = ∅, then we move on to Step 2; otherwise, we move on to Step 3.
Step 2: , we can determine whether the change is adding or deleting information. If there is an addition case, then we move on to Step 2.1; if there is a deletion case, then we move on to Step 2.2.
Step 2.1: After traversing the set E +,t , there are three possible cases: (1) When a new edge e is added between two nodes in G t−1 , the PageRank of nodes in V t aff can be updated by using Algorithm 2. (2) When a new edge e is added between a node in V +,t and a node in V 0,t , then the total number of nodes in G t is |V t−1 | + 1 and the PageRank of nodes in V t aff can be calculated by continuously calling Algorithm 2. (3) In general, if there are many added edges or nodes in G t , then we traverse G t . Step 2 is repeated until the end, and then Step 3 is executed.
Step 2.2: After traversing the set E +,t , there are also three possible cases: (1) When an edge e is deleted between two nodes in G t−1 , the PageRank of nodes in V t aff can be updated by using Algorithm 3. (2) When an edge e is deleted due to a deleted node, then the total number of nodes in G t is |V t−1 | − 1, and the PageRank of nodes in V t aff can be calculated by continuously calling Algorithm 3. (3) In general, if there are many deleted edges or nodes in G t , then we traverse G t . Step 2 is repeated until the end, and then Step 3 is executed.
Step 3: For all new nodes, we traverse V +,t , call Algorithm 4 to construct a new small graph that includes a supernode, and calculate the PageRank of all new nodes.
Step 4: The PageRank of all nodes in G t has been updated, and all calculation operations are completed.

VI. EXPERIMENTAL EVALUATIONS A. EXPERIMENTAL ENVIRONMENT
To evaluate the effectiveness of our proposed algorithm, we conduct a series of experiments. The hardware environment is a server cluster, including 4 computing nodes connected by high-speed Ethernet. Each computing node has 64-bit Intel(R) Xeon(TM) CPU, 32G RAM and 4T HD. The server cluster runs the Linux operating system Ubuntu 20.04 and the graph processing system GraphX [36]. GraphX is used to store and manage big graphs and to implement parallel and distributed computing. The experimental data we used are: wiki-Vote [37], amazon0302 [38], Slashdot0811 [39], and 2 synthetic datasets. The first three are real-world graphs in application fields, while the last two are synthetic graphs generated by the tool P-MAT [40]. This experimental data is shown in Table 1.
The above graph data is essentially historical static data. However, our research object is the dynamic streaming graph. Therefore, we need to use the above graph data to simulate and generate a dynamic streaming graph. Specifically, 80% of nodes and edges in each graph data are randomly selected for the initial graph. At each time interval t, a small scale of nodes is randomly selected as the added or deleted nodes, and a small scale of edges is randomly selected as the added or deleted edges.

B. EXPERIMENTS AND ANALYSIS
We use four different algorithms to conduct our comparative experiments. The first is the power iteration method (PRC_PI) [41], which is a widely used traditional algorithm to calculate PageRank. This algorithm calculates the PageRank of all nodes in a streaming graph starting from the first node, and the result can be treated as the real or true PageRank. The second is an overall random walk algorithm (PRC_RW). The third is UPR_DZIG, which is the state-ofthe-art incremental update PageRank algorithm implemented in [20]. The fourth is our algorithm, which is an incremental random walk algorithm (UPR_IRW).

Experiment 1:
The comparison between the overall random walk algorithm PRC_RW and the traditional power iteration algorithm PRC_PI  To evaluate the accuracy of PRC_RW, we set different numbers of rounds in random walks, M = 2, 4, 6,8,10,15,20,25,30,35, and execute these two algorithms over 5 graph datasets. The results of PRC_RW are compared with the results of PRC_PI, which is regarded as the real PageRank. The mean relative error MRE and the standard deviation SD are calculated. The experimental results are shown in Figs. 4 and 5. As we can see, the MRE over wiki-Vote becomes small and stable when M is larger than 4 and the SD tends to be stable when M = 10. Likewise, the MRE over amazon0302 becomes small and stable when M is larger than 15, and the SD tends to be small and stable when M = 6. Both MRE and SD over Slashdot0811 become small and stable when M is larger than 20. For RMAT-dg1 and RMAT-dg2, their MREs become small and stable when M is larger than 6, and their SDs tend to be stable when M = 20. Therefore, as long as we are setting a reasonable M , we can obtain smaller and relatively stable MRE and SD, meaning that the calculated PageRank is more accurate. Moreover, as M increases, MRE decreases and converges to a smaller error range. This is consistent with the conclusion of Proposition 1 in Section III-B. Experiment 2: The comparison between our incremental random walk algorithm UPR_IRW and the traditional power iteration algorithm PRC_PI First, we set the threshold to terminate impact propagation with different values, δ = 0.1, 0.2, . . . , 0.9. Experiments are carried out separately. We observe the proportion of the affected nodes to all nodes, and compare the mean relative error MRE between UPR_IRW and PRC_PI. The experimental results are shown in Fig. 6(a) and (b). In general, the larger δ is, the smaller the affected area is. As we can see in Fig. 6(b), almost all curves show an upward trend. This means that the affected nodes in the graph cannot be completely found as δ increases, and thus, the MRE becomes larger. According to the above experimental results, δ = 0.5 is a reasonable choice.
To analyze the impact of adding or deleting different sizes of nodes on updating PageRank, we randomly add or delete nodes with different proportions from each graph dataset in Table 1, ρ = 0.1%, 0.5%, 1.0%, 2.0%, 4.0%, 6.0%, 8.0%, 10.0%, 15.0% and 20.0%. We use UPR_IRW and PRC_PI to calculate the PageRank of all nodes in the graph, and further calculate the mean relative error MRE. The experimental results are shown in Fig. 7(a) and (b). With the increasing proportion of adding or deleting nodes, the corresponding MRE increases to a certain extent. This is because the graph data may have a mutation, which leads to a decrease in the accuracy of our algorithm UPR_IRW. Experiment 3: The comparison between incremental random walk algorithm UPR_IRW and the overall random walk algorithm PRC_RW We set the sizes of new nodes to 1.0%, 2.0%, 3.0%, 4.0% and 5.0% relative to the maximum size of each graph dataset, and the threshold to terminate impact propagation is δ = 0.5. We use PRC_ RW and our algorithm UPR_IRW to calculate the PageRank of new nodes, and then obtain the corresponding average relative error MRE. The experimental  results are shown in Fig. 8. As we can see, MRE is small and less than 0.01. If the proportion of new nodes is small, the MRE is close to 0. Therefore, the effect of the incremental random walk algorithm UPR_IRW is very similar to that of the overall random walk algorithm PRC_RW.
In addition, we treat these 5 graph datasets as the initial graph and randomly add 10 new nodes. We use UPR_IRW and PRC_ RW to calculate the PageRank of these 10 new nodes respectively, and then obtain the relative error corresponding to each node. The experimental results are shown in Fig. 9. As we can see, the relative error of each new node is less than the maximum relative error, which is consistent with the conclusion of Proposition 4. Experiment 4: The comparison between the incremental random walk algorithm UPR_IRW and the state-of-the-art incremental computing algorithm UPR_DZIG.  We set the sizes of new nodes to 1.0%, 2.0%, 3.0%, 4.0% and 5.0% relative to the maximum size of each graph dataset, and the threshold to terminate impact propagation is δ = 0.5. We use UPR_DZIG and our algorithm UPR_IRW to calculate the PageRank of new nodes, and then compare the results with the real PageRank to obtain the corresponding average relative errors, respectively. The experimental results are shown in Fig. 10. As we can see, MRE is small and less than 0.01. For the graph dataset amazon0302, the MREs of UPR_DZIG are slightly smaller than the MREs of UPR_IRW. However, UPR_IRW performs better than UPR_DZIG on these graph datasets.

Experiment 5:
The speedup comparison between the incremental random walk algorithm UPR_IRW and the traditional power iteration algorithm PRC_PI We set the sizes of new nodes to 0.1%, 0.2%, 0.5%, 1.0%, 2.0%, 3.0%, 4.0%, 5.0%, 6.0% and 8.0% relative to the maximum size of each graph dataset, and the threshold to terminate impact propagation is δ = 0.5. We use UPR_IRW and PRC_PI to calculate the PageRank of all the nodes in each graph. Suppose that the execution times of UPR_IRW and PRC_PI are T(UPR_IRW) and T(PRC_PI) respectively, and the speedup of UPR_IRW to PRC_ PI is Speedup=T(PRC_PI)/T(UPR_IRW). The experimental results are shown in Fig. 11. Compared with PRC_ PI, our algorithm UPR_IRW accelerates the speed of computing PageRank tremendously, especially in graph dataset RMAT-dg2, and the speedup can be as high as 99.50. As the size of new nodes increases. The speedup decreases gradually. Anyway, the speedup is much greater than 1, which indicates that the incremental random walk algorithm UPR_IRW is faster than the traditional power iteration algorithm PRC_PI. Experiment 6: The speedup comparison between the incremental random walk algorithm UPR_IRW and the overall random walk algorithm PRC_RW We use the same experimental environment and parameters stated in Experiment 5, and repeatedly execute UPR_IRW and PRC_RW to calculate the PageRank of all nodes in the graph. In addition, the speedup of UPR_IRW to PRC_RW is Speedup=T(PRC_RW)/T(UPR_IRW). The experimental results are shown in Fig. 12. Compared with PRC_RW, our algorithm UPR_IRW greatly accelerates the speed of calculating PageRank, especially on the graph dataset RMAT-dg1, and the speedup can be as high as 82.62. As the size of new nodes increases, the speedup decreases gradually. The speedup is much greater than 1, which indicates that the incremental random walk algorithm UPR_IRW is faster than the overall random walk algorithm PRC_RW. Experiment 7: The speedup comparison between the incremental random walk algorithm UPR_IRW and the stateof-the-art incremental computing algorithm UPR_DZIG We use the same experimental environment and parameters stated in Experiment 5, and repeatedly execute UPR_IRW and UPR_DZIG to calculate the PageRank of all nodes in the graph. In addition, the speedup of UPR_IRW to UPR_DZIG is Speedup=T(UPR_DZIG)/T(UPR_IRW). The experimental results are shown in Fig. 13. If the size of new nodes is smaller than 1.0% on these two graph datasets VOLUME 10, 2022  amazon0302 and Slashdot0811, the speedup may be smaller than 1. In this case, it means that UPR_DZIG has certain advantages when calculating the PageRank speed. However, if the size of new nodes is greater than 1.0% on each graph dataset, the speedup is greater than 1, especially on graph dataset RMAT-dg1, and the speedup can be as high as 3.76. Therefore, if the size of new nodes is added in a dynamic streaming graph is small, UPR_DZIG performs better than our algorithm. If more nodes are added to such a graph, the performance of our algorithm UPR_IRW is superior to that of the state-of-the-art incremental computing algorithm UPR_DZIG.

VII. CONCLUSION
PageRank update for a big dynamic streaming graph is a very important task. This paper focuses on all the local change information about node or edge insertion or deletion in a streaming graph, and designs an algorithm based on the idea behind wave propagation theory to find the all nodes affected by such local changes. For the unaffected nodes, their PageRank does not need to be updated, which greatly reduces the computational overhead. For the affected nodes, we use the random walk method to calculate the change in the total number of walks, and update the PageRank corresponding to these nodes. This process not only reduces the error of calculating the PageRank of the affected nodes as much as possible, but also avoids redundant computation. For the newly added nodes, an algorithm based on the aggregation idea is designed, which can accurately calculate the PageRank of these new nodes. In summary, a comprehensive algorithm based on incremental random walks for updating the PageRank in a streaming graph is proposed, and a deep analysis is given. Finally, we use the proposed algorithm, the traditional power iteration algorithm and the overall random walk algorithm to conduct a series of comparative experiments on 3 real-world graphs and 2 synthetic graphs. The experimental results show that our algorithm has better capability in terms of the speed, while maintaining the accuracy. In future work, we will prove the consistency of calculating PageRank between our incremental computation method and the overall computation method through theoretical derivation. Moreover, we will transplant the proposed algorithm to some open-source big graph computing systems and evaluate its effectiveness over more real-world graph datasets.
ZHIPENG SUN was born in 1990. He received the M.S. degree in computer application technology from the Department of Computer Internet of Things Engineering, Jiangnan University, in 2017. He is currently pursuing the Ph.D. degree in computer software and theory with Tongji University. His research interests include big data analysis, graph computing, machine learning, and artificial intelligence.
GUOSUN ZENG (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees in computer software and application from the Department of Computer Science and Engineering, Shanghai Jiao Tong University. He is currently working at Tongji University, as a Full Professor, and as a Ph.D. Supervisor of computer software and theory. His research interests include cloud computing, big data processing, software verification, artificial intelligence, parallel computing, and information security.
CHUNLING DING received the B.S. degree from the Huazhong University of Science and Technology and the M.S. degree from Tongji University. She is currently working as a Senior Engineer with Tongji University. Her main research interests include data analysis, computer simulation, and computer applications. VOLUME 10, 2022