Searching Correlated Patterns From Graph Streams

Mining the correlation has attracted widespread attention in the research community because of its advantages in understanding the dependencies between objects. In this paper, a correlated graph pattern searching scheme has been proposed, that is, provided with a query <inline-formula> <tex-math notation="LaTeX">$g$ </tex-math></inline-formula> as a structured pattern (<italic>i.e.</italic>, a graph), our algorithm is capable of retrieving the top-<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> graphs that most likely correlated with <inline-formula> <tex-math notation="LaTeX">$g$ </tex-math></inline-formula>. Traditional methods treat graph streams as static records, which is computational infeasible or ineffective because of the complexity of searching correlated patterns in a dynamic graph stream. In this paper, by relying on sliding windows to separate graph streams in chunks, we propose a Hoe-PGPL algorithm to handle the top-<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> correlated patterns searching from a dynamic perspective. Our algorithm applies Hoeffding bound and two-level (Sliding window level and local batch level) candidate inspection to discover potential graph candidates and determine the similarity of these candidates without double-checking the previous stream. Theoretical analysis shows that our method can guarantee the quality of the returned answers, and our experiments also present that Hoe-PGPL has an excellent performance with aspects of precision, recall, runtime, and resource consumption.


I. INTRODUCTION
Correlation mining has attracted widespread attention in the research community because of its peculiarity and superiority in finding potential correlations between different patterns. Many research works have been conducted on correlated patterns mining in various applications, such as financial transaction databases [45], [68], [70], quantitative databases [11], and time-series data streams [40]. The research interest has recently extended to graph streams where elements are connected by relying on structural relations [26], [28]- [30]. A structural pattern is, therefore, equivalent to a graph represented as nodes that interconnected by edges.
The correlation between two structural patterns (graphs) is a measure of the similarity in their occurrence distribution. Provided with a query g and a graph stream, Correlated Graph Search (CGSearch) aims to find structural patterns with the The associate editor coordinating the review of this manuscript and approving it for publication was Shirui Pan . Pearson correlation coefficient at least θ, which is a user predefined threshold. An example of this has been shown in Fig 1 where θ = 1. Although this type of correlated subgraph pattern is found to be important in various graph representations, it expects the user to predefine a threshold to work. However, it is usually challenging to specify such a threshold, since a small value may lead to many correlation graphs, while a large value may lead to very few results. Recently, a version of the top-k algorithm has been proposed to identify salient correlation subgraphs with the highest correlation with g from the database [28], [43], [48].
However, these works [26], [28] are limited to static graph streams only. In reality, the data streams are continuously generated or change within/across various applications. For example, on a website such as eBay, there are millions of active users every day [5], and we can use graphs to represent their browsing records. On the other hand, their activities may change dynamically and different from other users so that the analysis of correlated graphs of user's browsing FIGURE 1. A example illustrated correlated graph query. Given a database D = {g 1 , g 2 , g 3 , g 4 } and a query graph q, correlated graph query is able to discover some subgraphs which has similar occurrence distributions with q. For instance, the answer subgraph ''E-F'' is a co-occurrent subgraph of the query subgraph ''A-B-C''.
patterns will enable website owners to better understand user needs and improve their services. Another example of the dynamic graph stream can be found in the biomedical domain. The association of chemical compounds may change constantly during a chemical reaction because the structures of chemicals change dynamically. Analysis of these compounds (which can be regarded as correlated graphs) can find out some important substructures, which may help to discover new drugs [51]. In our experiments, we will also demonstrate a case study of structural patterns searching of scientific publications, which uses keywords and citation relationships to retrieve papers highly correlated to a query pattern (whereas traditional non-structural pattern search can only use keywords for query). All these examples demonstrate a clear need for searching structural patterns from graph streams.
Recently, Shirui and Xingquan proposed a CGStreambased method to query correlated graphs in a data stream scenario [42]. While CGStream can return the correlated graphs in an approximate but effective way, it expects users to define a threshold value θ , which indicates the minimum correlation scores between query g and retrieved patterns in graph streams. Notice that in real-world applications, such a threshold value is usually challenging to define and may differ significantly for different streams and query patterns. In order to make the structure pattern search more applicable to real-life usages, we propose to address top-k correlated graph queries for data streams in a dynamic way. In this article, we focus on finding the top-k correlated graphs, rather than retrieving correlation graphs with correlation values above a given threshold θ. The main challenges, in this case, are as follows: • Graph correlation query combines NP-complete subgraph isomorphism examinations. In this case, storing and calculating the frequency and correlation of each subgraph is intricate in streams.
• Recalculating all correlations is very time-consuming because the correlation within graphs are constantly changing in streams over time.
• Streaming schemes require algorithms to return answers promptly. Exhaustive search is a straightforward method to address our problem, which handles graph streams by using sliding window approaches, then the hidden correlations are examined by searching for the top-k correlated structural patterns based on the CGSearch [25], [26] or Top-Cor [28], [30] algorithm in each window. Although the exhaustive approach guarantees that the results are complete and accurate, it is computationally infeasible since it is time-consuming to search correlated structural patterns in different windows repeatedly. A compromise is to allow the system to retrieve the results in an approximate but highly reliable way, with much faster query speed than exhaustive search.
In this paper, we propose a correlated pattern searching algorithm based on Hoeffding bound [20] and data streams. This algorithm returns the top-k most correlated graphs of the query graph in a sliding window covering multiple consecutive batches of stream records. More precisely, Hoeffding bound [20] has been applied to each batch to determine a series of possible correlated graphs. Then a novel scheme named global-local inspection has been proposed to keep track of candidate information based on two different subschemes: Potential Local Lists (PLs) and Potential Global List (PG). Although the potential global list PG can be utilized to roughly estimate the candidate's real correlation value from a global perspective, the potential local lists PLs enable us to estimate the correlation value from a local perspective, which is more reliable. By carefully manipulating the PG and PLs, we can predict each candidate's actual correlation with a relatively small deviation. Theoretical analysis shows that the correlation value of each candidate in the PG is close to its real correlation so that the output quality can be guaranteed and bounded with aspects of the accuracy and recall of queries. Our experimental results in section 5 indicate that Hoe-PGPL is far more accurate and efficient when compared with an exhaustive search method with aspects of time, memory consumption, precision, recall.
The rest of the paper is structured as follows. Preliminaries and the problem definition are presented in Section 2. Our Hoeffding bound based algorithm is presented in Section 3. In Section 4, we provide a bound on the expectation of precision and recall. Experimental results are shown in Section 5, following by a case study of correlated graphs from DBLP graph streams in Section 6. In Section 7 and 8, we give a review of the related works and summarize the paper.

II. PRELIMINARIES AND PROBLEM DEFINITION A. PRELIMINARIES
We mainly consider connected, labeled, and undirected graphs in this paper, where a typical graph is defined as g = (V , E, ). In this representation, V , E, and are the vertices set, edge set, and labeling function, respectively. The labeling function aims to attach labels to an edge or a node in a graph. Given two graphs g 1 = (V 1 , E 1 , 1 ) and . g 1 is a subgraph of g 2 if a subgraph isomorphism exists from g 1 to g 2 , where subgraph isomorphism is a NP-complete problem [13].
For a set of graphs G i , the projected set in terms of graph g denotes as S G i g = {g |g ⊆ g , g ∈ G i }. The frequency F g and support SP(g) of the projected set is denoted as F g = For a set •, its cardinality has been denoted as | • |. Given g 1 and g 2 in G i , their joint frequency is the number of graphs in G i that have both g 1 and g 2 , which denotes as F g 1 g 2 = |S G i g 1 Accordingly, their joint support is SP(g 1 , g 2 ) = |S G i g 1

B. PROBLEM DEFINITION
Provided with a streaming graph G and a query g, we mainly focus on determining top-k correlated structural patterns (graphs) from G regards to g. We assume that graph data comes in batches, and a sliding window D = {G 1+i−w , G 2+i−w , · · · , G i } is defined to cover a continuous area of the streaming graph since G describes a dynamic graph stream which changes continuously. In this case, a batch of graphs has been denoted as G j , i ≥ j ≥ 1 + i − w, and the latest batch of streaming graphs has been denoted as G i . After that, our task is searching for the top-k correlated patterns with g and the highest Pearson correlation in the most current w batches. Figure 2 shows a top-k correlated graph query with a query graph g and a window size w, which equals to 3.

Definition 1 (Pearson Correlation Coefficient):
Provid with graphs g 1 and g 2 , which individual and joint supports are defined as SP(g 1 ), SP(g 2 ), and SP(g 1 , g 2 ) over F graphs, respectively. The Pearson correlation coefficient φ(g 1 , g 2 ) is denotes as [45], [61]: Searching top-k correlated patterns based on sliding windows over data streams. Batches G 1 , G 2 , and G 3 have been covered by a sliding window (red rectangular range) at timestamp T 1 . G 4 is a newly arrived batch at timestamp T 2 , which means that it is the most recent batch, the sliding window updates to cover G 2 , G 3 , and G 4 (solid red rectangle). We use two different lists (PG and PLs) to store the potential graphs, and results will be returned from the PG in each sliding window.
φ(g 1 , g 2 )) is 0 when SP(g 1 ) or SP(g 2 ) is equal to 0 or 1, and the range of φ(g 1 , g 2 )) is between 0 and 1 because in this paper, we only consider the positive correlations between graphs.
The Pearson correlation coefficient can transform into another representation over a set with F graphs in terms of frequency [68]: F g 1 , F g 2 , and F g 1 g 2 denote the number of graphs with g 1 , g 2 , and g 1 plus g 2 over F graphs, respectively. For clarity, the correlation between g 1 and g 2 is represented as φ D (g 1 , g 2 ) when they come from the entire window D; On the other hand, the correlation is φ L (g 1 , g 2 ) if these variables are collected from a local batch of graphs G j .

III. PROPOSED METHOD
In a data stream scenario, we aim to return the top-k correlated subgraphs in each sliding window. A straightforward way is to apply CGSearch [25], [26] or TopCor [28], [30] algorithm to each sliding window whenever a new graph batch arrives. However, this is computationally ineffective because recomputing the correlation of each subgraph from scratch in the sliding window is expensive, especially for a large window size.
Another way is to have a set of possible candidates over streams, whenever a new batch of graphs flow in, we update the statistic frequency information (according to Eq. 2) for each candidate g in D (F g 1 , F g 2 , and F g 1 g 2 ) instead of restarting to mine from the whole sliding window. Under such circumstances, determining what type of candidates should be discovered and how to maintain these candidates are of great significance. In this paper, Hoeffding bound [20] has been applied in each batch to elect a set of candidates, then we rely on a global-local inspection scheme to maintain these potential graphs with the PG and PLs. In this section, we first state the Hoeffding bound [20] for candidate generation and then introduce our algorithms with two levels of candidate lists.

A. HOEFFDING BOUND FOR CANDIDATE GENERATION
In [58], Rong et. al use a Chernoff bound to discover the topk frequent itemsets over data stream, where they assume the presence of each item/itemset at each timestamp over streams is a random boolean variable (i.e., if an item/itemset appears at a timestamp, the random variable is 1, otherwise 0). In our application, the correlation over a stream at each timestamp is a real number within range [0, 1], so we employ Hoeffding bound [20], which is also widely used in pieces of literature on algorithm ranking and data stream classification for candidate generation.
Suppose there is a series of independent drawing points Hoeffding boundary [20] provides a certain probability that guarantees for the statistical estimation of the underlying data. More explicitly, the estimated meanr = 1 n n i= φ i if r is the expected value of these points. For any , Hoeffding bound states that [20], [43], In above formula, n and R denote the number of accumulated observations and the range of each observation φ i , respectively. If we let δ = 2e − 2n 2 R 2 be the right side of Eq.(3), the Hoeffding bound [20] indicates that the estimated averager exceeds ± of r with a probability no greater than δ; Otherwise speaking,r is within the expected . From Eq.(3), we know that [42] = R 2 ln (2/δ) 2n (4) In the problem settings of this paper, we mainly focus on positive correlations and fix the range of R = 1. We assume the streaming data arrives in batches, and the correlation is randomly distributed over data streams. For each batch, we measure the estimated meanr = φ L (g 1 , g 2 ), and the correlation φ D (g 1 , g 2 ) in a sliding window is no less than φ L (g 1 , g 2 ) − with probability 1 − δ. Generally speaking, we compute φ L (g 1 , g 2 ) in batches over |G i | graphs, and this result should equals to the average of φ L (g 1 , g 2 ).
In each batch of the data stream, a set of potential topk correlated graphs can be mined by using the Hoeffding bound [20]. Specifically, instead of query from the whole batch, we aim to identify a series of candidates from the projected database S G i g 2 , which is a subset of the batch data G i including all graphs containing query g 2 . Because the size of S G i g 2 is much smaller than the original batch of data, the employment of S G i g 2 can greatly reduce the time consumption [26], [28], [30]. For each candidate graph g 1 , we requires its correlation value is within a small range of the k th value, i.e., φ L (g 1 , g 2 ) > φ L − 2 s . In this case, s = R 2 ln (2/δ) 2|G i | and φ L (g 1 , g 2 ) denotes the k th correlation value (notice that candidates are sorted based on their correlation values in descending order) with query graph g 2 . Specificlly, the method in [28], [30] or CGSearch algorithm [25], [26]  can be used to find the k th correlation value. In this paper, we summarise the primarily used notations of Hoeffding bound in table 1 [20].

B. TWO LEVELS OF LISTS IN GLOBAL-LOCAL INSPECTION SCHEME
After finding the possible top-k correlated structural patterns in different batches, they need to be maintained carefully.
Here we introduce two different lists to manipulate these patterns (graphs).
• Potential Global List (PG): For each candidate in sliding window D, its global frequency information (F g 1 , F g 2 , and F g 1 g 2 ) is stored in the PG where graphs are sorted based on their correlation values in descending order. The top-k correlated graph query may simply return the final answers by retrieving the top-k graphs in the PG. The frequency of each candidate should be updated accordingly whenever D is changed, e.g., adding or excluding a data batch.
• Potential Local Lists (PLs): The local frequency information of each candidate (F g 1 , F g 2 , and F g 1 g 2 ) is recorded in a set of PLs where we apply a list PL j , j ∈ [1 + i − w, i] to keep the retrieved graphs from G j .
To mine a graph stream effectively, we cannot keep track on all of the information about the data stream. Usually, the complexity is caused by a situational pattern τ , which is less significant in historical observations but may become significant in the future. However, the historical information τ may not be stored since τ is less significant in previous observations unless we are allowed to access the historical stream data, which is prohibitive. As a result, we define an emerging candidate pattern as follows.
Definition 2 (Emerging Candidate Patterns): Provide with a query graph g 1 and a sliding window D that covers w consecutive data chunks D = {G i−w , G 1+i−w , · · · , G i−1 }, where G i−1 is the most recent data chunk. Assume a new data batch G i arrives, an emerging candidate pattern denotes a pattern τ , whose correlation value to g 1 in G i is significant with respect to a predefined threshold (i.e., φ L (τ, g 1 ) ≥ φ L −2 s ), whereas τ does not exist in D's potential global candidate list.
Because an emerging pattern τ does not exist in the PG of the previous sliding window, τ 's true correlated value to g 1 in a new sliding window D = {G i−w+1 , · · · , G i−1 , G i } will need to be estimated, which may, however, introduce bias. In the following, we first discuss some possible solutions and then present our global-local inspection scheme to handle this problem.
• Up-to-date Batch only Estimation: A straightforward way of estimating an emerging pattern τ 's correlation to the query graph g 1 is to use τ 's correlation in the most current batch (φ L (τ, q)) as an estimation of the whole sliding window. As in our experiments, such a simple scheme is not accurate and may result in significant errors in the query results.
• PG-only Estimation: An alternative approach is to utilize the PG to estimate the correlation of candidates from a global perspective. Specifically, if an emerging pattern τ is not in the PG (i.e., τ / ∈ PG) but correlate with the most current batch (i.e, τ ∈ PL i ), which means that in previous batches, φ D (τ, g 1 ) is comparatively low in the sliding window, so τ cannot be previously held in the PG. If a candidate τ / ∈ PG and τ min denotes the graph with the minimum correlation value φ D (τ min , g 1 ) in the PG, we know that in previous batches φ D (τ, g 1 ) < φ D (τ min , g 1 ). Therefore, τ can be estimated by using τ min 's frequency information in previous batches. τ is usually a small value (not the correlated patterns in the current window) but may have the most significant correlation in previous blocks.
• PLs-only Estimation: We can also employ a set of PLs to calculate the emerging candidate pattern's frequency. Specifically, if a candidate τ / ∈ PG, we can search over the PL 1+i−w to PL i−1 to estimate its frequency in previous batches. On the one hand, the frequency information in G j can be obtained directly if τ ∈ PL j , . On the contrast, if τ / ∈ PL j , then the minimum correlation in PL j is larger than its correlation in G j . If τ min denotes the graph in PL j with the minimum correlation value, the frequency of τ in G j can be estimated by using the frequency information of τ min . By leveraging PLs, we are hopefully getting a more precise correlation estimation for each emerging candidate pattern.
Global-local inspection scheme: Because PG and PLs each has its unique advantages to estimate the correlation value from either global or local perspectives, our global-local inspection scheme integrates PLs and PG together, and the detailed procedures are illustrated in Algorithm 1.
Firstly, the estimated frequency for each candidate τ can be obtained by going through PL 1+i−w to PL i−1 from a local perspective, as shown in Algorithm 1 from steps 3 to 10, which minimizes the gap between the correlation values in PL j , j ∈ [1+i−w, i−1] and their actual values in G j . Specifically, τ in batch G j has been estimated by using the frequency of g 2 in step 11. The data structure of list(F τ ), list(F g 2 ), and list(F τ g 2 ) in this step will be introduced in Sec. 3.3. When 11 Insert the statistics of g 2 in batch G j to list(F τ ), list(F τ g 2 ), and list(F g 2 ); 13 Further correct the statistics of τ ; 14 end 15 Add τ into PG; 16 end we get the frequency of τ from PLs, this information can be examined from a global perspective in the PG, as shown in Algorithm 1 from steps 12 to 14. If τ min denotes the graph with the minimum correlation in the PG and a renewed correlation value φ D (τ, g 2 ) is greater than φ D (τ min , g 2 ), we require φ D (τ, g 2 ) < φ D (τ min , g 2 ) from a global perspective so that a further correction of the frequency information of g 2 will be needed. In our implementation, we replace the frequency of τ with the frequency of τ min in G j if we find τ / ∈ PL j , j ∈ [1+i−w, i−1]. By doing this, we can add newly generated candidates into the PG and accurately estimate its real correlation with the query graph g 1 . Consequently, the ranking of the PG can be more reliable, and the ground truth can also be returned in a more precise way.

C. HOE-PGPL ALGORITHM
Our Hoe-PGPL algorithm for continuous correlated subgraph queries from data streams is listed in Algorithm 2. The framework takes multiple parameters as inputs. To be more specific, G is a continuous graph stream; k specifies the number of returned graphs; w indicates the sliding windows size with aspects of a number of batches; parameter m controlling the capacity of the PG.
We first initialize the PG and PLs as empty sets in step 1, and the loop in step 2 represents a stream processing cycle. As long as new graph data arrives constantly, the stream processing cycle will repeat. When a new chunk G i is collected, the most antiquated data batch will be discarded by Hoe-PGPL, which only applies G i for the mining purposes. After this, in each sliding window, Hoe-PGPL will return the top-k correlated graphs A g 2 , and this process continues as long as the stream data continuously flow.

Algorithm 2 Hoe-PGPL Algorithm
Input : G = {G 1 , · · · , G i , · · · }: Graph streams; k : number of returned correlated graphs; w : size of the sliding window; m : maximize number of candidates in PG; Retrieve top-k correlated graphs in G i from S G i g 1 ;

7
Get the k th correlation value φ L and error s ; for g 2 ∈ PG do 10 Increasing list(F g 2 ), list(F g 2 g 1 ), and list(F g 1 );

19
Get the k th correlation value in PG, φ D ;

20
T ← {g 2 |φ D (g 2 , g 1 ) < φ D − 2 w , g 2 ∈ PG}; 21 PG ← PG − T ; 22 end 23 Output A g 2 in the current sliding window; The Hoe-PGPL algorithm has four parts: (i) Candidate generation based on Hoeffding bound [20] (steps 5 to 8); (ii) Adjusting the frequency information in the PG (steps 9 to14); (iii) Adding newly generated candidates to the PG (step 15). Notice that this is a crucial subprocedure, and it is shown in Algorithm 1; (iv) Pruning the PG (steps 16 to 22). As parts (i) and (iii) have been discussed in previous subsections, we focus on the other two parts in this subsection.

1) UPDATING THE FREQUENCY INFORMATION IN PG
In each batch and window D, we need to refresh the frequency information for each candidate g 2 ∈ PG, i.e., F g 1 , F g 2 and F g 1 g 2 . Steps 9 to 14 in Algorithm 2 list the maintenance of candidates that already exist in the PG. We first introduce our data structure to record the frequency information then describe the increasing and the decreasing procedures to maintain the PG. To check the calculation of the Pearson correlation value, each candidate in the PG is represented by a five-dimensional tuple as follows: In the above tuple, g 2 is the graph, F is the total number of stream graphs after g 2 is inserted into the global candidates set. list(F g 1 ) is an array list, with each of its elements recording the frequency of g 1 in each batch of the sliding window. Similarly, list(F g 1 g 2 ) and list(F g 2 ) store information about the F g 1 g 2 and F g 2 in each batch, respectively. The increasing and decreasing procedures are as follows: Increasing Procedure: for each candidate g 2 ∈ PG, if it exists in the PL i , we update its frequency information directly, otherwise, we search dataset S G i g 1 to get F g 2 g 1 , and search the whole batch to get F g 2 information. All these frequency statistics are inserted to list(F g 2 ), list(F g 2 g 1 ), and list(F g 1 ), respectively.
Decreasing Procedure: We need to decrease the frequency information for candidates in the PG when an antiquated data batch is deleted from the sliding window. As we store each candidate in a five-dimensional tuple, which helps facilitate the decreasing process in a fast way, i.e., we only need to remove the first element in the array list of F g 2 , F g 2 g 1 , and F g 1 whenever a batch of the graph becomes outdated.

2) PRUNING PG
If |PG| is becoming considerably large, it may be time-consuming to maintain and search from the PG, which will, in turn, require pruning for PG (steps 11 to 15 in Algorithm 2). If the number of candidates in the PG exceeds a predefined threshold, i.e., m * k, Hoe-PGPL simply removes the candidates whose correlation value is less than φ D −2 w to ensure that the maximum number of candidates in PG is m * k.

IV. PRECISION AND RECALL BOUND ANALYSIS
Because our algorithm is an approximation based method, in this section, we study its theoretical bound in terms of query precision and recall.
Suppose the target top-k correlated graphs are denoted as T g 2 , and the calculated graphs A g 2 are returned by Algorithm 2, then the precision and recall can be calculated with Basically, we need to use two different estimations in our algorithm: the k th correlation values in a local batch (denoted by φ L ), the sliding window (denoted by φ D ), and the estimated correlation for each emerging candidate τ . Either of estimation may contribute to the overall error.
Firstly, a bound has been setted on the possible bias of the k th correlation value in the local batch (φ L ) and sliding window ( phi D ) in Lemma 1, where we show that φ L is likely to be close to φ D . Next, in Lemma 2, we state that the accurate correlated graph's correlation value in a batch φ L (g 2 , g 1 ) will be larger than φ D − s with a high probability. Combining Lemma 1 and 2, we derive Theorem 1 to attest that our algorithm will guarantee the storing of the accurate correlated graphs at a certain confidence level. In other words, a real top-k correlated graph will be stored in the PG and PL i with a certain probability value. As our global-local inspection scheme can estimate each candidate accurately, we finally state our bound on precision and recall in Theorem 2 based on Theorem 1.
Lemma 1: Let the k th correlation values in a local batch and the sliding window are φ L and φ D , respectively. The probability that φ L ≥ φ D + s is at most δ, where s = R 2 ln (2/δ) 2|G i | , and δ is a user specified confidence threshold. Proof: From Hoeffding bound, we have φ L (g 2 , g 1 ) ≥ φ D (g 2 , g 1 )+ s with probability at most δ. Suppose graph g 2 is the k th correlated graph in a local batch, i.e., φ L (g 2 , g 1 ) = φ L , then with probability at most δ, Case 1: if g 2 is also the k th graph in sliding window, which means φ D (g 2 , g 1 ) = φ D . Then the lemma holds.
Case 2: if g 2 is not one of the real top-k correlated graph in the sliding window, then φ D (g 2 , g 1 ) < φ D . From Eq.(5), we know with probability p ≤ δ, φ L ≥ φ D + s . Case 3: If g 2 is one of the top-k correlated graphs in the sliding window, then φ D (g 2 , g 1 ) ≥ φ D . As g 2 is the k th correlated graph in a local batch, there should be a set of graphs G A that for each graph g 2 ∈ G A , φ L (g 2 , g 1 ) ≥ φ L . With probability at most p z ≤ δ, φ L (g 2 , g 1 ) ≥ φ D (g 2 , g 1 ) + s (6) Case 3a: If the real top-k correlated graphs in the sliding window are the graphs in G A , then there is a graph g 2 in G A which is the k th correlated graph in the sliding window, i.e., φ D (g 2 , g 1 ) = φ D . As φ L (g 2 , g 1 ) > φ L and φ D (g 2 , g 1 ) = φ D . According to Eq (6), we know that, p ≤ p z ≤ δ and φ L ≥ φ D + s with a probability p.
Case 3b: At least one graph g 2 in set G A is not the real top-k correlated graphs, then φ D (g 2 , g 1 ) < φ D . As φ L (g 2 , g 1 ) > φ L and φ D (g 2 , g 1 ) < φ D , substituting in Eq (6), we know that, with probability p, p ≤ p z ≤ δ, φ L ≥ φ D + s . Lemma 2: If g 2 is one of the true top-k correlated graph in the sliding window, the probability that φ L (g 2 , g 1 ) ≥ φ D − s is at least 1 − δ.
Proof: If g 2 is the k th correlated graph in the sliding window (φ D (g 2 , g 1 ) = φ D ), the lemma follows according to the Hoeffding bound (φ L (g 2 , g 1 ) ≥ φ D (g 2 , g 1 ) − s is at least 1 − δ). If g 2 is not the k th correlated graph, then φ D (g 2 , g 1 ) ≥ φ D , the bound also follows.
As the k th correlation values in a local batch (φ L ) is close to the k th correlation values in a sliding window (φ D ) (Lemma 1), and each true top-k graph's correlation value is greater than φ D − s with a high probability (Lemma 2), combining these two inequations, we can guarantee that each true correlated graph will be stored by our algorithm with a high probability, which is shown in Theorem 1.

Theorem 1:
If g 2 is one of the real top-k correlated graphs, then PL i stores g 2 with probability higher than (1 − δ) 2 , and the PG stores g 2 with probability higher than (1 − δ) 2 .
Proof: According to Lemma 2, if g 2 is one of the top-k correlated graph in the sliding window, then the probability that φ L (g 2 , g 1 ) ≥ φ D − s is at least 1 − δ. From Lemma 1, the probability that φ L ≥ φ D + s is at most δ. In another word, φ D ≥ φ L − s is at least 1 − δ. Thus, combining the two inequalities, we have φ L (g 2 , g 1 ) ≥ φ D − s ≥ φ L 2 s , and the probability is at least (1 − δ) 2 . Note that in each batch, we retrieve those graphs whose correlation is above φ L − 2 s and store them to the candidate list PL i . In other words, if g 2 is one of the top-k correlated graphs in the sliding window, it is held by a local list PL i at least (1 − δ) 2 . As we insert the newly discovered graphs into the PG, and we use a similar technologies to maintain the PG, which store g 2 with a probability also higher than (1 − δ) 2 .
Since we design two levels of candidate lists to determine the real correlation of each newly emerging candidate, its estimated correlation will approach its actual correlation. So if a real top-k correlated graph is held in the PG, it will be returned. Then our algorithm has such a bound in terms of expectations of precision and recall, stated in Theorem 2.
Theorem 2: The expectations of recall and precision of our proposed Hoe-PGPL algorithm are both at least (1 − δ) 2 .
Proof: Because Hoe-PGPL returns k highest correlated graphs, in our circumstance, the precision and recall will be the same according to our definition. This is because |A g 2 | = |T g 2 | = k 1 . Given that the correlation of each candidate g 2 with query graph g 1 in the PG approaches to the true correlation, the true answer will be returned only if it is held in the PG. From Theorem 1, we know that the probability of storing a true correlated graph is at least (1 − δ) 2 . In other words, each true top-k correlated graph will be returned at a probability at least p r = (1 − δ) 2 . Suppose the number of true top-k correlated graphs returned by our algorithm is t, then t follows a binomial distribution in a total number of k, i.e., (t; k, p r ) = Pr(T = t) = C t k p t r (1 − p r ) (k−t) . The expectation of this binomial distribution is kp r , variance is kp r (1 − p r ). Thus the expectation of recall is kp r /k = p r = (1 − δ) 2 .

V. EXPERIMENTAL RESULTS
In this section, we illustrate our experiments, as well as the associated results, in detail. The streaming dataset we experimented is NCI Open Database Compounds 2 , which contains a series of compound structures of illness data. The official dataset consists of around 250,000 graphs, which has been reduced to around 230,000 after the preprocessing (e.g. delete disconnected graphs).
We use two metrics, known as the precision and the recall, to measure the performance of our algorithm. If there are Our algorithm has been compared with an exhaustive search strategy, known as FixWin, with aspects of memory consumption and running efficiency. FixWin is an accurate method that can return all correct results with no false positives and negatives. However, this approach is computationally infeasible because it demands a large amount of computing capacity and memory so that it is not an optimal choice for the majority of stream-based applications. For example, in our experiments, FixWin will abandon the existing work and turn to search the top-k correlated structural patterns from the beginning regardless of whether a new batch of graphs comes.
In addition to FixWin, we also compare our algorithm with three simple solutions to justify the effectiveness of our global-local inspection scheme. The three simple solutions represent three different approaches to use PG and PLs to estimate the correlation values of newly emerging candidates.
• Estimate from the most current batch (denoted as Hoe-M): A newly emerging candidate is inserted into the PG directly, and its correlation value in the most current batch is used as an estimation of the candidate's correlation in the sliding window.
• Use PG-only (Hoe-PG): Use the PG to estimate the correlation from a global perspective.
• Use PLs-only (Hoe-PLs): Use PLs to estimate the correlation from a local perspective.
There are fifty graphs we used in our experimented query graphs, and for each of them, it has been randomly selected and associated with a support value between 0.02 and 0.05 in data streams. If there are α batches of graphs in total, the average precision for a certain query can be calculated by Precision = 1 α α i=1 P i . In this equation, P i is the precision in a sliding window D = {G j |i ≥ j ≥ 1 + i − w}. Similarly, the average recall, time, and memory consumption are calculated in the same way. Note that for those fifty graphs, we need to calculate the above metrics and average them again as our final metrics.
For each batch, we rely on CGSearch [25], [26] to discover a set of top-k correlated graphs over data streams. In the first batch, we need to lower down the correlation threshold for CGSearch to get the top-k answers step-by-step and then insert these graphs into both PL 1 and PG. Suppose φ D is the k th correlation value in the PG, we can initially use φ D as a correlation threshold to get the potential top-k correlated graphs in the next batch over data streams. By doing so, we can discover a series of candidates in different batches effectively. An alternative method is to employ TopCor [28], [30] to find the candidates within different batches. It is worth noting that both CGSearch and TopCor can be integrated into our Hoe-PGPL algorithm, and their selection not affects the fairness of comparison in our experiments.
The performance of our algorithm has been examined with different parameters, and the default values are δ = 0.03, k = 20, m = 4, w = 8, and |G i | = 4000.

A. EXPERIMENTS WITH DIFFERENT K VALUES
To examine the performance of our algorithm for various k values, we adjust this value from 10 to 100. Fig. 3 compares the runtime and memory consumption, and Fig. 4 shows the average precision and recall with different k values.

1) SYSTEM RUNTIME AND MEMORY CONSUMPTION
With regard to memory consumption, Fig. 3 tells us that FixWin consumes around three times more memory comparing to the Hoeffding bound based methods since FixWin has to keep the graph data within an entire window, but others methods only store the latest batch of data and candidates in PLs or/and PG. Among these Hoeffding bound based methods, Hoe-M and Hoe-PG utilize a little smaller memory comparing to Hoe-PLs and Hoe-PGPL. This is because Hoe-M and Hoe-PG only employ the PG to store the candidates, while Hoe-PLs and Hoe-PGPL additionally use PLs to estimate the correlation values.
Similarly, with regard to the runtime performance, Hoe-PGPL and other Hoeffding bound based strategies are several times faster than the exhaustive search, as shown in Fig. 3. Meanwhile, the time consumption of the Hoeffding bound based methods are very close to each other. This result VOLUME 8, 2020 shows that including PLs (like Hoe-PLs and Hoe-PGPL do) will not significantly increase the time consumption compared to the methods using PG only (Hoe-M and Hoe-PG). This is because the majority runtime for all these methods is taken by the process of finding the top-k correlated structural patterns in batches by CGSearch [25], [26], which involves two time-consuming procedures, including mining frequent subgraphs by gspan [63] and checking candidate frequency to get F g 1 .
The results in Fig. 3 reveal that compared to exhaustive search method FixWin, Hoeffding bound based methods are several times more effective in terms of runtime and memory consumptions. In addition, all Hoeffding bound bases methods are close to each other with aspects of their memory and time consumptions because Hoeffding bound based methods are highly overlapping with each other, for clarity of presentation, we will mainly focus on the Hoe-PGPL in the following subsections.

2) QUERY PRECISION AND RECALL
In Fig. 4, we report the effectiveness of our global-local inspection scheme in terms of query precision and recall. From Fig. 4(A), it is clear that Hoe-M has the worst performance among these methods. This is because Hoe-M simply uses the correlation value in the most current batch as an estimation of the sliding window for each emerging candidate pattern. Because an emerging pattern's correlation value in the current batch may be relatively large, this estimation will introduce severe bias and affects the candidate ranking in the PG, which in turn results in inaccurate query results. In comparison, Hoe-PG, Hoe-PLs, and Hoe-PGPL integrate additional estimation techniques to take PG or/and PLs into consideration, thus can significantly improve the system performance. While Hoe-PG uses the PG only to estimate the correlation values roughly, Hoe-PLs makes use of a set of local candidate list PLs, which results in more accurate estimations than Hoe-PG. Because PG and PLs each have their unique advantage to estimate the correlation value from either global or local perspectives, Hoe-PGPL combines their strength and yields a better result in terms of query performance and recall. Since the correlation value estimated by Hoe-PGPL is much more reliable and more close to its true value, according to Theorem 2, the precision and recall are theoretically bounded. The result in Fig. 4 confirms that Hoe-PGPL is always better than other methods in both precision and recall.
Note that in the experiment with increasing k values, the precision and recall values drop for all algorithms. This is because with a large k, the φ D (the k th correlation value in PG) will be small, and the number of candidates in the range [φ D − , φ D + ] may increase dramatically compared to a large φ D . For instance, when φ D = 0.85, there may be only ten candidates in a range of [0.85− w , 0.85+ w ]. When φ D = 0.65, there may be 235 of candidates in a range of [0.65 − w , 0.65 + w ]. It is challenging to differentiate these candidates as their ranking order in the PG may change dramatically even if a small error estimation exists in the PG. Nevertheless, it is shown in Fig. 4 that, when Hoe-M drops dramatically in performance in k = 100, Hoe-PGPL only decreases slightly, which reveals the robustness of our Hoe-PGPL algorithm. As both precision and recall of Hoe-PGPL are approaching to 1, it shows that Hoe-PGPL is a highly accurate algorithm.
In summary, the above experiments and observations conclude that (i) Hoeffding bound based methods can significantly reduce the memory and time consumption comparing to the exhaustive search method. (ii) Hoe-PGPL scheme outperforms its peers in terms of precision and recall by compromising a tiny amount of runtime and memory as it considers not only the potential global candidates (PG) but also the potential local candidates (PLs).
B. EXPERIMENTS WITH VARIOUS BATCH SIZES |G I | Fig. 5 and Fig. 6 shows the performance of our algorithm with different batch sizes varies from |G i | = 3000, 5000, 7000, to 9000.
From the results shown in Fig. 5 and Fig. 6, with the increasing of the batch size, the precision and recall are increasing for all the methods. This is because increasing the batch size will decrease s ( s = R 2 ln (2/δ) 2|G i | ).   In other words, the correlation of each candidate in PGs will be more close to its true correlation values, resulting in a better performance. However, the time and memory consumption will also climb because the number of graphs that need to be processed is increasing in each batch. Nonetheless, Fig. 5  and 6 show that Hoe-PGPL is much more effective comparing to FixWin in terms of memory and time complexity, which is also superior other methods with aspects of precision and recall.
C. EXPERIMENTS WITH VARIOUS SLIDING WINDOW SIZES Fig. 5 and Fig. 6 shows the performance of our algorithm with different sliding window sizes w varies from 5 to 20.
From Fig. 7 and Fig. 8, it is obvious that with the increasing of the window size, all methods experience a decline of the query precision and recall values. This is because that the larger the window size, the correlation of each candidate in the PG will be more far away from its true correlation. Intuitively, φ L (g 2 , g 1 ) in a batch will be much nearer to φ D (g 2 , g 1 ) in a sliding window if |G i | is closer to the total number of graphs in the sliding window (i.e., w|G i |). It is also shown in Fig. 8 that our Hoe-PGPL outperforms the exhaustive method significantly regards to system runtime and memory consumption. When increasing the window size, the runtime taken by Hoe-PGPL remains the same. However, the runtime of FixWin increase dramatically. For instance, with w = 20, Hoe-PGPL only needs about 50 seconds to return the answer,   whereas FixWin requires about 700 seconds. In this case, Hoe-PGPL outperforms FixWin an order of magnitude.

D. EXPERIMENTS WITH DIFFERENT POTENTIAL CANDIDATE LIST SIZE
In Fig. 9 and 10, the performance of the algorithm with different sizes of potential candidate list (PG) are compared, where we vary the m values from 2 to 10, to change the size of the PG.
The results in Fig. 9 show that with a smaller m value, the Hoe-PG's performance deteriorates significantly, with only about 0.92 and 0.89 in precision and recall, respectively.
In this case, Hoe-PG is significantly worse than the methods involving local PLs (Hoe-PG and Hoe-PGPL). Meanwhile, Hoe-PG is only slightly better than Hoe-M, which uses the most current batch to estimate the correlation values. This result demonstrates the power of employing local PLs into the algorithms. While m value continuously increases, the precision and recall values of all methods improve because the increasing of the PG size allows more candidates to be stored in the PG list. As a result, fewer candidates need to estimate their correlation values, which in turn produces more accurate results. In an extreme case, if the size of PG is unlimited, we will record information for all patterns in the graph stream. The algorithm will then become 100% accurate, yet it requires unbounded storage space (for PG) and a significant amount of time for checking the PG list. In our experiments, when varying the value of m from 2 to 10, Hoe-PGPL consistently outperforms others in respect of precision and recall, and there is no notable increase in terms of the system runtime and memory consumptions.

VI. CASE STUDY ON DBLP GRAPH STREAM
In this section, we use the DBLP graph streams to analyze the correlated graphs discovered by our Hoe-PGPL algorithm.
DBLP Graph Stream: DBLP database 3 contains bibliographic data on the majority of computer science journals and proceedings [71]. For each instance in this database, it consists of several attributes, i.e., article title, authors, abstract, and unique ID [39], [53]. In our case, we choose a series of conferences on Database and Data Mining conferences and summarised in Table 2. We use the paper published on these conferences as sources to build a graph stream.
In our case study, graphs are papers recorded in the DBLP database, keywords or unique identifiers of these papers are represented as nodes, and citation relationships across these papers, or the relationships between titles, authors, and keywords, are defined as edges. In detail, we define that (i) The unique identifier of a paper is denoted as a node; (ii) Every keyword in a paper is also denoted as a node; (iii) There is an edge between paper P A and paper P B if the citation relationship exists between P A and P B ; (iv) Nodes are connected with each other, i.e., given a paper, the unique identifier and keywords are linked with each other. Particularly, keywords in a given paper are fully connected in this scenario. Fig. 11 shows a graph example for a DBLP paper.
In our study, we use ''data-streams'' as a one-edge query graph, and retrieve top-20 correlated subgraphs from the whole graph streams. As correlated graph query can return subgraphs with similar distribution of the query graph, we expect that this structure pattern query can help obtain following meaningful results: (i) keywords frequently appear with ''data'' and ''streams'', which may indicate some data stream properties, challenges, research directions, and methods, etc. and (ii) some state-of-the-art literature related to data streams, which have been publicly cited in the DBLP benchmark. Fig. 12 shows 10 correlated graphs returned by our Hoe-PGPL algorithms. From the correlated graphs, we can observe that many research papers related to data streams indeed consider the time-changing (g 1 , g 6 , and g 10 ), high speed (g 2 , g 3 , g 4 , and g 10 ), or concept drifts (g 8 and g 9 ) of data streams. According to the results showing in g 5 and g 7 , query over data streams is also a popular research direction in the past years. Subgraph g 9 shows that the classifier ensemble-based method is a popular method to handle data streams.
The results in Fig. 12 also show that some popularly cited papers in data stream research have been retrieved in our query results. For instance, P1 ( [16]) is actually the first incremental learning algorithm to address data stream classification using decision tree methods. P2 ( [23]) is the follow-up work of P1 ( [16]), which intends to capture the time-changing properties of data streams. P4 ( [57]) is the state-of-the-art classifier ensemble learning algorithm to handle concept drift over data streams. P3 ( [15]) is another state-of-the-art algorithm addressing the query task from data streams. All these four papers (P1, P2, P3, and P4) represent some early works on data streams. According to Google Scholar statistics on March 17, 2013, the citations of these papers are 905, 828, 306, and 692, respectively.
The above case study indicates that correlated structure pattern search is indeed useful to discover some interesting patterns inside streams. When referring to academic publication streams (such as DBLP), it can retrieve research challenges, directions, methodologies, and state-of-the-art literature related to a given query. For chemical compounds analysis, it may help discover some fundamental properties or similar substructures of the provided query graph, which may eventually help discover new drugs.

VII. RELATED WORK
The work did in this paper is related to correlation mining, correlated structural patterns (graphs) mining, data stream/ graph stream mining, and correlated graph stream query.
Correlation mining has attracted many research interests and been widely studied in multiple domains, like biomedicine and economics. For example, correlation mining has been extensively examed and adopted in marketbasket database [47], [59], [61], [68]. Although the methods in these works are designed to mine the correlation based on Pearson correlation coefficient, there are other strategies VOLUME 8, 2020 FIGURE 12. A query graph (''data-streams'') with one edge and part of its top-20 correlated graphs discovered from the DBLP graph streams. The values in the brackets are the correlation values. Note that the keywords should be fully connected. For clear presentation, we omit some edges. P1, P2, P3, P4 are some reference papers, i.e., P1 - [16], P2 - [23], P3 - [15], P4 - [57].
As for graph databases, correlation mining is an active research topic. For example, CGSearch mines the correlated graphs by filtering the correlation above a predefined threshold [25], [26], and TopCor mines the correlated graphs by searching the graphs with the most significant correlation [28], [30]. These two algorithms handle a query by relying on a certain query g 1 , but the works in [27], [32] did it in a different way: It mines all of the correlated structural pattern pairs in a graph database. In contrast, our algorithm is designed for dynamic graph databases, instead of limited to static databases.
For the correlated graph search from data stream settings, [42] previously proposed an algorithm CGStream to query the correlated graphs whose correlations are higher than a predefined value θ . CGStream treats query graphs as operators, which constantly query subgraph patterns related to itself when going through the graph stream. To reduce the computational cost, CGStream sets up a set of outlooks (special time stamps) over streams, and the mining task is only triggered at the outlooks. The Hoe-PGPL algorithm is different to CGStream algorithm in two aspects: (i) CGStream is a threshold based algorithm (i.e., it retrieves graphs with correlation values higher than θ), and Hoe-PGPL is a topk based algorithm (i.e., it returns the top-k most correlated graphs only). (ii) CGStream stores all graphs in the sliding window to perform repeatedly mining at each outlook, while Hoe-PGPL only stores the potential candidates in two potential lists (PG and PLs) and discard all the graphs in the sliding window after processing.

VIII. CONCLUSION
In this paper, we have studied searching for the top-k correlated patterns by using a sliding window approach to cover graph streams over multiple consecutive batches. We believe that in a dynamic data stream scenario, simply exhaustively searching for the top-k correlated patterns requires storing the entire window of graph data and repeating the query process, which is computationally infeasible and memory consumptive. In our proposed method, each candidate's correlation in the PG can be accurately calculated by applying Hoeffding bound and a global-local inspection scheme that integrates with PLs and PG. Theoretical analysis proves that this method can guarantee the quality of retrieval results. On the other hand, experimental results show that, in terms of time and memory consumption, our algorithm is much more efficient comparing an exhaustive search method and has a relatively good performance with aspects of precision and recall.