Efficient Methods for Clickstream Pattern Mining on Incremental Databases

,

A variant of sequential patterns, called clickstream patterns has also attracted a number of researchers in recent years. A clickstream is a list of ordered actions that the user takes when navigating a website. Clickstream patterns are usually applied in two popular fields, e-commerce and network traffic analysis. In e-commerce, clickstream pattern mining is used The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato . to analyze, evaluate and predict online customer behaviors to determine the effectiveness of the site as a marketing channel, and is especially useful in the digital era. In network traffic analysis, the server will track how many pages are served to the visitor, how long it takes each page to load, how much data is transmitted before the user moves on, etc.
Nevertheless, most existing clickstream pattern mining methods [12]- [16] only focus on static sequence databases, ignoring incremental databases. This is despite the fact that many databases are updated incrementally, such as customer online transaction databases in e-commerce, which grow because new transactions are appended into the existing databases daily when new customers or existing customers buy goods, the same as happens with stock price sequences which grow incrementally over time. There are two kinds of database updates in applications, which are inserting new clickstreams and/or appending new actions to the existing clickstreams. Previous approaches are not suitable for handling this situation because the result mined from the old VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ database is no longer valid on the updated database, and it is extremely inefficient to mine the updated databases. The challenge of this problem is how to find a solution to increase the runtime and reduce the number of times the original databases are rescanned to reduce the computational cost in mining clickstream patterns from incremental databases.
In this paper, we apply the pre-large concept and B-List structure generated from an SPPC-Tree to propose two effective methods for mining clickstream patterns from incremental databases. In this context, we only consider the case that inserts new clickstreams into the existing database. Our experiments on various real-time databases have shown that the proposed methods outperform the SMUB algorithm [13] in terms of runtimes and memory consumption, especially on large databases with low minimum support thresholds. Our contributions can be summarized as follows: • We apply a compact data structure named SPPC-tree to compress the original database and reduce the cost of database scans in clickstream pattern mining.
• We reduce the rescanning of the original database based on a safety threshold. If the number of inserted clickstreams is not greater than a safety threshold, then the algorithm does not need to rescan the original database. This is the first study to apply the pre-large concept to a clickstream database. A pre-large clickstream acts like a buffer and is used to reduce the movement of clickstreams directly from large to small and vice versa during the incremental mining process.
• Two effective methods for mining clickstream patterns from incremental databases are proposed. The first method inserts the large 1-clickstream patterns from the inserted database into the existing tree, then mines all frequent clickstream patterns from the updated tree. The second is a novel method, it creates a new tree for pre-large and large 1-clickstream patterns from the inserted database and updates the existing list of prelarge/large clickstream patterns mined from the original database, then selects all frequent clickstream patterns. The proposed methods do not rescan the original database until the number of newly inserted clickstream exceeds a safety threshold.
• We evaluate the proposed algorithms on real-word and synthetic clickstream databases.
The remainder of this paper is organized as follows. Related works are briefly presented in section 2. The basic concepts are described in section 3. In section 4, we present the two proposed methods for clickstream pattern mining from incremental databases. The experimental results for showing the performance of the proposed methods are provided in section 5, and the conclusion is given in section 6.

II. RELATED WORKS
Sequential pattern mining (SPM) is an essential task of data mining, and has been widely applied in numerous applications, with many algorithms having been proposed. These algorithms can be classified into two groups: The horizontal database format includes AprioriAll [17], GSP [18], FreeSpan [19], and PrefixSpan [20]. AprioriAll is a popular algorithm that uses the ''generate and test candidate'' approach, GSP is an extension of AprioriAll, FreeSpan reduces the size of the databases slowly by using a projection process, and PrefixSpan is an improved version of FreeSpan that uses a prefix projection method.
The other group contains those algorithms with a vertical database format, such as SPADE [21], SPAM [22], PRISM [23], CM-SPADE and CM-SPAM [24]. SPADE is one of the most efficient algorithms for SPM, and it identifies all frequent sequential patterns by BFS or DFS based on lattice decomposition. SPAM uses a depth-first search strategy to generate candidate patterns and a bitmap structure to store position information and encode the database. PRISM is based on primal block encoding to present the database. CM-SPADE and CM-SPAM are improved versions of SPAM and SPADE by integrating the CMAP structure, which stores co-occurrence information about items to prevent redundant candidates from being generated.
Many compact pattern mining approaches have been proposed instead of mining a full set of sequential patterns, such as closed sequential pattern mining [25]- [26], mining sequence pattern with constraints [27]- [28], and mining top-k frequent patterns [29]- [30].
Parallel computing approaches have also been studied [31] to speed up sequence pattern mining processing, with pDBV-SPM [32] using a dynamic bit vector structure to quickly calculate the support values, while MCM-SPADE [33] uses the CMAP (Co-occurrence MAP) structure for storing cooccurrence information, with both pDBV-SPM and MCM-SPADE being based on multi-core architecture. Recently two methods named DPCompact-SPADE and APCompact-SPADE [34] were proposed for mining frequent weighted clickstream patterns, based on hyper-threading technology. In addition, many proposed methods are based on computers with distributed memory [35]- [37].
Recently, researchers turned their attention to clickstream patterns, a special case of sequential patterns that are used in market behavior analysis and network traffic analysis. Many methods have been proposed for mining this kind of pattern, such as SMUB [13], CM-WSPADE [15], CUP [14], and CCPC [16]. SMUB mines the clickstream patterns based on the B-List structure, CM-WSPADE mines frequent weighted clickstream patterns based on the WIBList data structure and WCMAP (Weighted Co-Occurrence Map). The CUP mines clickstream pattern based on the pseudo-IDList data structure and applies the DUB (Dynamic intersection Upper Bound) constraint to reduce candidates. The CCPC mines closed clickstream patterns based on the C-List structure.
In many applications, databases are always updated incrementally, such as customer online transaction databases in e-commerce that grow due to the new transactions being inserted into the existing databases. However, the previous works are not suitable for mining on incremental databases because the result mined from the original database is no longer valid on the updated database.
Many algorithms for incremental pattern mining have thus been proposed, such as PreFUSP-TREE-INS [38], IncWTP [39], SSP-Tree [40]. PreFUSP-TREE-INS mines sequential patterns based on the PreFUSP-tree structure. It does not rescan the original database until the number of newly added sequences exceeds a safety threshold. The preFUSP-tree structure is a combination of the pre-large concept and the FUSP-tree structure. IncWTP is used to detect and delete the failed sequence data in a timely manner, and SSP-Tree mines rare association rules from incremental databases.
In addition, various pattern mining approaches for use with incremental databases have been studied, such as Pre-FUT [41], FCIL. [42]- [43], MU2P-Miner [44], IWEL [45], PRE-HAUIMI [46], APHAUP [47], RHUPS [48], LIMHAUP [49], and ILDHUP [50]. Pre-FUT uses the Trie data structure and the concept of pre-large itemsets for mining incremental frequent itemsets. FCIL (Frequent Closed Itemset Lattice) builds an incremental frequent (closed) itemsets lattice. MU2P-Miner mines incremental maximal frequent patterns from univariate uncertain data. IWEL mines erasable patterns in an incremental database with weighted conditions. PRE-HAUIMI mines the set of high average-utility itemsets based on the average-utility-list structure and the pre-large concept to reduce the number of database scans. APHAUP uses two upper-bounds to reduce the size of candidates and decrease the number of candidates for HAUIM. RHUPS mines high utility patterns without generating candidate patterns by applying the sliding window model that calculates the influence of the insertion time of each batch in stream database. LIMHAUP mines high average utility patterns from dynamic databases based on a compact data structure named HAUP-Lists. ILDHUP mines HUI patterns from dynamic databases based on the HUPM approach and damped window model to decrease the importance of older data over time.
The challenges of this problem are how to find a solution to minimize the memory usage and reduce the number of rescans of the original database to reduce the computational cost in mining clickstream pattern process on incremental databases.
In this paper, we propose two effective methods for mining clickstreams from an incremental database based on the SPPC tree, a compact data structure to compress the databases and reduce the rescanning of the original database when the inserted database into the existing database based on pre-large concept. A pre-large clickstream acts like a buffer and is used to reduce the movement of clickstreams directly from large to small and vice versa during the incremental mining process.

III. BACKGROUND AND PROBLEM DEFINITION
Given A = {a 1 , a 2 , . . . , a n } is a set of n distinct actions with the same category. Definition 4: In other words, β is also called a super-clickstream of α.
Definition 5: A clickstream, which is a sub-clickstream of at least one user clickstream in CDB, is called a clickstream pattern.
Definition 6: The support count of a clickstream X , denoted θ (X ), is the number of user clickstreams in CDB that contain X, i.e ., θ (X ) = |{C i ∈ CDB|XC i }|.
A clickstream X is a frequent clickstream pattern if and only if its support is more than or equal to the given threshold value defined by users, θ (X ) ≥ ϕ, where ϕ is the minimum support threshold value defined by the user. The clickstream pattern mining task is to mine all frequent clickstream patterns in CDB.
Definition 7: An inserted database, denoted as D, is a set of new clickstreams added into an existing database. The new and original databases integrated together in an incremental database and denoted as D * , where D * = CDB ∪ D. TABLE 1 is an illustration of an original clickstream database. Given threshold ϕ = 2, the user clickstream {c, a, b, c} has cid = 104, and {c, b, c} is a clickstream pattern due to θ ({c, b, c}) = 3 > ϕ.

A. PROBLEM STATEMENT
Given a clickstream database CDB, a minimum support threshold ϕ, and an inserted clickstream database D. The problem of clickstream pattern mining on incremental databases is to mine all frequent clickstream patterns from D * = CDB ∪ D. VOLUME 9, 2021 To resolve this problem, we can modify existing algorithms for mining sequence patterns such as CM-SPADE, SMUB, CUP. However, the inserted database is usually very small compared to the original database. Thus, rescanning the incremental database takes much time and is not needed in most cases. The problem is how to mine or update the clickstream patterns from the inserted database without/reducing rescanning the original database to reduce the computational cost in the mining process.
Section 4 presents more details of the proposed methods for mining clickstream patterns from incremental databases

IV. CLICKSTREAM PATTERN MINING ON INCREMENTAL DATABASES
In this section, we firstly present the pre-large concept to improve the mining process and SPPC-Tree structure. Based on the pre-large concept and B-List structure generated from the SPPC-Tree structure [13], two effective clickstream pattern mining methods for incremental databases are then proposed.

A. PRE-LARGE CONCEPT
The pre-large concept [51] uses two thresholds, a lower support threshold (S L ) and an upper support threshold (S U ). S U is the same as the minimum support used in static mining algorithms. Assume that SC U and SC L are the support count of S U and S L , respectively. A clickstream pattern with the support count (SC) ≥ SC U is a large clickstream pattern (or frequent clickstream pattern). A clickstream pattern with SC < SC L is a small clickstream pattern (or infrequent clickstream pattern). On the other hand, a clickstream pattern with SC L ≤ SC < SC U is considered to be a pre-large clickstream, and it may be a large clickstream in the future. A pre-large clickstream acts like a buffer and is used to reduce the movement of clickstreams directly from large to small and vice versa during the incremental mining process.
To reduce the need for rescanning the original database, a safety threshold f based on the pre-large concept was applied. The safety threshold f is defined as follows [45]: where S U is the upper threshold, S L is the lower threshold, and |CDB| is the number of clickstreams in the original sequence database CDB. The f value will be rounded down to be an integer value. Given a clickstream X = {x 1 , x 2 , . . . , x k }, there are three cases: X is a small clickstream pattern Property 1 [52]: A large clickstream pattern in D * must be large in either CDB or D (or both).
Property 2 [52]: A small clickstream pattern in both CDB and D, it also small in D * .
Lemma 1 [51]: If | D| ≤ f , the algorithm does not need to rescan the original database.
A small clickstream in the original database that is a large clickstream in the inserted database cannot be a large clickstream for the incremental database.
Example, consider the inserted database in TABLE 2, assume S U = 50%, S L = 30% we have where |CDB| = 6. The algorithm does not need to rescan the original database in TABLE 1 to confirm the small clickstreams in the original database.

B. SPPC-TREE
SPPC-tree [16] is a data structure, and each node in the tree consists of the following fields: • action-name is the current action name.
• count is the number of actions sharing a common path from the root to the current node.
• first-child is a list of the first children that are expanded from the current node.
• first-father is the first previous node that is reached from the root node.
• right-sibling is the first next node with the same level as the current node.
• label-sibling is a list of nodes with the same actionname, and includes nodes that exist in different branches of the tree.
• pre-code is a pre-order number assigned by the pre-code traversal of the tree.
• post-code is a post-order number assigned by the post-code traversal of the tree.
The B-List structure is generated from SPPC-Tree and has the following form (pre-code, post-code, count). For example, with the clickstream database CDB in TABLE 1, assume S U = 50%, S L = 30%, the SPPC-Tree is shown in

C. PROPOSED METHODS FOR CLICKSTREAM PATTERN MINING IN INCREMENTAL DATABASES
The static methods need to rescan the original database when new databases are inserted, and these methods are not effective and require a high computation cost. In this section, we propose two effective methods for mining clickstreams from an incremental database based on a safety threshold to reduce the rescanning of the original database and the runtime. Based on the pre-large concept, the two proposed methods do not rescan the original database until the number of newly inserted clickstream exceeds a safety threshold.
A description of the two proposed algorithms is presented in the following two sub-sections. Both of them use the pre-CMUB procedure to build the SPPC-tree, as presented in Algorithm 1.
The database D is first scanned to find all 1-clickstream patterns with their support count ≥ SL×|D|, these patterns are called promising clickstreams (line 1). The algorithm will then eliminate all infrequent 1-patterns from D (line 2). Line 3 creates a SPPC-tree T with the root node R as empty. Next, each action in the clickstreams in D is assigned a new node and appended to the tree in the same order as they are in the user clickstream. The first action of the user clickstream is appended to the root, the second is appended to the first node, and so on. If a node has existed in the tree, it updates its node information by increasing the count (lines 7-8). Otherwise, a new child node appends to the tree and becomes the first child Algorithm 1 Pre-CMUB (Building an SPPC-Tree With Lower Support S L ) Input: A clickstream database D, and threshold S L Output: An SPPC-tree T 1. Scan D to find all 1-clickstreams with their support ≥ S L × |D| 2. Eliminate all small 1-clickstream patterns (support < S L × |D|) in D 3. Build an SPPC-tree T with a root node R = NULL 4. for each clickstream cls in D 5. r = R 6.
for each action a in cls 7.
if r has a child node c such that c.action-name = a.actionname 8.
create a new node c with c.action-name = a.actionname 11.
add node c as a first-child node of r 13. r = c 14. Traverse the tree T to create pre-code and post-code for each node after it is built. 15. return T (lines 10-13). Finally, traverse the tree to create the pre-code and the post-code values for each node (line 14).

1) AN UPDATED TREE METHOD FOR INCREMENTAL CLICKSTREAM PATTERN MINING USING B-LIST a: INCMUB ALGORITHM
A SPPC tree must be built in advance from the original database before new clickstreams are inserted. When new clickstreams are added, then based on the safety threshold the algorithm will determine whether it needs to rescan the original database or not. However, small clickstreams from the original database will at most become pre-large and cannot become large, thus reducing the amount of rescanning necessary.
inCMUB does not need to rescan the original database if | D| ≤ f , it just scans the inserted database to find all 1-clickstream patterns that satisfy S L and inserts them into the existing SPPC-tree built from the original database. Based on the updated tree, the algorithm mines all large k-clickstream patterns from (k-1)-clickstream patterns and its B-List. This approach saves much time as it does not rescan the whole database to build the tree. Details of the inCMUB algorithm are shown in Algorithm 2.
Initially, the original database CDB is empty, so the first incremental database D is to be the original database, and the algorithm determines the safety threshold f (lines 1-2). If the number of clickstreams in the incremental database is greater than the safety threshold, the algorithm updates the safety threshold f and builds the SPPC-tree for an integrated database by executing the pre-CMUB procedure (lines 3-4).
In contrast, the algorithm scans the inserted database once, finds and updates the tree with 1-clickstream patterns that satisfy S L by calling the Update-Tree procedure (or a prelarge pattern) (line 6).
Next, the algorithm creates the B-List for each large 1-clickstream pattern from the tree (line 7). Finally, the

Algorithm 2 inCMUB
Input: Original clickstream database CDB with SPPC-tree T using S L , an inserted database D, S L and S U Output: A set of all large clickstream patterns in the incremental clickstream database FCP pre-CMUB(D * , S L ) to create T 5. else 6.
Update-Tree(T , D, S L ) to update T 7.
Generate the B-List L 1 for each large 1-clickstream from T 8.
for each action a in cls 15.
if r has a child node c such that c.action-name = a.action-name 16. c.count = c.count + 1; 17. else 18.
create a new node c with c.action-name = a.action-name 19.
add node c as a first-child node of r 21. r = c 22. Traverse the tree T to create pre-code and post-code values for each node after it is built.
Create B-List of c by joining the B-Lists of a and b that have the same (k-1) prefix as a 27.
The same as the tree-building process, the Update-Tree procedure removes small 1-clickstream patterns from the inserted database (line 10). Next, it traverses the existing tree for each action in clickstreams in D. If a node has existed, it updates its node information by increasing the count (lines [15][16]. Otherwise, a new child node appends to the tree (lines [18][19][20][21]. Finally, it updates the pre-code and post-code values for each node (line 22).
The last mining procedure finds all large k-clickstream patterns by joining the (k-1)-clickstream patterns that have the same prefix (lines 24-26). for each candidate pattern, c, if their support count of c is no smaller than SC, then the procedure adds c to CFP (lines [27][28][29]. finally, the procedure calls itself recursively to conduct mining processes to extend the clickstream patterns (line 30).

b: AN EXAMPLE
Consider the original database CDB in TABLE 1, inserted database D in TABLE 2, S U = 50% and S L = 30%.
The tree T was built from F 1 = {a, b, c, d, e}, the clickstream patterns satisfying SC L = 2 are shown in FIGURE 1, and the B-List L 1 for each large 1-clickstream pattern is shown in TABLE 3. The large k-clickstream patterns are generated by joining large (k-1)-clickstream patterns.   For cid = 107, the seq = {c, b, c, e}, starting from c clickstream, node (18,21) c:1 in T is updated (18,21) c:2. Because the node (18,21) c:2 does not have any child nodes, the next node b, c, e is appended under node c, and a similar thing happens to cid = 108 for e clickstream. The tree T is updated as shown in FIGURE 2, which includes pre-code and post-code values.
Clickstream d is eliminated because the support count does not satisfy the upper threshold, so the list of large 1-clickstream patterns is shown in TABLE 5.
A recurrent clickstream pattern extends from (k-1)clickstream patterns, and the full set of large k-clickstream patterns that are mined is shown in TABLE 6.

2) EFFICIENT METHOD FOR UPDATED FREQUENT CLICKSTREAM PATTERNS IN INCREMENTAL DATABASES a: EFF-INCMUB
The process of finding and inserting all pre-large 1-clickstream patterns satisfying S L from a database inserted into the existing tree is time consuming, especially when the   size of the SPPC tree from the original database is large but the number of inserted clickstreams is small. Therefore, a new approach named Eff-inCMUB is proposed. Assume that we have pre-large and large clickstream patterns (PCP) mined from the original database. When the inserted database D arrives, Eff-inMUB will scan D to build SPPC tree T ' and B-List L 2 of the 1-clickstream patterns that satisfy S L . Finally, it traverses clickstreams in PCP and L 2 to update the support count for each clickstream in PCP. Frequent clickstream patterns (FCP) are easy to get by traveling the updated PCP to find patterns that satisfy S U .
The description of the Eff-inCMUB algorithm is shown in Algorithm 3.
First, the original database CDB is empty, so the first inserted database D is the original database, and the algorithm initializes the safety threshold f (lines 1-2). The algorithm builds the SPPC-tree T and B-List L 1 for the integrated database (lines 4-5). Next, all pre-large/large clickstreams are generated in this phase (lines 6-7) in cases when the number

Algorithm 3 Eff-inCMUB
Input: Original clickstream database CDB with pre-large/large clickstream patterns PCP, inserted database D, S L and S U Output: Updated pre-large/large clickstreams PCP and a set of all large clickstream patterns in the incremental clickstream database FCP pre-CMUB(D * , S L ) 5.
Generate the B-List L 1 for each 1-clickstream pattern in tree T . 6.
pre-CMUB( D, 1 | D| ) to build the new tree T 11. Generate the B-List L 2 for each 1-clickstream pattern from T 12.
Update-List(PCP, L 2 ) to update the support count for each clickstream pattern in PCP a.count ← a.count +b.count 19.
C ← set of child nodes in a 20.
for each c in C 21. Assuming for each item c in C 26.
TraverseAndUpdate(c, L 2 ) of clickstreams in an incremental database is greater than the safety threshold. In contrast, the Eff-inCMUB algorithm just scans the inserted database once and eliminates the small 1-clickstream patterns (line 9). It then builds the new tree T and creates the B-List L 2 pre-large/large 1-clickstream patterns of inserted database D (lines 10-11). Next, the Update-List procedure is called to update the support count for each clickstream pattern in the B-List L 1 by L 2 , and it generates the PCP list 1-clickstream patterns after updating (line 12). The full set of clickstream patterns is extracted from the PCP that satisfies the upper threshold value (line 13).
The Update-List procedure traverses each node in B-List L 1 and calls the TraverseAndUpdate procedure to update its node information by L 2 (lines 14-15). If a node has existed both in L 1 and L 2 , it updates its node information by summing the count of two nodes (lines [17][18][19]. Next, it creates a new node and its B-List by joining each of the child nodes in L 1 . If the count of the parent and child nodes is greater than the S L threshold, then update the L 2 with the new node (lines 20-24). Finally, the procedure calls itself recursively to conduct mining processes for each child node (lines [25][26]. The Eff-inCMUB algorithm is a new approach that can greatly outperform inCMUB with regard to runtime because it does not need to traverse the tree that was built from the original database.  The set of clickstream patterns that meet the S L threshold and is mined from L 1 is shown in TABLE 8. The algorithm scans D once (| D| ≤ f ) to find 1-clickstream patterns CP 1 = {a, b, c, d, e} for which the support count is not less than the lower support. The new SPPC-tree T is shown in FIGURE 3. And the B-List L 2 created from T is shown in TABLE 9. The update procedure is executed for L 1 and L 2 . Node a and its child nodes are omitted because node a does not exist in L 2 . Next, node b exists in both L 1 and L 2 , and the support count is updated to 7 in L 1 . There are two child nodes of b, which are {ba, bc}. The child node {ba} is omitted because node a does not appear in L 2 . The B-List of nodes {bc} is created, and this node is put into L 2 because the support count ({bc}.count in L 1 is 4 + {bc}.count in L 2 is 1 > S L × |D * |), satisfying the S L threshold, so it is put into L 2 and the support count is updated to 5 in L 1 . A similar process happens for the remaining nodes c, d, e, and the B-List L 2 after being updated is shown in TABLE 10. The PCP unified list of L 1 and L 2 is shown in TABLE 11. The full set of large clickstream patterns extracted from the PCP is the same as that shown in TABLE 6.

V. EXPERIMENTAL EVALUATION
This section evaluates the effects of the experimental methods, which are executed on a computer equipped with an Intel Core I5-5300U 2.3 GHz, 16GB main memory and Windows 10 operating system. The experiments are carried out on reallife datasets, with the details shown in TABLE 12. The results show that the two proposed methods outperform the SMUB algorithm in both memory usage and running time, especially on huge clickstream datasets with a lower threshold. To depict the process for incremental databases, each experimental database is split into two parts, the first part is the original database, and the second is an incremental database for which the shares are 99.9% and 0.1%, respectively, for all databases, with the exception of the FIFA database, for which they are 98% and 2%. The second part continues to split the data into ten equal sub-parts, and each part is then inserted into the original database in the mining process.

A. THE RUNTIME
The results in this section confirmed that the runtime of the two proposed methods is always much better than the SMUB algorithm on all experimental databases. In particular, both of them can be executed on a large clickstream database with a very low threshold. A weakness of the proposed methods is that they are costly when a database is rescanned. However, most of the incremental databases are so small that this does not affect the overall cost. Moreover, with the smaller lower threshold the differences in the runtimes of Eff-inCMUB, inCMUB and SMUB are larger. FIGURE 4 to FIGURE 7 compare the execution times the of the SMUB, inCMUB, and Eff-inCMUB algorithms.
In FIGURE 4, we compare the runtimes of SMUB, inC-MUB, and Eff-inCMUB on the FIFA database with various settings. When the lower and upper threshold settings are S L = 0.08 and S U = 0.1, we found that the runtime of the inC-MUB is faster than CMUB while that of Eff-inCMUB is the best, and the runtimes of these methods are stable when the new clickstream database is inserted. Eff-inCMUB is approximately 20 and 30 times faster than inCMUB and SMUB, respectively. When ten inserted databases are inserted both inCMUB and Eff-inCMUB only rescan the database once. FIGURE 5 performs the same experiment for the BMS database. We found that the runtimes of inCMUB and Imp-inCMUB are much better than that of SMUB. Eff-inCMUB is the best, although it rescanned the database twice.  shows that the runtimes of SMUB, inCMUB and Eff-inCMUB for the Kosarak database with various settings. When the setting is S L = 0.003 and S U = 0.004, we found that the runtime of the Eff-inCMUB does not increase, except for in the case of rescanning the database, while the runtime of inCMUB is also better than that of SMUB.

B. THE MEMORY USAGE
This sub-section reports the memory usage of the algorithms on the databases with various settings. FIGURE 8 shows the memory usages of Eff-inCMUB, inCMUB and SMUB for the FIFA database. With the setting is S L = 0.08 and S U = 0.1, the memory usages of SMUB and Eff-inCMUB are the same, and they are both better than that of inCMUB. FIGURE 9 reports the memory usages of Eff-inCMUB, inCMUB and SMUB with various settings for BMS2   database. The memory usage of Eff-inCMUB is also better than that of inCMUB in the phase without rescanning the database, but it is not better when rescanning the database occurred, while the memory usages of inCMUB and CMUB are the same. Overall, the total memory usage of Eff-inCMUB is better than those of inCMUB and SMUB. FIGURE 10 reports the memory usages of Eff-inCMUB, inCMUB and SMUB for the Kosarak database. With setting S L = 0.003 and S U = 0.004, the memory usage of Eff-inCMUB is the best, while those of inCMUB and SMUB are nearly the same.   shows the memory usages of SMUB, Eff-inCMUB and inCMUB for the MSNBC database. With setting S L = 0.001 and S U = 0.002, the memory usage of SMUB is better than those of Eff-inCMUB and inCMUB in some of the early stages. However, when the number of incremental databases is large, the memory usage of SMUB increases significantly, while the memory usages of Eff-inCMUB and inCMUB are stable. The MSNBC database has a large average clickstreams length (13.23), which leads to a large difference in the mining results in the two thresholds S L (S L = 0.001, 16641 clickstreams) and S U (S U = 0.002, 4251 clickstreams). Because the CMUB only executed a S U threshold so its memory usage in some first iterations is better than those of inCMUB and Eff-inCMUB, both of which run on the thresholds S L and S U .

C. SCALABILITY ON LARGE DATABASE
To verify the performance and scalability on large data, we execute the proposed methods on the SUSY database that contains five million clickstreams.

1) THE RUNTIMES
The runtimes of both inCMUB and Eff-inCMUB are better than that of SMUB, and Eff-inCMUB is always the best method for mining clickstream patterns for the  FIGURE 12(B)). The runtimes of Eff-inCMUB and inC-MUB are small, and both are better than that of SMUB. The difference in runtime between SMUB and the two proposed methods is significant. In particular, with a smaller lower threshold the runtime of the proposed methods is always better.

2) THE MEMORY USAGE
Similarly, the memory usages of Eff-inCMUB, inCMUB and SMUB for the SUSY database are shown in FIGURE 13. On this database, the memory usages of Eff-inCMUB and inCMUB are better than that of SMUB in two settings. With setting S L = 0.109 and S U = 0.11, the memory usages of inCMUB and SMUB are nearly the same, while that of Eff-inCMUB is much smaller, at just half the memory usage of inCMUB. Similar results are found for setting S L = 0.109 and S U = 0.11, as the memory usages of Eff-inCMUB and inCMUB are better than that of SMUB.

VI. CONCLUSION AND FUTURE WORK
This paper proposed two methods named inCMUB and Eff-inCMUB to mine the clickstream patterns on incremental datasets. Both inCMUB and Eff-inCMUB are very effective, especially in cases when the number of new clickstreams added is smaller than the safety threshold, as they do not need to rescan the original database. In addition, an outstanding feature of the proposed methods is that they are very efficient for large databases.
inCMUB mines large 1-clickstream patterns from the original clickstream database, then inserts the large 1-clickstream patterns from a new clickstream database into the existing tree for mining all large clickstream patterns. Eff-inCMUB first finds all clickstream patterns satisfying the lower threshold (known pre-large clickstream patterns). Next, it updates the set of pre-large clickstream patterns by using the prelarge 1-clickstream patterns from the inserted database, then extracts the full frequent clickstream patterns from the set of pre-large clickstream patterns that was updated. This method is more cost-effective than inCMUB as it does not traverse the tree.
In the future, we will improve the Eff-inCMUB to further reduce the running time in rescanning the database. We will also study strategies for mining clickstream patterns with constraints or those weighted on incremental databases. In addition, we will study how to use distributed architectures to improve the efficiency of the mining process.