Efficient Chain Structure for High-Utility Sequential Pattern Mining

High-utility sequential pattern mining (HUSPM) is an emerging topic in data mining, which considers both utility and sequence factors to derive the set of high-utility sequential patterns (HUSPs) from the quantitative databases. Several works have been presented to reduce the computational cost by variants of pruning strategies. In this paper, we present an efficient sequence-utility (SU)-chain structure, which can be used to store more relevant information to improve mining performance. Based on the SU-Chain structure, the existing pruning strategies can also be utilized here to early prune the unpromising candidates and obtain the satisfied HUSPs. Experiments are then compared with the state-of-the-art HUSPM algorithms and the results showed that the SU-Chain-based model can efficiently improve the efficiency performance than the existing HUSPM algorithms in terms of runtime and number of the determined candidates.


I. INTRODUCTION
Pattern mining is considered to find the valuable relationships between items/objects in the databases, and many variants of knowledge were then investigated in different applications and domains, such as association-rule mining (ARM) [1], [10], sequential-pattern mining (SPM) [2], [9], [22], [23], and high-utility-itemset mining (HUIM) [6], [12], [13], [18], among others. SPM which discovers high frequent sequence from sequence database, is one of the important research areas in data mining and knowledge discovery since it shows the correlations of the ordered events, which can be applied in many real-life applications and situations. For example, the sequence data can be extracted from the Weblog, DNA sequence, or trajectory datasets. Several algorithms [9], [22], [23] have been proposed to improve the mining efficiency regarding SPM but most of them do not, however, consider the other factors or attributes in the databases (i.e., importance, weight or interestingness). To solve this limitation and provide more useful and meaingful information, the high-utility sequential pattern mining (HUSPM) [17], [29], [32], [33] was presented to consider both utility The associate editor coordinating the review of this manuscript and approving it for publication was Changsheng Li. and sequence factors to reveal the set of high-utility sequential patterns (HUSPs) from the databases. It takes the quantities and unit profits of the items into account to mine the set of HUSPs as the required knowledge for decision making. In the HUSPM, a sequence is considered as a HUSP if its sequence utility is no less than the pre-defined minimum utility value. However, the task of HUSPM is more complex than that of traditional SPM since the sequence utility does not hold the downward closure property, thus the search space to discover the required HUSPs has become huge. Many algorithms were presented to design the pruning strategies and new upperbound values to reduce the search space such as USpan [32], HUS-Span [29], ProUM [7] and HUSP-ULL [8]. USpan was introduced to utilize the lexicographic quantitative sequence tree for mining the HUSPs, but the upper-bound value is overestimated, thus the search space to find the required information is too huge. HUS-Span consists of two tighter upper bounds, respectively named as prefix extension utility (PEU) and reduced sequence utility (RSU) to establish the upperbound values of the promising candidates. With the designed pruning strategies and the new upper bounds, the HUS-Span can greatly reduce the size of the unpromising candidates to mine the HUSPs. However, the HUS-Span still has to generate many unpromising candidates to completely mine the HUSPs, and the database is required to be projected at each time by the level-wise approach. Moreover, the used PEU in the HUS-Span is not updated at each iteration, thus the upper bound is still overestimated; more unpromising candidates are still required to be determined. ProUM [7] and HUSP-ULL [8] are the state-of-the-art approaches by introducing the projection mechanism and efficient pruning strategies to mine the HUSPs.
The above algorithms still suffer the limitation of memory usage (i.e., the state-of-the-art HUSP-ULL), we thus design an efficient sequence-utility (SU)-Chain structure to keep more information for the later mining progress. The projection approach is also utilized in the SU-Chainbased algorithm to speed up the generation progress of the promising candidates. Moreover, several pruning strategies are utilized in the SU-Chain structure to identify the irrelevant information for the early pruning progress, thus those items can be removed from the projected database, and the search space can be also reduced. From the experiments, we can also observe that the developed SU-Chain structure can produce better performance compared to the previous works regarding the runtime and number of examined candidates.
The organization of this paper is stated below. Literature review is discussed in Section 2. Preliminaries and problem statement of the designed model are studied in Section 3. The designed sequence-utility-chain-based model is developed in Section 4. Experiments are conducted and discussed in Section 5. Finally, the conclusion and future works are mentioned in Section 6.

II. LITERATURE REVIEW
High utility itemset mining (HUIM) [6], [12], [13], [18], [28], [31] is to consider the utility factor of the itemsets to reveal the high profitable itemsets from the databases. Compared with association rule mining (ARM) [1], [10], conventional HUIM does not hold the downward closure property thus the search space of the promising candidates is huge. To efficiently mine the high-utility itemsets (HUIs) and reduce the size of the search space, Liu et al. [13] proposed the transaction-weighted utility (TWU) concept, which is used to estimate the upper bound of the itemset utility value. Tseng et al. extended the FP-tree and proposed the UP-growth [25] and UP-growth+ [26] algorithms to exploit the nature of the tree for compressing the search space. Lin et al. [14] also presented the HUP-tree, which is based on the TWU concept and FP-tree structure [10]. The HUPtree uses the tree structure to keep the necessary information of the promising candidates, and the HUP-growth mining algorithm was presented to discover the required HUIs based on the HUP-tree structure. Liu and Qu [15] proposed the HUI-Miner algorithm, which converts the original database into an utility-list structure and mines the HUIs efficiently from the utility-list to avoid the generation progress of the huge candidates. Zida et al. [35] designed a novel algorithm called EFIM, which consists of two upper bounds on utility to reduce the size of the search space. Several algorithms [11], [20], [30] using the evolutionary computations have been discussed to find the HUIs in a limit time. Research directions include the improvement of high-utility itemset mining [19], high-utility itemset mining for IoT uncertain data [18], and mining top-k high-utility itemsets [27] are also the interesting issues and been developed in progress.
High-utility sequential-pattern mining (HUSPM) is an emerging field in recent decades since it considers both utility and sequence factors to discover the utility utility sequential patterns (HUSPs) from the sequence dataset. HUSPM can also be considered for sequence mining of Website logs [34]. Shie et al. [24] proposed the UMSP algorithm and the UM-span algorithm for mining high-utility mobile sequences based on the mobile-business applications. To exploit the usefulness of web page access sequence data, Ahmed et al. [3] proposed two tree structures, respectively called UWAStree and IUWAS-tree, to process the static and dynamic databases. Subsequently, Ahmed et al. [4] proposed a highutility sequential pattern mining algorithm for processing general sequences, namely, the layer-by-layer search UL algorithm and the pattern-extended US algorithm. However, there is no formal definition of high-utility sequential pattern mining. Yin et al. [32] officially defined high-utility sequential pattern mining and proposed an efficient algorithm, USpan, for mining general sequence patterns with utility values. To simplify the parameter setting, Yin et al. [33] then proposed the TUS algorithm for discovering the topk high-utility sequential patterns. Lan et al. [16], [17] first introduced the concept of fuzziness into sequence mining and then proposed a high-utility sequential pattern mining algorithm to simplify the mining results and reduce the search space. Alkan and Karagoz [5] proposed a high-utility sequential pattern extraction (HuspExt) algorithm, which is used to calculate the Cumulated Rest of Match (CRoM) to obtain a smaller upper bound. The complexity of the search space can thus be reduced. Wang et al. [29] subsequently proposed the HUS-Span algorithm to reduce the unpromising candidates by introducing two utility upper bounds called PEUs and RSUs. However, it is still challenging to find the HUSPs from a very big dataset. In addition, Gan et al. then presented a projection method called ProUM [7] and the HUSP-ULL [8] to efficiently mine the HUSPs, which are the state-of-the-art approaches based on the utility-list structure.

III. PRELIMINARIES AND PROBLEM STATEMENT
Let I = {i 1 , i 2 , . . . , i m } be the finite set of m distinct items. A quantitative item, abbreviated as q-item, is denoted as (i k , q k ), which is used to represent the item with its purchase quantity. An itemset is denoted as is the set of several q-items. Without loss of generality, here we assume that the items (q-items) are sorted as alphabetic order in itemset (q-itemset) through the items in itemset. A sequence is denoted as sorted list of one or more itemset. A quantitative sequence, abbreviated as q-sequence is denoted as which is the sorted list of one or more q-itemset. A quantitative sequence database, abbreviated as q-sequence database S = {s 1 , s 2 , . . . , s n } is the set of q-sequence where each q-sequence is associated with a unique identifier called sid. Table 1 shows a quantitative sequential database. It has four q-sequences and six items, denoted from a to f. Table 2 shows the unit profits of the items that appear in Table 1.
Here, several definitions regarding the HUSPM are given below.
Definition 1: Let the utility of an item i r in a q-itemset v denote as u(i r , v), and is defined as: where q(i r , v) is the quantity in a q-itemset v and pr(i r ) is the profit of an item i r . Example 1: Take an example as follows. The utility of an item a in the first q-itemset of s 1 in Table 1  Definition 2: Let the utility of a q-itemset in a q-sequence s denote as u(X , s), and is defined as: Example 2: Take an example as follows. The utility of a q-itemset [(c:6)(a:3)] in q-sequence s 1 is calculated as: Definition 3: Let the utility of a q-sequence in a quantitative sequential database D denote as u(s), and is defined as: Example 3: Take an example as follows. The utility of the q-sequence s 1 in Table 1  Definition 4: Let the utility of a quantitative sequential database D denote as u(D), which is the sum of the utility of each its q-sequence and defined as: Example 4: Take an example as follows. The utility of the quantitative sequential database D in Table 1 is calculated as: Example 6: Take an example as follows. The q-itemset [(c:6)(a:3)] in q-sequence s 1 in Table 1  Definition 7: Given two sequences t a = < w a 1 , w a 2 , . . . , w a m > and Example 7: Take an example as follows. A sequence  Table 1.
Definition 9: Given a q-sequence s = < v 1 , v 2 , . . . , v n > and a sequence t = < w 1 , w 2 , . . . , w m >, if n = m and the items in v i are same as the items in w i , where 1 ≤ i ≤ n, then s is said to match t, which can be denoted as t ∼ s. Example 9: Take an example as follows. A sequence <[c][e,b]> matches the s 1 in Table 1. Notice that the two qitemsets may be considered as different although they contain the same itemset because of the quantities and the position of a q-sequence. Therefore, it is possible that more than one q-subsequence of a q-sequence matches the given sequence. Definition 10: A q-itemset containing k items is called k-q-itemset. A q-sequence containing k items is called k-q-sequence. Example 10: Take an example as follows. The q-sequence s 1 is a 7-q-sequence. The first q-itemset of q-sequence is a 2-q-itemset.
Definition 11: Let the utility of a sequence t in a q-sequence s denote as u(t, s), and is defined as: where ∼ denotes the matched relationship and t ∼ s k represents that s k is the match of t. Example 11: Take an example as follows. The utility of a sequence <[a], [b]> in the q-sequence s 1 of Table 1  Definition 12: Let the utility of a sequence t in a quantitative sequence database D denote as u(t), and is defined as: Example 12: Take an example as follows. The utility of a sequence <[a],[b]> in Table 1 is calculated as: Definition 13: A sequence t in a quantitative sequential database D is a high utility sequential pattern (HUSP) if it satisfies the condition as: where δ is minimum utility threshold and u(D) is the total utility of the q-sequence D. Problem Statement: Given a quantitative sequence database and a user-defined minimum utility threshold, the task of high utility sequential pattern mining (HUSPM) is to find the complete set of high utility sequential patterns (HUSPs) in which the utility value of each sequence is no less than δ × u(D) from the quantitative database.

IV. DEVELOPED SEQUENCE-UTILITY (SU)-CHAIN-BASED MODEL
In this paper, we present a novel sequence-utility (SU)-Chain structure to keep more information for further mining process. A lexicographic enumeration (LE)-tree is used here to represent the search space of the promising candidates, which can be shown as Figure 1.
In Figure 1, the I-Concatenation and S-Concatenation are used in the pattern-growth mechanism [29], [32] to generate the possible and promising HUSPs. Based on the I-Concatenation and S-Concatenation for the enumeration tree, all the possible and promising candidates can be produced and explored. In order to ensure the integrity of the mining results, we should concatenate items in a certain order [7], [8]. It is noted that the definition of sequence order is also suitable for q-sequence. According to the definition of sequence order, we could produce all candidate sequences completely without loss of integrity.
For the HUS-Span [29] and ProUM [7], it needs to generate the projection database of a sequence t using the original database. A designed sequence-utility (SU)-Chain here can be considered to produce the projection database for the sequence. While exploring the child nodes in the LE-tree, this projection database could be passed to the child nodes after updating. This progress can be used to reduce time consumption. Table 3 shows the SU-Chain of a sequence <a> from Table 1. The SU-Chain is a set of projection sequences and utility-lists. The element of the utility-lists contains four fields as: concatenation position p i ; the maximum utility at concatenation position p i ; the utility of remaining sequence s/ t,p i ; a pointer pointing to either the (i + 1)-th concatenation position or null.
Based on the SU-Chain, the projection sequence can thus be maintained for later generation of the promising candidates for examination. Also, it is easy to find the I-Concatenation and S-Concatenation of the sequences. Thus, the computational cost can be greatly reduced to mine the required HUSPs. The designed SU-Chain structure is then presented in Algorithm 1. The main construction process is divided into three parts as: (1) find the candidate concatenation items of I-Concatenation or S-Concatenation; (2) build the new utility-list; and (3) project the required sequences.
In order to efficiently reduce the size of the search space for mining HUSPs, several pruning strategies [21], [32] can thus incorporated with the designed SU-Chain structure to improve mining performance. Several definitions, theorems and pruning strategies are then given below.
Definition 14: SWU (t) is used to denote the sequence weighted utilization of t in the q-sequence database SUD, and defined as: Theorem 1 Given a sequence t, for each sequence t that could be generated by t using concatenation operations, we then can obtain that: u(t ) ≤ SWU (t) Proof: As the above definition, it is obvious that u(t ) ≤ SWU (t ) holds. Since t ⊆ t , SWU (t ) = s∈D {u(s)|t ⊆ s} ≤ s∈D {u(s)|t ⊆ s}.
Pruning strategy 1: According to Theorem 1, For a given sequence t, if SWU (t) is less than the minimum utility value, the utility of any sequences which could be generated by t will return HUSPs be less than the minimum utility value. And these sequences could be safely pruned from the LE-tree without affecting the complete mining results.
Definition 15: PEU(t, s) is used to denote the prefix extension utility of t in q-sequence s, and defined as: where max{u(t, p, s) + ru(s/ t,p )} holds if ru(s/ t,p ) > 0, otherwise, the PEU (t, s) is set as 0. Definition 16: PEU (t) is used to denote the prefix extension utility of t in q-sequence, and defined as: Theorem 2 Given a sequence t, for each sequence t that could be generated by t using concatenation operation, Proof: From the above definition, u(t ) ≤ PEU (t ) holds. PEU (t , s) = max{u(t , p , s) + ru(s/ t ,p )} = max{u(t, p, s) + u(i j ) + ru(s/ t ,p )}. i j is the concatenation item at the concatenation position p . since p ≥ p, u(i j ) + ru(s/ t ,p ) ≤ ru(s/ t,p ), Therefore, PEU (t , s) ≤ max{u(t, p, s) + ru(s/ t,p )} = PEU (t, s). Therefore, PEU (t ) = s∈D∧t ⊆s PEU (t , s) ≤ s∈D∧t ⊆s PEU (t , s) ≤ s∈D∧t⊆s PEU (t, s). Then u(t ) ≤ PEU (t). Pruning strategy 2: According to Theorem 2, For a given sequence t, if PEU (t) is less than the minimum utility value, the utility of any sequences which could be generated by t will be less than the minimum utility value. And these sequences could be safely pruned from the LE-tree without affecting the complete mining results. Furthermore, the pruning strategies used in the HUSP-ULL [8] can also be incorporated with the designed SU-Chain structure as follows.
Pruning strategy 3: Given a sequence t and t, t is generated by t and i j using concatenation operation. Then i j is the concatenation candidate item of sequence t. Thus, if s∈D∧t ⊆s PEU (t, s) is less than the minimum utility value, then i j called unpromising item is removed from the set of concatenation candidate items not to generate the sequence t . Therefore, if i j is the I-concatenation candidate item, then we can remove i j from the set of I-concatenation items; if i j is the S-concatenation candidate item, then we can remove i j from the set of S-concatenation items.
Pruning strategy 4: Given an item i j and a sequence t. t 1 is generated by t and item i j using I-concatenation; and t 2 is generated by t and item i j using S-concatenation. if s∈D∧t 1 ⊆s PEU (t, s) is less than the minimum utility value and s∈D∧t 2 ⊆s PEU (t, s) is less than the minimum utility value. Then we can remove the item i j called irrelevant item from the projection database of the sequence t since this sequence is an super sequence generated by t and item i j could be high utility sequential pattern.
Using the Pruning strategy 4 to remove irrelevant items from projection database of a sequence t could reduce the size of the projection database of the sequence t and its supersequence because the projection databases of these sequences do not need to contain the irrelevant items. As the same time, Removing the irrelevant items could lower the upper bound value of PEU (t).

V. EXPERIMENTAL EVALUATION
In this section, several experiments were conducted to evaluate the proposed SU-Chain compared to the state-of-the-art USpan [32], HUS-Span [29] and HUSP-ULL [8] approaches. Six real-life datasets were used in the experiments to evaluate the performance in terms of runtime and number of generated candidates. The characteristics of six datasets are shown in Table 4. The parameters of the used datasets indicate: #|D| states the total number of sequences; #|I | is the number of distinct items; C is the average number of itemsets per sequence; and MaxLen states the maximum number of items per sequence.

A. RUNTIME
Experiments were conducted under the various minimum utility threshold δ and the results are then shown in Figure 2.  From the results, it can be observed that the designed SU-Chain-based algorithm outperforms the state-of-the-art USpan and HUS-Span algorithms in terms of runtime performance. The state-of-the-art HUSP-ULL algorithm has slightly better performance than that of the SU-Chain-based algorithm, for example in Figure 2(d), when the threshold is set as 1%, the SU-Chain-based model requires 13.1 seconds and the state-of-the-art HUSP-ULL needs 9.6 seconds. When the threshold is set as 1.715%, the SU-Chain-based model needs 0.79 seconds and the HUSP-ULL requires 0.2 seconds. When the threshold is set as 1%, the SU-Chain-based model requires 7.5 seconds while the HUSP-ULL needs 4.9 seconds. However, for the databases shown in Figures 2(a), 2(b), and 2(c), the designed SU-Chain-based model needs less runtime than that of the HUSP-ULL algorithm, especially the HUSP-ULL has the memory leakage problem in Figures 2(a) and 2(c). Generally, the designed SU-Chain-based algorithm can obtain better performance compared to the most HUSPM algorithms, especially it has better capacity to keep more information for efficiency improvement.

B. NUMBER OF GENERATED CANDIDATES
In order to evaluate the effectiveness of the compared algorithms, the number of generated candidates and the number of discovered HUSPs under different minimum utility thresholds are then conducted and shown in Table 5. The #HUSPs represents the number of HUSPs and ''-'' denotes that the runtime of the performed algorithm exceeds 10,000 seconds or it cannot be performed in a limited main memory.
From the given results, it can be observed that the designed SU-Chain-based algorithm generates less candidates than the previous USpan and HUS-Span algorithms. As the less candidates are required to be explored, less runtime is needed. When the minimum utility threshold increases, the number of the determined candidates decreases, and vice versa. This is reasonable since less patterns are then generated based on the higher minimum utility threshold. We also can observe that the USpan and HUS-Span cannot generate the results in the Yoochoose-buy dataset, and the HUSP-ULL cannot obtain the results both in Yoochoose-buy and MSNBC datasets. Although the HUSP-ULL has a very slight better performance than the SU-Chain-based algorithm (almost 1-2 seconds different), the number of generated candidates are nearly similar. However, the designed SU-Chain-based algorithm can obtain good performance for handling the Yoochoose-buy and MSNCBC datasets than that of the state-of-the-art HUSP-ULL approach.

VI. CONCLUSION AND FUTURE WORK
In this paper, we present a Sequence-Utility (SU)-chain structure to keep the projection database and its utility-list structure. Based on the designed SU-Chain-based model and the utilized pruning strategies, the SU-Chain-based algorithm successfully obtains good results than the other compared algorithms, especially the designed SU-Chain-based algorithm can reduce the leakage problem of the memory compared to the state-of-the-art HUSP-ULL approach. In the future, we will then address the dynamic situation to efficiently update the discovered HUSPs for transaction insertion based on the Hadoop or Spark platform. How to efficiently design a better structure used in the MapReduce framework is also an interesting topic for the further study.
YUANFA LI is currently pursuing the master's degree with the School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), China. His research interests include big data analytics and cloud computing.
PHILIPPE FOURNIER-VIGER received the Ph.D. degree in computer science from the University of Quebec, Montreal, in 2010. He is currently a Full Professor and a Youth 1000 Scholar with the Harbin Institute of Technology (Shenzhen), Shenzhen, China. He is the Founder of the popular SPMF open-source data mining library, which has been cited in more than 800 research articles. His research interests include pattern mining, sequence analysis and prediction, and social network mining. He is also the Editor-in-Chief (EiC) of the Data Mining and Pattern Recognition (DSPR) journal.
YOUCEF DJENOURI received the Ph.D. degree in computer engineering from the University of Science and Technology Houari Boumediene (USTHB), Algiers, Algeria, in 2014. From 2014 to 2015, he was a Permanent Teacher-Researcher with the University of Blida, Algeria, where he is currently a member of LRDSI Lab. He was granted a Postdoctoral Fellowship from Unist University, South Korea. He worked on the BPM Project supported by Unist University, in 2016. In 2017, he was a Postdoctoral Research with Southern Denmark University, where he has working on urban traffic data analysis. He was granted a Postdoctoral Fellowship from the European Research Consortium on Informatics and Mathematics (ERCIM). He worked with the Norwegian University of Science and Technology (NTNU), Trondheim, Norway. He is currently a Researcher Scientist with SINTEF Digital, Oslo, Norway. He is working on topics related to artificial intelligence and data mining, with a focus on association rules mining, frequent itemsets mining, parallel computing, swarm and evolutionary algorithms, and pruning association rules. He has published over 24 refereed conference papers, 20 international journal articles, two book chapters, and one tutorial article in the areas of data mining, parallel computing, and artificial intelligence.