An Efficient Method for Mining Closed Potential High-Utility Itemsets

High-utility itemset mining (HUIM) has become a key phase of the pattern mining process, which has wide applications, related to both quantities and profits of items. Many algorithms have been proposed to mine high-utility itemsets (HUIs). Since these algorithms often return a large number of discovered patterns, a more compact and lossless representation has been proposed. The recently proposed closed high utility itemset mining (CHUIM) algorithms were designed to work with certain types of databases (e.g., those without probabilities). In fact, real-world databases might contain items or itemsets associated with probability values. To effectively mine frequent patterns from uncertain databases, several techniques have been developed, but there does not exist any method for mining CHUIs from this type of databases. This work presents a novel and efficient method without generating candidates, named CPHUI-List, to mine closed potential high-utility itemsets (CPHUIs) from uncertain databases. The proposed algorithm is DFS-based and utilizes the downward closure property of high transaction-weighted probabilistic mining to prune non-CPHUIs. It can be seen from the experiment evaluations that the proposed algorithm has better execution time and memory usage than the CHUI-Miner.

The concept of HUIM was based on the problem of frequent itemset mining (FIM) and was first presented in [9]. An itemset is a HUI if it has utility value not less than a user-specified threshold. The objective of HUIM is to discover a set of patterns that yield high profit (utility). HUIM is considered a more challenging task than FIM since the downward closure property does not hold for utility measure [1], [2]. In addition, the number of HUIs returned The associate editor coordinating the review of this manuscript and approving it for publication was Choon Ki Ahn . and candidates generated during the process of HUIM are often huge, leading to high runtime and memory requirement. To solve this problem, Tseng et al. [10] proposed closed high-utility itemsets (CHUIs), which improve performance of the mining process and allow to derive a complete set of HUIs. An itemset is a CHUI if it has a utility value not less than a user-specified minimum utility threshold and has no supersets which have the same support.
As mentioned, the collected data in real-world applications can have uncertainty, such as locations obtained via RFID devices or GPS [11] or the habits of shoppers obtained from e-commercial websites [12], [13]. Traditional pattern mining algorithms (those for mining FIs or HUIs) are unable to work or to return incorrect outcomes when applying on input data either incomplete or contain erroneous values. There are many algorithms for discovering useful information from uncertain databases. The UApriori approach was VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ proposed by Chui et al. [14] to mine FIs from uncertain databases, and it uses a generate-and-test and breadth-first search approach. Later, the UFP-Growth algorithm presented by Leung et al. [15] uses a tree structure called UFP-tree, to mine uncertain FIs without generating candidates. Lin and Hong [16] proposed a tree structure to compress uncertain frequent patterns, named CUFP-tree. In addition, an algorithm was also proposed based on this tree structure to mine uncertain FIs from the built tree. PUF-Growth [17] reduces runtime by finding only frequent patterns (i.e., no false negatives and no false positives) from uncertain data. Recently, Lin et al. [18] proposed the PU-List algorithm for mining potential HUIs (PHUIs) from uncertain databases. The present study proposes a model for mining closed potential high-utility itemset (CPHUI) based on the tuple uncertainty database model [19]. The proposed model is similar to the expected-support-based model [14], [20]. A method, namely CPHUI-List, is then proposed. It utilizes the PEU-List (potential extended utility) structure to mine CPHUIs. This work is an extension to the proposed algorithm in [21]. The main contributions of this paper can be summarized as follows: (i) It proposes a novel type of pattern named CPHUI along with a data structure named PEU-List for mining CPHUIs. (ii) A pruning strategy named Pr-Prune is proposed to prune the search space and reduce the cost of database scans by utilizing the proposed PEU-List (iii) Based on the proposed PEU-List, Pr-Prune strategy, an effective algorithm named CPHUI-List algorithm is developed to directly mine CPHUIs from uncertain databases.
The rest of this paper is organized as follows. Section 2 presents basic concepts and a problem statement. Section 3 reviews related work on mining HUIs, CHUIs, and FIs from uncertain databases. In Section 4, we develop the PEU-List structure and the CPHUI-List algorithm. Section 5 compares the numbers of patterns produced by the proposed algorithm and CHUI-Miner. Conclusions and ideas for future work are presented in Section 6.

II. BASIC CONCEPTS AND PROBLEM STATEMENTS A. BASIC CONCEPTS
Let D be a transactional uncertain database defined as . . i m }, in which each item i j appears in a transaction T k ∈ D has a positive value q i j , T k , it is the quantity of purchase of item i j ∈ I in transaction T k , each transaction T k has its own unique transaction ID (TID). Furthermore, each transaction T k is associated with a unique probability of existence pe (T k ) based on the tuple uncertainty model [19]. A set of positive numbers called Protable, an external utility table (e.g., unit profit table), is defined as Protable = {p (i 1 ) , p (i 2 ) , . . . , p (i m )}, where p i j is the profit of item i j . A transaction T k ∈ D contain an itemset X ⊆ I if X ⊆ T k . Two user-specified constraints, the minimum utility threshold (MUT) and the minimum potential probability threshold (MPPT), are called γ and δ, respectively.  An example of an uncertain database with six transactions is presented in Table 1 along with the unit profits of each item is given in Table 2. In this example, γ = 25%, δ = 15%.
Definition 1: Let X and Y be non-empty itemsets. X is a subset of Y , and Y is a superset of X if and only if X ⊆ Y .
Definition 2: The number of transactions containing X in database D, called the support count of itemset X , is denoted as SC(X ). The tidset of itemset X is the set of identifiers of transactions that contain X , denoted as TidSet (X ).
Based on Definition 2, SC (X ) = |TidSet (X )|. For example, for the database in Table 1 Definition 3: The utility of an item i j ∈ I in transaction T k is defined as u i j , T k = q(i j , T k ) × p(i j ).
For example, the utility of item A in transaction T 1 is Definition 4: The existence probability of itemset X in transaction T k , denoted as pe (X , T k ), is defined as pe (X , T k ) = pe(T k ).
For example, the probability of item A in T 3 is pe (A, Definition 5: The utility of itemset X in transaction T k is defined as u (X , [18]. For example, the utility of itemset BCE in transaction T 2 is calculated as u (BCE, T 2 ) = u (B, T 2 ) + u (C, T 2 ) + u (E, T 2 ) = 3 + 20 + 45 = 68, and the utility of item X in uncertain database D is calculated as u (BCE) = u (BCE, T 2 ) + u (BCE, T 3 ) + u (BCE, T 5 ) = 68 + 61+ 44 = 173. Definition 6: The potential probability of itemset X in uncertain database D [18], denoted as Prob (X ), is defined as follows: Prob (X ) = X ⊆T k ∧k∈TidSet(X ) pe (X , T k ).
For instance, considering the sample database D given in Table 2 Definition 7: The utility of transaction T k ∈ D is defined as TU (T k ) = n j=1 u(i j , T k ). For example, the transaction utility of transaction T 1 is calculated as For example, the total utility of D, shown in Table 1, is cal- Definition 9 (Potential High-Utility Itemset): An itemset X is defined as a PHUI if X is a HUI and X has a high existence probability; i.e., X satisfies the following two conditions: Definition 11 (Closed Itemset): Let X be an itemset. If there does not exist a superset Y ⊃ X in D such that SC (X ) = SC(Y ), then X is called a closed itemset.
Definition 12 (Itemset's Closure): Let X , Y be the closed itemsets, and itemset Y ⊃ X be a superset of itemset X . If SC (Y ) = SC (X ), then Y is a closure of X , which is defined as follows: closure (X ) = ∩ k∈TidSet(X ) T k [22].
Definition 13 (Closed Potential High-Utility Itemset): An itemset X is called a CPHUI if X is a CHUI and X has high existence probability; i.e., X is a CHUI if X is closed and u(X ) ≥ γ × sum(TU ); X is a closed high-potential itemset (CHPI) if X is closed and Prob(X ) ≥ δ × |D|.

B. PROBLEM STATEMENT
Given an uncertain database D with total utility sum(TU ), MUT threshold γ , and MPPT threshold δ. The task of mining CPHUIs from D is to discover the complete set of closed itemsets having utility not less than γ × sum(TU ) and whose probability is not less than δ × |D|.
Consider the database D given in Table 1, MUT γ = 25% and MPPT δ = 15%. Table 3 shows the discovered set of CPHUIs from D.

III. RELATED WORK A. HIGH-UTILITY ITEMSET MINING
HUIs mining algorithms may return a huge number of outputs and candidates. This can be mitigated by algorithms which incorporate a two-phase model. For example, Liu et al. [6] proposed an algorithm named Two-Phase, that can effectively prune the number of candidates to discover the complete set of HUIs. However, this algorithm requires scanning the database multiple times and generates a huge number of candidates. The BAHUI algorithm, proposed by Song et al. [23], uses bitmaps and the divideand-conquer strategy to mine FIs. BAHUI effectively reduces memory usage and uses efficient bitwise operations. The UP-Growth algorithm and the utility pattern UP-Tree, proposed by Tseng et al. [24], only require only two database scans to discover the complete set of HUIs. The UP-Tree helps reducing the number of generated candidates and significantly reduce the algorithm's execution time than previously proposed methods. Later, Tseng et al. [5] proposed an improved version of UP-Growth, called UP-Growth+, which reduces overestimated utilities. To minimize the cost of candidate generation and utility calculation from two-phase HUIM algorithms, Qu et al. [25] proposed a novel tree structure for storing candidates and a tree-based algorithm to quickly identify HUIs. Experimental results showed that the proposed candidate tree structure and the algorithm outperforms the performance of two-phase algorithms. The HUI-Miner algorithm, proposed by Liu and Qu [7] uses the utility list structure to prune the search space and thus efficiently mine all HUIs. Lan et al. [26] proposed a projection-based index approach. However, it consumes a high amount of memory usage and requires long execution time to complete. The FHM algorithm, proposed by Fournier-Viger et al. [27], introduced a structure named EUCS (Estimated Utility Co-occurrence Structure) [27] and a corresponding pruning strategy Estimated Utility Co-occurrence Pruning (EUCP) [27] to speed up the discovery of HUIs by considering item co-occurrences. Extending from the HUI-Miner algorithm, Krishnamoorthy proposed the HUP-Miner algorithm [28], in which the input database is partitioned to prune the search space. In addition, it sped up the utility-list construction by utilizing a look ahead strategy [28]. Recently, Zida et al. [8] proposed an algorithm called EFIM to effectively mine HUIs. The authors also proposed two novel and tighter upper bounds, namely sub-tree utility and local utility, and two efficient strategies called high-utility database projection (HDP) and high-utility transaction merging (HTM) to significantly prune the search space and reduce database scans.

B. CLOSED HIGH-UTILITY ITEMSET MINING
In 2015, Tseng et al. [10] come up with the idea of a more compact and lossless representation of HUIs and the task of mining this type of pattern for the first time. The authors also proposed an algorithm, named CHUD, to mine this new type of itemset representation. The process of mining CHUIs decreases execution time and memory usage. Wu et al.
proposed an algorithm named CHUI-Miner [22] for mining CHUIs without generating candidates. The algorithm stores the utility information of itemsets in transactions by utilizing VOLUME 8, 2020 the proposed EU-List data structure. Thus, from the complete set of discovered CHUIs, it can derive all the HUIs.

C. POTENTIAL HIGH-UTILITY ITEMSET MINING
When it comes to the context of uncertain databases, FIs now have different denotations. In the case of uncertain database, the support of an itemset is now called the expected support of this itemset. Based on this definition, an itemset is frequent if the expected support is no less than the minimum expected support threshold [20]. Chui and Kao [36] applied frequent itemset mining to datasets under the existential uncertain data model. Bernecker et al. [37] proposed a framework for efficient probabilistic frequent itemset mining. Sun et al. [12] then proposed the p-Apriori algorithm, which is an improvement of the Apriori algorithm [1] for uncertain databases. This algorithm determine the probability of itemset X is frequent as follows: P(sup(X ) ≥ minSup) ≥ minPro, where P is the frequency probability and minSup is the user-specified minimum support; minPro is the probability threshold. Many methods have been presented to mine FIs from uncertain databases. The UApriori algorithm, which was presented by Chui and Kao [14], is a level-wise method and mostly based on the Apriori algorithm. The UFP-Growth algorithm, which was proposed by Leung et al. [15], extended the FP-Tree structure to mine frequent patterns from uncertain databases without generating candidates. Lin and Hong [16] proposed the CUF-Growth algorithm, which utilizes a tree structure known as the compressed uncertain frequent-pattern tree to mine FIs effectively. Later, Leung and Tanbeer [17] presented the PUF-tree structure and an algorithm named PUF-Growth algorithm, which only requires three probabilistic dataset scans to mine FIs.
Mining high-utility itemsets from an uncertain database is a new and challenging problem, one that is closer to the situation with real-world applications. Lin et al. proposed the first algorithm to mine HUIs from uncertain databases [18] named PHUI-UP, and it is based on the PHUI mining model and very similar to the model used for probabilistic databases. The presented algorithm is based on an Apriori-like [1] approach with a probabilistic measure, and thus it still generates candidates. Furthermore, the authors also proposed the PU-List structure and integrated it into the HUI-Miner algorithm to mine HUIs. The proposed algorithm by Lin et al. is the most recent algorithm to mine HUIs from uncertain databases.

IV. PROPOSED ALGORITHM
This section presents the novel PEU-List data structure along with an improved version of the CHUI-Miner algorithm, namely the CPHUI-List algorithm. CPHUI-List is able to mine CPHUIs from uncertain databases. CPHUI-List integrates the proposed PEU-List structure to discover CPHUIs from the set-enumeration tree.

A. PROBABILITY-EXTENDED-UTILITY-LIST STRUCTURE
The Apriori property dramatically reduces the number of generated candidates required in the task of mining association rule. For the case of uncertain databases, the property is used for measuring the probability in CPHUI-List to mine CPHUIs.

Theorem 1 (High-Probability Itemsets's Closure Property):
Let X be a high probability itemset in an uncertain database. It has the downward closure property: Prob(X r ) ≤ Prob(X r−1 ) [18].  Table 4.

2) PEU-LIST STRUCTURE
To the efficiently and directly mine CPHUIs from uncertain databases and reducing database scans, the information of itemsets in transactions is recorded in PEU-Lists. A PEU-List is initiated by scanning the database two times. The PEU-List structure of itemset X comprise of a PU-List [18], the support of X , and two itemsets, namely PrevSet(X ) and PostSet(X ), which are the itemsets precede and succeed X , respectively. The PU-List of X contains lists of items. Every list in the PU-List of X contains the utility and probability of promising items.
Definition 16 (Precede and Succeed): Consider a set of items I = {i 1 , i 2 , . . . ,i n } and the total order relation R :

Definition 17 (PU-List):
The PU-List [18] of itemset X , denoted as PUL (X ), is an ordered list containing |TidSet (X )| tuples in the form TID, IU , RU , PR . Each tuple is an element and contains the utility of X in transaction T k , where k ∈ TidSet (X ). Each element in the PU-List of itemset X represents a transaction T k containing X , which is denoted as PUL (X , T k ), containing the information TID, IU (X , T k ), RU (X , T k ), PR(X , T k ) , where TID is the tid of X , IU (X , T k ) is the utility of X in T q , defined as IU (X , remaining utility of X in T q , defined as RU (X , For instance, Figure 1 presents the PU-List constructed for one-member itemsets (a), two-member itemsets (b), and three-member itemsets (c). In this example, ≺ is the ascending order of item's TWUs (D ≺ E ≺ A ≺ B ≺ C).
Definition 18: The PU-List structure in Definition 17, denoted as sum (IU ), is the sum of utilities of X in D and sum (RU ) is the sum of remaining utilities of X in D. In this: Pruning strategy (Pr-Prune). If X .sum(IU )+X .sum(RU ) is less than minutil or Prob(X ) is less than minprob, then X is not a PHUI; therefore, X is not a CPHUI.
For example, E is subsumed by BCE because E ⊂ BCE and SC(E) = SC(BCE).
Property 3: For two itemsets Y and X , if Y ⊂ X and SC (Y ) = SC(X ), then closure (X ) = closure(Y ).

3) SET-ENUMERATION TREE
As mentioned above, the search space of the proposed algorithm can be represented logically as a set-enumeration tree and is defined as follows. An itemset is represented as a node in the tree and is an extension of its parent node. The root node of the tree is {∅}, nodes at level 1 are 1-itemsets. Nodes are arranged from left to right based on the total order relation R, where R is in descending order of TWU of 1-itemsets.
The algorithm CPHUI-List traverses the tree using depthfirst search (DFS) strategy. Each node of the set-enumeration tree is a member of the PU-List, which consists of four members, namely i, IU , RU , PR . The pruning strategy Pr-Prune is used to eliminate branches whose child node is an unpromising itemset. If a node satisfies two conditions, namely the sum of IU and RU of the processed node is more than or equal to the MUT(γ × sum(TU )) and the PR of the node is more than or equal to the MPPT(δ × |D|), then its subset can be a CPHUI. The traversal is done recursively to extend the set-enumeration tree. The utility and potential probability are recorded in the PEU-List. Finally, CPHUIs can be mined without multiple database scans.

B. PROPOSED CPHUI-LIST ALGORITHM
The first step of CPHUI-List, whose pseudo-code is given in Algorithm 1, is to scan the input database D to calculate TWU (i), Prob (i) of 1-itemsets i ∈ I . Unpromising 1-itemsets are pruned if TWU (i) ≤ minutil or Prob(i) ≤ minprob, and the set of remain items called I p . Next, items in I p are sorted based on ascending order of TWU s. Then, the algorithm performs a second scan on D and constructs the PEU-List structure of 1-itemsets. The set-enumeration tree can be used to represent the search space of the algorithm. Every 1-itemset i ∈ I p and its PEU-List is recursively explored by the DFS  The PEU-List of new_gen is then constructed by using the Construct routine, which is given at Pseudo-Code 1.
The high utility and subsumed checking procedure of X is then carried out (lines #7 and #8). The subsumed checking procedure is shown in Pseudo-Code 4. If generated itemset new_gen passes the subsumed check, the ComputeClosure procedure with the following parameters: (new_gen,PostSet(Closed_set new ),PostSet(X )) is then invoked in order to calculate the closure of new_gen and its PEU-List by updating the PEU-List of new_gen. The algorithm for computing the itemset's closure is given in Pseudo-Code 2. After the ComputeClosure routine is invoked, the algorithm outputs closed set Closed_set new if it satisfies: Closed_set new .sum(IU ) ≥ γ , Prob(Closed_set new ) ≥ δ. The algorithm executes the GEN-CPHUI routine to recursively explore the search space in the set-enumeration tree to discover CPHUIs that are itemsets of Closed_set new . After that, i is added to PreSet(X ). The algorithm terminates when all CPHUIs in the input database D have been discovered.

V. EXPERIMENTAL STUDIES
Experiments were carried out to evaluate the performance of the proposed CPHUI-List algorithm. They were conducted on a computer equipped with a 3rd-generation 64-bit  Intel R Core-i5 TM 2.7GHz processor, having 16 GB RAM, operated by Windows 10 Professional edition. The performance of the algorithm CPHUI-List was then compared to that of the CHUI-Miner algorithm. All algorithms were written in Java, using JDK 1.8.0. The initial and maximum heap memory sizes of the JVM were fixed at 4GB. The source code for CHUI-Miner can be downloaded in [38].

A. DATASETS
Experiments were performed on the following datasets: Foodmart, Chainstore, Mushroom, Retail, Chess and Accident. Mushroom, Chess, Retail and Accident are available from the FIMI Repository, and Foodmart is the Microsoft Foodmart 2000 database. The characteristics of these datasets are given in Table 5. For all datasets, the external utilities of items were generated between 1 and 1, 000 using a lognormal distribution. The quantities of items were randomly generated between 1 and 5. The probabilities were randomly generated to be between 0.1 and 1.0.

B. RUNTIME
The runtime of the proposed CPHUI-List algorithm was compared with that of CHUI-Miner using six uncertain databases VOLUME 8, 2020 and various MUT and MPPT thresholds. Figure 2 presents the runtime of the two algorithms for various MUTs and a fixed MPPT. In reverse, Figure 3 presents the runtime of the evaluated algorithms for various MPPTs and a fixed MUT. Since the CHUI-Miner is an algorithm designed to operate on certain datasets, not uncertain ones, thus it does not handle potential probability values. In Figure 2, it can be seen that CPHUI-List has better execution time than CHUI-Miner on all tested datasets. For example, for the Chess dataset at minutil = 20%, CPHUI-List is faster than CHUI-Miner for over 20 times. From Figure 3, it can be observed that CPHUI-List is also much faster than CHUI-Miner. As we increase the value of MPPT, the runtime time of CPHUI-List is decreased, while that of CHUI-Miner remained unchanged. This can be addressed to the higher MPPT values, such that CPHUI-List generated fewer candidates while the performance of CHUI-Miner is affected only by varying MUT values, not MPPTs. In this case, MUT is fixed on each dataset.

C. PATTERN ANALYSIS
Pattern analysis was also done to compare our proposed algorithm to CHUI-Miner. The number of discovered CPHUIs is always lower than that of CHUIs for a fixed MPPT and various MUT. This can be addressed to the CPHUI-List mines CPHUIs by considering both the utility and probability constraints, whereas CHUI-Miner mines CHUIs (i.e., it only checks the utility constraint). The numbers of CHUIs and CPHUIs both decrease as the MUT increases. It also can be observed that the number of CHUIs is unchanged when the MPPT is raised. The reason is that CHUI-Miner only considers the high utility threshold for discovering CHUIs, and thus when the MPPT is raised, it does not affect the number of CHUIs. As we increase the MPPT, the number of CPHUIs drops dramatically. Since the proposed algorithm utilizes two thresholds for mining CPHUIs, thus the number of discovered patterns is always lower than that of CHUIs.

D. MEMORY CONSUMPTION
The memory consumption for the two algorithms is shown in Figures 4 and 5. For both a fixed MPPT and various MUT values and a fixed MUT and various MPPT values, CPHUI-List consumed less memory than CHUI-Miner. The reason for this result is the same as that given for Figure 4. From Figure 5, when minprob = 2% for the Mushroom dataset, the memory consumed by CPHUI-List is less than one-fifth that consumed by CHUI-Miner.

VI. CONCLUSION AND FUTURE WORK
This study proposed an approach for mining CPHUIs from uncertain databases without generating candidates. The CPHUI-List algorithm mines HUIs and high-probability itemsets. The algorithm is developed using the PEU-List structure and the set-enumeration tree as basis. It uses the DFS-based approach to directly mine CPHUIs. Experimental results for real databases show that CPHUI-List outperforms CHUI-miner in terms of runtime and consumed memory.
This is the first step towards mining CPHUIs from uncertain databases. In the future, we will improve the structure of PEU-List to improve the performance of the CPHUI-List algorithm and extend this work to mine maximal-potential HUIs and top-k-potential HUIs.