A Fast Approach for Up-Scaling Frequent Itemsets

With the rapid growth of data scale and diversification of demand, people have an urgent desire to extract useful frequent itemset from datasets of different scales. It is no doubt that the traditional method can solve the problem. However, the relationships among datasets of different scales are not fully utilized. A fast approach proposed in this paper is as follows: the frequent itemsets on the large-scale data are directly inferred based on the frequent itemsets that are belonged small-scale datasets, instead of mined from the large-scale dataset again on condition that the frequent itemsets on the small-scale datasets have been mined. We conduct extensive experiments on one synthetic data and four UCI data sets. The experimental results show that our algorithm is significantly faster and consumes less memory than these leading algorithms.


I. INTRODUCTION
To analyze customer's buying behavior-based transactions database, Agrawal et al. first presented frequent itemset mining in 1993 [1], that is one of the critical data mining tasks and has widely used in many other significant data mining tasks including mining associations and correlations, classifying, clustering, etc. Since then, frequent itemset mining has been a hot field that has attracted a great deal of attention of researcher. After Apriori proposed, there are several improved algorithms because Apriori needs to scan the database repeatedly. These algorithms have a common feature: generating candidate itemsets. So, filtering candidate itemset is a challenging task. FP-growth algorithm is a classic representative that does not generate candidate itemsets and compresses the database representing frequent items into FP-tree, which retains the itemset association information [2]. In recent years, to enhance the efficiency of mining frequent itemset, three kinds of the data structure are presented by Deng et al., named Node-list, N-list, and Nod-eset. FIN based Nodeset consumes less memory because the Nodeset structure requires only the pre-order(or postorder) [3]. Despite the above advantage of Nodeset, two data structures (DiffNodeset [3] and NegNodeset [4]) are proposed by Deng et al. and Aryabarzan et al., and there are two algorithms named dFIN and negFIN based the former The associate editor coordinating the review of this manuscript and approving it for publication was Xin Luo . data structures respectively. Extensive experimental results show that dFIN and negFIN have the same speed but ming frequent itemset faster, compared with the state-of-the-art algorithms [4].
The same problem or system can be perceived at different levels of specificity (detail), depending on the complexity of the problem, available computing resources, and particular needs to be addressed [5]. Assuming you have nationwide patient information, you need to give both city and country managers some advice on which diseases that the same person suffers from, where the data onto each city is named small-scale data, and the data onto the whole country is named large-scale data. So, there is an urgent need to mine frequent itemsets on different scale datasets. It would be a pity that the traditional mining algorithm is first applied to discover frequent itemsets from the small-scale datasets, and then to from the large-scale dataset later. On the other hand, the scheme does not use the relationship between small-scale data and large-scale data. In this paper, a new framework(upscaling) is proposed: the frequent itemsets on a small-scale datasets is used to directly infer the frequent itemsets on a large-scale dataset, instead of secondary mining on a largescale dataset.
The contributions of this paper are listed as follows: 1) This paper presents a novel framework for addressing the issue that one mines frequent itemsets from different scale datasets. propose the new algorithm mining frequent itemsets  from the large-scale dataset, which depends on the frequent  itemsets belonged to the small-scale datasets, not original  data. 3) Experimental results show that the framework and the algorithm are efficient and violently reducing memory consumption with similar accuracy, especially in the case of short frequent itemsets, and the framework is better than the current optimal algorithm. The rest of this paper is organized as follows: Section II discusses related work to mine frequent itemsets. Section III describes our problem and the framework is proposed in section IV. Section V shows the experimental results. The conclusions and some future research directions are given in section VI.

II. RELATED WORK
Frequent itemset mining is the first and foremost step of association rule mining [6]. In association rules mining and Frequent itemset mining literature, frequent Itemset mining methods are mainly divided into two main categories: 1) algorithms that mine frequent itemset that takes advantage of the horizontal data format. 2) algorithms that mine frequent itemset that takes advantage of the vertical data format.
In the first categorized algorithms, apriori, the basic algorithm for finding frequent itemsets, is firstly proposed by Agrawal and Srikant [7], which motivates many researchers to study this field. It adopts an iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. Apriori algorithm often needs multiple passes over dataset and produces many candidate itemsets that are eventually pruned. For the problem that Apriori generates too many candidate itemsets, Park introduces the DHP algorithm that can reduce the number of candidate itemsets and improve efficiency by using a hash, function and bit vector [8]. To reduce the number of scanning the dataset, Savasere proposes a partitioning algorithm [9], which can obtain all frequent itemsets by scanning the dataset twice. First, The partitioning algorithm divides the dataset into several non-overlapping partitions and the frequent itemsets for each partition are computed. Then, another pass over the dataset is performed to acquire the support of the candidates and the frequent itemsets can be discovered. Y. Djenouri et al. propose SSFIM [10], which scans the transactional database when discovering frequent itemsets once. It has a unique feature to allow the generation of a fixed number of candidate itemsets, independently from the minimum support threshold, which intuitively allows to reduce the cost in terms of runtime for large databases. Toivonen presents the Sampling algorithm [11] based on the fact that trade off some degree of accuracy against efficiency. It uses the sampling method to extract an appropriate number of samples from the original dataset, and then mine frequent itemsets from the samples. The above all algorithms need to generate candidate itemsets. The classic and basic algorithm that doesn't need to generate candidate itemsets is the FP-growth algorithm [2]. It stores essential information about frequent itemset in a tree-based data structure, namely frequent pattern tree (FP-tree). Like the FP-growth algorithm, other algorithms [12]- [14] employ the pattern growth method to discover frequent itemsets.
In the second categorized algorithms, Generating frequent k-itemsets by intersecting the TID sets of every pair of frequent (k-1)-itemsets, which is the essence of the Eclat algorithm [15] by Zaki 2000. Eclat algorithm can recursively partition large classes into smaller ones until each class can be maintained entirely in the memory. Then, each class is processed independently in the breath-first fashion to compute the frequent itemsets. The main problem of Eclat is that when the intermediate results of vertical TID lists can become too large to be store in the memory. Burdick et al. propose to MAFIA [16] that converts the original data into binary vectors, and obtains the support through the ''and operation'', so as to improve the operation speed. When the data set is dense, these algorithms will generate a lot of redundant items, Pasquier et al. [17] propose closed frequent itemsets where frequent itemsets are computed. With the introduction of diffset technology, the algorithm [18] by Zaki that its memory requirements were reduced. It only keeps track of differences in the tids of a candidate pattern from its generating frequent patterns. The diffsets drastically cut down the size of memory required to store intermediate results.
Recently, to mine frequent itemsets in the presence of missing items and overcome these limitations of FT-Apriori, Shariq Bashir proposes FT-PatternGrowth [19], which adopts a divide-and-conquer technique and projects a big database into several databases and mines FT frequent itemsets in each small database. EAFIM [20] that uses the Apache Spark framework to achieve parallelism is an improved version of the apriori algorithm. Yasir, Muhammad, et al. propose the HARPP [21], which adopt the concern of pow set and dictionary data structures, and the D-GENE [22], which suspends the process of ITTL generation till the completion of transaction pruning phase, discovering frequent itemsets from sparse datasets.
The drawback of these methods is that it requires excessive time consumption or construct complex data structures [23] or dominates only in a specific scenario, so its efficiency needs to be improved. In this paper, we introduce the method(up-scaling) that computes frequent itemsets of the large-scale dataset depending on the frequent itemsets which belonged to small-scale datasets, not original data. So, our method is efficient and requires less memory consumption.

III. PROBLEM DESCRIPTION
Before presenting our problem statement, let's start with enlisting some necessary notations used in this paper.
A dataset D of size |D| consists of disjoint subsets D 1 ,D 2 , . . . , D k , whose size are |D 1 |, |D 2 |, . . . , |D k | respectively, where D is named large-scale dataset and D i is named small-scale datasets. Let T = {I 1 , I 2 , . . . , I m } be an itemset. every transaction element S of D is not empty such that S ⊆ T. supND is defined occurrence frequency of some itemset S in D and support is the percentage of S appearing, that to say support = supND / |D|.

Definition 1 (Frequent Itemset of Small-Scale Datasets):
if ∃x Let x ⊆ t, t ∈ D i , Tsupport i computed by (1) that comes from [11] is minimum support threshold, p is probability parameter and adjustable and x satisfies x.surpport ≥ Tsupport i , where Tsupport is minimum support threshold on large-scale dataset D, D i is a small-scale datasets of D and x.supND i is occurrence frequency of

Definition 2 (The Estimated Value of Infrequent Itemset of the Small-Scale Datasets):
if itemset x is infrequent of D i but frequent of D j , x's estimate value of occurrence frequency in D i is as follows where W ij is the similarity weight between D i and D j and W ij ≤ 1 Definition 3 (Frequent Itemset of the Large-Scale Dataset): let the x's value of occurrence frequency in D is computed according to (3), if x.supND/|D| > Tsupport, then x is defined as frequent in D and all x form frequent itemsets of D, notated LCFI.
According to [24], it is obvious that if itemset x is frequent in D then there is at least one D i that is small-scale datasets and x is also frequent in D i .
where |A| is the number of the elements in A. Problem 1: let LSFI is the set of the frequent itemset in D. The ultimate objective of this paper is finding the function up-scaling (PLSFI, Tsupport) which can reveal all large-scale frequent itemsets in D, that is to say LSFI = up-scaling (PLSFI, Tsupport). Fig. 1 shows two methods that can get the frequent itemsets of the large-scale dataset. It is the intuitional method that translating small-scale datasets into the large-scale dataset, then getting frequent itemsets from the large-scale dataset. Another method is finding frequent itemsets from the smallscale datasets, then translating the frequent itemsets which  are mined from small-scale datasets into the final result that is frequent itemsets of the large-scale dataset. Our method is the latter. Specifically, we propose the up-scaling algorithm that depends on the frequent itemsets that have been mined from, the small-scale dataset, not the raw large-scale dataset.

IV. PROPOSED FRAMEWORK AND THE PROPOSED ALGORITHM A. THE OVERALL FRAMEWORK
The ultimate objective of this paper is to reveal all frequent itemsets belonged large-scale dataset basing on the frequent itemsets which have been mined on small-scale datasets. We can observe the overall framework shown in fig. 2. In this paper, discovering the frequent itemsets of large-scale dataset is divided into the following five steps: 1) Mining the frequent itemsets for each small-scale dataset.
2) Calculating the similarity between small-scale datasets.
3) Constructing the potential frequent itemset of largescale dataset. 4) Estimating the support value of some itemsets on smallscale datasets where they are infrequent. 5) Filtering frequent itemsets for the large-scale dataset.

B. PROPOSED ALGORITHM
Based on three formulas in section 3 and the overall framework, up-scaling frequent itemsets of the large-scale dataset is described in Algorithm 2. In particular, we use the  similarity of frequent itemsets of the small-scale datasets instead of their similarity, in this algorithm. Algorithm 1 products with the frequent itemsets of smallscale datasets using traditional data mining methods, which is the input data source of Algorithm 2.
The flow diagram of up-scaling frequent itemsets can be seen in Fig. 3. It is visible that the time consumption of the constructing supportMatrix and est_supMatrix is dominant in the up-scaling algorithm. The time complexity of them is same, (m × k), where m is the cardinality of the set of PLSFI, k is constant and k m on a specific scenario. memory consumption mainly consists of inputting small-scale frequent itemsets and constructing supportMatrix and est_supMatrix, so the space complexity of the up-scaling algorithm is (ma-x(kFI, m × k)), where kFI = k i=1 |LCFI i | and |LCFI i | is the cardinality of the set of LCFI i . if supxi = 0 then do 18:   In the third step, based on every frequent itemsets of the small-scale datasets and according to Definition 4, the potential frequent itemset of the large-scale dataset(PLSFI) is constructed as follows: {{I1},{I2},{I3},{I4},{I1,I2},{I1,I3},  {I1,I4},{I2,I4}, {I2,I3}, {I3,I4},{I1,I2,I3},{I1,I2,I4},{I1,  I3,I4}, {I2,I3,I4}} In the fourth step, we design the data structure which can save the support value of itemsets on every small-scale datasets, and estimate the support value of some itemsets of the small-scale datasets where they are infrequent, which are shown in Table 1 and Table 2.
In the last step, based on the support of the large-scale dataset in Table 2

V. EXPERIMENTS
To prove the effectiveness and efficiency of the upscaling approach, we conducted two groups of experiments.
The purpose of the first group of experiments is to compare the performance of the up-scaling algorithm against the dFIN algorithm [3] and negFIN algorithm [4], which are leading mining algorithms in the field of mining frequent itemsets at present. In the second experiment, we select Apriori [1] that is a classic frequent itemset mining algorithm as the baseline algorithm to verify the accuracy of the up-scaling algorithm.

A. DATA PREPROCESSING
Comparison experiments are assessed on five datasets, which consist of one synthetic dataset and four real datasets. The description of these datasets is shown in Table 3. To obtain the small-scale datasets, we divide every dataset into four nonoverlapping partitions in two methods. The first method is in terms of equal interval, i.e., the first partition consists of record 1, record 5, and so on, the second partition consists of record 2, record 6, and so on, and the second method is  that the data are directly quartered. Then frequent itemsets are mined from the partitions basing different minimum support threshold and parameter p.

B. EXPERIMENTAL SETTINGS
We compare the performance that is runtime and memory consumption of our method against the negFIN algorithm.   To make a fair comparison, these two algorithms have been run on the same hardware and software conditions. Our computer has the configuration of Inter(R) Core(TM) i7 Dual-Core processors running at 2.8GHz and 16G RAM, with the windows 10 x64 Home operating system. All algorithms are coded in C/C++.

C. RUNTIME COMPARISON
The runtime comparison of up-scaling against negFIN and dFIN is shown in fig. 6. In these figures, the X and Y axes are the minimum support threshold and runtime, respectively. As we know, with the increment in the value of the minimum support threshold, there is the corresponding decrement in execution time for these three algorithms. However, except for entree dataset, the running time of up-scaling on the VOLUME 8, 2020 other four datasets vary very small. The main reason are that the number of the frequent itemsets that are belonged small-scale datasets on these datasets is small and the time overhead of our algorithm mainly depend on the frequent itemsets that are belonged small-scale datasets, not original data. Up-scaling faintly outperforms both the algorithms for the entree dataset when the minimum support value is set to 5%. Because, the number of the frequent itemsets severed our algorithm as input is rough the same as that of raw data in that case. NegFIN runs faster than dFIN on T40I10D100K dataset for lower minimum support, and the two algorithms spend almost the same time on the other four datasets. The reason is that the amount of which the negFIN drive NegNodeset is more than that the dFIN drives the DiffNodeset on T40I10D100K dataset, and the their amount is same roughly on other datasets.
As we can see in these figures, it is evident that up-scaling is more efficient than negFIN and dFIN. It should be noticed that runtime of up-scaling means the total execution time, which is the period between input and output of algorithm 2.
Given different values of p, experiment 1 evaluates the performances of up-scaling on varying minimum support threshold, where our method can improve the CPU perfor-mance by an average of 73 percent against negFIN and 75 percent against dFIN in our experiments. In addition, we also notice the fact that for two different data partitioning methods, the running    Fig. 7 compares up-scaling with negFIN and dFIN. As we can see in this figure, the memory consumption of our algorithm is much less than negFIN and dFIN. Because our algorithm is mining the frequent itemsets of large-scale dataset from the frequent itemsets which are belonged to small-scale datasets, while the negFIN algorithm and the dFIN algorithm are directly mining on the raw dataset and our algorithm needs to construct the supportMatrix and the est_supMatrix that is main components of memory consumption of up-scaling when negFIN constructing set_enumeration_tree and frequent_itemset_tree [4] and  dFIN constructing PPC-tree [3]. It takes about the approximate space to construct data structures frequent_itemset_tree and PPC-tree, as we can see in this figure, the memory consumption of both algorithms is roughly the same. It is obvious that the frequent itemsets on these datasets are much smaller than the original dataset. In particular, the result that minimum support threshold is 5% on the entree dataset in fig. 7. shows that up-scaling consumes much more memory than other minimum support thresholds, that is because minimum support threshold is set to 5%, the entree dataset produces much more frequent itemsets which are belonged to small-scale datasets than for the other mini-mum support thresholds. It should be noted that for two different data partitioning methods, the up-scaling algorithm consumes almost the same memory. Therefore, we use one histogram depicting the two cases in fig. 7.

E. VALIDITY OF EFFECTIVENESS
In this part, we give parameter p and minimum support threshold six different values with regard to the same dataset, respectively, as shown by fig. 8. For the same dataset and every partition method, we used up-scaling and Apriori to conduct 36 experiments, respectively. Fig. 9. gives the specific results from accidents in terms of the first partition method. The accur is expressed as the percentage of |A|/|B|, where A and B are computed by up-scaling and Apriori, respectively. The accuracy on the other four datasets datasets partitioned by the first method is also 100% and the figure display is omitting here. That is to say that up-scaling VOLUME 8, 2020  and Apriori discover the same frequent itemsets, which confirms the result generated by up-scaling in our experiments is effective.
For the second partition method, the accuracy of our algorithm is shown from fig. 10 to fig. 14. That is noticeable that up-scaling is 100% accurate on two datasets, partially 100 % on the other two datasets, and performs poorly on the entree. We guess the reason is the data distribution of the entree is very uneven.

VI. CONCLUSION AND FUTURE RESEARCH DIRECTIONS
Based on the requirements of mining frequent itemsets from different scale datasets, this paper proposes a new frequent itemsets mining framework. It just needs to mine all frequent itemsets on small-scale datasets, and then according to the frequent itemsets generate large-scale dataset's frequent itemsets, but no looking for frequent itemsets from the large-scale dataset, thereby reducing the operating costs. Because frequent itemsets decrease with the increase of minimum support threshold, our algorithm that needs to input frequent itemsets of small-scale datasets is especially suitable for situations with a high threshold. Experimental results show that the framework is feasible and effective.
In the future, our research directions as follows: (1) upscaling the cluster centers in clustering tasks, (2) up-scaling function moving trends in regression analysis, (3) solving some problem of industrial recommender systems basing collaborative filtering (CF) [25]- [27].