Efficient Discovery of Weighted Frequent Neighborhood Itemsets in Very Large Spatiotemporal Databases

Weighted Frequent Itemset (WFI) mining is an important model in data mining. It aims to discover all itemsets whose weighted sum in a transactional database is no less than the user-specified threshold value. Most previous works focused on finding WFIs in a transactional database and did not recognize the spatiotemporal characteristics of an item within the data. This paper proposes a more flexible model of Weighted Frequent Neighborhood Itemsets (WFNI) that may exist in a spatiotemporal database. The recommended patterns may be found very useful in many real-world applications. For instance, an WFNI generated from an air pollution database indicates a geographical region where people have been exposed to high levels of an air pollutant, say $PM2.5$ . The generated WFNIs do not satisfy the anti-monotonic property. Two new measures have been presented to effectively reduce the search space and the computational cost of finding the desired patterns. A pattern-growth algorithm, called Spatial Weighted Frequent Pattern-growth, has also been presented to find all WFNIs in a spatiotemporal database. Experimental results demonstrate that the proposed algorithm is efficient. We also describe a case study in which our model has been used to find useful information in air pollution database.


I. INTRODUCTION
F REQUENT Itemset Mining (FIM) is an important data mining model [1]- [3] with many real-world applications [4]. FIM aims to discover all itemsets in a transactional database that satisfy the user-specified minimum support (minSup) constraint. The minSup controls the minimum number of transactions that an itemset must cover in the data. Since only a single minSup is used for the whole data, the model implicitly assumes that all items within the data have the uniform frequency. However, this is the seldom case in many real-world applications. In many applications, some items appear very frequently in the data, while others rarely appear. If the frequencies of items vary a great deal, then we encounter the following two problems: 1) If minSup is set too high, we miss those itemsets that involve rare items in the data. 2) To find the itemsets that include both frequent and rare items, we have to set minSup very low. However, this may cause a combinatorial explosion, producing too many itemsets, because those frequent items associate with one another in all possible ways and many of them are meaningless depending upon the user or application requirements.
This dilemma is known as the rare item problem [5]. When confronted with this problem in real-world applications, researchers have tried to find frequent itemsets using multiple minSups [6], [7], where the minSup of an itemset is expressed with minimum item support of its items. An open problem of this extended model is the methodology to determine the items' minimum item supports.
Cai et al. [8] introduced Weighted Frequent Itemset Mining (WFIM) to address the rare item problem. WFIM takes into account the weights (or importance) of items and tries to find all Weighted Frequent Itemsets (WFIs) that satisfy the user-specified weight constraint in a transactional database. Several weight constraints (e.g., weighted sum, weighted support, and a weighted average) have been discussed in the literature to determine the interestingness of an itemset in a transactional database. Selecting an appropriate weight constraint depends on the user or application requirements. Some of the practical applications of WFIM include marketbasket analytics [8], spectral signature analytics in astronomical databases [9], and finding events in Twitter data [10]. This paper argues that though studies on WFIM consider the importance of items within the data, they disregard the spatiotemporal characteristics of an item. Consequently, WFIM is inadequate to find only those WFIs that have items close (or neighbors) to one another in a spatiotemporal database. A naïve approach to tackle this problem involves discovering all WFIs from the data and pruning the WFIs whose items are not neighbors to each others. Unfortunately, this approach is inefficient due to its huge search space and the computational cost. With this motivation, this paper introduces the model of Weighted Frequent Neighborhood Itemsets (WFNI) that may exist in a spatiotemporal database. Before we describe the contributions of this paper, we discuss the usefulness of the proposed itemsets with a real-world application.
Air pollution is a significant factor for many cardiorespiratory problems found in the people living in Japan. In this context, the Atmospheric Environmental Regional Observation System (AEROS) constituting of several monitoring stations has been set up by the Ministry of Environment, Japan. The data generated by these stations represent a non-binary spatiotemporal database. An WFNI found in this pollution database provides the information regarding the geographical region (or a set of neighboring stations) where people have been exposed to high levels of an air pollutant. This information is useful for the users of the pollution control board in devising appropriate policies to control the industrial emissions.
High Utility Itemset Mining (HUIM) [11]- [13] generalizes WFIM (respectively, FIM) by taking into account the items' internal utility and external utility values. However, discovering WFIs (respectively, frequent itemsets) using a HUIM algorithm is inefficient due to the additional cost of transforming a binary spatiotemporal database into a nonbinary spatiotemporal database. (This topic is further discussed in latter parts of this paper).
This paper proposes a more flexible model of WFNI that may exist in a spatiotemporal database. An itemset in a spatiotemporal database is considered as an WFNI if it satisfies the user-specified minimum weighted sum and maximum distance constraints. The generated WFNIs do not satisfy the anti-monotonic property. Two upper bound measures, called estimated weighted sum (EW S) and cumulative neighborhood weighted sum (CN W S), have been employed to reduce the search space and the computational cost of finding the desired itemsets. EW S aims to identify candidate items whose supersets may be WFNIs. CN W S seeks to identify those items that have to be projected (or build conditional pattern bases) to find all WFNIs. A pattern-growth algorithm, called Spatial Weighted Frequent Pattern-growth (SWFP-growth), has also been presented to find all WFNIs in a spatiotemporal database efficiently. Experimental results demonstrate that SWFP-growth is not only memory and runtime efficient, but also scalable as well. We also describe a case study in which we apply our model to find useful information in air pollution database.
Reddy et al. [14] proposed the model of WFNI by taking into account the items as points. This paper generalizes the model of WFNI by taking into account items of any geometric form (e.g" point, line, or polygon). We will also provide the correctness of our algorithm. Furthermore, we strengthen the paper with extensive experiments and describe the real-world application of the proposed model using air pollution database.
The remainder of this paper is organized as follows. Section 2 discusses the previous literature related to the problem. Section 3 introduces the proposed model of WFNI that may exist in a spatiotemporal database. Section 4 describes the SWFP-growth. Experimental results are reported in Section 5. Section 6 concludes the paper with future research directions.

II. RELATED WORK A. FREQUENT ITEMSET MINING
Frequent itemsets are an important class of regularities that exist in databases. Since it was first introduced in [2], the problem of finding these itemsets has received a great deal of attention. Several algorithms (e.g., Apriori [2], ECLAT [15] and Frequent Pattern-Growth (FP-growth) [3], [16]) have been described in the literature to find frequent itemsets. Though there exists no universally acceptable best algorithm to find frequent itemsets in any database, FP-growth is widely accepted as the best algorithm to mine frequent itemsets in real-world databases [17]. Consequently, several extensions of FP-growth using GPUs, disks and parallel processing have been discussed to find frequent itemsets efficiently.
FP-growth is a depth-first search algorithm that discovers frequent patterns using pattern-growth technique. The pattern-growth technique briefly involves the following two steps: (i) compress the database into a tree, and (ii) recursively mine the entire tree to find all frequent itemsets. We also employ a pattern-growth based algorithm to find all WFNIs in a spatiotemporal database. However, it has to be noted that the tree structure and the mining procedure of our algorithm are different from that of the FP-growth algorithm.

B. WEIGHTED ITEMSET MINING
Cai et al. [8] introduced WFIM to address the rare item problem in FIM. Two Apriori algorithms, called MinWAL(O) and MinWAL(M), have been discussed for finding WFIs in a transactional database. Unfortunately, both algorithms suffer from the performance issues involving multiple database scans and the generation of too many candidate itemsets. Yun and John [18] discussed a pattern-growth algorithm, called WFIM, to find the weighted frequent itemsets. Uday et al. [10] described an improved WFIM based on the concept of cutoff weight, which represents the maximum weight among all weighted items.
Cai et al. [9] used a variant of WFIM algorithm to find weighted frequent itemsets in an astronomical database. An entropy-based weighting function has been employed to determine the interestingness of an itemset.
In the literature, researchers have studied WFIM by taking into account other parameters. Tao et al. [19] proposed a weighted association rule model by taking into account the weight of a transaction. An Apriori-like algorithm, called WARM (Weighted Association Rule Mining) algorithm, was discussed to find to the itemsets. Vo et al. [20] proposed a Weighted Itemset Tidset tree (WIT-tree) for mining the itemsets and used a Diffset strategy to speed up the computation for finding the itemsets. Lin et al. [21] studied the problem of finding weighted frequent itemsets by taking into account the occurrence time of the transactions. The discovered itemsets are known as recency weighted frequent itemsets. Furthermore, Lin et al. [22] extended the basic weighted frequent itemset model [8] to handle uncertain databases. Chowdhury et al. [23] discussed a weighted frequent itemset model with an assumption that weights of items can vary with time and proposed the algorithm AWFPM (Adaptive Weighted Frequent Pattern Mining). Please note that though some of the above studies consider the temporal occurrence information of items within the data, they completely disregard the spatial information of the items. On the contrary, the proposed study investigates the problem of finding WFNIs in spatiotemporal databases by taking into account the spatiotemporal characteristics of the items within the data.

C. HIGH UTILITY ITEMSET MINING
Yao et al. [13] introduced HUIM by taking into account the items' internal utility (i.e., number of occurrences of an item within a transaction) and external utility (i.e., weight of an item in the database) values. Since then, the problem of finding HUIs from the data has received a great deal of attention [11], [12], [24], [25]. As HUIM generalizes WFIM (respectively FIM), WFIs (respectively, FIs) can be generated using a HUIM algorithm. This paper argues that such an approach to finding WFIs using HUIM algorithms is inefficient because of two main reasons: 1) To employ a HUIM algorithm, we need to transform the binary transactional database into a non-binary transactional database by adding one as the internal utility for every item in a transaction. This process of transforming a huge binary database into a nonbinary database is a costly operation concerning to both memory and runtime.
2) The size of the resultant non-binary transactional database is substantial larger (approximately 1.5 to 2 times) than the actual size of a binary database. Consequently, HUIM algorithms have to find WFIs from much larger databases consuming more memory and runtime.
In practice, a WFIM algorithm (respectively, FIM algorithm) is generally faster than a HUIM algorithm for mining WFIs (respectively, FIs) in a binary transactional database. It is because they are more optimized for that specific problem. Uday et al. [26] discussed an algorithm, called Spatial High Utility Itemset Miner (SHUIMiner), to find all spatial high utility itemsets in a non-binary spatiotemporal database. Unfortunately, finding the proposed WFNIs using SHUIMiner turns out to be costly due to the above mentioned reasons.

D. SPATIAL CO-OCCURRENCE ITEMSET MINING
The problem of finding spatiotemporal co-occurrence itemsets (or association rules) in spatiotemporal databases has received a great deal of attention [27]- [30]. These algorithms can be broadly classified into distance-based approaches [27], [28] and transaction-based approaches [29], [30]. A distance-based approach typically uses a parameter, called the prevalence, to determine how interesting the spatiotemporal co-occurrences are in the data. A transaction-based approach initially cluster the data over space and time and then apply traditional association rule mining algorithms on each cluster to find useful information. Unfortunately, all spatiotemporal co-occurrence itemset mining algorithms determine the interestingness of an itemset by taking into account only its support and disregard the internal and external utility values of an item. Moreover, most of these algorithms cannot handle numeric data. On the contrary, the proposed model considers internal and external utility values of an item and handles numeric data.
Overall, the proposed model of finding WFNIs in a spatiotemporal database is novel and distinct from current studies.

III. PROPOSED MODEL
Without loss of generality, a spatiotemporal database can be represented as a spatial database and a temporal database. For brevity, we first describe the neighborhood itemset using a spatial database. Next, we introduce weighted frequent neighborhood itemset using a temporal database and items' weight database.

A. NEIGHBORHOOD ITEMSET
Let I = {i 1 , i 2 , · · · , i n }, n ≥ 1, be a set of geometric (or spatial) items. Let P ij denote a set of coordinates for an item i j ∈ I. The spatial database SD is a collection of items and their coordinates. That is, SD = {(i 1 , P i1 ), (i 2 , P i2 ), · · · , (i n , P in )}. The above notion of spatial database facilitates us to capture items of various geometric forms, such as point, line, or polygon. Two items, i p , i q ∈ I, are said to be neighbors to each other if    Table 1a.
Definition 1. (Neighborhood itemset.) Let X ⊆ I be an itemset (or a pattern). If X contains k items, then it is called a k-itemset. An itemset X in SD is said to be a neighborhood itemset if the maximum distance between any two of its items is no more than the user-specified maxDist. That is, Example 2. The set of items c and d, i.e., cd is an itemset. This itemset contains two items. Therefore, it is a 2-itemset. The itemset cd is also a neighborhood itemset because max(Dist(a, b)) ≤ maxDist.
Several distance functions (e.g. Euclidean distance and Geodesic distance) have been described in the literature to compute the distance between the items. Selecting a right distance function depends on the user and/or application requirements. In our example, we have represented spatial items with points and employed Euclidean as the distance function for brevity. However, our model is generic and can be employed with any distance function that satisfies the commutative property (see Property 1) and anti-monotonic property (see Property 2). We now define weighted frequent neighborhood itemset using temporal database and items' weight database.
Property 2. (Anti-monotonic property). If X ⊂ Y , then the maximum distance between any two items in X will always be less than or equal to the maximum distance between any two items in Y . That is, max(Dist(i p , i q )|∀i p , i q ∈ X) ≤ max(Dist(i r , i s )|∀i r , i s ∈ Y ).

B. WEIGHTED FREQUENT NEIGHBORHOOD ITEMSET
A transaction, denoted as T ts = (ts, Y ), where ts ∈ R + represents the transactional identifier (or timestamp) of the corresponding transaction and Y ⊆ I is an itemset. A (binary) temporal database, denoted as Example 3. Continuing with the previous example, a temporal database generated by all items in Table 1a is shown in  Table 1c. The items' weight database is shown in Table 1d. Each transaction in this database represents the measurement of an air pollutant, say PM2.5 1 , determined by a weather station for a particular time period. The weight of an item c in the second transaction, i.e., w(c, T 2 ) = 30. In other words, station d located at (3, −4) has recorded 30µg/m 3 of PM2.5 at the timestamp of 2.  Table 1c, i.e., T DB cd = {T 4 , T 5 }. The support of cd in Table 1c, i.e., S(cd) = |T DB cd | = 2.   Definition 5. (Weighted frequent neighborhood itemset X.) A neighborhood itemset X is said to be a weighted frequent neighborhood itemset if W S(X) ≥ minW S, where minW S represents the user-specified minimum weighted sum.
Example 7. If the user-specified minW S = 150, then the neighborhood cd is a weighted frequent neighborhood itemset because W S(cd) ≥ minW S. The complete set of WFNIs generated from the Tables 1a 1c and 1d are shown in Table 1e.
Definition 6. (Problem Definition.) Given a temporal database (T DB), items' weight database (W D) and items' spatial database (SD), the problem of Weighted Frequent Neighborhood Itemsets mining involves discovering all itemsets in T DB that have weighted sum no less than the userspecified minimum weighted sum (minW S) and the distance between any two of its items is no more than the userspecified maxDist. It is interesting to note that WFIM is a special case of the problem WFNIM when maxDist = ∞ (or very large).

C. A SMALL DISCUSSION.
In our model, we have set a strict constraint that all items in an WFNI must be close (or neighbors) to one another. If we relax this constraint, then too many uninteresting itemsets with items far away from the rest can be generated as WFNIs. Example 8 illustrates the importance of employing a strict spatial constraint on WFNIs.
Example 8. Let l = (0, 0), m = (2, 0), n = (4, 0) and o = (6, 0) be four items located on a straight line. Let maxDist = 2. If we relax the constraint that all items in a WFNI need not be close to each other, then we may find lmno as a WFNI. Unfortunately, this itemset may be uninteresting to the user as the items n and o are located far away from l.
To reduce the number of input parameters, the proposed model does not determine the interestingness of an itemset using minSup constraint. However, if an application demands, the user can employ minSup as an additional constraint to find WFNIs. Please note that significant changes are not needed for our SWFP-growth algorithm as it inherently records the support information of an itemset. Algorithm 1 SWFP-tree (T DB: temporal database, I: items in a database, SD: spatial database, W D: weight database, minW S: minimum weighted sum, minDist: minimum distance) 1: Scan the spatial database SD and identify neighbors for each item i j in I. Let N (i j ) denote the neighbors for item i j in I. 2: Scan the database T DB and calculate EW S, W S and minimumwieghts for each item i j in I. Prune all items in I that have EW S less than the user-specified minW S. Consider the remaining items in I as candidate items and sort them in descending order of their EW S values. Let L denote this sorted list of candidate items. 3: Create the root node of SWFP-tree T and label it as "null". Scan the temporal database T DB for the second time and update SWFP-tree as follows. For each transaction T ts ∈ T DB do the following. Identify and sort the candidate items in T ts in L order. Let T ts denote the sorted transaction of T ts containing only candidate items. Let the sorted candidate item list in T ts be [p|P ], where p is the first element and P is the remaining list. Call insert_tree([p|P ], T ), which is performed as follows. If T has a child N such that N.item-name = p.item-name, then increment the N.support value by 1, calculate the OEW S value of p in T ts and add this value to the existing N.oews value. If T has a child N such that N.item-name = p.item-name, then create a new node N , set its support count to 1, calculate the OEW S value of p in T ts and set this value as N.oews. Next, its parent link is linked to T , and its node-link to the nodes with the same item-name via the node-link structure. If P is non-empty, call insert_tree(P , N ) recursively.
Algorithm 2 SWFP-growth 1: input : T X : SWFP-tree, H X : header table for T X , X: an itemset 2: output: all candidate weighted frequent itemsets in T X 3: for each item a i ∈ H X do 4: if WeightedSum(Y ) + CN W S(a i ) is no less than minW S then construct Y 's conditional pattern base constituting of only neighbors of a i . Next, recalculate each node's oews value. Consider items having oews value greater than minW S as candidate items in Y -CP B and put them in H Y . Readjust the oews values for the items by removing non-candidate The space of items in a database gives rise to a subset lattice. The itemset lattice is a conceptualization of the search space when mining WFNIs. The itemset lattice of the items a, b and c is shown in Figure 1. The proposed SWFP-growth performs a depth-first search on this itemset lattice to find all WFNIs in the data. The main reason for choosing patterngrowth technique is due to the fact that algorithms based on this technique can be easily extended to develop diskbased algorithms and parallel algorithms [31]. In this paper, we confine to the sequential memory-based pattern-growth algorithm.
In this section, we first introduce the basic idea of SWFPgrowth algorithm. Next, we describe the working of SWFPgrowth using the database shown in Table 1c.

A. BASIC IDEA
The weighted sum of an ordered itemset can be more, less, or equal to the weighted sum of its ordered superset (see Property 3). Consequently, the WFNIs generated from the data do not satisfy the convertible anti-monotonic, convertible monotonic, or convertible succinct properties [32]. This increases the search space, which in turn increases the computational cost of finding the WFNIs. Two upper bound measures, called optimized estimated weighted sum (OEW S) and cumulative neighborhood weighted sum (CN W S), have been presented to reduce the search space and the computational cost. These two measures aim to identify itemsets (or items) whose supersets may yield WFNIs. We now describe each of these measures.

1) Optimized estimated weighted sum
The key objective of OEW S measure is to identify items whose supersets may yield WFNIs. The items whose OEW S value is no less than the user-specified minW S are called as candidate items. Definitions 7 and 8 define the estimated weighted sum (EW S) of an itemset in a transaction and temporal database, respectively. Definitions 9 and 10 respectively define the candidate item and candidate itemsets. Pruning technique to remove itemsets whose supersets may The estimated weighted sum (EW S) of an item i j in a transaction T ts , denoted as EW S(i j , T ts ), represents the sum of weights of i j and its neighboring items in T ts . That is, EW S(i j , T ts ) = w(i j , T ts ) + i k ∈Tts.Y ∩i k ∈Ni j w(i k , T ts ).
Example 9. Consider the item a in Table 1c. The neighbors of a, i.e., N a = {bce} (see Table 1b). The estimated weighted sum of a in T 1 is the sum of weights of a and its neighboring items in T 1 . That is, EW S(a, T 1 ) = w(a, T 1 ) + w(b, T 1 ) = 20 + 15 = 35. Please note that the weights of remaining items (i.e., g and f ) in T 1 are not used in the calculation of EW S(a, T 1 ). It is because these two items are not neighbors of a. The above definition of EW S captures the maximum weighted sum of a and its neighboring items in a transaction. We now extend this definition by taking into account a set of transactions (or a temporal database).
The EW S of a in T 1 , i.e., EW S(a, T 1 ) = 35. Similarly, EW S(a, T 2 ) = 35 and EW S(a, T 6 ) = 85. The EW S of a in the entire database, i.e., EW S(a) = EW S(a, T 1 ) + EW S(a, T 2 )+EW S(a, T 6 ) = 35+35+85 = 155. In other words, EW S(a) provide the information that an item a with all its neighboring items has resulted in a maximum weighted sum of 155 µg/m 3 in the entire database. Henceforth, this value can be used as a upper-bound constraint to identify candidate items whose supersets may yield WFNIs. The above definition captures the maximum weighted support an item and its supersets (constituting of its neighboring items) can have in the entire spatiotemporal database with respect 6 VOLUME 4, 2016 to its neighboring items. Thus, EW S acts as a weighted sum upper bound on the items. For an item i j ∈ I, if EW S(i j ) < minW S, then neither i j nor its supersets will result in WFNIs. So only those items whose EW S is no less than minW S will generate WFNIs at higher order. We call these items as candidate items and defined in Definition 9.
Definition 9. (Candidate item.) An item i j in T DB is said to be a candidate item if EW S(i j ) ≥ minW S.
Example 11. Continuing with the previous example, the item a in Table 1c is a candidate item because EW S(a) ≥ minW S. We now generalize the above definition by taking into account the notion of itemset. This generalization facilitates uses to push the above pruning technique to the lower levels of itemset lattice. If EW S(i j ) + W S(α) ≥ minW S, then α ∪ i j is a candidate itemset (or i j is a candidate item in T DB α ). Otherwise, i j is an uninteresting item that can be pruned from T DB α . The proposed SWFP-growth employs the above definition to identify candidate itemsets whose supersets may yield WFNIs.  Table 1c. The lexicographical sorted order of items in this transaction is abf g. Let us consider the item g, which is the last item in the sorted transaction. The conditional pattern base of g, i.e., g-CP B = {abf }∩N g = {abf }∩{f } = {f }. Therefore, the EW S of g in T 1 , i.e., OEW S(g, T 1 ) = w(g, T 1 ) + w(f, T 1 ) = 20 + 20 = 40. Similarly, for the item f , f -CP B = {ab} and N f = {dg}. The OEW S of f in T 1 , i.e., OEW S(f, T 1 ) = w(f, T 1 ) +

Property 5.
For an itemset X, EW S(X, T k ) ≥ OEW S(X, T k ). In other words, OEW S is the more tighter constraint than EW S.
The SWFP-growth employs EW S measure to find candidate items. After finding candidate items and sorting them with respect to EW S descending order, items' OEW S values in every transaction are used to find candidate itemsets effectively.

2) Cumulative neighborhood weighted sum
The candidate items constitute of both weighted frequent items and uninteresting items whose supersets may generate WFNIs. We have observed that constructing projected databases (or conditional pattern bases) for all uninteresting items is a costly operation. In this context, we exploit another weight upper bound measure, called cumulative neighborhood weighted sum (CN W S), to identify those candidate items whose projections will only WFNIs.

Definition 12. (Cumulative neighborhood weighted sum)
Let S = {i 1 , i 2 , · · · , i k } ⊆ I be an ordered list of candidate items such that EW S(i 1 ) ≤ EW S(i 2 ) ≤ · · · ≤ EW S(i k ). The cumulative neighborhood weighted sum of an item i j ∈ S, denoted as EW S(i j ), is the sum of weighted sum of remaining items in the list which are neighbors of i j . That is, CN W S(i j ) = |S| p=j+1 W S(i p ) if i p ∈ N (i p ). For the last item in S, cnws(i k ) = 0.
Example 13. Let us order the candidate items in increasing order of their EW S values. Let denote this order of items. The candidate items in order are a, e, c, b and d. Let us consider item a, which is the first item in order. The neighbors of this item are b, c and e (see Table 1b). Thus, the item a will generate WFNIs by combining with the items b, c and e. Thus, the cumulative neighborhood weighted sum of a, i.e., CN W S(a) = W S(b)+W S(c)+W S(e) = 365. The CN W S of a provides the crucial information that the item a and its supersets containing only a's neighborhood items can at most have the maximum weighted sum of 365 in the entire database. This information can be used to determine whether a suffix item in the tree needs to be projected or not. If sum of weighted support of suf f ixitemset and CN W S of a suffix itemset is less than the user-specified minW S, then we can prevent the depth-first search (or construction of conditional pattern bases) to find WFNIs. Thus, significantly reducing the search space.

B. SWFP-GROWTH
The proposed SWFP-growth algorithm is presented in Algorithms 1 and 2. Briefly, SWFP-growth algorithm involves the following steps: (i) finding candidate items (ii) constructing Spatial Weighted Frequent Pattern-tree (SWFP-tree) by compressing the spatiotemporal database using candidate items (iii) Recursively mining SWFP-tree to find all candidate itemsets and (iv) finding all WFNIs from candidate itemsets by performing another scan on the spatiotemporal database. Before we explain each of these steps, we describe the structure of SWFP-tree.

1) Structure of SWFP-tree
In SWFP-tree, each node N includes N.name, N.support, N.oews, N.parent, N.hlink and a set of child nodes. The details are as follows. N.name is the item name of the node. N.support represents the support of an item in node N . N.oews represents the OEW S value of an item in node N . N.parent records the parent node of the node. N.hlink is a node link which points to a node whose item name is the same as N.name.
Header table is employed to facilitate the travel of SWFPtree. In this table, each entry is composed of an item name, OEW S value, and a link. The link points to the last occurrence of the node which has the same item as the entry in the SWFP-tree. By following the link in the header table and the nodes in SWFP-tree, the nodes whose item names are the same can be traversed efficiently.

2) Finding candidate items
In the first database scan, we calculate the EW S, minimum weight sum and weightedsum of each item in database T DB. The calculated EW S values for all items in Table  1c are shown in Fig. 2(a). From these items, the candidate items are generated by pruning all items that have EW S value less than the user-specified minW S. The candidate items are later sorted in descending order of their EW S value. Let this sorted list of candidate items be denoted as L. The sorted list of candidate items generated from Table  1c for the user-specified minW S = 150 is shown in Fig.  2(b). (The above process can be repeated until no more items get pruned from the temporal database. However, for computational reasons we recommend limiting this step to single scan on the database.)

3) Construction of SWFP-tree
Using the generated candidate items, we scan the temporal database for the second time and generate SWFP-tree by following the procedure similar to that Frequent Pattern-tree (or FP-tree). It has to be noted that we will maintaining both support and OEW S value of an item at each node.
The sorted transactional database constituting of only candidate items is shown in Fig. 2(c). The scan on the first sorted transaction, "1: ba," generates a branch b : 1 : 15 , a : 1 : 35 (format is item : support : OEW S ). Fig. 3(a) shows the branch generated after scanning first transaction.
The scan on the second sorted transaction, "2:ca," generates another branch c : 1 : 30 , a : 1 : 35 (see Fig.  3(b)). Similar process is repeated for remaining transactions and SWFP-tree is updated accordingly. The tree constructed after scanning the last transaction is shown in Fig. 3(c). To facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree via a chain of node-links. The final SWFP-tree generated after scanning entire temporal database is shown in Fig. 3(d).  After constructing SWFP-tree, we start with the last item in the header table. Choosing this item as a suffix itemset, we determine its CN W S. If the sum of weighted support of the suffix item and its CN W S value is more than the userspecified minW S, then we construct its conditional pattern base constituting of neighboring items of suffix itemset, construct its conditional SWFP-tree, and generate all candidate itemsets. If CN W S value of a suffix item is less than the user-specified minW S, then we skip the construction of conditional pattern bases and move to the next item in the header table. Similar process is repeated for the other items in the header table.
Mining of the SWFP-tree is summarized in Table 3 and defined as follows. We first consider a, which is the last item in the SWFP-list. Item a occurs in three branches of the SWFP-tree of From this conditional SWFP-tree, we generate eb, ec and e as candidate itemsets. Similar process is performed for the remaining items in the SWFP-list of Figure 3(d) to find all candidate itemsets. The complete set of candidate itemsets generated from Figure 3(d) are a, eb, ec, e, c, cd, b, bd . The correctness of finding all candidate itemsets is shown in Theorem 13.
Theorem 13. Let α be an itemset in SWFP-tree. Let B be the α's conditional pattern base, and β be an item in B. If α is a suffix itemset and OEW S(α, β) + W S(α) ≥ minW S, then α, β is a candidate itemset.

Proof 14.
According to the definition of conditional pattern base and compact SWFP-tree, each subset in B occurs under the condition of the occurrence of α in the transactional database. If an item β appears in B, then β appears with α. Thus, α, β is a candidate itemset if OEW S(α, β) + W S(α) ≥ minW S. Hence proved.

5) Generating all WFNIs from candidate itemsets
After finding all candidate itemsets from SWFP-tree, we perform third scan on the database and calculate actual weighted support for each candidate itemset. The candidate itemset that has weighted support no less than the user-specified minW S will be generated as WFNI. The complete set of WFNIs generated from Table 1c for the user-specified minW S of 150 is shown in Table 1e.

V. EXPERIMENTAL RESULTS
Since there exists no algorithm to mine WFNIs in a binary spatiotemporal database, we only evaluate the proposed algorithm using various databases. We show that our algorithm is not only memory and runtime efficient, but also scalable as well.

A. EXPERIMENTAL SETUP
The SWFP-growth algorithm has been written in java and executed on i7 1.5 GHz processor having 8GB of memory. The experiments have been conducted using synthetic (T10I4D100K) and real-world (Retail, Chess and PM2.5) databases.
The T10I4D100K [2] is a sparse synthetic database, which is widely used for evaluating various pattern mining algorithms. This transactional database is converted into a temporal database by considering tids as timestamps. A spatial database for all the items in T10I4D100K has been generated by assigning random coordinates between (0, 0) to (100, 100). The coordinates of these items in a Cartesian coordinate system is shown in Fig. 4a. It can be observed that items have non-uniformly spread throughout the region. The statistical details of this database were provided in Table 4.
The Retail is a sparse real-world transactional database, which is widely used for evaluating various pattern mining algorithms. This database is converted into a temporal database by considering tids as timestamps. A spatial database for all the items has been generated by assigning random coordinates between (0, 0) to (200,200). The coordinates of these items in a Cartesian coordinate system is shown in Fig.  4b. It can be observed that items have non-uniformly spread throughout the region. The statistical details of this database were provided in the third row of Table 4.
AEROS consists of several air pollution measuring stations located throughout Japan. Each station measures several air pollution concentrates (e.g., NO, NO 2 , PM2.5 and SO 2 ) over hourly intervals. In this paper, we only consider PM2.5 pollution concentrate. The pollution data is generated at 1 hour time interval for 24 hours of a day. For our experiments, we are using air pollution data of 6 months (i.e., from VOLUME 4, 2016 01-12-2018 to 04-06-2019). The PM2.5 database contained 5366157 data points and 1065 items (or station ids). UTC time is used to record the transactions. Without loss of generality, the pollution database was split into a temporal database, spatial database and items weight database. PM2.5 is a dense high dimensional database. The statistical details of this database are shown in Table 4.
The Chess is a dense real-world transactional database, which is widely used for evaluating various pattern mining algorithms. This database is converted into a temporal database by considering tids as timestamps. A spatial database for all the items has been generated by assigning random coordinates between (0, 0) to (20,20). The coordinates of these items in a Cartesian coordinate system is shown in Fig. 4d. It can be observed that items have non-uniformly spread throughout the region. The statistical details of this database were provided in the fourth row of Table 4.  Figs. 6a, 6b, 6c and 6d show the memory requirements of SWFP-growth (in megabytes) on T10I4D100K, Retail, PM2.5 and Chess databases at different minW S and maxDist values, respectively. The following observations can be drawn from these two figures : (i) increase in minW S results in the decrease of memory as relatively less number of WFNIs get generated and (ii) increase in maxDist results in increase of memory required to find WFNIs. It is because a large number of WFNIs get generated at higher maxDist values.
Figs. 7a, 7b, 7c and 7d show the runtime requirements of SWFP-growth algorithm on T10I4D100K, Retail, PM2.5 and Chess databases at different minW S and maxDist values, respectively. The following observations can be drawn from these two figures : (i) increase in minW S results in a decrease of runtime as fewer WFNIs are getting generated and (ii) increase in maxDist results in the increase of runtime.

C. SCALABILITY TEST OF SWFP-GROWTH
We study the scalability of proposed algorithm on execution time and required memory by varying the size of T10I4D100K database. We concatenated the T10I4D100K database ten times to produce a very large database, which we call as T10I4D1000K database. Next, we divided this database into five portions of 0.2 million transactions in each part. Then we investigated the performance of our algorithm after accumulating each portion with previous parts while finding SWFIs each time. To find same itemsets as SWFIs with the increase in database sizes, the minW S was doubled to reflect the database size. The minW S for the first database was set at 40,000. Fig. 8a and 8b respectively show the memory and runtime requirements of SWFP-growth algorithm on T10I4D100K database. It is clear from the graphs that as the database size increases, the memory and runtime requirements of our algorithm increase. However, SWFP-growth has stable performance of about linear increase of runtime and memory consumption with respect to the data size. Thus, SWFPgrowth can mine SWFIs over large databases and distinct items with considerable amount of runtime and memory. (We can conduct the above experiment by directly generating T10I4D1000K database using the synthetic database generator. However, such generated results may be misleading as the number of generated itemsets can vary with the database size. In our scaled database, the number of patterns remain the same irrespective of the database size.) Table 5 shows the WFNIs generated in the PM2.5 database at maxDist = 5 kilometers and minW S = 10, 000µg/m 3 . The spatial location of all these stations in the entire Japan are shown in Fig. 9. The spatial location of the sensors present in each Weighted Frequent Neighborhood Itemsets are shown in Fig. 10 11 12 and 13. These itemsets in these figures indicate the geographical areas where people have been exposed to high levels of PM2.5 pollutant. It can be observed that high levels of PM2.5 have been observed at the places close to the bay areas (or harbors). This information can be found very useful in devising policies to control pollution at bay areas.

VI. CONCLUSIONS AND FUTURE WORK
In this paper, we have introduced a flexible model of spatial weighted frequent itemset that exist in a spatiotemporal database. Two novel measures have been introduced to reduce the search space effectively. A pattern-growth algorithm has also been presented to find all desired itemsets in a spatiotemporal database. Experimental results demonstrate that the proposed algorithm is efficient. Finally, we have also demonstrated the usefulness of the proposed model with a real-world case study on air pollution data.
In this paper, we have studied the problem of finding SWFIs by taking into account positive weights for the items in a spatiotemporal database. As a part of future work, we would like to investigate finding SWFIs in a spatiotemporal database using both positive and negative weights for the items. Additionally, we would like to investigate disk-based and parallel algorithms to find SWFIs.  KOJI ZETTSU is a Director General of Big Data Integration Research Center of National Institute of Information and Communications Technology (NICT). He has been doing research and development of data analytics technology in NICT, and now leading Real Space Information Analytics Project since 2016 to implement smart data platform based on data mining and AI. For promoting industry-academia-government collaboration on the platform, he is also a leader of Cross-Data Collaboration Project of Smart IoT Acceleration Forum in Japan. He received Ph.D. in Informatics from Kyoto University in 2005. His research interests are database systems, data mining, information retrieval and software engineering. He has serviced on numerous academic societies, conference committees and working groups.
MASASHI TOYODA is a professor of Institute of Industrial Science jointly affiliated with the Graduate School of Information Science and Technology at the University of Tokyo, Japan. He received the BS, the MS, and the PhD degrees in computer science from the Tokyo Institute of Technology, Japan, in 1994, 1996, and 1999, respectively. In 1999, he joined the Institute of Industrial Science, the University of Tokyo as a research fellow, and worked as a specially appointed associate professor from 2004 to 2006, and as an associate professor from 2006 to 2018. His research interests include archiving and analysis of Web, social media, and IoT data, information visualization, visual analytics, and user interface. He has delivered several invited/panel talks at the reputed conferences and workshops in India and abroad. He has got several awards and recognitions. He has executed research projects by raising the research funding of about 80 million Indian rupees. Since 2004, he has been investigating the building efficient knowledge agricultural knowledge transfer systems by extending developments in IT. He has developed eSagu system, which is an IT-based farm-specific agroadvisory system, which has been field-tested in hundreds of villages on about 50 field and horticultural crops. He has also built eAgromet system, which is an IT-based agro-meteorological advisory system to provide risk mitigation information to farmers. He has conceptualized the notion of Virtual Crop Labs to improve applied skills for extension professionals. Currently, he is investigating the building of Crop Darpan system, which is a crop diagnostic tool for farmers, with the funding support from India-Japan Joint Research Laboratory Program. He has received two best paper awards. The eSagu system, which is an IT based farm-specific agro-advisory system, has got several recognitions including CSI-Nihilent e-Governance Project Award