MCF tree-based Clustering Method for Very Large Mixed-type Data Set

Several clustering methods have been proposed for analyzing numerous mixed-type data sets composed of numeric and categorical attributes. However, existing clustering methods are not suitable for clustering very large mixed-type data sets because they require a high computational cost or a large memory size. We propose a novel clustering method for very large data sets using a mixed-type clustering feature (MCF) vector with summary information about a cluster. The MCF vector consists of the CF vector and a histogram to summarize the mixed-type values. Based on the MCF vector, we propose an MCF tree, along with a distance measure between the MCF vectors representing two clusters. Unlike previous studies that summarize a data set based on a fixed memory size, we estimate a small initial memory size of the data set for building the tree. Then, the memory size is adaptively increased to estimate a more accurate threshold by reflecting the size reduction in the re-built tree. Our theoretical analysis demonstrates the efficiency of the proposed approach. Experimental results on very large synthetic and real data sets demonstrate that the proposed approach clusters the data sets significantly faster than existing clustering methods while maintaining similar or better clustering accuracy.


I. INTRODUCTION
C LUSTERING groups a data set into clusters based on the concepts of distance or similarity. It is widely used in a diverse range of fields (e.g., data science, machine learning, pattern recognition, and artificial intelligence [1]- [4], [6], [9]) as an effective technique for data analysis. The attributes of real-world data sets are commonly composed of both numeric and categorical attributes, e.g., health, marketing, and medical [5], [20]. Numeric attributes can take real values, e.g., age and height. Categorical attributes can take the values in a fixed number of categories, e.g., sex and religion. A "mixed-type" data set is composed of both numeric and categorical attributes [25].
There are two types of commonly used clustering methods: partitional clustering and hierarchical clustering [26]. Partitional clustering methods partition a data set into final clusters, each of which is represented by its cluster center. In this paper, the term "final cluster" indicates "the cluster found by the clustering methods," whereas the term "true cluster" indicates "the original cluster with the ground-truth label." Hierarchical clustering methods generate a hierarchy of clusters in either a top-down or bottom-up sequence.
Both types of clustering methods for mixed-type data sets have similar disadvantages to those for non-mixed-type data sets (e.g., numeric data sets). Partitional clustering methods for mixed-type data sets [5], [10], [24], [27], [32] cannot cluster very large data sets rapidly because they repeat the assignment and update steps by using whole data sets until their stopping criteria are satisfied. Specifically, if all data do not reside in the main memory, disk I/Os are required repeatedly in each iteration. This drastically increases the clustering time. Many hierarchical methods for mixed-type data sets have been proposed [21]- [23], [29], [31], [33]. However, these hierarchical clustering methods require a proximity matrix, and generating the matrix requires high computational costs. Further, a large amount of memory must be allocated to maintain the matrix until the clustering is completed. This means that these hierarchical clustering methods are not suitable for clustering very large mixed-type data sets.
Sampling-based [36], dimensionality reduction-based, and summary-based clustering methods [13] are broadly used for clustering large data sets [35]. However, deriving appropriate samples from data sets for sampling-based clustering methods is a challenge [30]. Further, dimensionality reductionbased methods can loss the information that maintains the significant properties of the data sets [8], [14]. Therefore, we focus on summary-based clustering methods for the analysis of large data sets.
Summary-based clustering methods for numeric data sets have been proposed to cluster very large data sets effectively. BIRCH (balanced iterative reducing and clustering using hierarchies) and CF + -ERC (clustering features + treebased effective multiple range queries-based clustering) are examples of this type of clustering method [35], [38], [39]. However, these clustering methods cannot cluster mixed-type data sets because they only handle numeric data sets.
A summary-based clustering method for mixed-type data sets, which is an extension of the CF tree of BIRCH, was proposed in [13]. They computed a distance between data using their distance measure based on the log-likelihood function. If the distance between data was less than a threshold, the data were summarized in the CF vector. However, their threshold estimation method is limited by a given memory size, similar to BIRCH. If a given memory size is not suitable for clustering mixed-type data sets, their threshold estimation method does not perform well, thereby giving poor final clusters.
In this paper, we propose a novel summary-based clustering method for very large mixed-type data sets by incorporating a dynamically increasing memory size for more accurate threshold estimation. For this, we first introduce the mixed-type clustering features (MCF) vector that the numeric and categorical values of the mixed-type objects in the cluster are summarized using the CF vector [35] and the histogram proposed in K-histograms [19], respectively. Next, we implement a mixed-type clustering feature (MCF) tree, which is an extension of the CF + tree [35].
We generalize the distance measure of K-histograms to compute the distance between histograms of the categorical values of two clusters. Therefore, our proposed distance measure only uses the MCF vector to compute the precise distance between two clusters. The MCF vector is also used to obtain a diameter of its corresponding cluster. Unlike previous works [13], [35], [38], [39], our method first estimates a small initial memory size for building the MCF tree by considering the size and characteristics of a given mixed-type data set. It then adaptively increases the memory size for the MCF tree to estimate a more accurate threshold. The memory size is adaptively increased by considering the size reduction of the re-built MCF tree so that the threshold is not increased too conservatively or too aggressively.
The MCF tree provides a set of microclusters that have summary information about the overall data set. Subse-FIGURE 1: Process of our clustering method quently, existing mixed-type clustering methods (e.g., convex k-means and W-K-prototypes) are used in our proposed approach as the global clustering methods to group a set of microclusters into the final clusters. Consequently, our summary-based clustering method clusters mixed-type data sets more rapidly than existing clustering methods without sacrificing accuracy compared to non-summary-based clustering methods. Furthermore, the range-query-based clustering method ERC (effective multiple range queries-based clustering), which uses the structure of the tree, can also be applied to the MCF tree. Therefore, various existing clustering methods can be applied to our global clustering method.
The main contributions of our proposed method are summarized as follows: • We propose a scheme for summarizing mixed-type data sets for clustering using an MCF vector composed of the CF vector and histogram. • We develop an efficient distance computation method between MCF vectors. • We develop an adaptive memory size increasing method to estimate more accurate thresholds for constructing MCF trees. • We propose a method of applying the range-querybased clustering method ERC to MCF Tree as a global clustering method.
We use Figure 1 to explain a process of our clustering method briefly. First, we build an MCF tree as follows. We set a threshold T = 0 and an initial memory size is estimated.
An MCF tree T r with T is created. For each object o, o is inserted into T r if o is not in T r. If the size of T r is greater than the current given memory size, we increase the threshold T . With these increased threshold value, we rebuild a new MCF tree T r from the current MCF tree T r by inserting all leaf entries into T r . At that time, each leaf entry of T r can absorb more objects because of an increase in T . After the completion of rebuilding the MCF tree, the size of T r becomes smaller than that of T r. We increase the memory size by using our proposed scheme. We now make T r as our current MCF tree by resetting T r = T r . Our current MCF tree T r is ready to get the insertion of the remaining objects from the data set. All above steps are repeated until all the objects are inserted into the MCF tree. After building the MCF tree, existing mixed-type clustering methods group a set of microclusters into the final clusters. The detailed process of our method will be discussed in Section IV.
The remainder of this paper is organized as follows. Section II handles the CF + -ERC and clustering methods for mixed-type data sets in detail. In Section III, an MCF vector consisting of summary information about a cluster is described. We explain that the MCF tree identifies microclusters and how to apply ERC to our method in Section IV. Theoretical and extensive experimental analyses of our method are presented in Section V. Finally, Section VI provides our concluding remarks.

II. RELATED WORKS
In this section, we briefly explain CF + -ERC [35], which is an extension of BIRCH [38], and existing clustering methods for mixed-type data.

A. CF + TREE
We share the definitions of the CF vector, distance measure, and CF + tree [35].
Definition II.1. Given N d-dimensional objects in a cluster X : { o i }, where i = 1, 2, . . . , N , the CF vector of X is defined as CF = (N, LS, SS), where N is the number of objects in X ; LS is the linear sum of the N objects, that is, Definition II.2. Given N d-dimensional objects in a cluster X : { o i }, where i = 1, 2, . . . , N , the centroid c and the average radius ar are obtained using the following formula: Definition II.3. Given two clusters X and Y having CF vectors CF X = (N X , LS X , SS X ) and CF Y = (N Y , LS Y , SS Y ), respectively, the centroid Euclidean distance d 0 between X and Y is obtained by the following formula: Definition II.4. The CF + tree satisfies the followings: 1) Each non-leaf node contains, at most, C entries in the form [CF i , child i ], where i = 1, 2, . . . , C; child i is a pointer to its i-th child node, and CF i is the CF vector of the subcluster pointed to by child i . An entry of the non-leaf node is called a subcluster. 2) Each leaf node contains, at most, L entries in the form [CF i ], where i = 1, 2, . . . , L. An entry of the leaf node is called a microcluster. 3) All leaf nodes are chained together for efficient sequential scanning using the "prev" and the "next" pointers.
CF + -ERC obtains microclusters by building a CF + tree using a threshold. The CF + tree is built as follows: For each object o, starting from the root node, it recursively goes down to the closest child node on the CF + tree using the distance measures such as d 0 . For example, given a cluster X and an object o, d 0 (X , o) is the centroid Euclidean distance between the centroids of X and o. When o arrives at a leaf node n l , if the distance between o and its closest entry (i.e., microcluster mc) of n l is less than a given threshold, o is absorbed in mc. Otherwise, o generates a new entry, and the new entry is inserted into n l . If there is space in n l for this new entry, the new entry is inserted into n l . Otherwise, n l must be split. Node splitting is performed by the adaptable node split scheme [35]. In the adaptable node split scheme, all entries of n l are divided into two entry sets as follows. First, the farthest entry pair of n l is chosen as the seeds. Next, each seed is moved to each entry set. After that, each of the remaining entries is distributed to the closest entry set. At that time, the remaining entries use the distances from the CF vectors of the entry sets rather than those of the seeds. This is because the CF vector of each entry set reflects all entries of the entry set.
Splitting n l causes the insertion of a new non-leaf entry into the parent node, and then the parent node can also be split. Thus, the node split can be propagated to the parent node, and it can propagate up to the root node. If a nonleaf node n ln has room for the new entry created from the propagation of the node split, the propagation of the node split stops at n ln . A split of the root node increases the height of the CF + tree by one.
If the size of the CF + tree is less than that of a given memory size, CF + -ERC rebuilds the tree using the average value of the closest distances of all leaf nodes as the new threshold. Tree rebuilding continues until the size of the entire CF + tree is less than that of a given memory size. Note that the CF + tree aims to summarize the data set so that the size of the summary is less than the memory size. VOLUME 4, 2016 FIGURE 2: CF + tree constructed using 60 objects CF + -ERC also has the CF additivity theorem. Given two CF vectors CF X = (N X , LS X , SS X ) and is the CF vector of the merged cluster. Figure 2 shows an example of the CF + tree constructed by 60 objects. All leaf nodes of the CF + tree are chained together using the "prev" and the "next" pointers, which are denoted by the solid arrow lines. The 60 objects are used to build the CF + tree, and the tree generates 13 microclusters.

B. ERC METHOD
The global clustering method ERC is conducted by utilizing the structure of the CF + tree and the threshold T [35]. For the distance computation between two microclusters, the intermicrocluster distance (IMD) is used. The IMD between two microclusters is computed by subtracting the sum of their radii from the Euclidean distance between their centroids.
Definition II.5. Given two microclusters MC x and MC y having their centroids c(MC x ) and c(MC y ), and their radii ar(MC x ) and ar(MC y ), respectively, inter-microcluster distance between MC x and MC y is defined as − (ar(MC x ) + ar(MC y )).
If the IMD between two microclusters is less than T, both microclusters reside in the same final cluster. ERC consists of two steps: partition and refinement. The partition step groups the linearly adjacent microclusters within T into a segment called a microcluster segment (MCS) using the microcluster linearization property of the CF + tree.
In the refinement step, the effective refinement step (ERS) of ERC gathers a set of MCSs into the set of final clusters using multiple range queries. For each MCS, ERS calls an effective range query (ERQ), which uses the structure of the CF + tree to discover the connections between MCS and others without redundant query processing. A connection between MCS i and MCS j means that at least one IMD exists between MC x and MC y is less than T, where MC x ∈ MCS i and MC y ∈ MCS j . Thus, MCS i and MCS j are merged into the same final cluster. As a result, ERC returns a set of final clusters.
In Figure 2, IMD(MC 2 , MC 4 ), and IMD(MC 5 , MC 8 ) are less than T. However, these two microcluster pairs are not linearly adjacent; hence, the partition step does not group them into the two respective MCSs. The ERS of Ahmad [5], Modha [32], Huang [24], Ji [27], Brnawy [10] Partitional clustering methods. Not suitable for clustering very large data sets because they have to load the entire data sets into their memory to cluster.
Chiu [13] Summary-based clustering method using a fixed memory size.
Our approach Summary-based clustering method using the adaptive memory-size-increasing scheme. Improve the clustering accuracy and reduce clustering time significantly.

C. CLUSTERING METHODS FOR MIXED-TYPE DATA SETS
Most of the existing clustering methods cluster mixed-type data sets in partitional ways [5], [10], [24], [27], [32]. However, these methods are not suitable for clustering very large data sets because they have to load the entire data sets into their memory to cluster. Distributed clustering methods based on the specific distributed processing environments, i.e., MapReduce [15] and Spark [37] frameworks, cluster very large data sets rapidly [17], [18], [28]. The existing summarybased clustering method efficiently analyzes very large data sets [13], but its clustering performance is determined by a fixed memory size. Table 1 shows the main characteristics of these methods in comparison with our proposed scheme. In Table 1, the partitional clustering methods for mixed type data sets [5], [10], [24], [27], [32] are used for the comparative global clustering methods of our proposed MCF-tree. Thus, we give more detailed description of these partitional clustering methods in the following two paragraphs.
A new distance measure to compute the dissimilarity between mixed-type objects were proposed [5]. The authors compute the distance between two attribute values of a categorical attribute from the data sets. The distance depends upon the co-occurrence of these attribute values with the attribute values of other attributes. The weights of numeric attributes are also calculated in this method such that more significant attributes are given greater weights. A convex K-means for the data in heterogeneous feature spaces was proposed [32]. They normalized each numeric attribute and used 1-in-q representation for each q-ary categorical attribute. In order to compute the distortion between a centroid and an object, they applied the squared Euclidean distance for the numeric feature and the cosine distance for the categorical feature. The distortion in each feature is multiplied by the feature weight and summed to the total distortion. W-K-prototypes was proposed [24]. In each iteration of W-K-prototypes, the attribute weights are updated and used in the cost function. These weights are inversely proportional to the sum of the within-cluster distances. An extension of W-K-prototypes was proposed [27]. Its new cost function reflected the significance of the cluster center and the weight of W-K prototypes. The significance of an attribute is initially selected randomly, followed by an update to its value for each iteration. The random selection of the significance of an attribute can worsen the problem of random initialization of the cluster center because it would lead to different results in different runs. The K-mixed prototypes further distinguished the categorical types of attributes into nominal versus binary types [10]. The K-mixed prototypes increased its clustering accuracy by assigning more significance to the attributes of the binary types than those of nominal types.

III. MIXED-TYPE CLUSTERING FEATURES AND MEASURES
In this section, we introduce the MCF vector for summarizing clusters. Our proposed distance measure computes the distance between two clusters using the MCF vectors. For computing the distance between two clusters considering only categorical attributes, we generalize the distance measure of K-histogram to the distance measure between histograms of the categorical attributes of two clusters.

A. NOTATIONS
Suppose a mixed-type data set with d attributes composed of d r numeric and d c categorical attributes (d = d r + d c ). Let A l = {a 1 , · · · , a m } be a set of m distinct values of the l-th categorical attribute. It should be noted that m may vary depending on the domain characteristics of the attribute.
Then, we can derive two types of sets, X r and X c , from X , where o c N,l } be a set of values of the l-th categorical attribute of all the objects in X c . Let V l be a set of distinct values in X c l , and g(a j , X c l ) be the frequency We denote H X c briefly by H X . The histogram set of X c is defined as H X = (H X ,1 , H X ,2 , · · · , H X ,d c ). The symbolic notation is summarized in Table 2.    Table 3 shows an example of a database having eight objects (o 1 , o 2 , · · · , o 8 ) composed of both numeric and categorical attributes. In this example, we have two clusters There are two numeric attributes, age and height, and two categorical attributes religion and sex. The values in the numeric attributes age and height are real numbers, whereas the values in the categorical attributes religion and sex are in A 1 = {Christianity, Islam, Hinduism} and A 2 = {M, F}, respectively. We use the simple notations C, I, and H to represent Christianity, Islam, and Hinduism of the categorical attribute religion, respectively. For example, the

B. MIXED-TYPE CLUSTERING FEATURES
We formally define the MCF vector, which is a quadruple maintaining summarized information about a cluster.
Definition III.1. MCF vector: Given N d-dimensional objects in a cluster X : {o i }, where i = 1, 2, . . . , N , the MCF vector of X is defined as M CF X = (N X , LS X , SS X , H X ), where N X , LS X , and SS X are the same as N , LS, and SS in the CF vector in Definition II.1, and H X is a histogram set of X c of X .
The two histogram sets can be merged as follows.
Definition III.2. (Histogram set merge): Given two histogram sets H X and H Y , the merged histogram set H X ∪Y of H X and H Y is obtained by the following formula: For example, in Table 3 The two clusters can be merged using their MCF vectors.
be the two MCF vectors of two clusters. The MCF vector of the cluster obtained by merging the two clusters is Proof of Theorem III.1. The proof consists of straightforward algebra.
Theorem III.1 shows that an MCF vector holds the MCF additivity theorem, which is an extension of the CF additivity theorem [38]. Thus, the two clusters are merged in an accurate manner.

C. MEASURES FOR CLUSTERS
We compute the distance between two clusters as follows: Definition III.3. Distance between clusters: Given d r numeric and d c categorical attributes, the distance between clusters X and Y is defined as is the distance between X c and Y c , and γ is the numeric attribute weight.
The distance between X r and Y r is obtained using the centroid Euclidean distance d 0 . d H (H X , H Y ) is the distance between X c and Y c , which will be explained in Definition III.5 of Section III-D. Thus, we can compute the distance between the two clusters. The diameter of the cluster is also derived from Definitions II.2 and III.8, which will be given in Section III-D.
Definition III.4. Diameter of a cluster: The diameter of cluster X is defined as where ar(X r ) is the average radius of X r ; ρ(H X ) is the diameter of X c , and γ is the numeric attribute weight.
The numeric attribute weight γ is defined by an expert in the domain. It should be noted that we will demonstrate that γ = 0.5 is sufficient for our proposed distance measure. 1 √ d r and 1 d c are used to normalize the distance and diameter of the numeric and categorical attribute measures, respectively.

D. CATEGORICAL DISTANCE MEASURE BASED ON HISTOGRAM
We present a distance measure based on the histogram of the MCF vector. This measure computes the distance between the categorical values of the two clusters using their histogram sets. Then, we show that this distance measure computes the same average distance as the average distance between objects of two clusters when only considering categorical attributes. In our distance measure, we use elements of the MCF vector M CF X of X , such as H X and N X .
Definition III.5. Distance based on histogram: The distance between H X and H Y is obtained using the following formula: We use Figure 3 to demonstrate the distance computation of the categorical attribute religion between X c 1 and Y c 1 . In this figure, we use the simple notations C, I, and H to represent Christianity, Islam, and Hinduism, respectively. Let A 1 = {C, I, H}. In Figure 3, two objects are connected by a solid line if they have the same value (the distance between them is zero); otherwise, two objects are connected by a dashed line (the distance between them is one). In Figure 3a, each element of the histogram H X ,1 forms a value:frequencypair, i.e., C:2 . The distance between H X ,1 and H Y,1 is Next, we introduce the distance between two values within the same categorical attribute. Using Definition III.6, we define the average distance between X c and Y c .
Definition III.6. Given two categorical values a j , a k ∈ A l , the distance between a j and a k is obtained by the following formula: (a) Based on histogram (b) Based on object FIGURE 3: Distance between X c 1 and Y c 1 from Table 3 Definition III.7. Distance based on object: The distance between X c and Y c is obtained by the following formula: In Figure 3b, the distance between X c 1 and Y c 1 is (the number of dashed lines) / (the number of all lines) = 12/15, which is the same as the distance obtained from the example in Figure 3a. This is because these two distance measures in Definitions III.5 and III.7 compute the same distance. Now, we prove that the distance based on the histogram and the distance based on the object are the same.
Theorem III.2. Given X c and Y c , the following equation holds: Proof of Theorem III.2. Let β c = (β c 1 , β c 2 , . . . , β c d c ) be a list of the categorical values of a mixed-type object β,and The distance between X c and β c is defined [19] as follows: In other words, the sum of frequencies in H X ,l , whose categorical values are different from β c l is φ(H X ,l , β c l ). Therefore, Eq. (7) can be replaced with If β c l is changed with the histogram H Y,l of Y c l in Eq. (8), VOLUME 4, 2016 Using Eq. (9), Eq. (6) is extended to obtain the average distance between X c and Y c : The distance based on the histogram decreases the computational cost and memory usage because it takes less time and space complexities. Given X c and Y c , the distance computation between X c and Y c based on the object takes O(|X c |·|Y c |). By contrast, the distance computation between For the same reason, maintaining H X (H Y ) takes less space than X c (Y c ) in most cases.
Because histograms can contain only one object, they are easily applied to a distance computation between a cluster and an object or between two objects.
The average distance of all object pairs in X c shows the closeness of all objects of X with regard to the categorical values. We refer to this in the terms of the "diameter" of X c .
Definition III.8. Diameter: The diameter ρ(H X ) is obtained using the following formula: otherwise.
, where In the examples in Figure 4, the symbolic notations and line types are the same as those in Figure 3. The diameter of H X ,1 in Figure 4a, which is ρ(H X ,1 ) = (3 · 2 + 1 · 4 + 1 · 4)/(5·4) = 7/10, is the same as the diameter of X c 1 in Figure  4b, which is (the number of dashed lines) / (the number of all lines) = 7/10.

IV. MCF TREE
We describe the structure and building process of the MCF tree. We also propose a tree-rebuilding process based on an adaptive memory-size-increasing scheme to estimate a more accurate threshold for building an MCF tree. Furthermore, we explain how to apply the range-query-based clustering method ERC to the MCF tree.

A. STRUCTURE AND INSERTION PROCESS OF MCF TREE
The MCF tree is a height-balanced tree. It consists of the MCF vectors to summarize a mixed-type data set. A cluster is stored and maintained in the form of an MCF vector. The distance between clusters is computed using their MCF vectors. Two clusters can be merged by the MCF additivity theorem, similar to a CF + tree. However, the structure and building process of the MCF tree are slightly different from the CF + tree because it requires the minimum number of entries of each node. We first extend the definition of the CF + tree to the MCF tree and explain its building process.
Definition IV.1. The MCF tree satisfies the following properties: 1) Each non-leaf node contains, at most, C entries in the form [M CF i , child i ], where i = 1, 2, . . . , C; child i is a pointer to its i-th child node, and M CF i is the MCF vector of the subcluster pointed to by child i . An entry of the non-leaf node is called a subcluster. 2) Each leaf node contains, at most, L entries in the form [M CF i ], where i = 1, 2, . . . , L. A entry of the leaf node is called a microcluster. 3) All leaf nodes are chained together for efficient sequential scanning using the "prev" and the "next" pointers. 4) Each node has at least φ entries of the maximum number of entries (i.e., L or C).
The building process of an MCF tree is the same as that of the CF + tree. The only difference is the adaptable node split scheme of of the MCF tree. It divides all entries of the node into two entry sets until each entry set has at least φ entries of the maximum number of entries of the node. Once the MCF tree is built, existing clustering methods, such as ERC and W-K Prototypes, group a set of microclusters of the MCF tree into the final clusters.

B. TREE BUILDING PROCESS BASED ON ADAPTIVE MEMORY-SIZE-INCREASING SCHEME
In CF + -ERC, when a given memory size is too small, the objects from the different clusters are merged into the same microcluster. This implies that the accuracy of global clustering using these microclusters is degraded. On the contrary, if a given memory size is too large and the initial threshold is too small, the CF + tree built using the threshold estimation scheme does not summarize the data set effectively. This is because a small threshold rarely absorbs the objects into the same microcluster. However, the estimation of an accurate memory size for the data sets is a difficult problem because it differs depending on the characteristics of the data sets. Therefore, we propose an adaptive memory-size-increasing scheme that estimates a more accurate threshold by adaptively increasing memory size according to the amount of size reduction of the re-built MCF tree.
We introduce the tree building process and then explain our threshold estimation based on the adaptive memorysize-increasing scheme in detail. Our threshold estimation scheme has three functions: initial memory-size-estimation, threshold increase, and memory-size-increase. The MCF tree is built as follows: Let S be a data set composed of d r numeric and d c categorical attributes, M M be the main memory size, and M (W ) be the memory size required to store W .
2) Compute a small initial memory size µ i (the initial memory-size-estimation function). The steps i) -vii) are the tree rebuilding process. These steps are performed repeatedly until all the objects are inserted into the MCF tree. If the memory size for the i-th MCF tree, M (T r i ), is greater than the main memory size M M , the main memory size is used as the memory size, and the memory size does not thereafter increase. The two variables α and β store the memory sizes of the two MCF trees immediately before and immediately after rebuilding the MCF tree, which will be used in the memory-size-increase function.
In the initial memory-size-estimation function, a small initial memory size µ 0 is roughly computed using the MCF tree T r 0 , consisting of π% × |S| objects for the first tree rebuilding. All entries of the nodes in the MCF tree are represented by MCF vectors. Therefore, we need to obtain the number of entries of T r 0 and the memory size of the MCF vector.
Let B be the branching factor of T r 0 . Let a i denote the number of entries of all nodes whose heights are i. The number of entries of all leaf nodes (i.e., their heights are 0) is the same as the number of objects. The reason is that each entry, for all leaf nodes of T r 0 built by using T 0 = 0, contains one object. Thus, a 0 = |S| × π . In T r 0 , the non-leaf node at each level has up to B entries. Hence, the total number of entries in the MCF tree can be approximately computed as κ = h i=0 a i , such that a 0 = |S| × π ; a n = a From κ and M (M CF X ), the small initial memory size is computed as In the threshold increase function, the i-th estimated threshold T i is increased from T i−1 using the (i-1)-th rebuilt MCF tree T r i−1 . We use the term "leaf minimum distance" of a leaf node to refer to the minimum distance among the microclusters belonging to the leaf node. While building T r i−1 , it tends to assign similar microclusters to the same leaf node. Therefore, we can use all the leaf minimum distances to compute a new threshold T i to reduce the computational cost, instead of using the minimum distance between all microcluster pairs of the tree. Because the minimum distance of the leaf node depends on the microclusters assigned to the leaf node, some leaf minimum distances might be much larger than the average of the leaf minimum distances. We exclude such leaf minimum distances to obtain a newly increased threshold, as follows.
Definition IV.2. Threshold increase: Given a set ρ of the leaf minimum distances of all leaf nodes in the (i-1)-th rebuilt MCF tree T r i−1 with the (i-1)-th increased threshold T i−1 , we letρ and σ denote the average and standard deviation of ρ, respectively. The set ω of leaf minimum distances is Letω be the average of ω.
The memory-size-increase function increases the memory size adaptively. By Definition IV.2, the estimated threshold increases monotonically as the number of tree rebuilds increases. Therefore, we use the number of tree rebuilds to estimate a more accurate threshold through the memory size reduction ratio (i.e., β α ). Ifω < T i−1 , the memory size is increased by the previous increment in memory size i.e., . This is because β α indicates whether the i-th increased threshold T i is sufficient for summarizing the data set. We use 1/e β α to amplify the effect of β α in computing an increment in memory size. A β α close to one means that the MCF tree T r i with T i summarizes slightly more objects than the previous MCF tree T r i−1 with T i−1 when using the same memory size. Thus, T i is estimated conservatively to summarize the data set. We increase the memory size slightly to rebuild the tree more frequently, which in turn gives a more frequent estimation of the threshold increase. Therefore, the current conservative threshold estimation becomes more aggressive.
On the other hand, if β α is close to zero, many more objects are summarized in T r i with T i compared to T r i−1 with T i−1 when using the same memory size. This indicates that the estimation of T i is rather aggressive. Thus, we increase the memory size significantly to reduce the number of tree rebuilds. As a result, the threshold increase is estimated less frequently, which makes the current aggressive threshold estimation more conservative. Consequently, we estimate the more accurate threshold by increasing the memory size adaptively while building the MCF tree. Figure 5 shows an example of the adaptive memory-sizeincreasing scheme. In Figure 5a, the memory size M (T r 1 ) of the MCF tree T r 1 built using T 1 = 4 becomes larger than the 1 st increased memory size µ 1 . At that time, α is set to M (T r 1 ). Because T r 1 consists of four leaf nodes ln 1 , ln 2 , ln 3 , and ln 4 , we compute their leaf minimum distances ρ 1 = 5, ρ 2 = 6, ρ 3 = 4, and ρ 4 = 9, its average distanceρ = 6, and the standard deviation σ = 1.9. Because ρ 4 ≥ρ + σ and ω ≥ T 1 , T 2 is set to 5. Next, T r 2 is rebuilt using T 2 . It should be noted that M (T r 2 ) is always smaller than M (T r 1 ) when T 2 is greater than T 1 . Then, β is set to M (T r 2 ), as shown in Figure 5b. The memory size is adaptively increased from µ 1 to µ 2 using α and β, as shown in Figure 5c. The remaining objects of the data sets are inserted into T r 2 based on T 2 , until M (T r 2 ) is greater than µ 2 .

C. APPLICATION OF ERC TO MCF TREE
ERC has been proposed as a global clustering method [35]. It reduces the clustering time by using the CF + tree structure. In this section, we explain how to apply the ERC method to the MCF tree. The distance measure d 0 and average distance ar of the ERC method handle only the numeric values. Thus, d 0 and ar are replaced with our distance measure D and diameter measure P, respectively. The three definitions and one method (ERQ) of ERC, which use d 0 and ar, are modified to use D and P. Note that the ERQ method is called by the ERS method of ERC.
First, we formally change IMD to IM D M for mixed-type objects.
Definition IV.3. Inter-microcluster distance: Given two microclusters X and Y with their corresponding diameters P(X ) and P(Y), the inter-microcluster distance between X and Y is defined as The partition step of the ERC method divides a set of microclusters into a set of MCSs using IM D M . Additionally, the ERQ method of the refinement step of the ERC method uses IM D M to find out connections between MCSs. Subsequently, the radius of each MCS is obtained as follows: The ERC method performs a range query using the CF + tree structure. To perform the range query, the subcluster must have a radius that covers all its descendant microclusters. Similar to the CF + tree, the radius of the subcluster of the MCF tree is formally defined as follows: Definition IV.5. Radius of subcluster: On a non-leaf node in an MCF tree, the i-th subcluster SC i consists of m clusters C 1 , C 2 , . . ., C m . The radius SC i .r is defined as Because the subcluster can have child subclusters, the radius of the subcluster can be computed recursively. For example, when the height of the MCF tree is three and the subcluster SC r of the root node has m child subclusters SC i (1 ≤ i ≤ m), each child subcluster has l microclusters C j (1 ≤ j ≤ l). Thus, the radius of SC r is max The radius of a subcluster must cover all its descendant microclusters. In other words, the distance between the subcluster and all microclusters must be less than the radius. Assume that the radius of SC r is the sum of the three distances D(SC r , SC i ), D(SC i , C j ), and P(C j ). If D(SC r , C j ) + P(C j ) ≤ D(SC r , SC i ) + D(SC i , C j ) + P(C j ), SC r .r must cover all its descendant microclusters. The above inequality can be simplified as The distance measure d 0 holds the inequality because it uses the Euclidean distance. Hence, we only show that the distance measure d also holds the inequality.
Theorem IV.1. Triangle inequality: Given X c , Y c , and Z c , the following inequality holds: where We rewrite each term of inequation (11) as follows: Therefore, inequation (11) is rewritten as follows: For all i ∈ [1, e], k ∈ [1, g], f j=1 HD(x c i , z c k ) j , f j=1 HD(x c i , y c j ) k , and f j=1 HD(y c j , z c k ) i are divided into f inequalities as follows: Because the Hamming distance holds the triangle inequality, all inequalities in inequation (13) also hold. Therefore, Because d and d H are the same, it is guaranteed that the radius of the subcluster of the MCF tree covers its descendant microclusters in accordance with Theorem IV.1. This indicates that the range query based on the radii of the MCF tree determines the microclusters exactly.

V. PERFORMANCE ANALYSIS
In this section, we provide theoretical and experimental analyses of our proposed clustering method.

A. THEORETICAL ANALYSIS
We analyze the time complexities of our proposed clustering method. Let N be the number of data objects, B be the branching factor of the MCF tree, d r be the number of numeric attributes, and d c be the number of categorical attributes. We assume that each leaf entry of the MCF tree absorbs a data objects on average. Therefore, let n(= N a ) be the number of microclusters.
We analyze the time complexity of building an MCF tree. An object is inserted into the MCF tree based on the closest criteria on a node with up to B entries. The time complexity of the insertion of the objects is O(B · d · log B n), where log B n is the height of the MCF tree consisting of N objects. Thus, the time complexity of constructing a complete MCF tree for the entire data set is O(N · B · d · log B n).
When rebuilding the MCF tree, inserting microclusters into the MCF tree takes O(n · B · d · log B n). The number of rebuilds depends on π of our adaptive memory-sizeincreasing scheme. Therefore, rebuilding the MCF tree takes O( n π · B · d · log B n). However, because the estimated threshold monotonically increases as the number of tree rebuilds increases, more objects are merged into the same microcluster every time the MCF tree is rebuilt. This, in turn, reduces both the number of microclusters and the height of the MCF tree. As a result, an increase in the number of rebuilds of the MCF tree reduces the time complexity of our proposed method.

B. EXPERIMENTAL SETUP
The performance of our method is compared with five existing clustering methods using various synthetic data sets and different real data sets. All experiments were carried out on a computer with an eight-core 2.5-GHz CPU, 32 GB of RAM, VOLUME 4, 2016 and a 1-TB SSD. In addition, we set C = 10 and L = 10 for the MCF tree. In the MCF tree, we keep at least 20% of the entries in each node to distribute the entries more evenly during the node split.
Our proposed MCF tree-based clustering method uses not only ERC but also the non-summary-based clustering methods as the second-phase global clustering scheme. For example, the method that uses WKP as the global clustering method is called MCF-WKP. Because they require mixedtype input data, all microclusters must be converted into objects in a mixed-type form. For each microcluster, the centroid (mode) of each numeric (categorical) attribute is used as the numeric (categorical) value of the converted object. Subsequently, global clustering methods are used to group sets of converted objects into the final clusters.
We also compared our method to the summary-based clustering method that Chiu et al. [13] had proposed. It used a CF tree for clustering the mixed-type data sets by using the modified CF vectors. To distinguish the tree from the CF tree of BIRCH, we used the term "CF M tree" to refer to the CF tree. Similar to our method, the non-summary-based clustering methods were used as a global clustering method of the CF M -based clustering method.
In the experiments, a random centroid initialization scheme was used for all K-means extension methods. For all K-means extension methods, we set the maximum number of iterations to 100. This is because almost all experiments were performed before 100 iterations, or there were no significant performance improvements thereafter.

2) Validation Measure
The average purity, inverse purity [7], Rand index [34], and execution time of these clustering methods were used as the primary performance measures. They were obtained by repeating the experiment ten times with different seeds on the same data set. The purity and inverse purity measures were formally described in [7]. The purity of the clustering method evaluates the frequency of the most similar objects in each cluster, while the inverse purity of the clustering method evaluates similar objects placed in the same cluster. The execution time includes the building time of the MCF and CF M trees.

3) Data Sets
Diverse synthetic and real data sets were used in our experiments. Synthetic data sets were obtained using our synthetic data generator based on the parameters shown in Table 4.
All the synthetic data sets consist of K clusters with d r numeric and d c categorical attributes. All numeric attributes   take real values, and all categorical attributes represent one of the distinct values. All attributes are assumed to be independent. The numeric attributes are generated based on the normal distributions and the categorical attributes are generated based on the multinomial distributions. The cluster centers are randomly located in the mixed-type data space and the radius of each cluster is randomly selected from r min to r max . The number of objects in a cluster depends on the density of the cluster. All objects in each cluster are randomly generated around the cluster center within the radius. We deliberately doubled K and halved N to differ the average density of all clusters in all data sets. Because other parameters were not changed, varying N and K changed the density of every cluster. In our experiment, K doubled from 4 to 8 to 16 · · · to 512 to 1024, when N was halved from 25000 to 12500 to 6250 · · · to 195 to 98. All the centroids of the data set were randomly placed in the mixed-type data space. Nine data sets were generated by the synthetic data generator. The numbers of objects in these data sets were 233 k, 40 k, 179 k, 133 k, 158 k, 137 k, 141 k, 149 k, and 145 k. The default data set had the parameter values shown in Table  5.
The cover type data set (CovType), the CovPokElec data set (CPE), and the KDDCup99 data set (KDD99) were used. All data sets were provided by OpenML 1 . The KDD99 data set was used to evaluate clustering methods in previous experiments [11], [12], [16]. All numeric attributes of all data sets were normalized [5]. In experiments using ConvexK, each q-ary categorical attribute was converted to a 1-in-q representation [32]. The weights of ConvexK and the β of WKP and JIWKP were experimentally obtained. The experimental  Table 6.

C. EXPERIMENTAL ANALYSIS
In this section, we discuss the effects of the parameters on our proposed method, such as the π of the adaptive memorysize-increasing scheme and the numeric attribute weight γ of our distance measure. We also show the performance of our proposed method using three real data sets compared to the other clustering methods.

1) Effect of Adaptive Memory-Size-Increasing Scheme
The effect of the adaptive memory-size-increasing scheme on the performance of our clustering method was analyzed using nine synthetic mixed-type data sets. In this experiment, we increased the percentage π from 1% to 5% and the number K of clusters from 4 to 1024. Note that our threshold scheme dynamically increases the memory size. We set the numeric attribute weight γ of our distance measure to 0.5. The performance of our method was analyzed using five non-summary-based clustering methods. Because they all showed the same performance trend in terms of purity, inverse purity, and execution time, we will only discuss MCF-WKP and MCF-ERC in this study.
In Figures 6-7, with an increase in π for each K, the execution times of all methods increased, whereas their purities and inverse purities were similar, except for K = 512 and K = 1024. Note that the numbers of microclusters in Figures   6c and 7c were identical because these two MCF trees were built by using the same data sets and parameters (i.e., π and γ). Our proposed method found a group consisting of similar objects and then summarized it in a microcluster. This indicates the number of microclusters was much fewer than the number of objects for each data set as shown in Figure  6c. Therefore, the clustering time of our proposed method was reduced while maintaining its accuracy.
However, in Figures 6a-7a, the MCF tree excessively summarized the data sets with K = 512 and K = 1024. This is because our proposed method with a low π obtained a large initial threshold; hence, objects of different true clusters were summarized in the same microcluster. When each cluster had a small number of objects, and the number of clusters was relatively high, our proposed method computed an overestimated initial memory size using π objects, which were sparsely distributed in the data space. For these data sets whose objects, the MCF-based clustering methods with π = 3% show similar purity and inverse purity to those of other methods, as shown in Figures 6-7. According to these experiments, the proper π of our adaptive memorysize-increasing scheme is proportional to K n , where n is the number of objects and K is the expected number of clusters. We consider that experts in the domain can predict the desired number of clusters. In the real data sets, the numbers of clusters were usually less than those of these synthetic data sets while containing more objects. For the real data sets, we found that π = 0.06% was sufficient for our adaptive memory-size-increasing scheme. Next, we compared our summary-based clustering method to the non-summary-based clustering method WKP to demonstrate the efficiency of our proposed approach using the nine synthetic mixed-type data sets. We set π = 3% for all data sets because it was sufficient to cluster all synthetic data sets as shown in Figures 6-7. Figure 8 shows that the MCF-based clustering methods, such as MCF-WKP and MCF-ERC, clustered these data sets more rapidly than WKP owing to fewer microclusters than objects in each data set. Therefore, our proposed method enhances the efficiency of clustering methods while maintaining clustering accuracy in terms of purity and inverse purity. Now, we show how the memory size and threshold in our adaptive memory-size-increasing scheme were increased as the number of rebuilds of the MCF tree is progressed in Figure 9. Two variables α and β were set to the memory sizes of the two MCF trees immediately before and immediately after rebuilding the MCF tree. A variable µ represents the increased memory size.
In Figure 9, the small initial memory sizes of DDS and KDD99 (i.e., π = 1% for DDS and π = 0.06% for KDD99) were adaptively increased every time the MCF trees was rebuilt. In Figure 9a, the threshold was increased until the 3 rd rebuild of the MCF tree and then it remained the same afterward. The reason was that the threshold estimated at the 3 rd and later rebuilds of the MCF tree was lower than that at the 2 nd rebuild of the MCF tree. In BIRCH, the threshold was doubled when the newly computed threshold was lower than the current threshold. As a result, if the threshold increased repeatedly like BIRCH, it is likely to be over-estimated.
In Figure 9b, β was very smaller than α at the 1 st rebuild of the MCF tree, which indicates the threshold was increased too aggressively. Thus, the memory size µ was significantly increased to reduce the frequency of the tree rebuild. Similarly, the memory size was largely increased in the 2 nd -5 th rebuilds of the MCF tree because the newly increased thresholds were sufficiently large to summarize the KDD99 data set. Consequently, our adaptive memory-size-increasing scheme estimated the proper memory size for each data set without increasing the memory size too conservatively or aggressively.

2) Effect of Numeric Attribute Weight
The effect of the numeric attribute weight γ of our distance measure on the performance of our clustering method was analyzed using DDS and KDD99. We set π = 1% for DDS and π = 0.06% for KDD99. Figure 10 shows that the purity and inverse purity of all methods were similar when varying γ for the synthetic data set. In Figure 11, when varying γ, the purity and inverse purity of MCF-ERC significantly decreased for the real data set, except for γs between 0.4 and 0.5. This indicates that the balanced attribute weight was suitable for summarizing the synthetic and real data sets. This is because our method with extreme numeric attribute weights, such as γ = 0.1 (0.9), rarely used the numeric (categorical) values of the mixedtype data sets for clustering, which decreased the clustering accuracy, as shown in Figure 11b. Therefore, we set γ = 0.5 in all the subsequent experiments.

3) Effects for Real Mixed-type Data Sets
The performance of our method was analyzed by comparing non-summary-based clustering methods and CF Mbased clustering methods using three real mixed-type data sets. The non-summary-based clustering methods include AhmadK [5], ConvexK [32], WKP [24], JIWKP [27], and KMP [10]. We set the memory size of CF M -based clustering methods to 64 MB that was the same memory size in the experiments [13]. We also set the memory size of the CF Mbased clustering methods to 32 MB and 128 MB for showing the effect of varying the memory size. CF M -based clustering method is represented by its global clustering method and memory size. For example, CF M -WKP (32M) is CF M -based clustering method that uses WKP as the global clustering method and uses 32 MB memory size for building CF M tree. We set π = 0.06% for all real data sets. The results of these experiments are shown in Table 7. In Table 7, the comparative non-summary-based clustering methods and MCF-tree-based clustering methods usually demonstrate similar performance in terms of purity, inverse purity, and Rand index. Because the MCF tree gathered only similar objects into the same microcluster, microclusters can be used to cluster a mixed-type data set. Because the number of input objects has a significant impact on the time complexity of most clustering methods, including the methods used in our experiments, our method can cluster more rapidly than diverse non-summary-based clustering methods. Table 7 shows that the inverse purities of our methods were higher than those of non-summary-based clustering methods. The reason was that the MCF tree gathered similar objects into the same microcluster, and these similar objects were not assigned to the different final clusters by the global clustering methods. Non-summary-based methods occasionally assigned similar objects into different final clusters, which dwindled the inverse purity. This is one of the benefits of the summary-based clustering method, as opposed to some clustering methods, such as K-means, that only consider the distance between centroids and objects, which may cause similar objects to be grouped into different final clusters.
The CPE and KDD99 data sets were relatively large compared to the CovType data set, and the execution times of our methods using these data sets were significantly smaller than those of non-summary-based clustering methods. The main reason for the reduced clustering time was the reduced num-ber of input objects for the clustering method. Although the computation cost for each iteration of the K-means extension methods was not large, the same computation was required for each iteration; hence, a decrease in the computation cost for each iteration was an important factor for the rapid clustering. These experiments demonstrate that our method is useful for clustering very large data sets.
In Table 7, CF M -based clustering methods for all real data sets were slower than non-summary-based clustering methods when this method used a memory size of 32 MB or more. This is because it identified a large number of microclusters. If the CF M -based clustering method was performed based on a large memory size, it obtained a large number of microclusters, and then the overall clustering time was not reduced. However, our method reduced the clustering times in most cases because our method estimated a proper memory size by using an adaptive memory-size-increasing scheme for each data set.
AhmadK usually requires a large amount of execution time in most experiments because it makes an exceptionally large number of distance computations for all distinct value pairs of every attribute pair before carrying out clustering. On the other hand, our method required quite a small execution time, despite the fact that our method also computed the distances of all distinct value pairs for every attribute pair of the converted objects. Because the number of converted objects was fewer than the number of objects of the data sets,  MCF-AhmadK took less execution time while maintaining a similar performance in terms of purity, inverse purity, and Rand index. Table 7 also shows the higher performance of the ERC method based on the MCF tree. The ERC method, which uses the structure of the MCF tree, reduced the distance computations between microclusters. A decrease in the number of distance computations reduced the clustering time. Table 8 also shows the memory size required to summarize the data sets for each summary scheme such as CF M and MCF trees. The CF M tree used the same memory size for 2 The memory size of each data set used in the non-summary-based clustering methods. all data sets because it summarized the data set based on the given memory. The execution time of most CF M -based clustering methods is increased with an increase in the memory size. However, their purity, inverse purity, and Rand index were not enhanced as shown in Table 7. Thus, CF M -based clustering methods are inefficient when a given memory is not suitable for summarizing the data sets. Table 7 shows that our method clusters these data sets 2 to 10 times faster than CF M -based clustering methods. This is because our adaptive memory-size-increasing scheme computed the proper memory size depending on the characteristics of each data set. In Table 8, the memory sizes of the MCF trees are much less than those of CF M trees. This shows that MCF tree summarizes data sets much better than CF M trees. As a result, the global clustering methods of our proposed scheme cluster the much less number of microclusters than those of CF M -based clustering methods, which in turn significantly reduces the clustering times of our method.

VI. CONCLUSIONS
We proposed a novel summary-based clustering method that efficiently clusters very large mixed-type data sets by incorporating an adaptively increasing memory size for a VOLUME 4, 2016 more accurate threshold estimation. A cluster of the mixedtype data sets is summarized in our proposed MCF vector, consisting of a CF vector and a histogram. Based on the MCF vector, we build the MCF tree using our proposed distance measure. Our adaptive memory-size-increasing scheme determines the proper initial memory size using a small portion of the objects for each data set (i.e., 0.06% -3%). We demonstrated that our distance measure holds the triangle inequality property. It allows a range-query-based clustering method, ERC, to be used as our global clustering method. The scalability and flexibility of our clustering method were demonstrated using diverse large synthetic and real data sets. Consequently, our method clustered very large mixed-type data sets more rapidly than various existing clustering methods, while maintaining similar or better clustering accuracy.
HYEONG-CHEOL RYU received his B.S. degree in computer science from Hanshin University, Korea in 2009. He received his M.S. and Ph.D. degrees in computer science and engineering from Sogang University, Korea in 2014 and 2020, respectively. Since 2020, he has been with SK hynix, Icheon, Korea, where he is currently a Senior Engineer in the NAND SE technology Group. His research interests include spatial databases and data mining.
SUNGWON JUNG received his B.S. degree in computer science from Sogang University, Seoul, Korea in 1988. He received his M.S. and Ph.D. degrees in computer science from Michigan State University, Michigan, USA in 1990 and 1995, respectively. He is currently a Professor in the Computer Science and Engineering Department at Sogang University. His research interests include mobile databases, spatial databases, data mining, and multimedia databases. VOLUME 4, 2016