Mapreduce-Based Distributed Clustering Method Using CF+ Tree

Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF+-ERC. CF+-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF+ tree. However, CF+-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF+-ERC on MapReduce (CF+ERC_MR). It builds a CF+ tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.


I. INTRODUCTION
Owing to the rapid advancement of the Internet and online technologies, exceptionally large collections of data containing digital traces from users and devices can be generated. Such large data sets can be generated by Twitter, Google, Yahoo, and Facebook [1]. Processing and analyzing such large data can provide valuable insights on the activity patterns and preferences of users or customers, which can strongly impact decision making in data-driven businesses. Clustering is a popular technique for analyzing large data sets in which the data set is divided into clusters based on some measure of similarity or distance.
Clustering methods are widely used in a variety of fields such as data science, machine learning, pattern recognition, and artificial intelligence [1]- [4]. These methods can be classified into three main categories: sampling-based, dimensionality reduction-based, and summary-based methods [5]. However, it is difficult to derive appropriate samples with The associate editor coordinating the review of this manuscript and approving it for publication was Shanying Zhu .
sampling-based methods. Further, dimensionality reductionbased methods cannot extract the precise dimensions that maintain the key properties of the data sets. Therefore, summary-based clustering methods are more suitable for the analysis of large data sets.
BIRCH [6] is a popular summary-based clustering method that is used to solve real-world problems owing to its wide applicability and high-level scalability with respect to the data size [7]- [10]. In BIRCH, a clustering feature (CF) tree is built by using a threshold value, which is used to obtain a suitable data summary. When the threshold value is not given, BIRCH uses the computer's memory to estimate the threshold value, following which the existing clustering methods are used as the global clustering method. In some real-world problems, a threshold value is often given as a criterion of clustering [11]- [17]. This criterion is similar to t in a distance-t stopping condition [18], where two clusters within t are grouped into the same cluster.
Recently, CF + -ERC [5] has been proposed as an extension of BIRCH by incorporating the global clustering method ERC (effective multiple range queries-based clustering).
ERC reduces the number of computations by using range queries based on the threshold value, which cluster rapidly. However, CF + -ERC is a sequential clustering method, which consumes a significant amount of computation time to cluster exceptionally large data sets. Further, it cannot be implemented in a distributed computing environment, which implies that it cannot be used with multiple machines for rapid clustering.
Distributed computing algorithms such as MapReduce are an efficient alternative for handling large data sets [19]. Distributed clustering methods based on MapReduce have been useful in analyzing data sets. PKMeans [20] and mrk-means [21] can implement K-means in parallel using MapReduce. Such distributed clustering methods can be used as the global clustering method of BIRCH. However, these methods do not use the threshold value, which makes them unsuitable for clustering large data sets in which the threshold values are estimated. In addition, BIRCH that uses distributed clustering methods as the global clustering method cannot build the CF tree in a distributed parallel manner.
In this study, we propose a novel distributed clustering method CF + ERC_MR, which is an extension of CF + -ERC based on MapReduce. The CF + tree is built by using a given threshold value of a data set in the distributed computing environment. Note that our proposed method assumes data sets with given threshold values. Subsequently, ERC is invoked to cluster the data summary in a parallel way. The two key advantages of CF + ERC_MR are the reduction in the clustering time and the utilization of the threshold value.
There are three main challenges for extending CF + -ERC to the MapReduce programming paradigm as follows.
• The workload should be evenly distributed to multiple reduce tasks.
• Each reduce task should receive similar objects.
• The results of the reduce tasks should be refined. The region centroid set is used to avoid unbalanced data distribution and collect similar data with the same reduce task. Because the clustering of a divided data set might not find the cluster based on the threshold, the refinement step merges the results of the multiple reduce tasks.
CF + ERC_MR rapidly clusters the data sets compared to the existing clustering methods because it runs a MapReduce job using the data summary. Since CF + ERC_MR uses the threshold value, it appropriately clusters the data sets of some domains that use their corresponding threshold values. We have compared the performance of our method with that of PKMeans, mrk-means, and BIRCH-based distributed clustering methods on MapReduce. Our method is scalable, i.e., it can work across many machines in a distributed computing environment. Further, it can analyze a large data set more rapidly and accurately than the existing clustering methods.
The remainder of the paper is organized as follows. CF + -ERC and MapReduce are briefly described in Section II. The extension of CF + -ERC to the MapReduce programming paradigm is discussed in Section III. Section IV discusses the theoretical and experimental performance analyses of our proposed method. Finally, the study is concluded in Section V.

II. RELATED WORKS
CF + -ERC [5] is a state-of-the-art clustering method for analyzing large data sets. MapReduce [19] is a suitable programming paradigm to handle large data in a distributed computing environment. In this section, we have focused on a detailed description of CF + -ERC because CF + ERC_MR is an extension of CF + -ERC to deal with data sets in a distributed parallel manner.
A. CF + -ERC CF + -ERC builds a CF + tree and conducts multiple range queries on this tree, which drastically reduces the clustering time. A node is split if it overflows. The first step in the node split is to choose the farthest entry pair of the overflow node as the seeds. Two new entry sets are created, and the seeds are separated into these sets. The remaining entries are redistributed in the two new entry sets by considering the centroids of the current entry sets that contain several entries during the node split.
CF + -ERC shares the definitions of the CF vector and CF tree along with the CF additivity theorem of BIRCH [6]. SS), where N is the number of objects in C, LS is the linear sum of the N data, that is, LS = N i=1 X i , and SS is the sum of the N squared data, that is, Definition 2: The CF + tree exhibits the following properties: 1) Each non-leaf node contains at most C entries in the form [CF i , child i , r i ], where i = 1, 2, . . . , C. CF i is the CF vector of the i-th child entry, child i is a pointer to the i-th child node, and r i is the radius covering all the descendant microclusters. The i-th entry of the non-leaf node is called a subcluster SC i . 2) Each leaf node contains at most L entries in the form [CF i ], where i = 1, 2 . . . , L and CF i is the CF vector of the i-th child entry. The i-th entry of the leaf node is called a microcluster MC i . 3) For efficient sequential scanning, all leaf nodes are linearly chained by the ''prev'' and ''next'' pointers. Figure 1 shows the CF + tree constructed by using 50 objects. The CF + tree is built starting from the root node. Each data instance p recursively descends through the CF + tree by choosing the closest child node in accordance with the distance metric. When p reaches the leaf node, it is absorbed into the microcluster of the node if the threshold requirement is satisfied. Otherwise, a new entry is generated and p is included in that entry. If the node has room for the new entry, the new entry is inserted in the node. Otherwise, the node splits. Splitting of the leaf node causes the insertion of a new non-leaf entry into the parent node. The parent node splits if it overflows. The node split can be propagated up to the root node. When the root node splits, the height of the CF tree increases by one. If the propagation of the node split ceases at a non-leaf node, merging refinement is attempted to reduce the number of nodes.
Subsequently, the threshold value is used as the criterion for global clustering. The inter-microcluster distance (IMD) is used to compute the distance between two microclusters. IMD between two microclusters is the Euclidean distance between their centroids minus the sum of their radii. If IMD between two microclusters is less than the threshold value, they reside in the same final cluster. Note that the term ''final cluster'' refers to the cluster obtained by the clustering methods.
ERC is conducted by utilizing the structure of the CF + tree and the threshold value T. It can be divided into two steps: partition and refinement steps. In the partition step, the linearly adjacent microclusters within T are grouped into a segment called microcluster segment (MCS) using the microcluster linearization property of CF + tree. For instance, in Figure 1, the CF + tree has eleven microclusters (MC 1 , MC 2 , · · · , MC 11 ) that are linearly adjacent. Because IMD (MC 2 , MC 3 ) (IMD between MC 2 and MC 3 ) is less than T, MC 2 and MC 3 are grouped into MCS 2 . Similarly, the partition step finds six microcluster segments MCS 1 , MCS 2 , · · · , MCS 6 .
In the refinement step, the effective refinement step (ERS) of ERC is employed to gather a set of MCSs into the set of final clusters using multiple range queries. For each MCS, ERS employs an effective range query (ERQ) to find the connections between that MCS and other MCSs without having redundant query computation by using the structure of the CF + tree. A connection between two microcluster segments MCS i and MCS j implies that at least one IMD between MC x and MC y is less than T, where MC x ∈ MCS i and MC y ∈ MCS j . This indicates that MCS i and MCS j are in the same final cluster.
In Figure 1, IMD (MC 2 , MC 4 ) and IMD (MC 2 , MC 6 ) are less than T. However, both microcluster pairs are not linearly adjacent, so they are not grouped into the same microcluster segment in the partition step. In the refinement step, ERS conducts ERQ for each MCS to find the connections to other MCSs. ERQ for MCS 2 checks whether IMD (MC 2 , MC 4 ) and IMD (MC 2 , MC 6 ) are less than T, and then finds two connections MCS 2 , MCS 3 and MCS 2 , MCS 4 , respectively. Subsequently, ERS merges MCS 2 , MCS 3 , and MCS 4 into the final cluster C 2 . ERQs for other MCSs (i.e., MCS 1 , MCS 3 , and MCS 4 ) find no connection between them. Thus, ERC returns the four final clusters C 1 , C 2 , C 3 , and C 4 shown at the bottom of Figure 1. The four final clusters are as follows: MapReduce is a programming paradigm for distributed and scalable computation of large data. For example, Facebook employs MapReduce running on Hadoop to deal with a large part of their data-processing applications [22]. Hadoop [23] is a widely used open-source implementation of MapReduce. Figure 2 shows the overall procedure of a MapReduce job. While conducting a MapReduce job, the inputs and outputs are treated as a form of key, value -pair. The machine performing a map task is called a mapper, while the machine performing a reduce task is called a reducer. The number of mappers and reducers depends on the number of machines in the system and the user configuration. A MapReduce job consists of the five phases: the input, map, sort and shuffle, reduce, and output phases. In the input phase, the input data set is divided into m splits and all splits are distributed to all the mappers. In the map phase, each mapper receives several splits. For each split, the map task is performed to generate an intermediate result, which is then utilized in the sort and shuffle phase. In the sort and shuffle phase, the intermediate results are partitioned based on their key and sent to the reducer that manages the key. In the reduce phase, each reducer receives the key, list of values -pair and performs the reduce task to deal with this pair. Finally, the results of the reduce task are gathered and written in the distributed file system, which is the result of the MapReduce job.
The user only defines the map and reduce tasks to deal with input data set in a parallel way on MapReduce. Meanwhile, the user must consider the characteristics of MapReduce. In particular, since MapReduce exhibits a shared-nothing architecture, the map and reduce tasks cannot read the data sent to other map and reduce tasks. Thus, all map and reduce tasks can only use their input data to generate the results.

III. CF + ERC_MR: CF + -ERC ON MapReduce
In this section, we discuss the problems encountered while conducting CF + -ERC on MapReduce, which includes building the CF + tree and performing ERC in a parallel manner, and our solutions for these problems.

A. PROCESS FLOW OF CF + ERC_MR
The proposed method, i.e., CF + ERC_MR, can be broadly divided into three steps: space-partitioning, clustering, and refining. Figure 3 shows the overall process flow of this method. CF + ERC_MR must perform an additional task of merging the results of the reduce tasks to obtain the final result on the refining step at the refinement phase (see Section III-D). In the space-partitioning step, a region centroid set V for the map tasks is obtained in a sequential manner. V is used to guide the intermediate result of the map task to its proper reduce task. This is discussed in detail in Section III-B. In the clustering step, the local final clusters are determined using MapReduce in parallel. Note that the term ''local final clusters'' refers to the final clusters obtained by ERC in the reduce phase. Each map task receives its corresponding split data and V. In each map task, a CF + tree is first built by using a given threshold value and then a set of microclusters of this tree is determined. All the microclusters are sent to their proper reduce tasks through the sort and shuffle phase in accordance with V. The reduce task receives the key, (list of values) -pair and then assembles the local final clusters.
For example, the map task M1 in Figure 3 only reads the key, value -pairs of split 1 and uses it to build a CF + tree. A set of microclusters of the CF + tree is the intermediate result I1, which is sent to reduce tasks Ri(1 ≤ i ≤ r) through sort and shuffle phase in the form of the i, MC -pair. The reduce task Ri(1 ≤ i ≤ r) receives the i, list of MCspair and then builds a CF + tree using the list of MCs. After completely building the tree, Ri(1 ≤ i ≤ r) finds the local final clusters. A set of local final clusters represents the result Oi, which is discussed in Section III-C.
In the refining step, the local final clusters are sequentially merged into the global final clusters. Note that the term ''global final cluster'' refers to the final clusters merged from the local final clusters at the refinement phase. Because a reduce task finds the local final clusters consisting of the microclusters in that reduce task only, the global final clusters based on those local final clusters must be determined; this is explained in Section III-D.

B. SPACE PARTITIONING STEP OF CF + ERC_MR
All the reduce tasks are simultaneously conducted in isolation in the clustering step. The reducer performing the ''straggler'' reduce task may continue to run even after the other reducers have already finished their reduce tasks. In such a case, MapReduce cannot finish the reduce phase, which is a major factor that contributes to the computation time of a MapReduce job [24]. Thus, it is necessary to send an equal workload to every reduce task.
If the intermediate results of map tasks are randomly and evenly distributed to the reduce tasks, every reduce task receives the intermediate results that are spread in the entire data space. In each reduce task, the local final clusters in the entire data space are determined. This implies that the local final clusters obtained from different reduce tasks overlap with each other. Thus, assembling the global final clusters by combining all the local final clusters can be time-consuming.
However, collecting similar objects corresponding to the same reduce task may be useful for multiple reduce tasks. This is because if the entire data space is divided into exclusive regions and each reduce task oversees microcluster in the respective region, the local final clusters of different reduce tasks do not overlap. In addition, each reduce task can find the local final clusters that might become the global final clusters. This, in turn, leads to a decrease in the workload of the refining step. In other words, the intermediate results of the map tasks should be evenly distributed to every reduce task and the input microclusters of each reduce task should be similar. We now compute the region centroid sets to satisfy these conditions.
A map task cannot read the split data sent to other map tasks. The map task must receive the region centroid set V in advance, which is then used to evenly and similarly distribute the microclusters. The distribution of all the objects should be known for distributing the microclusters to the r reduce tasks. Sampling facilitates the determination of the entire object distribution, so only the samples obtained by applying the reservoir sampling technique [25] are used.
Algorithm 1 describes the space-partitioning function for generating a region centroid set V. The K-means++ method is conducted using the sample set D, where K is set to r. In the K-means++ approach, the centroid of each cluster is considered as a region centroid because it represents the center of that cluster. A set V of r region centroids is broadcast to all the map and reduce tasks.
Compute the centroid v j of C j 5: The map and reduce tasks are described in this subsection. Algorithm 2 shows the map task of the clustering step. Here, ED(MC, v j ) is the Euclidean distance between a microcluster MC and a region centroid v j . Each map task receives an input split S, a set V of region centroids, and a given threshold value T of the data set. Note that each map function uses the same threshold value T, which is assumed to be given for data sets of application domains. It first creates a CF + tree (tree) with T. Subsequently, all the objects of S are inserted into tree.
Then, for each microcluster MC of tree, MC, whose closest region centroid is v i , is sent to the reduce task Ri.
Threshold value OUTPUT i: Reduce task index, MC: A microcluster 1: Build CF + tree tree using S and T 2: M ← A set of microclusters of tree 3: for MC ∈ M do 4: min ← ∞, i ← 0 5: However, occasionally these microclusters do not satisfy the threshold requirement. In Figure 3, the entire input data is divided into m splits, which are processed by the corresponding m map tasks. Figure 4 shows an example of the intermediate results obtained from four map tasks M1, M2, M3, and M4. All the map tasks handle the objects in the same data space, so that the intermediate results may be overlapped. If the distance between the centroids of any two microclusters (i.e., o 1 and o 2 in Figure 4) is less than the threshold value T, they must be merged into the same microcluster.
The validity of the threshold requirement among the microclusters must be rechecked in the reduce task by building a tree. While building the tree, each microcluster MC is inserted into the tree and then routed to its proper leaf entry e l in the tree based on the closest criteria. If MC is closer to e l than T, MC is absorbed in e l . Thus, the threshold requirement of all the microclusters can be rechecked when the tree is built by using these microclusters. After completely building the tree, the reduce task implements ERC for finding the local final clusters. Figure 5 illustrates the example of the microclusters of the CF + trees in three reduce tasks R1, R2, and R3. R1-3 have built their CF + trees after sending the intermediate results of four map tasks to the all reduce tasks based on the region centroids v 1 -v 3 . Here, each of the smallest circles represents a microcluster. The pattern of each of the smallest circles indicates the label of the global final clusters (C 1 , C 2 , · · · , C 5 ). The microclusters in the same global final cluster are covered by a thick solid circle. A set of microclusters that are connected via the dotted line indicates the local final clusters determined by ERC during the reduce task. The local final cluster is covered by a dashed circle in this figure. Each region separated by solid lines in the data space indicates the local region managed by reduce task such as R1, R2, and R3. For the reduce task R1, four microclusters are generated and three local final clusters are obtained. Thus, there are five  This figure demonstrates why the local final clusters must be refined. Because of dividing the data space based on V, some global final clusters (i.e., C 2 and C 5 ) are the same as the local final clusters. Contrarily, other global final clusters (i.e., C 1 , C 3 , and C 4 ) cannot be established by the reduce tasks which cannot read the microclusters sent to other reduce tasks. Here, the local final clusters consisting of all objects in each region are the subsets of the global final clusters separated by the solid line, as shown in Figure 5. Therefore, the additional connections between the local final clusters of the different regions must be obtained to assemble the three global final clusters C 1 , C 3 , and C 4 after the clustering step.
The border between the regions handled by the two reduce tasks is used as the baseline to discern microclusters that are likely to merge into the same global final cluster. If a microcluster is closer to the border than T, it can be merged with the microclusters in another reduce task. The region that is closer than T from the border is called the border region BR. For example, the solid line in Figure 5 represents the border, and the region between the dashed lines including the solid line represents BR.
Let B be a set of microclusters overlapped with BR. A scalar projection is used to determine B. Let Ri and Rj be the two reduce tasks. To check whether a microcluster MC of Ri is included in B, the following approach must be used. Let v i , v j ∈ V be the two region centroids in the two regions managed by Ri and Rj, respectively. d c is the sum of the average radius of MC and the scalar projection of the centroid of MC onto the line between v i and v j , while d h is the distance between v i and v j divided by two. If the difference between d c and d h is less than T, MC is included in B. We name all MCs in B as border microclusters. Figure 6 shows an example of the scalar projection of MC 1 onto the line between v 1 and v 2 . If the difference between d c and d h is less than T, there is a possibility that MC 1 may be connected to one or more microclusters in R2 (MC 2 in this figure). So MC 1 becomes the border microcluster and is added to B.
The function isOverlap (i, V, MC, T ) returns true if MC of Ri is overlapped with BR. If it returns true, MC is added to B. After finding all border microclusters, B is stored into the distributed file system (DFS) such as HDFS [26]. Subsequently, B is rechecked in the refining step to obtain the global final clusters.
Algorithm 3 shows the reduce task of the clustering step. The reduce task receives the i, M -pair generated during the shuffle phase, the threshold value T, and a set V of region centroids. It builds a CF + tree, called tree, using the microclusters of M to recheck the validity of the threshold requirement. After completely building tree, the reduce task Ri performs ERC(tree, T ) for finding the local final cluster set L i by utilizing the structure of tree in a parallel manner. Assume that L i consists of n local final clusters {L i 1 , L i 2 , · · · L i n } and each local final cluster L i l is a set of microclusters. All the reduce tasks stored the border microcluster sets into DFS. The local final cluster sets resulting from all the reduce tasks were written to DFS. After the reduce phase of MapReduce, the refining step reads these border microcluster sets and local final cluster sets during the refinement phase, as shown in Figure 3. It merges the local final cluster sets into the global final clusters using the border microcluster sets.
Let L i l be the l-th local final cluster resulting from the reduce task Ri. Two border microclusters MC x ∈ L i l and MC y ∈ L j m are given. If IMD (MC x , MC y ) is less than T, MC x and MC y are connected and then two local final clusters L i l and L j m are also connected. Then, the global final clusters are assembled by sequentially combining the connected local final clusters. The connected local final clusters can be efficiently obtained by calling ERC if the CF + tree is composed of the border microclusters. However, one border microcluster MC x can be absorbed into another by the threshold requirement when building the CF + tree. This indicates that the connected local final clusters by using MC x cannot be found.
We propose a refining CF + tree consisting of the border microclusters where each border microcluster maintains both the reduce task index and the local final cluster index. While building the refining CF + tree, we do not allow all border microclusters to be absorbed into the existing leaf node entries. Thus, each border microcluster becomes a new entry of a leaf node in the tree. After building the refining CF + tree, the execution of ERC over the refining CF + tree gives a set of connected local final clusters. Thus, the connected local final clusters are merged into the same global final cluster. In contrast, the local final cluster that is not merged into another local final cluster becomes the global final cluster.
Algorithm 4 is used to determine the global final clusters by using a set B of border microclusters. Here, the list of local final cluster index sets I = {I 1 , I 2 , · · · , I r } and the list of border microcluster sets B = {B 1 , B 2 , · · · , B r } are read from DFS. The refining CF + tree tree is built by inserting the border microclusters of B. The connection set of the local final clusters P = {P 1 , P 2 , · · · P p } is obtained by performing ERC in a sequential way.
Since P k (1 ≤ k ≤ n) consists of the connected border microclusters, the local final clusters containing the border microclusters of P k are also connected. Thus, these local final clusters are merged into the same global final cluster. Subsequently, the remaining local final clusters are moved

Algorithm 4 Refine(T ) INPUT T :
Threshold value OUTPUT G: A list of global final cluster index sets 1: Read I = {I 1 , I 2 , · · · , I r } from DFS 2: Read B = {B 1 , B 2 , · · · , B r } from DFS Build a refining CF + tree using B incrementally 3: tree ← ∅ tree: Refining the CF + tree 4: for B i ∈ B do 5: for (l, MC) ∈ B i do MC: Microcluster 6: Insert (i, l, MC) into tree P = {P 1 , P 2 , · · · P p } P k consists of the connected border microclusters 7: P ← ERC(tree, T ) 8: List G ← ∅ Merge the indices of the connected local final clusters 9: for P k ∈ P do Find indices of the local final clusters using P k 10: E ← ∅ 11: for (i, l, MC) ∈ P k do 12: I i ∈ I l: Index of the l-th local final cluster of Ri

13:
if l ∈ I i then Remove l from I i and insert (i, l) to E 14: 16: Add E to G Add the index of remaining local final clusters to G 17: for I i ∈ I do 18: for I i l ∈ I i do 19: Add {(i, l)} to G 20: return G into the global final clusters because they are not connected to others. Consequently, Algorithm 4 refines a set of local final clusters to obtain the desired global final clusters. Figure 7 shows the process flow of the refining step using the fourteen microclusters in Figure 5

IV. PERFORMANCE ANALYSIS
The effectiveness of the proposed clustering method, CF + ERC_MR, was validated through theoretical and experimental analyses.

A. THEORETICAL ANALYSIS
The time complexities of our proposed clustering method are analyzed. CF + ERC_MR is divided into three steps: space partitioning, clustering, and refining. The clustering step is also divided into two tasks: map and reduce. MapReduce runs on n nodes (one NameNode and (n − 1) workers). Since each worker normally performs either the map tasks or the reduce tasks, (n − 1) workers consist of µ mappers and ν reducers, thus giving n = µ + ν + 1. Table 1 summarizes the symbolic notations frequently used in this section. We discuss the time complexities of the construction CF + tree and ERC. We assume that each leaf entry of the CF + tree absorbs a objects, on average. The time complexity of the construction CF + tree is O(N · L · d · log L N a ) [5]. Let W be the number of microcluster segments and w be the average number of microclusters resulting from the multiple range queries. The time complexity of ERC is is the height of the CF + tree and p is the average number of indices in po lists [5].
are the time complexities of all radii computations of the CF + tree and the partition step of ERC, respectively. Since all radii computations and the partition step are performed only once, both are negligible. Further, p is usually very small. Therefore, ERC takes O(W · w · L · d · log L N a ). The space partitioning step takes O(s · d · (r + i)) where s is the number of samples and i is the number of iterations of K-means++. Let A = O(s · d · (r + i)).
The map task of the clustering step receives N /m objects, on average. The construction CF + tree takes O( N m · L · d · log L N m·α ). Distributing N /(m · α) microclusters to r reduce Since µ mappers run m map tasks, the time complexity of each mapper is In the reduce task of the clustering step, since all the map tasks produce N /α microclusters, each reduce task receives N /(α · r) microclusters, on average. The construction CF + tree takes O( N α·r · L · d · log L N α·β·r ). ERC takes O(κ · λ · L · d · log L N α·β·r ). Finding border microclusters takes O( N α·β·r ·d·r 2 ). Let C = O(( N α·r + κ · λ) · L · d · log L N α·β·r + N α·β·r · d · r 2 ). Because (n − µ − 1) reducers run r reduce tasks, the time complexity of each reducer is ).

VOLUME 8, 2020
The refinement step reads the border microclusters and the indices of the local final clusters. The construction CF + tree takes O(η · L · d · log L η). ERC takes O(ψ · ω · L · d · log L η).

Merging the local final clusters to the global final clusters takes
Combining the above time complexities, the total time complexity of our proposed clustering method is Note that an increase in the number of reduce tasks expands the border region, which can increase the number of border microclusters. This, in turn, increases the time complexity of the refinement step. Further, the entire data space was finely partitioned for the higher number of reduce tasks. Thus, it is more difficult to find the local final clusters that might become the global final clusters. Therefore, a small value of the number of reduce tasks is sufficient for our proposed method.

B. EXPERIMENTAL SETUP
A diverse range of synthetic data sets and two real data sets were used to compare the performance of our method with that of existing clustering methods including parallel K-means (PKMeans) [20], mrk-means [21], BIRCH-based distributed clustering methods, and distributed BIRCH. BIRCH-based distributed clustering methods build a CF tree on the NameNode and then implement distributed clustering using a set of microclusters of the tree. Since PKMeans and mrk-means are used as the global clustering method of BIRCH, they are referred to as BIRCH-PKM and BIRCH-MRK, respectively.
Distributed BIRCH is a simple extension of BIRCH to the MapReduce programming paradigm with a single reducer. Every map task locally builds a CF tree and then sends a set of microclusters of the tree to the single reduce task. The reducer uses the input microclusters to build the CF tree. Since this tree covers all the microclusters generated by all the map tasks, an existing partitional clustering method, such as K-means, is applied to all the microclusters of the CF tree to obtain the final clusters. The distributed BIRCH whose global clustering method is K-means++ [27] is called BIRCH-MR_KM++.
PKMeans, mrk-means, and K-means++ are also called the extensions of K-means because they share the stopping criteria. The maximum number of iterations for these methods is set to 50 except for PKMeans for which it is set to 10. The reason for this selection is discussed in the later subsection.
The split size for mrk-means is set to √ K · n proposed in [21], where n is the number of objects. The number of true clusters in our synthetic and real data sets is provided to the K-means extensions. Note that CF + ERC_MR does not require the number of final clusters for clustering. In addition, the same threshold value is used to build the CF and CF + trees. The proper threshold values were obtained for all synthetic and real data sets by conducting the experiments with various threshold values.
The average purity, inverse purity, and execution time of these clustering methods are used as the primary performance metrics. They were obtained by repeating the experiment five times with different seeds on the same data set. The purity and inverse purity metrics are formally described in [28]. The purity of the clustering method evaluates the frequency of the most similar objects of each cluster, while the inverse purity of the clustering method evaluates similar objects placed in the same cluster. The execution times of CF + ERC_MR and the other methods include the time required to build the CF + and CF trees.
All the experiments were conducted on Amazon EMR. The MapReduce system consisted of ten nodes (one NameNode and nine workers). All the nodes used the Intel Xeon Platinum 8000 series (Skylake-SP) processor with a sustained all core Turbo CPU having a clock speed of up to 3.1 GHz. The NameNode used four cores with two threads, 64 GB memory, and EBS storage. The workers used two cores with two threads, 32 GB memory, and EBS storage. The basic size of EBS storage was 1 GiB, which could be extended up to 16 TiB if a task required additional storage. In addition, we set C = 50, L = 50 for the CF + and CF trees, which are the same parameter values used in [5]. We set the number of reduce tasks, r, to 4 because we experimentally found out that a small value of r (i.e., 4) was good enough for our proposed method.

C. SYNTHETIC DATA SETS
A collection of synthetic data sets was used in our experiments. These data sets were obtained by employing a synthetic data generator based on the parameters used in [6]. Table 2 shows the parameters used in our generator. All the synthetic data sets consist of K clusters with d-dimensional data. Each cluster in the data set has n min to n max objects within a radius of r min to r max . The centroids of all the clusters are placed in the data space according to the cluster patterns in Table 2. All centroids of the grid pattern data set are placed on a √ K × √ K grid, where the distance between the centroids of adjacent clusters is k g r min +r max 2 . All the centroids of the random pattern data set are randomly placed in the data space. Further, all the centroids of the sine pattern data set are partitioned into n c groups, and then the centroids of each group are placed at the curve of a different cycle of the n c sine function.
The three default data sets used in our experiments were generated by the synthetic data generator with the parameter values shown in Table 3. These data sets are named with respect to the cluster pattern, i.e., grid pattern cluster (GPC), random pattern cluster (RPC), and sine pattern cluster (SPC). All the data sets consisted of approximately 4, 000, 000 objects each.

D. PARAMETRIC ANALYSIS OF CF + ERC_MR
In this subsection, we analyze the performance of CF + ERC_MR in terms of the sample size, speedup, and workload balance of multiple reduce tasks. These analyses reveal that CF + ERC_MR performs exceptionally well in a distributed computing environment.

1) EFFECT OF THE SAMPLE SIZE
The effect of the sample size on the performance of CF + ERC_MR using GPC and RPC was analyzed, while the number of clusters K was set to 400. Each experiment was repeated five times, while the sample size was increased from 100 to 6,400. The average execution time and its standard deviation obtained from all experiments indicated the effect of the sample size. Figure 8 shows the average execution time and its standard deviation as a function of the sample size. The purity and inverse purity were not affected by the variation in the sample size; therefore, they are not shown in this figure. A larger sample size increased the time consumed in the space-partitioning step because more objects were gathered into the NameNode from the workers on the MapReduce framework. However, CF + ERC_MR with a large sample size was occasionally faster than that with a small sample size. For example, in Figure 8a, CF + ERC_MR with 800 samples is faster than that with smaller sample size. This is because a small number of samples does not capture the overall data distribution accurately, which can increase the time consumed in the clustering and refining steps.
CF + ERC_MR exhibits a stable execution time and low standard deviation without strong fluctuations, despite the randomness of the reservoir sampling and K-means++ in the space-partitioning step. CF + ERC_MR with 200 samples is relatively faster than the other methods when GPC and RPC are considered. This experiment also indicates that a small sample size is sufficient to obtain a good region centroid set V. Thus, the sample size was set to 200 in all the subsequent experiments.

2) SPEEDUP OF CF + ERC_MR
Experiments were conducted regarding the speedup of CF + ERC_MR by increasing the number of nodes on the MapReduce framework using three GPC data sets: G32M, G64M, and G128M. G32M is GPC with 800 clusters. It has 32 million objects because the number of clusters is 800 and the number of objects in each cluster is 40,000. Similarly, G64M has 64 million objects and G128M has 128 million objects.
The experiments were conducted using the MapReduce framework in which the number of nodes was increased from 2 to 12 for each data set. Figure 9 shows the speedup of CF + ERC_MR. CF + ERC_MR with an increase in the number of nodes performed more rapidly because the clustering step was performed on all nodes in a distributed manner. All n nodes of the MapReduce FIGURE 9. Speedup of CF + ERC_MR on GPC. VOLUME 8, 2020 system were composed of one NameNode, µ mappers, and ν reducers, thus giving n = µ + ν + 1. The mapper takes O( m·A µ ) and the reducer takes O( r·B n−µ−1 ). Thus, the speedup of CF + ERC_MR is increased as n increases.
However, in this figure, it is clear that the speedup for all the data sets does not linearly increase with the increase in the number of nodes. The network cost of the method was analogous for all the experiments on the same data set, but an increase in the number of nodes reduced the workload per node. Since the network cost portion of the total cost is higher for a higher number of nodes, the speedup difference decreases with the increase in the number of nodes.
The speedup of the method on larger data set is higher than that on smaller data set because the workload of larger data set is more than that of the smaller data set. From Figure 9, we can also infer that every reduce task deals with similarsized workload. If each reduce task deals with workloads of different sizes, the speedup is dependent on the performance of any straggler node, which can cause fluctuations in the speedup. Figure 9 shows no fluctuations, which implies that CF + ERC_MR deals with the data set in an evenly distributed manner. We now need to show that the reduce tasks of the method deal with data sets in an evenly distributed manner. Figure 10 shows the workload balance of all the reduce tasks obtained by CF + ERC_MR on the three data sets: GPC, RPC, and SPC. Here, the error bars represent the standard deviation of the average size of microclusters ( Figure 10a) and average execution time (Figure 10b). The heights of all the error bars are small compared to the average value, so the workloads of all the reduce tasks are even. Therefore, CF + ERC_MR exhibits a good speedup performance because it avoids the straggler reducer, which is considered to be one of the major factors that degrade the performance of MapReduce jobs. Next, by using three GPC data sets, we compare the execution times of CF + ERC_MR having two nodes and CF + -ERC. We conducted this experiment, since Figure 9 shows the speedup of CF + ERC_MR with respect to two nodes. Note that CF + ERC_MR requires at least two nodes to perform, while CF + -ERC is a sequential clustering method. Through this experiment, we show how CF + ERC_MR with two nodes performs over a sequential clustering method CF + -ERC. Table 4 shows the clustering times of CF + -ERC and CF + ERC_MR with two nodes. The purity and inverse purity are omitted because they are the same in both methods. CF + -ERC is shown to be faster than CF + ERC_MR for all the experiments summarized in this table. This is because CF + ERC_MR included extra steps, such as spacepartitioning and refining steps, and performed additional tasks for running a MapReduce job. In view of CF + ERC_MR with these overheads, the difference between the clustering times is negligible. In addition, CF + ERC_MR with more nodes can reduce the clustering time, as shown in Figure 9.

E. EXPERIMENTAL ANALYSIS WITH SYNTHETIC DATA SETS
Experiments on the synthetic data sets were performed to compare the effects of the cluster patterns, number of clusters K, data set size n, and the number of dimensions d on the performance of CF + ERC_MR, PKMeans, mrk-means, BIRCHbased distributed clustering methods, and distributed BIRCH.

1) EFFECT OF THE CLUSTER PATTERN OF DATA SETS
The effect of cluster patterns on the performance of CF + ERC_MR and other clustering methods using the default data sets (GPC, RPC, and SPC) in Table 3 was analyzed. Figure 11 shows that CF + ERC_MR exhibits the best performance in terms of average purity and inverse purity for all the cluster patterns in the data sets. Further, CF + ERC_MR is faster than other methods in most cases. It may be noted that CF + ERC_MR performs additional work in the sampling step. Only 200 objects need to be read, so the effect of additional work on the total execution time is negligible. For all the cluster patterns, the average purity and inverse purity of CF + ERC_MR are always better than those of other clustering methods. This indicates that CF + ERC_MR is suitable for the data set whose threshold value is given by the user. Figure 11c shows the execution times of all the clustering methods. The maximum value of the vertical axis is intentionally restricted to 100 s because the execution time of some methods was much higher than 100 s. For example, PKMeans consumed 309 s for GPC, 316 s for RPC, and 298 s for SPC.
Because the maximum number of iterations in PKMeans is less than that in BIRCH-PKM, PKMeans is faster than BIRCH-PKM. PKMeans is considerably slower than the other methods because it repeatedly runs MapReduce jobs in each iteration. A MapReduce job requires additional input/outputs (I/Os) and tasks for distributed computing. Although the computational complexity of K-means is reduced in PKMeans because the distance computations and centroid update step are parallelly implemented in PKMeans, clustering is slowed down due to the overhead of several MapReduce jobs. It consumed a longer time than CF + ERC_MR while its purity and inverse purity were less than those of CF + ERC_MR in Figure 11c. When increasing the maximum number of iterations from 10 to 50, PKMeans consumed much longer time (approximately 1,400 s), but its purity and inverse purity were not only significantly increased but also always less than those of CF + ERC_MR.
For analyzing the performance of CF + ERC_MR and PKMeans within the comparable difference of execution times, the maximum number of iterations in PKMeans was set to 10 in all experiments. The performance of the clustering methods based on SPC is excluded in the subsequent experiments because it lies between the performances of the methods based on GPC and RPC.

2) EFFECT OF THE NUMBER OF CLUSTERS
The effect of the increase in the number of clusters on the performance of CF + ERC_MR and other methods was investigated by increasing the size of the data sets, while maintaining the data densities of the clusters. For this experiment, GPC and RPC with K ranging from 100 to 1,600 (Table 3) were used.
The results are shown in Figures 12 and 13. BIRCH-PKM and BIRCH-MRK are not shown in this figure because they were not able to cluster the data sets with K = 1, 600 due to the shortage of main memory. The BIRCH-based distributed clustering methods build the CF tree on the NameNode, so they usually require a lot of memory, which indicates that they cannot cluster an exceptionally large data set.  The purity and inverse purity of CF + ERC_MR are always higher than those of other clustering methods. CF + ERC_MR is extremely faster than PKMeans and mrk-means, whereas it is slightly faster than BIRCH-based distributed clustering methods and distributed BIRCH. However, as the number of clusters increases, the difference in the execution time between CF + ERC_MR and other clustering methods using microclusters also increases. This indicates that for exceptionally large data sets, ERC that utilizes the structure of the CF + tree by using range queries is more suitable than other clustering methods. An increase in the number of clusters increases the number of microclusters, which in turn increases the number of distance computations for clustering. Since ERC avoids unnecessary distance computations, the clustering time for several microclusters is reduced.

3) EFFECT OF THE SIZE OF DATA SETS
The effect of the increase in the size of data sets on the performance of CF + ERC_MR and other methods was experimentally analyzed by using GPC and RPC with an increase in n min and n max ( Table 3). The size of the data sets was approximately increased from 4 M (4 × 10 6 ) to 20 M (20 × 10 6 ). With K fixed to 100, increasing the size of the data sets causes increasing the data density of all clusters. Figures 14 and 15 show that an increase in the data set size rarely affected the purities and inverse purities of the clustering methods. However, the execution times of the clustering  methods were reduced when increasing the size of the data sets in Figures 14c and 15c. The reason for this is that more objects required many more distance computations.
The performance of CF + ERC_MR is the best if all the primary performance metrics are considered. However, it is slightly slower than BIRCH-MR_KM++ for smaller data sets. This is because a small data set generates small-sized summary data, but a large data set generates large-sized summary data, which increases the global clustering time. BIRCH-MR_KM++ that performs global clustering on the NameNode is inefficient in clustering exceptionally large data sets. Therefore, CF + ERC_MR clusters a large data set in a much shorter time than BIRCH-MR_KM++.

4) EFFECT OF THE DATA DIMENSIONALITY
In this subsection, the effect of the dimensionality of the data set on the performance of CF + ERC_MR and other methods is analyzed. GPC and RPC with d increasing from 2 to 10 ( Table 3) are used in this experiment.
It is clear in Figures 16 and 17 that CF + ERC_MR exhibits a high performance in all data sets even if the dimensionality increases. An increase in the number of dimensions reduced the standard deviation of distances of all object pairs due to the ''curse of dimensionality.'' This deteriorated the purity and inverse purity of K-means extensions because similar objects were assigned to the different final clusters except for mrk-means and BIRCH-MRK. Contrarily, the object pair within T was assigned to the same final cluster in CF + ERC_MR that had used the threshold value as the clustering criteria.
mrk-means was much slower than CF + ERC_MR, and its purity and inverse purity were lower than those of CF + ERC_MR. The reasons were that CF + ERC_MR handled microclusters and used the threshold value as the global clustering criteria. Figure 17b shows that the inverse purities of BIRCH-PKM and BIRCH-MRK decrease because BIRCH does not obtain appropriate microclusters when the number of dimensions is increased. Thus, we can infer that CF + ERC_MR is more suitable for clustering high-dimensional data sets than BIRCH-MRK which was often comparable to CF + ERC_MR regarding execution time.  In this subsection, we compare the performance of CF + ERC_MR with that of PKMeans, mrk-means, BIRCHbased distributed clustering, and distributed BIRCH using real data sets. The Sensorless Drive Diagnosis data set (SDD) and the Amsterdam Library of Object Images data set (ALOI) were used for this experiment. SDD and ALOI were provided by UCI Machine Learning Repository 1 and OpenML, 2 respectively. Kobren et al. [29] used the ALOI data set to evaluate the K-means extensions and hierarchical clustering methods. In this study, WEKA, 3 a popular data mining application, was used to normalize the data. The parameters used are shown in Table 5.  Figure 18a shows the purity, inverse purity, and execution time of all the methods used in the experiment. For the SDD data set, CF + ERC_MR outperforms the other methods in terms of purity and inverse purity. BIRCH-PKM is noticeably slower than the other methods because it runs a MapReduce job repeatedly. The clustering times of CF + ERC_MR and BIRCH-MRK are similar, but the purity and inverse purity of BIRCH-MRK are lower than those of CF + ERC_MR. This is because BIRCH-MRK cannot use the threshold value in the global clustering method. For the ALOI data set, the performance of CF + ERC_MR is much better than that of the other methods. The clustering time of BIRCH-MR_KM++ exhibits a much stronger increase for the experiment conducted using the SDD data set as compared to that using the ALOI data set. This confirms that global clustering on a single machine is not suitable for clustering exceptionally large data sets, even though it deals with the summary data such as microclusters.
Overall, the experiments based on real data sets indicate that CF + ERC_MR provides a better performance than PKMeans, mrk-means, BIRCH-based distributed clustering methods, and distributed BIRCH. Consequently, CF + ERC_MR is more suitable for clustering real data sets.

V. CONCLUSIONS
Summary-based clustering and distributed clustering are a vital component of effective data analysis techniques for exceptionally large data sets. Here, we proposed CF + ERC_MR, which is an extension of CF + -ERC to the MapReduce framework, for clustering exceptionally large data sets with given thresholds. Our method was run parallelly on multiple reducers. The CF + trees were built on multiple reducers, and multiple range queries were called to find the local final clusters by utilizing the structures of these CF + trees in a parallel manner. In the refining step, the connections between these local final clusters were examined to sequentially determine the global final clusters.
Consequently, the CF + ERC_MR method efficiently yielded the final clusters based on MapReduce.
The theoretical analysis of CF + ERC_MR was first discussed. The efficiency of CF + ERC_MR was then demonstrated by comparing its performance with that of the existing clustering methods for large synthetic and real data sets. A variety of experiments indicated that the proposed approach was significantly faster and more accurate than existing clustering methods. It also showed the robustness of our approach to various patterns, the number of clusters, the density of clusters, and the number of dimensions. Overall, our method is extremely useful for clustering exceptionally large data sets with given threshold values. In the future, we aim to extend the proposed clustering method to other types of data sets, such as categorical or mixed-type data sets.