Three-Way Ensemble Clustering for Incomplete Data

,


I. INTRODUCTION
As a new idea of artificial intelligence in recent years, granular computing [1]- [3] is relatively modern theory in simulating human being's thinking and problem solving. One of the fundamental problem of granular computing is information granulation [4]- [6]. Many strategies of information granulation [7]- [9] have been proposed to meet different user's demands, in which clustering analysis is a widely used one [10]. As an important field in data mining, clustering [11] tries to find the internal underlying structure of data set. Generally, it means that similar objects are assigned to the same cluster while dissimilar objects are assigned to different clusters.
Though there are a large variety of clustering algorithms, most of the existing algorithms like k-means [12] are only adapted to complete datasets. However, in the actual scenario, some data values are missing due to random noise, data lost, limitations of data acquisition, data misunderstanding etc.
The associate editor coordinating the review of this manuscript and approving it for publication was Mu-Yen Chen . The object with missing values is generally referred to as incomplete data. Due to the existences of missing values, traditional clustering algorithms cannot be directly applied to incomplete datasets. Therefore, how to handle incomplete data set is a problem to be solved in cluster analysis. In order to solve the problem of incomplete data clustering, Hathaway and Bezdek [13] put forward four incomplete data clustering methods based on fuzzy k-means clustering algorithm (FCM), which are Whole Data Strategy (WDS), Partial Distance Strategy (PDS), Optimal Completion Strategy (OCS) and Nearest Prototype Strategy (NPS), respectively. On the basis of FCM, Li et al. [14] studied a robust FCM clustering algorithm to deal with incomplete data. Shi et al. [15] introduced a clustering ensemble algorithm for mixed data. Yu et al. [16] proposed a three-way decision clustering algorithm for incomplete data. In this paper, we use ensemble clustering to present a new three-way clustering algorithm for incomplete datasets.
As for incomplete datasets, the single clustering algorithm cannot achieve a good clustering result because of a large number of missing data. In order to overcome this problem, we use ensemble clustering technique to combine multiple clustering results into a probably better one in this paper. The idea of ensemble clustering [17] is to integrate multiple clustering results into an alternative approach for improving the quality of the results. In fact, as we know, different clustering algorithm may lead to different clustering results. It is difficult to judge which clustering result is better than others due to the lack of supervised information. Therefore, the ensemble clustering technique emerged as a powerful tool for data clustering. Compared with a single clustering algorithm, ensemble clustering can significantly improve the robustness, stability and quality of clustering results. In recent years, ensemble clustering algorithm has been receiving increasing attention and many ensemble clustering algorithms have been developed [18]- [24].
The concept of three-way decision was outlined by Yao [25]. The main idea of three-way decision is to divide the universe into the positive, negative and boundary regions, denoting the regions of acceptance, rejection and non-commitment for ternary classifications. Since the introduction of three-way decision, many developments and applications [26]- [32] are proposed in different fields and disciplines. Inspired by the theory of three-way decision, Yu proposed the concept of three-way clustering [33]. In threeway clustering, a clustering is no longer represented by a single set with clear boundaries, but by three regions called core region, fringe region and trivial region, namely, which reflects the three types of relationship between an object and a cluster, namely, belong-to definitely, uncertain and not belong-to definitely. For an incomplete dataset, it is hard to assign the elements with missing attribute values to one cluster because of insufficient information. Three-way clustering can be used to solve this problem.
Based on the above discussion, we present a three-way ensemble clustering algorithm for incomplete data in this paper. We firstly propose an imputation incomplete data clustering algorithm based on hard clustering algorithm. In the proposed algorithm, we cluster the objects with non-missing values and use the mean attribute's value of each cluster to fill the missing attribute's value, respectively. Perturbation analysis of cluster centroid is applied to search the optimal imputation. As an application of proposed imputation method, we develop a three-way ensemble clustering algorithm based on the ideas of clustering ensemble and three-way decision. The objects with the same cluster label in different clustering results are assigned the core region of corresponding cluster while the objects with different clustering labels are assigned to the fringe region. Therefore, a three-way clustering is naturally formed.
The rests of this paper are organized as follows. Section 2 reviews incomplete information system, ensemble clustering, three-way decision. Section 3 introduces the proposed three-way ensemble clustering algorithm for incomplete data. Section 4 reviews two clustering performance measurements. Experiment results are presented in Section 5 and concluding remarks are reported in Section 6.

II. PRELIMINARIES
In this section, we mainly review some related contents: incomplete information system, three-way decision and three-way clustering, ensemble clustering.

A. AN INCOMPLETE INFORMATION SYSTEM
The information system is also known as the knowledge representation system. It can be represented as S = When some attribute values are missing, the information system S is called as incomplete information system. Table 1 is an example of an incomplete information system, which includes 8 samples and each sample has 6 attributes.
The missing values are represented by * .

B. THREE-WAY DECISION AND THREE-WAY CLUSTERING
Rough set theory [34] uses three pair-wise disjoint sets to approximate a set, which are the positive, boundary, and negative regions, respectively. In order to interpret the three types of decision rules, Yao [35] introduced the Bayes risk decisionmaking into rough sets and proposed the decision-theoretic rough set model, then proposed the theory of three-way decision [25]. Three-way decision gives the definite decision to the object with definite information and to give the delayed decision to the object with insufficient information, which extends binary decision in order to overcome some drawbacks of binary-decisions. Three-way decision investigates scientifically a common problem-solving and informationprocessing practice. The idea of three-way decision is consistent with human's cognitions to solve the problem in the real world [36]. Many soft computing models for learning uncertain concepts, such as interval sets, rough sets, fuzzy sets and shadowed sets, have the tri-partitioning properties and can be reinvestigated within the framework of threeway decision [25]. Recently, Yao [36] used a TAO model to describe the research on three-way decision, which can be depicted by Fig. 1. Since the introduction of three-way decision, many developments and applications are proposed in different fields and disciplines, such as information filtering [37], text classification [38], risk decision [26], government decision [39], formal concept analysis [29]- [32], [40], sequential three-way decision [41], [42], etc.
Inspired by the idea of three-way decision, Yu [33] and Yu et al. [43], [44] proposed a framework of three-way clustering analysis. In the representation of two-way clustering, a cluster is represented by a single set, where the objects in the set belong to this cluster definitely and the objects not in the set do not belong to this cluster definitely. In contrast to the general crisp representation of a cluster, three-way clustering represents a three-way cluster C i as The representation of three-way clustering by using a pair of sets can chapter the three types of relationships between an element and a cluster, namely, belong-to definitely, not belong to definitely, and uncertain. It is more felicitous than the use of a crisp set.
Here, we summarize some basic concepts of three-way clustering. Assume that is a family three-way clusters of universe . The three sets, Co(C i ), Fr(C i ) and Tr(C i ) naturally form the Core Region, Fringe Region, and Trivial Region, respectively, of a cluster. That is: These subsets have the following properties . This is a representation of two-way clustering. This means two-way clustering is a special case of three-way cluster in which the fringe regions are the empty set.
Under the different applications, there are different requirements on Co(C i ) and Fr(C i ). In this paper, we adopt the following properties: Property (1) demands that each cluster cannot be empty. Property (2) states that it is possible that an element belongs to more than one cluster. Property (3) requires that the core regions of clusters are pairwise disjoint.
Since the proposal of three-way clustering, some threeway clustering approaches are developed. For example, Yu et al. [44] proposed a three-way clustering method to solve the problem of incremental overlapping clustering; Wang et al. [45] proposed a three-way k-means method by integrating k-means and three-way decisions; Wang and Yao [46] proposed a three-way clustering method based on mathematical morphology. Afridi et al. [47] presented a three-way clustering approach for handling missing data by using game-theoretic rough set model. All the above results enrich the theories and models of three-way clustering.

C. ENSEMBLE CLUSTERING
There are many clustering algorithms in cluster analysis, and each algorithm has its own method to discover the underlying data structure. However, different algorithm may lead to different clustering results. It has been accepted that a single clustering algorithm cannot handle all types of data distribution effectively. Due to a lack of any prior knowledge about cluster shapes, choosing a specialized clustering method is not easy. Therefore, some researchers are focus on integrating multiple clustering results, which is named as ensemble clustering.
Cluster ensemble methods attempt to find a better, more robust clustering solution by fusing as much information as possible among several distinct executions of a partitioning algorithm over the data. Compared with a single clustering algorithm, the clustering ensemble can significantly improve the robustness, stability, and quality of a clustering.
The ensemble clustering problem was first proposed by Strehl and Ghosh in [17], which is described as combining multiple clustering results of a set of objects without accessing the original features. Since the introduction of ensemble clustering, many developments are proposed in different fields and disciplines [26]- [32]. In general, the research on ensemble clustering mainly includes three aspects: ensemble generation, ensemble selection and ensemble integration. Ensemble generation is the first step in clustering ensemble algorithms in this step the set of clusterings that will be combined is generated. Appropriate generation process is very important because clustering result will be conditioned by the initial clusterings obtained in this step. Usually, there are no constraints about how the partitions must be obtained. Therefore, we can use different clustering algorithms or the same algorithm with different parameters initialization in the generation process. Next, we use k-means ensemble clustering as an example to show the process of ensemble clustering Suppose a data set U = {x 1 , x 2 , . . . , x n } with n objects, after clustering with different clustering algorithms or different parameters or initialization values, we get clustering result X 1 , X 2 , . . . , X t , where t is the ensemble sizes, which indicates the number of clusterings.
K-means algorithm was introduced by Macqueen [12], which is one of the most widely employed clustering algorithms for its efficiency and performance. The computational complexity of this algorithm is lower than the complexity of most other clustering algorithms. Detailed steps of k-means can be found in Algorithm 1.

Algorithm 1 k-Means Clustering Algorithm
Input: Dataset: U = {x 1 , x 2 , . . . , x n } ⊂ R m , number of clusters: k; Output: Clustering result: C = {C 1 , C 2 , . . . , C k }. 1 randomly select k objects from the dataset, where k < n, and view these objects as the initial centers z 1 , z 2 , · · · , z k . 2 assign each of object to one of k centers, according to the shortest distance principle. i.e., 3 when all of objects are assigned, recalculate the value of centers according to the formula: Step 2 and Step 3 until the centers no longer change or satisfy some stop conditions. 5 return C = {C 1 , C 2 , . . . , C k }.
K-means algorithm has the advantages of simplicity, efficiency, keeps scalability and high efficiency when dealing with large data sets. However, it has some inevitable shortcomings like: (1) the value of k has to be given in advance; (2) it is sensitive to the initial value, different initial centers lead to different clustering results; (3) it is easy to fall into the local optimal rather than the global optimal result. Therefore, we perform ensemble clustering by randomly initializing the centers value to obtain relatively stable clustering result.  In this paper, we use the clustering algorithm of k-means to obtain a set of base clustering results X 1 , X 2 , . . . , X t , (t is the ensemble sizes) by generating different initial centers. The procedure can be described by Fig. 2.

III. THREE-WAY ENSEMBLE CLUSTERING ALGORITHM FOR INCOMPLETE DATA
Suppose that U = {x 1 , x 2 , . . . , x n } ⊂ R m is an incomplete dataset with n objects. The missing values is constrained by two conditions: each original feature vector x i retains at least one component and each feature has at least one value presented in the incomplete data set. In this section, we present a new imputation algorithm for incomplete data and a three-way ensemble clustering algorithm based on the imputation result. Firstly, the data set U can be classified into two disjoint subsets. One set U W requires each object with non-missing values and the other set U M , where U W ∪ U M = U . Secondly, Algorithm 1 is used to cluster the set U W and the mean attribute's value of each cluster is used to fill the objects in U M . Perturbation analysis of cluster centroid is applied to search the optimal imputation. Based on proposed imputation method, a three-way ensemble clustering algorithm is proposed by integrating the ideas of clustering ensemble and three-way decision. The specific process of three-way ensemble clustering algorithm for incomplete data is shown in Fig. 3, where Algorithm 2 and Algorithm 3 are described specifically in the following Section A and Section B, respectively.

A. MISSING VALUE FILLING ALGORITHM
In the study of incomplete data clustering, the effective filling of incomplete data is the key to improve the accuracy of clustering result. There are many ways to fill incomplete data. For example: mean imputation, regression imputation, multiple imputation, etc. In this subsection, we present an improved mean imputation incomplete data clustering algorithm based on hard clustering algorithm. Firstly, we use k-means algorithm to cluster the objects with non-missing values C W = {C 1 , C 2 , · · · , C k } of set U W . Secondly, as for an object x i with missing values in U M , the missing value x d i = * (the d-th feature of x i , d ∈ {1, 2, · · · m}) is filled by the corresponding attribute mean of all samples in each cluster C i (|i = 1, 2, · · · , k) of U W , respectively. Thirdly, in order to search the optimal imputation, we compare the impacts of each filled object x i on corresponding cluster centroid by adding x i with |C i |/k times into C i and compute the new cluster center of C i . Then, recalculate the cluster centroid and compute the difference between the new and the old center. Finally, select the minimum value among these differences, and assign x i to the corresponding cluster, the missing value can be determined at the same time. Algorithm 2 is designed to describe the process of the proposed imputation method.

Algorithm 2 Missing Value Filling Algorithm
Input: Clustering result of complete data C W = 4 add x i with |C i |/k times into C i and obtain the new cluster C i . 5 recalculate the cluster centroid by z * i = 1 and compute the difference d i = |z i − z * i | (i = 1, 2, · · · , k). 6 end for 7 select the minimum value among d i (i = 1, 2, . . . , k), and assign x i to C i the missing value of object x i can determined at the same time. 8. until the set U M has no object with missing values 9. end if 10 return C = {C 1 , C 2 , . . . , C k }

B. THREE-WAY ENSEMBLE CLUSTERING ALGORITHM
Inspired by the idea of three-way decision in reducing the cost of decision risks and improving the performances of clustering algorithm, Yu [33] and Yu et al. [43], [44] introduced the theory of three-way clustering, which integrates three-way decision and cluster analysis. Three-way clustering represents a cluster by a pair of sets called core region and fringe region like C i = (Co(C i ), Fr(C i )), where the object in core region denote this object belong to the corresponding cluster definitely and the object in fringe region denote this object belong to the corresponding cluster ambiguously. In this paper, we adopt the following properties on Co(C i ) and Fr(C i ):

2)
In this subsection, we present a three-way ensemble clustering algorithm for incomplete system based on our proposed imputation method. This process can be divided into two major steps: cluster label matching and three-way clustering based on voting method.
Suppose that we use algorithm 1 to obtain a set of base clustering results C W 1 , C W 2 , · · · , C Wt by selecting different initial centers and use algorithm 2 to obtain clustering result after imputation C 1 , C 2 , · · · , C t . The number of identical object covered by each pair of clusters C i and C j (1 ≤ i, j ≤ k) are recorded in the overlap matrix of k ×k. Although we have obtained the clustering results of U , C i cannot be directly used for the conclusion of the next stage due to the lack of priori category information. As an example, we consider the data set V = {v 1 , v 2 , v 3 , v 4 , v 5 , v 6 } and let C 1 , C 2 , C 3 be three clustering results of V which are shown in Table 2. Although the objects are expressed in different orders, they represent the same clustering result. In order to combine the clustering results, the cluster labels must be matched to establish the correspondence between each other.
In general, the number of identical objects covered by the corresponding cluster labels should be the largest. Therefore, according to this heuristic, we can match the cluster label. Firstly, choose the cluster label that covers the largest number of identical objects to establish a correspondence and remove the result from the k × k overlap matrix. Finally, repeat the above process until all the cluster labels have established the corresponding relationship. For convenience, we choose the clustering result C 1 as matching criterion and match the other clustering results with C 1 during label matching. The implementation steps are shown in steps 1-6 of Algorithm 3.
If the label of the cluster is already matched, an updated cluster results C 1 , C 2 , · · · , C t are obtained. In the next steps, we use the voting-based method to achieve three-way cluster results. Firstly, compute the unions of clusters with the same label in all cluster members, i.e., C j 1 ∪ C j 2 ∪ · · · ∪ C j t (j = (1, 2, · · · k), If the object x ∈ C j 1 ∪ C j 2 ∪ · · · ∪ C j t ,, we view C j i as one vote if x ∈ C j i (i = 1, 2, · · · t) and count the votes of the object x. Finally, we get all the objects' votes and denote the vote of the object x as count(x). If count(x) ≥ t/2, VOLUME 8, 2020 we assign this object to the core region of the cluster C j , otherwise, we assign it to the fringe region of the cluster C j . The implementation steps are shown in steps 8-16 of Algorithm 3.

Algorithm 3 Three-Way Ensemble Clustering Algorithm
Input: Ensemble clustering result after imputation C 1 , C 2 , · · · , C t Output C final = {(Co(C 1 ), Fr(C 1 )), (Co(C 2 ), Fr(C 2 )), · · · , (Co(C k ), Fr(C k ))} 1 for l ← 2 to t, do 2 form the matrix A = (a ij ) k×k , a ij is equal to the number of the identical objects between the i-th cluster of C 1 and the j-th cluster of C l . 3 for i ← 1 to k, do 4 find the max value of a ij and let j = i, that is to unify the cluster label of each clustering result.. 5 end for 6 end for 7 through step 1-6, we get the new clustering result C 1 , C 2 , · · · , C t after label matching. 8 for j ← 1 to k, do 9 compute the unions C and count the votes of the object x: count(x), 11 if count(x) ≥ t/2, 12 assign x to the core region of the corresponding cluster C j i.e., x ∈ Co(C j ); 13 else 14 assign x to the fringe region of the corresponding cluster C j , i.e., x ∈ Fr(C j ); 15 end if 16 end for 17 return {(Co(C 1 ), Fr(C 1 )), (Co(C 2 ), Fr(C 2 )), · · · , (Co(C k ), Fr(C k ))}

C. GENERATION OF THE FINAL CLUSTERING
For an incomplete system, Algorithm 1 shows the clustering results of objects with non-missing values. Algorithm 2 helps us get the best filling value for incomplete data; To generate the final three-way clustering result C final , we repeat the processes of algorithm 1 and algorithm 2 by selecting different initial centers and obtain a set of clustering results after imputation C 1 , C 2 , · · · , C m . Base on the strategy of ensemble clustering, three-way ensemble clustering results can be obtained by Algorithm 3. The whole process can be shown as Algorithm 4.

IV. CLUSTERING PERFORMANCE MEASUREMENT
Generally, clustering performance measurement is also known as ''validity index''. It is similar to the effect of Algorithm 4 Three-Way Ensemble Clustering Algorithm for Incomplete Data Input: Incomplete data set: U = {x 1 , x 2 , . . . , x n }, number of clusters: k, ensemble sizes: t Output: C final = {(Co(C 1 ), Fr(C 1 )), (Co(C 2 ), Fr(C 2 )), · · · , (Co(C k ), Fr(C k ))} 1 classifying the data set U into two disjoint subsets. One set U W requires each object with non-missing values and the other set repeat the processes of algorithm 1 and algorithm 2 by selecting different initial centers and obtain a set of clustering results after imputation C 1 , C 2 , · · · , C m . 5 C final ← Algorithm 3 ( C 1 , C 2 , · · · , C m , m ) performance metrics of supervised learning. As for validity index, we need to adopt it to evaluate the clustering result on one side. On the other side, it can be viewed as an optimization goal during the process of clustering when the validity index is determined The Accuracy is a frequently-used evaluation index and easy to understand. It represents the ratio between the number of correctly partitioned objects and the total number of samples. The greater value of Accuracy means the more objects are correctly divided.
Dfinition 1 Accuracy (ACC Hereafter): where n denotes the total number of objects in data set, θ i represents the amount of objects that are exactly divided into the i-th cluster and the letter k represents the clustering number.

B. FOWLKES_MALLOWX_SCORES
The Fowlkes_Mallows_Scores index (FMI) [48] is defined as the geometric mean of the precision and recall. It measures the similarity between two clustering results. The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.  False Negative (i.e the number of pair of points that belongs to the same clusters in labels_pred and not in labels True).

V. EXPERIMENTAL ILLUSTRATION
To test the performance of our proposed algorithm, six data sets from UCI Machine Leaning repository [49] are employed in this section. Table 3 summarizes the detailed information of these datasets. The incomplete data set U M are obtained by randomly selecting some missing feature values. The random selection of missing values is constrained by two conditions: (1) each original feature vector retains at least one component; (2) each feature has at least one value present in the incomplete data set. In most cases, the higher the missing rate in the dataset, the lower the accuracy of the clustering results. Because the higher the missing rate, the accuracy of incomplete data filling will also decrease, which directly leads to the performance degradation of the clustering algorithm. Therefore, Algorithm 4 is only suitable for low missing rates. In this paper, the missing rates are set from 5% to 30%.
Because the evaluation indices ACC and FMI are only adapted to hard clustering algorithm. In order to identify the quality of our proposed three-way clustering, we use all the core regions to form a clustering result and compute ACC and FMI by using core region to represent corresponding cluster. The total number of the objects is to exclude the objects in the fringe regions when calculating ACC. Table 4 and Table 5 list the experimental results of ACC and FMI on UCI datasets, respectively. For comparing the clustering effect, the performances of OCS-FCM and NPS-FCM [13] are also presented in Table 4 and Table 5, where the bold data represents the best results. We get the average value and the best value of ACC and FMI through 100 times experimental tests under different missing rates.
From the experimental results recorded in Table 4 and Table 5, we can get the following results. The results of the average and best value of ACC and FMI of Algorithm 4 is better than the OCS-FCM and NPS-FCM; The results of ACC and FMI of Algorithm 4 are relatively stable, with the increase of missing rate, the values of ACC and FMI are decreased gradually. Because we only consider the core region in fringe regions when we compute ACC and FMI, the performances will be fluctuating for different missing rates. The value of ACC and FMI of NPS-FCM are unstable and fluctuating, which indicates that the result of filling in the missing value based on the nearest neighbor varies greatly, especially the data set of Pendigits. It is not difficult to find after analysis: the comparison method is based on the fuzzy C-means method, which is difficult to obtain good results on non-spherical data sets; Algorithm 4 is based on the method of ensemble clustering, which can significantly improve the robustness, stability and quality of clustering results.
Through the comparison and analysis of experimental results, we can conclude that the Algorithm 4 is feasible and can effectively to deal with the problem of incomplete data clustering.

VI. CONCLUSION
Because of a large number of samples with missing feature values, the single clustering algorithm cannot achieve a good clustering result for incomplete datasets. In order to overcome this problem, we use ensemble clustering technique to combine multiple clustering results into a probably better one. Three-way clustering uses core region and fringe region to chapter the three types of relationships between an element and a cluster, which is more appropriate than the use of a set. Integrating three-way clustering and ensemble clustering, we present a three-way ensemble clustering algorithm for incomplete data in this paper.
In the proposed algorithm, we cluster the objects with nonmissing values and use the mean attribute's value of each cluster to fill the missing attribute's value, respectively. Perturbation analysis of cluster centroid is applied to search the optimal imputation. Based on proposed imputation method, we develop a three-way ensemble clustering algorithm by integrating the ideas of clustering ensemble and three-way decision. The objects with the same cluster label in different clustering results are assigned the core region of corresponding cluster while the objects with different clustering labels are assigned to the fringe region. Therefore, a threeway clustering is naturally formed. Experimental results on UCI datasets demonstrate the effectiveness of the proposed method.
In this paper, we use the clustering algorithm of k-means to obtain a set of base clustering results by generating different initial centers. For the sake of simplicity, the algorithm in this paper sets the precondition that the number of clusters k in each cluster member is the same. How to apply the algorithm to the different value of k is our planned future work. Another direction of our future research is to extend our algorithm to noise data and analysis noise level.