Co-Clustering Ensemble Based on Bilateral K-Means Algorithm

Clustering ensemble technique has been shown to be effective in improving the accuracy and stability of single clustering algorithms. With the development of information technology, the amount of data, such as image, text and video, has increased rapidly. Efficiently clustering these large-scale datasets is a challenge. Clustering ensembles usually transform clustering results to a co-association matrix, and then to a graph-partition problem. These methods may suffer from information loss when computing the similarity among samples or base clusterings. Rich information between samples and base clusterings is ignored. Moreover, the results are not discrete. They need post-processing steps to obtain the final clustering result, which will deviate greatly from the real clustering result. To address this problem, we propose a co-clustering ensemble based on bilateral k-means (CEBKM) algorithm. Our algorithm can simultaneously cluster samples and base clusterings of a dataset, to fully exploit the potential information between the samples and the base clusterings. In addition, it can directly obtain the final clustering results without using other clustering algorithms. The proposed method, outperformed several state-of-the-art clustering ensemble methods in experiments conducted on real-world and toy datasets.


I. INTRODUCTION
Clustering technique is applied in various fields, such as biology [1], image retrieval [2], information retrieval [3], and image processing [4]. Its overall purpose is to divide data into groups based on some criteria. The same type of data are divided into one group as much as possible. However, each clustering algorithm has its own bias due to the optimization of different criteria. An additional challenge for a single clustering method is that the data of the truth label are unavailable, hence it is difficult to validate the clustering results [5].
To resolve some of the challenges related to clustering, the idea of a clustering ensemble, also called consensus clustering or ensemble clustering, has been proposed. Clustering ensemble technique finds the underlying structure of data by combining different base clusterings into a The associate editor coordinating the review of this manuscript and approving it for publication was Kashif Munir. single consensus clustering, which can provide a more robust solution [6]- [14]. A clustering ensemble method consists of two steps: generation and consensus function. In the generation step, multiple base clusterings are obtained by operating traditional clustering methods on the same dataset. Several methods have been proposed to obtain the base clusterings, such as the same algorithm with different parameter initialization [15], [16], different clustering algorithms [17], combining multiple weak clustering algorithms [18], random projection [19], and data resampling [20], [21]. In the consensus function step, multiple base clusterings are combined into a matrix to improve the accuracy of the final clustering result [22]- [26]. Many methods have been proposed, using heuristics or meta-heuristics to find approximate solutions. These include co-association matrix-based methods [27], [28], graph-based methods [29], relabeling and voting methods [30], and locally adaptive cluster-based methods [5]. Other methods use an objective function to VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ measure the similarity between the clustering results and the consensus. These include k-means-based algorithms [18], nonnegative matrix factorization [31], the EM algorithm [32], and kernel-based algorithms [33].
In the above method, the co-association matrix-based methods have attracted the interest of many researchers [34]- [37]. The base clusterings are combined into a co-association matrix via analyzing how many times two samples belong together in the same cluster. Researchers try to derive pairwise similarity from clustering results and then transform the clustering ensemble to a graph-partition problem [34]. Graph-based methods usually transform clustering results into a co-association. This is very important in the clustering ensemble method, which determines how well to conduct graph partitioning for the final consensus partition. However, it only computes the similarity between data points, ignoring rich information between samples and base clusterings. The co-association matrix may lead to information loss, and possibly the inability to accurately find the underlying structure of the data.
Some graph-based methods construct a bipartite graph that contains all of the samples and base clusterings as vertices [38]. It is based on a spectral graph partitioning algorithm [39]. However, when dealing with multipartitioning problems, it needs to relax the original discrete problem to a continuous problem, and then use a k-means algorithm or METIS algorithm [40] to obtain the final discrete results. The discrete-continuous-discrete transformation can cause the final result to deviate greatly from the actual result.
In light of the deficiencies of the above method, we propose a co-clustering ensemble based on a bilateral k-means algorithm, which simultaneously clusters both samples and base clustering of the datasets. Our algorithm can directly obtain the final clustering result without using a traditional clustering algorithm. Our contributions are summarized as follows: 1) Our algorithm involves constraints of indicator matrices, which store the clustering results of samples and base clusterings. It can perform ensemble clustering by simultaneously utilizing the sample information and base clustering information. It can make full use of the information between the sample and the base clusterings to effectively improve the accuracy of the cluster. 2) Our algorithm can directly obtain the final clustering result. Many existing clustering ensemble methods require post-processing steps to obtain the result. Some, such as CPSA, HGPA, MCLA, and HGPA use a k-means or METIS algorithm to obtain the final clustering result. 3) We performed a large number of experiments on several datasets. The clustering evaluation criteria NMI and AI were utilized to compare the algorithm's with that of other advanced clustering algorithms. Experimental results show that our method outperforms them. The rest of this paper is organized as follows. In section II, we revisit related work on clustering ensembles. In section III, we propose the co-clustering ensemble with BKM model and we introduce a clustering algorithm and its update process. We compare performance of the proposed model with several related methods in section IV. We end with conclusions, acknowledgements and references.

II. RELATED WORK
Clustering ensembles have been shown to effectively improve the accuracy and stability of single clustering algorithms, They generate a set of base clusterings from the same dataset and combine them into a final clustering.
The information of a group of base clusterings is combined in a co-association matrix, which counts how often two samples are in the same cluster [34], and it is used as a similarity matrix to apply the graph partitioning method to obtain the clustering result. Strehl and Ghosh proposed three graph-based algorithms [17], the cluster-based similarity partitioning algorithm (CSPA), hyperGraph partitioning algorithm (HGPA) and meta-clustering algorithm (MCLA). In CSPA, a binary similarity matrix is constructed for the cluster of the ensemble. Each column corresponds to a clustering that has an entry of 1 if the corresponding point belongs to the clustering, and 0 otherwise. The similarity matrix is used to recluster the samples using a graph-partitioning approach, where the vertices correspond to the samples and the edge weights correspond to the similarities. METIS [40] is used to partition the graph for the final clustering result. In HGPA, all of the hyperedges have the same weight, and each hyperedge represents a clustering of ensembles. HGPA looks for c unconnected components of the same size by dividing the hyperedges. The hypergraph partitioning package HMETIS [40] is used to generate the final clustering result. MCLA is based on the clustering of clusters. It provides object-wise confidence estimates of cluster membership. This algorithm groups and collapses related hyperedges and assigns an object to the collapsed hyperedge in which it participates most strongly.
Hybrid bipartite graph formulation (HBGF) was proposed by Fern and Brodley [38]. A bipartite graph is constructed, where both samples and base clusterings simultaneously act as vertices in the graph. In the bipartite graph, all of the edges have weight 1, and clustering vertices are only connected to sample vertices. Sample vertices are also only connected to clustering vertices. The multi-way spectral graph partitioning algorithm obtains the bipartite graph partitioning result, and METIS is used to obtain the final clustering result. A spectral co-clustering ensemble (SCCE) [29] also constructs a bipartite graph to converts a bipartite graph partition problem to matrix decomposition. The final clustering result was got by k-means. K-means-based Consensus Clustering (KCC) [41] was proposed as a theoretic framework, which provided the sufficient and necessary condition for KCC utility functions to exactly map the ensemble clustering to a K-means problem with theoretical support.
More recently, Yazhou Ren proposed three other consensus techniques to leverage the weighted samples [42].
The difficulty of clustering a sample is estimated by constructing the co-association matrix and embedding the corresponding information as weights associated to samples. Three algorithms were proposed Weighted-Object Meta Clustering Algorithm (WOMC) is a weighted version of the Meta-Clustering Algorithm (MCLA). The meta-graph is constructed, and it partitions the meta-graph into c meta-clusters, each representing a group of clusters, by using the similarity measure and applying the METIS [40] graph-partitioning algorithm. Each object is assigned to the meta-cluster in which it participates most strongly. The weighted-Object similarity partitioning algorithm (WOSP) moves centers once for each base clustering by using the original data and objects' weight information. Clustering with a boosting technique aims to generate a better single clustering result while we seek to find a better consolidated consensus solution, constructing the corresponding graph and partitioning the graph into c parts, utilizing METIS. Each represents a final clustering result. Weighted-Object hybrid bipartite graph partitioning(WOHB) reduces the clustering consensus problem to a bipartite graph partitioning problem, which simultaneously partitions both clustering vertices and sample vertices. The hybrid bipartite graph is constructed and METIS partitions it into c parts. The partition of the objects provides the final clustering result.
None of the above methods can directly obtain the final clustering result, and some cannot simultaneously cluster samples and base clusterings of a dataset. Almost all the graph partitioning methods need other clustering algorithms to get the final result. In this process, it relaxes the discrete solution (the final clustering result) to a continuous solution, which is obtained by eigendecomposition. Getting the final clustering result from eigenvectors usually involves k-means or the METIS algorithm. However, this will deviate from the optimal solution. Also, some methods only consider the relationship between samples, or between base clusterings, ignoring the relationship between samples and base clusterings. In contrast, our proposed ECBKM algorithm can directly obtain the final clustering result. Our algorithm can make full use of the information relating the sample and the base clusterings to improve the accuracy of the clustering ensemble.

III. CO-CLUSTERING ENSEMBLE BILATERAL K-MEANS ALGORITHM
In this section, we first introduce some notation, then we will briefly review the hybrid bipartite graph formulation (HBGF) techniques for solving co-clustering ensemble problems, and explain the co-clustering ensemble bilateral k-means algorithm in detail.

A. NOTATION
In this section, we will give some notations in this paper. We write matrices as boldface uppercase letters and vectors as boldface lowercase letters. The transpose of any matrix Z is denoted by Z T , and the trace of Z is denoted by Tr(Z ). We denote z i· ∈ R 1×n as the i-th row of Z , and the z ·j ∈ R n×1 is the j-th column of Z .
Given a data matrix X = [x 1 , x 2 , . . . , x n ] T ∈ R n×d , the vector x i· denotes the i-th row of X , and the vector x ·j is the j-th column of X .h c is represented as a clustered set of samples. τ c is represented as a clustered set of features.
, 1} d×c are two partitioned matrices that represent the clustering results of samples and features. If the i-th sample x i· belongs to clusterh j , then f ij = 1, and if the j-th feature x ·i belongs to cluster τ j , then g ij = 1. Each row in F and G has one and only one nonzero element, and the others are all zero. We call this an indicator matrix, and denote the set of them as ∅, i.e., F ∈ ∅ n×c and G ∈ ∅ d×c . Details and other notations are summarized in Table 1.

B. OTHER RECOMMENDATIONS C. HYBRID BIPARTITE GRAPH FORMULATION
Hybrid bipartite graph formulation(HBGF) uses the spectral method to convert the problem of co-clustering to a partition problem on a bipartite graph. HBGF combines multiple base clusterings of a dataset and samples to generate a bipartite graph G =<V , A>, where all of the base clusterings of a dataset and samples are contained in V as a vertex. The adjacency matrix L can be written as: where M is the edge-weights matrix, whose rows and columns correspond to the samples and base clusterings, respectively. Base clustering vertices are only connected to sample vertices and the edges in the graph have weights of 1, otherwise 0.
M (i, j) = 1 sample i belongs to the j − th cluster 0 otherwise (2) VOLUME 8, 2020 The multi-way spectral graph partitioning algorithm is used in a bipartite graph. Spectral clustering models use the eigenvalues of the Laplacian matrix for the data to perform dimensionality reduction before clustering. The Laplacian matrix L of G can be defined as: where D is a diagonal matrix, which can be expressed as D(ii) = j A(ij). Spectral clustering models seek to optimize the normalized cut criterion. The objective of normalized cut is defined as: where Q ∈ R n×c is the identity matrix and c is a reduced dimension. The optimization problem in equation (4) is converted to compute the normalized weight matrix L = D −1 A. It aims to find L's c (number of clusters) largest eigenvectors, to get the final clustering solution by using the METIS algorithm.

D. REVISITING THE BILATERAL K-MEANS ALGORITHM
Given a data matrix X ∈ R n×d , its i-th row is denoted as x i· , and its j-th column as x ·j . The columns represent features and the rows represent samples. Co-clustering based on diagonal co-clusters aims to group the samples into c clusters and the features into c clusters. Constructing a bipartite graph G =< V , A > between samples and features of a dataset X ∈ R n×d , where V is the set of vertices corresponding to samples and features, A is the adjacency matrix constructed by data matrix X : BKM aims to find the multipartitioning normalized cuts of G by solving the optimization problem where D is the diagonal matrix, the diagonal element D ii = n+d j=1 A ij , each row of Y has only one nonzero element, and the Laplacian matrix L = D − A. The first n numbers of Y = [y T 1· , y T 2· , . . . , y T (n+d)· ] T indicate the clustering results of samples, and the remaining numbers indicate the clustering results of features. since D is a diagonal matrix, Y T DY is also a diagonal matrix, with the diagonal elements (Y T DY ) kk equal to y T k Dy k ; thus the optimization problem in equation (6) can be rewritten as where Tr(·) is the trace of the matrix. Since L = D − A and Y T = [F T , G T ], we can rewrite (7) as min So, the optimization problem in equation (8) is changed to Since the problem is NP-complete [43], we solve the optimization problem in equation (9) by relaxing the discrete constraint problem, adding the two parts Tr((Y T DY ) −1 F T F(Y T DY ) −1 G T G) and Tr(X T X ). So, the optimization problem in equation (9) can be rewritten as We simplify the solution process of the optimization problem in equation (10) by replacing the (Y T DY ) −1 diagonal matrix with the S diagonal matrix, taking S as a new parameter, so the optimization problem in equation (10) is simplified to: where diag represents the diagonal matrix, F and G are the indicator matrices, indicating the clustering results of samples and the clustering results of features.

E. CLUSTERING ENSEMBLE
The clustering ensemble has two steps: generation and consensus function. The first step uses k-means to generate multiple clustering results, each with a random initialization of cluster centers. Each cluster is rewritten as an indicator matrix to generate them, then we use the new matrix to obtain a bipartite graph. Given an input dataset X = [x 1 , x 2 , . . . , x n ] T ∈ R n×d , we run the k-means algorithm m times to get base clusterings   Then this result of label vectors is changed to an indicator matrix. Connecting these three partial indicator matrices and treating them as a new feature, we obtain a new matrix, defined as an R matrix in Fig. 2.
We use the new data matrix R to generate a bipartite graph. Using the bipartite graph model [44], the base clustering clusters τ c (1 ≤ c ≤ k) and sample clustersh c (1 ≤ c ≤ k) are determined as follows. A given sample x i· belongs to sample clusterh m if its association with the base clustering cluster τ m is greater than its association with any other base clustering cluster. Thus, Each of the sample cluster is determined by the base clustering clustering. Similarly the base clustering cluster is given by We can see that there is a recursive relationship between τ c andh c , and the relations described in Eq. (12) and Eq.(13) determine bipartite spectral graph partition is based on diagonal co-cluster structure.
To get the final clustering result by solving the optimization problem in equation (11), as follows.
First, initialize F, G. Since these are indicator matrices, F and G, F T F and G T G are diagonal matrices. Calculate S with F and G fixed to solve S. Expand the objective function as follows and define it as Then the constraints of S are relaxed. We calculate the partial derivative of J with respect to S and set it to zero. The objective function (14) is changed to: Second, calculate G with F and S fixed. The optimization problem to solve G can be decomposed into (c × m) simple subproblems for each i(1 ≤ i ≤ (c × m)). Simultaneously, the new data set matrix R also can be decomposed into (c×m) simple subproblems for each i(1 ≤ i ≤ (c × m)): where W = FS. Because there is only one element equal to 1 and the remaining elements are zero in the vector g i· , the solution of equation (16) can be obtained from: where w ·k is the k-th column of W .
Finally, we update G with F and S fixed. The optimization problem to solve F can be decomposed into n simple subproblems for each i (1 ≤ i ≤ n). The new dataset matrix R also can be decomposed into n simple subproblems for each where L = SG T . There is only one element equal to 1, and the remaining elements are zero in the vector f i· , and the solution of equation (18) can be obtained from: where l k· is the k-th row of L.
We summarize the main steps of our algorithm as follows VOLUME 8, 2020

Algorithm 1 Co-Clustering Ensemble Based on Bilateral K-Means Algorithm
Require: The data matrix X ∈ R n×d , and c is the number of clusters.
1.Run k-means algorithm to cluster the samples (n) into c clusters for m times.
2.The result of each clustering is changed to an indicator matrix. i.e., if the sample x i (1 ≤ i ≤ n) belongs to the cluster c then x ic = 1, and the other numbers are zero, i.e., x ij = 0(j = c, 1 ≤ j ≤ c).
3.Combine the clustering solutions to obtain a new matrix R ∈ R n×(c×m) . Initialize F and G indicator matrices. repeat Update S with F and G by using Eq. (15). Update G with F and S by using Eq. (17). Update F with G and S by using Eq. (19). until Converge Ensure: Indicator matrix F is the sample clustering.

IV. EXPERIMENTS
We conducted experiments on several datasets to evaluate of the effectiveness of the proposed method. There were two aspects to the experiments. The first was to compare NMI and AI of the proposed CEBKM with other clustering ensemble methods on the real datasets.

A. DATASETS
We compared the performance of our method with that of several state-of-the-art clustering techniques. We conducted experiments on twelve datasets. The characteristics of all these datasets are summarized in Table 2. The real datasets (Zoo, Yeast, Pima, Heart, Ecoli, Diabetes, Crx, Australian, corel-5k, MnistData-10) are from benchmark datasets. MnistData-10 is 10% sampled from the original Mnist dataset. The number of classes ranges from 2 to 50, the number of samples from 101 to 6996, and the number of features from 8 to 1470.
The toy datasets (Twomoon, Gaosi) are shown in Fig. 3. Toy dataset 1 (Twomoon)consists of two classes of data distributed in a moon shape. Each cluster has 100 samples and the noise percentage is set to be 0.1. Toy dataset 2 (Gaosi2) consists of two classes with 100 data points each. The two classes are generated according to multivariate Gaussian distributions, and all share the same covariance matrix [1 0; 0 2]. The mean vectors are (7, 8) (10, 11).

1) EVALUATION CRITERIA
Since the labels of the datasets in Table 2 are known, we can use the normalized mutual information (NMI) [17] and the Rand index (RI) [45] to quantitatively validate our model.
Given a clustering result α (the clustering result by running the algorithm) and the true clustering labels β, The number of clusters in α and classes in β are both c(the number of clusters). Suppose n is the number of samples, n i is the number of samples in the i-th cluster, n j is the number of samples in the j-th class, and n ij is the number of samples in the i-th cluster and j-th class. An NMI between α and β is calculated as follows: Denote by a 11 the number of pairs of samples that are both partitioned in the same cluster in α and are also partitioned in the same cluster in β. Let a 00 be the number of pairs of samples that are partitioned in different clusters in α and are also partitioned in different clusters in β. a 01 is the number of pairs of samples that are partitioned in different classes in α and in the same class in β. a 10 is the number of pairs of samples that are placed in the same class in α and in different classes in β. The Rand index is: a 00 + a 11 a 00 + a 01 + a 10 + a 11 .
Both NMI and RI range from 0 to 1, and the higher values of RI or NMI indicate better clustering performance.

2) COMPARISON SETTINGS
We compared CEBKM with the state-of-the-art clustering ensemble algorithms including HBGF, HGPA, MCLA, CSPA, WOHB, WOSP, WOMC, and with the k-means algorithm. Source codes for the compared methods were obtained from the authors' websites. The experimental setup is described as following. To generate input to the clustering algorithm, we set c as the number of ground-truth classes.  Simultaneously, in the generation step, we ran k-means 20, 40, 60, 80, and 100 times with random initiation with selected centers to divide the dataset into c disjoint parts. We independently ran the clustering ensemble algorithms 100 times.

B. EXPERIMENTAL RESULTS ON TOY DATASETS
We compared the performance of k-means and CEBKM on the toy datasets. We set the ensemble size to 40. RI was used to evaluate the performance of the algorithms. We independently ran the two algorithms 50 times, and we recorded the average values of RI. The results are shown in Table 5. Best performances that are statistically significant VOLUME 8, 2020 are highlighted in boldface. Results in Table 5 show that CEBKM significantly outperformed k-means on toy datasets. were still larger than those of k-means. This demonstrates the effectiveness of ensemble techniques. Table 3 shows the normalized mutual information (NMI) of all of the algorithms on the eight real datasets. In the experiments, each method had 100 independent runs, and the average results were computed. As seen from the Table 3, we can find that the proposed method CEBKM achieved better results than the other clustering ensemble methods, except for the clustering result on Australian dataset, where our value was still close to the best value.

C. EXPERIMENTAL RESULTS ON REAL DATASETS
To determine the relationship between the clustering result and the number of the ensemble size, we perfomed the other experiment on the six datasets(Zoo, Yeast, Heart, Ecoli, Crx, Australian). Each clustering result for each dataset was generated by combining 20, 40, 60, 80, and 100 ensemble size(base clusterings). The average clustering performance of the six datasets with the ensemble size ascending from 20 to 100 is shown in Fig. 4, from which it can be observed that the clustering result was almost constant or increased when the ensemble size increased from 20 to 100, except that for Ecoli dataset, the clustering result of HGPA and WOMC had a large fluctuations. The overall trend was to increase with increasing ensemble size. Moreover, the clustering results remained almost constant, since there were no significant differences between these base clusterings. Fig. 4 shows that the clustering result of the CEBKM was the best among the six results. This reveals that the number of the ensemble size is beneficial. We can also find that the value of CEBKM increased with the ensemble size.
Another experiment to explore the relationship between the clustering result and the diversity of the ensemble size. Since the experimental effect of HGPA is not ideal, we add SCCE and KCC. Each of the base clusterings is produced by the k-means algorithm with the number of clusters k randomly selected in the interval of [2, √ N ], where N is the number of objects in the dataset. In this paper, 40 base clusterings are randomly generated for the experiment on the datasets(Zoo, Yeast, Pima, Heart, Ecoli, Diabetes, Crx, Australian, corel_5k, MnistData_10). As seen from the Table 4, we can find that the proposed method CEBKM achieved better results than the other clustering ensemble methods.

V. CONCLUSION
We have proposed a clustering ensemble model named CEBKM. Unlike the traditional ensemble methods that only work on the sample or base clusterings of the datasets. CEBKM can simultaneously cluster the sample and the base clusterings of a dataset, and it can make full use of the relationship between the sample and the base clusterings of the datasets to get better clustering result. What is more, our algorithm can efficiently obtain the final clustering result without post-processing. When the clustering algorithm converges, the result stored in the indicator matrix is the final clustering result. Experimental results show that our method outperforms other advanced clustering algorithms. Moreover, we studied the relationship between the clustering results and the ensemble size. Our analysis shows that diversity of the base clusterings is beneficial.
HUI YANG received the M.S. and Ph.D. degrees from Northeastern University, Shenyang, China, in 1988 and 2004, respectively. He is currently a Professor with the School of Electrical and Automation Engineering, East China Jiaotong University, Nanchang, China. His current research interests include intelligent transportation systems, complex system modeling, control and optimization, and process industry integrated automation technology and applications.
HAN PENG is currently pursuing the M.S. degree with the School of Key Laboratory of Advanced Control and Optimization of Jiangxi Province, East China Jiaotong University, Nanchang, China. His research interests include machine learning and data mining.
JIANYONG ZHU received the Ph.D. degree from Central South University, China, in 2014. He is currently an Associate Professor with the School of Electrical and Automation Engineering, East China Jiaotong University, Nanchang, China. His research interests include data analysis, random distribution control, and predictive control.
FEIPING NIE received the Ph.D. degree in computer science from Tsinghua University, China, in 2009. He has published more than 100 articles in the following top journals and conferences, such as TPAMI, IJCV, TIP, TNNLS/TNN, TKDE, TKDD, Bioinformatics, ICML, NIPS, KDD, IJCAI, AAAI, ICCV, CVPR, and ACM MM. His articles have been cited more than 5000 times (Google scholar). His research interests include machine learning and its applications, such as pattern recognition, data mining, computer vision, image processing, and information retrieval. He is currently serving as an Associate Editor or a PC Member for several prestigious journals and conferences in the related fields.