Double weighted ensemble clustering for cancer subtypes analysis

The era of big data provides the possibility of precision medicine. The most important idea we have for cancer is to divide and treat. Theoretically, each person’s cancer should be different, so it is very necessary to make personalized treatment plans for different cancer patients. Subtype analysis of cancer can be viewed as a clustering problem, while ensemble clustering techniques are widely followed for their ability to combine multiple basic clusters into potentially better and more robust clusters. However, the reliability of the present ensemble clustering methods in cancer subtype analysis still needs to be improved. Therefore, we propose a double weighted ensemble clustering method (DWEC), which first derives the similarity matrix of each base cluster based on the local weighting method, and this process can be regarded as the first weighting based on clusters. Subsequently, the objective of finding the final partitions is regarded as an optimization problem, and the similarity matrix corresponding to each base cluster is weighted twice by the block coordinate descent algorithm to solve the optimal partitions result. The best experimental results were obtained in both labeled datasets and unlabeled cancer gene datasets, validating the superiority of the method. For cancer subtype analysis, although our proposed method did not show statistically significant differences in survival distributions of several subtypes in the subtype analysis of glioblastoma multiforme. However, it performed best in the results of the temporal test for all other four cancer gene data, and therefore, we conclude that our method is more effective for cancer subtype analysis compared with other methods.


I. INTRODUCTION
C ANCER is the most complex disease faced by people in today's society, and more than 200 types of cancer have been found. At the same time, due to the dynamic changes of cancer genes, it is possible that genetic mutations, such as somatic mutations, copy number variations, altered gene expression profiles, and different epigenetic variations, are unique to each cancer. A variety of different molecular profiles can lead to a phenomenon that each cancer includes several subtypes, posing a formidable challenge to medical researchers. Each cancer with a different molecular structure requires a different treatment approach [1]. In terms of incidence, the top three cancers are lung cancer, female breast cancer and colorectal cancer [2]. Data clustering is a very important method in the fields of data mining and machine learning. Its purpose is to divide a given dataset into clusters that each share common characteristics [3]. Therefore, the discovery of cancer subtypes using clustering algorithms has attracted a lot of attention. This solution can help clinicians develop precise treatments by combining methods that analyze the different molecular profiles between cancer patients and healthy subjects [4]. In the past decades of research on clustering algorithms, many kinds of methods have been proposed [5]- [16], but a major drawback is that these algorithms are good for data sets with specific structures and cannot be applied to all data sets and are not universally applicable. Therefore, in view of the above defects, cluster ensemble was proposed and quickly became a hot research topic. In ensemble clustering, each input cluster is called a base cluster, and the final clustering result is called a consensus cluster.
We tried to use ensemble clustering to discover cancer sub- VOLUME 4, 2016 types, and by ensemble the results of different base clustering in the same data set, we obtained a more stable result that was better than other clusters. We can view this problem as an ensemble clustering problem. A large number of integrated clustering algorithms have been proposed in the past period [17]- [28]. The evidence accumulation clustering algorithm proposed by Fred et al. is based on the co-association matrix. First, the connection matrix is obtained by whether two data objects belong to the same class in the same base clustering result, and then the connection matrix generated by each base cluster is combined through a voting mechanism to obtain The co-association matrix is finally used as the input of hierarchical clustering to obtain the ensemble clustering result [29]. However, the quality of base clusters plays a crucial role in the consistency process, and consistent results can be compromised by low-quality base clusters. In recent years, research results on the weight of base cluster members or selection measurement strategies have emerged one after another. For example: Li et al. proposed to jointly learn the data partition weights and the final consensus clustering connection matrix under the Bregman divergence framework, proposed a weighted clustering integration scheme, which weighted data vectors of different dimensions differently to obtain data clustering [4]. D. Huang et al. proposed the Normalized Group Consistency Index (NCAI) to assess the quality of base class clusters in an unsupervised manner, thereby weighting base class clusters according to their clustering effectiveness [30]. However, these methods are developed based on an implicit assumption that all clusters in the same base cluster have the same reliability. They usually treat each base cluster as an individual and assign a global weight to each base cluster, regardless of the diversity of the clusters within it. However, due to the noise and inherent complexity of real datasets, different clusters within the same cluster may have different reliability. It is necessary to respect the local diversity of the set and deal with the reliability of different clusters. Recently, D. Huang et al. proposed a new ensemble-driven clustering effectiveness measure and proposed a locally weighted co-correlation matrix to summarize the ensemble of different clusters The calculation of interentropy makes it easier for them to get rid of the influence of low-quality base cluster [27]. The method in the paper [27] uses a local weighting strategy based on set-driven cluster validity to refine the co-association matrix, and proposes the concept of locally weighted co-association matrix. The locally weighted co-association matrix can be regarded as a consensus function for cluster weighting, which is obtained by the local weighted average of the connection matrix of each base cluster. However, there is no theoretical guarantee of optimality for such a simple averaging method.
To address the above issues, this paper proposes a novel co-clustering framework for determining the optimal clustering of cancer datasets to assist in the analysis of cancer subtypes. We refer to the local weighting method in [27] to integrate the entropy and validity of clusters into a local weighting scheme to improve consistency performance. A cluster can be viewed as a local area within the corresponding basic cluster. The entropy of each cluster is estimated based on the entropy criterion for the cluster labels in the entire set. In particular, given a cluster, investigate its uncertainty by considering how objects within that cluster are grouped in multiple base clusters. On the basis of cluster uncertainty estimation, the reliability of clustering is measured by an ensemble-driven clustering index (ECI). After obtaining the locally weighted similarity matrix of each base cluster, the process of integrating the connection matrix of the base cluster into the final result is regarded as an optimization problem, and a new consensus function is proposed to construct the final cluster.
The main contributions of our method are summarized as follows: Our method not only integrates the uncertainty and validity of clusters into a local weighting scheme, but also fully considers the uncertainty of clusters in the same base clustering. In addition, a new consensus function is proposed to theoretically support the optimality of the final clustering result.
Multiple experiments are conducted on a large number of real data and cancer datasets, and the results demonstrate the superiority of the proposed ensemble clustering method in terms of clustering quality and efficiency.
The rest of this article is organized as follows. Related work will be presented in Section 2. The formulation of the ensemble clustering problem will be given in Section 3. The ensemble clustering method proposed in this paper will be introduced in Section 4. Experimental results are reported in Section 5. Section 6 summarizes the full text.

II. RELATED WORKS
In recent years, with the rapid development of whole-genome sequencing and bioinformatics technology, cluster analysis of gene expression profiles has become an important research topic in the diagnosis of cancer subtypes, which helps to provide more precise medical treatment for cancer patients. For example, Ronglai Shen et al. developed an integrated clustering of joint latent variable models called iCluster, which integrates flexible modeling of associations between different data types and variance covariance structure in data types in a single framework, while reducing the dimensionality of the data set, using an expectation maximization algorithm Likelihood inference is performed to finally identify cancer subtypes characterized by DNA copy number variations and gene expression [31]. Bo Wang et al. developed Similarity Network Fusion (snF) to identify cancer by constructing a network of samples (such as patients) for each available data type, and then efficiently fusing these samples into a network that represents the full spectrum of the underlying data subtype [32]. Jianqiang Li et al. proposed Bregmannian Consensus Clustering (BCC), which generalizes the loss between the consensus clustering result and all input clusters from the traditional Euclidean distance to the general Bregman loss, and to the weighted and semi-supervised [4]. Ying Xu et al. used yeast dataset, human serum dataset and Arabidopsis dataset to construct a minimum spanning tree (MST) expressing multidimensional gene expression profiles, and proposed a clustering algorithm based on MST Cluster these genetic data [33]. Yu et al. proposed a random double-clustering clustering ensemble framework (RDCCE) for tumor clustering based on gene expression data. RDCCE uses a randomly selected clustering algorithm in the ensemble to generate a representative set of features, and then assign the samples to the corresponding clusters according to the grouping results [24]. Eric F. Lock et al. proposed an integrated statistical model that clusters objects individually for each data source. Using an extensible Bayesian framework, both consensus clustering and sourcespecific clustering were estimated and eventually used in the subtyping of breast cancer tumor samples [34].
In order to improve the robustness and stability of clustering methods, researchers have begun to focus on ensemble clustering, and there are many ensemble clustering methods. Pair co-occurrence-based methods typically construct coassociation(CA) matrix by considering the number of occurrences of two objects in the same cluster across multiple base clusters. Using the CA matrix as the similarity matrix, the traditional clustering method can be used to construct the final clustering result [29], [20], [27]and [35]. Fred et al. first proposed the concept of CA matrix and proposed the Evidence Accumulation Clustering (EAC) method. The idea of evidence accumulation clustering is to combine the results of multiple clusters into a single data partition, and each cluster The results are treated as an independent data organization evidence and consistent data partitions are extracted from the merged evidence [29]. Wang et al. extended the EAC method with the construction of the correlation matrix considering the cluster size of the original clusters and proposed a probabilistic accumulation method [20]. Huang et al. proposed an ensemble clustering method based on ensemble-driven clustering uncertainty estimation and a local weighting strategy. The labels of the clusters in the entire set were considered through the entropy criterion, and the uncertainty of each cluster was estimated for weighting. The local diversity in the ensemble further proposes two new consensus functions [27]. Lourenço et al. proposed a consensus clustering method based on the EAC paradigm, which is not limited to clear partitions and takes full advantage of the nature of the covariance matrix to determine the probabilistic assignment of data points to clusters by minimizing the Bregman scatter between the observed coassociation frequencies and the corresponding co-occurrence probabilities expressed as unknown assignment functions [35]. The graph partitioning-based approach solves the integration clustering problem by constructing a graph model to reflect the integration information.Strehl et al. formalized the clustering ensemble problem as a combinatorial optimization problem based on mutual information sharing, and proposed three graph partition-based ensemble clustering algorithms, CSPA, HGPA and MCLA [36]. The median division-based approach formulates the integrated clustering problem as an optimization problem whose goal is to find a clustering result by maximizing the similarity between this cluster and multiple base clusters. Huang et al. introduced the concept of hyperobjects, which are compact and adaptive representations of integrated data that greatly facilitate computation. The ensemble clustering problem is transformed into a binary linear programming problem by means of a probabilistic formulation. The constrained objective function is represented as a factor graph, and the maximum product belief propagation is used to generate solutions that are insensitive to initialization and converge to the neighborhood maximum [37].
However, the above studies still have significant limitations in practical applications, and the clustering analysis for cancer gene expression profiles still cannot achieve the desired results. To this end, we propose a new ensemble clustering framework for determining the optimal clusters of cancer datasets by fully combining the ideas based on pairwise co-occurrence methods and median-based partitioning methods.

A. ENSEMBLE CLUSTERING
Suppose there is a dataset X = {x 1 , x 2 , ..., x N } consisting of N data objects,where x i represents the ith data object. he dataset X is clustered M times to obtain M partitions, where each partition contains a certain number of clusters. Formally, we represent the set Π of M-based clusters as follows: where where Π m represents the mth base cluster in Π, c m i represents the ith cluster in π m , and n m represents a total of n m clusters in π m .Each base cluster is a collection of multiple samples, and different clusters in the same base cluster do not intersect with each other. The following conditions must be met here: i ,it means that the data x i belongs to the ith cluster in the mth base cluster. For convenience, all clusters in the base cluster set Π are denoted as: where c i represents the ith cluster, and n c represents a total of n c clusters in the base cluster set Π, i.e.,n c = n 1 + n 2 + ... + n M . Perform M times of clustering on the dataset X to obtain M partitions Π = π 1 , π 2 , ..., π M , and each partition can get its connection matrix, ie, CM = CM 1 , CM 2 , ..., CM M . The connection matrix is defined as follows. Definition 1. (Connection matrix) The connection matrix CM m that partition π m is an N×N symmetric square matrix, which reflects whether the two data objects in the division VOLUME 4, 2016 are grouped into the same cluster. Therefore, CM m can be used to represent the partitions π m , where the (u, v)th term is expressed as follows:

B. INFORMATION ENTROPY
In information theory, entropy is a measure of uncertainty associated with random variables. Joint entropy is a measure of uncertainty associated with a set of random variables. Definition 2. (Joint Entropy) For a pair of discrete random variables (X, Y), the joint entropy H(X, Y) is defined as: where p(x,y) is the joint probability of (x,y). H(X,Y) = H(X) + H(Y) if and only if two random variables X and Y are independent of each other. Therefore, given n independent random variables X 1 , X 2 , . . . X n , then,

C. EUCLIDEAN METRIC
The Euclidean metric is a commonly used definition of distance, referring to the natural length of two vectors in an M-dimensional space. Definition 3. (Euclidean metric) The true distance between two points in M-dimensional space, the distance ED based on the euclidean metric is defined as: when X and Y are N×N-dimensional matrices, then,

IV. DOUBLE WEIGHTED ENSEMBLE CLUSTERING
This paper proposes a double weighted ensemble clustering method based on local weighting [27]. In this section, we describe each step of the method in detail.

A. LOCAL WEIGHTING METHOD
Due to the unsupervised nature of clustering algorithms, it is difficult to know in advance which similarity measure is correct and reasonable. Different clustering algorithms have their own scope of application. Due to the difference in similarity, the clustering results are also different. Therefore, how to measure the similarity between clusters is the key to obtain reasonable clustering results. In order to evaluate the reliability of each cluster, the cluster uncertainty estimation method based on entropy criterion estimates the uncertainty of the cluster by considering the cluster labels in the whole set, and then proposes the concept of ECI to evaluate the clustering uncertainty and reliability.
As introduced in Information entropy, entropy is a measure of uncertainty associated with random variables. Each cluster is a set of data objects. Given two clusters C i , C j ∈ C and C i , C j do not belong to the same base cluster, when there are more overlapping data objects in C i , C j the value of H(C i , C j ) is smaller. By analyzing the clustering of C i and C, the entropy of C i for the base cluster set Π can be calculated. Definition 4. Given a set Π, the entropy of C i for the base cluster set Π is defined as follows: where M represents the number of base clusters, n m represents the number of clusters in the partition π m , and c m j represents the jth cluster in the mth partition. ∩ denotes the coincident elements in the two clusters, and |c i | denotes the number of elements in the cluster C i . In summary, p(c i , c m j ) ∈ [0, 1] for any i, j and m, so we have H Π (c i ) ∈ [0, +∞). The entropy of C i for the base cluster set Π can reflect how objects in C i are clustered in other base clusters in Π. If the objects in C i belong to the same cluster in each base cluster, it can be seen that all base clusters agree to assign the objects in C i to the same cluster, then the entropy of C i about Π reaches the minimum value, namely 0. When the entropy of C i with respect to Π is larger, the objects in C i are less likely to be in the same cluster.
After obtaining the entropy of each cluster in the cluster set, we consider the uncertainty of the cluster relative to the set through the concept of ECI, and add weights to the data objects within each cluster. Definition 5. Given a cluster set Π of M base clusters, the ECI of cluster C i is defined as follows: According to Definition 5, since H Π (C i ) ∈ [0, +∞), for any C i ∈ C, ECI(C i ) ∈ (0, 1]. Obviously, the entropy of C i for the base cluster set Π is more The smaller the value, the larger the ECI value. When the entropy of the cluster reaches the minimum value, that is, H Π (C i ) = 0, its ECI will reach the maximum value, that is, ECI(C i )= 1. When the entropy of the cluster tends to infinity , the ECI of this cluster tends to 0.
According to Definition 1, the connectivity matrix reflects whether two data objects in the partition are grouped into the same cluster. Combining the concept of ECI, we propose the E-similarity matrix(ESM) theory, which reflects the possibility of two data objects being grouped into the same cluster in the partition. Definition 6. (E-similarity matrix) The E-similarity matrix is essentially a symmetric matrix. Given a partition π m = {c m 1 , c m 2 , . . . , c m n m }, calculate its ESM The way is as follows: where c m i represents the ith cluster in the mth partition, w m i represents the ECI value of the cluster c m i , and ESM u v m represents that the data objects x u and x v are clustered to the same in the mth base clustering possibility of clusters. We provide an example in Figure 1 and Table 1 to show the computation of local weights with respect to three basis clusters. The data set X = {x 1 , x 2 , ..., x 14 } has a total of 14 data objects, which are divided into three partitions π 1 , π 2 , π 3 after three clustering. The base cluster set Π has a total of 8 clusters. According to formula (9) (11), the entropy and ECI of the 8 clusters can be calculated. The results are shown in Table 1. As can be seen from Table 1, the entropy values of clusters c 1 1 and c 2 3 are the smallest, which means that the certainty of this cluster is the largest. The value of H Π (c 3 2 ) is the largest, indicating that the certainty of the c 3 2 cluster is the smallest, that is, the aggregate set has the smallest support for the cluster to appear in the final clustering result. According to Definition 6 , formula (14) is calculated from the ECI in Table 1, which represents the ESM that partitions π 1 , that is, the possibility that 14 data objects are clustered into the same cluster in the first base clustering. When any two data objects in the data set X = {x 1 , x 2 , ..., x 14 } are grouped into the same cluster by the base clustering, then the items corresponding to the two data objects in the ESM are assigned a value, which is the ECI of this cluster. Figure 2 shows the ESM corresponding to partitions π 1 .

B. EUCLIDEAN METRIC BASED ESM WEIGHTED ENSEMBLE FUNCTION
The ensemble process refers to the process of finding the optimal one among the partitions generated by each base cluster. In this study, we calculated the corresponding ESM i from the partitions π m generated by each base clustering. Therefore, there also exists a similarity matrix ESM, corresponding to the final optimal consensus partition π. In this sense, finding an optimal ESM is the key to obtain good clustering results. In general, different base clusters also have different importance to the final consensus. Therefore, we  treat this objective as an optimization problem. According to Definition 3, this optimization problem is described as follows: where ESM is a non-negative symmetric matrix. ED(ESM, ESM i ) = ED(ESM, {ESM 1 , ESM 2 , . . . , ESM M }) represents the sum of the Euclidean metric of ESM i corresponding to ESM and each base cluster. w i represents the contribution of each base cluster in the consensus process. We use a block coordinate descent algorithm to minimize the above problem. When we fix one variable, optimization over another variable can be viewed as a convex problem with a unique solution. In order to avoid solving the result w i only take 1 and 0, we add a regularization term to formula (15). Definition 7. Calculate the optimal ESM, defined as follows: where λ is the regularization coefficient. When λ approaches 0, w i only takes 1 and 0. When λ approaches 1, the value of w i is 1/M , which is the average of all partitions. VOLUME 4, 2016 According to Equation (8) and Equation (15), we describe the problem as: By fixing w such that ∂J(ESM,w) ∂ESM = 0, where 0 is an N×Ndimensional matrix. We can get: Similarly, by fixing the ESM so that ∂J(ESM,w) ∂w = 0,the problem is transformed into a linear programming problem. Then, we can prove that formula (16) converges, J(ESM, w) ≥ 0 for any ESM and w.By fixing w = w t , the minimization of J(ESM, w) is convex, ESM t+1 is the optimal solution, and J(ESM t , w t ) ≥ J(ESM t+1 , w t ). Similarly, by fixing ESM = ESM t+1 , we have J(ESM t+1 , w t ) ≥ J(ESM t+1 , w t+1 ). Therefore, we get a monotonically decreasing sequence J(ESM 0 , w 0 ) ≥ J(ESM 1 , w 0 ) ≥ J(ESM 1 , w 1 ) ≥ ... ≥ 0.indicating that formula (17) converges. After finding the optimal solution ESM through optimization, we use the K-means algorithm to cluster the ESM to get the final data object labels. Among them, the input of the K-means algorithm is a vector composed of the similarity between each data object and all data objects, that is, each column of the ESM. The Double weighted ensemble clustering algorithm specifically described as follows: Input: • Π = π 1 , π 2 , ..., π M // M base clustering results • C = {c 1 , c 2 , ..., c nc }// n c clusters of Π • k//number of clusters • λ//regularization coefficient • // precision Output:labels//the final ensemble clustering result Step1: Compute the entropy of the clusters in C as Definition 4.
Step2: Compute the ECI measures of the clusters in C as Definition 5.
Step3: Compute the ESM set of the partitions in Π as Definition 6.
Step4: Optimization ESM in ESM set as Definition 7.
The final result is obtained by clustering ESM through k-means algorithm.

V. EXPERIMENTS
In this section, we conduct experiments on 10 real datasets in the UCI database and five cancer datasets from TCGA to illustrate the advantages of the doubly weighted ensemble clustering method compared to the state-of-the-art ensemble clustering method. This paper selects new and some classic clustering algorithms in recent years and compares them with the double weighted ensemble clustering algorithm (DWEC). Local Weighted Evidence Accumulation Algorithm (LWEA) and Local Weighted Graph Partition Algorithm (LWGP) [27], Double granularity weighted ensemble clustering (DG-WEC) [38], Kullback-Leibler Distance Weighted Bregman Consensus Clustering (KLWBCC) and Exponential Distance weighted Bregman consensus clustering (eWBCC) [4] were selected as the newly proposed ensemble clustering comparison algorithms in recent years; Evidence Accumulation Clustering (EAC) [29], Cluster-Based Similarity Partitioning Algorithm(CSPA)and Hyper-Graph-Partitioning Algorithm(HGPA) [36]were selected as the classic ensemble clustering comparison algorithm; In addition, k-means [7] and Spectral Clustering(SC) clustering algorithm [6] were used as basic comparison algorithms. All parameters involved in the algorithm are set according to the parameters established in the corresponding literature experiments, and the performance of the used comparison algorithm is for reference only. For all ensemble clustering algorithms, we use 50 partitions generated by K-means clustering algorithm as base clusters, and all evaluation metrics are averaged 20 times.

A. EXPERIMENTAL COMPARISON OF UCI DATASETS
In our experiments, 10 datasets from the UCI database were used, namely, Balance, Breast, Glass, Heart, Ionosphere, Iris, Sonar, Vehicle, Wine and Zoo. All datasets are available through the UCI official website http://archive.ics.uci.edu/ml/index.php. The details of the dataset are shown in Table 2.
Mutual information (MI) is a symmetric measure that quantifies the statistical information shared between two distributions, which can be seen as the amount of information contained in one random variable about another random variable, or the amount of information a random variable has Reduced uncertainty by knowing another random variable. As the name suggests, normalized mutual information (NMI) is to put mutual information between [0, 1], which is widely used to evaluate the quality of clustering. Suppose there are two random variables (X, Y ), MI and NMI are defined as follows: where the joint distribution of random variables (X, Y ) is p(x, y), and the marginal distributions are p(x), p(y) respectively. In essence, the mutual information M I(X, Y ) is the joint distribution p(x, y) and the relative entropy between the marginal distribution product p(x)p(y).
where H(X) is the definition of entropy in Equation 5.
The application of normalized mutual information in this paper, the reference [27] is defined as follows: where π is the clustering result of the experiment, π G is the clustering result of the truth, n is the number of data objects, n is the number of clusters in π , and n G is the cluster in π G , n i is the number of data objects in the ith cluster in π , n G j is the number of data objects in the jth cluster in π G , n ij is the ith cluster in π and the number of data objects in common in the jth cluster in π G .   Table 3 reports the NMI scores of different ensemble clustering methods based on the 50th K-means algorithm as base clustering. Deepening font in each line is the maximum. As shown in the table, the DWEC method obtains the best value among the NMI scores of each ensemble clustering method for the five datasets of Balance, Glass, Heart, Vehicle and Zoo. In addition, Fig. 2 and Fig. 3 respectively show the number of times that each method obtains the first and the top three in the NMI scores of the 10 datasets. Our proposed DWEC method achieves 5 firsts and 7 firsts, which is the best among many existing methods. It shows that the DWEC method has higher accuracy and robustness.

B. EXPERIMENTAL COMPARISON OF THE TCGA CANCER DATASET
To make our research more meaningful, we now take the real cancer gene dataset as the research object. We selected five cancers with high global incidence: Glioblastoma multiforme (GBM), Breast invasive carcinoma (BIC), Kidney renal clear cell carcinoma (KRCCC), Lung squamous cell carcinoma (LSCC) and Colon adenocarcinoma (COAD) from The Cancer Genome Atlas (TCGA). TCGA is a project overseen by the National Cancer Institute and the National Human Genome Research Institute to apply highthroughput genome analysis techniques to help people have a better understanding of cancer. Complete understanding, thereby improving the ability to prevent, diagnose and treat cancer. TCGA data requires complex preprocessing, and fortunately, the dataset provided by the article [32] can meet our research conditions. Among them, each type of cancer provides three types of gene expression (mRNA, miRNAs and methylation): mRNA is short for messenger RNA, and RNA expression measured by RNA sequencing is transcribed from DNA; MicroRNAs (miRNAs) are small endogenous non-coding RNA molecules, the amount of RNA expression detected by microRNA sequencing; methylation refers to the degree of DNA methylation, as measured by methylation chips. The data preprocessing process is described in detail in the paper [32]  Different conclusions can be drawn due to the use of different types of gene expression. For example, using miR-NAs and methylation it is possible to profile different cancer subtypes. Therefore, this study combined the gene expression of three different types of five cancers, respectively, to obtain five datasets. The details of the dataset are shown in Table 4.
The difference between the TCGA dataset and the UCI dataset is that the former has no labels, so we cannot use the normalized mutual information in Section 5.1 to evaluate the quality of the clusters. Here we use −log 2 p in logrank-test as the evaluation criterion for clustering results of different methods to evaluate the significance of differences in survival information among different subtypes [39]. The application of log-rank test in this study is to compare the survival curves between multiple groups to study whether the difference in survival distribution among multiple groups is statistically significant. Suppose there are i survival distributions. H 0 is defined as the null hypothesis that all survival distributions are the same. In this study, we take p ≥ 0.05 as H 0 , and the difference of i survival distributions is not statistically significant under the condition that H 0 is established, that is, when −log 2 p ≥ 4.32, H 0 is rejected. −log 2 p the larger the value, the better the clustering effect.
In the study [32], by reviewing numerous literatures, the experiment divided glioblastoma multiforme into 3 subtypes, breast invasive carcinoma into 5 subtypes, kidney renal clear cell carcinoma is divided into 3 subtypes, lung squamous cell carcinoma is divided into 4 subtypes, and colon adenocarcinoma is divided into 3 subtypes. This study will use this conclusion as the number of clusters for different datasets. Table 5 reports the values of −log 2 p in the log-rank test of DWEC and ten contrasting algorithms on the TCGA cancer dataset. Deepening font in each line is the maximum. As shown in Table 5, for the time series detection of GBM, all methods except the LWGP method satisfy the H 0 hypothesis, indicating that there is no significant difference in the survival distribution among the groups of the clustering results. For the BIC, KRCCC, LSSS and COAD datasets, the DWEC method proposed by us is superior to other methods, indicating that the clustering results of the DWEC method have greater differences in survival distribution among the groups, and the results are more reliable. On the whole, the DWEC method has the best clustering effect, and the SC clustering algorithm has the worst effect. Figure 5 - Figure 9 show the survival curves of each group by the DWEC method and the SC algorithm for GBM, BIC, KRCCC, LSSS and COAD datasets, respectively. Survival curves are used to describe the survival status of several groups of patients. The horizontal axis of the survival curve is the observation time, and the vertical axis is generally the survival rate. Each point on the curve represents the patient's survival rate at that time point. In general, the greater the distance between the curves, the greater the difference in the prognosis of patients in each group, and the easier it is to make statistical differences. As shown in the figure, the DWEC method has significant advantages over the SC clustering algorithm.

VI. CONCLUSIONS
This paper proposes a double weighted ensemble clustering algorithm. First, according to the local weighting method, the similarity matrix of each base cluster is obtained. The problem is then transformed into a convex optimization problem, and the optimal similarity matrix is obtained by a block coordinate descent algorithm. Finally, the K-Means algorithm is used on the basis of the similarity matrix to obtain the final partitions result. We conduct extensive experiments to demonstrate the large superiority of our algorithm in terms of clustering quality. First, a large number of experiments are conducted on the labeled UCI dataset. By comparing the NMI values of various algorithms, the results show that our method performs best compared with existing ensemble clustering methods. On the clinical side, these algorithms are used for clustering of cancer genes to analyze cancer subtypes. Through experiments on five common cancers in TCGA, and comparing the time series detection results of each clustering algorithm, the effectiveness of our method is verified.
In the next step, we plan to apply our consensus clustering framework to large-scale data from the World Health Organization, such as analyzing potential links between public health expenditures and life expectancy in countries, potential factors affecting suicide rates in countries, etc.