GRACE: A Graph-Based Cluster Ensemble Approach for Single-Cell RNA-Seq Data Clustering

Rapid development of single cell RNA sequencing (scRNA-seq) technology has accelerated the exploration in biomedical researches. One of the focal interests in scRNA-seq data analysis is to classify cells into different types, which significantly assists in studying inter-cellular heterogeneity, such as cell types, cell states, and cell lineages, at the resolution of single cells. Although a number of tailored approaches have been developed for scRNA-seq data, their performance varies with different datasets and their clustering accuracy need to be improved. In this paper, we propose a novel ensemble clustering framework for scRNA-seq data called GRACE (GRAph-based Cluster Ensemble approach). First, we construct a highly reliable graph network for single cells by combining the clustering outcomes from five leading scRNA-seq data clustering methods. Then, we remeasure the relationships between cells by exploring the topology structure of network using random walk distance. Finally, we build a hierarchical cell-tree and obtain the clustering labels by cutting the tree structure into an appropriate number of sub-trees. Experimental results on twelve benchmark datasets show that GRACE has the higher clustering accuracy and is more robust among a variety of datasets than the state-of-the-art individual approaches. In addition, the graph structure of the network which is built upon the ensemble clusters is more reliable than the networks which are constructed according to the conventional similarity metrics.


I. INTRODUCTION
As the basic structural and functional unit of organisms, single cells store the important genetic information [1]. In the process of cell proliferation and differentiation, a number of factors, such as cell state [2], micro-environment of cells [3], and the regulation of internal procedures of cells, lead to the heterogeneity of cells [4]. Previously, the technology of large population sequencing often analyzes tens of thousands cells altogether, where the expression value of gene is the average score of all the cells. It thus usually highlights the cell types with large populations and belies the rare cell types such as stem cells and cancer cells [5], [6]. Fortunately, the single cell RNA sequencing (scRNA-seq) technology can overcome this issue and promote the study of cellular heterogeneity [6], [7]. Clustering analysis, which can group cells according to gene expression patterns, is essential in order to mining The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Cheng . the underlying information of scRNA-seq data. Related studies on clustering analysis have been applied to many focal research interests, such as discovering cell types [8], [9], reconstructing cell development tracks, fate decisions [10], [11], and establishing spatial models of complex tissues [12].
Clustering analysis has always been a focal research interest in data mining and machine learning [13]. Up to now, the traditional classic clustering algorithms, such as k-means [14], DBSCAN (Density-Based Spatial Clustering of Application with Noise) [15], CLIQUE (Clustering In QUEst) algorithm [16], spectral clustering [17] and hierarchical clustering [18] are still wildly used. At the same time, clustering algorithms are advancing to pursue higher clustering accuracy and efficiency. Following the idea that clusters are the high density regions in the feature space separated by low density regions, a density-based method has been proposed, which bases on fast searching and finding density peaks [19]. To handle high-dimensional realistic data, some advanced subspace clustering algorithms are proposed from novel perspectives, such as eliminating the effect of the errors from the linear projection space [20], combing with deep neural networks architectures [21], [22]. In addition, many ensemble clustering approaches have been developed to achieve better clustering results and greater robustness. For example, Huang et al. have developed a serials of ensemble clustering algorithms by factor graph [23], probability trajectories [24], locally weighting base clusterings [25], exploring cluster-wise similarities via random walks [26] and integrating ultra-scalable spectral clustering [27]. In the area of scRNA-seq clustering analysis, many tailored methods have been developed to overcome the challenges posed by the inherent nature of scRNA-seq data, such as zero inflation (dropouts) [28], over-dispersion [29] and amplification bias [30], and we will briefly review some of the major approaches.
Kiselev et al. proposed a consensus clustering method named single-cell consensus clustering (SC3), which adopts three measurements to calculate the similarity between cells and two ways for feature reduction. By applying k-means clustering algorithm on each branching data, they construct a consistent matrix of cells and then use hierarchical clustering to obtain the final clustering results [31]. Hierarchical clustering is also applied by SINCERA [32], Clustering through Imputation and Dimensionality Reduction (CIDR) [33], and cellTree [34]. The main difference of these approach is the method to measure the similarity between samples, where SINCERA uses Pearson correlation coefficient, cellTree employs Chi-Square distance, and CIDR applies the square of the Euclidean distance. In addition, CIDR improves the clustering efficiency by an implicit imputation approach to alleviate the impact of dropouts in scRNA-seq data. Sun et al. proposed a probability model based method DIMM-SC, which assumes that the data is generated by k polynomial distribution whose parameters follow the Dirichlet prior distribution, and solves the parameters with maximum likelihood estimation [35]. Some researches apply bi-clustering methods to scRNA-seq data. BackSPIN splits the similarity matrix of samples and assigns genes to each sub-matrix iteratively in order to cluster cells and genes simultaneously [36]. There are some clustering algorithms specially developed to detect rare cell types, such as giniClust and RaceID, which can group the datasets with uneven population distributions [37], [38]. Recently, some deep learning methods have been proposed with the emergence of big data. Lopez et al. proposed a method named scVI for processing scRNA-seq data which applies the deep generation model like Variational Auto-Encoder to implement the low-dimensional representation of data and then clusters low-dimensional data using k-means [39].
In addition to the methods mentioned above, another import type of clustering methods is graph theory-based approaches. The purpose is to segment the graph network, trying to make the edge weights (similarities) within the sub-graphs as high as possible while the edge weights connecting different sub-graphs as low as possible. For example, clusters can be formed by splitting the smallest spanning tree found on the network or using minimum-cut algorithm to finding pre-defined sub-graphs [40], [41]. Chen et al. proposed an approach called SNNCliq, which identifies clusters by a quasi-clique-based clustering algorithm on a graph constructed based on shared nearest neighbor (SNN) [42]. Seurat is a graph modularity optimization-based clustering method. It constructs a graph network of cells with SNN similarity and then optimizes the modularity function to determine clusters [12]. Wang et al. proposed the Single-cell Interpretation via Multi-kernel LeaRning (SIMLR), which constructs the graph of cells based on the similarity learned from multiple kernels and uses spectral clustering algorithm on the graph for clustering [43].
Previous methods mostly use conventional similarity metrics (such as Euclidean distance, Pearson correlation coefficient) or the second order similarity (such as SNN that considers 1-hop neighbors) to measure the similarity of single cells. The graph division approaches which rely on these similarities to find dense sub-graphs loss the graph topology properties since they don't consider the higher-order neighboring information, such as k-hop (k 2) neighbors. Besides, the methods are often optimized for the specific dataset. Therefore, the outputs of those methods are unstable among different scRNA-seq datasets and it is hard for users to select an appropriate methods to apply. At the same time, the clustering accuracy also needs to be improved.
To address above problems, we propose a novel cluster ensemble approach called GRACE, which is a graph theory based clustering method. First, we construct a graph network using the predicting results of multiple basic clustering methods. Then, we build a tree of cells based on random walk distance on graph which can be considered as a higher order similarity that takes the high-order topological information into consideration. Finally, we obtain the clusters by cutting the tree structure according to the average distance of intraclusters. Experimental results on twelve benchmark datasets show that GRACE outperforms the state-of-the-art methods.
The rest of this paper is organized as follows: Section II and III introduce the twelve real scRNA-seq datasets and the clustering performance metrics. Section IV presents our method in detail. Section V talks about the experimental results. Section VI concludes this paper.

II. BENCHMARK DATASETS
Twelve scRNA-seq datasets are collected from publicly available platforms, such as ArrayExpress [44], Gene Expression Omnibus (GEO) [45], and Sequence Read Archive [46]. The brief information about these data are listed in Table 1, in which the header of ''#Cells'', ''#Genes'' and ''#Cluster'' indicate the number of cells (instances), genes (features), and clusters, respectively. Datasets are named by the accession numbers provided in the original publications. In addition, these datasets are collected from some representative sequencing platforms, including SMART-seq2 [47], [48], sci-RNA-seq [49], 10X Genomics [50], and Droplet-based protocols [28], [51]. All of these data have 'gold-standard' (deemed as true) cluster labels assigned to each single cell, and the different clusters indicate the diverse cell groups.
To investigate the influence of culture condition in cellular self-renewal and pluripotency state, researchers sequence mouse Embryonic stem cells (mESCs) accross three different conditions: serum, 2i, and the alternative ground state a2i. E-MTAB-2600 is a dataset of mESCs, in which the three clusters corresponds to the three different conditions and there are 704 cells in total and 30768 genes are sequenced in each cell. This data is available in ArrayExpress database (https://www.ebi.ac.uk/arrayexpress/experiments/ E-MTAB-2600/). GSE65525 is also a single cell data of mESCs which contains 2717 cells. GSE65525 reveals the population structure and the heterogeneous onset of cell differentiation after Leukemia Inhibitory Factor (LIF) withdrawal in mESCs. We downloaded the read count matrices of mESCs sample 1, mouse ES cells LIF -2 days, mouse ES cells LIF -4 days and mouse ES cells LIF -7 days, and put all cells together. Distinct clusters are sets of cells with different days after LIF withdrawal. This data is downloaded from GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE65525).
The SRP073767 dataset is provided by the 10X scRNA-seq platform, which profiles the transcriptome of the peripheral blood mononuclear cells (PBMCs) from a healthy donor. The total number of cells is 4217 classified in 8 types. This data is downloaded from the website of 10X genomics (https://support.10xgenomics.com/singlecell-gene-expression/datasets/2.1.0/pbmc4k).
Since the brain function is relies on a diverse set of differentiated cell types, including neurons, glia, and vasculature. The authors of the GSE60361 data used large-scale scRNAseq to classify cells from mouse somatosensory cortex and hippocampal CA1 region. The 9 clusters indicates distinct cell types in mouse cortex. There are 3005 single cells in total and 19972 genes are sequenced. This data can be downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi? acc=GSE60361).
The mouse bladder cells data is in the Mouse Cell Atlas project GSE108097, which obtains sequenced data by applying Microwell-seq, a high-throughput and low-cost scRNA-seq platform. The authors identify 16 cell types in mouse bladder tissue, which are considered to be different clusters in our experiment. In this data, there are 4186 instances and 13488 features. This data is provided by the authors (https://figshare.com/s/865e694ad06d5857db4b).
GSE98561 is a dataset of the worm neuron cells dataset which is profiled by sci-RNA-seq. The authors profiled about 4000 neural cells from the nematode Caenorhabditis elegans at the L2 larval stage and identified the cell types. After removing the cells with the ''Unclassified'' labels, we thus obtained 10 cell types (that is, the 10 clusters). This data is available in GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) with the accession number GSE98561.
The last six data sets in Table 1 are from a super-dataset GSE84133, in which four of them (i.e. GSM2230757, GSM2230758, GSM2230759, GSM2230760) are single cell data of human pancreatic islets and two (i.e. GSM2230761, GSM2230762) are mouse pancreatic islets. Clusters in these datasets indicates different endocrine cell types, including rare ghrelin-expressing epsilon-cells, exocrine cell types, vascular cells, Schwann cells, quiescent and activated pancreatic stellate cells, and four types of immune cells. These datasets could be downloaded from GEO databbase (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE84133).

III. PERFORMANCE EVALUATION METRICS
In our experimental results, Adjusted Rand Index (ARI) is used to evaluate the clustering performance, which is widely used when the sample labels of ground truth are given [54]. ARI calculates the agreement between the ground truth and the predict clustering labels and the calculation can be defined as where n ij is the value at the i th -row and the j th -column in the contingency table, a i is the sum of the i-th row of the contingency table, b j is the sum of the j-th column of the contingency table, and n 2 denotes a binomial coefficient.
Overview of GRACE. The first step is grouping the scRNA-seq data using CIDR, SC3, t-SNE+k-means, Seurat, and SIMLR. From the five clustering outcomes, we compute a consensus matrix representing the cell-to-cell relationships. Then a graph is constructed where nodes represents the single cells and weights of edges indicating each pair cells' similarity. Based on the random walk distance, we create a hierarchical tree of single cells. Finally, we implement clustering by cutting the tree-structure into sub-trees.
We also evaluate the clustering performance according to other intuitive metrics, Homogeneity score, Completeness score and V-measure, that apply the conditional entropy analysis [55]. Given the ground truth class labels, Homogeneity score calculates the levels of whether each predicted cluster contains unique members of a single class of cells. The Completeness score indicates whether all the members of a given cell group are assigned to the same predicted cluster. V-measure is a balance metric of the Homogeneity and Completeness scores with computing the harmonic mean of them. Formally, the scores are defined as where C and K is the set of ground-truth and the predicted cell labels, respectively. H (C|K ) indicates the conditional entropy of the classes given the cluster assignments and H (C) means the entropy of the classes. Specifically, where n is the total number of cells. n c and n k are the number of cells which belong to the class c and the cluster k. n c,k is the number of cells from class c assigned to cluster k.

IV. METHODS
In this section, we show the whole picture of GRACE. First, we give a general description of GRACE and the five scRNA-seq clustering methods. Next, we describe the main steps in cluster ensemble process in detail, including the computation of the consensus matrix and random walk distance, the construction of the graph network and the hierarchical tree structure, and the estimation of the number of clusters. Finally, we summarize GRACE to the pseudo-code for a formal description.
A. OVERVIEW OF GRACE Figure 1 shows the overview of our GRACE method. Generally, GRACE is composed by five parts. First, we take a scRNA-seq data gene expression matrix as input and use five clustering methods, which are CIDR [32], SC3 [31], t-SNE+ k-means [37], Seurat [56], and SIMLR [43], to obtain five sets of clustering solutions. Second, the five individual solutions are combined into a n × n consensus matrix that representing the relationship between cells, where n represents the number of single cells. Third, basing on consensus matrix, a graph network is created where the nodes represent the cells and the weights of edges are set as the similarity value between pair of cells. Fourth, we measure the relationships between cells based on the random walk on the graph. Each node is assumed to be a cluster and the nodes are gradually grouped into a hierarchical tree under a specific merging criteria. Finally, we cut the hierarchical tree into the appropriate number of sub-trees and calculate the optimal clustering outcomes. The dropout event in scRNA-seq data is a big problem in the computational analysis. Lin et al. developed a clustering algorithm called CIDR which can be regarded as a fast principal coordinate analysis (PCoA)-like algorithm considering dropout events [32]. Since previous researches [30], [57] have shown that the possibility of a gene expression value being loss is inversely correlated with the true expression levels, CIDR reduces the dropout-induced zero inflation by imputing the zero values of dropout candidate genes which are collected from the zero peaks in the distribution of the log-transformed expression profile. Then, CIDR performs the dimension reduction approach, PCoA, on the imputed dissimilarity matrix of cells. Finally, CIDR applies the hierarchical clustering on the first few principal coordinates. Besides, CIDR estimates the number of clusters k based on the Calinski-Harabasz index (CHI), also known as the variance ratio criterion [58]. By calculating the ratio of the sum of between-clusters dispersion and inter-cluster dispersion under different k, CIDR selects the most optimal k with the highest CHI score which indicates that the clusters are dense and well separated.

2) SC3
To achieve high accuracy and robust clustering solutions for scRNA-seq data, Kiselev et al. proposed a consensus clustering approach SC3 by combining multiple clustering results [31]. In the pre-processing step, SC3 filters out less-informative genes in scRNA-seq data expression profile. SC3 adopts three metrics, the Euclidean distance, the Pearson correlation coefficient, and the Spearman correlation coefficient, to calculate the similarity between each pair of cells. After that, SC3 applies two dimensionality reduction approaches, PCA and the Laplacian transform, on the three similarity matrix with two methods. Then, SC3 uses k-means on the data matrices to get different clustering results. Finally, SC3 constructs a consensus matrix of cells which combines the clustering outcomes and applies the hierarchical clustering on it for the final clustering labels. A hybrid SC3 approach is designed for the large datasets which groups 30% of cells using SC3, trains the support vector machine (SVM) with the clustering labels, and finally assigns the labels to the remaining cells.
To estimate the number of clusters, SC3 implements a random matrix theory (RMT) based approach, where the number of clusters is determined by the number of eigenvalues that are significantly different from the Tracy-Widom distribution [59], [60].

3) SIMLR
In scRNA-seq data clustering analysis, a key issue is selecting the appropriate similarity metrics for the cell-to-cell relationships. Wang et al. proposed a framework called SIMLR which learns a distance metric by combining multiple kernels to fit the structure of a specific scRNA-seq data [43]. SIMLR uses the Gaussian kernels with various hyper-parameters for the kernel construction. Assuming that the learned similarity matrix should have an approximate block-diagonal structure, where the cells have larger similarities to other cells within the same blocks, SIMLR applies an alternating convex optimization method to solve the objective optimization function which enforces the low rank constraint on the similarity matrix. Other computational analysis tasks such as visualization, dimension reduction, gene prioritization and clustering are all conducted on the learned similarity matrix. In the clustering task, SIMLR adopts the spectral clustering algorithm on the similarity matrix [17].
The number of clusters in SIMLR is determined by a heuristic approach based on the gap statistic [61].

4) SEURAT
Seurat is developed to identify and interpret the heterogeneity of single cells and integrate the diverse types of single-cell data [56], [62]. The approach identifies sub-populations of cells through unsupervised graph-based clustering. It calculates the k-nearest neighbors for each cell and then construct a shared nearest neighbor (SNN) graph in which the nodes represent the cells and the weights of the edges are the similarities between the cells. After that, it applies the smart local moving (SLM) algorithm to detect community on the SNN graph [63]. The SLM algorithm starts with a network in which each node is assigned to its own singleton community. It improves community structure by community merging and individual node movements to construct the final solution.

5) T-SNE+k-MEANS
It has been shown that dimension reduction before clustering is helpful for the improvement of scRNA-seq data clustering accuracy [31], [32], [57]. The method of ''t-SNE+k-means'' has successfully been applied in the rare cell types identification [37]. It reduces the high dimensional scRNA-seq data into a lower dimensional subspace by t-SNE algorithm and clusters the lower-dimensional data with k-means. In addition, the number of clusters are estimated by ADPclust which calculates the local density of samples and search for the cluster centers from estimated density peaks [64].

C. GRAPH-BASED CLUSTER ENSEMBLE METHOD 1) CONSENSUS MATRIX COMPUTATION
The consensus matrix is computed with the inferred cell labels from the five individual scRNA-seq clustering methods assuming that the more approaches divided two cells into the same cluster, the more similar the two cells are. Formally, we define the consensus matrix C as a n × n matrix, where n indicates the number of cells and the element c ij in C is equal to the number of the scRNA-seq method that classifies cell i and j into the same cluster. In this case, the value of the elements in the consensus matrix ranges from 0 to 5.

2) GRAPH NETWORK CONSTRUCTION
Building a reasonable network for the single cells is a fundamental task since there is no actual network among them. The connectivity between two nodes usually depends whether the nodes are similar enough. Consequently, we construct the network of the single cells according to the consensus matrix C where the value reflects a high-level integrated similarity between the cells since it is derived from multiple high-performing clustering outcomes. The graph can be defined as G = {V , E}, where V and E is the set of nodes and edges, respectively. Here, G is an undirected weighted graph and w ij is the weight of edge between sample i and j. To build a highly reliable network, we impose a constraint that w ij = c ij only if c ij >= 3, otherwise w ij = 0.

3) RANDOM WALK DISTANCE ON GRAPH
In general, if two nodes are directly connected or have many common neighbors, the probability that they belong to the same cluster is high [42], [56]. While from the perspective of the random walk on graph, two nodes are more similar if they have similar walking paths on the network. Therefore, we measure the relationships between cells based on their walking paths using random walk algorithm [65].
A typical random walk model on a regular graph is that, at each step, the walker at current location jumps to another site according to some probability distribution. In a simple random walk approach, the walker can only jump to adjacent positions of the graph to form a walking path [66]. From the graph constructed above, we compute a transition matrix M , in which the element m ij = w ij Deg(i) , Deg(i) = n i j=1 w ij and n i is the number of neighborhoods connecting with node i.
If a walker goes from the i th cell in scRNA-seq data graph, the initial probability P 0 i. will be an n × 1 vector where only the i th value is 1 and the others are 0. For each step of the walker walking on the graph, the probability vector is updated following P t+1 = M T P t , where P t ij is the probability of which the walker goes from cell i to cell j in t steps. Previous studies have shown that if the length of steps t tends towards infinity in the random walk process, the probability of being on a vertex j only depends on the degree of vertex j and is irrelevant to the starting vertex i. Therefore, it is important to choose an appropriate length of steps. If t is too small, the data is insufficient to depict the topology of graph. On the other hand, if t is too large, the system will result in a stationary distribution. In our experiments, we set t = 4 that is empirically advised in the previous study [65].
The random walk distance between cell i and cell j can be calculated as where n is the number of cells and P t ik is the probability of a walker going from cell i to cell k in t steps.

4) HIERARCHICAL TREE-STRUCTURE OF SINGLE CELLS
Next, we construct a tree-structure of cells. As shown in equation (6), the distance between a single cell k and a cluster C is defined as the average random walk distance from each node of C to node k, and |C| is the number of nodes in cluster C. Equation (7) calculates the distance between clusters C i and C j , which means the difference of the random walk distance about nodes in these two clusters.
Initially, we divide the cells into separate groups where each group only has one cell and each cell is put in one group. To form the tree-structure, the criteria is needed for selecting two groups to be combined each time. Here, we adopt the strategies from Ward's method [67]. We define the growth of the average intra-cluster distance before and after the union of each two adjacency cluster C i and C j as equation (8), where C u = C i ∪ C j and there is at least one edge between these two groups.
Then two clusters with the smallest value of σ is selected to be merged.

5) ESTIMATION THE NUMBER OF CLUSTERS
After the tree is constructed, we need to determine the number of divisions, a.k.a., the number of clusters [15]. Here we define the average intra-cluster distance of K groups as where C k is the k th cluster. Then the growth rate η K can be calculated as and the optimal number of clusters K is that satisfies the maximum value of η K .

D. PSEUDO-CODE OF GRACE
The pseudo-code of GRACE is shown in Algorithm (1). Line 1-3 is the process to construct a weighted graph of single cells. Line 4-11 is the process to build the tree-structure of clusters. Line 12-14 finish the structure of cutting and return the cluster labels.

A. DIFFERENCES AMONG THE CLUSTERING METHODS
Generally, clustering methods exhibit very different performance across different datasets. The reason lays in the VOLUME 8, 2020 1: Cluster the scRNA-seq data D using the five clustering methods; 2: Calculate the consensus matrix from the clustering outcomes; 3: Construct the graph G of cells based on the consensus matrix; 4: Calculate the random walk distance of cells with equation (5); 5: Initialize the partition P 0 = {v 1 , v 2 , . . . , v N }; 6: while k < N do 7: Calculate d C i C j , ∀ C i , C j ∈ P k with equation (6) and (7); 8: Select the two closest clusters to merge C 3 = C 1 ∪C 2 according to equation (8); 9: Update partition P k+1 10: k := k + 1; 11: end while 12: Evaluate the optimal number of clusters K with equation (9) and (10); 13: Get the cluster label, setting L to P N −K ; 14: return L.
fact that the method is usually optimized for some specific datasets [68]. To demonstrate this, We compare the similarity between the five state-of-the-art scRNA-seq data clustering approaches by computing the ARI of their predicting clustering results on twelve scRNA-seq datasets. As shown in Figure 2, the color from white to red indicates the ARI values ranges from 0 to 1. One can observe that the predicting outcomes of the five approaches are inconsistent (i.e. a lower ARI) on most datasets. The reason is that different methods capture different aspects of information from the complex and high-dimensional scRNA-Seq data.

B. IMPROVING CLUSTERING ACCURACY FROM INDIVIDUAL METHODS
In this section, we compare the ARI between GRACE and the five individual methods on twelve published datasets. The datasets are collected from different tissues by different sequencing technologies where the numbers of cells and numbers of cell types are totally different. Table 2 shows the ARI of our method comparing with the five individual clustering methods on the twelve datasets. Among the twelve scRNA-seq datasets, GRACE produces the best results in ten datasets (GSM2230757, GSM2230758, GSM2230759, GSM2230760, GSM2230761, SRP073767, GSE60361, GSE98561, GSE108097 and E-MTAB-2600), and the second best in the other two datasets (GSM2230762 and GSE65525). We also calculate the average ARI of each method on all the datasets, which are listed in the last line of the table. The results show that GRACE outperforms all other methods.
Furthermore, we make a statistical rank of these methods across twelve datasets according to ARI. A higher ARI value corresponds a larger rank, and a larger rank represents a better clustering performance. As shown in Figure 3, our approach GRACE achieves the highest rank and performs significantly better than other five individual methods. One can observe that the individual clustering methods have an TABLE 2. Similarity between predicted outcomes and gold-standard cluster labels is measured through ARI. unstable performance on different datasets, while our method is highly robust.
We also adopt other three metrics in clustering performance comparison. Figure 4 shows the Completeness scores, Homogeneity scores and V-measure of different approaches on twelve scRNA-seq datasets. Overall, GRACE performs the best in the Completeness score and V-measure. For the Homogeneity score, although SC3 is generally better than GRACE, the performance of SC3 is less stable. Compared with ''t-SNE+k-means'', other three methods (Seurat, SIMLR, and SC3) tend to obtain higher Homogeneity scores but lower Completeness scores. Among these methods, CIDR performs the worst for all the three metrics and is unstable dealing with different datasets.
In conclusion, individual clustering approaches exhibit the unstable performance in different datasets. GRACE improves the clustering accuracy and is more robust.

C. ADVANTAGES OF HIGH QUALITY GRAPH NETWORK
The reason why the ensemble cluster algorithm GRACE can improve the clustering accuracy is closely related to the way of the network construction. In order to verify that the network constructed by GRACE has higher reliability and is more conducive to cluster structure mining, we compared GRACE with other five conventional network construction approaches using similarly metrics. To be equitable, we only replace the network construction methods while keep other steps to cluster the scRNA-seq data in GRACE.
Here we list the five conventional methods to built network in our experiment. 1) Euclidean Distance (ED); 2) Manhattan Distance (MD); 3) Pearson Correlation Coefficient (PCC); 4) Spearman Correlation Coefficient (SCC); 5) Shared Nearest neighbors (SNN). ED and MD are common distance metrics where a smaller distance between two cells indicates a greater similarity between them. The corresponding similarity can be set as 1−normalizeddistance, where the normalized distance normalizes ED and MD into [0,1]. Literally, PCC and SCC demonstrate the similarity between cells. PCC is defined as the co-variance of two samples divided by the standard deviation of the two. The formula for calculating the SCC is similar to the calculation of the PCC, but the rank is substituted for the respective values.
Different from above four primary similarity indexes, SNN is called the ''second-order similarity'', which measures the similarity of the samples according to the number neighbors. Previous study show that SNN is more stable and robust for high-dimensional sparse data than the traditional distance metrics which generally results in a small value between samples [69]. In our experiments, we use the SNN similarity VOLUME 8, 2020 mentioned in [42] and corresponding similarity between cell i and j is defined as equation (11), where k is the number of nearest neighbors of node, SN is the intersection of the k-nearest neighbors (kNN) of sample i and sample j, Rank i (SNN ) is the rank of each node of SN in kNN for sample i, and min() is a function computing the minimum value of a vector. For example, there are 3 neighbors are shared by sample i and j, and the rank of them in kNN is (1,2,4) and (1,3,4) respectively. Then w SNN ij gets its max value k−1 because the top ranking of SN is 1 for both sample i and j. Figure 5 shows the clustering performance of graph-based methods with different network construction methods. One can observe that GRACE, which uses ensemble cluster results in building network, reaches the highest clustering accuracy on 10 scRNA-seq datasets. It performs much better than all the other methods on GSE60361, GSE98561, and GSE65525. For instance, our GRACE algorithm has a clustering accuracy of 0.86 on the data set GSE60361, while the highest ARI of several other methods is 0.35. Similarly, ARI of the GRACE algorithm reaches 0.84 on the dataset GSE65525, while ARI in several other methods range from 0.18 to 0.50. On average, GRACE obtains the highest ARI (0.72), followed by PCC(0.50) and ED (0.13) performs the worst.
In conclusion, the ensemble clustering based graph construction method is confirmed to have high quality outputs and provide more reliable graph information for us to detect communities.

D. RUNNING TIME OF GRACE
In this section, we test the the efficiency of GRACE. Since GRACE is an ensemble algorithm, its running time is always greater than that of each single algorithm it integrates. However, the overhead of GRACE is small. We list the running time of the five single clustering algorithms and the integration step in GRACE (overhead). To be fair, we use spaltter [70] to generate five simulated single cell datasets, where the number of features of each simulated data is set as 5000, while the sample size increased from 1000 to 5000 across the step size 1000. In simulating data, the parameters used by spaltter are estimated from one real dataset GSE60361.
The experiments are conducted on a desktop computer with 3.2GHz Intel Core i7 CPU, 16 GB 2400MHz DDR4 RAM, and Windows 10 operating system. Table 3 presents the time consumption of GRACE's ensemble step and other five individual methods. With the increase of sample size, the time consumption of SIMLR increases rapidly, followed by SC3, while the other 3 methods present shorter running time. Comparing to the clustering run-time, the overhead of  GRACE is small and acceptable. Even for the largest data with 5000 samples, the overhead is around 5%.

VI. CONCLUSION AND DISCUSSION
In the past decades, there has been increasing interests in scRNA-seq data analysis, where the growing clustering methods have helped to solve many research problems. However, experimental results show that the existing methods are not robust across multiple datasets and even perform poorly on some complex datasets [68]. Since the scRNA-seq data from different platforms or laboratories are always unlabeled and have limited additional information, it's difficult to determine which clustering approach is more appropriate. To address this problem, we propose a novel clustering framework GRACE, a graph-based cluster ensemble approach. By integrating the outcomes of five high-performing clustering methods for network construction, GRACE is able to build a more reliable graph network than using other conventional similarity metrics. What's more, the comparative analysis on twelve datasets shows that GRACE is highly robust and exhibits a competitive performance.
In our future work, we would like to apply GRACE to some specific disease related studies, such as disease-related cell types and biological pathway identification. Besides, we will also apply the graph constructed in GRACE to scRNA-seq data visualization. Moreover, the idea of ensemble clustering is not limited to clustering scRNA-seq data and may be useful to a wide range of clustering applications.