Consensus Nature Inspired Clustering of Single-Cell RNA-Sequencing Data

Single-cell RNA sequencing (scRNA-seq) enables quantification of mRNA expression at the level of individual cells. scRNA-seq uncovers the disparity of cellular heterogeneity giving insights about the expression profiles of distinct cells revealing cellular differentiation. The rapid advancements in scRNA-seq technologies enable researchers to exploit questions regarding cancer heterogeneity and tumor microenvironment. The process of analyzing mainly clustering scRNA-seq data is computationally challenging due to its noisy high dimensionality nature. In this paper, a computational clustering approach is proposed to cluster scRNA-seq data based on consensus clustering using swarm intelligent optimization algorithms to accurately recognize cell subtypes. The proposed approach uses variational auto-encoders to handle the curse of dimensionality, as it operates to create a latent biologically relevant feature space representing the original data. The new latent space is then clustered using Particle Swarm Optimization Algorithm, Multi-Verse Optimization Algorithm and Grey Wolf Optimization Algorithm. A consensus solution is found using solutions returned by the swarm intelligent algorithms. The proposed approach automatically derives the number of clusters without any prior knowledge. To evaluate the performance of the proposed approach a total of four datasets have been used then a comparison against the existing methods in literature has been performed. Experimental results show that the proposed approach performs better than widely most used tools, achieving an adjusted rand index of.95,.75,.88,.9 for Biase, Goolam, Melanoma cancer and Lung cancer datasets respectively.


I. INTRODUCTION
Single cell RNA Sequencing enables quantification of gene 21 expression at the cell level unlike Bulk RNA sequencing 22 that averages the gene expression across a population of 23 cells [1]. In other words, single cell RNA sequencing attempts 24 to represent the distribution of expression level within each 25 subpopulation per transcript for each individual cell in the 26 sample, while bulk RNA sequencing only measures the aver- 27 age expression level per gene for a large population of 28 cells [2]. 29 Accordingly, bulk RNA sequencing utilizes studying com- 30 parative transcriptomics, disease studies where it could 31 The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . quantify expression signatures. Nevertheless, it is inadequate 32 for studying complex heterogeneous systems [3]. scRNA-seq 33 analysis gives insight about cellular heterogeneity by identi- 34 fying cell types and detecting of rare cell types by studying 35 cells behavior in its microenvironment which is applied in 36 fields like cancer treatment, tumors composition, embryonic 37 development etc. [4]. 38 Single cell RNA sequencing was first introduced by 39 Tang et al. [2], though it gained its wide popularity around 40 2014 as the sequencing cost due to new protocols made the 41 sequencing process easier. Until now there is an urge to do 42 modifications on existing computational tools and imple- 43 menting new ones to perform single cell RNA sequencing 44 analysis in order to study the biological inquiries about a 45 particular cell type behavior etc. [5]. resulting in better clustering performance. 88 The major contributions of this research are:

104
• Achieving an adjusted rand index (ARI) of .95 for Biase 105 dataset, an ARI of .75 for Goolam dataset, an ARI of 106 .88 for Melanoma cancer dataset and an ARI of .9 for 107 Lung cancer dataset, with stable clusters regardless the 108 number of samples in datasets.

110
Considering the many challenges that faces clustering single 111 cell data in terms of execution time or clustering accuracy and 112 stability, many methods have been developed to overcome 113 these limitations compared to traditional methods. In this 114 section, a brief background about clustering, its categories 115 and the mechanism of the used metaheuristic techniques is 116 presented, then a review of the recently developed tools based 117 on the challenges is listed.

119
Clustering is a process to find natural grouping of data 120 objects such that each cluster has similar data objects but 121 simultaneously differs from data objects in other clusters 122 depending on the similarity measure used [12]. That is why 123 clustering is considered an unsupervised learning problem 124 since prior information about the data groups is unavailable 125 unlike classification problems where labels/groups of data 126 are already known [13]. Clustering algorithms tries to min-127 imize the intra-cluster distance between data objects of the 128 same cluster and maximize inter-cluster distance between 129 data objects of different clusters to find accurate grouping of 130 these objects.

131
There are numerous algorithms and techniques to solve the 132 clustering problem, and all can be categorized as follows: 133 • Hierarchical clustering [14]: it attempts to build a hierar-134 chical tree by finding structure among data points using 135 either agglomerative or divisive approach.

136
• Partition based clustering [15]: it attempts to partition 137 the data into k-clusters by identifying the best k-centers; 138 those centers are either centroids like in K-means algo-139 rithm or medoids.

140
• Graph based clustering [16]: it represents the data points 141 as nodes in a graph then uses pairwise similarities to 142 compose the edges between the nodes based on the 143 similarity measured between those nodes. are as follows α, β, δ, ω each representing a position within 208 the hierarchy of the grey wolves. The α wolves take the 209 responsibility of deciding during the hunting process, leading 210 and tracking other wolves to keep up social equality within 211 their groups. The β wolves comes after α wolves in hierarchy 212 and are considered the advisors of the α wolves. Once α 213 wolves die or become too old; β wolves ascend and become 214 α wolves. The ω wolves may be the children of the group and 215 are controlled by δ wolves. The δ wolves are also responsible 216 for providing information to the α and β wolves. The hunting 217 process of grey wolves starts with searching for and tracking 218 the prey. After that, grey wolves start to surround their prey 219 until it can no longer move. Lastly, grey wolves start attacking 220 the pray. MVO is first introduced by Mirjalili et al. [24], it is a 223 swarm intelligent optimization algorithm that is inspired from 224 the theory of multi-verse in astrophysics. The multi-verse 225 adopts the concept of the multiple universes created by the 226 big bang. It also adapts the concept of interaction between 227 these universes which takes place through divergent variety 228 of holes. These holes are either black, white or worm holes. 229 A transfer between any two pair of universes happens when a 230 black and white hole interact through what is called a tunnel. 231 Worm holes create those tunnels whereas black holes absorb 232 everything, and white holes emit everything. Each universe 233 represents a solution to the optimization problem and each 234 feature represents an object within a universe. The fitness 235 value per universe is calculated by some objective function 236 and is known by the inflation rate, which controls the expan-237 sion through space. The fitness of the solution depends on 238 the existence of white holes which leads to better fit solution. 239 However, the existence of black holes leads to poor solutions. 240 Many tackled the problem of clustering scRNA-seq data 241 using different clustering approaches; however, some pro-242 posed methods faced limitations. 243 Xu and Su [25], introduced SNN-Cliq in this approach the 244 SNN graph is constructed by computing a similarity matrix 245 that depends on Euclidean distance as a similarity measure 246 between data points. For each data point, a list of the k-nearest 247 neighbors is made based on the similarity matrix. The graph 248 is then constructed by considering each cell to be a data point 249 and a weighted edge between any two points is created if 250 the two points have at least one common nearest neighbor. 251 The maximal quasi-clique per node is found using a greedy 252 algorithm fed by SNN graph as an input. All possible quasi-253 cliques are found then an elimination process is performed to 254 remove any sub partial existence. Clusters are then identified 255 by merging quasi-cliques based on overlapping rate. Iterative 256 merging is later performed until there is no pair of clusters 257 that have a greater overlapping rate than a certain threshold. 258 Although this approach succeeds in having a clear definition 259 of clusters, it all depends on how the single cell data is 260 represented as a graph when a graph is too sparse it fails to 261 detect the clusters.   Wong algorithm [30]. Then clustering is performed to find  Hierarchal clustering is then performed on the consensus 319 matrix using an agglomerative approach. Though SC3 is 320 scalable to large datasets, it is also sensitive to outliers. SC3 321 requires the user to determine k. 322 Lin et al. [32] introduced CIDR, an approach that performs 323 dimensionality reduction using principal co-ordinate analysis 324 (PCoA) on a dissimilarity matrix. The number of co-ordinates 325 is determined based on variation of the scree algorithm. Hier-326 archical clustering is applied after determining the number 327 of clusters according to the Calinski-Harabasz index [33]. 328 CIDR provides hierarchical relationship among datapoints, 329 but it faces a high time complexity limitation and the clusters 330 given are not explicit. 331 Wang et al. [34], introduced a framework named SIMLR. 332 Given an expression matrix, SIMLR calculates a symmet-333 ric matrix that seizes pairwise similarities between cells. 334 Gaussian kernels with multiple different hyper-parameters 335 are used by utilizing Euclidean distance between pairs of 336 cells and K-nearest neighbors. An optimization algorithm 337 is used to improve some of the hyper parameters. SIMLR 338 then uses stochastic neighbor embedding (t-sne) method for 339 dimensionality reduction. K-means is then performed on the 340 resulting latent data. SIMLR makes no assumption about data 341 distribution, but it faces computational challenges when it 342 comes to large datasets. 343 Gan et al. [35] introduced conCluster, a model that per-344 forms consensus clustering. The model performs filtering out 345 on the gene level to eliminate genes that are either expressed 346 in r% of cells or (100-r) % of cells. Such genes hold no 347 valuable genetic information defined as rare and ubiquitous 348 genes respectively. A gene set of the most variable genes 349 across the cells is identified. For dimensionality reduction, 350 stochastic neighbor embedding (t-sne) is used with a per-351 plexity set to 30 to reduce the number of dimensions to two. 352 K-means is later performed T times utilizing multiple initial 353 parameters for basic clustering. A binary matrix is generated 354 per each clustering output using the resulting cluster labels. 355 The generated binary matrices from the multiple clustering 356 trials are concatenated into one large binary matrix. K-means 357 is performed once again on the concatenated binary matrix 358 using Calinski-Harabaz Index [33] to determine the number 359 of clusters. The results of clustering trials are fused into a con-360 sensus one. Concluster provides robust clustering however it 361 relies on combining other algorithms for ensemble. 362 Yang et al. [36] embeds four clustering methods SC3, 363 Seurat, CIDR and t-sne+K-means. Gene expression matrix is 364 taken as input after adjusting it to be suitable for all methods. 365 Clustering using the four methods is performed individually. 366 The results obtained are later used to construct an overall 367 hyper graph which combines all hypergraphs resulting from 368 individual results. For ensemble clustering one of partitioning 369 algorithms, HGPA, MCLA and CSPA is used. Performance 370 is evaluated using average normalized mutual information 371 (ANMI), for ensemble solution and individual ones and the 372 clustering result with highest ANMI is selected as the final 373 result.

430
A silhouette coefficient is measured to make sure samples 431 are correctly grouped and only groupings that maximize the 432 clustering score are kept. FEATS allows fitting to flexible 433 cluster shapes; however, it suffers from high time complexity. 434 Cui et al. [44], proposed SCENA an unsupervised method 435 for clustering single cell data. SCENA takes a gene expres-436 sion matrix as input; where rows represent the genes and 437 columns represent the cells. SCENA performs preprocessing 438 in three stages: gene filtering, log transformation and normal-439 ization. Similarity matrices between cells based on Euclidean 440 distance and K-nearest neighbors are generated per each 441 highly variable feature set. For each final similarity matrix, 442 spectral clustering is performed resulting in a binary matrix 443 that indicates whether two cells belong to the same cluster 444 or not. A consensus matrix is generated from the binary 445 matrices and spectral clustering is used again to perform 446 consensus clustering. A strength point of SCENA is it makes 447 no prior assumption about data distribution; however, spectral 448 clustering is computationally intensive for large datasets.

449
It can be concluded that clustering single cell data is 450 a challenging computational problem that needs further 451 exploration. Methods that follow partitioning clustering and 452 methods using neural networks or affinity propagation are 453 sensitive to outliers. Other methods like hierarchical cluster-454 ing suffer from high time complexity. While density-based 455 clustering suffers from both back draws, ensemble-based 456 clustering shows robust clustering performance as it inte-457 grates multiple methods.

459
The proposed system preserves the same workflow of 460 scRNA-seq analysis with the following steps: First, datasets 461 are preprocessed in two steps; gene filtering and matrix nor-462 malization. Secondly, a variational auto-encoder (VAE) is 463 used for dimensionality reduction followed by a clustering 464 step. In the clustering step, cells are clustered into groups 465 where each cluster represents a unique type of cells. In the 466 clustering step, the latest feature space resulting from dimen-467 sionality reduction step is used as input to three different 468 algorithms (PSO, GWO, MVO). The best solutions resulting 469 from these algorithms are used to generate a binary matrix 470 that is clustered again by GWO, and the consensus solution is 471 returned. Fig.1 illustrates the workflow of the proposed CNIC 472 approach.  representation is learnt simultaneously. VAE consists of an 525 encoding phase and a decoding phase; the encoding phase 526 is responsible for compression of the data. During decoding 527 phase, the data is reconstructed into its original shape and a 528 loss value is calculated. The aim of decoding in this approach 529 is to ensure the compressed latent space correctly represents 530 the data. Accordingly, VAE in this context could be summa-531 rized to an input layer, an encoding phase and a decoding 532 phase as follows:  .
Such that: s(i) represents an instance in cluster i where i =  The goal of each particle's movement is to gain optimum 600 velocity according to its local best (Plocal_best) value, and 601 its neighbor's global best (Pglobal_best). A particle's posi-602 tion changes according to its current position, its current 603 velocity, its distance from the (Plocal_est), and its distance 604 from (Pglobal_best). All particles update their positions and 605 velocities based on equations (9) and (10) respectively.
where Pi (t) , V i (t) indicates the particle's position and veloc-610 ity at iteration t respectively. The terms a 1 , a 2 are acceleration 611 coefficients while w is the inertia weight and r 1 , r 2 are random 612 numbers. The mathematical model of the GWO in this approach is as 616 follows: The α wolf represents the fittest solution while β 617 and δ wolves represent the second and third best solutions, 618 respectively. All other solutions represent the ω wolves who 619 follow the α, β, δ wolves leading the hunting process. Search 620 agents known by the grey wolves, not including the fittest 621 ones, encircle the pray according to equations (11), (12).

622
where D represents the distance between position of the 625 prey Xp and position of the search agent P at iteration t. 626 Fittest grey wolves represented by α, β, δ wolves adjust their 627 positions according to the prey's position according to the 628 search agents' positions to start the hunting process modeled 629 by equations (13-21).
where D represents the distance between the fittest grey All entries per row are zeros except for the cluster number 675 that the cell (sample) belongs to is indicated by 1. All binary 676 matrices are concatenated into one matrix. The new binary 677 matrix is used as input to the GWO algorithm once again to 678 be clustered and the final solution is found. Fig.4 illustrates 679 the consensus clustering process.

681
To assess the performance of the clustering process, dif-682 ferent evaluation measures are used to further prove the 683 superiority of consensus nature inspired approach to other 684 existing approaches in literature. Since the true labels and 685 number of clusters are publicly available by the original 686 authors of the datasets; evaluation measures such as Adjusted 687 Rand Index, Completeness score, Homogeneity score and 688 V-measure score are used.

1) ADJUSTED RAND INDEX (ARI) [49]
690 ARI is a measure that evaluate the similarity of two clus-691 tering results the ground truth and the predicted labels. 692 ARI values range from −1 to 1 such that lower values 693 specify poor clustering results while higher values closer 694 to 1 specifies similar clustering result to the ground truth. 695 1.0 indicates perfect matching score between the predicted 696 results and the ground truth. ARI makes no prior assump-697 tions on the cluster structure hence it could be used to com-698 pare different algorithms. ARI is calculated according to 699 equation (22).       1 indicates perfect score. CS is calculated using equation (27): 727 c: V-MEASURE (VM) 728 A harmonic mean that makes no assumption on the clus-729 ter structure. It is also used to qualitatively interpret the 730

738
In this section, the experimental configuration and parameter    for all datasets used with the same settings. Four datasets 797 were used to assess the performance of the proposed approach 798 against competing approaches.

799
Standard deviation is used as a descriptive statistic to detail 800 the computed solutions obtained by the proposed clustering 801 approach. The average objective function for all 100 runs of 802 the algorithm is calculated and reported as well as the average 803 execution time needed to find the clustering solutions by all 804 algorithms. Table 3 shows the mean and standard deviation of the 806 fitness function (Silhouette Coefficient) of each algorithm 807 across all runs on all datasets. 808

809
To evaluate the clustering accuracy, the performance of the 810 proposed CNIC is compared to currently most used methods 811 in clustering Single cell data using their default parameters 812 as mentioned by the authors. These methods are SNN-Cliq, 813 SC3, CIDR, PcaReduce, SEURAT and t-SNE+k-means. All 814 approaches were applied to Biase, Goolam and melanoma 815 cancer datasets. The results were evaluated by ARI.    For assessing the clustering stability, all implemented 841 algorithms as well as the consensus solution were evalu-842 ated in terms of homogeneity score, completeness score and 843 V-measure. 844 Fig.11, Fig.12 and Fig.13 show the homogeneity score, 845 completeness score and V-measure score of the proposed 846 approach and other implemented algorithms respectively. It is 847 noticed that CNIC achieves better scores indicating clustering 848 stability even in cases of large datasets.   1063 Egypt. She has many scientific research articles 1064 published in international journals in the topics 1065 of bioinformatics, artificial intelligence, machine 1066 learning. Her research interests include bioin-1067 formatics and biomedical, cloud computing, soft 1068 computing, image processing, artificial intelligence, data mining, high per-1069 formance computing, optimization, and meta-heuristics techniques.