Single-Cell RNA-Sequencing Data Clustering via Locality Preserving Kernel Matrix Alignment

Single-cell RNA-sequencing (scRNA-seq) data provide opportunities to reveal new insights into many biological problems such as elucidating cell types. An effective approach to elucidate cell types in complex tissues is to partition the cells into several separated subgroups via clustering techniques, where the cells in a specific cluster belong to the same cell type based on gene expression patterns. In this work, we present a novel multiple kernel clustering framework for scRNA-seq data clustering via locality preserving kernel alignment. Specifically, we first generate a series of similarity kernel matrices by using different kernel functions. Then we transfer the clustering task to a multiple kernel k-means problem with the kernels aligned in a local manner, i.e., the similarity of a sample to its k-nearest neighbours are aligned with the ideal similarity matrix. In our method, the clustering process focuses on closer sample pairs that shall stay together, and avoids involving unreliable similarity evaluation for farther sample pairs. In addition, we construct a local Laplacian matrix for each sample to constrain that closer samples should be allocated similar labels. In such a manner, the local structure of the data can be well preserved and utilized to produce better alignment for clustering. An alternate updating algorithm with theoretical analysis is developed to solve the proposed problem. We evaluate the performance of the proposed method on various real scRNA-seq data, and the results show that our method can obtain superior results when compared with other state-of-the-art approaches.


I. INTRODUCTION
Recent literature indicate that single-cell measurements plays an important role in understanding cellular heterogeneity [1]- [5] and cell differentiation [6], [7]. Thanks to the rapid development of Single-cell RNA-sequencing (scRNA-seq) techniques, a tremendous amount of transcriptome datasets have been generated at single-cell resolution [8], [9]. On the one hand, these datasets provide opportunities to reveal new insights into many biological problems, e.g., elucidating cell types, on the other hand, there are also computational challenges due to the amount of data. A straightforward approach to elucidate cell types in complex The associate editor coordinating the review of this manuscript and approving it for publication was Li He . tissues is to partition the cells into some separated subgroups via clustering techniques [10]- [13], which can be regarded as an unsupervised classification problem [14]- [16]. Many previous clustering techniques can be used for this task, such as principal component analysis (PCA) [17], spectral clustering [18], and k-means [19]. However, different to bulk RNA-seq or gene expression microarrays, there are high level of noise and many missing values in scRNA-seq data due to technical and sampling issues [20]- [22]. In addition, the high variability exists in gene expression levels even between cells of the same type, and this could degenerate the performance of those existing clustering approaches [23]- [27].
In order to address the issues in scRNA-seq data analysis, a various of novel clustering methods have been proposed in recent years. For subtype classification and detection of VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ relationships between the subtypes, some iterative clustering methods have been proposed [28]- [31]. Haghverdi et al. [32] used the diffusion maps to perform dimension reduction of the data, which stresses continuity of cell states along putative developmental pathways. Nonnegative Matrix Factorization (NMF) technique has also been used to decompose the high-dimensional scRNA-seq data into biologically interpretable compositions [33], and the functional cell subgroups and biologically relevant features can be simultaneously obtained with NMF. Several graph theory-based algorithms have also been applied to scRNA-seq data clustering problems. Xu and Su [34] developed a quasi-clique-based clustering algorithm named SNN-Cliq to identify tight groups of highly similar nodes that are likely to belong to the same genuine clusters. In SNN-Cliq, the clusters are identified by using the proposed SNN graph. Spectral clustering, as a typical graph based clustering method, has also been successfully deployed for this task. Park and Zhao [35] first constructed a series of symmetric doubly stochastic similarity matrices by using the Gaussian kernel function with varying parameters, then they learned a target similarity matrix from the previous constructed matrices for spectral clustering. In [36], Wu et al. integrated dimension reduction and clustering of single-cell RNA-sequencing data into a unified framework. Multiple kernel clustering is also a kind of popular modern clustering method which aims to optimally integrate a group of pre-specified kernels to improve clustering performance. A critical issue in multiple kernel clustering is to learn the kernel combination weights. Margolin [37] made the combination weights adaptively change with respect to samples to better capture their individual characteristics. Du et al. [38] presented a robust multiple kernel k-means algorithm that simultaneously finds the best clustering labels and the optimal combination of multiple kernels by replacing the squared error in k-means with an l 2,1 -norm based one. Lu et al. [39] employed kernel alignment maximization to jointly perform the k-means clustering and multiple kernel learning. Wang et al. [40] presented an analytic framework (SIMLR) via multi-kernel learning which learns a similarity measure from scRNA-seq data in order to perform dimension reduction, clustering and visualization. Compared to other previous methods, SIMLR learns a distance metric that best fits the structure of the data by combining multiple kernels. Standard dimension reduction or clustering algorithms often work under certain statistical assumptions which the diverse statistical characteristics of scRNA-seq data could not easily fit well. Qi et al. [41] proposed to automatically learn similarity information from data and introduced a new clustering method in the form of a multiple kernel combination, which can directly discover groupings in scRNA-seq data. In this article, we propose a new scRNA-seq data clustering method via locality preserving multiple kernel alignment (referred to as LPKA briefly). Considering that previous kernel alignment methods often rigidly constrain closer and farther sample pairs to be equally aligned to the same ideal similarity, and the intra-cluster variation of samples is inappropriately neglected, we propose to align the kernels in a local manner, i.e., the similarity of a sample to only its k-nearest neighbours are aligned with the ideal similarity matrix. In our method, the clustering process focuses on closer sample pairs that shall stay together, and avoids involving unreliable similarity evaluation for farther sample pairs. In addition, we construct a local Laplacian matrix for each sample to constrain that closer samples should be allocated similar labels. In such a manner, the local structure of the data can be well preserved and utilized to produce better alignment for clustering. Experiments on 9 scRNA-seq datasets are conducted to demonstrate the superiority of our proposed method.

A. DATASETS COLLECTION AND KERNEL MATRICES GENERATION 1) DATASETS
In order to evaluate the efficacy of our proposed LPKA, we use some real-world scRNA-seq datasets to test the clustering performance. Similar to [35], we test the performance of LPKA on 9 scRNAseq datasets which represent several types of dynamic processes such as cell cycle, cell differentiation, and response upon external stimulus. For each dataset, the types of cells are known as a priori. The number of cells, number of cell types, number of genes for each dataset are summarized in Table 1.
For the readability and integrity of this article, we also give the detailed information of the datasets as follows: • Treutlei [42]. This dataset is composed of single cell RNA-seq expression data for 80 lung epithelial cells at E18.5 together with five putative cell types including AT1, AT2, Clara, BP, and ciliated. Similar to [42], we considered data with selected genes with 959 highest loadings in the first four PCA coefficients. The dataset is downloaded from: https://www.nature.com/articles/nature13173.
• Ting. This dataset contains contains 5 subtypes from Single-cell transcriptomes from MEFs, the NB508 pancreatic cancer cell line, normal WBCs, bulk primary tumors diluted to 10 or 100 pg of RNA, and classical CTC. We downloaded the data from GEO (GSE51372).
• Deng [43]. This dataset consists of transcriptomes for individual cells isolated from mouse embryos at different preimplantation stages. There are 135 cells and 19,703 genes, where cells belong to zygote, early • Buettner [44]. This dataset contains the transcriptional profile of 182 ESCs that has been staged for cell-cycle phase (G1, S, and G2M) based on sorting of the Hoechst 33342-stained cell area of a flow cytometry (FACS) distribution. The cells were sorted for three stages of the cell cycle, and they were validated using gold-standard Hoechst staining. The data have been deposited at Array-Express: E-MTAB-2805.
• Pollen. There are 249 single cells from 11 populations using microfluidics, including neural cells and blood cells. The 11 clusters in the dataset were from different sources (CRL-2338, CRL-2339, K562, BJ, HL60, hiPSC, Keratinocyte, Fetal cortex (GW21+3), Fetal cortex (GW21), Fetal cortex (GW16), and NPC) that are expected to show robust differences in gene expression. Data were pre-filtered to exclude genes where more than 90% of cells had zero measurements and include only single cells with greater than 500000 reads (n = 249).
• Tasic [30]. There are 49 transcriptomic cell types in this dataset, including 23 GABAergic, 19 glutamatergic and 7 non-neuronal types. To identify cell types, Tasic et al. [30] applied two parallel and iterative approaches for dimensionality reduction and clustering, iterative principal component analysis (PCA) and iterative weighted gene coexpression network analysis (WGCNA), and validated the cluster membership from each approach using a non-deterministic machine learning method (random forest). We downloaded the processed data from GEO (GSE71585).
• Zeisel [28]. In this dataset, Zeisel et al. [ The 39 cell types were identified via PCA and densitybased clustering, and they were validated by differential gene expression. We filtered out cells with less than 1200 genes (yielding 6418 cells) for clustering analysis. We downloaded the data from GEO (GSE63473).

2) KERNEL MATRICES GENERATION
In our experiments, three kinds of kernels are used to generate kernel matrices, including Gaussian kernel, linear kernel and polynomial kernel.
For Gaussian kernel, it is one of the most widely used kernel function and it can obtain steadily performance whether the data size is large or small. We follow [40] and consider multiple kernel functions to construct kernel matrices as follows: where . KNN (i) represents a set of sample indices that are the top k nearest neighbors of the sample x i . As can be seen, parameters θ and k control the width of the neighborhoods. For generality, we also vary θ from {1, 1.25, · · · , 2} and k from {10, 12, · · · , 30}. Thus, a total number of 55 Gaussian kernel matrices can be obtained.
For linear kernel, it is suitable for the data samples which are linear separable. In addition, it has no parameter. We generate a linear kernel matrix as follows: For polynomial kernel, it projects low dimensional feature space to a higher dimensional feature space. It is defined as follows: In this work, we vary d from {0.1, 0.2, · · · , 1} and obtain 10 polynomial kernel matrices. Finally, for each dataset, we combine different kinds of kernel matrices to obtain a kernel matrix with size n × n × 66 for further computation, where n is the number of samples in the dataset.

B. KERNEL K-MEANS CLUSTERING
Given a set of n data samples from k clusters {x i } n i=1 ⊆ X , let φ(·) : x ∈ X → H be a feature mapping which maps x from original space onto a reproducing kernel Hilbert space H. Kernel k-means aims to minimize the sum-of-square loss over the cluster assignment matrix Z ∈ {0, 1} n×k , and the problem can be solved by minimizing following objective function: where n c = n i=1 Z ic and ξ c = 1 n c n i=1 Z ic φ(x i ) represent the sample number and centroid of the c-th cluster in H.
By some algebra, (4) can be transferred to following matrix form: where Tr(·) denotes the trace of a matrix and K is a kernel , and 1 l ∈ R l is a column vector with all elements 1.
The problem (5) is hard to solve since Z is discrete. Fortunately, this problem can be usually approximated through relaxing Z to take arbitrary real values. By defining H = Z L 1 2 and letting H take real values, the relaxed version of (5) can be obtained as: where I k is a k×k identity matrix. Since Z T Z = L −1 , we have L 1 2 Z T Z L 1 2 = I k , then it is easy to get the orthogonality VOLUME 8, 2020 constraint on H . Finally, (6) can be solved by taking the k eigenvectors that correspond to the k largest eigenvalues of K .

C. MULTIPLE KERNEL K-MEANS CLUSTERING
With a multiple kernel setting, each sample is represented via a group of feature mappings {φ(·)} m p=1 [45]. In detail, each sample can be represented as · · · , ξ m ] T represents the coefficients of each base kernel that we need to learn. As a consequence, the corresponding kernel function over the above mapping function can be written as: By replacing the kernel matrix K in (6) with K ξ calculated via (7), the multiple kernel k-means clustering problem can be re-rewritten as following form [37]: Kernel alignment maximization has been widely used to learn kernel parameters in supervised learning. However, it is not suitable to unsupervised learning that the true labels are absent [46]. An effective solution is to update kernel coefficients by maximizing the alignment between the combined kernel K ξ and HH T , where H can be regarded as pseudo-labels in the last iteration [47]. In specific, the kernel alignment maximization for multiple kernel clustering can be formulated as following: where K ξ , H H T = Tr(K ξ H H T ), K ξ , K ξ =ξ T Mξ with ξ = [ξ 2 1 , ξ 2 2 , · · · , ξ 2 m ] T and M is a positive semi-definite matrix with M pq = Tr(K T p K q ) [46]. Directly optimizing (9) is difficult since it is a fourth-order fractional optimization 201582 VOLUME 8, 2020 problem. Thus, we need to derive a new and approximated optimization problem.
Following we use Theorem 2 to get a second-order upper bound for the denominator in (9).
Lemma 1: ξ T M ξ is an upper bound ofξ T Mξ . Proof: For a pair of semi-definite matrices K p and K q , there exists matrices U p and U q such that K p = U p U T p and K q = U q U T q . As a results, we have M pq = Tr(K T p K q j) = Tr(U p U T p U q U T q ) = Tr((U T p U q )(U T p U q ) T ) = ||U T p U q || 2 F ≥ 0, where || · || F denotes the Frobenius norm of a matrix. We also have ξ T M ξ = m p,q=1 M pq ξ p ξ q ≥ m p,q=1 M pq ξ 2 p ξ 2 q =ξ T Mξ . This completes the proof. ξ T M ξ is much easier to handle thanξ T Mξ since it leads to a well studied quadratic programming. In addition, this term also works as a regularization on the kernel coefficients to prevent ξ p and ξ q from being jointly assigned to a large weight if M pq is relatively high.
In addition, we find that minimizing the negative of numerator, i.e., −Tr(K ξ H H T ), together with ξ T M ξ simultaneously cannot guarantee that the whole objective is convex w.r.t ξ VOLUME 8, 2020 with fixed H . This would degenerate the quality of solution at each iteration, leading to sub-optimal solution. Here we use the following theorem to give a good substitute of −Tr(K ξ H H T ) while with convexity.
Proof: Since H T H = I k , we have HH T H = H . By denoting H = [h 1 , h 2 , · · · , h k ], we can obtain that HH T h c = h c , ∀1 ≤ c ≤ k. This means HH T has k eigenvalues with 1 and its rank is no more than k, which implies that it has n − k eigenvalues with 0. As a consequence, I n − HH T has n − k and k eigenvalues with 1 and 0. This induces Tr(K p (I n − H H T )) ≥ 0. With Tr(K ξ (I n − H H T )) = m p=1 ξ 2 p Tr(K p (I n − H H T )), we can conclude that According to the above-mentioned observations, the maximization problem described by (9) can be turned to following minimization problem: where λ is a parameter used to balance the two terms.  As can be seen from (9) and (10), they maximize the kernel alignment between the combined kernel matrices K ξ and the ideal kernel matrix HH T globally. In such a way, closer and farther sample pairs will be equally aligned to the same ideal similarity, intra-cluster variation of samples will be neglected. In other words, the discrimination between samples are not fully exploited. In addition, the closer samples are not constrained to share similar label vectors. Therefore, the locality of samples are not well preserved, which is a critical priori in unsupervised learning. In this article, we propose to locally align the similarity of each sample to its k-nearest neighbours with corresponding ideal kernel matrix rather than enforce the global alignment of all the samples, which is flexible and able to well handle the intra-cluster variations. In addition, for each sample, we construct a local Laplacian matrix to regularize that closer samples being allocated similar labels. Thus the locality of data samples can be well preserved.
If we use K (i) ξ and H (i) to represent the sub-matrices of K ξ and H , and their indices are specified by the τ -nearest neighbors of the i-th sample, then we have K , 1} n×τ is a matrix indicating the τ -nearest neighbors of the i-th sample. For each pair of sample, if they are close to each other, then their label vector should be also similar, this can be formulated as following:  where h (i) p and h (i) q represent the p-th and q-th row of H (i) , respectively. (11) can be easily rewritten as the following trace form: where L (i) τ is the local Laplacian matrix of sample i with local similarity matrix K (i) ξ . Then we rewritten (10) in a locally regularized form and combine it with (12) to induce our final LPKA model: where I τ and is an identity matrix with size τ ×τ . By defining A (i) = S (i) S (i) T , we obtain the objective function of our proposed LPKA:

1) OPTIMIZATION ALGORITHM
Note that (14) is not jointly convex with respect to H and ξ , while it is convex to each variable if the other one is fixed. Thus, we design a two-step algorithm to solve this problem alternately.
201586 VOLUME 8, 2020 Step 1: Updating H with fixed ξ When ξ is fixed, H can be obtained by solving the following optimization problem: can be rewritten as: which is a standard kernel k-means clustering problem and can be efficiently solved.
Step 2: Updating ξ with fixed H Given fixed H , the optimization (14) w.r.t ξ is a quadratic programming with linear constraints, which turns to the VOLUME 8, 2020 following problem: Tr(K p A (i) K q A (i) ).
In summary, our algorithm for solving (14) can be outlined in Algorithm 1. obj t denotes the objective value at the t-th iterations.

2) CONVERGENCE ANALYSIS
In Algorithm 1, the neighborhood of each sample is kept unchanged during the optimization. Specifically, the τ -nearest neighbors of sample i are measured by K (i) ξ . In such a manner, the objective value of Algorithm 1 is guaranteed to be monotonically decreased when updating one variable with the other fixed at each iteration. Meantime, the whole optimization problem is lower-bounded. As a sequence, the proposed algorithm can be guaranteed to be convergent. In order to empirically study the convergence of Algorithm 1, we show the variation of the objective values of Eq.  datasets in Figure 1, which demonstrate that our proposed optimization algorithm is very efficient, i.e., the objective value is monotonically decreased and the algorithm quickly converges in less than five iterations.

1) EXPERIMENTAL SETTINGS
To systematically evaluate the clustering performance of LPKA on the collected datasets, three metrics including Normalized Mutual Information (NMI) [48], Purity [49], and Adjusted Rand Index (ARI) [50] are used in our experiments. NMI and Purity take on values between 0 and 1, but ARI can be negative. The three metrics measure the concordance between the ground truth cell types and the cell types calculated by clustering algorithms, thus higher values indicate better performance for all metrics.
Given two clustering results U and V on a set of N data points with N U and N V clusters, respectively, the mutual VOLUME 8, 2020 information NMI is defined as where the numerator is the mutual information between U and V , and the denominator represents the entropy of the clustering U and V .
For Purity, each identified cluster is assigned to the one which is most frequent in the cluster, and then the accuracy of this assignment is computed by counting the number of correctly assigned samples divided by the number N : The ARI depends on the following four quantities: • O uv , the number of objects in a pair that are placed in the same group in U and V ; • O u , the number of objects in a pair that are placed in the same group in U but in different groups in V ; • O v , the number of objects in a pair that are placed in the same group in V but in different groups in U ; • O, the number of objects in a pair that are placed in the different group in U and V . Then, ARI is defined as As can be seen from Eq. (14), there are three parameters in LPKA including λ, β and τ need to be set. In our experiments, we tune all the parameters by a ''grid-search'' strategy. Specificaly, λ and β are chosen from {0.01, 0.1, 1, 10, 100}, and τ is chosen from {0.3n, 0.4n, 0.5n}. Finally, the best clustering results are used for comparison.

2) COMPARED WITH OTHER METHODS
We compare the proposed LPKA with several existing methods, including PCA, traditional k-means, t-SNE [51], spectral clustering (''SC''), sparse spectral clustering (''SSC'') [ and SIMLR [40]. For all the compared methods, their parameters are turned carefully as suggested in their corresponding papers for fair comparison. We show the NMI, Purity and ARI of different clustering algorithms on different datasets in Figure 2, Figure 3 and Figure 4, respectively. As can be seen, the proposed LPKA has higher performance than all of other methods on the 9 datasets in terms of three different metrics, which demonstrates the superiority of LPKA. Therefore, our proposed LPKA can be used as a reliable pre-processing step to distinguish different cell types, which can reveal new insights into many other biological problems.

3) PARAMETER SENSITIVITY
In our experiment, we turn the parameters λ, β and τ to obtain the optimal results. To study the sensitivity of the LPKA with regard to the parameters in Eq. (14), we conduct experiments by fixing one of the three parameters and varying the other two ones. Firstly, we fix β = 1, and vary τ and λ. Figure 5-7 plot the values of the NMI, Purity and ARI, respectively, with fixed β. Secondly, we fix λ = 1, and vary τ and β. Figure 8-10 plot the values of the NMI, Purity and ARI, respectively, with fixed λ. Thirdly, we fix τ = 0.4n, and vary λ and β. Figure 11-13 plot the values of the NMI, Purity and ARI, respectively, with fixed τ . As can be seen, the clustering results are robust with respect to the varying of λ, while the results are a little sensitive to β and τ to some extend, which demonstrate the significance of preserving the local structure of original data. We can obtain optimal clustering results with different combinations of the three parameters.

IV. CONCLUSION
In this work, we present a novel multiple kernel clustering framework for scRNA-seq data clustering via locality preserving kernel alignment. A series of similarity kernel matrices are firstly generated by using different kernel functions. Then we transfer the clustering task to a multiple kernel k-means problem with the kernels aligned in a local manner. In order to preserve the local structure of the data for boosting final clustering performance, we construct a local Laplacian matrix for each sample to constrain that closer samples should be allocated similar labels. An alternate updating algorithm with theoretical analysis is developed to solve the proposed problem. Experiments with parameter sensitivity analysis on various real scRNA-seq data are conducted to demonstrate that our method can obtain superior results when compared with other state-of-the-art approaches.
ACKNOWLEDGMENT (Xiao Zheng and Jiajia Chen contributed equally to this work.)

V. AVAILABILITY OF DATA AND MATERIALS
The implementation code and supporting files are available from http://tangchang.net/codes/LPKACode.zip