AAE-SC: A scRNA-Seq Clustering Framework Based on Adversarial Autoencoder

Single-cell RNA sequencing (scRNA-seq) provides the expression profiles of individual cells, and it is expected to provide higher cellular differential resolution than traditional bulk RNA sequencing. In scRNA-seq analysis, clustering is crucial for identifying cell types, and can be potentially exploited to understand high-level biological processes. Recently, autoencoder has been successfully applied in scRNA-seq clustering problem and achieved promising results. Most existing works focus on characterizing the sparsity of data, and directly utilize the bottleneck feature of the autoencoder for clustering might not be optimal. In this paper, a novel framework named Adversarial AutoEncoder ScRNA-seq Clustering (AAE-SC) is proposed to bring an additional constraint on the bottleneck feature. Specifically, AAE-SC adds an AAE module on top of the bottleneck layer, and constrains the bottleneck feature distribution to be aligned with a consistent distribution. Also, the AAE and the reconstructed modules are jointly optimized to generate a highly discriminative and consistent feature, which is further proceeded for clustering. We find that by using AAE-SC to impose certain constraints on the features of the hidden layer, the performance of clustering can be improved. Experimental results on three real-world datasets demonstrate that the proposed AAE-SC framework outperformed state-of-the-art methods by 2% at least and 5% at most. And AAE-SC shows more robustness than the baseline model for downsampled and unbalanced cluster size datasets.


I. INTRODUCTION
Technological advances in single-cell RNA sequencing (scRNA-seq) [1]- [5] have revolutionized transcriptomic studies by providing higher resolution of individual cellular differences of transcriptomes than commonly used bulk RNA sequencing. They allow researchers to systematically study the cellular heterogeneity, cellular developmental trajectories and classification of tumor sub-population across a large number of cells [6]. Unsupervised clustering is an essential step in the analysis of scRNA-seq to achieve the above tasks. Only after clustering, the cell types can be identified, and researchers can further depict the cellular functional states and infer the potential cellular dynamics [7].
Although clustering is a traditional machine learning research field [8]- [10] and there have been some representative methods such as k-means [11] and spectral clustering [12], clustering analysis on scRNA-seq data is The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wang . still a challenge due to the dropout occurring in the raw data [13]. The dropout refers to the fact that there are some false-zero counts and gene count matrix may contain actually no reported data, which are caused by low sequencing depth and other technology limits. As shown in Fig. 1, different heat map colors indicate different gene expression levels (the value in the gene count matrix). It is obvious that most genes in cells have very low expression level and only a few genes express over 0. Therefore the dropout makes the scRNA-seq data highly sparse, and traditional clustering approaches fail to deal with this data. To alleviate this problem, several specific clustering algorithms including SNN-Clip [14], single-cell interpretation via multikernel learning (SIMLR) [15] and MPSSC [16] for scRNA-seq data have been proposed. However, their computational cost is huge for large-scale datasets, and the clustering performance is still inferior.
Recently, deep learning technology [17] has made significant breakthroughs in computer vision, natural language processing and other cross domains based on deep neural FIGURE 1. Explanation of scRNA-seq clustering task. Because dropout causes the gene expression level of the original data to be very low, the data is highly sparse, and it brings difficulties to the subsequent clustering. Therefore, special clustering algorithms are required to process this type of data and correctly assign different cell samples to different And identify the cell type. The heat map on the left of the figure is a visual representation of raw scRNA-seq data, and the numbers in the heat map indicate the expression value of each gene in the cell sample. The color bars in the figure indicate the level of gene expression.
networks (DNN) [18]. DNN can process high-dimensional data and extract efficient and effective features, thus becomes the dominant option for big data analysis. Among a variety of deep learning frameworks, the autoencoder [19] is a classic unsupervised algorithm. It consists of two components: an encoder and a decoder with symmetrical structures. The encoder first projects high-dimensional data to a low-dimensional feature in the hidden layer (i.e. the bottleneck feature), then the decoder attempts to reconstruct the original data from this bottleneck feature. If the bottleneck feature can reconstruct the input data well, or is considered to contains the most informative components of the original data, it is straightforward to be utilized for clustering. Actually, there are several relevant attempts which apply the autoencoder algorithm to scRNA-seq, including scScope [20], scvis [21], deep count autoencoder (DCA) [22] and single-cell-model-based deep embedded clustering (scDeepCluster) [23]. These models improve the clustering performance of scRNA-seq and significantly speed up calculations, but most of them focus on addressing the sparsity problem, and directly utilize the bottleneck feature for clustering. Without constraints on hidden code during feature learning process, the latent features might be noisy and distorted, which are not good for clustering.
In this paper, we propose a novel framework, named Adversarial AutoEncoder ScRNA-seq Clustering (AAE-SC). Our baseline model scDeepCluster lacks constraint on the hidden feature and its performance on clustering is limited. Therefore, inspired by the generative model Adversarial Autoencoder (AAE) [24] which can match the latent feature to any prior distribution while processing the data reconstruction stage, here we add an AAE module on the basis of scDeepCluster to preserve the data structure in hidden layer during the feature learning, forming the AAE-SC framework.
Specifically, AAE-SC first trains the additional discriminator network and data reconstruction module by the adversarial loss and the zero-inflated negative binomial (ZINB) loss. After acquiring the constrained initial features, AAE-SC clusters the hidden features by jointly optimizing the reconstruction loss and clustering loss from an improved deep clustering layer. Finally, experiments on several real-world datasets demonstrate that the proposed AAE-SC framework can considerably outperform the state-of-the-art models on three clustering evaluation metrics. Also, subsequent experiments also show that AAE-SC achieves better robustness than the baseline model. Our main contributions are as follows: • We proposed AAE-SC framework which innovatively utilizes the adversarial autoencoder component to constrain the low-dimensional feature and uses the constrained hidden feature for clustering.
• The proposed AAE-SC framework is evaluated on three real-world scRNA-seq datasets and the clustering results on the three clustering metrics are at least 2% and at most 5% better than that of the state-of-the-art model.
• Furthermore, regarding experiments on datasets with downsampled and unbalanced cluster size, our model also shows better robustness compared to the baseline model. And the effect of clustering coefficient on clustering performance and the network structure selection of AAE-SC is studied in detail. The remaining parts of this paper are organized as follows: The Section II mainly reviews the representative works of scRNA-seq clustering. The Section III introduces the proposed AAE-SC framework. The Section IV describes the dataset information, the relevant implemental details of the model and evaluation metrics of the experiment. The following section is about the analysis and discussion of the experimental results. Finally, we summarize the work of the paper and the look forward to the future work in the Section VI. Table 1 provides the abbreviation-full name comparison table of the full article.

II. RELATED WORK
In this section, we briefly reviewed and summarized the representative works in scRNA-seq clustering analysis. And we focused on these works from two aspects: traditional clustering methods and deep learning-based methods.

A. TRADITIONAL METHODS
Early researchers applied the traditional clustering algorithms to analyze the scRNA-seq data. The SNN-Clip [14] identified groups of cells that are densely connected by graph-based clique algorithm. It leveraged the concept of shared nearest neighbor to calculate the cell similarity for finding all quasicliques. Then several algorithms based on k-means have been proposed. RaceID [25] utilized k-means to unravel the heterogeneity of rare intestinal cell types. SAIC [26] applied an iterative k-means to identify the optimal subset of signature genes that separate single cells into distinct clusters. Since k-means is a greedy algorithm, these methods may fail to find their global optimum. Besides, k-means is sensitive to outliers since it tends to identify globular clusters, resulting in the failures in detecting of rare cell types. To overcome the above disadvantages, the RaceID2 [27] replaced k-means with k-medoids clustering and later the improved version of RaceID3 [28] added the random forest component to ameliorate the clustering accuracy. Some researchers have also tried to add additional constraints to the feature extraction phase before the clustering phase on scRNA-seq data begin. The LAK [29] integrated Lasso penalty into clustering method as the feature selection approaches, and then using k-means algorithm based on the selected genes which have an actual effect on clustering.
Some researchers also tried to determine the variety of cell groups by spectral clustering method. The SIMLR [15] used Gaussian Kernel and assisted spectral clustering to learn a better distance metric to model the special data structure. In addition, SIMLR can process the large-scale datasets with heavy noise. MPSSC [16] innovatively used L1 penalty to characterize the sparsity of data with multi-kernel spectral clustering. SinNLRR [30] was proposed to impose a non-negative and a low rank structure on cell similarity matrix and then utilized spectral clustering to detect cell types.
Although these methods have improved clustering performance on scRNA-seq (better performance on cluster metrics, see the Section V for details), they were usually not very scalable and required huge computing resources and storage when dealing with the large-scale dataset (for example, researches [23] have shown that it is hard to run the datasets which contain over 4000 cell samples with even large memory such as 256GB by MPSSC and SIMLR, and the clustering time of some spectral clustering methods on 2000 sample datasets is more than 10 times longer than other algorithms [31]). Some scalable tools like Seurat [32] and SCANPY [33] which utilized Louvain algorithm to detect the community have low time complexity on large-scale datasets, but they may not find small communities and therefore reduce the accuracy of clustering. Some researchers also tried to use existing datasets as reference to identify to cell types of scRNA-seq data. The single-cell Clusterbased automatic Annotation Toolkit for Cellular Heterogeneity (scCATCH) [34] algorithm annotated cell types through the tissue-specific cellular taxonomy reference database and the evidence-based scoring protocol.

B. DEEP LEARNING METHODS
Recently, deep learning has made breakthroughs in many areas of bioinformatics [9], [35], [36]. Among all the deep learning techniques, autoencoder has been the most popular so far. There has been many autoencoder approaches which aim to deal with scRNA-seq data more efficiently and accurately. Lin et al. [37] tried to reduce the dimensions of scRNA-seq data by neural networks with prior biological knowledge. The scScope [20] used a stacked auto-encoder to build a recurrent model and conducted batch effect removal, dropout imputation and cell subpopulation identification. Inspired by the recent success of autoencoders for sparse matrix imputation in collaborative filtering for recommendation system, Talwar et al. [38] proposed AutoImpute, which was also based on autoencoder and aimed to handle the dropout in scRNA-seq data. This model utilized overcomplete autoencoder to regenerate the imputed expression matrix by focusing on the non-zero entries in the input sparse matrix. Some works like deep variational autoencoder for scRNA-seq data (VASC) [39] and scvis [21] both utilized variational autoencoder (VAE) [40] to characterize the data structure of scRNA-seq afterwards. VASC modeled the dropout events and attempted to find the nonlinear hierarchical feature representation of original data, while scvis inferred the approximate posterior distribution of low-dimensional latent variables and accordingly learned a parametric mapping from a high-dimensional space to a low-dimensional embedding.
The imputation model DCA [22] adjusted the reconstruction loss of traditional auto-encoder into a special ZINB model-based loss function, and the loss function is the likelihood of the ZINB distribution. DCA constructed a denoising autoencoder with three neuron nodes in the output layer, which represented the mean of denoised data and two parameters of ZINB distribution respectively. It modeled special sparsity structure and inferenced dropout events of scRNA-seq data. On the basis of DCA, the scDeep-Cluster [23] added an extra clustering layer inspired by a deep learning clustering method of improved deep clustering (IDEC) [41], and it can iteratively update clustering assignment after trained the DCA. The scDeepCluster outperformed DCA in the performance of clustering task and became stateof-the-art approach for scRNA-seq clustering.

C. SUMMARY OF EXISTING METHODS
In general, previous researchers have improved traditional clustering algorithms or used deep learning algorithms to implement clustering analysis on scRNA-seq data and achieved better performances.
As for the algorithms using the k-means method, the RaceID3 [28] and LAK [29] algorithms have effectively improved the traditional k-means algorithm by using the random forest components and the Lasso penalty respectively. Also, they achieved better performance than other k-means based algorithms. On the benchmark dataset 10X PBMC, RaceID3 achieved about 69%, 70% and 55% on the three evaluation metrics of Clustering Accuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI), respectively. As for the LAK algorithm, the three metrics are improved to 78%, 75% and 68%. Regarding the three spectral clustering based methods, including SIMLR [15], MPSSC [16] and SinNLRR [30], the last algorithm combined the advantages of the first two algorithms and used a unique low rank structure on the cell similarity matrix, which achieved about 77%, 74% and 66% on the above three metrics (The performance of the first two algorithms are: SIMLR: 62%, 72%, 52%, and MPSSC: 76%, 73%, 65%). For other traditional methods, Seurat [32] and SCANPY [33] are more used for scRNA-seq data preprocessing or coarse-grained analysis. And the innovative use of reference database for cell type identification of scCATCH [34] reached 83%, 76% and 73% on the ACC, NMI and ARI under 10X PBMC dataset, which also shows competitive performance compared to the k-means based and spectral clustering based algorithms.
The methods based on deep learning mainly use autoencoder as the core component. As a representative of the earlier work, the performance of the AutoImpute [38] and IDEC [41] algorithms on the 10X PBMC dataset reached about 72%, 71%, 61% and 70%, 70%, 55% on the three metrics respectively. Afterwards, some work such as DCA [22], VASC [39] and scvis [21] improved the previous algorithm from different starting points, which made the clustering performance improved. Among them, the scDeepCluster [23] method combined the previous researchers' modeling of the special sparsity and dropout noise of scRNA-seq data. Also, scDeep-Clsuter leveraged the deep embedded clustering method to make the two processes of denoising and clustering can be jointly trained and optimized, achieving the best performance on the scRNA-seq data clustering task. It reached 82%, 77% and 72% on the three metrics under 10X PBMC dataset, and also reached better clustering performance under other scRNA-seq datasets (all the numerical performances for each algorithms are shown in Table 4).
Although the above methods have achieved certain clustering performance, they still suffer from some shortcomings [7], [13], [31], [42]. K-means based algorithms are not good at directly determining the optimal value of the number of clusters, and are not good at handling samples with unbalanced data clusters [13], [42]. The computational time complexity and computational space consumption of algorithms based on spectral clustering are very huge, making them not suitable for the current large-scale data analysis [7], [31]. The accuracy of the scCATCH algorithm depends on the selection of good reference datasets, while the scDeepCluster algorithm lacks constraints on hidden layer features, which may be unfavorable for the subsequent clustering.

III. PROPOSED FRAMEWORK
In this section, the baseline model scDeepCluster is introduced first. Then, the proposed Adversarial AutoEncoder ScRNA Clustering (AAE-SC) model is described, alongside the training and optimization process of our method.
As shown in Fig. 2, AAE-SC consists of an AAE module with noise, three independent layers at the end of the decoder of AAE to estimate ZINB loss and the improved deep clustering layer.
To make the autoencoder more robust, the DAE incorporates an extra Gaussian noise into the input samples, and attempts to reconstruct the original input from corrupted data. In DAE, both the encoder and decoder are composed of fully connected layers which are low-dimensional compared to the raw data. By reconstructing the clean data, the hidden layer learns effective low-dimensional feature representation.
Although common practice tends to employ the meansquare error (MSE) loss to fulfill the reconstruction process in traditional autoencoder and DAE, the scRNA-seq data is too sparse that the MSE loss cannot rebuild the original data VOLUME 8, 2020 well. Therefore scDeepCluster utilizes a specific loss function based on ZINB distribution from DCA. This distribution has shown its effectiveness to model the highly sparse and overdispersed data. ZINB can be estimated by the mean (µ) and dispersion (θ) of the negative binomial distribution and the additional coefficient (π) which represents the probability of dropout: where X stands for the original data. The scDeepCluster uses three independent fully connected layer at the end of decoder to estimate the above parameters.
To better perform the clustering task, scDeepCluster also employs the method of IDEC method in the latent space features instead of using traditional clustering algorithm such as k-means directly. After obtaining the latent space features from the hidden layer of DAE, scDeepCluster uses the same clustering approach with IDEC. The method first computes the distribution Q of soft clustering labels in sample features, and then defines an auxiliary target distribution P based on Q. Finally the clustering loss is defined as Kullback-Leibler (KL) divergence between P and Q, which is illustrated as below: where q ij is the soft label of embedded sample z i . This variable is used to measure the similarity between sample z i and cluster center µ j by Student's t-distribution. After that, scDeepCluster iteratively uses the self-training strategy to compute the auxiliary target distribution p ij with previous q ij .

B. AAE-SC FRAMEWORK
In addition to modeling and constraining the reconstructed data output by the decoder with a special prior ZINB distribution, we also constrain the prior distribution of the bottleneck feature of DAE to preserve the latent data structure and generate more consistent feature. Recent researches use variational inference like AAE [24] to match the aggregated posterior of the latent features of the autoencoder with an arbitrary prior distribution, and they have been proved to be effective in many fields. Therefore we modify the DAE in scDeepCluster to an AAE by adding a discriminator D on top of the bottleneck layer, and use the original encoder as a generator.
Based on the implementation of DAE in scDeepCluster, the input data X input is corrupted by a zero-mean Gaussian random noise ε and becomes X noise . We define the encoder and decoder functions as Z = F W E (X noise ) and G W D (Z ), where Z stands for the latent space feature in the bottleneck layer. The weight W E and W D are learning parameters of encoder and decoder respectively. In addition to the raw data, we also add the zero-mean Gaussian random noise to each layer of the encoder and make the model more robust.
Similar to the generative adversarial network (GAN) [43], AAE uses an adversarial training procedure on the autoencoder and a discriminator to match the aggregated posterior of the hidden vector with the prior distribution, which aims to learn a better mapping function and hidden code. However, the purpose of AAE and GAN are completely different. GAN uses an adversarial training method to learn the data distribution of the original data, so the random noise can be converted into new data similar to the raw data through a generator. Whereas, AAE is trained to make the hidden layer feature of the autoencoder conform to a prior distribution. So the purpose of GAN is to generate new data, while the goal of AAE is to restrict the data distribution of existing features. In this article we adopted AAE to make constraint on the hidden feature so that it can be clustering-friendly.
The additional discriminator of AAE is also composed of fully connected layers. Meanwhile, the final layer's output dimension is set to be 1, which is to determine the authenticity of input samples. The input of the discriminator are the latent features from the bottleneck layer of DAE and the random sampling data from the prior distribution with the same dimensions as the former. The generated data from prior distribution is true data and its label is set to 1, while the label of latent feature is set to 0, which is regarded as fake data. The discriminator network utilizes the binary cross entropy loss to train and update parameters: where s i , z i and n stand for the generated samples from prior distribution, the latent feature and the batch size respectively. After updating the parameters of discriminator D, all weights inside are frozen.
Unlike the GAN structure, which has an independent generator, the adversarial autoencoder trains the encoder part as a generator to confuse the discriminator D, and let D judge whether the input samples generated by encoder are true samples: Through the above adversarial training process, the hidden features are aligned to the prior distribution and the whole AAE framework learns a good mapping function from input data to a low-dimensional feature which is suitable for the downstream cluster analysis.
In addition to the variance inference by AAE, our method also employs the ZINB loss as the reconstruction loss and utilizes an IDEC layer. To estimate the three parameters (π, µ, θ ) of ZINB distribution described above, the last layer of the decoder is replaced with three independent fully connected layers and their dimensions are the same with the input data. Thus the architecture of the decoder is given as below (Z represents the output of bottleneck layer in AAE-SC): where W M , W π and W θ are the learning parameters in the final three fully connected layers respectively. The size factor sf is an independent biological variable that is calculated by the library size and the median of cells. The reconstruction loss function of the ZINB distribution is the negative log transformation of the ZINB likelihood: L r = −log(ZINB(X |π, µ, θ )).
AAE-SC also has an IDEC layer on top of the hidden layer of AAE. Based on the above description, the clustering loss is computed by KL-divergence between P and Q as below:

C. TRAINING STRATEGY & OPTIMIZATION
In this paper, our model has two stages: 1) Combination of the adversarial training and reconstruction stage, which aims to constrain the prior distribution of the hidden layer coding while reconstructing the noisy original data. 2) Jointly optimizing the reconstruction loss and clustering loss on the constrained features listed above to iteratively update the clustering label assignment. The objective function of model is defined as below: where α is a clustering coefficient to adjust the clustering loss to avoid the clustering space to be distorted. Loss in the pre-training phase corresponding to L 1 , and L 2 represents the target function in clustering process. As for the above loss function, the three types of parameters can be optimized and updated by Stochastic Gradient Descent (SGD) and back propagation.
Specifically, as described in [41], [44], the gradient of L c with respect to the clustering center µ j and latent feature sample z i can be computed as below: And during the clustering process the clustering center µ j is updated by: where l is the learning rate and n is the value of mini batch. The decoder weights W D are updated by: In the stage 1 the encoder weights W E are updated by: And in the stage 2 the encoder weights W E are updated by:

IV. EXPERIMENTS
In this section, we provide quantitative comparisons of the proposed AAE-SC model to other state-of-the-art scRNA-seq clustering methods in two categories: traditional clustering models and deep learning models.

A. DATASETS
The proposed AAE-SC model is evaluated on three real-world scRNA-seq datasets coming from different sequencing platforms. All the datasets are publicly available. The statistics of the datasets are summarized in Table 2 and the detailed information is shown as follows:   [45]. There are over 4,000 cells with 16,000 genes in the dataset and it has 8 different clusters.
• Mouse Bladder Cells [46] 2 : This dataset comes from the Mouse Cell Atlas project by [46]. We select the bladder tissue cell data from overall 400,000 single cells, and they can be divided into 16 different groups.
• Worm Neuron Cells [47] 3 : It is a worm cell dataset profiled by the sci-RNA sequencing platform. About 50,000 cells from the nematode Caenorhabditis elegans at the L2 larval stage in the dataset have been measured by previous researchers and the corresponding cell types have been identified. Following the approach in [47], we select the subset of these neural cells and remove the unlabeled individuals. Therefore, the dataset we use consists of 4,186 cells with over 10,000 genes and there are 10 different categories in total.

B. EXPERIMENTAL METHODS
To evaluate the performance of our proposed AAE-SC, we compare it with sixteen algorithms, which are the representative and widely used works in scRNA-seq clustering. The descriptions of these methods are summarized in Table 3.

C. EVALUATION METRICS
In our experiments, three metrics of ACC, NMI and ARI are used to evaluate AAE-SC model, which are widely used in unsupervised learning scenario. Introduction of these metrics are as follows: • ACC: The clustering accuracy (ACC) is used to measure the matching level of the clustering labels assigned to the samples and their true labels. Given the sample i, the assignment label p i and its groundtruth label t i , the ACC is computed as: where n is the number of sample points and map() refers to best mapping between assigned labels and true labels. It can be solved by the Hungarian algorithm with polynomial time.
• NMI: The Normalized Mutual Information (NMI) measures the similarity of two clusters from the perspective of information theory. It is defined as: where I(T,P) represents the mutual information between the ground truth label T and the model-predicted assigned label P. H() denotes the entropy of the labels and n is the batch size.
• ARI: The Adjusted Rand Index (ARI) evaluates the similarity between two clustering results by computing the pair relationship improved from the original RI (Rand Index). Given ground truth label T and the predicted clustering results assignment P, we first compute the four mathematical quantities: a: the number of sample pairs which are divided into the same cluster in both T and P.  d: the number of sample pairs which are divided into different clusters in P but the same in T . Base on the above quantities, the ARI is defined as: The value range of ACC and NMI is both [0,1], and that of ARI is [-1,1] of ARI. For all the three metrics, a larger score indicates a more accurate clustering result.

D. IMPLEMENTATION DETAILS
In experiments, the proposed AAE-SC network architecture is constructed with the same layers as that of the baseline model, scDeepCluster. Meanwhile, the encoder network dimensions is set to input -128 -64 -32, where input stands for the dimension of input data, and the decoder has a symmetric structure with the encoder. Besides, the discriminator network is built with dimensions 32 -128 -64 -32 -1. The reasons of setting the first layer and its symmetrical layer to be 128 nodes will be discussed in the following section. Furthermore, the activation function of the last layer of discriminator is sigmoid, while other fully-connected layers are all activated by ReLU. In the pretraining stage, we utilize the optimizer Adam with learning rate 0.001 for all the datasets. As for the clustering phase, the optimizer Adadelta is applied with learning rate 1.0. We discussed the choice of initial learning rate in subsequent experiments.
Here the standard normal distribution N (0, 1) is exploited as the prior distribution to align the bottleneck feature. And the corresponding group number is employed to be the number of clusters for the clustering layer as prior information on each dataset. All weights in fully-connected layers of the proposed AAE-SC model are initialized with the Glorot uniform. The whole model is pretrained by 300 iterations first, then the clustering stage start. Meanwhile, special experiments are conducted to determine the more appropriate value of parameter α in the following section. The rest hyperparameters are set to be the same as scDeepCluster.

E. DATA PRE-PROCESSING
All scRNA-seq data were pre-processed using SCANPY [33], following the data pre-processing implementation of the baseline scDeepCluster: For each dataset, genes with expression values less than 5 and cells with expression values less than 1 were first filtered out. Then, the entire reading matrix was normalized so that each cell has a total count equal to the median of the gene counts of per cell before normalization. After that, logarithmic transformation was applied on the data, and it was scaled to unit variance and zero mean.

A. QUANTITATIVE ANALYSIS
The clustering analysis results on three real-world scRNAseq datasets are listed in Table 4. All experimental data are the average of 10 independent experimental results.
The proposed model is first compared with the three traditional methods: PCA+k-means, SIMLR and MPSSC. PCA+k-means is considered as a typical traditional method, both the PCA and k-means are very commonly utilized in clustering process. Compared to the PCA+k-means method, AAE-SC demonstrates huge superiority, with an overall increase of 17%-32% on all of the three datasets. Since the PCA method only focuses on reducing the dimension of the data rather than extract effective features for clustering, resulting in poor final clustering effect. By employing spectral clustering, SIMLR and MPSSC achieve a remarkable improvement over PCA+k-means method. Although spectral clustering is better than the ordinary PCA+k-means method, SIMLR cannot model the large amount of noise and dropout events present in scRNA-seq data effectively. The MPSSC adds an additional L1 penalty on the basic of spectral clustering, so its performance is better than SIMLR. However, this artificially designed constraint does not fully model the essential characteristics of scRNA-seq data. As a result, their performance is inferior to the method proposed in this paper. Meanwhile, our method is also superior to other algorithms including the RaceID3, LAK, SinNLRR and scCATCH, which are based on k-means algorithm, spectral clustering algorithm and reference databases respectively. It can be clearly seen that our algorithm is superior to these algorithms on each dataset.
DEC and IDEC are the early deep learning methods which used autoencoder for clustering. For IDEC, the decoder structure is retained for subsequent clustering on the basis of DEC, and it is apparent that IDEC performs better than DEC on all the three datasets. However, because the scRNA-seq data is quite different from the traditional image data, and these two algorithms are not specifically designed for the task of scRNA-seq data clustering. Experimental results of the two algorithms on this kind of data are even worse than traditional MPSSC method, and this holds true for the AutoImpute algorithm. On the other hand, although DCA, scvis and VASC characterize the scRNA-seq data by a specific ZINB loss and variance inference model VAE respectively, all of them ignore taking the advantage of deep clustering. Therefore, they only achieve the similar performance with traditional spectral clustering algorithm, restraining the ability of deep learning to process big data.
The baseline model, scDeepCluster, follows the method of DEC and IDEC and adds an extra clustering layer connecting the hidden layer of DCA model. In this way, the scDeepCluster not only effectively models and describes scRNA-seq data through ZINB distribution, but also strengthens the effect of subsequent clustering tasks by the cLustering layer. In this case, it outperforms the above methods and becomes the previous state-of-the-art. Compared with scDeepCluster, our improved model constrains the hidden layer data to prevent distortion of the data structure during the learning and clustering process, thus showing remarkable improvement on 10X PBMC and Mouse Bladder Cells dataset. Especially in the experiment of 10X PBMC, our model exceeds the original scDeepCluster by about 5% on both ACC and ARI metrics. This confirms the importance of maintaining data structures in hidden layer and the effect of AAE for improving the clustering performance.
The GAN [43] model with a confrontation training process similar to that of AAE is also compared, and it is found that the GAN model plus k-means algorithm performs the worst among all comparison algorithms. The reason is that although the GAN model uses similar adversarial training ideas, the essence of GAN is to generate data rather than extract useful features, and this model is not suitable for cluster analysis.
In addition, all of our original experimental performances have been tested by Wilcoxon Signed-Rank Test with other algorithms. The results are shown in Table 5. It is clear that when the significance level is 0.05, the performance of our algorithm on all metrics is significant different compared with the above algorithms, which also proves statistically that our algorithm is better.

B. QUALITATIVE ANALYSIS
As described above, the scDeepCluster is improved by adding an extra clustering layer on DCA first, and AAE-SC impose constraint on the hidden features of scDeepCluster by AAE. To evaluate clustering effects and effectiveness of AAE-SC versus the baseline methods more intuitively, we visualize the hidden embedded representations of AAE-SC, scDeepCluster and DCA on 10X PBMC dataset using TSNE [49].
In Fig. 3, it is obvious that the samples in the same cell group distribute in a wide range and cannot be well clustered in DCA. Taking advantage of the extra clustering layer, the performance of scDeepCluster is significantly better than DCA. Although similar cells are made compact and dense by scDeepCluster, some different clusters cannot be separated well (such as cluster 2&6 and cluster 4&5). Our AAE-SC overcomes the above problems and clusters the cell samples well into different groups, which will be beneficial for the subsequent biological analysis.

C. SELECTIONS OF THE NUMBER OF CLUSTERS
Since the k-means algorithm is utilized in the improved deep embedding clustering (IDEC) method of AAE-SC, and the k-means algorithm demands the number of clusters in the data in advance. In order to determine the optimal number of clusters, additional experiments on the above three datasets were conducted.
For each dataset, the number of clusters used by the kmeans algorithm in the experiment was chosen within the range of length 2 around the number of cell types in the dataset. For the above three datasets of 10X PBMC, Mouse Bladder Cells and Worm Neuron Cells, the selection range of the number of clusters fall into [6,10], [14,18] and [8,12] respectively. The clustering performance on these three datasets is shown in Fig. 4. It can be clearly seen that on each dataset, the clustering performance is the best when the number of clusters is equal to that of cell types in the dataset.
We also utilize the metric Davies-Bouldin Index (DBI) [51] to determine the optimal clustering effect when selecting the number of clusters. Introduction of this metric is as follows: where avg() represents average distance between samples in the cluster, and d cen () represents the distance between the center points of two clusters. The smaller the value of DBI, the better the clustering effect can be considered. Fig. 5 shows the DBI value under different number of clusters on three datasets. It can be obviously seen that on each dataset, the DBI value is the best when the number of clusters is equal to that of cell types in the dataset. This also verifies the results of our above experiment. So, the number of cell types in each dataset is taken as the number of clusters for the algorithm to reach the best clustering performance.

D. THE ROBUSTNESS OF AAE-SC
Robustness of the AAE-SC model was also studied and compared to that of the baseline model, scDeepCluster. SCANPY [33] was exploited to downsample the above three datasets, and limited the total gene counts of each cell sample in the dataset to the range of [500, 1000, 1500], so that the clustering performance of the two models under these noisy data could be observed. Fig. 6 shows the clustering performances of AAE-SC and scDeepCluster on the downsampled datasets. AAE-SC performs better than the baseline model on three different down-sampled noise data in almost every dataset. Especially when the downsampling reaches only 500 gene counts per cell, the NMI metric of AAE-SC outperforms that of the baseline model by about 10%, and the other two metrics by 5%. The superior performance indicates that the constrained features extracted by AAE-SC model are more robust than the hidden layer features of the scDeepCluster baseline model in case of high-noise data.

E. EXPERIMENTS ON UNBALANCED DATASETS
In addition, the ability of the AAE-SC model to handle unbalanced datasets were studied. The unbalanced dataset means that the number of samples of certain cell types (clusters) in the dataset is much smaller than that of the dataset. In this experiment, three representative unbalanced single-cell datasets extracted from three mouse tissues in the Tabula Muris project [50] have been selected and some rare data clusters exist in these datasets. Relevant information of the datasets is demonstrated in Table 6. Obviously, there are clusters with very few samples in these datasets. The data pre-processing method we use on these three data sets is the same as the method above. Fig. 7 exhibits the clustering performance of AAE-SC versus that of the baseline model on three unbalanced datasets. It can be observed that the AAE-SC model performs at least 5% better than the baseline model scDeepCluster on all three metrics. Especially on the Skin dataset, AAE-SC outperforms the scDeepCluster model by at least 12% on each metric. The above experimental results can show that the AAE-SC model   has better clustering performance than the baseline model in case of the unbalanced datasets with scarce data clusters.

F. SELECTIONS OF THE INITIAL LEARNING RATE FOR TWO OPTIMIZERS
The selections of the initial learning rate for the two optimizers in our AAE-SC model were also studied. In addition to the learning rate used in the baseline model scDeepCluster (0.001 for Adam and 1 for Adadelta), we also compared the performance of the two optimizers with other learning rates. The learning rate of Adam is set as 0.0001, 0.001, 0.01 and 0.1 successively, and which of Adadelta is set as 0.01, 0.1, 1 and 10 successively. Fig. 8 exhibits the clustering performance of AAE-SC under different learning rates for the two optimizers on 10X PBMC dataset. It can be clearly seen that when the learning rates of Adam and Adadelta are 0.001 and 1, respectively, the clustering performance of the model is best. At the same time, the above two learning rates are also the default learning rates of the two optimizers in the baseline model scDeepCluster, so in order to make a fair comparison and achieve the best performance, we recommend using these two values as the initial learning rate of the two optimizers.

G. HYPER-PARAMETER ANALYSIS
The effect of the clustering coefficient α on the clustering performance is further investigated. In this study, we aim to find a suitable value of α, so that the final clustering effect can be improved. Meanwhile, we hope that the final model will not be overly sensitive to the changes of coefficient α and the performance of the model not fluctuates too much. Therefore, the effect of different network widths on the model performance was also studied, by changing the width of first layer of the adversarial autoencoder network.
In order to make the optimization process more efficient and to better explore the solution space, we chose to use   Bayesian optimization to optimize these two parameters. Specifically, we used the python package BayesianOptimization to make the analysis. For clustering coefficient α and the width of first layer in AAE-SC, we defined the search space as [0.1, 10] and [64,512] respectively in advance, and then performed experiments. The result of the experiment is that the optimal parameters of the two under Bayesian optimization are 1.50023 and 128.0076 respectively, so we finally chose to set these two parameters to 1.5 and 128. It is worth mentioning that this setting can make the width of the network narrower than the baseline model.

VI. CONCLUSION
In this paper, we propose AAE-SC, a single-cell RNA-seq data clustering model. The model combines benefits of modeling of specific biological noise modeling, variational inference and deep clustering. The AAE-SC model preserves the data structure and utilizes the constraint bottleneck feature to improve clustering analysis by an AAE module. Experimental results on three real-world scRNA-seq datasets show that AAE-SC achieves considerable better clustering performance than the state-of-the-art on three evaluation metrics.
Although the proposed model achieves superior performance, we believe that it has some limitations, including AAE may not be easy to train, and the training time may be relatively long, etc., so our future work will also focus on improving the above problems. Also, it is significant to explore the utilization of this model for processing other single-cell data, such as single-cell Hi-C data and scATAC-seq data. Besides, using XAI technology to improve the interpretability of this model is worth to explore.