A deep embedded clustering algorithm for the binning of metagenomic sequences

The study of metagenomic sequences brings a deep understanding of microbial communities. One of the crucial steps in metagenomic projects is to classify sequences into different organisms, named the binning problem. In the emerging methods for classification, deep learning is a potential technology to be applicable with high accuracy. However, it is well-known that reference databases, which are highly required by deep learning based methods, are not always available. As a result, some existing binning solutions have applied unsupervised learning processes, but utilizing the strength of deep learning in an unsupervised model is still a challenging problem. This work proposes a binning algorithm for metagenomic sequences, called MetaDEC, which applies a deep unsupervised learning approach. By following the two-phase paradigm, the algorithm firstly divides sequences into groups of overlapping sequences. The groups are then classified into clusters using an adversarial deep embedded clustering technique. Experimental results show that MetaDEC achieves competitive performance compared to existing methods on both simulated and real metagenomic data.


I. INTRODUCTION
M ETAGENOMICS is the study of genetic content collected directly from the environment, bypassing the need for culturing and isolating in laboratories. The field offers opportunities to get a deep inside into microbial communities that are infeasible with traditional methods of singlegenome sequencing technologies. There are many fields of applications from metagenomic studies such as biomedical science, biotechnology, energy, and agriculture [1].
Some initial metagenomic projects, e.g. Acid Mine Drainage, and Sargasso Sea are based on Sanger sequencing technology [2], [3]. The technology produces quite long sequences which provide meaningful information for the binning process. Unfortunately, because it requires high costs and is very time-consuming, the technology is impractical for most current projects, which require analyzing a huge amount of data. Thanks to the development of Next-Generation Sequencing approaches such as Illumina, and 454 pyrosequencing [4] which are appropriate for nowadays metagenomic projects. The technologies are able to process millions of reads per run in a short time with low costs. However, one of the limits of the sequencing technologies is that they produce sequences with short lengths, and thus brings research challenges for communities because of the lack of information in short sequences.
Due to the fact that DNA fragments in metagenomic samples are from multiple genomes, one of the major steps in metagenomic projects is to classify sequences into groups of closely related genomes. The step is referred to as binning problem. Binning results are beneficial to reconstructing genomes by assembly approaches or used for analyzing sequences directly from every single genome.
Some binning approaches such as TIPP2 [5], and mOTUs2 [6] use marker genes, e.g. 16S rRNA, recA for profiling metagenomic sequences. The strength of those methods is that they require low computing costs because of not needing to analyze the whole genomes of organisms. However, the classification quality of the methods is affected by the fact that many different species contain the same marker genes, and others may have a small ratio of 16S rRNA genes [7], [8]. This work focuses on whole-genome sequencing approaches which analyzing all genome sequences of microbial organisms. Those approaches can be classified into groups of supervised, and unsupervised methods.
Supervised methods are based on homology or composition information from an available reference database to support the classification process. MEGAN CE [9], and MEGAN-LR [10] use homology search tools such as DI-AMOND [11], or LAST [12] to determine the similarity between input sequences with reference sequences. The algorithms then assign the sequences into groups of known organisms. One of the drawbacks of the methods is that they are very time-consuming. Kraken2 [13] deal with the challenge by extracting long k-mers from sequences and comparing them with reference databases. On the other hand, DeepMicrobes [14] is composition-based supervised methods that utilizes genomic signatures extracted from sequences to classify the metagenomic data. The method applies a deep learning-based computational framework for its classification task.
Due to the limitation on reference databases, some binning algorithms apply an unsupervised learning process for classifying sequences. MetaCluster 2.0 [15] and MetaCRS [16] aim to cluster metagenomic data of long sequences or contigs. Both of the algorithms use composition features and apply k-means algorithm in their process. Focusing on analyzing short sequences, MetaCluter 5.0 [17] applies a two-round binning approach that classifies effectively samples with low-abundance species. Another binning algorithm for short reads, AbundanceBin [18] is only based on abundance levels of species in metagenome, and thus it does not work well for samples of species with similar abundance levels.BiMeta [19] and MetaProb [20] are other clustering algorithms that utilize sequence overlapping information and kmer frequency extracted from groups of reads. While BiMeta uses Euclidean distance of k-mer frequencies between data points in its second phase, MetaProb applies probabilistic signatures to get better classification quality. MetaProb 2 [21] is an improvement of MetaProb which requires an assembly task on the group of reads and a graph clustering algorithm for classifying sequences.
Recent studies on unsupervised deep learning-based clustering approach achieve significant improvement [22], [23] and are applied in a few metagenomic binning algorithms. Among them, the binning algorithm of Isis et al [24] uses an autoencoder architecture to compress composition-based representation of sequences and applied k-Means++ method on compressed representation to classify data. On the other hand, VAMB algorithm [25] only focuses on analyzing metagenomic contigs. It uses a variational autoencoder architecture to compress data into a latent posterior distribution (multivariate Gaussian distribution) in the clustering process.
This study proposes a novel unsupervised binning method for metagenomic sequences called MetaDEC (i.e., Metagenomic binning with Deep Embedded Clustering). The proposed approach follows the paradigm of two phases as used by previous studies [19], [20] in which the first phase does preprocessing works to groups sequences using the sequence overlapping information. However, different from previous works, MetaDEC then uses ADEC [23] -a novel unsupervised deep learning approach -to classify the sequence groups in its second phase. The method utilizes an autoencoder architecture with latent space interpolation factor to learn the underlying representation of data and optimize the learned representation toward the clustering objective.
The next section presents the details of the proposed method. The Experiments and Results section shows the strength of MetaDEC comparing with available binning approaches. Some conclusions are presented in the final section.

II. METHODS
In order to overcome the lack of phylogenetic information in short sequences, the proposed method applies a twophase paradigm in the clustering process ( Fig. 1). Phase 1 does preprocessing work to build groups of sequences using overlapping information between them. Clustering features are then extracted from the sequence groups. Phase 2 uses the features to continue classifying the groups into clusters of close organisms.

1) Phase 1: Grouping sequences and building seeds
Derived from the work in [19], the phase classifies reads that share sufficient long l-mer into the same groups and builds group representative. Based on an observation that the l-mers are unique in genomes [17], [26]. MetaDEC firstly builds a graph whose vertices are reads, and each edge is the connection between two reads that have sufficient substring overlapping. While previous methods [19], [20] used a greedy approach to build groups, this work applies a multilevel partitioning algorithm [27] to create a coarsened graph and acquires groups from connected components.
An observation in [19] revealed that genomics signatures of k-mer nucleotide frequency are also preserved in a group of non-overlapping short reads as in long sequences. Thus, MetaDEC selects a subgroup of non-overlapping reads in each group, called seed, as a representative of the group. The technique does not only reduce noises in features extracted from sequence groups that have unbalanced coverage but also saves computation costs [19]. Next, k-mer frequency distribution of each seed is computed as follows.
Given S = {r 1 , r 2 , ...r n } is a seed, where n is the number of reads in seed S. Let |r i |, i ∈ [1..n] be the length of read r i . In order to find k-mers of each read, MetaDEC uses a sliding window method with a window size of k. There is |r i |−k +1 k-mers in each read. Thus, the total number of k-mer of seed S, denoted |S|, is Overlapped l-mer graph Set of seeds k-mer frequency Step 2 136 1 2 ...

Update centers
and Cluster assignment Phase 1

FIGURE 1. Process of MetaDEC
In other words, there are at most 4 k different contents of k-mers because each k-mer is a combination of 4 kinds of nucleotides (A, T, G, C corresponding to Adenine, Thymine, Cytosine, Guanine, respectively). However, either of the DNA strands can be obtained from their reversed complement strands (e.g., ATTT -TAAA, GCGC -CGCG). The frequency of a k-mer and its reversed complement are the same. As a result, the number of total k-mer values could be reduced by half, from 4 k to 4 k /2 if k is odd, (4 k + 4 k/2 )/2 if k is even. Studies in [15], [19], [28] showed that value k = 4 in computing k-mer frequency is the best choice for extracting compositional features from DNA sequences or contigs. Thus, we also choose k = 4 in this work. It means that there are 136 k-mers in total. Let .., f S 136 } is the set of k-mer frequency representation of seed S. f S is normalized by dividing each entry by |S|. It is then normalized a second time into a normal distribution x S with mean µ = 0 and variance σ = 1. As a result, the final representation of seed S is

A. PHASE 2: CLASSIFY OF SEQUENCE GROUPS USING DEEP CLUSTERING
Given n groups of sequences which are represented by a set of n normalized frequencies of seeds constructed in phase 1 .n] be seeds of the groups. In this phase, MetaDEC classifies X into m clusters with m centroids C = {c 1 , c 2 , ...c m } applying a deep clustering method called ADEC (Adversarial Deep Embedded Clustering) [23]. The phase consists of two steps: Cluster initialization, and Clustering optimization.

1) Step 1: Clusters initialization
The first step aims to find a sufficient initialization of clusters (presented in Fig. 1 and Algorithm 1). Instead of using similarity measurement directly on X, the proposed method uses an autoencoder to learn the latent space representation of X. In detail, the autoencoder composes of two parts, encoder, and decoder. The encoder is a non-linear mapping function E θ : X −→ Z that transforms X space into a smaller dimensionality latent feature space Z = {z S1 , z S2 , ..., z Sn }, in which each z Si corresponds to encoded x Si , i ∈ [1..n]. The decoder is another non-linear mapping function D ϕ : Z −→X that transforms latent space Z back to the original data space of k-mer frequency representation. In our work, both encoder and decoder are modeled by neural networks. MetaDEC also adds interpolation factor on latent space based on a framework proposed from ACAI [29] by using another neural network called critic C ψ to improve the quality of latent space. The information from the critic is used as a regularization term in the autoencoder optimization process.
In this step, autoencoder (encoder E θ , decoder D ϕ ) and critic C ψ are trained in predefined iterations. Each iteration updates C ψ using a loss function L C (1) and E θ , D ϕ using a loss function L E,D (2).
). x S1 , x S2 are randomly picked data points in X and α, λ are randomly sampled in range [0, 1]. It is noted that instead of using square root function for each term in L C and L E,D as the work in [23], MetaDEC uses square function for a better model training.
After the optimization of the autoencoder and critic, latent space Z = E θ (X) is computed, and centroids C are initialized by applying k-means algorithm on Z.

2) Step 2: Clustering optimization
In this step, MetaDEC continues optimizing the clustering result from step 1. The process performs two sub-steps iteratively, which computes soft clustering assignment and learns from high confidence assignment. It runs until convergence, or a threshold met.
Firstly, soft cluster assignment for latent space Z is computed. This study uses the Student's t-distribution in calculating the distance from z Si , i ∈ [1..n] and cluster's centroid c j , j ∈ [1..m]. The probability q ij ∈ Q that data point i th belongs to cluster j th is computed as the following formulation.
in which f req j ′ = Σ i q ij ′ is the frequency of soft assignment. An objective here is to minimize the difference between soft cluster assignment distribution Q and auxiliary target distribution P . MetaDEC then minimizes Kullback-Leibler divergence (KLD) between P and Q to optimize E θ and clusters' centroids c i ∈ C . Beside optimizing KLD loss function, MetaDEC adds a neural network into the clustering optimization step, called discriminator G ω , to help improve the quality of latent space produced by encoder E θ . The discriminator network is optimized to be able to distinguish real groups and reconstructed groups by minimizing loss function L G (equation 5). The information produced by discriminator G ω is used as a regularization term in optimizing E θ and clusters' centroids C. The regularization term penalizes the encoder when it generates meaningless latent space for decoding. Equation 6 describes the final loss function L cls used for optimizing E θ and clusters' centroids C.
The proposed method keeps optimizing decoder D ϕ in the training process. The decoder in this step plays a role as a monitor for assuring the quality of the latent space. Thus, it needs to be trained to catch up with the changes from encoder E θ by optimizing reconstruction loss L D in (7).
Algorithm 2 presents the details of clustering optimization step. It firstly does pretraining discriminator before finding assignment results for elements in X to clusters.

B. PERFORMANCE METRICS
The quality performance of binning algorithms are evaluated using three metrics precision, recall and F-measure (also used in [17], [19], [20]). While precision measures the ratio of reads classified into a cluster that comes from the same species, recall presents the percentage of reads that belong to the same species are clustered into the same clusters. Because each of the two metrics cannot fully reflect the performance of a binning algorithm, F-measure is used as a harmonic metric between them. The three metrics are defined as follows.
in which A ij is the number of reads that come from species j classified into cluster i, m is the number of species in the Compute L G , L D , L cls using (5), (6), (7) 16: if i % aux_iter <= (aux_iter/2) then 17: ϕ ← ϕ − δ∆L D

A. DATASETS
The proposed method is tested on 25 simulated datasets which were used in previous studies [19]. Among them, there are 16 datasets following Illumina sequencing technology, named from S1 to S10, and from L1 to L6, which are pairedend short reads with the length of 80bp. Datasets from L1 to L6 contain sequences of species with different abundance levels. Besides, 9 datasets of Roche 454 single-end long reads named from R1 to R9, have the length of 700bp. The details of the datasets are presented in Appendix A. Furthermore, this study also evaluates the proposed algorithm on a real metagenome, Acid Mine Drainage (AMD) [2]. The dataset is downloaded from National Center for Biotechnology Infor-mation (NCBI) trace archive. This work tests MetaDEC, and compares it with published results of BiMeta, MetaCluster 2.0, AbundanceBin, MetaCluster 5.0 on those datasets [19].

B. EXPERIMENTAL SETTINGS
In the experiment, we use a specific machine for each phase of the proposed algorithm. Phase 1 is run on a machine with 500GB memory and Intel(R) Xeon(R) Gold 6152 processor. Phase 2, which includes the process of training neural networks, is run on a machine of NVIDIA Tesla P100 16GB GPU to accelerate the training time.
In the first phase of MetaDEC, two sequences are considered to be overlapped with each other if they share at least no olp overlapping l-mer substrings. This work empirically set no olp = 5 for short sequence datasets and no olp = 45 for long sequence datasets. Besides, the value of l of l-mer to determine overlapping sequences in phase 1 of the proposed algorithm is set to 30 empirically.
The second phase of MetaDEC is performed with four types of architecture including T iny, Small, Large, Xlarge which are presented in Appendix B. Experiments for simulated datasets in this work uses Small architecture.
In the other hands, the step of cluster initialization trains autoencoder and critic parts for 2000 iterations using Adam optimizer with a fixed learning rate at 0.001. Besides, in the cluster optimization step, discriminator is trained using the same optimizer with 200 iterations. The step continues to train the model using Adam optimizer with learning rate at 0.0001 with epsilon value 10e-8. The training will be stopped if the ratio of changing clustered data points after two consecutive iterations less than a threshold tol of 0.0001% or reaching the maximum number of 300 iterations. In addition, the degree of freedom η in t-student distribution used in step 2 of the phase is set as 1.

1) Results on short read datasets
The proposed method is firstly compared with BiMeta, Meta-Cluster 5.0 on short-read datasets from S1 to S10. The bar charts in Fig. 2 present the precision and recall of the binning algorithms. It can be seen from the chart that MetaDEC gets higher precision values than BiMeta and MetaCluster 5.0 on 9/10 datasets. Besides, there are 6/10 cases in which the proposed method returns better recall values than the two remaining algorithms.
Furthermore, overall F-measures of the three methods are presented in Table 1. The table shows that MetaDEC outperforms both BiMeta and MetaCluster 5.0 for 6/10 samples. On average, MetaDEC gets higher 3.79% and 12.91% compared with the two remaining methods, respectively.
Considering the ability to classify metagenomes of different abundance levels, MetaDEC is compared with BiMeta and AbundaceBin on datasets from L1 to L6. The datasets contain two species of Eubacterium eligens and Lactobacillus amylovorus, and have different abundance ratios. The  [19], BiMeta [19], and MetaDEC on datasets from S1 to S10. bar chart in Fig. 3 shows that MetaDEC achieves higher Fmeasure than both BiMeta and AbundanceBin on all samples of low-abundance ratios (Datasets from L1 to L4 with ratios of 1:1, 1:2, 1:3, 1:4, respectively). Especially, MetaDEC outperforms BiMeta for all cases. Because the strength of AbundanceBin is based on abundance ratios of species in datasets to classify sequences, it gets better results than BiMeta and MetaDEC for samples L5, and L6 (with highly different abundance levels).

2) Results on long read datasets
The proposed method is also compared with MetaCluster 2.0 and BiMeta on long sequence datasets from R1 to R9. Table 2 shows that MetaDEC outperforms the two algorithms for 8/9 cases. The proposed method gets 3.44% and 11.44% higher F-measure on average than BiMeta and MetaCluster 2.0 on the samples, respectively. For more details, bar charts in Fig.  4 show that MetaDEC returns higher precision values than both the algorithms on 8/9 samples. Besides, there are 6/9 cases that the proposed method achieves better recall values than MetaCluster 2.0 and BiMeta.   [19], BiMeta [19], and MetaDEC on datasets from L1 to L6.

D. RESUTLS ON A REAL DATASET
The proposed method is also tested on a real meta-genome. It is compared with BiMeta on the dataset of AMD. darmanus Type I, and Thermoplasmatalesarchaeon Gpl with abundance ratios of 5:5:1:1:1, respectively [2]. In order to assess binning results, assembled scaffolds of the five species are downloaded from NCBI. Reads from each cluster in the output results are aligned with the scaffolds using the BLAST tool to calculate roughly classification accuracy. The dataset is tested using two network architectures of Small and Large presented in Appendix B. Experimental results show that MetaDEC achieves F-measure values of 68.85% and 70.1% when architecture Small and Large applied, respectively. The results are very comparative to BiMeta which was reported that it got a F-measure value of 68.3 on the dataset [19]. and R3 datasets. It is noted that the architectures are assigned with different numbers of layers (T iny is 3 layers, Small is 4 layers, Large is 5 layers, and Xlarge is 7 layers). Line chart in Fig. 5 shows that MetaDEC reaches the highest Fmeasures with architecture Large on sample S5 and with architecture Small on sample R3. However, it gets the lowest F-measures with architecture Xlarge on both datasets. Furthermore, with the shallow network T iny, MetaDEC also returns lower F-measures on the datasets compared with architectures of Small and Large. The results reveal that if a model is too shallow (e.g. T iny with 3 layers), it is not able to learn meaningful latent space that has the ability to describe original data space. In contrast, a model that is too deep (e.g. Xlarge with 7 layers) is prone to overfitting the data instead of extracting meaningful features from it. In both cases, latent spaces learned from these models would not result in a good clustering performance. As can be seen from the visualization of latent spaces from results of cluster initialization for the both datasets (section II-A1 in Phase 2 of MetaDEC) presented in Appendix C, latent spaces of species from T iny and Xlarge models tend to be mixed together. It can result in poor centroid initialization. In contrast, latent spaces of Small and Large models give a more convex shape for each specie, and result in a good centroid initialization.
In the aspect of processing time, the line chart in Fig.  6 shows that the running time of MetaDEC is proportional to the complexity level of model architectures. It can be understood because the larger size of the model is used, the higher computational resources are required.
Furthermore, this work also evaluates the performance of the case of using neural network with zero layer (by applying k-means directly on k-mer frequency representation). The experiment result is compared with MetaDEC using Small architecture on all of the simulated datasets. The chart in Fig. 7 shows that MetaDEC performs better than the case merely applying k-means on average. It means that applying ADEC method providing promising classification quality on metagenomic data. S1 to S10 L1 to L6 R1 to R9 Datasets

IV. CONCLUSION
The complexity of the microbial community requires indepth exploration strategies to yield practical benefits. This study takes advantage of deep learning techniques to effectively solve the binning problem of metagenomic data. The proposed method employs an adversarial deep embedding approach to automatically learn a latent space of k-mer frequency representation of sequence groups, and is demonstrated to outperform state-of-art binning algorithms. In future works, it is worth extending our method to estimate the number of potential clusters in order to be more suitable in the binning context. Besides, a deeper investigation to find a better measuring function for soft cluster assignment in the second phase of MetaDEC is one of the potential research directions. Source codes and datasets used in this work can be downloaded from https://bioinfolab.fit.hcmute.edu.vn/MetaDEC. .

APPENDIX B ARCHITECTURE TYPES
In the trainning process for neural networks in phase 2 of MetaDEC, each architecture consists of four parts of encoder (θ), decoder (ϕ), critic (ψ), discriminator (ω). Each network is a multi layer perceptron. Each layer i of a network is defined as follows.
where W i and b i are layer's parameters, x i−1 is the output of previous layer, g i is the non-linear activation function. MetaDEC uses ReLU for all layers' g i , except the last layers of encoder, decoder and critic, which are set to be the identity function and the last layer of discriminator is set to be the sigmoid function. We initialized all layers' parameters to random numbers drawn from a zero-mean Gaussian distribution with a standard deviation of 0.01. Tables 4, 5, 6, 7 describe the architectures used in our experiments for encoder, decoder, critic, and discriminator, respectively.  Tiny  136  500  10  Small  136  500  1000  10  Large  136  500  500  2000  10  Xlarge  136  500  500  1000  1000  2000  10   TABLE 5. Details of decoder part, which has the inverse order of encoder

APPENDIX C VISUALIZATION OF LATENT SPACE
The visualization of latent spaces from results of cluster initialization (section II-A1 in Phase 2 of MetaDEC) for datasets S5 and R3 on four architectures, which are T iny (Fig. 8), Small (Fig. 9), Large (Fig. 10), and Xlarge (Fig.  11).