NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Motivated by the need for fast and accurate classification of unlabeled nucleotide sequences on a large scale, we developed NASCUP, a new classification method that captures statistical structures of nucleotide sequences by compact context-tree models and universal probability from information theory. NASCUP achieved BLAST-like classification accuracy consistently for several large-scale databases in orders-of-magnitude reduced runtime, and was applied to other bioinformatics tasks such as outlier detection and synthetic sequence generation.

Moreover, existing databases sometimes contain mislabeled sequences that can potentially impair the identification accuracy significantly 7 . To address these challenges, we present NASCUP (http://github.com/nascup and Supplementary Software), an accurate and computationally efficient classification method that is scalable for large and growing datasets and robust against mislabeling errors.
Common approaches to sequence classification can be broadly divided into two categoriesalignment-based and model-based ones. In alignment-based approaches, the class of the query sequence is determined by sequence-to-sequence comparison. Alignment tools, including BLAST 8 , typically exhibit high accuracy, but are vulnerable to errors and often becomes time-consuming as the number of sequences increases. Model-based approaches derive statistical models from each group of sequences and compare the query sequence to these models. Sequence-to-model comparison is more scalable than sequence-to-sequence comparison, but it is often difficult to extract a model from a group of sequences that is statistically meaningful and does not overfit the data. By utilizing compact context-tree models along with the notion of universal probability from information theory, NASCUP delivers high accuracy comparable to the sequence-to-sequence comparison approaches, while providing robustness and scalability as a computationally efficient sequence-tomodel comparison approach suitable for large-scale, expanding datasets. Similar to most model-based sequence classification tools such as RDP 9 , HMMER 10 , and Phymm 11 , the NASCUP pipeline consists of two stages ( Fig. 1a and Supplementary Fig. 1).
In the first, model-building stage, NASCUP learns the statistical structure of each nucleotide se-quence group in a database from the occurrence counts of all k-mers (substrings of length k) in the sequences and builds a corresponding context-tree model 12 (alternatively referred to as variableorder Markov models 13 or probabilistic suffix trees 14 ) that represents the data best ( Fig. 1b and Supplementary Fig. 2). Such context-tree models, as reported for protein sequence classification 15 are simple enough for fast and scalable processing (as in k-mer count models of RDP), yet rich enough for accurate modeling of the data (as in hidden Markov models of HMMER or interpolated context models of Phymm). In the second, classification stage of its pipeline, NASCUP evaluates the likelihood of a test sequence under the context-tree model of each sequence group and chooses the group that maximizes the likelihood.
In both model-building and classification stages, NASCUP utilizes the notion of universal probability 16, 17 from information theory. In a nutshell, universal probability approximates all probability distributions in a given class of models, and serves as a close proxy to an unknown true probability distribution of given data without unnecessary overfitting. With theoretical performance guarantee and practical low-complexity implementations, universal probability has found many successful applications, including compression and prediction of sequential data of a priori unknown statistics. NASCUP measures how likely a sequence group fits a context-tree model and how likely a test sequence fits the chosen context-tree model of a sequence group by evaluating the universal probabilities of the sequences. The inference approach based on universal probability in particular and the information-theoretic principle in general has an additional benefit of having no tunable parameter 18 except the maximum depth of the context-tree models.
In order to demonstrate the classification accuracy and efficiency of NASCUP, we performed a set of comprehensive experiments on real sequence datasets from a variety of sources, organized on a functional or taxonomic basis with varying degrees of inter-group similarity (Supplementary Table 1). NASCUP achieved superb performance in accuracy and speed among the five classification methods compared, consistently across the diverse set of seven datasets (Fig. 1c). In terms of accuracy, NASCUP achieved the highest average accuracy of 97.8% (in terms of both arithmetic and geometric means) among an expanded collection of thirteen alternative classification methods, which is trailed slightly by the BLAST-based classification method (see Supplementary Table 2).
NASCUP also showed the highest accuracy both at the genus and species levels of the uncompacted SILVA dataset (Supplementary Table 3). Except for NASCUP and BLAST, the classification accuracy varied significantly over the datasets. In particular, HMMER was accurate on the functional RNA datasets but not on metagenomic microbial datasets, while RDP worked well on microbial datasets but showed unsatisfactory results on functional RNAs. NASCUP, RDP, and USEARCH ran significantly (often by orders of magnitude) faster for most datasets (Supplementary Table 4).
To emulate the usage of classification tools for a realistic environment in which sequence databases scale over time, we prepared two types of expanding datasets-one by increasing the number of sequences for a fixed number of groups and the other by increasing the number of groups for a fixed number of sequences per group (Supplementary Table 5). For sequencewise expansion, the model building (first stage) time of NASCUP grew linearly as the number of sequences increased. The actual classification time, however, was affected only marginally since the second-stage classification operation is almost independent of the number of sequences in a group once the modeling has been completed (Fig. 1d top). The total runtime of NASCUP (the sum of modeling and classification times) was lower than the other four classification methods regardless of the data size (Fig. 1e left and Supplementary Table 6).
For groupwise expansion, the classification time of NASCUP grew as the number of groups (as well as the total number of sequences) increased. The modeling time did not increase since the model building procedure had to be performed only for newly added groups (Fig. 1d bottom).  Table 7). In both sequencewise and groupwise expansion experiments, NASCUP was orders-of-magnitude faster than BLAST, the only method that achieved a comparable level of accuracy, across all database sizes.
We tested the robustness of NASCUP against mislabeling errors in the sequence database, which, in principle, exists in any real dataset labeled in the absence of the ground truth and especially in pyrogenetic datasets. We prepared a dataset with classification errors at a rate ranging from 1% to 20% and compared the classification accuracy of five alternative methods ( Fig. 1f and Supplementary Fig. 5). The accuracy of NASCUP was robust with marginal performance degradation even when 20% of the sequences in the database were mislabeled to arbitrary groups.
RDP exhibited a similar level of robustness, whereas the performance of the other three methods degraded as the error rate increased.
As mentioned earlier, NASCUP computes the likelihood of a test sequence belonging to each sequence group. This numerical likelihood value provides additional soft information that can augment the hard classification outcome. As the simplest application of such likelihood values, we evaluated how the accuracy of NASCUP improves when it produces a ranked list of likely groups, instead of a single most likely group, for a given test sequence. By increasing the list size, NASCUP achieved near-perfect accuracy (Fig. 2a). In particular, the ten most likely sequence groups produced by NASCUP included the correct group over 99% of the time for all datasets.
Instead of producing a fixed number of candidate groups, NASCUP can produce a variable-size list according to the target accuracy, which can be estimated by the likelihood values and the Bayes rule (Supplementary Table 8). The list of top candidate groups can be fed into subsequent bioinformatics pipelines (such as cross-validation by BLAST) or actual biological experiments on a far reduced set of groups than before preprocessed by NASCUP.
As another application of the likelihood value computed by NASCUP, we performed oneclass classification that identifies whether a test sequence belongs to a single target sequence group or not. For NASCUP and HMMER, both of which build generative models for the target group and provide the likelihood values of the test sequence. We used normalized negative log likelihood value to distinguish member sequences in the target group from outliers. NASCUP showed a clear separation of outliers, which manifested in the much larger area under the precision-recall curve ( Fig. 2b and Supplementary Fig. 6).
The combination of context-tree models and universal probability in NASCUP finds statistical generative models of nucleotide sequence groups that are parsimonious and consequently are expected to better represent the ground truth by Occam's razor ( Supplementary Figs. 3 and 4). In order to demonstrate the interpretive power of such generative models, we generated synthetic sequences randomly according to four generative models-a combination of context-tree vs. Markov models and universal vs. maximum-likelihood probabilities-and measured how often these sequences looked real by classifying them using BLAST. As the maximum context size increased, BLAST could classify the synthetic sequences generated by NASCUP (context-tree models and universal probability) correctly with very high accuracy, while the synthetic sequences generated by the other three models failed to emulate real sequences due to inadequate modeling and overfitting ( Fig. 2c and Supplementary Fig. 7). Similar trends were observed when BLAST was replaced by other classification methods.    and sequence groups of size less than ten for 10-fold cross-validation. We compacted Greengenes and SILVA with cd-hit-est 23 by 97% similarity and limited the number of sequences in a group to 2,000. The datasets thus obtained had diverse characteristics: the number of groups from 23 to 1,320, the sequence length from 20 to almost 5,000, and the average normalized intra-group pairwise sequence distance from 0.08 to 0.33. For scalability experiments, we generated two types of expanding datasets from the Greengenes database, each in ten phases (Supplementary Table 5).
For the sequencewise expansion, we fixed a set of sequence groups and increased the number of sequences per group. For the groupwise expansion, we introduced a fixed number of new groups to an existing training set.
Implementation and experimental setup. We implemented NASCUP in C++ (source code available at http://github.com/nascup) and measured its performance in runtime (i.e., modeling time plus classification time) and classification accuracy using a Linux machine (Ubuntu 12.04, 2.2 GHz Intel Xeon E5-4620, and 512 GB memory) without any parallelization. We compared NASCUP with four main alternative methods (BLAST, HMMER, RDP, and USEARCH) as well as additional methods (Phymm, gzip, UBLAST, caBLAST, BLAT, and three methods provided by QIIME 24 -Naive Bayes in QIIME-2, and UCLUST and Mothur in QIIME-1). For BLAST and its variants based on sequence alignment (USEARCH, UBLAST, caBLAST, and BLAT), the class of a query sequence was determined by the best hit. For gzip, the class was determined by the smallest difference between the lengths of compressed representations of a sequence group and the group appended by the query sequence. All command scripts of NASCUP and other methods are provided as Supplementary Note 2.
Markov and context-tree models. The simplest probabilistic model for nucleotide sequences is the independent and identical distribution (IID) model that assigns one of the four fixed probabilities p A , p C , p G , and p T to each symbol, and computes the probability of the entire sequence as the product of those probabilities. More precisely, a sequence x = X 1 · · · X n with symbols A, C, G, and T respectively appearing n A , n C , n G , and n T times has the probability A d-th order Markov model introduces dependence across d + 1 consecutive symbols (i.e., k-mers for k = d + 1) by assigning one of the probabilities p A (s), p C (s), p G (s), and p T (s) to the symbol X i at position i if the previous d symbols X i−d · · · X i−1 , namely, the state, is equal to s. For example, a second-order Markov model assigns the probability sequence GG|GAGGGC from the third position. In general, a d-th order Markov model assigns the By parsing the sequence into subsequences by states, this probability can be equivalently expressed as where the products are over all states s ∈ {A, C, G, T} d , P s (x) denotes the probability assigned to the subsequence of x in state s (that is, the symbols in x that are preceded by s), and n X (s) A context-tree model (CTM) 12 is both a specialization and generalization of a Markov model, in which states with a common suffix share the same probability assignments and thus are aggregated to form a context. For example, the set {AA, CA, GA, TA} consists of all possible states of a second-order Markov chain that has the common suffix A. This set is represented in shorthand notation * A, where * denotes "any" symbol. Since the probability assignments are the same for all states in the context * A, the symbols preceded by A effectively follow a first-order Markov distribution. Since the effective Markov order varies from one context to another, a CTM is also referred to as a variable-order Markov model 13 . Each CTM is represented by a collection S of contexts that partition {A, C, G, T} d and parameters p(s) = (p A (s), p C (s), p G (s), p T (s)) associated with each context s ∈ S. Using the contexts in S in as leaves and merging contexts that share suffixes in a hierarchical manner, we can form a proper (that is, each node has 0 or 4 children) suffix tree with root node * d = * · · · * (d times). For example, the contexts * A, AC, CC, GC, TC, * G, * T partition {A, C, G, T} 2 and form a suffix tree ( Supplementary Fig. 2). Consequently, a CTM is equivalently represented by a suffix tree along with probability assignments on its leaves, and thus is referred to as a probabilistic suffix tree model 14  Maximum likelihood and universal probability estimates. When the parameters p A , p C , p G , p T of an IID model are not known, the most naive approach to estimating the true probability P (x) of a given sequence x is to find the parameters that maximize the sequence probability expression in (1) and then to compute the probability of the entire sequence using these parameter estimates.
It can be easily verified that the empirical frequencies (p A , p C , p G , p T ) = (n A /n, n C /n, n G /n, n T /n) maximize (1) for any sequence x with symbol counts n A , n C , n G , n T and length n = n A + n C + n G + n T . The resulting maximum likelihood (ML) estimate of the true probability is Note that Q ML (x) is a function only of the symbol count vector n = (n A , n C , n G , n T ). Hence, with a slight abuse of notation, we will write the righthand side (RHS) of (3) as Q ML (n). More generally, for a depth-d CTM on a given context set S with unknown parameters p(s), s ∈ S, the ML estimate Q ML (x) of the true probability P (x) = s P s (x) (from the (d + 1)-st position) as in (2) is the product of the ML estimates of subsequence probabilities P s (x), namely, where Q ML (n(s)) denotes the ML estimate of an IID probability in (3) evaluated with the count vector n(s) = (n A (s), n C (s), n G (s), n T (s)) for context s. The ML estimate Q ML (x) overfits the given data x by discounting the symbols that did not appear in it. In fact, Q ML (·) is not a valid probability assignment since the sum of the estimated probabilities Q ML (x) over all sequences x ∈ {A, C, G, T} n is greater than 1.
As an alternative, NASCUP relies on the notion of universal probability 16, 17 in information theory that is chosen independent of the data x and is close to all unknown probability models in a given class. The most basic example of universal probability is the Krichevski-Trofimov (KT) estimate 25 for IID models that assigns the sequence probability where f (p A , p C , p G , p T ) is the Dirichlet prior on the quaternary probability simplex with parameters 1/2, 1/2, 1/2, 1/2. As a Dirichlet mixture of IID models, the KT estimate is a valid probability assignment (that is, Q KT (x) ≥ 0 for every x and x Q KT (x) = 1). Moreover, Q KT is uniformly close to every IID probability model P on quaternary sequences of length n in the sense that both the relative entropy (Kullback-Leibler divergence) 26 and the maximum log likelihood ratio are upper bounded by (3/2) log n plus uniform constants independent of P , which vanishes when normalized by the sequence length n and is essentially tight as no other probability estimate can approximate all IID probability models uniformly closer 27, 28 .
Since the Dirichlet distribution is the conjugate prior for the parameters of an IID model, the KT probability estimate has the predictive "add-half" formula for the conditional probability when the sequence X 1 · · · X i has symbol counts i A , i C , i G , i T . Applying this predictive estimate sequentially, the KT probability estimate of the entire length-n sequence x in (5) can be expressed as where Γ(·) is the standard Gamma function. As with the ML estimate, the KT estimate is a function of x only through the symbol count vector n = (n A , n C , n G , n T ) and hence we will write the RHS of (7) as Q KT (n). Further extending the predictive estimate in (6), we can express the conditional probability of a sequence x with symbol count vector m given a preceding sequence y with symbol count vector n as Paralleling (2) and (4), we can generalize the KT estimate to a CTM on a given context set S with unknown parameters p(s). In this case, the KT estimate of the probability of the sequence x with symbol count vectors n(s), s ∈ S, is and the KT estimate of the conditional probability of x with symbol count vectors m(s) given y with symbol count vectors n(s) is As in the IID case, the KT probability estimate is universal in the sense that Q KT is uniformly close to all CTMs on the given context set S.
Model building. Given a collection of sequence groups in a database, NASCUP models each sequence group by a CTM of an unknown context set S and unknown parameters p(s), s ∈ S. Scores, ranking, and multi-candidate classification. For each sequence group j = 1, . . . , J in the database, NASCUP computes the estimate Q KT (x|y j ) of the likelihood that the query sequence is generated from group j. This likelihood estimate serves as a score for each group, which can be rank-ordered to form a small list of candidate groups of the query sequence (Supplementary Table 8 The smaller the NLL value is, the better the conformance to the model is. Conversely, the larger the value, the greater the difference from the model, indicating a high likelihood of being an outlier ( Supplementary Fig. 6).
Synthetic sequence generation. Let d be the depth of the context tree and l be the average length of the sequences y in a sequence group. First, we generate the starting string X 1 · · · X d of length d by copying the most frequent starting string of the same length among the existing sequences in the group. Using a suffix of X 1 · · · X d as the context s, we generate the next symbol X d+1 according to Q KT (X d+1 = X|X 1 , . . . , X d , y) = n X (s) + 1/2 n(s) + 2 as in (6). Subsequently, we generate X i , i = d + 2, . . . , l, each according to a similar predictive probability estimate with a suffix of X i−d · · · X i−1 as the context and the corresponding count vector from the existing sequences y as well as the preceding symbols X d+1 · · · X i−1 . This sliding-window sequence generation procedure is an extension of Polya's urn process in the standard Bayesian statistics to a CTM. Due to the conjugacy of the Dirichlet prior, the distribution of the generated sequence x (from the (d + 1)-st position and on) is equivalent to a CTM with random parameters p(s) drawn from the Dirichlet prior with parameters n A (s)+1/2, n C (s)+1/2, n G (s)+1/2, n T (s)+ 1/2 for each s ∈ S. To ascertain the quality of synthetic sequences in our procedure, we generated synthetic sequences from RF, RD, and GG datasets, one per group, and classified them using BLAST ( Supplementary Fig. 7).
Statistical significance. We performed Wilcoxon signed-rank tests in the SciPy library on normalized and unnormalized aggregate datasets as well as individual datasets (Supplementary Table 9).
The normalized aggregate dataset was formed by combining the seven datasets in Supplementary   Comparison of NASCUP and its design alternatives. Universal probability and context-tree models are two key features of NASCUP. In order to demonstrate their combined benefit, we compared classification accuracy among a few design alternatives by varying the model depth ( Supplementary Fig. 3).  Supplementary Fig. 4). Moreover, CTMs using KT estimator always have fewer leaf nodes than CTMs using ML estimator. In conclusion, the combination of KT and CTM used in NASCUP was the most parsimonious. This sparsity can be interpreted as being closer to the ground truth in principle (which was cross-examined by the synthetic sequence generation experiment), but also leads to faster and more efficient classification as a practical benefit.
Comparison of NASCUP with twelve alternative classification methods. Including the four main alternatives, we compared NASCUP with twelve classification methods in terms of accuracy and running time (Supplementary Tables 2 and 4). Regardless of the diversity of dataset, NASCUP showed a high level of accuracy consistently and achieved the best performance in term of the average and geometric mean accuracy. In particular, NASCUP showed the highest accuracy on RF dataset, which was the most difficult dataset to classify due to its largest number of classes and widest intra-group sequence distance (AIGD). More than half of the thirteen classification methods (UBLAST, BLAT, caBLAST, RDP, gzip, UCLUST, and Mothur) exhibited unsatisfactory results of below 80% of accuracy on RF. NASCUP and BLAST maintained the accuracy of above 95% across all datasets considered, whereas the performance of the other methods varied often significantly from dataset to dataset. NASCUP, RDP, USEARCH, and QIIME package based methods ran significantly (often by orders of magnitude) faster for most of the datasets.
Runtime in the sequencewise and groupwise expanding datasets. With advances in sequencing technologies, sequence databases are constantly being updated and expanded. To ascertain the desired scalability of NASCUP in such environments, we performed classification experiments on two types of database expansion-sequencewise and groupwise expanding datasets (Supplementary Table 6      Supplementary Table 3 Experimental results on large-scale SILVA (SSS, LSU) datasets. The sequence groups are formed based on genus-level and species-level from latest SILVA (v132) without compaction. The genus-level SSU and LSU have 1,861 and 587 groups, respectively, and the species-level SSU and LSU have 3,134 and 749 groups, respectively. NASCUP-7 is comparable to RDP whose default k is 8, NASCUP-6 is the default parameter of NASCUP. Each value of accuracy is the average of 10-fold cross validation. 1279.93 * NASCUP-7 means depth of 7 which is comparable to 8-mer. NASCUP-6 means depth of 6 which is comparable to 7-mer. Table 4 Total runtime (classification time) comparison. Each value in the table including average and geomean is the log-transformed time result (unit: log 10 of a second). The classification time for BLAST and its variants is the same as the total runtime, since there is no separate modeling process. For HMMER, the preprocessing time for multiple sequence alignment was excluded.  Accuracy comparison from candidate extraction. Two sections represent different extracting methods about extracting candidate groups as varying (1) number of candidates, fixed to K and (2) threshold T on the sum of the posterior probabilities, respectively. Each value in the seven datasets is the average accuracy of 10-fold cross-validation. In the second section, the numbers in parentheses mean the average number of candidate groups.