An Application of Random Walk Resampling to Phylogenetic HMM Inference and Learning

Statistical resampling methods are widely used for confidence interval placement and as a data perturbation technique for statistical inference and learning. An important assumption of popular resampling methods such as the standard bootstrap is that input observations are identically and independently distributed (i.i.d.). However, within the area of computational biology and bioinformatics, many different factors can contribute to intra-sequence dependence, such as recombination and other evolutionary processes governing sequence evolution. The SEquential RESampling (“SERES”) framework was previously proposed to relax the simplifying assumption of i.i.d. input observations. SERES resampling takes the form of random walks on an input of either aligned or unaligned biomolecular sequences. This study introduces the first application of SERES random walks on aligned sequence inputs and is also the first to demonstrate the utility of SERES as a data perturbation technique to yield improved statistical estimates. We focus on the classical problem of recombination-aware local genealogical inference. We show in a simulation study that coupling SERES resampling and re-estimation with recHMM, a hidden Markov model-based method, produces local genealogical inferences with consistent and often large improvements in terms of topological accuracy. We further evaluate method performance using empirical HIV genome sequence datasets.

Two classes of resampling methods are used.The first class is non-parametric; of these, the bootstrap method is among the most widely used [3], [4].Given an input set of observations, the bootstrap method resamples observations uniformly at random with replacement.Re-estimation is then performed on resampled replicates, and repeatability is quantified by comparing re-estimates.Other related non-parametric methods include the jackknife, weighted bootstrap, and others.In contrast, parametric methods resample directly from an explicit statistical model.Ideally, the model that generated the original inputs is available, but in practice a hypothesis model must typically be assumed.Non-parametric methods are a popular choice since they avoid the need to assume that observations were generated under a specific model.
But the bootstrap method and other popular non-parametric resampling methods bring their own important limitation: the simplifying assumption that input observations are independent and identically distributed (i.i.d.).The i.i.d.assumption is invalid in the case where inputs consist of sequences of observations, as is common throughout genomics and many other topics in computational biology and bioinformatics.
To relax this simplifying assumption, we developed the SERES (or "SEquential RESampling") method for nonparametric/semi-parametric resampling from an input of either aligned or unaligned sequences [25].SERES synthesizes and extends the bootstrap method with a simple but powerful insight due to [12]: inferences should be repeatable whether an input of unaligned sequences is read left-to-right or rightto-left.In lieu of using "mirrored" inputs, SERES performs random walks on input sequences.In this study, we focus on the SERES algorithm for aligned sequence inputs.A start point (i.e., an initial site) and direction for the SERES walk are chosen uniformly at random.The walk then proceeds, where aligned sites are sampled during each step of the walk; walk reversals occur with probability γ during each step (where γ is typically smaller than 0.5) and with certainty at the start and end of the alignment.The random walk concludes once the resampled replicate has length equal to the input alignment.For each resampled replicate, re-estimation is performed.Repeatability is then measured by quantifying disagreement among re-estimations.
Our initial study of SERES focused on the SERES algorithm for unaligned sequence inputs [25], rather than aligned inputs.Briefly, the SERES algorithm for unaligned sequence inputs also takes the form of a random walk, with one main This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/difference: resampling "reads" along unaligned sequences occur in an asynchronous fashion, and a set of anchors serve as synchronization "barriers" in much the same sense as in parallel computing.We previously applied the SERES algorithm to perform confidence interval placement for a classical problem in computational biology and bioinformatics: multiple sequence alignment (MSA) estimation.Using synthetic and empirical data, we showed that the use of SERES random walks within a resampling/re-estimation pipeline resulted in comparable or often better type I and type II error rates relative to state-of-the-art methods.
In this study, we address several corollary questions which constitute the three primary contributions of our study.(1) We propose the first application of SERES random walks on aligned sequences, whereas our earlier study focused on SERES random walks on unaligned sequences.(2) Our study utilizes SERES random walks as a means to "boost" HMM inference/learning performance.Like other non-parametric resampling methods, we show that SERES has utility as a data perturbation technique in addition to its use in confidence interval placement, as considered by our earlier work.
(3) We introduce a SERES-based approach for another classical problem in computational biology and bioinformatics: recombination-aware local genealogical inference.

A. Standalone recHMM Analysis
The coalescent-with-recombination (CwR) model [8] is a classical population genetic model involving recombination.However, phylogenetic inference under the multi-species CwR model is computationally prohibitive, and alternatives such as the sequentially Markovian coalescent (SMC) model [17] are used as an approximation to the full CwR model.First-order hidden Markov models (HMMs) are a widely used choice for tractable SMC-based inference.
Phylogenetic HMMs (or "phylo-HMMs") are the class of HMMs with hidden states that correspond to phylogenies.Markovian dependence between phylo-HMM states are meant to capture intra-sequence dependence among local phylogenies, which can be caused by recombination and other evolutionary processes.There are a variety of phylo-HMMbased methods for local genealogical inference, depending on modeling assumptions [10], [14], [16], [26].We focus on recHMM [26] as an exemplar method in this class.
The recHMM framework utilizes a statistical model that combines a finite-sites substitution model and a phylo-HMM to capture intra-sequence dependence due to recombination.The combined model parameters θ consist of local gene tree branch lengths, substitution model rates and base frequencies, and state transition probabilities.Emissions occur at a state with likelihood under a finite-sites substitution model, which can be efficiently calculated using Felsenstein's peeling algorithm [5].Combined model likelihood is calculated using dynamic programming in the form of either the forward or backward algorithm [19].Typically, model parameters in a traditional HMM are learned by addressing computationally difficult optimization problems; for this reason, heuristics such as the expectation-maximization (EM) algorithm and the related Baum-Welch algorithm are used.An EM-based approach is used to learn recHMM model parameters θ .[26] also applied a structural EM heuristic [6] to automatically learn the set of local gene trees represented by recHMM's statesone distinct gene tree per state.The recHMM framework allows the user to specify the HMM state space size φ.In our simulation study, recHMM was run with a default setting of φ = 3; we also included model complexity experiments with alternative settings φ ∈ {10, 15}.We note that, in the structural EM used by [26], HMM states are distinguished by both gene tree topologies and branch lengths.
As is common practice for local search heuristics in the context of statistical learning, [26] utilized ψ independent optimization trials and then selected the best trial under the maximum likelihood criterion as a means to avoid getting stuck in local optima (cf. Figure 1 in [26]).We followed their practice in our study: when run as a standalone method, recHMM utilized ψ = 100 independent optimization trials.
Consistent with the study of [26], we used the posterior decoding algorithm to perform statistical inference of local phylogenies [19].The posterior decoding algorithm addresses the following problem.Let G be the set of all possible unrooted gene tree topologies on n sampled individuals.The input consists of a multiple sequence alignment A on n sequences -one for each of n individuals -with length k (i.e., k sites in A).A is assumed to contain recombinant sequences, and historical recombination can cause local genealogies to vary across the sites in A [7].The output consists of the following: for each aligned site a i where 1 ≤ i ≤ k, we seek the conditional probability that the HMM is in a hidden state corresponding to a particular gene tree g ∈ G conditional on all sites in A and the fitted HMM model.For a particular HMM instance, the posterior decoding effectively estimates which gene tree is the most likely evolutionary history that explains the observed character at a given site conditional on the sequence of all observed sites in A. Analogous to the distinction between filtering and smoothing [22], the posterior decoding weighs any particular inference at a given site against the total evidence across all sites.

B. The SERES+recHMM Pipeline
The key algorithmic contribution of this study takes the form of a methodological pipeline for local phylogenetic inference which augments recHMM with SERES random walks.First, we ran SERES resampling on the input alignment A. Detailed pseudocode for this procedure is shown in Algorithm 1 (reproduced from [25]), and Fig. 1 provides an illustrated example of a SERES random walk on an input MSA.The SERES resampling procedure in our simulation study utilized a default reversal probability γ = 0.005.We also conducted additional experiments with alternative reversal probability values γ ∈ {0, 0.01, 0.1}.
Next, we ran recHMM on each SERES replicate.Consistent with the study of [26], we observed that the quality of recHMM's inference depends upon sufficiently intensive Fig. 1.Illustrated example of a SERES random walk on an input multiple sequence alignment (MSA).The input to the SERES resampling algorithm consists of a multiple sequence alignment.The SERES resampling algorithm takes the form of a random walk.First, a start site and initial walk direction are chosen uniformly at random.In this example, the fourth site from the left is chosen as the start site, and the initial walk direction is rightward.The random walk proceeds, where an MSA site is sampled during each step of the walk.Walk reversals occur with certainty at the start and end of the input MSA and with probability γ at any other point of the walk.The walk concludes when the sampled replicate meets a sequence length criterion -namely, the number of sites in the sampled replicate and input MSA are equal.In this example, a first reversal occurs in the interior of the input MSA and a second reversal occurs at the left boundary of the input MSA.The output consists of the sampled replicate.
Algorithm 1 Pseudocode for SERES Walk on Aligned Sequences.Reproduced From [25] 1: procedure SERESWALKONALIGNEDSEQUENCES( A, γ , numReplicates) Input: MSA A, walk reversal probability γ , number of SERES replicates numReplicates Output: list of SERES replicates return(replicates) learning optimization.We adopted a conservative approach and restricted the number of independent learning trials ψ used in the SERES-based pipeline, where recHMM was run on each SERES replicate with ψ = 10 independent trials.For each dataset, the total number of independent learning trials used in the SERES-based pipeline was therefore equal to the number of independent learning trials used by the standalone recHMM method.Otherwise, recHMM re-estimation of a SERES replicate was run in an identical manner compared to the standalone recHMM method.Given optimized model parameter values, inference proceeded via the posterior decoding algorithm.The resulting output annotation consists of a per-site probability distribution over φ gene tree topologies.
For each site, inferred posterior decoding probability distributions were aggregated across all SERES replicates in which the site appeared (with per-replicate multiplicity based on the number of times that the site was sampled within the replicate).The aggregated distribution was then normalized to obtain a valid probability distribution.

C. Simulated Datasets
Gene trees were simulated under the CwR model using ms [9].Each CwR simulation sampled either 4, 5, 6, or 10 alleles with scaled recombination rate ρ ∈ {0.5, 1.0, 2.0} and total sequence length of 1 kb per replicate.For each gene tree, finite-length sequence evolution was simulated under the Jukes-Cantor model of nucleotide substitution [11] using Seq-Gen [20].We used a substitution rate θ ∈ {0.5, 1.0, 2.0}.A model condition consisted of fixed values for all model parameters, and simulation procedures were repeated so that 30 replicate datasets were generated per model condition.Summary statistics for the simulated datasets are shown in Table I.We assessed topological accuracy of inferred gene trees relative to ground truth using the Robinson-Foulds measure [21], which is the proportion of bipartitions that occur in an inferred gene tree but not the true gene tree or vice versa.The coalescent simulations were performed using the following ms [9] command: ms <number of sampled alleles> 1 -r <rho> <number of sites> -T where the number of sampled alleles is 4, 5, 6, or 10, the scaled recombination rate ρ is 0.5, 1, or 2, the number of sites is 1000, and the -T parameter outputs true local gene trees.The Seq-Gen simulations made use of the following command: seq-gen -mHKY -l <number of sites> -p <number of partitions> -s <mutation rate> -z <PRNG seed> <gene trees> > <seqfile> where the −mHKY parameter with no additional options specifies the Jukes-Cantor mutation model, the −l parameter specifies sequence length of 1000 bp, the −p parameter is the number of local gene trees, the −s parameter specifies the mutation rate θ ∈ {0.5, 1, 2}, the −z option specifies the pseudo-random number generator seed, and the <gene trees> argument is the list of true gene trees that were output by ms.

D. Empirical Datasets
We also re-analyzed two HIV datasets from the study of [26].One dataset consisted of Indian samples that were originally studied by [15], and the other consisted of Malaysian samples that were originally studied by [13].Both datasets had four sequences, including the putatively recombinant sequences that were the original foci of the two studies.

E. Software and Data Availability
Open-source software and open data can be found at https://gitlab.msu.edu/liulab/seres-based-recombinationbreakpoint-inference-data-and-scripts.

A. Simulation Study
The performance measures evaluated the extent to which each method's inferred per-site posterior probability distribution reflected topological accuracy.We initially examined the correlation between each method's inferred per-site posterior probability for a gene tree topology g and the topological accuracy of g.Equivalently, we quantified the anticorrelation between the former and the topological error of g, as measured by the Robinson-Foulds distance between g and the true gene tree topology for a site.We focus on correlation rather than anticorrelation to simplify discussion.Table II shows correlation results for each simulation study model condition.
Across all 4-sequence model conditions, SERES+recHMM inference was consistently better correlated with topological accuracy compared to standalone recHMM.Performance improvement obtained by coupling recHMM analysis with SERES resampling and re-estimation was robust to varying mutation rates and recombination rates.Absolute correlation improvements were large in magnitude -amounting to at least 0.203 for any model condition and as much as 0.305.
We next compared the two methods' inferred posterior probability distributions on the 4-sequence model conditions.Fig. 2 shows a histogram of recHMM-inferred per-site posterior probabilities for gene tree topologies falling into two classes: either the true gene tree topology for a site ("true class") or all other topologies ("false class"); Fig. 3 shows the equivalent histogram for SERES+recHMM.
Focusing on the true class of per-site inferences, inferences with less than 10% posterior probability were consistently reduced across all model conditions when comparing standalone recHMM versus SERES+recHMM; the reduction amounted to more than half in all cases.The latter method's posterior probability distribution was shifted rightward compared to the former method (i.e., the SERES+recHMMinferred posterior probability mass was instead distributed among per-site inferences with higher posterior probability, relative to standalone recHMM).The effect was most pronounced for per-site inferences in the highest decile range of posterior probability (i.e., 90% posterior probability or greater).An opposite trend was observed for the false class of per-site inferences.Standalone recHMM's per-site inferences in the highest decile range of posterior probability (i.e., 90% posterior probability or greater) were typically the second highest in frequency compared to all other deciles; in contrast, SERES+recHMM consistently returned fewer per-site inferences in the top decile of posterior probability range -at most a few percentage points and nearing zero frequency for some model conditions.The SERES+recHMM-inferred posterior distribution for the false class of per-site inferences was consistently shifted leftward compared to standalone recHMM.We attribute these findings to two factors.First, the use of non-parametric resampling and re-estimation appears to be conducive to improved inference of true gene tree topologies.Second, incorrect inferences for all other gene tree topologies (in terms of relatively high inferred posterior probability) were less repeatable.
A similar performance outcome was observed on the larger 5-sequence model conditions.Across all model conditions, SERES+recHMM inference was more strongly correlated with topological accuracy compared to recHMM (Table II).We observed absolute improvements in correlation coefficients amounting to between 0.217 and 0.345.Taken together, SERES+recHMM's performance advantage relative to standalone recHMM was larger on the 5-sequence model conditions, relative to the smaller 4-sequence model conditions.
As in the 4-sequence and 5-sequence dataset comparisons, SERES+recHMM's per-site inference was more strongly correlated with topological accuracy across all 6-sequence and 10-sequence model conditions, when compared to standalone recHMM (Table II).However, the observed correlation coefficients for both methods were generally weaker when comparing the 6-sequence and 10-sequence dataset analyses versus analyses of smaller datasets; in particular, standalone recHMM inference was effectively not correlated with topological accuracy on some of the 10-sequence model conditions.Furthermore, the observed improvement in correlation returned by SERES+recHMM versus standalone recHMM varied across model conditions to a greater extent: ranging between nearly comparable (an absolute improvement of 0.004) to 0.206 on the 6-sequence model conditions and between 0.224 and 0.476 on the 10-sequence model conditions.On the 6-sequence model conditions, the histogram comparison of each method's per-site inferences was also different for the true class of persite inferences, but not for the false class (Figs. 4 and 5).The latter was in fact consistent: for the false class of per-site inferences, SERES+recHMM's inferred posterior probability distributions were more strongly shifted leftward compared to recHMM.The effect was preserved even though posterior probabilities of the false class of inferences was more than double that seen on the 4-sequence and 5-sequence experiments.However, a different outcome was observed for the true class of per-site inferences: rather than a rightward shift, SERES+recHMM returned posterior probability distributions which were generally more diffuse compared to recHMM alone.We attribute these findings to the increased computational complexity of HMM learning optimization as the number of input sequences increases.It is likely that conservatively limiting SERES-based re-estimation to 10 learning iterations is insufficient for the larger model conditions in our study.More intensive learning optimization may yield improved re-estimation and a greater performance benefit from augmenting recHMM with SERES.
We also conducted additional experiments that evaluated the impact of key method parameters.Table IV compares inference accuracy for recHMM versus SERES+recHMM as different choices are used for the SERES reversal probability γ .We found that the performance advantage returned by SERES+recHMM over standalone recHMM was robust to the choice of reversal probability γ so long as the chosen value was not too high; reasonable choices are equivalent to reversal breakpoints separated by at least 100 bp of sequence length on average.The results are consistent with the original motivation for sequence-aware resampling and re-estimation.[25] noted the correspondence between an r th order Markov process and a SERES random walk with reversal probability γ .For γ = 0.5, a first-order Markov process suffices; for γ < 0.5, higher-order Markovian processes are needed to capture sequential dependence.Essentially, smaller γ values mean that longer-distance sequential dependence is retained.Our results suggest that there is a critical point: past a certain threshold, longer-distance sequential dependence is critical to the performance of resampling and re-estimation for sequencebased inference problems.
Table V shows results for recHMM and SERES+recHMM analyses using alternative settings for the HMM state space size φ.We note that the simulated datasets in our study included between approximately 3 and 6 recombination-free intervals with distinct true gene trees, on average.Consistent with [26], we found that using a more complex recHMM model than necessary (i.e., more HMM states than the number of local gene trees encoded within a simulation replicate's ancestral recombination graph) resulted in overfitting.SERES+recHMM's performance was relatively robust to overfitting, as compared to standalone recHMM analysis.
Runtime and memory usage results are reported in Table III.Average runtime for the two methods were roughly comparable: average runtime differences between the two methods were less than an hour on the 4-sequence and 5-sequence model conditions and less than two hours on the 6-sequence and 10-sequence model conditions, and neither method consistently returned faster average runtime.Throughout our study, we observed low memory usage for both methods that amounted to less than 100 MiB.

B. Empirical Study
1) Method comparison on Indian HIV-1 Dataset: As in the earlier studies of [15] and [26], our SERES-based re-analysis of the Indian HIV-1 dataset clearly detected local topology switching that is consistent with historical recombination.The finding supports the hypothesis that the sequence 95IN21301 is recombinant.
As shown in Fig. 6, the SERES+recHMM method recovered the five breakpoints described by both [15] and [26]; the specific coordinates described in the latter study were 6402 bp, 6969 bp, 7073 bp, 9431 bp, and 9585 bp.In our re-analysis, the breakpoints correspond to switching between the blue topology and orange topology.SERES+recHMM posterior decoding also clearly showed inference uncertainty in the first few hundred bp of the input alignment.
Furthermore, [26] reported two additional breakpoints at 4328 bp and 4401 bp that were not described in the study of [15].Neither the standalone recHMM method nor the SERES+recHMM method recovered local topological incongruence in this specific region, although standalone recHMM posterior decoding recovered nearby local topology switching in the region from 4000 bp to 4200 bp.However, the SERESbased method indicated more uncertainty regarding gene tree inference within this region, relative to the five breakpoints described by both of the previous studies.
Re-analysis using SERES+recHMM also clarified patterns of local topology switching in other genomic regions.We detected local topological incongruence within the region from 3000 bp to 3500 bp.Some genomic regions exhibited local topology switching in standalone recHMM posterior decoding analysis that was not supported by the SERES+recHMM analysis (e.g., the region from 6000 to 6500 bp).Finally, throughout much of the genome alignment, SERES+recHMM inferred lower posterior decoding probability for the gene tree topology shown in green, relative to standalone recHMM.The region located between 5000 and 8000 bp was particularly striking: within this region, SERES+recHMM inferred basically zero probability for the topology shown in green, whereas recHMM inferred highly variable probability that was often far from zero.
2) Method comparison on Malaysian HIV-1 Dataset: Analyses of the Malaysian HIV-1 dataset are shown in Fig. 7.A clear signal of recombination was detected by both methods.
Both methods recovered the six breakpoints that were described both by [26] and [13].The posterior decoding We re-analyzed a subset of the Indian HIV-1 genome dataset that was published by [15].[26] re-analyzed the original dataset using recHMM.Our re-analysis compared local gene tree probabilities computed using standalone recHMM posterior decoding (top panel) versus SERES+recHMM posterior decoding (bottom panel).The plots show posterior decoding probabilities (y-axis) versus genome coordinate (x-axis).Local gene tree probabilities are colored based on the three possible unrooted topologies for the four-sequence dataset (shown in either blue, orange, or green).
probability distributions were mostly consistent in two of the breakpoint-delineated regions: between 2141 bp and 2856 bp, and between 3283 and 3617 bp.There was a larger discrepancy between the two methods' posterior decoding probabilities for the third breakpoint-delineated region, which was between ∼7600 bp to ∼8250 bp; the SERES+recHMM inference suggested greater uncertainty compared to the standalone recHMM analysis.
[26] also reported four additional breakpoints that were not described by [13].These breakpoints delineate the regions Fig. 7. Posterior probabilities of local gene tree topologies inferred by standalone recHMM versus SERES+recHMM method on Malaysian HIV-1 dataset.We re-analyzed the Malaysian HIV-1 genome dataset that was published by [13].[26] re-analyzed the original dataset using recHMM.Distinct local gene tree topologies are shown using different colors.from 2360 bp to 2553 bp and from 6415 bp to 6594 bp.
Our standalone recHMM analysis recovered these breakpoints.However, our SERES+recHMM analysis had two important differences.For the former region, the topologies with highest posterior decoding probability differed between the two analyses.For the latter region, our SERES+recHMM analysis flagged higher uncertainty in the form of more diffuse posterior decoding probabilities.
We noted other differences between the two methods.The local gene tree topology colored in orange had generally lower posterior decoding probability in our SERES+recHMM analysis, as compared to standalone recHMM analysis.Several regions were flagged by SERES+recHMM as exhibiting greater inference uncertainty, including the regions from ∼4600 bp to ∼5000 bp and from ∼6250 bp to ∼6500 bp.Finally, our standalone recHMM analyses exhibited greater local variation and very short (less than ∼100 bp) regions of local topology switching.We attribute these findings to two factors: the use of resampling in our SERES+recHMM analyses and the larger state space used in our phylogenetic HMM analyses (where φ = 3) as compared to the study of [26] (where φ = 2).

IV. CONCLUSIONS
This study introduced the first application of SERES random walks on aligned sequences.The application is also the first to utilize SERES as a data perturbation technique to improve statistical inference and learning.Our performance validation experiments showed that coupling SERES with recHMM, an HMM-based method for recombination-aware local genealogical inference, yielded improved local inferences and potentially reduced type I and/or type II error.Re-analyses of two HIV genome sequence datasets clarify the findings in the earlier study of [26].
We conclude with thoughts on future research.First, we note that statistical learning was a major bottleneck for the methods under study, particular for the SERES-based pipeline since optimization-based learning must be addressed for all SERES replicates.This scalability challenge is well suited to "pleasantly" parallel computation as well as more sophisticated parallelization techniques.Second, other studies have investigated recombination inference problems other than local genealogical inference (e.g., recombination rate estimation [24], recombination hotspot/coldspot detection [1], [18], etc.).SERES resampling and re-estimation may prove to be similarly beneficial in these other contexts.Finally, we believe that we have only begun to realize the full potential of SERES random walks.As with other non-parametric and semi-parametric resampling techniques, SERES promises to find wide utility in computational biology/bioinformatics and beyond.

Fig. 2 .
Fig.2.Histogram of posterior probabilities inferred by standalone recHMM method on 4-sequence model conditions.Local gene tree topologies at a site were split into two classes: the "true class" consists of the true gene tree topology for the site, and the "false class" contains all other gene tree topologies.For each class and each replicate dataset in a model condition, the inferred posterior probabilities for gene trees at any site were binned into deciles; the resulting histogram was then normalized (n = 30).The normalized histograms for the true and false classes are shown in blue and orange, respectively.

3 .
Histogram of posterior probabilities inferred by SERES+recHMM method on 4-sequence model conditions.Figure layout and description are otherwise identical to Fig. 2.

4 .
Histogram of posterior probabilities inferred by standalone recHMM method on 6-taxon model conditions.Figure layout and description are otherwise identical to Fig. 2.

5 .
Histogram of posterior probabilities inferred by SERES+recHMM method on 6-taxon model conditions.Figure layout and description are otherwise identical to Fig. 2.
Fig.7.Posterior probabilities of local gene tree topologies inferred by standalone recHMM versus SERES+recHMM method on Malaysian HIV-1 dataset.We re-analyzed the Malaysian HIV-1 genome dataset that was published by[13].[26]re-analyzed the original dataset using recHMM.Distinct local gene tree topologies are shown using different colors.Figure layout and description are otherwise identical to Fig.6.

TABLE I SIMULATED
DATASET STATISTICS.MODEL CONDITIONS IN OUR SIMULATION STUDY WERE PARAMETERIZED BY THE NUMBER OF SEQUENCES, RECOMBINATION RATE ρ, AND MUTATION RATE θ.THE NUMBER OF TRUE GENE TREES AND AVERAGE NORMALIZED HAMMING DISTANCE ("ANHD") ARE REPORTED FOR SIMULATED DATASETS FROM THE

TABLE III THE
RUNTIME AND MEMORY USAGE INFORMATION FOR STANDALONE RECHMM AND SERES+RECHMM METHODS ON SIMULATION STUDY MODEL CONDITIONS.MODEL CONDITIONS WERE PARAMETERIZED BY THE NUMBER OF SEQUENCES, RECOMBINATION RATE ρ, AND MUTATION RATE θ.BOTH METHODS UTILIZE MODELS WITH φ = 3 AND γ = 0.005 TO INFER A POSTERIOR PROBABILITY DISTRIBUTION OVER GENE TREE TOPOLOGIES.AVERAGE RUNTIME IN HOURS AND PEAK MEMORY USAGE IN GiB ARE REPORTED ACROSS ALL REPLICATES IN A MODEL CONDITION (n = 30)

TABLE IV THE
COMPARISON AMONG DIFFERENT REVERSAL PROBABILITIES γ ON 4-, 5-AND 6-TAXON MODEL CONDITIONS.THE METHODS UTILIZE MODELS WITH φ = 3 TO INFER A POSTERIOR PROBABILITY DISTRIBUTION OVER GENE TREE TOPOLOGIES.FOR EACH METHOD'S INFERENCE, WE CALCULATED THE PEARSON CORRELATION BETWEEN THE INFERRED POSTERIOR PROBABILITY FOR A GENE TREE g AND THE TOPOLOGICAL DISTANCE BETWEEN g AND THE TRUE EVOLUTIONARY HISTORY OF A SITE (I.E., THE TRUE LOCAL GENE TREE).THE AVERAGES ARE REPORTED ACROSS ALL n REPLICATES IN A MODEL CONDITION (n = 30) Fig. 6.Posterior probabilities of local gene tree topologies inferred by standalone recHMM versus SERES+recHMM method on Indian HIV-1 dataset.

TABLE V THE
COMPARISON AMONG DIFFERENT NUMBER OF STATES φ ON 5-TAXON MODEL CONDITIONS.THE METHODS UTILIZE MODELS WITH γ = 0.005 TO INFER A POSTERIOR PROBABILITY DISTRIBUTION OVER GENE TREE TOPOLOGIES.FOR EACH METHOD'S INFERENCE, WE CALCULATED THE PEARSON CORRELATION BETWEEN THE INFERRED POSTERIOR PROBABILITY FOR A GENE TREE g AND THE TOPOLOGICAL DISTANCE BETWEEN g AND THE TRUE EVOLUTIONARY HISTORY OF A SITE (I.E., THE TRUE LOCAL GENE TREE).THE AVERAGES ARE REPORTED ACROSS ALL n REPLICATES IN A MODEL CONDITION (n = 30)