Large-Scale Multiple Sequence Alignment and the Maximum Weight Trace Alignment Merging Problem

MAGUS is a recent multiple sequence alignment method that provides excellent accuracy on large challenging datasets. MAGUS uses divide-and-conquer: it divides the sequences into disjoint sets, computes alignments on the disjoint sets, and then merges the alignments using a technique it calls the Graph Clustering Method (GCM). To understand why MAGUS is so accurate, we show that GCM is a good heuristic for the NP-hard MWT-AM problem (Maximum Weight Trace, adapted to the Alignment Merging problem). Our study, using both biological and simulated data, establishes that MWT-AM scores correlate very well with alignment accuracy and presents improvements to GCM that are even better heuristics for MWT-AM. This study suggests a new direction for large-scale MSA estimation based on improved divide-and-conquer strategies, with the merging step based on optimizing MWT-AM. MAGUS and its enhanced versions are available at https://github.com/vlasmirnov/MAGUS.


INTRODUCTION
M ULTIPLE sequence alignment is a basic step in many bioinformatics pipelines, including phylogenetic estimation, protein structure prediction, the detection of selection, etc. Yet accurate alignment estimation is very challenging under some circumstances (e.g., when datasets evolve with high rates of evolution) [1]. Furthermore, only a few methods have been able to run on datasets with 1000 or more sequences [2], [3], [4], and many of the methods that can run on large datasets degrade in accuracy under those conditions [1], [4]. Thus, accurate alignment estimation on large datasets, especially ones that have evolved under high rates of evolution (and so have low pairwise sequence identity) is very challenging.
Divide-and-conquer strategies have been designed to enable highly accurate alignment methods to scale to large datasets while maintaining high accuracy. These methods operate in three stages: first the sequence dataset is divided into disjoint subsets, then alignments are computed for each of the subsets, and finally the subset alignments are merged into an alignment on the entire dataset. Examples of these methods include SATCHMO-JS [5], SAT e [1], [6], and PASTA [2]. Of these, PASTA provides the best accuracy on datasets with many thousands of sequences and high rates of evolution, and has been shown to scale to 1,000,000 sequences. MAGUS [7] is a recent MSA method that uses this strategy, and it provides a substantial accuracy advantage over PASTA, especially on datasets with high rates of evolution. Yet MAGUS is essentially identical to PASTA for the first two stages, and only differs significantly in how it performs the third stage (where the disjoint subset alignments are merged into an alignment on the full dataset). Specifically, PASTA merges the set of k disjoint alignments by first merging k À 1 pairs of alignments (using either OPAL [8] or Muscle [9]), and then uses the overlap between the merged pairs to define the final merged alignment on the basis of transitivity. In contrast, MAGUS uses a method called the Graph Clustering Method (GCM) to merge the k disjoint alignments all at once, rather than pairwise. Thus, the advantage MAGUS has over PASTA is due to the use of GCM instead of PASTA's technique.
This study aims to understand why the use of GCM produces highly accurate merged alignments. To do this, we formulate an optimization problem for merging disjoint alignments, which we called the Maximum Weight Trace Alignment Merging (MWT-AM) problem. The Maximum Weight Trace problem itself is a classical problem in bioinformatics, introduced by Kececioglu nearly 30 years ago [10].
The input to the MWT problem is a set of sequences (e.g., DNA sequences) and weights on pairs of letters from the different sequences, and the objective is an MSA that has the maximum total possible weight (defined to be the sum of the weights of aligned letters in the output MSA). Kececioglu [10] showed that MWT is NP-hard and can be exactly solved in Oð2 B L B B 2 Þ time for B sequences of total length L using dynamic programming. Kececioglu [10] also proposed a more practical branch-and-bound refinement of this algorithm, but the reduction in time is still not sufficient to analyze more than relatively small datasets (e.g., at most 20 sequences). Subsequent studies provided some additional techniques, both for exact solutions [11], [12] or heuristic solutions [13], [14]. Regrettably, not even the heuristics are able to run on even moderate-sized datasets, and the methods of Moreno and Karp [14] and Koller and Raidl [13] are not publicly available.
In this study, we define the Maximum Weight Trace for Alignment Merger Problem, or MWT-AM. We then show how GCM can be seen as a heuristic for this problem, applied to a particular input that is constructed by MAGUS, and we explore the optimization criterion scores it obtains. We then modify the GCM strategy to try to improve the MWT-AM scores, and we evaluate the correlation between MWT-AM scores and alignment accuracy. We prove that MWT-AM is NP-hard, and explore variants of GCM to improve the MWT-AM criterion scores.
Our study shows that the GCM technique and its variants are better solutions to MWT-AM than MAFFTmerge [15] and T-Coffee [16], which are the only other methods currently available that can merge disjoint alignments. Our study also shows that MWT-AM scores correlate well with alignment accuracy, suggesting that MWT-AM is a desirable optimization problem for use within divide-and-conquer alignment estimation. Finally, we show that using GCM and its variants within a divide-and-conquer pipeline enables analyses of datasets with thousands of sequences and produces highly accurate alignments. GCM and MAGUS are freely available at https://github. com/vlasmirnov/MAGUS.

Maximum Weight Trace Alignment Merging
The input to the Maximum Weight Trace (MWT) problem is a set S of sequences and a weight on selected pairs of letters from the sequences; the objective is a multiple sequence alignment of the sequences that maximizes the total weight of the pairs of letters that appear together in a column. The output alignment is also referred to as a "trace," which has a graphtheoretic description in [10] but can be more simply described as follows. A trace is a partition of the letters of the input sequences into an ordered set of pairwise disjoint subsets X 1 ; X 2 ; . . . ; X k so that (1) each set has at most one letter from each sequence and (2) if letters x and y from the same sequence s appear in X i and X j respectively with i < j, then x appears before y in s. By treating the subsets as defining columns, these two properties ensure that the provided ordering of the subsets defines a multiple sequence alignment of the sequences. The weight of the trace is the sum of the weights of those pairs of letters that appear in the same subset, and hence in the same column in the alignment defined by the trace.
We generalize the MWT problem to allow the input to be a collection of disjoint alignments instead of a collection of individual sequences, and we refer to this as the Maximum Weight Trace Alignment Merging problem, or MWT-AM. We begin with some basic definitions. Definition 1. Given an alignment A on sequence set S and a proper subset S 0 & S, the restriction of A to the sequences in S 0 is the sub-alignment of A induced by S 0 . In this restriction, if a site is entirely gapped in the sub-alignment, then it is removed (i.e., masked). Thus, the induced alignment contains no all-gap sites. Letting A 0 denote the induced alignment on S 0 , we will say that A induces A 0 . Given a collection A of alignments A 1 ; A 2 ; . . . ; A k , where A i is an alignment of set S i , we say that A is a merger of the alignments in A if and only if A is an MSA of the sequences in set S ¼ [ i S i and A induces A i for each i ¼ 1; 2; . . . ; k. We will also refer to the alignments in A as constraint alignments.
Definition 2. Given an alignment A that is a merger of alignments in A and given columns c and c 0 drawn from different alignments, we let A c;c 0 be the indicator function that returns 1 if c and c 0 are in the same column in A and otherwise returns 0. Then, given a non-negative weight function wðc; c 0 Þ, where c and c 0 are columns in different alignments in A, we define the weight of A to be weightðAÞ ¼ P c;c 0 ½wðc; c 0 Þ Â A c;c . In other words, the weight of the merged alignment A is the sum of the weights of column pairs, where the two columns come from different alignments in A.
We now define the MWT-AM problem, using these definitions. The input to MWT-AM is a set A of alignments as well as a set of weights on selected pairs of columns from different alignments in A. The output will be an alignment A that is a merger of the alignments in A (see Definition 1) that maximizes the total weight (see Definition 2).
The MWT-AM problem can also be described in terms of the MWT problem, as we now show. A trace is a partition of the columns of the input constraint alignments into an ordered list of clusters, so that (1) each cluster contains at most one column from each alignment and (2) the ordering of the clusters is "valid": if columns x and y from the same constraint alignment A appear in clusters C i and C j , respectively, with i < j, then x appears before y in A. Such a trace trivially gives us a multiple sequence alignment that induces each of our constraint alignments (i.e, the trace defines a merged alignment). Then, given a weighting wðx; yÞ for every pair of columns x; y from different constraint alignments, we can define the weight of a trace T to be P x;y wðx; yÞ, where the sum is taken over all pairs of columns x; y belonging to the same cluster in T . It is easy to see that this is exactly the same as the weight of the merged alignment defined by the trace according to Definition 2.
We now formally define the MWT-AM problem: Input: Multiple sequence alignments A 1 ; A 2 ; . . . ; A k , with A i an MSA on set S i of sequences with S i \ S j ¼ ; for i 6 ¼ j, and a weight function wðx; yÞ ! 0 for every pair of columns x; y with x and y from different MSAs. Output: a trace T on A 1 ; A 2 ; . . . ; A k of maximum weight (i.e., a merged alignment of the input alignments that maximizes the total weight). It is easy to see that if each MSA A i is a single sequence, then MWT-AM is identical to MWT. It is also easy to see that finding a trace with the largest MWT-AM score is identical to finding a merged alignment with the total maximum weight (Definition 2). The proof follows easily from the corresponding theorems for MWT in [12].

Graph Clustering Merger (GCM)
The Graph Clustering Merger (GCM) is an algorithm that was developed for use in MAGUS to merge the constraint alignments it produces in a divide-and-conquer setting. In this original formulation, GCM constructs a set of backbone alignments (i.e., a library graph) in order to weight the pairs of sites from different constraint alignments, but it can also be used to merge a set of disjoint alignments if the weights are provided by the user. In what follows, we describe GCM for use in both cases, noting that without user provided weights it will perform an initial step (referred to as Step 0 below) to compute the weights. The Graph Clustering Merger (GCM) then uses the weights it computes (or that it is given) to construct a merged alignment, which is referred to as a "trace" [10], using a sequence of steps. We explore variants of the original GCM algorithm from [7] in an attempt to improve the MWT-AM scores. Hence, the final general design for these GCM variants has several stages ( Fig. 1), some of which allow for variants. If the input includes weights on the pairs of sites from different constraint alignments, then GCM skips Step 0. Fig. 1 presents a description of the stages of the algorithm.

Variants on Step 2: Clustering
The original technique for this step used in MAGUS is the Markov Clustering Algorithm (MCL) from [17]. Here we compare MCL to other approaches (for details, see Supplementary Materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TCBB.2022.3191848, Section S1). Furthermore, within MCL, we vary the inflation factor to evaluate its impact.
Multi-level Regularized MCL (MRL-MCL) is a modification of MCL [18] and offers two extensions. The first is to improve MCL's scalability with a hierarchical structure: the input graph is coarsened to a more tractable size [19], clustered with MCL, uncoarsened, and the coarse clustering is refined. The second extension is a "regularization" operator, which is meant to smooth the flow distributions of neighboring nodes and reduce MCL's perceived over-clustering.
Region Growing (RG) is inspired by the heuristic in [13] and reminiscent of Kruskal's algorithm for finding minimum spanning trees. We initialize our clustering with every node in its own cluster. Each pair of clusters is added to a max heap, weighted by the weight of the edge between them. We then proceed to take pairs of clusters off the heap, merging the pair together if they don't contain any nodes from the same subalignment (i.e., the new cluster would be a valid MSA column). When we merge cluster B into cluster A, we update A's weight to all of B's neighbors (we call clusters "neighbors" if any of their elements have edges between them) as follows: for each cluster C among B's neighbors, we let WeightðA; CÞ ¼ WeightðA; CÞ þ WeightðB; CÞ and put the pair (A,C) on the max heap. We continue merging pairs off the heap until we can't merge anything else.

Variants for Step 3: Computing a Trace
We compare the original technique in GCM (which we will refer to as "minclusters") to two new techniques.
We explore the Fiduccia-Mattheyses (FM) algorithm [20], previously applied to the realm of multiple sequence alignment by MSARC [21]. FM is a heuristic for finding a minimum-weight split in a graph, constrained by a balancing requirement (the sizes of the two parts can differ by at most some value). Like MSARC, we use FM to recursively cut our graph in half, looking for bipartitions that minimize the weight of the split. Notably, the FM trace does not require a prior clustering, meaning that Step 2 can be skipped.
The MWTgreedy algorithm is basically a simple version of the first step of the algorithm described in the Moreno and Karp [14] paper. MWTgreedy takes the alignment graph (with or without clustering), and performs a breadthfirst-search on the graph, starting from the "left"; when it finds a cycle, it breaks the lowest-weight edge in the cycle.
In MWTSEARCH, we combine mwtgreedy with the look ahead ability of the minclusters search algorithm used in MAGUS. Instead of just greedily breaking the smallest edge in every cycle, we consider every edge in the cycle as a possible move, and we try to find a path of such moves with the smallest weight.
The RG-FAST method is nearly the same as the Region Growing (RG) clustering method described earlier, but we constrain the algorithm to produce a valid trace, rather than just a clustering.

Variants for Step 4: Optimizer
This is an optional step, which uses a greedy strategy to improve the MWT-AM score, moving nodes between columns in our trace, and stopping when no additional gains can be made. A "move" entails transferring a node X from one cluster (column) to another, in such a way that the trace remains valid. In each iteration, we identify all valid moves  [7].) The input to GCM is a set of constraint alignments. Optionally, we also supply weights on pairs of columns (sites) from different constraint alignments.
Step 0: If the weights are not provided, this step constructs the backbone alignments, and uses these to compute the weights on pairs of columns.
Step 1: We build an alignment graph, where each node represents a column from a constraint alignment and the weight of the edge between two nodes is the weight for the corresponding pair of columns. Thicker lines represent edges with higher weight.
Step 2: We cluster the alignment graph with our method of choice.
Step 3: We find a valid trace, where each cluster represents a column in our final alignment. Our example shows two common problems: the orange-outlined cluster contains columns from the same constraint alignment, and there is no valid ordering between the green-outlined cluster and the three other clusters that it "crosses".
Step 4: We optionally run our trace through the optimizer, which will greedily move nodes between clusters to increase the MWT-AM score.
(as defined above) with a positive improvement to the MWT-AM score. We perform these moves in descending order of gain, updating the gains of subsequent moves as needed. We stop when we no longer see any gainful moves.

Experimental Study 2.3.1 Overview
We explored GCM variants, T-Coffee [16], and MAFFTmerge [15]; to the best of our knowledge, T-Coffee and MAFFT-merge are the only other methods that are designed to merge three or more disjoint alignments. We used simulated and biological datasets from prior studies to explore these methods. The simulated datasets are nucleotide datasets that evolve down phylogenetic trees and so have true alignments, and the biological datasets are either nucleotide datasets from the Comparative Ribosomal Website [22] or protein datasets from the Homfam collection [23], which have curated reference alignments based on structure. We evaluated methods with respect to alignment accuracy and MWT-AM scores. For alignment accuracy, we used the number of shared homologies (i.e., the total number of true pairwise homologies recovered in the estimated alignment), SP-score (i.e., recall, or the fraction of true pairwise homologies recovered in the estimated alignment) and Modeler Score (i.e., precision, or fraction of the pairwise homologies in the estimated alignment that appear in the true or reference alignment), computed using FASTSP [24]. We also noted wallclock running times.
The analyses of simulated datasets and some of the smaller biological datasets were run on a single node with 16 CPUs and 64 GB of memory on the Campus Cluster at UIUC, and hence were limited to 4 hours. Those biological datasets that did not complete within 4 hours on the campus cluster were then run on the tallis cluster (which has 256 GB of memory) and allowed more time to complete (50 hours for the larger CRW datasets and 168 hours for the larger Homfam datasets).
For a given sequence dataset, we produce a collection of disjoint alignments and weights on the pairs of sites, as follows. Given a sequence dataset, we have two ways of producing inputs to GCM and other methods for MWT-AM: (1) We randomly decompose the dataset into disjoint subsets of approximately the same size (varying from 50 to 500) and (2) We use the default PASTA decomposition (mincluster) to produce the decomposition (i.e., PASTA computes an initial alignment and tree, and then decomposes the dataset into subsets using the tree by deleting edges until the subsets are of size at most 200). We then compute alignments on each subset using an existing method (default: MAFFT -l-ins-i [15]), thus producing a set of disjoint alignments. To define the weights on pairs of sites, we proceed as follows. We compute a set of 10 "backbone alignments," each of which contains 200 sequences that are obtained by selecting sequences randomly from the constraint alignments. These sequences are then aligned using existing methods (default: MAFFT -l-ins-i). These backbone alignments are then used to define the input to GCM, so that the weight on a pair of sites from different alignments is the weighted number of pairs of letters (one from each site) that appear in one or more of the backbone alignments (i.e., if x and y are two letters that are aligned in q backbone alignments, then this particular pair ðx; yÞ contributes q to the total weight). As shown in Fig. 2 in [7], this is one of the ways of defining the weights on pairs of sites that provide good final alignment accuracy (e.g., in general, accuracy is improved by using several backbones rather than just one, and by using larger backbones rather than smaller backbones).

Experiments
We performed three experiments.
Experiment 0 explores the correlation between MWT-AM scores and alignment accuracy. Experiment 1 explores variants of the GCM-MWT algorithm in order to optimize MWT-AM scores. Experiment 2 compares the best variants of GCM-MWT in comparison to T-Coffee and MAFFTmerge, the other methods that can merge disjoint alignments. Experiments 0 and 1 use the training datasets (1000M1 and 1000M4), and Experiment 2 uses the remaining datasets (i.e., "testing datasets").

Datasets
We studied performance on both biological (nucleotide and proteins) and simulated datasets, all from prior studies (and available online). In total, we analyzed 304 datasets (284 nucleotide datasets and 20 protein datasets), and performed multiple analyses on each.
Simulated data We use a total of 14 model conditions, each with 1000 sequences. 12 of these were used in the papers introducing SAT e and PASTA [1], [2], [6]: we use 1000M1 and 1000M4 for the training phase, and then 1000L3, 1000S1, 1000M2, 1000L1, 1000S2, 100S3, 1000M3, 10000L2, 1000L4, and 1000S4 for the testing phase. These sequences were evolved using the ROSE [25] software, and evolve under i.i.d. evolution with substitutions and indels, and under three indel lengths -long (L), short (S), and medium (M). The integer at the end of the model condition roughly with respect to the rate of evolution, with 1 indicating the highest rate of evolution. Each model condition has 20 replicates. We also use two subsets of the RNASim [2] simulated datasets, named RNASim1 and RNASim2, each with 1000 sequences. The RNASim datasets evolve under a model that includes selection and takes RNA secondary structure into consideration. We provide the empirical statistics for these simulated nucleotide datasets in Table 1.
CRW (Nucleotide) Datasets We use four nucleotide datasets developed by Robin Gutell for the Comparative RNA Website [22], and which were used in previous studies to evaluate alignment accuracy [2], [3], [6], [7]. Specifically, we used versions of the 16S.M, 16S.3, 16S.T, and 16S.B.ALL datasets from [7], which were used to evaluate MAGUS, and obtained by taking the corresponding datasets from [6] and then pruning them to remove sequences that are more than 20% from the median sequence length. Three of the four CRW datasets have at least 5000 sequences and the remaining one has only 740 sequences. We provide the empirical statistics for these CRW datasets in Table 1.
HomFam (Protein) Datasets We also used 20 protein datasets from the Homfam [23] collection. These range in size from about 10,000 sequences to nearly 94,000 sequences, and each has a reference alignment based on protein structure for a small subset of its sequences.

Evaluation Criterion
For alignment accuracy, we report number of shared homologies (i.e., the total number of true pairwise homologies recovered in the estimated alignment), SP-score (i.e., recall, or the proportion of pairwise homologies in the reference alignment that appear in the estimated alignment) and Modeler score (i.e., precision, or the proportion of pairwise homologies in the estimated alignment that appear in the reference alignment). We also report the MWT-AM score, using the backbone alignments to define the weights on pairs of columns from the constraint alignments.

Results for Experiment 0: Correlation Analyses
We explored the correlations between MWT-AM criterion and alignment accuracy using the default version of GCM on 1000M1 and 1000M4, and under two ways of decomposing the dataset: random decompositions and decompositions obtained during the first iteration of a default PASTA alignment.
Correlations Using GCM With Random Subset Decompositions. We ran default GCM on the 1000M1 and 1000M4 datasets using random subset decomposition with subsets of size 50, 100, 200 and 500. The number of backbones was set to 1 and 10, with sizes of 20, 40, 100, 200 and 500. For each model condition, we reported MWT-AM scores and the number of shared homologies between the estimated and the true alignment, and then Spearman's r was computed on all 20 replicates (Supplementary Table S1, available online). These results show correlations ranging from 0.851 to 0.997 for 1000M1 and from 0.241 to 0.971 for 1000M4. Conditions for which 1000M4 returned low correlations correspond to 'extreme' conditions with only 1 backbone and 20 sequences. Thus, the correlation is very strong for 1000M1, a model condition where alignment is challenging so that SP-scores are not very high, and lower (but moderately strong) for 1000M4, a model condition where alignment is easier and SP-scores are high. Furthermore, the correlations are statistically significant (p < 0:005) for all conditions except for one 'extreme' condition.
Correlations Using GCM With Default PASTA Decomposition. We ran default version of GCM from [7] within the standard divide-and-conquer pipeline described above (10 backbones each containing 200 sequences). Results (Fig. 2) show that the correlations between MWT-AM scores and alignment accuracy (using the number of shared homologies) are very strong for the 1000M1 model condition (r = 0.9) and null (r = -0.048) on the 1000M4 model condition. For 1000M1, the Spearman rank correlation is statistically significant (Fig. 2) but not for 1000M4. Increasing the size or number of the backbone alignments also improves accuracy and strengthens the correlation, as shown in Supplementary Table S1, available online (for random decompositions) and Supplementary Figure S2, available online (for PASTA decompositions).
Thus, for most model conditions (1000M1 and 1000M4) using random decomposition and specifically for a hard condition (1000M1) using a default PASTA decomposition, the Spearman rank correlation coefficient between MWT-AM scores and alignment accuracy (measured using the number of shared homologies) is at least moderately high and statistically significant. This suggests strongly that optimizing MWT-AM scores is a valuable approach to alignment estimation.

Results for Experiment 1: Designing GCM-MWT
Impact of Varying MCL Inflation Factor Within MCL. We explored variants of GCM when using MCL for Step 2 in which we varied the inflation factor (IF) parameter within MCL, which controls the granularity of the clustering. We ran GCM varying IF between 1.1, 1.4, 2, 3, 4, 6, 8 and 10, and recorded the MWT-AM scores (supp. Fig. S1). These experiments reveal that selecting any IF in the range 2-10 produce equivalently good results, and lower IF values produce worse MWT-AM scores. We select IF = 4.0 for the default setting.
Exploring Other Aspects of GCM. We explore all the variants of GCM described on our training datasets, 1000M1 and 1000M4. Four different methods are available for step 2 (MCL, MLR-MCL, RG), six options for step 3 (minclusters, FM, mwtgreedy, mwtsearch, RG, RG-FAST) and step 4 has two possibilites (using or not using the optimizer). In addition, some steps are not mandatory for some combinations; step 4 is optional, and step 2 can be omitted if FM, mwtgreedy, mwtsearch, RG and RG-FAST are used in step 3. In all, there are currently 34 possible combinations of steps 2-4 for the complete GCM pipeline.
A comparison between these variants with respect to MWT-AM score on the 1000M1 condition (Table 2)   Median scores and running times where computed on 19 replicates. stars (*) indicate a step for which one or more replicates didn't finish within 4 hours of running time on campus cluster (nodes with 16 cores, 64 GB of memory); in these cases, we report median across the replicates that completed.
MWT-AM scores, and any improvement that resulted was very small. Second, some combinations showed much less accurate results than default mode (e.g., RG+MWTGREEDY), but all combinations are more or less even (and very slightly better than the default combination) in terms of average accuracy when step 4 (the optimizer) is added. This is not surprising, since the optimizer is designed to increase the MWT-AM score. Results on the 1000M4 condition show very little difference between variants, indicating that this model condition is generally too easy to distinguish between variants (Supp. Table S2). We note that improvements in the MWT-AM score tend to also result in an improvement in alignment accuracy using SP-score, a trend that is consistent with our earlier observations. Interestingly, improvements in SP-score resulting from using the optimizer tend to result in decreases in the Modeler score, but the average of these two alignment accuracy measures either stays the same or improves.
Among the variants that do not use the optimizer step, we picked the default setting from MAGUS (i.e., MCL+minclusters) as one of the GCM variants to explore further in our testing phase: it ties for best for both MWT-AM score and alignment accuracy (average of SP-score and Modeler score) of all the methods that complete on all the datasets. There is essentially no difference between methods that use the optimizer score (except for running time, where some methods are more computationally intensive); therefore, we picked FM+OPTIMIZER as a second variant to pick, in the hope that the substantial difference in approach might reveal some benefits during the testing phase. In sum, we selected two variants for GCM to explore in the testing phase: MCL+minclusters (i.e., no optimizer step at the end); note that this is the default setting used in MAGUS [7], and so we also refer to this as GCM (default). FM+OPTIMIZER; we refer to this as GCM(fm-opt).

Results for Experiment 2: Comparing GCM-MWT to Other Methods
We compare the GCM variants to MAFFT-merge (two variants) and T-Coffee with respect to alignment accuracy (SPscore, Modeler score, and the average of these two) and running time when applied to a set of disjoint alignments. For GCM, the weights are computed from the backbone alignments it computes. In these experiments, T-Coffee is much less accurate than the other methods (supp. Figure S3). Based on a discussion with the developer, Cedric Notredame, T-Coffee is not designed for datasets of this size and has also not been tested sufficiently on nucleotide datasets. Hence the use of T-Coffee for merging large nucleotide alignments may not be appropriate. The rest of this section focuses on the comparison between the GCM variants and MAFFT-merge, using default decompositions (which are phylogeny-based) or random decompositions.
Results on Random Subset Decompositions. We selected three simulated datasets-1000L3, RNASim and 1000S4and tested the performance of GCM(default) and MAFFTmerge (using the most accurate version based on L-ins-i) when the constraint alignments are obtained through random decomposition rather than the phylogeny-based mincluster decomposition.
We compared the default GCM (MCL+minclusters) and MAFFT-merge on random decompositions, using subsets of size 50, 100, 200 and 500. The number of backbones was set to 1 and 10, with sizes of 20, 40, 100, 200 and 500. For each model condition, we report SP-scores.
Results for alignment accuracy (Fig. 4) show that MAFFT-merge and GCM(default) have close accuracy for slowly evolving sequences (1000S4 and RNASim) but GCM (default) is more accurate on fast-evolving sequences (1000L3). This is similar to what we observed on the PASTA decompositions in Fig. 3. Surprisingly, MAFFT-merge performed relatively well despite not having the advantage of using alignments on local subsets as constraints (see recommendations for using MAFFT-merge at https://mafft.cbrc. jp/alignment/software/merge.html).
However, a comparison between these results and the same methods with the default phylogeny-based decomposition in Fig. 3 shows that random decompositions produce worse alignments. This underlies the importance of the subset decomposition step in pipelines that use divide-and-conquer to estimate alignments.
Results Using Default decompositions on Simulated Datasets. In Fig. 3, we compare the two variants of MAFFT-merge to GCM(default) and GCM(FM+opt) on all the testing model conditions with default decompositions. This comparison shows the following trends. The two ways of running MAFFT-merge are clearly distinguished, with the L-ins-i version much more accurate than the default version. Both ways of running GCM are approximately equal in accuracy, and both are almost always more accurate than MAFFTmerge using L-ins-i, with the only exceptions being the easiest datasets (1000L4 and 1000S4) that have the lowest rate of evolution.
Running Time on Simulated Datasets Using Default Decompositions. MAFFT-merge in default mode is the fastest method, followed by both GCM(default) and GCM(FM +opt), and then by MAFFT-merge using L-ins-i (Fig. 3). Both ways of running GCM are often twice as fast as   running MAFFT-merge using L-ins-i, and finish on all these datasets in at most 30 minutes.
Results on Biological Nucleotide (CRW) Datasets. Table 3 shows results on the CRW datasets. Although the methods are very similar on the smallest dataset, the methods are clearly distinguishable on the larger datasets. Most importantly, MAFFT-merge is less accurate than both GCM variants, obtaining SP-scores that are 10-13% lower and Modeler scores that are 4-6% lower than GCM (default). It is important to realize that the L-ins-i option for MAFFT-merge is unable to complete in 4 hours on these larger datasets, and so we used the default setting for MAFFT-merge, which is less accurate. A comparison between the two GCM variants shows that they have the same average alignment accuracy (where we average SPscore and Modeler score), but GCM(default) has somewhat better Modeler score and GCM(fm+opt) has somewhat better SP-score.
A comparison of running times (Table 3) shows the following trends. On the 16S.M dataset (with 740 sequences) we used MAFFT-merge with the L-ins-i setting, which is more accurate but also more expensive; as a result, the three methods have nearly identical running times (6 minutes). The other three datasets are larger, requiring that we use the default version of MAFFT-merge instead of the L-ins-i version. On these three larger datasets, MAFFT-merge is much faster, finishing in under a minute on 16S.3 and 16S.T (each with about 5500 sequences), and in 7 minutes on the 16S.B.ALL dataset (with 24,246 sequences); in contrast, the two GCM variants take longer. As expected, GCM(default) is faster, with 12-15 minutes on all three datasets, but GCM (fm+opt) is slower: 31-38 minutes on 16S.3 and 16S.T, and then unable to complete on the 16S.B.ALL dataset within the allowed time (48 hours).

Results on Biological Protein (Homfam) Datasets
We also compared GCM-default, GCM(fm+opt), and MAFFT-merge on several large Homfam datasets with up to nearly 94,000 sequences.. MAGUS and recursive MAGUS were evaluated on these datasets in comparison to many established protein alignment methods (e.g., MAFFT, Clustal-Omega, and Muscle) as well as more recently developed alignment methods (UPP and PASTA) in [26], and shown to be more accurate (meaning, having lower average of SPFN and SPFP) than all of them, with MAGUS more accurate than recursive MAGUS. Given the dominance of MAGUS in that study, here we focus the comparison on variants of MAGUS where we modify the merger step (i.e., replacing GCM-default, as used in MAGUS, with GCM+opt and MAFFT-merge(default)). We allowed all methods to run up for a a full week (168 hours) on the tallis cluster, which has 256 GB of memory.
These results show the following clear trends. First, in every dataset, GCM(fm+opt) produces better MWT-AM scores and also better SP-scores than GCM-default (Table 4), showing that optimizing MWT-AM scores correlates with improved SP-scores on these datasets as well. Second, both GCM-default and GCM(fm+opt) complete on every dataset, whereas mafftmerge(default) fails to return alignments on the two largest datasets (zf-CCHH with 88,345 sequences and rvp with 93,681 sequences). For those two cases, MAFFT returns a warning that some groups are forced to be a monophyletic cluster and then uses a huge computational time reallocating sequences into sub-MSAs. Third, when restricted to the remaining 18 datasets where MAFFTmerge(default) could run, we see that GCM(fm+opt) came had the best SP-scores in 11 datasets and MAFFT-merge (default) had the best SP-scores in 7 datasets. The average across the 18 datasets also show an advantage for SP-score to GCM(fm+opt).

DISCUSSION
This study showed that the MWT-AM optimization criterion (as computed in this study, using backbone alignments to define the weights) is generally highly correlated with alignment accuracy measured using the number of shared homologies. This trend shows that optimizing the criterion is desirable. This study also showed that the default usage of GCM within MAGUS is an effective heuristic for MWT-AM, explaining why MAGUS produces highly accurate merged alignments. In other words, by posing and studying MWT-AM, we are able to explain why MAGUS is a highly accurate alignment method.
On the other hand, we also saw that some modifications to the GCM heuristic, largely through the use of a final "optimizer" stage, can improve the MWT-AM scores compared to GCM-default (the use of the technique as performed in MAGUS). Furthermore, both GCM(default) and GCM(FM+opt) are sufficiently fast that they can be used to merge a large number of alignments; hence, both ways of using GCM are effective within divide-and-conquer pipelines for large-scale multiple sequence alignment.
A comparison to other methods that can merge disjoint alignments reveals some significant differences. As we have Average shown without the Rvp and zf-CCHH datasets, because mafftmerge did not finish on these datasets (indicated by NaN). N.A. means "not applicable".
noted, T-Coffee did not produce satisfactory results in these experiments, but the explanation is that T-Coffee has been designed primarily for amino acid alignments rather than nucleotide alignments, and has not been tested in this setting. Given the outstanding accuracy of T-Coffee that has been observed for amino acid alignments (e.g., [27]), there is an opportunity for future work to provide a version of T-Coffee explicitly for merging large nucleotide alignments. The comparison to MAFFT-merge is also informative, and shows that MAFFT-merge using L-ins-i performs relatively well, generally as accurate as GCM. However, MAFFT-merge with L-ins-i is more computationally intensive than GCM, making GCM more appealing on large datasets. Furthermore, we note that we used MAFFT-merge in a way that is not recommended (see the MAFFT-merge website at https://mafft.cbrc.jp/alignment/software/merge. html, which recommends that MAFFT-merge use monophyletic clusters as entries). Although monophyly is not ensured in PASTA decompositions, it is definitely violated in random decompositions, showing that MAFFT-merge (using L-ins-i) is a valuable approach for merging non-monophyletic clusters, even though it wasn't designed for it initially.

CONCLUSION
This study was motivated by the desire to understand why MAGUS, a recent method for multiple sequence alignment, produced more accurate alignments than its immediate predecessor, PASTA. Both MAGUS and PASTA operate by dividing a sequence dataset into disjoint sets, compute alignments on the subsets using MAFFT -linsi, and then merge these disjoint alignments together; essentially the only important difference between the two methods is how each merges the disjoint alignments. MAGUS computed a set of extra alignments in order to merge the disjoint alignments, defined an edge-weighted "alignment graph" from these extra alignments, and then computed a "trace" (i.e., merged alignment) using the Graph Clustering Merger (GCM) applied to the alignment graph. In contrast, to merge the set of disjoint alignments, PASTA used a much simpler strategy: it merged pairs of the constraint alignments and then followed this by transitivity. Thus, GCM uses a more complex approach than PASTA's merger strategy, but neither is based on an explicit optimization criterion. In exploring the GCM method further, we hypothesized that its approach might be an effective technique with respect to maximizing the support implied by the alignment graph, which we defined to be the total weight of the edges in the alignment graph that appear in the output alignment. We formulated this objective as a new optimization problem, which we termed the Maximum Weight Trace Alignment Merging (MWT-AM) problem. Thus, the MWT-AM problem is a generalization of a classical problem in bioinformatics called the Maximum Weight Trace problem [12]. Our study shows that the Graph Clustering Merger (GCM) method provides good accuracy for the MWT-AM problem. Further, our study shows that using default GCM or its improved variant, GCM(FM+opt), within the MAGUS pipeline produces alignments with excellent accuracy, and can be used on very large datasets (up to 93,681 sequences in this study).
This study suggests multiple directions for future research. The first and most obvious direction is to develop new methods for merging a set of disjoint alignments. Currently, there are only a few methods that are able to merge disjoint alignments, and so there is clear opportunity for advances. To begin with, future research could explore modifying how GCM defines the weights on the pairs of columns, as the heuristic it uses to solve MWT-AM might work well with other techniques. Also of interest is the theoretical approximability of MWT and MWT-AM; to the best of our knowledge, there are no results regarding polynomial-time constant-factor approximations for these problems.
Given the high accuracy of MAGUS, which depends on GCM, another direction for future work is to investigate changes to the ways the constraint alignments and backbone alignments are computed within MAGUS. In this study we relied on MAFFT for these computations, and MAFFT has been shown to generally provide very good results for alignment estimation, especially when used in its most accurate settings [1], [2], [6], [7]. Yet other methods could also be considered and could be computationally feasible. Based on results shown in [27], when used with methods for aligning amino acid sequences, such as PROMALS [28] and ProbCons [29], the MAGUS pipeline could potentially be even more accurate. Another interesting direction is to consider the use of computationally intensive Bayesian methods for the constraint alignments. For example, methods such as BAli-Phy [30] can have outstanding accuracy, but are generally limited to very small datasets (perhaps 50 sequences). As noted in [31], BAli-Phy can be used within the PASTA divide-and-conquer strategy, where it is only used on small datasets; similarly, we conjecture that the MAGUS pipeline could be used with BAli-Phy on subsets to improve accuracy.
Paul Zaharias received the PhD degree from the Mus eum national d'Histoire naturelle de Paris, in 2019 under the direction of Nicolas Puillandre, where he worked on the systematics and evolution of marine molluscs. He is a postdoctoral researcher with Mus eum national d'Histoire naturelle de Paris working with Olivier Gascuel and former postdoctoral researcher (with Tandy Warnow) with the University of Illinois Urbana-Champaign. His current research interests include metrics for support in phylogenies, computational methods for largescale multiple sequence alignment and phylogeny estimation in the presence of substantial heterogeneity across the tree and across the genome.