Branching Out to Speciation in a Model of Fractionation: The Malvaceae

Fractionation is the genome-wide process of losing one gene per duplicate pair following whole genome doubling (WGD). An important type of evidence for duplicate gene loss is the frequency distribution of similarities between paralogous gene pairs in a genome or orthologous gene pairs in two species. We extend a previous branching process model for fractionation, originally accounting for paralog similarities, to encompass the distribution of ortholog similarities, after multiple rounds of whole genome doubling and fractionation, with the speciation event occurring at any point. We estimate the fractionation rates during all the inter-event periods in each lineage of the plant family Malvaceae. We suggest a major correction of the phylogenetic position of the durian sub-family, and discover a new triplication event in this lineage.


INTRODUCTION
T HE evolutionary history of the flowering plants (angiosperms) is punctuated with numerous whole genome doubling and tripling (WGD) events, a phenomenon that has only occasionally been identified in other phylogenetic domains. In the angiosperms, every species whose genome has been sequenced has at least one such event in its history, and very often two, three, four or more, except in one early diverging lineage [1]. The doubling events in any one lineage are typically spaced tens of millions of years apart, and the tetraploid or hexaploid genome at the origin of the event re-diploidizes in relatively short order, as homeologous sequences diverge, chromosomes fuse and inversion and translocations disrupt the separate identity of the two subgenomes. However, the set of genes in the genome can remain elevated for long periods of time and if there are several WGD, the genome can accumulate 100,000 genes or more instead of the approximately 25,000 necessary for most angiosperm genomes. It is true that pairs of duplicate genes tend to lose one redundant member over time, a process called fractionation with some categories of genes more susceptible to loss than others [2], [3], and sometimes from one subgenome rather than the other [4], but the question remains of whether fractionation occurs at a rapid enough pace to counter the effect of recurrent WGD on gene number. This is the main focus of this paper; the rate of fractionation after a WGD or a series of WGD.
An important type of evidence in analyzing ancient polyploidization events is the distribution of coding sequence similarities between two paralogous genes in a genome. For flowering plants with one, two, or more WGD events in their history, the distribution of similarities is a mixture of distributions, each component of which is centred at a similarity value indicative of the age of one of the events. We have developed a model for predicting the shape of these distributions based on the event times, the ploidy multiplicities of the events, rates of loss of duplicate genes from the genome (fractionation), and rates of sequence divergence [5]. Underlying these predictions is a paralog tree generated by a discrete-time branching process with one biologically-motivated constraint, which is mathematically tractable and whose parameters are well suited to statistical inference.
The mathematics of the branching process involving recurring WGD within a single genome, and the associated inferential tasks associated with paralogous gene pairs, have been worked out, but attempts to extend this to orthologs in two genomes post-speciation [5], [6] have involved treatments not fully consistent with the spirit of the single genome model. These previous models interrupted the ongoing fractionation process in the transition from one species to two, and established a new process, with a new lossrate parameter, to take account of the loss of genes from the two daughter species and the reduction in the number of orthologous gene pairs. In reality, however, speciation as such does not involve any sudden change in the ongoing evolution of the two diverging populations. This includes the uninterrupted continuation of the fractionation process independently in the two new species-neither of them is "aware" of what is happening in the other.
Thus a key aspect of this paper is to provide a treatment of the transition from one genome to two daughter genomes that smoothly continues the fractionation regime in place before speciation, and extends it independently in each of these species until a new WGD in that species or until the present time of observation. With some additional details, this solution enables a complete model encompassing all the WGD in the ancestral genome, the speciation event, and all the independent WGDs in the daughter genomes. It defines all the homolog pairs at the time of observation in terms of the event at which they originated. The key period for the model, that which includes the speciation event, is often also the key period for empirical studies, since it includes the speciation plus the fractionation process before that event.
In addition our model leads to the first formal expression of the fact, often empirically observed and intuitively reasonable if not widely known, that the distribution of relative frequencies of ortholog similarities in two genomes is independent of WGD that occur after speciation. However the absolute frequencies change after speciation, they remain a constant multiple of the relative frequencies at the moment of speciation; the last component of the mixture of distributions is always due to the speciation event creating the two diverging genomes, and this is the only speciation event leaving any trace on the distribution. The pre-speciation WGD and the single speciation event are the only events that determine the relative frequencies of the ortholog similarities; further WGD change the absolute frequencies but not the relative frequencies of these similarities. In this paper, we find for the first time an expression for the amplifying factor, the ratio of the absolute to the relative frequencies as a function of post-speciation WGDs.
Aside from providing an insight into fractionation, the distributions of homolog similarities have phylogenomic implications. To the extent that a distribution may be decomposed into a mixture of normal components, the local modes may be found at the means of these distributions, and serve as accurate reflection of the timing of the events. We have previously [6], [7] used these modal values-or their logarithmic transforms-as a similarity or distance matrix for input into a phylogenetic algorithm, such as neighbour-joining. We apply the model and inferential apparatus to eight genomes in the family Malvaceae. This is an ideal group of plants, since many of them, in several subfamilies and genera, have recently had their genomes sequenced, and they are very heterogeneous with respect to the number of WGD in their respective lineages since the Malvaceae emerged some 40 Mya [8]. One result is an overview of the variability in the proportional rates of fractionation in different lineages and in different periods. Another result is an unexpected phylogenetic positioning of one subfamily (Durionaceae, or Helicteroideae) within an early diverging clade of the Malvaceae containing cacao and jute rather its usual assigned position as originating close to cotton in a more recent multifurcating branching of subfamilies. Moreover, the Durio genome, shows a recent whole genome triplication, different from the WGD elsewhere in this plant family.
The next section summarizes the general model for generating the distribution of paralog similarities and its extension to the distribution of ortholog similarities. This is followed by a discussion of the Malvaceae genomes we study and a sketch of our inferential procedures. A full exposition of our results and inferences follows. The discussion of Durio evolution appears in the Appendix, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TCBB.2019.2955649.

THE GENERAL MODEL
In our general model for WGD and fractionation, at times t 1 < Á Á Á < t nÀ1 each gene in the population of M i > 0 genes is independently replaced by j daughter genes, where 1 j r i . The quantity r i ! 2 represents the ploidy of the ith event and r i À j ! 1 is considered the effect of fractionation on the progeny of the parent gene. The trajectory of the process is in effect a sample point from the n À 1 probability distributions u 1 ðiÞ; . . .; u r i ðiÞ for i ¼ 1; . . .; n À 1. There is no provision for u 0 ðiÞ > 0, a condition called "no lineage extinction". The fact that the u j ðiÞ are defined over the progeny of each gene independently may be characterized as "sibling rivalry"; there is no constraint or relation on the survival of "cousins". Motivations for the "no lineage extinction" and "sibling rivalry" assumptions are given in [7], [9].
Let aðiÞ ¼ ða 1 ðiÞ; . . .; a r i ðiÞÞ represent the numbers of genes at time t i , of which 1; . . .; r i , respectively, survive until t iþ1 , so that Given M i , the probability of aðiÞ is P r i ðaðiÞÞ ¼ M i a 1 ðiÞ; . . .; a r i ðiÞ ðu 1 ðiÞ a 1 ðiÞ . . .u r i ðiÞ a r i ðiÞ Þ; (2) and the probability of an entire trajectory, defining a paralog gene tree is P r 1 ðað1ÞÞ. . .P r nÀ1 ðaðn À 1ÞÞ; ( with M 1 ! 1 given and the other M i determined by Equation (1). Once we know how to calculate these probabilities, it is possible to calculate the EðM i Þ. And using the independence of the trajectories starting at any two sibling genes existing at time t i , and their independence from the trajectory between time t 1 and t i , we can calculate EðN i Þ the expected number of pairs of genes at time t n originating at time t i .
The accumulation of multinomial coefficients in Equations (2) and (3), and the potentially high degree polynomials might seem computationally formidable. In practice, however, n seldom exceeds 5 or 6, and the r i are generally 2 or 3. Thus individual instances of the model are generally computationally tractable.
For example, suppose there is just M 1 ¼ 1 gene at time t 1 , and suppose all r i ¼ 2. We can write uðiÞ ¼ u 2 ðiÞ; i ¼ 1; . . .; n À 1 for the probability that both progeny of a gene at time t i survive until time t iþ1 . We have previously shown [6] the expected number N i of duplicate pairs of genes born at time t i surviving until t n is Suppose there are n A À 1 À s WGD in species A after speciation and n B À 1 À s in species B. Let be the expectation of the "amplifying factors" affecting the distribution of orthologs due to these WGD. Then are the expected number of ortholog pairs observed after the n A À 1 À s WGD in species A by which time there will have been n B À 1 À s WGD in species B. (We dispense with the terminology of "outparalogs" versus "orthologs", since we are keeping track of the event at which each pair of homologous genes originates). The three key factors in our improved model, terms in Equations (5) and (6), are ð1 þ u A ðsÞÞ; ð1 þ u B ðsÞÞ and ð1 þ uðs À 1ÞÞ. Between the two successive WGD, at time t sÀ1 in the pre-speciation genome, and t A sþ1 in genome A and t B sþ1 in genome B, the same fractionation regime should hold, despite the speciation at t s . Writing Àlog uðs À 1Þ ¼ rðt s À t sÀ1 ÞÞ; our model presumes r ¼ r A ¼ r B . The same proportional rate should hold before and after speciation, since speciation is a population-level event in the first instance, not involving any genome-level changes, in contrast with WGD.

THE MALVACEAE
The Malvaceae are family of plants in the rosid order Malvales. Considering only the genomes studied here, this family bifurcated early into two lineages, as indicated in Fig. 1, giving rise the subfamily Malvoideae and to the three sister subfamilies Grewioideae, Helicteroideae and Byttnerioideae. Our Malvoideae genomes reveal repeated instances of WGD, the Grewioideae and Helicteroideae only one each, and the Byttnerioideae none. Of note is that the genera that boast sequenced genomes are almost all economically important ones (cf [10]). We use the SYNMAP software on COGE [11], [12], and have direct access to some of the data, in an appropriate format, among those available on the COGE platform. Those genome data gathered elsewhere (cited below) were uploaded to a temporary private account on COGE for purposes of the present research.
The hibiscus, or "Rose of Sharon", (Hibiscus syriacus) genome [15] has undergone at least three WGD events. Two  [8], [16] and other references, and Malvoideae divergence from Grewioideae and Byttnerioideae [15], other times in the Malvoidae from [15]. Corchorus WGD is from [18], Grewioideae and Byttnerioideae divergence is from [20], and other points, like the the speciation of the two Byttnerioideae, simply interpolated. Most of these estimated times are 30-50 percent less than those given in the Time Tree of Life [13] or the Angiosperm Phylogeny Website [14], but are based on more recent technology. cotton genomes (Gossypium raimondii and Gossypium hirsutum) [16] are in a genus closely related to Hibiscus, i.e., in the same subfamily, the Malvoideae, and share the first of the ancestral hibiscus WGD events. They also have experienced a more recent WGD proper to the Gossypium genus. G. hirsutum is a recent tetraploid (4MYa) of G. raimondii and G. arboreum, so that the comparison of G. raimondii with G. hirsutum can be viewed as a proxy for the comparisons between G. raimondii and the G. arboreum subgenome of G. hirsutum. We also made some use of two other genomes [17] from the Gossypieae tribe, Kokia drynarioides and Gossypioides kirkii, although we did not include them in the full comparative protocol we applied to the other genomes mentioned here.
Of particular interest is the recently sequenced Durio zibethinus genome [23], which has been considered relatively close phylogenetically to the cotton genomes, but which is actually far from cotton and close to cacao. (See the Appendix, available in the online supplemental material.) As an outgroup, we use the agarwood (Aquilaria agallocha) genome [24], not in the Malvaceae family, but in the same order, Malvales.
The ancestor of the core eudicots contained 21 chromosomes, resulting from a tripling of a seven-chromosome precursor [25], [26]. This is known as the "g" tripling. Over half of the known flowering plants, including the Malvaceae, belong to this group. It is important to note that evidence of this event shows up in all our analyses.

INFERENCE ON THE DISTRIBUTION OF SIMILARITIES
Knowing the expected number of pairs of genes originating at each WGD in the past is the first step in predicting the full distribution of similarities. The second step is to derive the actual distribution of gene pair similarities, or an appropriate approximation to it, for each of the n À 1 WGD events.
One way gene pair divergence may be measured is in terms of a probability p reflecting similarity -the proportion of nucleotide positions that are occupied by the same base in the two orthologs (or paralogs). In analogy to radioactive decay, we relate p to time t as a negative exponential: p ¼ e Àt , for some constant .
The densities of similarities of pairs generated by the ith WGD can be approximated by a normal distribution Nðp i ; s 2 i Þ, and the expected frequency by We can predict the entire frequency distribution over all events as Were we to try to decompose a mixture of distributions arising from a single comparison of two genomes, we would likely use standard software such as EMMIX [27] or the R package mixtools. Such approaches, however, suffer from a number of problems, stemming from deviations from normality, very large differences among the O i , disproportionate noise at lower values of p, and reliance on biologically unaware significance testing.
We nevertheless use mixture-of-distribution estimation, but our approach tries to attenuate some of their problems through pairwise comparison of all the genomes in a group of plants, a group phylogenetically diverse enough to include lineages with different WGD histories, but not so scattered that finegrained relationships within the group will be missed. In our experience, the appropriate phylogenetic level is the plant family. The present study of the family Malvaceae builds on our previous work on the Brassicaceae [7] and on the Solanaceae [6].
The advantage in a family-based approach is that many pairs of genomes will share the same events, WGD or speciation, and hence component distributions showing the same average p. Indeed, knowing the phylogeny of the family helps us to discard spurious component distributions caused by small sample size fluctuations and to detect other component distributions that may be below the level of significance.
We first locate candidate p i in each comparison by picking out local modes in the distribution of similarities. We are guided by the phylogenetically-based knowledge that some of these p i are shared among several genome comparisons, since they reflect the same events. (We have developed a phylogenetic method based on modes and WGD [7], but the point of the present paper being the estimation of fractionation rates, we simply incorporated a phylogeny most in keeping with the biological literature, though this is confirmed by neighbour-joining in the Appendix, available in the online supplemental material. In each comparison, we fix these p i in a maximum likelihood mixture of distributions analysis to produce the amplitude and variance of each component. We then iterate, allowing the p i to shift a slight amount to improve the visual fit of the whole distribution. The amplitude and variance inferred for each component are estimates of the number of O 1 pairs, O 2 pairs, etc. These numbers, together with the p i , can then be used to produce estimates of the uðiÞ.
We use modes to estimate the p i because the overlapping tails of the component distributions preclude their estimation by averaging. Furthermore, the fact that hundreds or thousands of syntenically validated orthologs support the position of the mode, even if they are not exactly the same set of orthologs in each comparison, means we are not limited to the small set of genes that are single copy in all the genomes, and which in any case are susceptible to nonparallel fractionation.
Once we have decomposed the mixed distribution of similarities into its component distributions, the area under the distribution due to each component is an estimate of O i ; i ¼ 1; . . .; s, the number of ortholog pairs due to the WGD at time t i , whose expectation we derived in Equation (6). Then are estimates of the proportion of the area under the original distribution due to only the WGD at time t i or the speciation at time t s . There are the same number of independent proportions (s À 1) as parameters uðiÞ; i ¼ 1. . .; s À 1, so that we can estimate all the parameters. Although the roles of F A and F B are an important analytical result, the great disparity in the quantity of data produced by the various comparisons, due entirely to methodological disparities, preclude any systematic attempt to analyze them. We can nevertheless suggest a procedure for further work. First, for any two genomes A and B, an estimate of the product F A F B can be obtained by averaging all the quantities in (10) above. Suppose the lineage of genome A consists of a series of WGD interspersed with speciations giving rise to genomes B 1 ; B 2 ; . . ., themselves with no post-speciation WGD, then the successive values of F A at these speciation points could be compared in order to obtain some of the factors ð1 þ uðiÞÞ in Equations (5) at the intervening WGD.
If, on the other hand, the genomes B 1 ; B 2 ; . . . themselves underwent WGD after divergence from the A lineage, there is no direct way of factoring the product F A F B .

Figs. 2, 3 and 4 depict the distributions and their decomposi-
tions for the 38 comparisons, including seven self-comparisons, we succeeded in producing by COGE. Technical difficulties with COGE, beyond our control, such as the unavailability of coding sequence for Aquilaria and the Gossypieae genomes, or the small numbers of gene pairs detected, prevented us from obtaining five pairwise genomic comparisons and two self-comparisons. Table 1 presents details about the components distribution associated with the ancient or recent speciation in the lineages of every pair of genomes.
The individual graphs all share the same x-axis, ranging from 0.5 to 1.0. However, the y-axis scale differs greatly among the comparisons. Part of the reason for this is no doubt the different lengths of the time period since speciation, allowing fractionation to reduce the number of gene pairs. More important in many cases are difficulties with the genome assembly or, more crucially, its annotation. Among the genomes we analyzed, those with the most results included the hibiscus, upland cotton, cacao, monkey cacao, and durian genomes. The most problematic were the G. raimondii genome, the two jute genomes and the two Gossypieae. This does not necessarily speak to the quality of these genome sequences nor their annotation; there are a number of possible vulnerabilities beyond our control in the pipeline between the original database, through the uploading to COGE, to the SYNMAP computations and display. These problems do not extend to the other genomes and though they make affect the precision of some of the results, should not result in any biases.
In general, we note some moderate problems of fit between the model and the data. Many of these have to do with the evident non-normality of the component subgenomes. This is clearest in the speciation components, where there is skewness involving a steeper slope on the side closest to 100 percent similarity. Thus the data are fewer than expected from a normal approximation on this side, and contribute to a "shoulder" that exceeds the normal curve on the side closer to 50 percent similarity. In some cases, this partially obscures a slightly earlier peak, but careful inspection usually reveals a local mode where the peak is expected.
The two Gossypieae were not included in most of our analyses, largely because no annotations were available. Without CDS information, these genomes could only be compared to other genomes by using the capacity of SYNMAP to locate likely homology regions in sequence data based on the CDS in another genome. This problem was also encountered with the agarwood genome, but comparisons were much more productive in this case.
In the cases of G. raimondii genome, the two jute genomes and the two Gossypieae genomes, we were obliged to lower the minimum syntenic block size for the SYNMAP run from 5 to 3, in order to get even a minimal number of orthologous or paralogous pairs for further analysis. Even so, there were not enough pairs in some of the comparisons to detect some known events. For example, the white jute self-comparison does not detect the WGD, though it shows up both in the self-comparison of dark jute and in the comparison of the two jute genomes.  Despite these problems, some clear patterns emerge. The figures demonstrate a strong resemblance among the six comparisons with agarwood. Hibiscus shows a disproportionate number of pairs surviving from the g event, but lacking comparisons with cotton, we can only attribute this to statistical fluctuations in the relatively low number of pairs in the agarwood comparisons.
A key trend in these figures is the compact recent speciation peaks in the comparisons of the non-Malvinon-Malvoideaeoideae genomes: the jutes, durian, cacao and monkey cacao. This contrasts consistently with the more dispersed distributions comparing the non-Malvoideae genomes to the Malvoideae. The grouping of the three non-Malvoideae subfamilies as sister groups seems confirmed by the mean similarities in Table 1 (top section). This contrasts with suggestions in [19] that Corchorus is the earliest branch in the Malvaceae, and in [23] that Durio is a sister group to Gossypium. Our suggestion is confirmed by a neighborjoining analysis (not shown), which exactly reproduces the tree in Fig. 1, with the exception that Durio branches off slightly before Corchorus on the lineage leading to Theobroma.
As was recently shown [6] the mean similarity scores at speciation in Table 1 drop exponentially with the assigned times in Fig. 1, except for a lower similarity than expected for the divergence time of the Malvaceae species from the Malvales outgroup -see Fig. 5. Also in this figure the standard deviation of the component distribution is linearly related to its mean, again with the exception of the original Malvaceae emergence.
We used the data in the three sections of Table 1 to estimate the uðiÞ according to Equation (6), and transformed these according to the relationship r ¼ À log u t , to estimate proportional fractionation rates, shown in Table 2. We performed a similar analysis on the self comparisons, shown in Table 3. In both cases a range of values is shown, corresponding to the range of time estimates shown in Fig. 1. Of note is the relative constancy of rates except after recent WGD events. The larger recent rates may be methodological artifacts or may indicate a higher rate of fractionation, immediately after the WGD.

DISCUSSION AND CONCLUSION
A major methodological problem in estimating proportional fractionation rates is deciding on the time estimates for speciation and WGD events. The times given by the Timetree of Life Modes identified visually, assuring rough concordance with other comparisons sharing the same event history. Standard deviations and component proportions determined by maximum likelihood, assuming means coincide with modes. Incremental adjustments to the means applied to correct regions that fit less well due to component overlap and non-normality followed by further rounds of likelihood maximization. [13] or the Angiosperm Phylogeny Website [14] are based on a wide range of data and are systematically much older than those provided by more recent molecular phylogenies, often by a factor of two or more. And the more recent results usually involve a wide range of values. The accuracy of the rate estimations suffers accordingly, and can only be reported as a similarly wide range of values. This difficulty is partially due to the very different quality of the genome sequences and annotations. Over less than ten years, technology has improved by orders of magnitude, but there is still great variability among laboratories, reporting styles and journals. The studies we consulted performed "phylogenomic" estimations based on very few genes, chosen according to a variety of criteria, such as wide presence in many genomes or single-copy in all genomes. How these criteria affect the time estimates is unknown, but certainly create inconsistencies between different studies.
The principled extension of our model from WGD to speciation highlights inference difficulties due to gene loss other than fractionation. Our model focuses on gene loss within syntenic blocks and, up to the moment of speciation, disregards genes that are not in such blocks. But there are at least four reasons why genes do not show up in syntenic blocks aside from fractionation. The very criteria for defining a block, including a minimum number of gene pairs in a block with a maximum number of unpaired genes between successive pairs, necessarily excludes many pairs. Second, local rearrangements may disrupt blocks so that the resulting fragments are too small. Third, related to the previous, one member of a gene pair may be displaced, so that the pair no longer remains in syntenic context. Finally, and most important, both members of a pair may be lost, by silencing, pseudogenization, excision or other mechanism, transcending the "no lineage extinction" constraint on our model. A good proportion of genes in most organisms can be considered non-essential. Even those genes deemed essential under usual conditions, may be lost, and this has become of increasing interest as a mechanism of evolution [30].
How pervasive this non-fractionation loss may be is not known, but it will be necessary at some point to incorporate  it into our models. From a mathematical point of view, our current branching process is super-critical, which is not realistic, even if some organisms, like hibiscus, have accumulated very large numbers of genes through WGD. The fractionation rates found in the present study are of the order of a half those calculated for the Solanaceae [6]. This correlates with the somewhat faster evolutionary rate of the latter [8] and may represent the longer generation time of the woody tree mode prevalent in the Malvaceae compared to the annual herbaceous growth mode of the Solanaceae. If higher fractionation rates for recent WGD prove not to be methodological artifacts, this may be the result of long-lasting pairs settling into a stable subfunctionalization, particularly pertinent in the case of the g event.