Demographic History Inference and the Polyploid Continuum

Polyploidy is an important generator of evolutionary novelty across diverse groups in the Tree of Life, including many crops. However, the impact of whole-genome duplication (WGD) depends on the mode of formation: doubling within a single lineage (autopolyploidy) versus doubling after hybridization between two different lineages (allopolyploidy). Researchers have historically treated these two scenarios as completely separate cases based on patterns of chromosome pairing, but these cases represent ideals on a continuum of chromosomal interactions among duplicated genomes. Understanding the history of polyploid species thus demands quantitative inferences of demographic history and rates of exchange between subgenomes. To meet this need, we developed diffusion models for genetic variation in polyploids with subgenomes that cannot be bioinformatically separated and with potentially variable inheritance patterns, implementing them in the dadi software. We validated our models using forward SLiM simulations and found that our inference approach is able to accurately infer evolutionary parameters (timing, bottleneck size) involved with the formation of auto- and allotetraploids, as well as exchange rates in segmental allotetraploids. We then applied our models to empirical data for allotetraploid shepherd’s purse (Capsella bursa-pastoris), finding evidence for allelic exchange between the subgenomes. Taken together, our model provides a foundation for demographic modeling in polyploids using diffusion equations, which will help increase our understanding of the impact of demography and selection in polyploid lineages.

offer a promising avenue for further development of more gen-49 eralized demographic models for different types of polyploids, 50 but are also limited by their reliance on simulations and the com-51 parison of summary statistics rather than on likelihood-based that identifying homoeologous positions for genotyping is more 23 difficult, and it is further complicated by having to distinguish 24 between SNPs within a subgenome and fixed differences be- available for an allopolyploid, then these approaches are not able 33 to be used. One possible solution would be to simply genotype 34 a polyploid at its ploidy level without any attempt to separate 35 out lower-ploidy subgenomes. With this approach, the task of 36 determining the mode of polyploid formation could be incorpo-37 rated into any downstream analyses rather than needing to be 38 decided up front. 39 In this paper we develop a method for inferring the demo-40 graphic history of a single polyploid population using a dif-41 fusion framework that includes homoeologous exchanges and 42 a modified version of the site frequency spectrum (SFS) that 43 collapses data across polyploid subgenomes into a single SFS, 44 removing the need to classify the polyploid as an autopolyploid, 45 allopolyploid, or in between during SNP calling. The collapsed 46 SFS represents the combined sample allele frequencies across the polyploid subgenomes, and is analogous to combining allele 48 frequencies between populations. After describing the model, 49 we compare the SFS generated by this diffusion approximation 50 with frequency spectra generated by forward simulations. We 51 then use forward simulations to assess our ability to infer demo-  inheritance. This exchange rate parameter ranges between 0 and 69 1, with Θ = 0 corresponding to no allelic exchange and Θ = 1 70 corresponding to free allelic exchange between subgenomes. 71 Within this framework, the standard categorizations of allopoly-72 ploid and tetrasomic autopolyploid fit naturally and represent 73 the extremes in the amount of expected homoeologous crossover. 74 Values of Θ between 0 and 1 allow for intermediate amounts of 75 homoeologous exchange and most closely align with what we 76 would consider to be segmental allopolyploids. We incorporate 77 this range of exchange rates into a diffusion framework, param-78 eterizing homoeologous exchanges akin to migration, much like 79 Roux and Pannell (2015), to provide a generalized model for 80 polyploid demography.

81
A diffusion approximation for polyploids: We considered a single population of a K-ploid organism with S subgenomes, each containing k 1 , k 2 , . . . , k S chromosomes (∑ i k i = K). Our diffusion model tracks the joint density of derived mutations across each of the S subgenomes, in an infinite sites model. The density of derived mutations at relative frequencies x 1 , . . . , x S at time t is denoted as φ(x 1 , . . . , x S , t). In general, the relevant diffusion equation is (Kimura 1964) Here, M i represents the per-generation mean change in the fre-82 quency of an allele in subgenome i, V i the variance in that change, 83 and W ij the covariance between changes in subgenomes i and j. 84 Let N be some reference number of individuals (often the In general, the binomial distribution has variance np(1 − p), 91 where n is the sample size and p is the probability of "success". 92 In this case, n = Nνk i and p = x i , because each of the Nνk i 93 chromosomes in the next generation is independently copied 94 from a random chromosome in the previous generation, and 95 the proportion of carriers in the previous generation is x i . The Nνe i↔j x j . The first term is the loss of alleles due to exchange in the derived allele frequency within subgenome i is then For compatibility with the widely-used diploid equations, we rescaled time to measure in units of 2N generations. Defining E i↔j = 2Ne i↔j and ∆t = ∆τ/(2N), plugging in our results for the mean and variance terms, and multiplying both sides of Eq. 1 by 2N, we obtained: Note that the population size ν can be an arbitrary positive 12 function of time, although modelers often employ piecewise 13 constant or exponential functions for ν(t). 14 We noted the close analogy between Eq. 2 and the diffusion  With a distribution for the expected frequency of derived mutations in each subgenome at time t, we can generate the expected SFS for a sample of n individuals by integrating over the distribution of allele frequencies and calculating the probability of observing d i derived alleles using a binomial distribution (Sawyer and Hartl 1992): Here each dimension of the SFS corresponds to a different 4 subgenome.

5
Combining polyploid subgenomes: In practice, it may be difficult to partition allele counts between two or more subgenomes.
In that case, two or more dimensions of the model SFS must be collapsed down to a single dimension for comparison with the observed SFS. This problem can arise due to issues with phasing, unknown or unsampled parental lineages, or simply unknown origin for the polyploid. As an example, consider a sample of n polyploid individuals with two subgenomes with ploidal levels k 1 and k 2 . The original SFS for this population, E[d 1 , d 2 ], would be a two-dimensional array of size (nk 1 + 1) × (nk 2 + 1). We collapse the two dimensions together using the following

Collapsed Frequency
Subgenome 1 Frequency Subgenome 2 Frequency equation: In words, the d entry of the reduced SFS is the summation of all 6 entries for which the total allele count in the removed dimen-7 sions equals d. This collapse of the SFS is illustrated in Fig. 2.

8
The resulting dimension of the new SFS is of size nk 1 + nk 2 + 1 9 and is what we use for comparison with the observed SFS when 10 performing demographic inference with a polyploid population.

11
If more than two subgenomes are indistinguishable, then this re-12 duction process can be iterated to collapse all indistinguishable 13 genomes into a single dimension of the SFS.  This was repeated 100 times for a grand total of 5,000 simulated 31 frequency spectra for each scenario. We then constructed com-32 parable models using the diffusion approximation implemented Discussion. An initial burn-in of 40,000 generations was used to 51 reach an approximate state of equilibrium in genetic diversity. 52 The first set of frequency spectra were sampled immediately af-53 ter this burn-in period to obtain the mean SFS in an equilibrium

94
The corresponding model specified in dadi began with a 95 single diploid population that was split into two populations. 96 and combined in the same way and compared to the frequency 10 spectra from SLiM.

11
Segmental allopolyploids: For segmental allotetraploids, the 12 simulation setup was similar to the one used for allotetraploids. 13 We included the same initial period of divergence (T 1 ), as well 14 as the secondary period of time after polyploid formation (T 2 ).

15
During this secondary period, we added two levels of allelic 16 exchange between subgenomes. The levels we chose were 17 e i↔j = 5 × 10 −5 and e i↔j = 5 × 10 −7 . These levels correspond to 18 one exchange event every 10 or every 1,000 generations, respec-19 tively. Parameters for the initial period of divergence, T 1 , were 20 kept the same. We also did simulations with bottlenecks, using  Parameter inference and identifiability 27 In a separate set of SLiM simulations, we also sought to under-28 stand how well demographic parameters could be inferred from 29 a combined polyploid SFS in dadi. For these simulations, we 30 used a subset of the parameterizations listed above for allote-31 traploids and segmental allotetraploids, simulating 10 indepen-32 dent frequency spectra for each parameter combination. We also 33 included an additional layer of complexity in these simulations  The diffusion approximation in polyploids 28 We compared the expected SFS from the polyploid diffusion 29 approximation as implemented in dadi with frequency spectra  As might be expected for autopolyploids, the frequency spec- 38 tra appear similar to what we would obtain for a diploid but with 39 double the sample size (Fig. 4A&B). This is the case even though where the spike at 50% frequency is higher and the drop off in 50 the prevalence of sites with allele frequencies over 0.5 is greater.

51
Segmental allotetraploids also differ in the appearance of their 52 50% frequency spike, having small shoulders of increased counts 53 of sites with frequencies around 50% (Fig. 4C&D). This is caused 54 by the allelic exchange between subgenomes generating allelic 55 combinations that are not possible in allotetraploids due to the 56 complete separation of subgenomes. As the exchange rate in 57 segmental allotetraploids increases, the spike continues to level 58 out and eventually becomes visually indistinguishable from an 59 autotetraploid (Fig. 5).

60
Inferring demographic parameters in polyploids 61 As a follow-up to our validating simulations, we also sought 62 to understand how well we could infer demographic parame-63 ters using the numerical approaches implemented in dadi for 64 collapsed allotetraploid and segmental allotetraploid frequency 65 spectra. For the parameters that describe the formation of the 66 polyploid population itself, we are typically able to obtain pre-67 cise parameter estimates across our simulated scenarios (Fig. 6). 68 As expected, parameter estimates for the GBS data simulations   (Table 1). Both models resulted 108 in similar estimates for the parameters regarding the formation 109 of the polyploid lineage (ν bot and T 2 ; see Fig. 3), finding that C.