SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data

Motivation: Genotyping and parameter estimation using high throughput sequencing data are everyday tasks for population geneticists, but methods developed for diploids are typically not applicable to polyploid taxa. This is due to their duplicated chromosomes, as well as the complex patterns of allelic exchange that often accompany whole genome duplication (WGD) events. For WGDs within a single lineage (auto polyploids), inbreeding can result from mixed mating and/or double reduction. For WGDs that involve hybridization (allopolyploids), alleles are typically inherited through independently segregating subgenomes. Results: We present two new models for estimating genotypes and population genetic parameters from genotype likelihoods for auto- and allopolyploids. We then use simulations to compare these models to existing approaches at varying depths of sequencing coverage and ploidy levels. These simulations show that our models typically have lower levels of estimation error for genotype and parameter estimates, especially when sequencing coverage is low. Finally, we also apply these models to two empirical data sets from the literature. Overall, we show that the use of genotype likelihoods to model non-standard inheritance patterns is a promising approach for conducting population genomic inferences in polyploids. Availability: A C++ program, EBG, is provided to perform inference using the models we describe. It is available under the GNU GPLv3 on GitHub: https://github.com/pblischak/polyploid-genotyping. Contact: blischak.4@osu.edu.


27
The discovery and analysis of genetic variation in natural populations is a central task of evolutionary 28 genetics, with applications ranging from the inference of population structure and patterns of historical 29 demography, detecting selection and local adaptation, and performing genetic association studies. The where B(α, β) represents the beta function with parameters α and β. Since genotypes must be inferred 113 from sequence data (d i ; see Methods), we can also account for this uncertainty by summing over likelihood is encumbered by the logarithm of the sum over genotypes, we instead use an expectation 118 conditional maximization algorithm to obtain maximum likelihood (ML) estimates for p and F (Meng 119 and Rubin, 1993). Since an analytical solution for the maximization step is not readily available, we 120 instead employ numerical maximization of the likelihood using Brent's method (Brent, 1973). Then, 121 given the ML parameter estimates, we can calculate the posterior probability of the genotype of each 122 individual at each locus using Bayes' theorem: with two diploid subgenomes can have a full genotype of 0, . . . , 4 copies of the alternative allele, but 142 each of these full genotypes can be found via a different combination of genotypes in the subgenomes: 143 {0 = (0, 0); 1 = (0, 1), (1, 0); 2 = (0, 2), (2, 0), (1, 1); 3 = (1, 2), (2, 1); 4 = (2, 2)}. In general, for an 144 allopolyploid that has two subgenomes with ploidy levels equal to m 1i and m 2i , there are a total of 145 (m 1i + 1) × (m 2i + 1) genotype combinations to consider. The probabilities of these genotypes are then 146 determined using the allele frequencies for the alternative allele in the subgenomes.

147
An obvious complication of not being able to separate the sequencing reads into sets coming from each 148 subgenome is that it makes independently estimating the allele frequencies and genotypes impossible.

149
However, it is sometimes the case that the parental species of the allopolyploid are known, which can 150 help with inferring genotypes by providing an outside estimate of the allele frequencies within the 151 subgenomes. For our model, we relax this use of outside knowledge further and assume that only a 152 single parent has been identified. Arbitrarily designating the known parent as subgenome one, we treat 153 the allele frequencies at each locus estimated in the parental population to be known (p * 1 ) and require 154 only the estimation of the allele frequencies in subgenomes two (p 2 ). We then model the full genotype 155 in the allopolyploid as the sum of the two independent subgenomes with separate, and potentially 156 unequal, allele frequencies. Since we assume Hardy Weinberg equilibrium within each subgenome, we can model the sum of the number of alternative alleles in the two subgenomes as a product of two binomial distributions. The log likelihood for known genotype data across individuals at all loci is then 159 given by The inclusion of genotype likelihoods is done in a similar way to the autopolyploid model, only now we 161 are summing over the values of the genotypes in both subgenomes one and two. The log likelihood for 162 the observed sequence data given the allele frequencies in each of the subgenomes is where P (d i |g i ) is the genotype likelihood, and P (g 1i |p * 1 ) and P (g 2i |p 2 ) are binomial distributions. = i a1 a2 a 2 P (g i = a 1 + a 2 |d i , p * 1 , p where P (g i = a 1 + a 2 |d i , p * 1 , p 2 ) is the joint conditional probability of the genotypes in subgenomes 169 one and two given the data and the current parameter estimates. Using these ML estimates, an empirical 170 Bayes estimate of the genotypes within each of the subgenomes can be found using their joint posterior 171 probability (note that subscripts i and are dropped for readability) 172 P (g 1 = a 1 , g 2 = a 2 |d) = P (d|g = g 1 + g 2 )P (g 1 = a 1 |p * 1 )P (g 2 = a 2 |p 2 ) a 1 a 2 P (d|g = g 1 + g 2 )P (g 1 = a 1 |p * 1 )P (g 2 = a 2 |p 2 ) , for a 1 = 0, . . . , m 1i and a 2 = 0, . . . , m 2i .

173
Other Approaches

174
We consider two additional approaches that use genotype priors that have been described in previous 175 studies. The first is an implementation of the SAMtools Hardy Weinberg equilibrium prior (Li, 2011) 176 and the second is a flat prior on genotypes that is similar to the model used by the Genome Analysis considered here because they can only handle specific ploidy levels (triploids and/or tetraploids).
sequencing error values at each locus, , across reads and individuals (Li, 2011). Then for the possible 183 values of the genotype (a = 0, . . . , m i copies of the alternative allele), the probability of the read data, which is the probability of drawing an alternative allele weighted by the probability of a sequencing 187 error.

188
Simulations 189 We generated sequencing read data with mean coverage per individual, per locus equal to 2x, 5x, 10x,

201
To compare our models with other methods, we reused these simulated data as input for the 202 estimation of genotypes and model parameters using models that assume either Hardy Weinberg 203 equilibrium or equal genotype probabilities (GATK-like). For the allopolyploid model, this also equates to infer the full genotype by again comparing RMSD values.
We tested our autopolyploid model on an empirical data set collected in the grass species under the assumption of Hardy Weinberg equilibrium and disequilibrium to assess which was a better 234 fit. These allele frequency estimates were then used as the reference panel for genotype estimation in B.

235
pubescens using the allopolyploid model.      here. Another potential avenue would be to develop ways to use more parental information, as well as 409 demographic parameters to account for the amount of divergence between the allopolyploid and its 410 parents. Models that help to identify parental taxa will also be an important contribution for future 411 research on allopolyploids.

413
As methods for the analysis of polyploid data continue to be developed, we are hopeful that the barriers 414 to more widespread study of these taxa will begin to drop. The prevalence of polyploidy in plants and 415 other groups of eukaryotes, including fish, amphibians, and fungi, make these methods fundamentally  Overall, we believe that using genotype likelihoods when studying polyploids to overcome difficulties 420 in determining allele copy number and for dealing with low-coverage sequencing data is a promising 421 approach for future model development.
422 Table 1: A key to the symbols and notation that are used in describing the autopolyploid and allopolyploid models. We use a either a bold or bold-capitalized letter when referring to the collection of parameters together (e.g., G refers to g i for all individuals at all loci). Parameters within subgenomes for the allopolyploid model use the same symbol but with either a 1 or a 2 added as a subscript.