Accounting for genotype uncertainty in the estimation of allele frequencies in autopolyploids

Despite the increasing opportunity to collect large-scale data sets for population genomic analyses, the use of high throughput sequencing to study populations of polyploids has seen little application. This is due in large part to problems associated with determining allele copy number in the genotypes of polyploid individuals (allelic dosage uncertainty–ADU), which complicates the calculation of important quantities such as allele frequencies. Here we describe a statistical model to estimate biallelic SNP frequencies in a population of autopolyploids using high throughput sequencing data in the form of read counts. We bridge the gap from data collection (using restriction enzyme based techniques [e.g., GBS, RADseq]) to allele frequency estimation in a unified inferential framework using a hierarchical Bayesian model to sum over genotype uncertainty. Simulated data sets were generated under various conditions for tetraploid, hexaploid and octoploid populations to evaluate the model’s performance and to help guide the collection of empirical data. We also provide an implementation of our model in the R package polyfreqs and demonstrate its use with two example analyses that investigate (i) levels of expected and observed heterozygosity and (ii) model adequacy. Our simulations show that the number of individuals sampled from a population has a greater impact on estimation error than sequencing coverage. The example analyses also show that our model and software can be used to make inferences beyond the estimation of allele frequencies for autopolyploids by providing assessments of model adequacy and estimates of heterozygosity.


32
Biologists have long been fascinated by the occurrence of whole genome duplication (WGD) in 33 natural populations and have recognized its role in the generation of biodiversity (Clausen et al. realize that it does not apply to a large portion of polyploid taxa. Nevertheless, we believe that 118 accounting for ADU by modeling genotype uncertainty has the potential to be applied more broadly 119 via modifications of the probability model used for the inheritance of alleles, which could lead to 120 more generalized population genetic models for polyploids (see the Extensibility section of the 121 Discussion).

123
Our goal is to estimate the frequency of a reference allele for each locus sampled from a population    Then for individual i at locus , we model the number of sequencing reads containing the reference 145 allele (r i ) as a Binomial random variable conditional on the total number of sequencing reads (t i ), 146 the underlying genotype (g i ) and a constant level of sequencing error ( ) Here g is the probability of observing a read containing the reference allele corrected for sequencing (2) The intuition behind including error is that we want to calculate the probability that we observe a 150 read containing the reference allele. There are two ways that this can happen.
(1) Reads are drawn 151 from the reference allele(s) in the genotype with probability g i ψ but are only observed as reference 152 reads if they are not errors (probability 1 − ).
(2) Similarly, reads from the non-reference allele(s) in the genotype are drawn with probability 1 − g i ψ but can be mistakenly read as a coming from a 154 reference allele if an error occurs (probability ). The sum across these two possibilities gives the 155 overall probability of observing a read containing the reference allele. If we also assume conditional 156 independence of the sequencing reads given the genotypes, the joint probability distribution for 157 sequencing reads is given by Since the r i 's are the data that we observe, the product of P (r i |t i , g i , ) across loci and individuals 159 will form the likelihood in the model.

160
The next level in the hierarchy is the conditional prior for genotypes. We model each g i as a

161
Binomial random variable conditional on the ploidy level of the population and the frequency of the 162 reference allele for locus (p ): We also assume that the genotypes of the sampled individuals are conditionally independent given the 164 allele frequencies, which is equivalent to taking a random sample from a population in Hardy-Weinberg We choose here to ignore other factors that may be influencing the distribution of genotypes such 169 as double reduction. In general, double reduction will act to increase homozygosity (Hardy 2015).

170
However, it is more prevalent for loci that are farther away from the centromere, which makes The marginal posterior distribution for allele frequencies can be obtained by summing over genotypes It would also be possible to examine the marginal posterior distribution of genotypes but here we 185 will focus primarily on allele frequencies.

186
Full conditionals and MCMC using Gibbs sampling 187 We estimate the joint posterior distribution for allele frequencies and genotypes in Eq. 5 using MCMC. is Beta distributed and is given by Eq. 7 below: This full conditional distribution for p has a natural interpretation as it is roughly centered at the 194 proportion of sampled alleles carrying the reference allele divided by the total number of alleles 195 sampled. The "+1" comes from the prior distribution and will not have a strong influence on the 196 posterior when the sample size is large.

197
The full conditional distribution for genotypes is a discrete categorical distribution over the 198 possible values for the genotypes (0, . . . , ψ). The distribution for individual i at locus is Similarly, the m th estimate of the expected heterozygosity is calculated using Eq. 10 [denominator of 262 Eq. 8 in Hardy (2015)]: The posterior distribution of a multi-locus estimate of heterozygosity can then be approximated by    . This is not to say, however, that sequencing coverage has no effect 346 on the posterior distribution of allele frequencies. Lower sequencing coverage affects the posterior 347 distribution by increasing the posterior standard deviation (Figure 2). An interesting pattern that 348 emerged during the simulation study is the observation that the allele frequencies closer to 0.5 tend to 349 have higher error rates, which is to be expected given that the variance of a Binomial random variable 350 is highest when the probability of success is 0.5. We also observed small differences in the RMSE the frequency of the reference allele is very low, the read ratio estimate tends to be higher than the 365 posterior mean. However, the overall pattern does not indicate over or under estimation for most 366 allele frequencies ( Figure S2). When we took the difference between the estimates at each locus, the 367 distribution was centered near 0 ( Figure S3). in this way could technically be applied to any type of polyploid, but is only really appropriate for 398 autopolyploids due to the model of inheritance that is used. Other methods for estimating SNP g i Simulated genotype for posterior predictive model checking. g The probability of observing a reference read corrected for sequencing error.