Estimating temporally variable selection intensity from ancient DNA data II

Recent technological innovations, such as next generation sequencing and DNA hybridisation enrichment, have made it possible to recover DNA information from historical and archaeological biological materials, which has motivated the development of various statistical approaches for inferring selection from allele frequency time series data. Recently, He et al. (2023a,b) introduced methods that can utilise ancient DNA (aDNA) data in the form of genotype likelihoods, therefore enabling the modelling of sample uncertainty arising from DNA molecule damage and fragmentation. However, their performance suffers from the underlying dependency on the allele age. Here we introduce a novel particle marginal Metropolis-Hastings within Gibbs framework for Bayesian inference of time-varying selection from aDNA data in the form of genotype like-lihoods. To circumvent the performance issue encountered in He et al. (2023a,b), we devise a novel numerical scheme for backward-in-time simulation of the Wright-Fisher diffusion and mix forward- and backward-in-time simulations in the particle filter for likelihood computation. Our framework also enables us to reconstruct the underlying population allele frequency trajectories, integrate temporal information in genotype likelihood calculations and test hypotheses on the drivers of past selection events. We conduct extensive evaluations through simulations and show its utility with an application to aDNA data from pigmentation loci in ancient horses.

(3)-(5), but the remaining two are computationally infeasible to generate 128 from. We therefore resort to the MH-within-Gibbs algorithm introduced by Tierney (1994) as a 129 hybrid framework that incorporates the MH algorithm (Metropolis et al., 1953;Hastings, 1970) 130 to implement the sampling steps when Gibbs sampling fails. More specifically, each iteration we 131 update the sample individual genotypes g 1:K using the full conditional probability distribution 132 p(g 1:K | ϑ, x 1:K , r 1:K ) in the Gibbs step and update the population genetic parameters ϑ and 133 the population mutant allele frequency trajectory x 1:K in the MH step.

134
In the MH-within-Gibbs algorithm, the population genetic parameters ϑ and the population 135 mutant allele frequency trajectory The product of the average weights of the set of particles at the sampling time points t 1:K yields 142 an unbiased estimate of the marginal likelihood p(r 1:K , g 1:K | ϑ), denoted byp(r 1:K , g 1:K | ϑ), and the population mutant allele frequency trajectory x 1:K is sampled once from the final set 144 of particles with their corresponding weights (see He et al. (2023a) for details).

145
For clarity, we write down the PMMH-within-Gibbs algorithm for our posterior computation:

150
Repeat Steps 2 and 3 until a sufficiently large sample of ϑ, x 1:K and g 1:K have been obtained:

155
Step 3b: Run a bootstrap particle filter with ϑ i and g i 1:K to getp(r 1:K , g i 1:K | ϑ i ) and x i 1:K .

156
Step 3c: Accept ϑ i and x i 1:K with otherwise set ϑ i = ϑ i−1 and x i 1:K = x i−1 1:K .

158
Note that the superscript i represents the iteration label in our procedure presented above. We with the selection coefficient s − for t < τ and s + for t ≥ τ , respectively, in the bootstrap particle  In our PMMH-within-Gibbs procedure, the marginal likelihood is calculated through a boot- given X(t) = x, we let p − (t, x) and p + (t, x) denote the probability that and x + ∆ + x , respectively, which are the two neighbouring points of x. For generality, we allow , and then we have where whence we obtain and then we can approximate this measure (not necessarily a probability measure) by drawing 204 a value x for X(t) according to the probability distribution and assigning weight w t to x. The weights at every time step multiply. Note that for simplicity, 206 we fix the grid point closest to 0 to be 1/(2N ), thereby p

207
With the backward-in-time simulation of the Wright-Fisher diffusion in the bootstrap parti-208 cle filter, our method can avoid the dependence of the timing when the mutant allele is created 209 but will require the timing when the mutant allele is lost, which is also commonly unavailable in   Step 1 until x i K ∈ (0, 1) for i = 1, 2, . . . , 10000:

246
Step 2: Simulate x 1:K with s − , s + and x 1 through the Wright-Fisher model with selection.

252
Note that the superscript i represents the replicate label. Such a procedure avoids the mismatch 253 between the underlying model in the data generation process and that in the selection inference 254 procedure (He et al., 2023a). We set the parameter α n,k g n,k to φψ and the other two to (1 − φ)ψ/2, 255 where φ and ψ are the parameters that control the quality of the simulated dataset in terms of 256 the missing rate (MR) and error rate (ER) with a common threshold for genotype calling (i.e., the performance difference will decline and may even be eliminated as the data quality improves 295 (see Supporting Information, Figure S1). testing selection changes than our method, but the gap will narrow and may even be eliminated 318 when the data quality improves (see the Supporting Information, Figure S3).   Figure S4 and  Figure 5: Boxplots for the AMR and AER of the genotype posteriors across different selection scenarios. GP and GL are shorthands for genotype posterior and genotype likelihood.

Selection of horse coat colouration 331
We employ our procedure to infer selection acting on the MC1R, KIT13 and TRPM1 genes,

397
In this work, we have introduced a novel Bayesian procedure for estimating temporally vari-  However, improving the quality of genotype likelihoods by integrating the temporal dimension 481 is a highly desirable feature in aDNA studies, and moreover, the performance difference in the 482 inference of selection is negligible when the data quality is not too poor. Our procedure therefore 483 is expected to be a convincing alternative for aDNA studies.