Inferring the timing and strength of natural selection and gene migration in the evolution of chicken from ancient DNA data

With the rapid growth of the number of sequenced ancient genomes, there has been increasing interest in using this new information to study past and present adaptation. Such an additional temporal component has the promise of providing improved power for the estimation of natural selection. Over the last decade, statistical approaches for detection and quantification of natural selection from ancient DNA (aDNA) data have been developed. However, most of the existing methods do not allow us to estimate the timing of natural selection along with its strength, which is key to understanding the evolution and persistence of organismal diversity. Additionally, most methods ignore the fact that natural populations are almost always structured, which can result in overestimation of the effect of natural selection. To address these issues, we introduce a novel Bayesian framework for the inference of natural selection and gene migration from aDNA data with Markov chain Monte Carlo techniques, co-estimating both timing and strength of natural selection and gene migration. Such an advance enables us to infer drivers of natural selection and gene migration by correlating genetic evolution with potential causes such as the changes in the ecological context in which an organism has evolved. The performance of our procedure is evaluated through extensive simulations, with its utility shown with an application to ancient chicken samples.


Introduction
which enables the co-estimation of the selection coefficient and the migration rate from genetic Wright-Fisher diffusion for a single locus evolving under natural selection and gene migration,

58
In this section, we first introduce the multi-allele Wright-Fisher diffusion for a single locus 59 evolving under natural selection and gene migration and then present our Bayesian method for 60 the joint inference of selection and migration from time series allele frequency data.

Wright-Fisher diffusion 62
Let us consider a population of N randomly mating diploid individuals at a single locus A 63 with discrete and nonoverlapping generations, where the population size N is finite and fixed 64 over time. Suppose that at locus A there are two allele types, labelled A 1 and A 2 , respectively. corresponds to randomly sampling 2N gametes with replacement from an effectively infinite 91 gamete pool to form new zygotes in the next generation through random union of gametes. Table 1: Relative viabilities of all possible genotypes at locus A when we distinguish between the alleles that originate on the island and the alleles that emigrate from the continent.
We let X (N ) (k) = (X space Ω X = {x ∈ [0, 1] 4 : 4 i=1 x i = 1} and satisfying the stochastic differential equation (SDE) 103 of the form 104 dX(t) = µ(X(t), t)dt + σ(X(t), t)dW (t), with initial condition X(t 0 ) = x 0 . In Eq. (1), µ(x, t) is the drift term where x c is the frequency of the A c 1 allele in the continent population, which is fixed over time, 106 σ(x, t) is the diffusion term where p(ϑ s , ϑ m ) is the prior probability distribution for the population genetic quantities and 141 can be taken to be a uniform distribution over the parameter space if their prior knowledge is

146
With the Markov property of the Wright-Fisher diffusion, we have where p(x 1 | ϑ s , ϑ m ) is the prior probability distribution for the allele frequencies of the under-148 lying population at the initial sampling time point, commonly taken to be non-informative (e.g.,

149
flat over the entire state space) if the prior knowledge is poor, and p(x k+1 | x k ; ϑ s , ϑ m ) is the in this work we take the prior p(x 1 | ϑ s , ϑ m ) to be a uniform distribution over the state space 154 Ω X , known as the flat Dirichlet distribution, if migration starts before the first sampling time point, i.e., t m ≤ t 1 ; otherwise, the prior p(x 1 | ϑ s , ϑ m ) is set to be a uniform distribution over 156 the state space Ω X restricted to the line satisfying the condition that x 3 = x 4 = 0, i.e., there is 157 no continent allele in the island population.

158
Given the allele frequency trajectories of the underlying population, the observations at each 159 sampling time point are independent. Therefore, we have 160 p(c 1:K , d 1: where p(c k , d k | x k ; ϑ s , ϑ m ) is the probability of the observations at the sampling time point t k 161 given its corresponding allele frequencies of the underlying population for k = 1, 2, . . . , K. If 162 the sample continent allele count d k is available, we introduce z k = (z 1,k , z 2,k , z 3,k , z 4,k ) to be 163 the counts of the A i 1 , A i 2 , A c 1 and A c 2 alleles in the sample at the k-th sampling time point, and 164 the emission probability p(c k , d k | x k ; ϑ s , ϑ m ) can be expressed as where Ω Z k = {z k ∈ N 4 : 4 i=1 z i,k = n k } and 1 A is the indicator function that equals to 1 if 166 condition A holds and 0 otherwise. Otherwise, the emission probability p(c k , d k | x k ; ϑ s , ϑ m ) 167 can be reduced to

169
The most challenging part in the computation of the posterior p(ϑ s , ϑ m , x 1:K | c 1:K , d 1:K ) is 170 obtaining the transition probability density function p(x k+1 | x k ; ϑ s , ϑ m ) for k = 1, 2, . . . , K −1. parameters and the allele frequency trajectories of the underlying population.

180
In our PMMH-based procedure, the marginal likelihood 181 p(c 1:K , d 1: is estimated with the bootstrap particle filter developed by Gordon et al. (1993), where the par- Step 1: Update the selection-related parameters ϑ s .

207
Step 2c: Accept ϑ m and x 1:K with otherwise set ϑ m = ϑ m and x 1: In this work we use random walk proposals for both selection-and migration-related parameters 210 in our blockwise PMMH algorithm unless otherwise specified.

211
Once enough samples of the parameters (ϑ s , ϑ m ) and the underlying population allele fre- where the likelihood is replaced by the product of the likelihoods for each locus.

223
In this section, we first evaluate the performance of our approach using simulated datasets

319
The resulting estimates for dominant selection (h = 0) and recessive selection (h = 1) can be 320 found in Figures S3 and S4, respectively, for the case of the population size N = 50000. They 325 This is mainly due to our parameter setting, i.e., the effect of selection, when the mutant allele 326 has been established in the population (e.g., our starting mutant allele frequency is 0.4), is the 327 strongest for recessive selection and weakest for dominant selection (see Figure S5).

328
In conclusion, our approach can produce reasonably accurate joint estimates of the timing   The time from which the data come ranges from approximately 2200 years ago to the present.

385
To evaluate the performance of our approach when samples are sparsely distributed in time 386 with small uneven sizes like the European chicken samples at the TSHR locus we have studied 387 above, we generated 300 simulated datasets that mimic the TSHR data, i.e., we used the sample 388 times and sizes as given in Table 2, the timing and strength of selection and migration as given 389 by MAP estimates found in Table 3 Table 2. We simulate the underlying population dynamics with the timing and strength of selection and migration estimated with the population size N = 180000 shown in Table 3. To aid visual comparison, we have picked the x axis in the left panel not to cover all 300 estimates. The histogram containing all 300 estimates can be found in Figure S8.
In summary, our approach works well on the ancient chicken samples, even though they are  timing and strength of selection and migration were taken to be our estimates given in Table 3 441 but the true population size was taken to be N = 4500. We ran our method with a misspecified 442 population size N = 180000 for these 300 replicates and find that this larger population size  Table 2. We take the timing and strength of selection and migration to be those estimated with the population size N = 180000 given in Table 3, but the true population size in the simulation is taken to be N = 4500. To aid visual comparison, we have picked the x axis in the left panel not to cover all 300 estimates. The histogram containing all 300 estimates can be found in Figure S9.
We explored how misspecification of genetic dominance or gene migration affects our infer-  In this work, we have focused on the continent-island model under the assumption that the 463 allele frequencies of the continent population are fixed over time. As has been previously noted 464 Figure 9: Empirical distributions of the estimates for 300 datasets simulated for TSHR based on the aDNA data presented in Table 2. We take the timing and strength of selection and migration to be those estimated with the population size N = 180000 given in Table 3, but the true dominance parameter in the simulation is taken to be (a) h = 0 and (b) h = 0.5, respectively. To aid visual comparison, we have picked the x axis in the left panel not to cover all 300 estimates. The histogram containing all 300 estimates can be found in Figure S10.  Figure 10: Empirical distributions of the estimates for 300 datasets simulated for TSHR based on the aDNA data presented in Table 2. We take the timing and strength of selection and migration to be those estimated with the population size N = 180000 given in Table 3, but the true migration rate in the simulation is taken to be (a) m = 0.0001 and (b) m = 0.001, respectively. To aid visual comparison, we have picked the x axis in the left panel not to cover all 300 estimates. The histogram containing all 300 estimates can be found in Figure S11.
likelihood will depend on the samples from both the island and continent populations, and the 479 selection-related parameters for the continent population are updated as an additional block.

480
In a similar manner, we can also allow gene migration to change the genetic composition of the 481 continent population, i.e., the two-island model. Our approach is also readily applicable to the but it may suffer from particle degeneracy and impoverishment issues if we extend our method 484 to jointly estimate the allele age, which results from low-frequency mutant alleles at the early 485 stage facing a higher probability of being lost.

486
It is possible to extend our procedure to handle the case of multiple islands or multiple loci.

487
For multiple islands, our method will be more computationally demanding with an increase in 488 the number of demes, but improvements to exact-approximate particle filtering techniques such  Table 2. We take the timing and strength of selection and migration to be those estimated with the population size N = 180000 given in Table 3, but the true migration time in the simulation is taken to be (a) km = −400 and (b) km = −100, respectively. To aid visual comparison, we have picked the x axis in the left panel not to cover all 300 estimates. The histogram containing all 300 estimates can be found in Figure S12.