Estimating time-varying selection coefficients from time series data of allele frequencies

Time series data of allele frequencies are a powerful resource for detecting and classifying natural and artificial selection. Ancient DNA now allows us to observe these trajectories in natural populations of long-lived species such as humans. Here, we develop a hidden Markov model to infer selection coefficients that vary over time. We show through simulations that our approach can accurately estimate both selection coefficients and the timing of changes in selection. Finally, we analyze some of the strongest signals of selection in the human genome using ancient DNA. We show that the European lactase persistence mutation was selected over the past 5,000 years with a selection coefficient of 2-2.5% in Britain, Central Europe and Iberia, but not Italy. In northern East Asia, selection at the ADH1B locus associated with alcohol metabolism intensified around 4,000 years ago, approximately coinciding with the introduction of rice-based agriculture. Finally, a derived allele at the FADS locus was selected in parallel in both Europe and East Asia, as previously hypothesized. Our approach is broadly applicable to both natural and experimental evolution data and shows how time series data can be used to resolve fine-scale details of selection.


Introduction
Time series data of allele frequencies are obtained from many sources including experimental Wright-Fisher model 49 Following the notation of Mathieson and McVean (2013), we consider a Wright-Fisher 50 population with an effective size of 2N e . We write f t as the frequency of the selected allele 51 at generation t for t = 0 . . . T . Suppose that the frequency trajectory is known exactly and 52 the selection coefficient s is constant over time. Then, an approximate maximum likelihood 53 estimator for s (Watterson, 1982) is . (1) That is, the total change in allele frequency, divided by the sum of the heterozygosity over 55 the time the allele is observed. Now suppose that the selection coefficient at generation t 56 is s t , but that it takes one of K possible values σ 0 . . . σ K . We assume that we know which 57 value s t takes at each generation and define indicator variables z t such that s t = σ zt . We 58 show in the Appendix that the maximum likelihood estimator of σ k is given by .
(2) This is Equation 1 with sums over generations when the selection coefficient is equal to σ k . In practice, f t is unknown. Instead, the 63 data consist of samples of n t chromosomes at each generation t (n t can be zero), of which 64 a t carry the selected allele. We treat f t as the hidden state in a hidden Markov model and where µ t = f t + sf t (1 − f t ) and ν t = ft ( In summary, the algorithm is as follows:  3. Repeat step 2 until iteration r where max k |σ r k − σ r−1 k | is less than some pre-defined 96 tolerance, and stop.

97
Because there are DK hidden states, running time is O(D 2 K 2 T ) and space is O(DKT ). generations of each changepoint.

117
• The root mean squared error in the weighted per-generation estimate of the selection We investigated performance as we varied parameter values and specified incorrect values 120 for fixed parameters, for example N e , c or K.  We ran an independent analysis where we fitted the observations using logistic regression where P i is the population to which individual i belongs and A i and B i are two of its 165ŝ standard error by assuming that the ratio ofŝ to its standard error is the same as the ratio 167 of β P i to its standard error. While this is not an explicit model of the evolutionary process, 168 it does allow us to account for variation in genome-wide ancestry across individuals.

170
Simulated data

171
In simulated data, we recover allele frequency trajectories, selection coefficients and the 172 timing of changes in selection coefficients (Fig. 2). Simulations also allow us to test the 173 robustness of the estimator to misspecification and highlight key features of its behavior.

174
First, under scenario 1, we tested robustness to misspecification of N e and c. These pa-175 rameters must be specified in advance. However, we find that the error in the estimates 176 is robust over one order of magnitude for N e , and two orders of magnitude for c ( Fig. S1 177 & S2) Thus, as long as reasonable estimates of these parameters are available, misspecifi-178 cation should not be a major concern. Second, we note that even for very large samples 179 the RMSE of the selection coefficientσ k andŝ t do not tend to zero. This is partly due to the stochastic effect of drift and partly due to the fact that the estimators can be biased, 181 particularly for low initial frequencies (Fig. S3). If the initial frequency is very low, there 182 is a relatively high chance that the allele is just by drift. For example, for an allele in a 183 single copy, there is a probability of ∼ e −1 ≈ 0.37 that the allele is lost in one generation 184 leading to a negative MLE for the selection coefficient.

185
As sample size increases, the RMSE ofŝ k decreases more reliably than that ofσ k ( inŝ t in the middle panel, and the posterior probability that z t is wrong in ±10 generations around each changepoint. We show estimates for sample sizes ranging from 1 to 1000, sampled either every generation or every 10 generations. and two selection coefficients (pre-and post-changepoint). We tested the performance of 208 this model under scenario 1 and find that our estimator outperforms it both in terms of 209 locating the changepoint and estimating the selection coefficients (Fig. S6).  In parts of Europe, for example Iberia, the derived allele did not become common until

223
We used data from 499 ancient Europeans, divided by region, to investigate whether 224 there were differences in the selective pressure across Europe, and whether the strength 225 of selection varied over time (Fig. 3). We estimate that in Britain and Central Europe, 226 the variant experienced a selection coefficient of ∼0.025, consistently for the past 4-5000 227 years. In Iberia, the selection coefficient was slightly lower-around 0.02. Bootstrapping 228 suggests that the selection coefficients outside Italy might be underestimated by up to 229 0.005 (Fig. 3). We find no evidence that the allele was ever under selection in Italy, with 230 an estimated selection coefficient of zero. One concern is that these differences might be due 231 to difference in the timing of ancestry changes across Europe. We therefore fitted a logistic 232 regression to the observations, including date and two ancestry components (inferred using selection coefficient trajectories for ADH1B (Fig. 4). We estimate that by 4000 BP, the 262 derived ADH1B was already common south of 30°N, but was still rare further north. Selec-263 tion intensified in the north around 4000 BP with a selection coefficient of around 2%. We  (Fig. 5A). In East Asia, we find that the same allele has also been under recent selection, 281 with a trajectory and selection coefficient in the north that is similar to that observed in 282 Europe (Fig. 5B). In the south we estimate a lower frequency but stronger selection though 283 with only one observation (out of 30) of the derived allele, this is very uncertain. In both 284 cases, we find consistent results with the logistic regression model (Figures S12 and S13). Ancient DNA is powerful tool for studying the role of natural selection in human evolution.

287
By detecting time-varying selection, we can identify environmental changes leading to se-288 lective pressure on particular alleles. Our approach is not limited to human data, and is 289 broadly applicable to ancient DNA, ecological or experimental evolution studies. 290 We find that the selection coefficient for the European lactase persistence allele was one hypothesis about selection for lactase persistence is that it allows the uptake of vita-305 min D from milk rather than UV radiation, which is advantageous in the North but not 306 South of Europe. However, our results show that selection was almost as strong in Iberia 307 as in Northern Europe and much stronger than in Italy, making this unlikely to be the sole 308 explanation. By allowing these inferences, our approach and others based on ancient DNA 309 should provide much deeper insight into the nature of recent human evolution.

311
We thank Ziyue Gao for helpful comments on an earlier version of the manuscript. This on f t , the distribution of f t+1 is binomial with size N e and probability f t + s t f t (1 − f t ).

470
Thus, log-likelihood of the selection coefficients σ 1 . . . σ K is given by: But, since s t = σ k 1 {Z t = k}, the log-likelihoods for each σ k do not depend on each other 472 so we can write Differentiating w.r.t. σ k and setting equal to zero gives.
Expanding the fraction to first order in σ k gives T t=1 Estimated selection coefficients and 95% confidence intervals in each region (0.004-0.015 and -0.01-0.18 in North and South, respectively). Right panels: Ancestry components for each individual (identical to Figure S11), with region-specific smoothed LOESS fit lines.