Timesweeper: Accurately Identifying Selective Sweeps Using Population Genomic Time Series

Despite decades of research, identifying selective sweeps, the genomic footprints of positive selection, remains a core problem in population genetics. Of the myriad methods that have been developed to tackle this task, few are designed to leverage the potential of genomic time-series data. This is because in most population genetic studies of natural populations only a single period of time can be sampled. Recent advancements in sequencing technology, including improvements in extracting and sequencing ancient DNA, have made repeated samplings of a population possible, allowing for more direct analysis of recent evolutionary dynamics. Serial sampling of organisms with shorter generation times has also become more feasible due to improvements in the cost and throughput of sequencing. With these advances in mind, here we present Timesweeper, a fast and accurate convolutional neural network-based tool for identifying selective sweeps in data consisting of multiple genomic samplings of a population over time. Timesweeper population genomic time-series data by first simulating training data under a demographic model appropriate for the data of interest, training a one-dimensional Convolutional Neural Network on said simulations, and inferring which polymorphisms in this serialized dataset were the direct target of a completed or ongoing selective sweep. We show that Timesweeper is accurate under multiple simulated demographic and sampling scenarios, identifies selected variants with high resolution, and estimates selection coefficients more accurately than existing methods. In sum, we show that more accurate inferences about natural selection are possible when genomic time-series data are available; such data will continue to proliferate in coming years due to both the sequencing of ancient samples and repeated samplings of extant populations with faster generation times, as well as experimentally evolved populations where time-series data are often generated. Methodological advances such as Timesweeper thus have the potential to help resolve the controversy over the role of positive selection in the genome. We provide Timesweeper as a Python package for use by the community.

Timesweeper, a fast and accurate convolutional neural network-based tool for identifying 23 selective sweeps in data consisting of multiple genomic samplings of a population over time. Timesweeper. This module is intended to be modified to suit the needs of the experiment, and 165 is written such that it is easy to incorporate utility functions, stochastic variable draws, and more 166 into each replicate. 167 Users must write their SLiM script such that it accepts two constant parameters at runtime: 168 the sweep type, and the output file name. The sweep types are denoted by the "scenarios" list in 169 the YAML configuration file, and a user may write their SLiM script to use these arguments 170 however they wish. In our default workflow, they specify the class of training/test data to be 171 generated, with "neut" denoting simulations lacking a selective sweep, "sdn" denoting simulations 172 with a selective sweep on a single-origin de novo mutation (hereafter referred to as the SDN 173 model), and "ssv" denoting simulations with a selective sweep from standing variation (the SSV 174 model). An example SLiM script is included in the software package, and many more included  Note that SSV sweeps with lower starting allele frequencies are more likely to be lost, and 181 thus if we were to condition on a randomly selected polymorphism to go to fixation, we would 182 obtain a downwardly biased initial selected frequency relative to what would be expected under 183 an SSV model that correctly deals with different fixation probabilities (see (Hermisson and 184 Pennings 2017)). To mitigate this problem in our simulations, when an SSV sweep is lost, we 185 restart the simulation from 500 generations prior to the onset of selection so that there is the 186 opportunity for different mutations to be selected in subsequent attempts. The rationale for this 187 design choice is that those sweep attempts that happen to select a more common polymorphism 188 will be more likely to result in fixation, although we did not investigate the extent to which this 189 avoids the frequency bias described above. We also note that our SSV mutations do not condition Simulating using stdpopsim 205 The second option for simulating training data is by injecting time-series sampling code into a (hereafter referred to as the "target population"), and the range of selection coefficients to use in a 213 log-uniform distribution. This module then adds the desired events to the stdpopsim SLiM 214 script before running the simulation. The output of this module is identical in format to that 215 generated when using the simulate_custom module described above, and is processed identically downstream. We note that although stdpopsim supports selective sweeps with some 217 degree of customizable parameters, this feature does not currently support all of the scenarios we 218 desired for our workflow (e.g. selection on a randomly chosen polymorphism that was previously 219 fitness-neutral, while specifying only the time at which the derived allele later became beneficial).

220
As more features are added to the stdpopsim software, we may update our workflow to better 221 take advantage of this resource. concatenates a VCF entry to an output file, hereafter referred to as a "multiVCF", for that replicate.

227
After a simulation replicate is complete, the multiVCF is split into separate files consisting of one 228 timepoint each, labeled in numerical order according to when they were sampled in the series.

229
Note that because our workflow uses forward simulations, the sweeping mutation may often be the "net-increasing" or "net-decreasing" are randomly assigned to the two alleles. The desired l × 251 m matrix showing the frequency of the net-increasing allele for each polymorphism at each 252 timepoint is then created (shown for several individual replicates as well as mean values in Figure   253 1A and Figure S1). For this paper, we set l to 51 (i.e. 25 flanking polymorphisms are examined 254 around each focal polymorphism to be classified) unless otherwise noted. When dealing with 255 simulated data, the number of polymorphisms obtained was always greater than l, and thus the 256 matrix was cropped to obtain l polymorphisms as described below in "Classifier Training and (shown for individual replicates as well as mean values in Figure 1B and Figure S1B). This matrix 270 is then sorted such that the haplotype with the highest net increase in frequency between the first 271 and last timepoints is located at the "bottom" of the frequency matrix, i.e. the 0 th index. regions (e.g. Figure 3 from (Garud et al. 2015), and this appears to carry over into 2-dimensional 281 visualizations of haplotype frequency time series ( Figure S1B). However, we did not test the 282 impact of this sorting on accuracy, and we note that for the one-dimensional CNN used for most haplotype frequencies) are identical in structure, although the number of parameters differ due to 296 the difference in input image size aside from the adjustment in input layer dimensions.

297
The 1DCNN time-series network begins with two 1D convolutional layers with 64 filters 298 and a kernel size of 3 each (i.e. the first convolutional layer runs rectangular filters across an area 299 of all polymorphisms/haplotypes by 3 timepoints at a time, sliding across timepoints with a stride 300 length of 1. These are followed by a 1D maximum pooling layer with a pool size of 3, followed 301 by a dropout layer with a dropout rate of 0.15. A flattening layer then precedes two blocks of 264-302 unit dense layers followed by dropout layers with a rate of 0.2. This is followed by a 128-unit 303 dense layer, a dropout layer with dropout rate of 0.3, and a final dense layer with 3 output units.

304
All layers described use ReLu activation function other than the final output layer, which uses a 305 softmax activation function for classification (identifying and classifying sweeps), and linear 306 activation for regression (inferring s). Our rationale for having the 1D convolutional filters stride 307 across timepoints was that they would be able to examine the selected allele and any linked 308 hitchhiking alleles in any given timepoint, and track them together across timepoints. We did not 309 experiment with transposing the input such that filters stride across polymorphisms rather than 310 time.

311
The one-timepoint network is a fully connected network (FCN) consisting of two blocks 312 of 512-unit dense layers followed by dropout layers with rates of 0.2, followed by a 128-unit dense To create training labeled data for models we first process simulated VCFs as discussed above.

322
During this process the mutation type specified in the SLiM simulation is obtained from each VCF   for the time-series allele frequency spectrum, and the rescaled allele frequency increments for each 405 polymorphism in the simulated region are calculated as ! = written to a temporary input file that was then used to run slattice with the default options.

433
WFABC was run with default parameters using input formatted from VCF files similarly as 434 described above.  which we refer to as TS-SHIC. TS-SHIC first calculates the f × l feature input array as described 459 above for each timepoint, and then concatenating them into an f × t format, where t is the number 460 of timepoints, that was then used to train and test the standard 1DCNN architecture described above. All training and testing was performed identically to other Timesweeper benchmarks, 462 including 10,000 replicates for each class for training, 5,000 replicates of each class for testing.

464
Applying Timesweeper to a Drosophila simulans evolve-and-resequence dataset 465 To demonstrate Timesweeper's flexibility and applicability to real data, we applied the AFT simulans pool-sequencing data from an evolve-and-resequence study. In brief, Barghi et al.  Figure 1A) and haplotypes (the bottom row in Figure 1B) is readily observable 548 in this representation of the mean, and the selected and unselected cases are readily discernable.

549
However, we note that individual selective sweeps may depart substantially from this expectation, Having constructed the data representations described above, we sought to train neural networks 558 to distinguish between three distinct classes: 1) Neutrally evolving regions, 2) Sweeps caused by selection on a single-origin de novo mutation (the SDN model, which produces classic "hard" in the vicinity of a sweep. As we show below, this approach yields excellent inferential power.

568
Here and in the following sections, we show that allele frequency trajectories are highly

577
Qualitatively similar results are shown in the precision-recall curves in Figure 2B. We find that 578 Timesweeper can estimate s with high accuracy across a range of selection strengths with both the 579 AFT and HFT data formats (Figure 1 C-D). we next benchmarked the same Timesweeper classifiers described above on a set of 15,000 587 simulated chromosomes 500 kilobases long-5,000 replicates each of the neutral, SSV, and SDN 588 scenarios. We found that Timesweeper was not only able to accurately detect true sweep 589 locations but has a low false positive rate across the flanking neutral regions along the chromosome 590 ( Figures S3-4). The HFT method, on the other hand, exhibits comparatively low resolution in 591 localizing sweeps ( Figures S3-4), likely because it examines a representation of haplotype 592 frequencies that has no spatial information about any potentially sweeping alleles. Timesweeper 593 also performs classification accurately for a variety of selection coefficients using the sampling 594 scheme described above (Figures S5-7). Testing on a range of s we found that AUROC is although we note that this may not hold for smaller numbers of timepoints, a possibility that we 651 did not investigate here. Interestingly, HFT performs worse the larger the window size becomes (AUROC of 0.96 for l =1 vs 0.80 for l=201; Figures S11-13). The regression task follows a similar 653 trend, with no major differences noted across the different window sizes for AFT and decreasing 654 performance with larger window sizes for HFT ( Figure S14).

655
In the context of genome-wide scans for selection, one will test many sites potentially 656 affected by selection, and in this context our goal is not only accurately detect sweeps, but to 657 localize them to a reasonably small candidate region. We found that, although small window sizes 658 produced accurate inferences at the target sites themselves, they were much more likely to 659 misclassify closely linked polymorphisms as being targets of selection as well than were networks 660 trained on larger window sizes ( Figures S3-4). Indeed, the number of linked polymorphisms within 661 500 bp of the true target of selection that are misclassified as sweeps decreases from 50 to 39 as l 662 increases from 1 to 201. We have adopted l=51 as our default as it appears to reach a good balance 663 between accuracy and computational speed (as larger window sizes require more trainable network 664 parameters), but if users wish to use larger values when running Timesweeper this is an option. 665 We note that the optimal value for any given dataset may depend on the density of polymorphism, 666 the recombination rate, and the distribution of selection coefficients, so we encourage users to 667 experiment with different window sizes before adopting one for their analysis.  Figure S26). 746 We also found that Timesweeper was able to differentiate between SDN and SSV   at an attenuated rate, even at 30,000 replicates (R 2 of 0.87; Figure S30).

773
Timesweeper is relatively robust to the effects of demographic model misspecification 774 As with all parametric approaches to population genetic inference, the demographic model used OoA models after training on the constant model (Table 1, Figure S33). Timesweeper outperforms previous methods 799 We compared Timesweeper to a variety of other methods created for the task of either detecting  ApproxWF being the next best performer (R 2 =0.6). Note that slattice obtained a negative R 2 820 value, because the magnitude of error was quite large for a substantial fraction of cases, but 821 visual inspection shows that many predictions by this methods were quite accurate (Figure 4).  Table 3. Accuracy of Timesweeper and competing methods on the selection coefficient estimation task.

829
In order to assess Timesweeper's utility on experimental evolution data we applied it to data 830 from a publicly available Drosophila simulans study published by Barghi et al. (2019). This study 831 contains 10 experimental replicates, each exposed to a warm environment for 60 generations, 832 giving us the ability to assess our method's performance on real data by examining the extent to 833 which Timesweeper's sweep candidates were replicated across these 10 datasets. We compare  In brief, we simulated a training set to match the experimental design described in Barghi 838 et al. and used these simulations to train a Timesweeper AFT classifier as described in Methods. 839 We note that this experiment lasted for a relatively short period of time (60 generations), such that 840 positively selected de novo mutations would rarely be expected to arise and reach high frequency 841 by the end of the experiment; we therefore trained Timesweeper to distinguish between 842 neutrality and the SSV model (example and average inputs in Figure S34), and refer to the latter 843 as the "sweep" model for the remainder of this section. We then used the resulting classifier, which 844 we found to be highly accurate (AUROC=0.96; Figure S35) to scan allele frequency estimates  (Tables S1 and S2). 853 We next investigated the degree to which the top sweep candidate SNPs (the 100 highest-  (Table 3). Upon closer examination, we   SNPs among the top 100 scoring sweep candidates for a given replicate were considered to be replicated if they were also found in the top-scoring 1% predictions by that same method in another replicate. Note that some candidate SNPs were replicated in more than one of the 9 additional experimental replicates (see Figure S36), and thus a total number of replicated hits >100 is possible. with two approaches: one using a matrix representation of haplotype frequencies estimated in each sampled timepoint, and one tracking allele frequencies through time in a window of 918 polymorphisms centered around a focal polymorphism. We found that while the haplotype 919 frequency tracker (HFT) had decent power to detect selective sweeps, the allele frequency tracker 920 (AFT), when compared to competing methods, had superior accuracy and was able to localize 921 selected variants with much higher resolution, and accurately infer selection coefficients. We note 922 that while the AFT method was able to distinguish between the SDN and SSV selection models 923 with greater accuracy than competing methods, accuracy on this task was considerably lower than 924 that for distinguishing between selection and neutrality. This is partially a consequence of how we 925 defined these models: the SSV model involves selection on a previously standing polymorphism, 926 but if by chance only a single ancestral copy of the adaptive allele survives the sweep, then the 927 result will be indistinguishable from the SDN model. 928 We also note that the next-best method for identifying sweeps on our initial benchmarking 929 dataset was the simplest: Fisher's exact test (FET) for allele frequency differences between the 930 first and final timepoint. We note that FET's performance dropped considerably when applied to 931 our test datasets with non-equilibrium demographic histories ( Figure S32), demonstrating that this 932 approach can only be used in settings where one knows the degree of genetic drift over the course 933 of the sampling interval will be small-scenarios involving periods of small effective populations 934 sizes, which many species experience in nature, and/or long sampling intervals will be 935 inappropriate for FET. In addition, we found that Timesweeper is better able to narrow down the

976
Our AFT method, on the other hand, was highly successful at localizing sweeps. This is 977 perhaps because this approach makes an inference for a single focal polymorphism while taking 978 into account spatial information about the presence or absence of nearby hitchhiking variants, 979 before sliding on to the next focal polymorphism. Although this method does not explicitly use 980 haplotype information, it may implicitly be able to track LD among polymorphisms by seeing 981 correlated shifts in allele frequencies (see Figure1). Thus, it is possible that little information is 982 lost by relying on allele frequency trajectories rather than haplotype frequencies. 983 An added benefit of the AFT approach is that it can be used with unphased data, or even 984 pooled sequencing data. This allowed us to test the method on data from an evolve-and-resequence 985 study that repeatedly used pool-seq to sample experimental populations of D. simulans over a 986 period of 60 generations of selection to thermal tolerance (Barghi et al. 2019). We found that 987 Timesweeper was able to detect regions with alleles that experienced large shifts in frequency 988 over the course of this selection experiment. Moreover, Timesweeper often detected the same 989 putatively sweeping alleles in multiple experimental replicates, suggesting that its results are 990 accurate. We stress that these encouraging results were obtained in spite of no adjustment of our 991 method to these data: the founder lines of the D. simulans experiment exhibit a very high density 992 of polymorphism, and thus our approach to examine 51-SNP windows caused us to examine only 993 a relatively small genetic map distance when making each classification-incorporating a larger 994 flanking window in this setting would likely yield even better performance. We believe that the 995 successful application of Timesweeper to these data in spite of this limitation bodes well for the 996 broad applicability and robustness of our approach.

997
In light of Timesweeper's strong performance on simulated and real genomic data, we 998 have released the software package implementing it to the public. We strove to make this package 999 easy to use and computationally efficient, as we believe that fast, user-friendly, and powerful 1000 computational tools are essential if we wish to realize the potential of population genomic time-1001 series data. We hope that further research in this area of population genetic methods development, 1002 combined with the application of these methods to time-series data from natural populations, will 1003 help us better characterize the genomic targets and evolutionary importance of positive selection.          Figure S6 but for the data described in Figure S11. Same as Figure S7 but for the data described in Figure S11. Same as Figure S9 but for the data described in Figure S11.  Same as Figure S6 but for the data described in Figure S15. Same as Figure S7 but for the data described in Figure S15. Same as Figure S9 but for the data described in Figure S15.  Figure S6 but for the data described in Figure S19.  Figure S7 but for the data described in Figure S19.  Figure S9 but for the data described in Figure S19.   Figure S6 but for the data described in Figure S23.  Figure S7 but for the data described in Figure S23.  Figure S9 but for the data described in Figure S23.

1407
Simulation parameterizations for selection coefficient and sampling start times are the same as 1408 described in Figure S1. Data was subsampled to sizes 30,000, 20,000, 10,000, 5,000, 2,000, and  Figure S6 but for the data described in Figure S27.