Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
Confirmatory Results

Admixture has obscured signals of historical hard sweeps in humans

View ORCID ProfileYassine Souilmi, View ORCID ProfileRaymond Tobler, View ORCID ProfileAngad Johar, Matthew Williams, View ORCID ProfileShane T. Grey, View ORCID ProfileJoshua Schmidt, View ORCID ProfileJoão C. Teixeira, View ORCID ProfileAdam Rohrlach, View ORCID ProfileJonathan Tuke, Olivia Johnson, View ORCID ProfileGraham Gower, View ORCID ProfileChris Turney, View ORCID ProfileMurray Cox, Alan Cooper, View ORCID ProfileChristian D. Huber
doi: https://doi.org/10.1101/2020.04.01.021006
Yassine Souilmi
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yassine Souilmi
  • For correspondence: yassine.souilmi@adelaide.edu.au raymond.tobler@adelaide.edu.au angad.johar@unimelb.edu.au alanjcooper42@gmail.com cdh5313@psu.edu
Raymond Tobler
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Raymond Tobler
  • For correspondence: yassine.souilmi@adelaide.edu.au raymond.tobler@adelaide.edu.au angad.johar@unimelb.edu.au alanjcooper42@gmail.com cdh5313@psu.edu
Angad Johar
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
12Melbourne Integrative Genomics, School of BioSciences and School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Angad Johar
  • For correspondence: yassine.souilmi@adelaide.edu.au raymond.tobler@adelaide.edu.au angad.johar@unimelb.edu.au alanjcooper42@gmail.com cdh5313@psu.edu
Matthew Williams
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shane T. Grey
2Transplantation Immunology Group, Immunology Division, Garvan Institute of Medical Research, NSW 2010, Australia
3St Vincent’s Clinical School, Faculty of Medicine, UNSW, Darlinghurst, New South Wales, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shane T. Grey
Joshua Schmidt
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Joshua Schmidt
João C. Teixeira
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for João C. Teixeira
Adam Rohrlach
4ARC Centre of Excellence for Mathematical and Statistical Frontiers, The University of Adelaide, Adelaide, South Australia 5005, Australia
5Department of Archaeogenetics, Max Planck Institute for the Science of Human History, Jena, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Adam Rohrlach
Jonathan Tuke
4ARC Centre of Excellence for Mathematical and Statistical Frontiers, The University of Adelaide, Adelaide, South Australia 5005, Australia
6School of Mathematical Sciences, The University of Adelaide, Adelaide, South Australia 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jonathan Tuke
Olivia Johnson
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Graham Gower
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Graham Gower
Chris Turney
7Chronos, University of New South Wales, Sydney, NSW 2052, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Chris Turney
Murray Cox
8Statistics and Bioinformatics Group, School of Fundamental Sciences, Massey University, Palmerston North 4410, New Zealand
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Murray Cox
Alan Cooper
9South Australian Museum, Adelaide, SA 5005, Australia
10BlueSky Genetics, PO 287, Ashton, SA 5137, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: yassine.souilmi@adelaide.edu.au raymond.tobler@adelaide.edu.au angad.johar@unimelb.edu.au alanjcooper42@gmail.com cdh5313@psu.edu
Christian D. Huber
1Australian Centre for Ancient DNA, The University of Adelaide, Adelaide, SA 5005, Australia
11Department of Biology, Penn State University, University Park, PA 16802, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christian D. Huber
  • For correspondence: yassine.souilmi@adelaide.edu.au raymond.tobler@adelaide.edu.au angad.johar@unimelb.edu.au alanjcooper42@gmail.com cdh5313@psu.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The role of natural selection in shaping biological diversity is an area of intense interest in modern biology. To date, studies of positive selection have primarily relied upon genomic datasets from contemporary populations, which are susceptible to confounding factors associated with complex and often unknown aspects of population history. In particular, admixture between diverged populations can distort or hide prior selection events in modern genomes, though this process is not explicitly accounted for in most selection studies despite its apparent ubiquity in humans and other species. Through analyses of ancient and modern human genomes, we show that previously reported Holocene-era admixture has masked more than 50 historic hard sweeps in modern European genomes. Our results imply that this canonical mode of selection has likely been underappreciated in the evolutionary history of humans and suggests that our current understanding of the tempo and mode of selection in natural populations may be quite inaccurate.

Introduction

The rapidly growing availability of genomic datasets provides a powerful resource to address a fundamental question in evolutionary biology, namely the role of natural selection in shaping biological diversity1. The proliferation of genomic datasets has been accompanied by parallel developments in statistical methods for uncovering genetic signals of positive selection; however, while the power and precision of these statistical methods has continued to improve, they can be confounded by complex aspects of population history that are not modelled or remain unknown2. In particular, previous genomic studies of positive selection typically do not account for past phases of inter-population mixing (i.e. admixture), which can alter genomic signatures of positive selection and either mask these signals or lead to erroneous inferences about the underlying modes of selection3, 4.

In the case of humans, a consistent finding has been the apparent paucity of classical ‘hard sweep’ signals in modern genomic datasets (i.e. where a novel beneficial mutation increases to 100% frequency in a population). This has resulted in suggestions that humans adapted to novel environmental and sociocultural pressures through alternate modes of selection – e.g. ‘soft sweeps’ (i.e. where the beneficial allele sits on multiple genetic backgrounds)5, 6 or polygenic selection (i.e. subtle frequency shifts across numerous loci with small fitness effects)7, 8 – though the evidence for these processes has also been challenged4, 9–11. The apparent ubiquity of admixture in human population history12–14 suggests instead that the absence of hard sweep signals in modern human genomes may be a consequence of the masking effects of historical admixture events. Moreover, growing evidence suggests that admixture events pervade the history of most natural populations, which may have led to a potentially biased view of the historical role of selection in nature4, 15.

Using ancient human genomes to uncover historical hard sweeps

The recent emergence of population-scale ancient human genomic datasets from Eurasia provides a novel means to search for any hard sweeps that had occurred in the ancestors of modern Europeans, by applying selection scans to genomic datasets that predate known admixture events. Modern Europeans are largely composed of three distinct ancestries – i.e. Western Hunter-Gatherers (WHG), Anatolian Early Farmers (Anatolian EF), and Steppe Pastoralists (see Text S1) – as a result of extensive admixture between these ancestral populations and their descendants that occurred from the Early Holocene to Bronze Age (i.e. ∼12-5 ka;16). Accordingly, we reprocessed 1,162 ancient western Eurasian genomes (mostly high-density SNP scans; see Methods) through a uniform bioinformatic pipeline – thereby mitigating potential processing artefacts – and grouped these samples into 18 distinct ancient populations that existed before and after documented Holocene admixture events based on their archaeological and genetic relationships (Figs. 1, S1, S2; Table S1). To ascertain the impact of Holocene admixture on modern European populations, we also analyzed three modern European populations from the 1000 genomes project (1kGP17; i.e. Utah residents with Northern and Western European ancestry (CEU); Finnish in Finland (FIN), Toscani in Italy (TSI)), and also included two other non-European 1kGP populations for further comparison (i.e. one East Asian: Han Chinese in Beijing (CHB) and one African: Yoruba in Ibadan (YRI)).

Fig. 1.
  • Download figure
  • Open in new tab
Fig. 1. Geographic and temporal distribution of 1,162 ancient Eurasian samples used in this study.

Each human symbol represents a sample and the colors indicate different populations classified into broad groupings according to archaeological records of material culture and lifestyle (colors indicated at the bottom right-hand side; see Text S1). The green lines depict the generalized migration route of Anatolian farmers (Anatolia_EF) into Europe ∼8.5ka, where they mixed with Western Hunter-Gatherers (WHG; EHG refers to the contemporaneous Eastern Hunter-Gatherers) to create the Early European Farmers (EF). Similarly, the pink arrows represent the generalized movement of Steppe pastoralists (Steppe; samples east of the Ural Mountains not shown) which resulted in admixture with Late European Farmers (LF) ∼5ka, giving rise to Late Neolithic and Bronze Age (LNBA) societies.

We used SweepFinder2 (SF2)18 to scan the genomes of each ancient and modern population for regions exhibiting distorted allele frequency patterns characteristic of a fixed hard sweep (i.e. an excess of low- and high-frequency alleles19; also see Text S2). By contrasting each window against site frequency spectrum (SFS) patterns from the overall genomic background, SF2 tests explicitly for the signature of hard selective sweeps while controlling for demographic history and population structure, which often cause false positives in such tests2, 18. A set of candidate sweep regions was determined by first assigning SF2 scores to a set of ∼19,000 annotated human genes, and subsequently aggregating neighboring outlier genes into single sweep regions (Fig. S3; see Methods). Importantly, testing our sweep detection pipeline on simulated genomic datasets modelled upon Eurasian demographic history confirmed that it is robust to the impacts of substantial demographic bottlenecks – including strong bottlenecks associated with the founding Eurasian and subsequent WHG populations (successive 12-fold and 6-fold population size reductions, respectively; see Fig. 2A) – and also to missing data, ascertainment bias, and alignment error (Fig. 2B and Figs. S4-S7, SI Methods).

Fig. 2.
  • Download figure
  • Open in new tab
Fig. 2. Assessing the robustness of the hard sweep detection pipeline.

(A) Schematic of the West Eurasian population history model used to explore the statistical properties of our analytical pipeline and the impact of historical bottlenecks and admixture on the false positive rate. Each vertical segment denotes a major population branch (effective population sizes shown in gold text), with grey horizontal arrows denoting separation and admixture events (times shown on the right-hand side of the figure, assuming that admixture occurred 500 years after the onset of the migrations shown in Fig. 1; with percentages indicating the proportion of ancestry contributed by the incoming admixture branch). Model parameters are taken from one of three studies, as denoted by the associated superscript (1 = ref. 84; 2 = ref. 33; 3 = ref. 34). (B) Estimated false positive rate (FPR) measured at six different simulated populations sampled before (Anatolia_EF, Steppe, WHG) and after major admixture events (EF, LNBA, and Modern Europeans) using three different q value thresholds (see Fig. S19 for further information on sample size and sampling time). Results are shown for 30 simulated genomes, dots indicate mean values and horizontal lines represent quartile values. Candidate sweep detection was based on the most stringent threshold (q < 0.01; red densities) in the current study, whereby the maximum mean FPR observed amongst the simulated populations at this threshold (i.e. ∼11%) was used as a conservative estimate for the study-wide FPR.

Hard sweeps were common in Paleolithic West Eurasian populations

In direct contrast to previous studies of modern human genomes20 we were able to identify large numbers of hard sweeps in the ancient West Eurasian populations (Figs. S8, S9, Table S2, Supplementary Data 1-57), identifying 57 with high confidence (estimated study-wide false positive rate of <11% under a realistic simulated Eurasian demography; Fig. 2; see Methods, SI Methods). None of these sweep signals are evident in the African Yoruba population, and ∼90% of the sweeps show significantly inflated levels of genetic divergence relative to African populations using Fst-based tests (outFLANK method21; see Methods and Fig. S10), consistent with the underlying selection pressures postdating the separation of the founding Eurasian group from African populations.

Although these 57 putatively selected loci likely swept to high frequencies in West Eurasian populations by the Early Holocene to Bronze Age era (i.e. ∼12-5ka), only two sweeps (1:35.4-36.5 and 6:29.5-32.8; Fig. S8 and S9, Table S2) are still identified as SF2 outliers in selection scans of modern European populations, with these regions having been reported in previous selection studies8. This dramatic reduction in hard sweep signals was not an artefact of differences in either data quality or quantity between modern and ancient populations (Figs. S11-S14; SI Methods). Nor could their absence be explained by the degradation of the sweep signals through random allele frequency changes (i.e. genetic drift) and new mutations, as hard sweep signals in Eurasian human genomes are expected to remain visible to our analytical pipeline for around 70 thousand years in the absence of other effects (Fig. S15; see SI Methods and also refs. 18, 22). Rather, the loss of the hard sweep signals is consistent with the reduction in sweep haplotype frequency below detection limits caused by the introduction of other orthologous haplotypes during the Holocene admixture phases. For the analytical pipeline used in this study, power drops below 50% if the sweep haplotype does not occur in at least 85% of the population at the time of sampling, and becomes effectively undetectable if less than half the population has the sweep haplotype (Fig. S15 and SI Methods, also see ref. 23).

A historical hard sweep in the MHC-III region is now masked in modern Europeans

Perhaps the most striking example of a previously unreported hard sweep in this study is provided by a ∼1.5Mb region overlapping the major histocompatibility complex class III (MHC-III) region on chromosome 6 that shows depleted genetic variation in the Anatolian Early Farmer (Anatolian EF) population (Fig. 3). The MHC-III region includes numerous genes that encode antigen proteins involved in the immune response and has previously been identified as a recurrent target of balancing selection24–26 and also as a potential source of selection artefacts due to the high levels of local genetic variation hindering read alignment across this region. However, nearly all (98.6%) of SNPs in the MHC-III sweep region lie in regions of high mappability, which is consistent with levels observed in the other 56 sweeps (minimum = 96.7%; Table S3), indicating that poor read alignment was not an issue for our sweeps in general.

Fig. 3.
  • Download figure
  • Open in new tab
Fig. 3. Hard selective sweep in MHC class-III region in Anatolian Early Farmers.

(A) Haploimage of the MHC-III region and associated SweepFinder2 CLR score for the Anatolian Early Farmers, Central European Early Farmers (CentralEurope_EF) and British Bronze Age (UK_LNBA) populations. Pseudohaploid calls are shown for all samples in each population, with major alleles in yellow, minor alleles in black, and missing data in white. Elevated CLR scores coincide with a region of depleted variation in the Anatolian EF population, which returns to background levels in the subsequent admixed populations. (B) The estimated nucleotide diversity across the MHC-III region of the Anatolian Early Farmers (black line) and expected diversity under a hard sweep model (green line, see Text S3) relative to the underlying recombination rate in cM/Mb (grey line on top). The two dashed red lines indicate two local recombination hotspots flanking the sweep region. The close correspondence between the expected and observed genetic diversity across this region is unlikely to be a bioinformatic artefact and points to the authenticity of the signal.

The authenticity of the MHC-III hard sweep is further supported by the distinctive trough of genetic diversity observed across the affected region in the Anatolian EF samples, which is located between two recombination hotspots (i.e. areas of locally inflated recombination rates) that also demarcate the sharp transition to background diversity levels either side of the sweep region (Fig. 3). This correlation between genetic diversity and local recombination rates is unlikely to arise from read misalignment artefacts, but can be closely approximated by a hard sweep model acting upon a centrally located beneficial variant that incorporates the local recombination rates (Fig. 3; see Text S3), pointing to historical positive selection as the more likely cause.

In contrast to the Anatolian EF population, the two other major contributors to modern European genetic ancestry (i.e. WHG and Steppe) have SF2 scores and patterns of local genomic variation that are indistinguishable from neutral background levels in the same ∼1.5Mb MHC-III region (Fig. S16). Holocene-era populations show patterns that are consistent with a progressive dilution of the sweep signal across this period – with a weak SF2 signal still evident in Early European farmers (which draw ancestry from both Anatolian EF and WHG), that returns to neutral background levels in subsequent Bronze Age and modern European populations following the introduction of additional Steppe ancestry (Figs. 3, S16). Taken together, the evidence strongly implies that the MHC-III region was a target of strong positive selection in the Anatolian EF population, and had likely gone unnoticed previously due to the masking effects of Holocene-era admixture.

Admixture can lead to misclassification of historical hard sweeps

The Anatolian EF-specific MHC-III hard sweep suggests that if Holocene admixture is responsible more generally for masking the 57 hard sweep signals in modern European genomes, then sweeps that were specific to only one of the admixing ancient Eurasian populations (e.g. where selection is driven by a population-specific, or local, pressure) should be particularly prone to SF2 signal dilution from Holocene admixture events. Conversely, sweeps occurring closer in time to the Out-of-Africa (OoA) migration are more likely to have signals that survive Holocene admixture events by virtue of being present at high frequencies in all admixing populations. To test if sweep signal presence was dependent on the sweep antiquity, we inferred the approximate onset of the underlying selection pressure by manually scanning five moderate-to high-coverage Upper Paleolithic individuals for evidence of the sweep haplotype – namely, Ust’Ishim ∼45ka27, Kostenki14 ∼37ka28, GoyetQ-116 ∼35ka28, Věstonice16 ∼30ka28, and El Mirón ∼18ka28 – assigning each sweep into one of five age classes based on the oldest sample in which the sweep haplotype is observed (Fig. 4; see Methods and Text S4). Concordant sweep age classifications were obtained using an alternate quantitative method (based on a set of diagnostic marker alleles; Fig. S17 and SI Methods), suggesting that our classifications should provide robust inferences for aggregated sets of sweeps from the same age class, even though individual sweep dates are likely to be less reliable.

Fig. 4.
  • Download figure
  • Open in new tab
Fig. 4. Older sweeps are more robust to population admixture.

(A) Schematic representation of the inferred temporal origins of the 57 sweeps, classified according to the first presence of the sweep haplotype amongst the five moderate to high coverage Upper Paleolithic specimens (italicized labels, blue arrows indicating the approximate sample age). After classifying sweeps by their putative onset time, we calculated the proportion of sweeps present in each tested population (panel B; dots indicate point estimates at q < 0.05, error bars show 95% binomial confidence intervals) and in two studies reporting partial sweeps in modern Europeans29, 30 (panel C; iHS test statistics from ref. 29 being limited to outliers reported in at least two European populations to provide a stringent classification) conditional on the estimated onset times. Sweeps starting within the last 35kyrs (i.e. not observed in GoyetQ-116 or older samples) tend to have patterns consistent with local selection (i.e. being highly frequent in some ancient populations but absent in others; see Text S4) and are less likely to be reported as partial sweeps nearing fixation (i.e. exhibit an XP-EHH signal in ref. 29, 30; see key in panel C). While the latter difference was not significant (one-sided Fisher’s Exact Test p ∼0.17), our results are consistent with sweeps arising after the diversification of the Eurasian founders being more susceptible to admixture distortion

Sweep presence across the 12 ancient Eurasian populations was broadly consistent with the inferred antiquity of the selection pressure (Fig. 4B). Sweeps inferred to have started by 35ka (i.e. observed in GoyetQ-116 or an older sample) are detected at consistent levels across ancient populations both before and after the Holocene admixture events (i.e. no significant differences in sweep detection rates, logistic regression p > 0.17 for all three sweep age categories; see Methods and Text S4). In contrast, sweep detection differed significantly across the ancient populations for more recently selected loci (logistic regression p < 0.002 for both age categories) and exhibit patterns consistent with historical selection acting on populations ancestral to a subset of the tested populations (Fig. 4B). Similarly, sweeps starting by 35kya were more likely to appear as partial sweeps in two prior studies of positive selection in modern European populations29, 30 (66% vs. 46% for sweeps dated less than 35kya; Fig 4C), and were also more frequently observed as outliers for a statistic (i.e. XP-EHH) that is sensitive to loci nearing fixation (25% vs. 8%; Fig 4C). These results demonstrate that admixture can sufficiently distort the genetic signals resulting from a hard sweep, leading to the misclassification of the inferred mode of selection in studies where it is not explicitly accounted for.

Evolutionary scenarios underlying sweep signal dilution

To further assess the plausibility of Holocene admixture events masking historical hard sweep signals in modern human genomes, we used population genetic simulations (SLiM331) to model selection (Figs. S18-20; see Methods and SI Methods) within a plausible West Eurasian demographic model informed by ancient DNA studies32–34 (Figs. 2A, S19). Beneficial mutations were introduced on the Main Eurasian branch before (55ka) and after (44ka and 36ka) the diversification of the founding Eurasian population (Fig. 5, Figs. S19 and S20), with selection tests performed on the three ancient source populations (i.e. Anatolia EF and WHG both sampled at 8.5ka, and Steppe at 5ka) and three admixed European populations descending from two separate Holocene admixture phases35 at 8ka (i.e. European Early Farmers sampled at 7ka) and 4.5ka (i.e. Bronze Age Europeans sampled at 4ka, and Modern Europeans at 0ka; see Text S1). All simulated datasets matched the sample sizes, SNP numbers, missing data, and ascertainment bias observed in empirical data from the relevant populations (see Methods).

Fig. 5.
  • Download figure
  • Open in new tab
Fig. 5. Investigating the influence of admixture on sweep detection in modern populations.

Sweep detection power was estimated for selected loci simulated using a realistic Eurasian demographic model (see Figs. 2A, S19). Sweeps were timed to start prior to the diversification of the Eurasian founding population (55ka), or following the separation of population branches that eventually gave rise to Steppe (44ka) or WHG populations (36ka). Power was measured at a FPR of 0.1% and is shown relative to the onset of selection in (A) the three European source populations (i.e. Anatolia EF, WHG, and Steppe) as well as three admixed populations (B, C) following the mixing of WHG and Anatolian EF at 8.5ka (i.e. European EF, sampled 7ka) and European EF and Steppe herder admixture at 4.5kya (i.e. LNBA and Modern Europeans, sampled 4ka and 0ka, respectively). Only beneficial mutations preceding the diversification of Eurasian lineages (55ka) are evident in all populations when the selection pressure did not persist following the admixture events, with sweeps starting also 44ka also being detectable in European EFs since they are shared by both source populations (B, solid lines). Notably, power increases appreciably for strongly selected loci (s ≥ 0.02) when selection is allowed to continue in the post-admixture phase (B, dashed lines) due to these loci refixing following the admixture event.

We first investigated a model where selection is active along all population branches that inherit the beneficial mutation up until the 8ka admixture event, at which point the selection pressure is relaxed (e.g. due to an environmental change or the emergence/introduction of a technological innovation coincident with the admixture event). The results clearly demonstrate that Holocene-era admixture can effectively mask historical sweep signals in the absence of any ongoing selection pressures: only the sweeps that precede the division of Eurasian lineages (i.e. selection starting at 55ka) can be detected with reasonable power both before and after the admixture events. In contrast, sweeps starting after the diversification of Eurasian lineages (i.e. 44ka and 36ka) had slightly increased power in unadmixed populations that experienced the selection pressure (consistent with having more recent fixation times and less signal loss from drift), whereas power was negligible in all admixed European populations (Figs. 5, S20). As expected, the detection rates were strongly positively correlated with selection strength, whereas sample size had a more moderate positive association.

Interestingly, sweep detection power remained high in admixed populations whenever sweeps predated the split between the source population lineages – despite one of the source populations (WHG) having power close to zero under all modelled scenarios (Figs. 5, S20). The low power observed for WHG is likely a consequence of their relatively small effective population size32 exacerbating drift and causing rapid degradation of any fixed sweep signals (expected signal loss in 0.2 x Ne generations18 = ∼11,000 years to loss in WHG assuming a generation time of 29 years36; cf. ∼70,000 years on the Main Eurasian branch). This result indicates that hard sweeps are reasonably robust to admixture-induced signal loss when the sweep signal has been eroded in one of the source populations due to drift, providing that the other source populations have retained the sweep signal.

After amending the model to allow the selection pressure to persist following the Holocene admixture events, we observe a dramatic increase in the detection rate of sweeps postdating the subdivision of Eurasian populations (i.e. selection onset at 44ka and 36ka), with power being particularly high for strongly selected loci (s = 10%) across all admixed European populations (power between 65-85%; Figs. 5, S20). Notably, Modern Europeans achieve detection power exceeding that observed for any of the three ancestral source populations (Anatolia_EF, Steppe and WHG) in nearly all cases. This result implies that many of the historical sweep signals should have been present in modern European populations had the underlying selection pressure persisted after the Holocene admixture phase, suggesting an attenuation of the underlying selection pressure(s), or the influence of other factors that were not included in the simulations, following the European Bronze Age (see Discussion).

The mutational basis of the Eurasian hard sweeps

Under canonical models of hard sweeps, beneficial mutations only emerge following the environmental shift marking the onset of the selection pressure, though recent theoretical and simulation-based work has shown that hard sweep signals are possible when selection acts instead on mutations segregating at the time of the environmental shift (i.e. standing genetic variation; SGVs)37–40. Our results suggest that the underlying selection pressure(s) likely arose during the early stages of the AMH occupation of Eurasia – e.g. 16 [28%] and 44 [77%] of the 57 sweep haplotypes being observed in archaic samples dated at ∼45kya and ∼35kya, respectively (Fig. 4) – constraining the time in which de novo mutations could have arisen and suggesting that SGVs may have provided the mutational basis for many of our sweeps.

To examine this question in more detail, we calculated the probability of a sweep arising from SGV relative to a de novo mutation, conditional on the fixation of the beneficial allele within a particular time interval (see Text S5). Using equations from ref. 41 and assuming standard human generation times36, mutation rates42, and parameters from our demographic model (Table S4), we observe that fixation from SGVs is highly probable for selection strengths that match the majority of our empirical estimates (s ∼ 1%; see Table S2) when the temporal window in which fixation must occur is reasonably constrained (≤ 20kyrs), whereas de novo variants dominate when purifying selection on SGVs was strong (s ≥ 1%) in the period preceding the environmental shift (Figs. 6, S21). Extending our calculations to include the probability that selection from SGVs produces a hard sweep signal41 revealed that hard sweeps are highly probable (> 50%) regardless of the mutational input source, if SGVs had been deleterious in the period preceding the environmental change (Figs. S6, S21; Text S5). While further distinguishing between the two modes of selection is hampered by the absence of prior information on several key parameters (e.g. the strength of purifying selection on SGVs, the type and strength of epistasis, distance to the new fitness optimum), our results imply that mutations underlying the hard sweeps were probably initially deleterious and reinforce previous findings showing that selection from both de novo mutations and SGVs have occurred in Eurasian evolutionary history39.

Fig. 6.
  • Download figure
  • Open in new tab
Fig. 6. Mutational origins of the Eurasian hard sweeps.

The probability that a sweep was caused by a standing genetic variant (SGV) relative to a de novo mutation conditional on a sweep of either type occurring within a fixed time interval (left panel), following equations in ref. 41. SGVs are assumed to have been under some degree of purifying selection (denoted by different coloured symbols) prior to the environmental shift that initiates the beneficial selection phase (symbol shapes). Fixation from SGVs was highly likely (>75%) for constrained time intervals (≤20kyrs) when the beneficial selection strength was moderate (s ∼ 0.01) – both being plausible values for Eurasian sweeps – provided that purifying selection prior to the environmental shift was weak (s ≤ 0.001). Factoring in the probability that fixed SGVs all descend from a single copy at the time of environmental shift results in consistently high probabilities (≥50%) that selection resulted in a hard sweep pattern regardless of the mutational source, provided that SGVs had previously been deleterious (right panel; assuming that all mutational targets arose in a single contiguous locus – noting that these probabilities will likely increase as mutational targets are distributed across more independent loci; see Text S5). Notably, the fixation of variants that had previously been strongly deleterious required large mutational target sizes (>1000 possible mutations) to ensure fixation within a 40,000 year interval when beneficial selection strength was ∼ 0.01, which may be implausibly large for many traits.

Discussion

Our analyses of >1,000 ancient West Eurasian genomes has uncovered strong evidence for 57 hard sweeps in Early- to Middle-Holocene populations that have been almost entirely erased from descendent populations in modern Eurasia. Most of these selected loci had likely swept to high frequencies well before the Holocene era, which is supported by the strong selection strengths that were inferred for the sweeps (41 sweeps having s > 1%, with the largest value of s nearing 10%; Table S2; see Methods, Text S6). These selection coefficients are comparable with the strongest currently known for human populations (i.e. LCT locus having s between ∼2-6% in the recent history of European populations 43, 44), and suggest that such strong positive selection events have been much more common in recent human history than previously recognised. Our empirical and simulation results implicate Holocene-era admixture as the primary factor attenuating these historical sweep signals, which has led to these signals being missed or misinterpreted as other modes of selection in previous studies.

An intriguing implication arising from the simulations is that the selection pressure(s) underlying the sweeps may have eased during the Holocene period in some cases. This period marked the introduction of new technologies, diets, a stable warm climate, and living conditions that introduced novel selection pressures (i.e. selection on the LCT gene to reduce the costs associated with milk consumption in adulthood44) and may have also reduced the intensity of historical selection pressures underlying the 57 hard sweeps. Alternatively, these patterns could have resulted from the introduction of novel ancestry sources into Europe following the Bronze Age period45–47, which would have further diluted sweep signals and left insufficient time for their reappearance even if the selection pressure was still present. Indeed, this may explain why modern Tuscans and Finns had the fewest sweeps of any population in this study, with these populations descending from post-Bronze Age mixing events that involved African 45, 46 and Siberian47 groups, respectively (Fig. 4B).

In addition to masking historical sweeps in human populations, the obscuring effect of admixture might explain why species-wide selective sweep signals are rare in many species7, 48 while being abundant in others49–51. Crucially, recently admixed populations often lack detectable signs of structure, in which case admixture will not be correctly accounted for in any subsequent selection tests. For example, modern European populations have been considered sufficiently genetically homogeneous for the purpose of selection scans29, despite recent ancient DNA studies revealing that Europeans have multiple diverged ancestry components52. Indeed, modern genomic data is often insufficient to establish past admixture events, with widely-used PCA and ancestry decomposition methods being unable to detect historical admixture signals when suitable proxy populations for the admixing groups are lacking. Similarly, while admixture creates temporal variation in the coalescence rates of homologous genomic regions, these patterns can be equally well explained by historical population size changes in a single continuous population53. Accordingly, species with little apparent population structure may actually be more prone to the confounding effects of recent admixture than those exhibiting clear structure, and this may partly explain why species with distinctive population structuring often show strong local selection signals (e.g. Swedish A. thaliana3, African D. melanogaster54, M. lychnidis-dioicae55), whereas genetically homogeneous taxa or populations tend to lack fixed hard sweeps but harbour abundant partial or soft sweep signals (e.g. European humans6, 29, North American D. melanogaster56, M. silenes-dioicae55).

In accordance with previous work4, 57, our results emphasize the importance of incorporating historical population structure and admixture events into the null models of selection tests. If this information is not available – e.g. because DNA from ancestral source populations is lacking, the default scenario for most species – then the interpretation of historical selection signatures could be highly misleading and heavily weighted against the detection of historical hard sweep events. These factors imply that the extent of past hard sweep events have likely been underestimated in natural populations in general, biasing our understanding of the mode and tempo of adaptation in humans and other species.

Methods

Population designation

All ancient individuals used in this study were assigned to historical populations based on published analyses of their genetic relationships in combination with details regarding their archaeological context, with that temporal and spatial variability between individuals being minimised where possible (Figs. S1 and S2, Table S1). Our sample grouping resulted in 18 ancient populations occurring before and after the major Holocene admixture events that created the genetic landscape of modern Europe16, 52 (see Text S1). Additional testing using alternate sets of samples for two ancient populations showed that sweep signal detection was reasonably robust to changes in population sample configuration, with the close phylogenetic relationships across the samples providing a buffer to (see SI Methods; Figs. S13 and S14).

Data collection and processing

To produce a robust dataset and avoid potential bioinformatic batch effects, the raw sequence read data for 1,162 ancient genomic datasets (Table S1) was retrieved from the European Nucleotide Archive and processed through the following standardized pipeline. To minimize the risk of modern contamination, the forward and reverse reads of the paired-end reads were merged (collapsed) using fastp58, and only merged reads were retained (modern data is more likely to comprise large DNA fragments that do not collapse). All collapsed reads were filtered for potential residual adaptor sequences and chimaeras using Poly-X with fastp58. The retained filtered set of sequence reads were aligned to the human reference genome (h37d) using the Burrows-Wheeler Aligner v0.7.1559. All mapped reads were sorted using SAMtools v1.360 and then realigned around insertions and deletions and potential PCR duplicate reads were marked and removed using the Genome Analysis ToolKit (GATK) v3.561.

Prior to variant calling, all remaining aligned reads were screened and base-calls recalibrated for aDNA postmortem damage using mapDamage262. To further limit the impact of postmortem damage on variant calling63, bamUtil64 was used to trim 3 base-pairs from each of the 5’ and 3’ ends of each mapped read. From the resulting set of reads, pseudo-haploid variants were called at the set 1240k capture SNPs65 found on the 22 autosomes, using a combination of SAMtools mpileup66 and sequenceTools (https://github.com/stschiff/sequenceTools). Pseudo-haploidization of read data is a standard strategy in aDNA analyses, whereby a single read is randomly sampled at each prespecified SNP position65 in order to mitigate potential biases introduced by differences in coverage or post-mortem damage between samples16. The 1240k capture was developed to minimize ascertainment in non-African populations and was used to generate data for most samples used in the study, whereby concentrating on the 1240k variants ensured a common and robust set of variants for the subsequent analyses. The pseudo-haploid variant calls were converted from EIGENSOFT format67, 68 to binary Plink format using EIGENSOFT. Plink v1.969, 70 was used to assign samples to the predefined populations (Table S1) and convert the variants to reference polarized VCF files, with correct polarization being checked using BCFtools66. Finally, a custom Python script was used to generate the site frequency spectrum (SFS) input files for SF2 analysis (https://gist.github.com/yassineS/fe2712ad52d76460b927e3f391ea51f6).

Sweep scans

The SweepFinder2 composite likelihood ratio (CLR) statistic18, 71 was computed across successive 1kb intervals across all autosomes for each ancient and modern human population. The CLR statistic evaluates the evidence for hard selective sweeps in dynamically sized windows, by calculating the expected SFS under a hard selective sweep conditional on the neutral SFS, assuming a certain selection coefficient and local recombination rate. The neutral SFS is based on the ‘background’ SFS calculated from the whole genome assuming that the influence of positive selection on the genome-wide SFS is negligible.

SF2 controls for genome-wide effects such as ascertainment bias and demography by allowing these processes to affect the expected SFS under neutrality (i.e., the background SFS)72. Further, unlike many other selection methods, the assumptions on the input data for SF2 are suitable for the low coverage and ascertained nature of ancient DNA datasets. Since it is only based on the spatial (genomic) pattern of allele frequencies but not on haplotype homozygosity or population differentiation, it is possible to detect selection without reference to a second population, calling genotypes, or phasing haplotypes. By leveraging an empirical null model (i.e. the background SFS) and a model-based alternative hypothesis, SF2 is both more powerful and more robust than alternate test statistics also based on deviations of the SFS from expectations under the standard neutral model (e.g., Tajima’s D, Fay and Wu’s H)18, 71.

Note that SF2 also has an option to detect sweeps based on local genomic reductions in diversity. However, we did not calculate this diversity-based metric since the accurate and unbiased estimation of diversity requires fully sequenced genomes, whereas our dataset consists of an ascertained set of SNPs.

Outlier gene detection

Human gene annotations were obtained from the ENSEMBL database73 (genome reference version GRCh37), which was accessed using the R biomaRt package74, 75 (version 2.36.1). Of the 24,554 annotated ‘genes’ on the biomaRt database, we removed any that were not annotated in the NCBI database (ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz) and also excluded those that lacked specific protein and RNA based annotations (in the biomaRt transcript_biotype field). This resulted in a list of 19,603 genes, from which we removed 26 that did not contain any polymorphic sites in our datasets (all being situated in the most terminal areas of chromosomes), leaving 19,577 genes that were used in the subsequent analyses (Table S2).

To obtain p values for each gene, we transformed the raw SweepFinder2 CLR scores to Z scores using the following series of steps (Fig. S3). For each population, all SweepFinder2 CLR scores were logarithmically transformed and assigned to each of 19,577 genes by binning the transformed scores within the genomic boundaries of each gene. The gene boundaries were extended by 50kb on either side in order to also capture cis-regulatory regions. Because this typically resulted in several scores per gene, we took the maximum score to represent the evidence for a sweep involving that gene. Each gene score was corrected for gene length using a non-parametric standardization algorithm76, 77, resulting in the gene scores having an approximately standard normal distribution. Finally, p values were calculated for all genes and a q value correction74 applied for each population. The q value is a Bayesian posterior estimate of the p value that accounts for the expected inflation of false positives due to multiple testing74, whereby a q value of 0.01 implies a false discovery rate of 1% per population in this study.

Candidate sweep classification

Sweeps were identified by determining a set of outlier genes across all populations, which were classified into sweep regions according to 1) the distance between the mid-point of neighbouring pairs of outlier genes (i.e. inter-gene distance) and 2) overlapping sweep regions between populations. Step 1 was performed independently for each population, whereby all outlier genes with midpoints that were less than a specific distance apart from the midpoint of a neighbouring outlier gene were collapsed into a single category. After generating the collapsed categories for each population, step 2 was applied to ensure that the sweep categories sharing at least one gene across different populations were considered as a single historical sweep.

We ran our sweep quantification pipeline at three q value thresholds (i.e. q < 0.01, 0.05 or 0.10; which imply false discovery rates of 1%, 5% and 10% per population, respectively) and the three different inter-gene distances (midpoint distances less than 250kb, 500kb or 1Mb). As expected, changing the q value had a large impact on the number of sweeps (ranging from ∼50 for q < 0.01 to ∼500 for q < 0.1), whereas changing the inter-gene midpoint distance had comparatively little impact overall, particularly at more stringent q-value cutoffs (Fig. S8). Based on these results, we decided to use the most stringent q-value cutoff and the most liberal inter-gene distance to define a conservative set of candidate sweeps for all further analysis. However, because this stringent cutoff might lead to the removal of potentially causal genes in a sweep (which could have values slightly lower than 0.01), we first defined our sweeps based on the more permissive q < 0.1 threshold, and then removed all sweeps that did not have at least one gene with q < 0.1. To further improve sweep determination, we removed populations with small sample sizes (see ‘Impact of small sample sizes on sweep detection’ below) from the sweep determination process, as previous analyses of modern genomes suggest that SF2 has little power to detect sweeps when the number of haploid genome copies being analyzed is 10 or less23. Finally, to ensure that the sweeps being defined were all relevant to western Eurasian history, the two modern populations from East Asia (CHB) and Africa (YRI) were also excluded from the sweep classification process. This strategy resulted in a total of 57 candidate sweeps that were used in subsequent analyses.

Impact of small sample sizes and missing data on sweep detection

Previous results suggest that SF2 maintains high power to detect true sweeps when the number of haplotypes being analyzed is 10 or more for modern genomes23. To examine the impact of sample size on SF2 estimation on our ancient populations, we derived a measure of sample size that incorporates both pseudo-haploidy and different levels of data missingness in our ancient samples, called the ‘effective’ sample size, neff. We calculated neff for each population as kn(1 - M), where k is the ploidy of each sample, n is the number of samples and M is the average proportion of missing sites at informative SNPs in that population. Consistent with results from modern data23, we found that the number of detected sweeps tended to be systematically lower for populations with neff values < 10, regardless of the q value threshold used (Figs. S8, S12).

When inspecting the distribution of p values for gene scores in each population, we observed large deviations from the idealized distribution as neff values decreased below 10 (Figs. S4, S5). These results indicate that populations with neff < 10 lack sufficient power to detect sweep signals, whereby only the 12 ancient populations with neff ≥ 10 were used to determine the candidate sweeps.

Selection strength inference

We inferred the selection strength, s, for each of our 57 candidate sweeps as s = r ln(2Ne)/α (from ref. 78), where Ne is the effective population size, r is the recombination rate, and α is a composite selection parameter that is estimated for each sweep region by SF218 (see Table S2). For each candidate sweep, we took α at the position of the largest SF2 CLR value from all ancient Eurasian populations. Recombination rates were estimated for each sweep using information from ref. 79, and Ne was taken as 10,000 (noting that the estimation of s is robust to changes in Ne due to the logarithmic transformation). Notably, estimating s based on SF2 α estimates tended to result in slightly lower values than expected (see SI Methods and Fig. S18), suggesting that we have likely systematically underestimated the strength of selection for the 57 observed sweeps.

Estimating the earliest evidence for selection

To estimate when the selection pressure(s) underlying each of the 57 sweeps may have first arisen, we manually inspected five Upper Paleolithic Eurasian human samples with moderate to high coverage genomes: Ust’-Ishim80, Kostenki 1481, GoyetQ-11628, Věstonice28, El Mirón28 for the evidence of the sweep haplotype. The oldest sample with evidence for the sweep is taken as the origin of the selection pressure, and therefore provides a coarse lower bound on the onset of selection pressure. The sweep was called as present only if the full set of alleles observed in the sweep region were observed in the sample, noting that genotype calls were only possible for Ust’-Ishim (coverage > 40x), with pseudo-haploid calls being used for the four other samples.

Because some sweeps contained complex signals with multiple peaks – which may represent the combination of two or more neighbouring selection events into a single sweep region (see Table S2 for a list of sweeps, and Supplementary Data Figures S1-S57 to show the distribution of SF2 CLR scores across all populations) – we concentrated on alleles underlying the strongest signal in each sweep. The plots used to discriminate the sweep haplotypes for all 57 candidate sweeps are provided in Supplementary Data Figures S58-S114 (https://adelaide.figshare.com/s/f00443ef535a8af2a61d). To test the robustness of these qualitative assessments, we compared our qualitative classifications with those obtained from a quantitative method. Details of this method and the comparison with qualitative results are provided in SI Methods.

Testing sweep detection rate relative to inferred onset of selection

We predicted that hard sweep signals would be more prone to admixture-mediated loss if the onset of the selection pressure postdated the initial diversification of the Eurasian founding populations, as these sweeps are more likely to be ‘local’ to a subset of the source populations contributing to modern European ancestry. Accordingly, we reasoned that sweeps of deep antiquity should exhibit consistent detection frequencies across all ancient populations used in this study, whereas more recent sweeps should show inter-population variation that reflect their restriction to a subset of historical source populations. To examine this hypothesis, we quantified the proportion of sweeps present in each ancient population conditional on the inferred age of the sweep (see Methods: Estimating the earliest evidence for selection), then used the glm function in the R statistical language to test if proportions differed significantly across ancient populations grouped into five broad categories (HG, EF, Steppe, LF, LNBA; see Fig. 4). Specifically, for each sweep age class, we fit a logistic regression model where the presence of the sweep was a binary dependent variable and population group was the sole independent variable, and compared the fit of this model to the null model (dependent population variable removed, i.e. all groups have the same proportion of sweeps) to compute p values. Note that modern European populations were not included in these tests as our analyses indicate that these populations experienced further signal dilution following the Bronze age (see Text S1). Our results were unchanged after repeating all logistic regression analyses with the Southern Caucasus LNBA population removed from the LNBA group (since this population shows different ancestry composition to other European LNBA populations; see Supplementary Methods).

Fst-based selection tests

To further investigate the validity of the 57 candidate sweeps, we tested if our sweeps were enriched with highly divergent SNPs amongst the 12 ancient Eurasian populations with sufficient power to reliably detect sweeps (see Methods: Impact of small sample sizes and missing data on sweep detection) with a modern African population (i.e. YRI). We calculated Fst for each of the ∼1.1M ascertained SNPs using the standard Weir-Cockerham estimator82 and used OutFLANK21 to estimate the probability that each SNP was more divergent than expected under neutrality (based on estimating fitting a χ2 distribution to Fst values from putatively neutral SNPs). Importantly, OutFLANK is robust to non-equilibrium demographic models21, including rapid range expansions that are thought to have resulted in highly divergent alleles observed in modern human populations (i.e. through allele surfing83). Applying the q value correction to the p values to control for multiple testing resulted in 29 of the 57 candidate sweeps having one or more SNPs with q < 0.05, and 55 candidate sweeps having at least one SNP with q < 0.20 (Fig. S10). We then tested if the SNPs found in each of the 57 candidate sweeps had significantly higher Fst values than background genome levels. 49 of the 57 sweeps had significantly elevated Fst values relative to the remaining background genome (p < 0.05; Wilcoxon Rank Sum Test; Fig. S10), confirming that our sweeps were most likely selected following the divergence of ancestral Eurasians from African populations and were unlikely to be an artefact caused by non-equilibrium demographic processes such as allele surfing.

Forward simulations

To assess the impact of Holocene admixture on historical hard sweep signals in modern human genomes, we used forward simulations (SLiM331) to model moderate to strong selection (selection coefficient, s, between 1% and 10%) within a plausible West Eurasian demographic model (Fig. 2A and S19). Model parameters were largely taken from a recent study of Eurasian demographic history84, with further parameters for the Steppe population coming from two additional studies incorporating ancient Steppe samples33, 34. Two Holocene-era admixture events are included, namely a 50% contribution of WHG ancestry into the Main Eurasian branch at 8ka, and a further 33% contribution of Steppe (i.e. Yamnaya33) ancestry onto the same branch at 4.5ka. These admixture events generated the typical modern European individual that derives one-third of their ancestry from each of the Anatolian EF (modelled as descending from the Main Eurasian branch), WHG, and Steppe populations16.

Beneficial mutations were introduced on the Main Eurasian branch at three different times: 55 ka, 44 ka, and 36 ka (Fig. S19). We investigated a model where the strength of selection (i.e. s either 1%, 2% or 10%) was constant until the present, and another model where selection ceased following the 8ka admixture event (i.e. s = 0 after 8ka). Genomic regions of 5 Mb were simulated assuming mutation rates and recombination rates of 1x10-8/generation/bp, with the beneficial variant being introduced as a single mutation in the middle of the simulated region.

Diploid samples were generated for one modern population, sampled at the present, and five ancient populations: three populations that are ancestral to modern Europeans (i.e. Anatolia EF and WHG, sampled 8.5ka, and Steppe, sampled 5ka), and two admixed populations (i.e. Central Europe EF and Central Europe LNBA, sampled at 7ka and 4ka, respectively). Sample sizes matched those in empirical data, with 100 diploid samples taken for the modern Europeans. For each ancient population, we simulated missing data by randomly removing loci based on the site-specific missing data distribution of the relevant empirical samples, thereby replicating the population-specific data missingness patterns. We also reproduced the pseudo-haploidy of the ancient samples by randomly selecting one allele at each site. Additionally, because our analyses were limited to a set of ∼1.1 million variants from the 1240k capture probes (see Methods: Data collection and processing), which were selected based on heterozygosity in a set of sequenced reference individuals from both African and non-African populations85, we also sought to reproduce the ascertainment bias associated with these variants. To do this, we sampled two modern African samples and only retained sites that were heterozygous in either of the two African samples, or at least one of two randomly chosen modern European samples, with all other sites being discarded.

Finally, in all simulations, the number of sites was downsampled to 2000 positions in accordance with the average number of variants from 1240k capture probes found in a 5 Mb region. After removing all non-polymorphic sites, the SF2 statistic was computed at 5000 positions evenly distributed over the 5Mb region. To compute the neutral background SFS necessary for the SF2 analyses, we simulated 1000 neutral replicates of the demographic model and extracted the SFS for each population after replicating all steps in our simulation pipeline. Two hundred replicates were generated for each selection scenario and used for power analyses and to estimate the false positive rate (FPR). The final set of selection simulations are conditioned on sweeps that have escaped the initial stochastic phase (where rare beneficial alleles are lost through drift) by omitting all simulations where the beneficial mutation never reached a frequency of at least 10%.

More details on the statistical modelling procedure and full specification of the power analyses and FPR estimation are provided in the SI Methods.

Author Contributions

A.C., Y.S., R.T. and C.D.H. conceived the study, Y.S., R.T., M.W. assembled the dataset, Y.S., R.T., A.J., C.D.H., G.G., M.C., A.R., and O.J. performed analyses, A.C. and C.D.H supervised analyses, Y.S., R.T., A.J., C.D.H., S.G., C.T., A.R., O.J., J.S., J.T., and A.C. interpreted the results, A.C., Y.S., R.T., A.J., and C.D.H. wrote the paper with input from all co-authors.

Competing interests

The authors declare no competing financial interests.

Additional information

Supplementary information is available for this paper.

  • - Supplementary Methods

  • - Supplementary Text S1-S6

  • - Supplementary Tables S1-S4

  • - Supplementary Figures S1-S21

  • - Supplementary Data Figures 1-114

Correspondence and requests for materials should be addressed to Y.S., R.T., A.J., or C.D.H.

Footnotes

  • This version has been updated with more validation tests, statistical tests, and simulations.

  • https://adelaide.figshare.com/s/f00443ef535a8af2a61d

  • https://gist.github.com/yassineS/fe2712ad52d76460b927e3f391ea51f6

References

  1. 1.↵
    Jensen, J. D. et al. The importance of the neutral theory in 1968 and 50 years on: a response to Kern & Hahn 2018. Evolution 112, 2109 (2018).
    OpenUrl
  2. 2.↵
    Nielsen, R., Hellmann, I., Hubisz, M., Bustamante, C. & Clark, A. G. Recent and ongoing selection in the human genome. Nat. Rev. Genet. 8, 857–868 (2007).
    OpenUrlCrossRefPubMed
  3. 3.↵
    Huber, C. D., Nordborg, M., Hermisson, J. & Hellmann, I. Keeping it local: evidence for positive selection in Swedish Arabidopsis thaliana. Mol. Biol. Evol. 31, 3026–3039 (2014).
    OpenUrlCrossRefPubMed
  4. 4.↵
    Harris, R. B., Sackman, A. & Jensen, J. D. On the unfounded enthusiasm for soft selective sweeps II: Examining recent evidence from humans, flies, and viruses. PLoS Genet. 14, e1007859 (2018).
    OpenUrlCrossRef
  5. 5.↵
    Harris, A. M., Garud, N. R. & DeGiorgio, M. Detection and Classification of Hard and Soft Sweeps from Unphased Genotypes by Multilocus Genotype Identity. Genetics 210, 1429–1452 (2018).
    OpenUrlAbstract/FREE Full Text
  6. 6.↵
    Schrider, D. R. & Kern, A. D. Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome. Mol. Biol. Evol. 34, 1863–1877 (2017).
    OpenUrlCrossRefPubMed
  7. 7.↵
    Hernandez, R. D. et al. Classic selective sweeps were rare in recent human evolution. Science 331, 920–924 (2011).
    OpenUrlAbstract/FREE Full Text
  8. 8.↵
    Pritchard, J. K., Pickrell, J. K. & Coop, G. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr. Biol. 20, R208–15 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  9. 9.↵
    Sohail, M. et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. Elife 8, (2019).
  10. 10.
    Jensen, J. D. On the unfounded enthusiasm for soft selective sweeps. Nat. Commun. 5, 5281 (2014).
    OpenUrlCrossRefPubMed
  11. 11.↵
    Berg, J. J. et al. Reduced signal for polygenic adaptation of height in UK Biobank. Elife 8, (2019).
  12. 12.↵
    Harris, A. M. & DeGiorgio, M. Admixture and Ancestry Inference from Ancient and Modern Samples through Measures of Population Genetic Drift. Hum. Biol. 89, 21–46 (2017).
    OpenUrlCrossRef
  13. 13.
    Stoneking, M. & Krause, J. Learning about human population history from ancient and modern genomes. Nat. Rev. Genet. 12, 603–614 (2011).
    OpenUrlCrossRefPubMed
  14. 14.↵
    Hellenthal, G. et al. A genetic atlas of human admixture history. Science 343, 747–751 (2014).
    OpenUrlAbstract/FREE Full Text
  15. 15.↵
    Kelley, J. L., Madeoy, J., Calhoun, J. C., Swanson, W. & Akey, J. M. Genomic signatures of positive selection in humans and the limits of outlier approaches. Genome Res. 16, 980–989 (2006).
    OpenUrlAbstract/FREE Full Text
  16. 16.↵
    Haak, W. et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522, 207–211 (2015).
    OpenUrlCrossRefPubMed
  17. 17.↵
    1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
    OpenUrl
  18. 18.↵
    Huber, C. D., DeGiorgio, M., Hellmann, I. & Nielsen, R. Detecting recent selective sweeps while controlling for mutation rate and background selection. Mol. Ecol. (2015) doi:10.1111/mec.13351.
    OpenUrlCrossRefPubMed
  19. 19.↵
    Stephan, W. Selective Sweeps. Genetics 211, 5–13 (2019).
    OpenUrlAbstract/FREE Full Text
  20. 20.↵
    Szpak, M., Xue, Y., Ayub, Q. & Tyler-Smith, C. How well do we understand the basis of classic selective sweeps in humans? FEBS Lett. (2019) doi:10.1002/1873-3468.13447.
    OpenUrlCrossRef
  21. 21.↵
    Whitlock, M. C. & Lotterhos, K. E. Reliable Detection of Loci Responsible for Local Adaptation: Inference of a Null Model through Trimming the Distribution of F(ST). Am. Nat. 186 Suppl 1, S24–36 (2015).
    OpenUrlCrossRefPubMed
  22. 22.↵
    Sabeti, P. C. Positive Natural Selection in the Human Lineage. Science vol. 312 1614–1620 (2006).
    OpenUrlAbstract/FREE Full Text
  23. 23.↵
    Ferrer-Admetlla, A., Liang, M., Korneliussen, T. & Nielsen, R. On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol. Biol. Evol. 31, 1275–1291 (2014).
    OpenUrlCrossRefPubMedWeb of Science
  24. 24.↵
    Black, F. L. & Hedrick, P. W. Strong balancing selection at HLA loci: evidence from segregation in South Amerindian families. Proc. Natl. Acad. Sci. U. S. A. 94, 12452–12456 (1997).
    OpenUrlAbstract/FREE Full Text
  25. 25.
    Brandt, D. Y. C., César, J., Goudet, J. & Meyer, D. The Effect of Balancing Selection on Population Differentiation: A Study with HLA Genes. G3 8, 2805–2815 (2018).
    OpenUrlAbstract/FREE Full Text
  26. 26.↵
    Bitarello, B. D. et al. Signatures of Long-Term Balancing Selection in Human Genomes. Genome Biol. Evol. 10, 939–955 (2018).
    OpenUrlCrossRef
  27. 27.↵
    Fu, Q. et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014).
    OpenUrlCrossRefPubMedWeb of Science
  28. 28.↵
    Fu, Q. et al. The genetic history of Ice Age Europe. Nature 534, 200–205 (2016).
    OpenUrlCrossRefPubMed
  29. 29.↵
    Johnson, K. E. & Voight, B. F. Patterns of shared signatures of recent positive selection across human populations. Nat Ecol Evol 2, 713–720 (2018).
    OpenUrl
  30. 30.↵
    Pickrell, J. K. et al. Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19, 826–837 (2009).
    OpenUrlAbstract/FREE Full Text
  31. 31.↵
    Haller, B. C. & Messer, P. W. SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Mol. Biol. Evol. 36, 632–637 (2019).
    OpenUrlCrossRef
  32. 32.↵
    Kamm, J., Terhorst, J., Durbin, R. & Song, Y. S. Efficiently Inferring the Demographic History of Many Populations With Allele Count Data. J. Am. Stat. Assoc. 1–16 (2019) doi:10.1080/01621459.2019.1635482.
    OpenUrlCrossRefPubMed
  33. 33.↵
    Damgaard, P. de B. et al. 137 ancient human genomes from across the Eurasian steppes. Nature 557, 369–374 (2018).
    OpenUrlCrossRefPubMed
  34. 34.↵
    Jones, E. R. et al. Upper Palaeolithic genomes reveal deep roots of modern Eurasians. Nat. Commun. 6, 8912 (2015).
    OpenUrlCrossRefPubMed
  35. 35.↵
    Turney, C. S. M. & Brown, H. Catastrophic early Holocene sea level rise, human migration and the Neolithic transition in Europe. Quat. Sci. Rev. 26, 2036–2041 (2007).
    OpenUrl
  36. 36.↵
    Moorjani, P. et al. A genetic method for dating ancient genomes provides a direct estimate of human generation interval in the last 45,000 years. Proc. Natl. Acad. Sci. U. S. A. 113, 5652–5657 (2016).
    OpenUrlAbstract/FREE Full Text
  37. 37.↵
    Hermisson, J. & Pennings, P. S. Soft sweeps and beyond: understanding the patterns and probabilities of selection footprints under rapid adaptation. Methods Ecol. Evol. 8, 700–716 (2017).
    OpenUrlCrossRef
  38. 38.
    Höllinger, I., Pennings, P. S. & Hermisson, J. Polygenic adaptation: From sweeps to subtle frequency shifts. PLoS Genet. 15, e1008035 (2019).
  39. 39.↵
    Peter, B. M., Huerta-Sanchez, E. & Nielsen, R. Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS Genet. 8, e1003011 (2012).
    OpenUrlCrossRefPubMed
  40. 40.↵
    Nakagome, S., Hudson, R. R. & Di Rienzo, A. Inferring the model and onset of natural selection under varying population size from the site frequency spectrum and haplotype structure. Proc. Biol. Sci. 286, 20182541 (2019).
    OpenUrl
  41. 41.↵
    Hermisson, J. & Pennings, P. S. Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genetics 169, 2335–2352 (2005).
    OpenUrlAbstract/FREE Full Text
  42. 42.↵
    Narasimhan, V. M. et al. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat. Commun. 8, 303 (2017).
    OpenUrlCrossRef
  43. 43.↵
    Mathieson, I. Estimating time-varying selection coefficients from time series data of allele frequencies. bioRxiv (2020).
  44. 44.↵
    Burger, J. et al. Low Prevalence of Lactase Persistence in Bronze Age Europe Indicates Ongoing Strong Selection over the Last 3,000 Years. Curr. Biol. 30, 4307–4315.e13 (2020).
    OpenUrl
  45. 45.↵
    Antonio, M. L. et al. Ancient Rome: A genetic crossroads of Europe and the Mediterranean. Science vol. 366 708–714 (2019).
    OpenUrlAbstract/FREE Full Text
  46. 46.↵
    Arauna, L. R., Hellenthal, G. & Comas, D. Dissecting human North African gene-flow into its western coastal surroundings. Proc. Biol. Sci. 286, 20190471 (2019).
    OpenUrl
  47. 47.↵
    Lamnidis, T. C. et al. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nat. Commun. 9, 5018 (2018).
    OpenUrlCrossRefPubMed
  48. 48.↵
    Gossmann, T. I. et al. Genome wide analyses reveal little evidence for adaptive evolution in many plant species. Mol. Biol. Evol. 27, 1822–1832 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  49. 49.↵
    Sattath, S., Elyashiv, E., Kolodny, O., Rinott, Y. & Sella, G. Pervasive Adaptive Protein Evolution Apparent in Diversity Patterns around Amino Acid Substitutions in Drosophila simulans. PLoS Genet. 7, e1001302 (2011).
    OpenUrlCrossRefPubMed
  50. 50.
    Long, Q. et al. Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden. Nat. Genet. 45, 884–890 (2013).
    OpenUrlCrossRefPubMed
  51. 51.↵
    Bonhomme, M. et al. Genomic Signature of Selective Sweeps Illuminates Adaptation of Medicago truncatula to Root-Associated Microorganisms. Mol. Biol. Evol. 32, 2097–2110 (2015).
    OpenUrlCrossRefPubMed
  52. 52.↵
    Lazaridis, I. et al. Genomic insights into the origin of farming in the ancient Near East. Nature 536, 419–424 (2016).
    OpenUrlCrossRefPubMed
  53. 53.↵
    Mazet, O., Rodríguez, W., Grusea, S., Boitard, S. & Chikhi, L. On the importance of being structured: instantaneous coalescence rates and human evolution—lessons for ancestral population size inference? Heredity vol. 116 362–371 (2016).
    OpenUrlCrossRefPubMed
  54. 54.↵
    Mousset, S. et al. Evidence of a high rate of selective sweeps in African Drosophila melanogaster. Genetics 163, 599–609 (2003).
    OpenUrlAbstract/FREE Full Text
  55. 55.↵
    Badouin, H. et al. Widespread selective sweeps throughout the genome of model plant pathogenic fungi and identification of effector candidates. Mol. Ecol. 26, 2041–2062 (2017).
    OpenUrlCrossRef
  56. 56.↵
    Garud, N. R., Messer, P. W., Buzbas, E. O. & Petrov, D. A. Recent Selective Sweeps in North American Drosophila melanogaster Show Signatures of Soft Sweeps. PLoS Genet. 11, e1005004 (2015).
    OpenUrlCrossRefPubMed
  57. 57.↵
    Zheng, Y. & Wiehe, T. Adaptation in structured populations and fuzzy boundaries between hard and soft sweeps. PLoS Comput. Biol. 15, e1007426 (2019).
    OpenUrlCrossRef
  58. 58.↵
    Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
    OpenUrlCrossRefPubMed
  59. 59.↵
    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] (2013).
  60. 60.↵
    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
    OpenUrlCrossRefPubMedWeb of Science
  61. 61.↵
    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
    OpenUrlAbstract/FREE Full Text
  62. 62.↵
    Jónsson, H., Ginolhac, A., Schubert, M., Johnson, P. L. F. & Orlando, L. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics 29, 1682–1684 (2013).
    OpenUrlCrossRefPubMedWeb of Science
  63. 63.↵
    1. Shapiro, B. &
    2. Hofreiter, M.
    Kircher, M. Analysis of High-Throughput Ancient DNA Sequencing Data. in Ancient DNA: Methods and Protocols (eds. Shapiro, B. & Hofreiter, M.) 197–228 (Humana Press, 2012). doi:10.1007/978-1-61779-516-9_23.
    OpenUrlCrossRefPubMedWeb of Science
  64. 64.↵
    Jun, G., Wing, M. K., Abecasis, G. R. & Kang, H. M. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Res. 25, 918–925 (2015).
    OpenUrlAbstract/FREE Full Text
  65. 65.↵
    Mathieson, I. et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature 528, 499–503 (2015).
    OpenUrlCrossRefPubMed
  66. 66.↵
    Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
    OpenUrlCrossRefPubMedWeb of Science
  67. 67.↵
    Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
    OpenUrlCrossRefPubMedWeb of Science
  68. 68.↵
    Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
    OpenUrlCrossRefPubMed
  69. 69.↵
    Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
  70. 70.↵
    Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
    OpenUrlCrossRefPubMed
  71. 71.↵
    DeGiorgio, M., Huber, C. D., Hubisz, M. J., Hellmann, I. & Nielsen, R. S weep F inder 2: increased sensitivity, robustness and flexibility. Bioinformatics 32, 1895–1897 (2016).
    OpenUrlCrossRefPubMed
  72. 72.↵
    Nielsen, R. et al. Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566–1575 (2005).
    OpenUrlAbstract/FREE Full Text
  73. 73.↵
    Kinsella, R. J., et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database vol. 2011 bar030–bar030 (2011).
    OpenUrlCrossRefPubMed
  74. 74.↵
    Storey, J. D. The positive false discovery rate: a Bayesian interpretation and the q-value. The Annals of Statistics vol. 31 2013–2035 (2003).
    OpenUrl
  75. 75.↵
    Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols vol. 4 1184–1191 (2009).
    OpenUrl
  76. 76.↵
    Iglewicz, B. & Hoaglin, D. C. How to Detect and Handle Outliers. (Asq Press, 1993).
  77. 77.↵
    Daub, J. T. et al. Evidence for Polygenic Adaptation to Pathogens in the Human Genome. Molecular Biology and Evolution vol. 30 1544–1558 (2013).
    OpenUrlCrossRefPubMedWeb of Science
  78. 78.↵
    Durrett, R. & Schweinsberg, J. Approximating selective sweeps. Theor. Popul. Biol. 66, 129–138 (2004).
    OpenUrlCrossRefPubMedWeb of Science
  79. 79.↵
    Bhérer, C., Campbell, C. L. & Auton, A. Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales. Nat. Commun. 8, 14994 (2017).
    OpenUrlCrossRefPubMed
  80. 80.↵
    Fu, Q. et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014).
    OpenUrlCrossRefPubMedWeb of Science
  81. 81.↵
    Seguin-Orlando, A. et al. Genomic structure in Europeans dating back at least 36,200 years. Science vol. 346 1113–1118 (2014).
    OpenUrlAbstract/FREE Full Text
  82. 82.↵
    Weir, B. S. & Cockerham, C. C. ESTIMATING F-STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE. Evolution 38, 1358–1370 (1984).
    OpenUrlCrossRefPubMedWeb of Science
  83. 83.↵
    Hofer, T., Ray, N., Wegmann, D. & Excoffier, L. Large allele frequency differences between human continental groups are more likely to have occurred by drift during range expansions than by selection. Ann. Hum. Genet. 73, 95–108 (2009).
    OpenUrlCrossRefPubMedWeb of Science
  84. 84.↵
    Kamm, J., Terhorst, J., Durbin, R. & Song, Y. S. Efficiently Inferring the Demographic History of Many Populations With Allele Count Data. Journal of the American Statistical Association 1–16 (2019) doi:10.1080/01621459.2019.1635482.
    OpenUrlCrossRefPubMed
  85. 85.↵
    Lu, Y., Patterson, N., Zhan, Y., Mallick, S. & Reich, D. Technical design document for a SNP array that is optimized for population genetics. (2011).
Back to top
PreviousNext
Posted November 19, 2021.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Admixture has obscured signals of historical hard sweeps in humans
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Admixture has obscured signals of historical hard sweeps in humans
Yassine Souilmi, Raymond Tobler, Angad Johar, Matthew Williams, Shane T. Grey, Joshua Schmidt, João C. Teixeira, Adam Rohrlach, Jonathan Tuke, Olivia Johnson, Graham Gower, Chris Turney, Murray Cox, Alan Cooper, Christian D. Huber
bioRxiv 2020.04.01.021006; doi: https://doi.org/10.1101/2020.04.01.021006
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Admixture has obscured signals of historical hard sweeps in humans
Yassine Souilmi, Raymond Tobler, Angad Johar, Matthew Williams, Shane T. Grey, Joshua Schmidt, João C. Teixeira, Adam Rohrlach, Jonathan Tuke, Olivia Johnson, Graham Gower, Chris Turney, Murray Cox, Alan Cooper, Christian D. Huber
bioRxiv 2020.04.01.021006; doi: https://doi.org/10.1101/2020.04.01.021006

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (5972)
  • Biochemistry (13523)
  • Bioengineering (10313)
  • Bioinformatics (32880)
  • Biophysics (16953)
  • Cancer Biology (14013)
  • Cell Biology (19876)
  • Clinical Trials (138)
  • Developmental Biology (10742)
  • Ecology (15885)
  • Epidemiology (2067)
  • Evolutionary Biology (20203)
  • Genetics (13312)
  • Genomics (18516)
  • Immunology (13610)
  • Microbiology (31812)
  • Molecular Biology (13265)
  • Neuroscience (69376)
  • Paleontology (517)
  • Pathology (2166)
  • Pharmacology and Toxicology (3712)
  • Physiology (5802)
  • Plant Biology (11899)
  • Scientific Communication and Education (1799)
  • Synthetic Biology (3335)
  • Systems Biology (8112)
  • Zoology (1833)