Understanding natural selection in Holocene Europe using multi-locus genotype identity scans

Ancient DNA (aDNA) has been a revolutionary technology in understanding human history but has not been used extensively to study natural selection as large sample sizes to study allele frequency changes over time have thus far not been available. Here, we examined a time transect of 708 published samples over the past 7,000 years of European history using multi-locus genotype-based selection scans. As aDNA data is affected by high missingness, ascertainment bias, DNA damage, random allele calling, and is unphased, we first validated our selection scan, G12ancient, on simulated data resembling aDNA under a demographic model that captures broad features of the allele frequency spectrum of European genomes as well as positive controls that have been previously identified and functionally validated in modern European datasets on data from ancient individuals from time periods very close to the present time. We then applied our statistic to the aDNA time transect to detect and resolve the timing of natural selection occurring genome wide and found several candidates of selection across the different time periods that had not been picked up by selection scans using single SNP allele frequency approaches. In addition, enrichment analysis discovered multiple categories of complex traits that might be under adaptation across these periods. Our results demonstrate the utility of applying different types of selection scans to aDNA to uncover putative selection signals at loci in the ancient past that might have been masked in modern samples.


32
With the emergence of large sample size sequencing data, numerous population genetic studies 33 have attempted to identify targets of natural selection in the human genome 1 . However, the 34 majority of studies carried out on modern human populations have largely been restricted to 35 detecting selection events that have happened in the most recent periods of human history 36 because selective sweeps decay due to processes including recombination and mutation 1 and can 37 be obscured by demographic events such as admixture 1,2 . By directly tracking genomic changes 38 over time using aDNA, it may be possible to observe sweeps that otherwise are undetectable 39 from modern data. However, until recently, the large sample sizes required for such analyses 40 were unavailable and, as a result, many aDNA based studies to examine natural selection were 41 largely confined to specific alleles 3-7 . 42 43 Recently, increased sample sizes have enabled genome-wide selection scans on aDNA 44 4,5,[8][9][10][11][12][13] . However, most current approaches have focused on single site statistics that leverage 45 temporal data to detect allele frequency changes over time. An alternative strategy is to use 46 haplotype-based approaches, which are sensitive to footprints of selection left behind by 47 hitchhiking of linked alleles with adaptive alleles. Haplotype scans do not require temporal 48 samples and instead only require samples from a single population group from one specific time 49 to infer recent selective events 14-18 and might provide complementary information to approaches 50 that detect allele frequency changes over time. However, most haplotype-based methods require 51 phased genomes that are particularly challenging to obtain from ancient samples for several 52 reasons. First, aDNA read lengths are incredibly short (30-50bp) and read-based phasing has 53 reduced efficiency at these lengths 19 . Second, reference panels constructed from modern 54 haplotypes may introduce bias in calling alleles from aDNA due to divergence that has arisen 55 between ancient and modern genomes. By using trio or family data, where the phasing and 56 imputation can be assessed directly and precisely by examining transmission of alleles, 57 biological information can be used to obtain a ground truth dataset. However, due to the nature 58 of sampling in aDNA studies, there are relatively few trios or families that have been sequenced 59 of sufficient quality that may help with assessing the quality of phasing and imputation methods. 60 61 Recently, statistics that leverage multi-locus genotypes, which represent strings of 62 unphased genotypes from diploid individuals, were proposed to circumvent the need for phased 63 haplotypes 20-22 . However, a major challenge in applying these statistics to aDNA is its low 64 coverage (largely between 0.5-2x coverage), which results in, on average, only one of the two 65 diploid alleles being called. Moreover, the reference allele in modern genomes may bias which 66 of the two diploid alleles is mapped. In this study, we modified a multi-locus genotype-based 67 scan 22 for adaptation to be suitable for low-coverage aDNA data using a pseudo-haploidization 68 scheme, in which one allele per site is randomly selected to represent the genotype of the 69 individual at that position. We evaluated the performance of this method, which we call 70 G12ancient, on aDNA using simulations and well characterized functionally validated variants. We 71 then applied it to different epochs from an aDNA time transect to examine the timing of selection 72 of well-characterized candidate sweeps. Finally, we examined novel targets of selection to see if 73 our new method could complement other studies of natural selection carried out using allele 74 frequency-based methods 13 . 75 76 We carried out this analysis on a dataset of ancient individuals from Holocene Europe 77 representing a period of significant cultural change, beginning with the transition from hunting 78 and gathering to farming, which resulted in people living in much closer proximity to animals, as 79 well as major dietary changes. This was also a period that covered the transition to state-level 80 societies, which led to large population densities and division of labor 23 . Notably, several papers 81 document the first evidence of bacterial and viral pathogens in the aDNA record during the 82 Holocene, and it is of interest to understand if and how humans adapted to these new cultural 83 changes, environments, and diseases that affected us in our evolutionary past 10,24,25 . Given the 84 large sample sizes spanning this time transect that provide a nearly gapless record of human 85 populations in Europe, we attempted to estimate the timing of selection and generate hypotheses 86 about its correspondence with major demographic and cultural changes. 87 and that these homozygous haplotypes provide a similar signal to those from phased data. 158

Results
To validate our modified statistic and its applicability to aDNA data, we took several 159 approaches. As a first line of analysis, we examined the correlation between G12 computed on 160 diploid low-coverage data from the 1,000 genomes project 42 with that of G12ancient computed on 161 the same samples using a pseudo-haploid genotyping scheme along with introducing missingness 162 and ancient DNA damage at typical rates in our dataset (Methods: Running selection scans on 163 ancient datasets and Supplementary Fig. 3). The correlation between G12 and G12ancient across 164 all windows in the genome was 0.95 suggesting that our modified version of the selection 165 statistic G12ancient is almost equivalent to running G12, a selection statistic that has already been 166 examined previously and applied to various other datasets. (Methods: Running selection scans on 167 ancient datasets and Supplementary Fig. 3). Second, we tested the ability of G12ancient in 168 simulated data to demonstrate its performance on a range of theoretical settings. Third, we 169 demonstrate the ability of G12ancient to identify well-characterized and functionally validated 170 variants that have previously been found to be under selection by multiple modern and ancient 171 genomic studies 8,43 (Supplementary We first showed that our pseudo-haploidization approach does not reduce the ability of 185 G12ancient to detect selection, and that the distribution of G12ancient values of pseudo-haploidized 186 simulated data is comparable to that of running the haplotype-based statistic H12 on phased data 187 (Supplementary Fig. 4). When incorporating missingness and data sparsity at levels typically 188 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint observed in aDNA to our simulated datasets (Methods: Running selection scans on simulated 189 data) the G12ancient signal is attenuated but can still be differentiated from neutrality. (Fig. 2 and  190 Supplementary Fig. 5). Additionally, we observe that G12ancient increases with stronger selection 191 ( Fig. 2 with the exception of Fig. 2 Fig. 6). For K = 5 the majority of the resulting sweeps 198 were hard, whereas for higher values of K the probability of a sweep being soft increased 199 (Supplementary Fig. 7). Again, G12ancient was able to distinguish selection from neutrality for 200 varying degrees of softness except for sweeps that were very young ( Supplementary Fig. 6 first 201 row, sample from 250 generations ago). Additionally, we observed that as sweeps became softer, 202 the G12ancient signal decreased, making it harder to detect sweeps that are old and very soft 203 ( Supplementary Fig. 6 bottom row, K = 50).

Age of sample (generations ago)
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint (rows). We sampled the population at 3 different time points (columns). For the soft sweep 209 simulations K=25 independent adaptive mutations were introduced to the population at the time 210 of the onset of selection. We ran a total of 100 simulations for each combination of parameters 211 with mutation rate µ = 1.25×10 -8 /bp, chromosome length L=5×10 5 and recombination 212 r = 1×10 -8 events/bp. Selection s = 0 represents neutrality. 213 Application of G12ancient to functionally validated variants from real data 214 215 To test the ability of G12ancient to detect selection signals on real data, we modified modern 216 genomic data from European individuals from the 1000 genomes project 42 , by introducing 217 missingness, ascertainment bias, sample size and random allele calling to mimic aDNA 218 (Methods: Generation of modern human data mimicking ancient data). We then examined the 219 ability of G12ancient to detect classic selective signals in the genes LCT, TLR1 and SLC24A5 220 which have been identified by multiple previously conducted selection scans and are regions that 221 are highly differentiated between Europeans and Asians (Supplementary Table 2). The causal 222 alleles at these loci have been fine mapped in association studies and have also been functionally 223 validated in cellular assays. These alleles are commonly used as positive controls in studies 224 carrying out tests for natural selection in humans 43 . The LCT locus is responsible for conferring 225 lactase persistence into adulthood; TLR1 is a gene involved in immune cell response and 226 SLC24A5 is the dominant locus contributing to light skin pigmentation in Europeans 3,46 . 227 228 Using our aDNA mimicking process on the modern data and then applying G12ancient, we 229 were able to identify the LCT, SLC24A5 and TLR1 loci in the top 3 peaks observed chromosome-230 wide in the European (CEU) population but not African (YRI) and South Asian (STU) 231 populations (Fig. 3a). We also examined the effect of utilizing different parameters for window-232 sizes and jumps (distance between analysis windows) and obtained an optimal choice of these 233 parameters on real data (Methods: G12ancient parameter choices and peak calling and 234 Supplementary Fig. 8).

236
Next, to establish that our process could identify the timing of signals of natural selection 237 from aDNA, we examined the LCT locus at different time periods of European history. This 238 locus is particularly relevant for this analysis as the causal allele was absent in Europe prior to 239 the arrival of Steppe Pastoralists in the Bronze Age and, therefore, could not have been under 240 selection prior to that point 3,8,12,13,43,47-51 . By applying G12ancient across different periods in our 241 time transect, we show that we were able to identify selection at the LCT locus, in the historical 242 period (this window is the top peak genome-wide), but we do not see signals of selection for 243 these in the Bronze Age and Iron Age populations (Fig. 3b). These results therefore are in line 244 with the rapid increase in frequency of the causal variant rs4988235 only in the historical period 245 (Fig. 3c), a finding that has also been replicated in other aDNA studies 8,13,43 . 246 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made We began by re-examining 12 loci previously established to be under selection using 276 aDNA data 8 . Although the selection signals produced by the previous scan and our scan differ in 277 their methodology and, therefore, their ability to detect selective events, we wanted to assess if 278 we might be able to use our approach to localize when in time these signals were selected. 279 280 As seen in Fig. 4a, the time period in which we observe a signal of selection at the LCT 281 locus is limited to the historical period. In the N population, among the top peaks, we found a 282 signal which included the gene OCA2/HERC2, variations in this gene are associated with eye, 283 skin, and hair pigmentation variation 8,26,52 . This gene is the primary determinant of light eye 284 color in Europeans and in our analysis, we time the selection signal to the Neolithic 13 . However, 285 we observe a signal of selection at the HLA and at neighboring ZKSCAN3 in most of the epochs 286 ( Fig. 4a and Fig. 4c). 287 288 Outside of these 4 loci, our selection scan also revealed several other candidates which 289 we determined as being above our significance threshold. Several of these were associated with 290 skin and eye pigmentation. In the BA and IA epochs we observed a signal of selection in the 291 gene SCL24A5. As mentioned above, this gene is thought to be the major determining locus for 292 light skin pigmentation in Europeans 3,53 . While highly differentiated between Asians and 293 Europeans and appreciated as a major candidate of selection using modern European Genomes, 294 single SNP allele frequency approaches examining aDNA have yet to identify this particular 295 allele as a candidate 8,13 . This shows the value of employing alternate types of selection scans on 296 similar datasets to uncover putative selective sweeps. 297 298 We observed a signal at a locus associated with PPM1L as on the top peaks in N, which 299 is an obesity related marker in Humans 54 . This signal for selection on obesity and body weight 300 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made to farming is also observed in single SNP based approaches 13 . 302 303 We also observed several signals in genes that were associated with immunity or auto-304 immunity. In the BA population, we observed a candidate in locus containing ADK, which 305 regulates the intra and extracellular concentrations of adenosine which has widespread effects on 306 cardiovascular and immune systems 55,56 . We see a signal at the ITCH gene in the BA, which is 307 associated with immune response, and regulation of autoimmune diseases 57,58 . In the IA we see 308 candidate sweep at the MAN2A1 locus -genetic variations in this gene have been shown to cause 309 human systemic lupus erythematosus 59 . In Fig. 4c   . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made In addition to examining individual SNPs, we examined mean G12ancient values across 324 broad categories of functional SNPs. We looked at loci that were associated with changes in gene 325 expression (eQTLs), identified as associated in Genome Wide Association Studies (GWAS), or 326 were thought to be previously under selection in Europeans (CMS) or highly differentiated 327 between Europeans and Asians (HiDiff) or were part of the HLA region. We found that 328 functional categories of SNPs were seen with significantly higher G12ancient values compared to 329 SNPs that were not annotated as being functionally relevant, with the HLA region being the most 330 elevated of the functional categories (Fig. 4b).

332
We next asked if we could associate biological functions to these top-scoring loci. We 333 computed a p-value based on deviation from neutrality based on simulations (Methods: 334 Enrichment Analysis). To determine if categories of genes associated with Genome-Wide 335 Association Studies were significantly associated with selection signals, we carried out 336 enrichment analysis using FUMA 60 , which maps SNPs to genes and performs gene set 337 enrichment analysis for GWAS annotations incorporating LD information as well as gene 338 matching by length and conservation scores (Methods: Enrichment Analysis). We found that 339 many categories of GWAS related to anthropometric traits, auto-immune traits as well as disease 340 related traits were under selection across the different time epochs. We report gene sets from the 341 GWAS catalog using FUMA: Gene2Func 60 and used a conservative significance threshold of -342 log !" >= 5. We report the list of all categories for which we observed enrichment in Fig. 5  In this paper, we introduce a modified version of a previously described selection 350 statistic 16,22 and applied it to a time transect of aDNA from Europe. To date, while allele 351 frequency-based approaches have been used extensively in the field, approaches using haplotype 352 scans have largely been lacking. A single study 61 performed a selection scan by phasing low 353 coverage aDNA samples, and running a widely used extended haplotype statistic, XP-EHH. 354 Here, we took an alternate approach aimed to reduce bias and artifacts from the use of modern 355 reference panels for phasing and imputing low coverage ancient DNA, but largely maintaining 356 power when compared to phased approaches in simulations. 357 358 Our results, which take advantage of the major increases in sample size in the availability of 359 aDNA data in the past 5-10 years demonstrate the potential of running multi-locus genotype-360 based scans on aDNA. Our modified statistic, which we verified through simulations and gold 361 standard variants, can potentially be employed in other settings where sequencing coverage is 362 low and there is high missingness requiring pseudo-haploidization. Importantly, since haplotype-363 based statistics are not as reliant on temporal data to exclude false positives, these statistics are 364 useful for ancient datasets from geographic regions that only have a single timepoint. 365 366 Despite its potential, our approach also has several limitations. As the results from the 367 simulation study show, our statistic is powered mostly for strong selective sweeps (s > 0.01). 368 Moreover, the timing of onset of selection is limited by our ability to detect selection below this 369 high threshold and therefore lack of selection at a particular time could also be due to a lack of 370 power. Another major limitation of our approach is that our window-based method is unable to 371 localize selection to a specific allele as it is based on detecting deviation in a surrounding region 372 of 200 SNPs. On data from a capture array like we have, this distance can span large distances 373 and decrease our target resolution. Here we used the closest gene to the peak SNP in a series of 374 windows to connect genes to candidates under selection. 375 376 An important future direction for this type of research is to carefully examine the 377 accuracy of imputation and phasing on low-coverage ancient data using biological confirmation 378 such as from trios, which could become more available as coverage increases for many samples 379 due to lower sequencing cost and better technology. Additional directions could also be to extend 380 these scans to other time periods or more importantly to other geographic regions in the world 381 where aDNA data is becoming rapidly available. 382 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made To examine whether the G12ancient based selection scans would be applicable to aDNA 413 data; we developed a process of converting the modern human genomic data from the 1000 414 Genomes project 42 to mimic the statistical and physical properties of aDNA data and ran the 415 scans on modified modern data. We utilized a pseudo-haploidization scheme in which we 416 randomly selected (probability of selection is 0.5) one of the two alleles from the heterozygous 417 genotype as described in Supplementary Fig. 1.

419
To simulate ascertainment, we restricted the 1,000 genomes samples to just the 1.2 420 million positions that were on the aDNA capture array. Finally, we incorporated missingness on 421 a per site basis in modern data using the mean (0.55) and standard error (0.23) we observed in 422 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint our sample of 708 individuals and randomly set the genotypes of a certain proportion of 423 individuals in the modern data to missing ( Supplementary Table 1 and Supplementary Fig. 2 Supplementary Fig. 2: Data processing scheme, we take modern genomic data and apply 435 ascertainment, pseudo-haploidization and add missingness to the data to make it mimic the 436 artefacts of aDNA data used in this study. 437 Running selection scans on modern data mimicking aDNA processing 438 439 We ran G12 on 91 GBR individuals from the 1000 Genomes 42 with phased genotypes 440 called using the standard process and G12ancient with the same individuals processed using our 441 ancient DNA mimicking approach. We then compared the G12 values and G12ancient values at 442 each SNP and calculated the Pearson correlation coefficient between G12 and G12ancient values 443 and found out they are strongly positively correlated with each other with a correlation 444 coefficient of 0.95 (Supplementary Fig. 3) suggesting that our new statistic behaved similarly to 445 the original G12 statistic on our data. 446 447 Based on the missingness observed in our ancient DNA data, we added missing data to 465 our simulated datasets following a beta distribution with mean 0.55 per SNP and standard 466 deviation of 0.23 (Supplementary Table 1). Moreover, we followed the pseudo-haploidization 467 scheme used in processing the data (Supplementary Fig. 2 and Supplementary Fig. 4). Finally,468 in order to incorporate the sparsity of aDNA data, we randomly selected 201 SNPs from our 469 pseudo-haplotype data. That is, we obtained a 201 SNP window for our sample of 177 470 individuals. 471 472 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made We computed G12ancient in simulated data using 201 SNP windows in a total of 100 475 simulations for each combination of parameters tested. We first obtained G12ancient for hard 476 sweeps and neutrality (s = 0) with and without applying our pseudo-haploidization scheme and 477 with no missing data (Supplementary Fig. 4). ago. No missing data was added to the simulated data. We ran a total of 100 hard sweep 484 simulations for each combination of parameters with mutation rate µ= 1.25×10 -8 /bp, 485 chromosome length L=5×10 5 and recombination r = 5×10 -9 events/bp. 486 487 Next, we obtained the distribution of G12ancient values in data sets containing missing 488 data. We compared the G12ancient signal obtained from hard sweep and neutral simulations with 489 and without missing data, obtaining a reduction of signal when missingness was included 490 (Supplementary Fig. 4). . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint 496 We tested the ability of G12ancient to detect sweeps across various degrees of softness in 497 sparse genomic data with high missingness. We introduced K beneficial mutations at the time of 498 the onset of selection for K=5,10,25 and 50 (Supplementary Fig. 6). To determine whether 499 these simulations were more likely to result in hard or soft sweeps, we computed the number of 500 distinct mutational origins at the selected site in each simulation (Supplementary Fig. 7). When 501 K=5 most simulations have a single origin, giving rise to hard sweeps. As K increases, so does 502 the number of origins in the simulations, increasing the probability of soft sweeps as well as the 503 softness of the sweep. 504 505 Supplementary Fig. 6: G12ancient values in a soft sweep model. We introduced K beneficial 506 mutations at the time of the onset of selection (rows), for K=5,10,25 and 50, where the higher K 507 the softer the sweep. We sampled the population at 3 different time points (columns). We ran a 508 total of 100 simulations for each combination of parameters with mutation rate µ = 1.25×10 -8 509 /bp, chromosome length L = 5×10 5 , recombination r = 5×10 -9 events/bp and s = 0.1. K = 0 510 corresponds to the scenario with no selection. 511 After examining the application of G12ancient on simulated data, we examined our ability 518 to identify 3 major signals of adaptation previously observed in modern Europeans 43 . We list 519 them here along with their known functional impact. 520 521 Gene To calibrate our G12ancient statistic we iterated over several parameter choices to improve 527 performance. The most significant parameters are window and jump. Window refers to the 528 analysis window size in terms of SNPs, and jump is the distance between centers of analysis 529 windows (readme.pdf). To find the best combination of window and jump, we ran a grid search 530 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint and varied the window size from 50-400 SNPs with a step size of 25 SNPs and jump from 1-20 531 with a step size of 5 SNPs. We tried to optimize our process on the three signals of well 532 characterized adaptation in humans from the previous section on the H population which is 533 closest in time to modern samples. Larger window sizes resulted in decrease of G12ancient values, 534 and larger step sizes resulted in decrease of SNP density, as larger windows diminish the power 535 of the statistic by averaging over regions that are come from different linkage blocks. As jump 536 increases, fewer and fewer SNPs are used in the computation, as illustrated in Supplementary 537 Fig. 8. Overall, we found that a window of 200 SNPs and a jump of 1 were optimal for our 538 datasets and enabled us to detect the well characterized selection candidates at the genome-wide 539 significance threshold. 540 541 Window size across epochs 542 543 Window sizes are also dependent on the number of segregating sites in a population as 544 our windows are computed in units of SNPs. We chose to use a window size of 200 SNPs for all 545 populations, after examining several population genetic parameters across different epochs 546 (Supplementary Fig. 8)  After running the selection scans and computing G12ancient at each focal SNP, we performed quality 566 control to remove spurious peaks that could have occurred due to artifacts or issues with the data. 567 One reason a certain genomic position might have artificially high G12ancient values is if the focal 568 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint SNP and the SNPs within its window range overlap with regions of low recombination rate in the 569 genome. The first step in post-processing/ quality control in our pipeline was to remove all 570 windows with mean per-window recombination rates in the lowest fifth percentile genome-wide. 571 Second, we also removed windows where the mean fraction of missing individuals (i.e., the mean 572 of the fraction of missing individuals per SNP for all the SNPs in that window) was greater than the 573 70th percentile of the mean fraction of missing individuals for all windows. Third, our 574 ascertainment scheme on the aDNA array results in each window having variable physical distance. 575 While most windows are of similar length, some windows are in sites where the distance between 576 positions is considerably lower or higher than the average. In order to show that our post-filtered 577 data is largely unaffected by these issues, we regressed G12ancient values against window size 578 (measured in the physical distance), missingness, and recombination rate after the percentile-based 579 removal process. We saw that the overall variability in the data explained by these three variables 580 combined was less than 5%, suggesting that we had effectively removed their association with 581 G12ancient values (Supplementary Table 4). A final issue could be that there are windows where 582 neighboring SNP positions are not captured well by the probes in our ascertainment scheme, and 583 missingness rates are clustered even though the overall missingness rate is similar to other 584 windows. To deal with these issues, we also removed windows that were consistently in the top 20 585 peaks genome-wide across a set of modern (the CEU, YRI, and STU populations) and the four 586 ancient European populations we analyzed. The rationale for this is that it is quite unlikely that we 587 see the same selective sweep across populations of such different ancestry, and across such a broad 588 range of time and signals of that nature are highly likely to be due to data processing issues. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made day-and-ancient-dna-data. 647 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 25, 2023. ; https://doi.org/10.1101/2023.04.24.538113 doi: bioRxiv preprint