Decreased adaptation at human disease genes as a possible consequence of interference between advantageous and deleterious variants

Advances in genome sequencing have dramatically improved our understanding of the genetic basis of human diseases, and thousands of human genes have been associated with different diseases. Despite our expanding knowledge of gene-disease associations, and despite the medical importance of disease genes, their evolution has not been thoroughly studied across diverse human populations. In particular, recent genomic adaptation at disease genes has not been well characterized, even though multiple evolutionary processes are expected to connect disease and adaptation at the gene level. Understanding the relationship between disease and adaptation at the gene level in the human genome is severely hampered by the fact that we don’t even know whether disease genes have experienced more, less, or as much adaptation as non-disease genes during recent human evolution. Here, we compare the rate of strong recent adaptation in the form of selective sweeps between disease genes and non-disease genes across 26 distinct human populations from the 1,000 Genomes Project. We find that disease genes have experienced far less selective sweeps compared to non-disease genes during recent human evolution. This sweep deficit at disease genes is particularly visible in Africa, and less visible in East Asia or Europe, likely due to more intense genetic drift in the latter populations creating more spurious selective sweeps signals. Investigating further the possible causes of the sweep deficit at disease genes, we find that this deficit is very strong at disease genes with both low recombination rates and with high numbers of associated disease variants, but is inexistant at disease genes with higher recombination rates or lower numbers of associated disease variants. Because recessive deleterious variants have the ability to interfere with adaptive ones, these observations strongly suggest that adaptation has been slowed down by the presence of interfering recessive deleterious variants at disease genes. These results clarify the evolutionary relationship between disease genes and recent genomic adaptation, and suggest that disease genes suffer not only from a higher load of segregating deleterious mutations, but also an inability to adapt as much, and/or as fast as the rest of the genome.

Each potential confounding factor is detailed in the Methods. For each confounding factor, the 128 boxplot shows on the y-axis the ratio of the average factor value for disease genes, divided by the 129 average factor value for non-disease genes. The boxplot error bars are obtained by calculating 130 the ratio 1,000 times, each time by randomly sampling as many non-disease genes as there are 131 disease genes. 132 133 134 Among other confounding factors, it is particularly important to take into account 135 evolutionary constraint, i.e the level of purifying selection experienced by different genes. A 136 common intuition is that disease genes may exhibit less adaptation because they are more 137 constrained (Blekhman et al., 2008), leaving less mutational space for adaptation to happen in 138 the first place. Less adaptation at disease genes might thus represent a trivial consequence of 139 varying constraint between genes (Kim et al., 2007), which says little about a specific connection 140 between disease and adaptation. In the same vein, one might expect disease genes to be 141 associated with higher mutation rates, and more frequent adaptation to follow as a trivial 142 consequence of elevated mutation rates. Whether disease genes experience higher mutation rates 143 is however still an open question (Osada et al., 2009; Eyre-Walker and Eyre-Walker, 2014). In 144 any case, focusing specifically on disease and adaptation requires controlling for confounders 145 such as constraint and mutation rate (see Methods, Results and Figure 1 for a complete list of 146 confounders accounted for in this analysis). 147 148 A specific evolutionary relationship may exist between adaptation and disease beyond the 149 simple effect of constraint, mutation rate or other confounders. In an evolutionary context, once 150 constraint and other confounding factors have been accounted for, we can imagine three potential 151 scenarios for the comparison of adaptation between disease and non-disease genes. Under 152 scenario 1, any potential difference in adaptation between disease and non-disease genes is 153 entirely due to differences in constraint and other confounding factors. Under this scenario, there 154 is no further evolutionary process linking disease and adaptation together. Therefore, there is no 155 difference in adaptation between disease and non-disease genes once confounding factors have 156 been accounted for. 157 158 Under scenario 2, disease genes have more adaptation than non-disease genes. For 159 example, as already mentioned above, deleterious mutations can hitchhike together with adaptive 160 mutations to high frequencies in human populations (Birky and Walsh, 1988; Barreiro and 161 Quintana-Murci, 2010; Chun and Fay, 2011). Other, less well established, cases can be imagined 162 where past adaptation decreased the robustness of a specific gene, and subsequent mutations 163 become more likely to be associated with diseases (Xu and Zhang, 2014). Scenario 2 thus favors 164 a relationship between adaptation and disease, where past adaptation precedes and influences the 165 likelihood of a gene being associated with disease. 166 Under scenario 3, disease genes have less adaptation than non-disease genes even after 167 accounting for confounding factors such as evolutionary constraint. Such a scenario might occur 168 for example if disease genes happen to be genes that can be sensitive to changes in the 169 environment, with a fitness optimum that can change over time, but where adaptation has not 170 occurred yet to catch up with the new optimum. Such an adaptation lag (or lag load, to reuse the 171 terminology introduced by J. Maynard-Smith (1976)) may occur for example if higher pleiotropy 172 at disease genes (Ittisoponpisan et al., 2017) makes it less likely for new mutations to be 173 advantageous (Otto, 2004) (in addition to increasing the level of constraint already accounted for 174 as a confounding factor). Such an adaptation lag, with genes further away from their optimum, 175 might make such genes more prone to accumulate disease variants that fall too far from the 176 "normal" functioning range around the optimum. An adaptation lag may also occur if deleterious 177 mutations interfere with and slow down adaptation at disease genes more than at non-disease 178 genes (Assaf et al., 2015;Hill and Robertson, 1966). 179 Even though uncovering the underlying evolutionary processes that govern the 180 relationship between disease and adaptation will take a lot more work than the present analysis, it 181 is important to find first which scenario is the most likely to be true, i.e whether disease genes 182 have as much, more, or less adaptation than non-disease genes. Finding out which out of the 183 three possible scenarios is true may give a preliminary basis to further hypothesize which 184 evolutionary processes are more likely to dominate the relationship between disease and 185 adaptation genome-wide. 186

187
Here, we compare recent adaptation in mendelian disease and non-disease genes in order 188 to disentangle the connections between adaptation and disease. We specifically compare the 189 abundance of recent selective sweeps signals, where hitchhiking has raised haplotypes that carry 190 an advantageous variant to higher frequencies (Smith and Haigh, 1974). Note that this means that 191 we can only compare adaptation at specific loci between disease and non-disease genes that was 192 strong enough to induce hitchhiking, hence we do not take into account polygenic adaptation 193 distributed across a large number of loci that did not leave any hitchhiking signals (see 194 Discussion). As mentioned above, confounding factors may affect the comparison between 195 disease and non-disease genes. In contrast with previous studies, we systematically control for a 196 large number of confounding factors when comparing recent adaptation in human disease and 197 non-disease genes, including evolutionary constraint, mutation rate, recombination rate, the 198 proportion of immune or virus-interacting genes, etc. (please refer to Methods for a full list of the 199 confounding factors included). In addition to controlling for a large number of confounding 200 factors, we estimate false positive risks (FPR) for our comparison pipeline that fully take into 201 account the implications of controlling for many factors (see Methods and Results). 202 As a list of disease genes to test, we curate human mendelian non-infectious disease 203 genes based on annotations in the DisgeNet and OMIM databases (Methods). We focus on 204 mendelian disease genes rather than all disease genes including complex disease associations, 205 because different evolutionary patterns can be expected between mendelian and complex disease 206 genes based on previous studies (Blekhman et al., 2008;Quintana-murci, 2016;Spataro et al., 207 2017). In total, we compare 4,215 mendelian disease genes with non-disease genes in the human 208 genome. In agreement with scenario 3, we find a strong deficit of selective sweeps at disease 209 genes compared to non-disease genes. We further test multiple potential explanations for this 210 deficit, and find that higher pleiotropy at disease genes is unlikely to explain the less frequent 211 occurrence of sweeps. In contrast, we find that the sweep deficit at disease genes strongly 212 depends on recombination and the number of known disease variants at given disease genes. 213 This suggests that segregating deleterious mutations at disease genes might interfere with, and 214 slow down genetically linked adaptive variants enough to produce the observed lack of sweeps at 215 disease genes.

Controlling for confounding factors with a bootstrap test 220
To compare disease and non-disease genes, we first ask which potential confounding factors 221 differ between the two groups of genes. As expected, multiple measures of selective constraint 222 are significantly higher in disease compared to non-disease genes. As a measure of long-term 223 constraint, the density of conserved elements across mammals is slightly higher at disease genes 224 compared to non-disease genes ( Figure 1: conserved 50kb, conserved 500kb; Methods). 225 As a measure of more recent constraint, we contrast pS, the average proportion of variable 226 synonymous sites, with pN, the average proportion of variable nonsynonymous sites (Figure 1; 227 Methods). If the coding sequences of disease genes are more constrained, we expect a drop of pN 228 at disease genes, but no such drop of pS at neutral synonymous sites. Accordingly, pN is lower at 229 disease compared to non-disease genes, while pS is very similar between the two categories of 230 genes ( Figure 1). Therefore, selective constraint was stronger in the coding sequences of disease 231 genes during recent human evolution. 232 As another measure of recent constraint, we also use McVicker's B estimator of background 233 selection (McVicker et al., 2009). The amount of background selection at a locus can be used as 234 a proxy for recent constraint, since it depends on the number of deleterious mutations that were 235 recently removed at this locus. The lower B, the more background selection there is at a specific 236 locus. In line with higher recent constraint at disease genes, B is slightly, but significantly lower 237 at disease genes (Figure 1; Methods). Overall, we find evidence of higher constraint at disease 238 genes. 239

240
In addition to constraint, mutation rate could represent an important confounder. The proportion 241 of variable neutral synonymous sites pS can be used to compare mutation rates, since the number 242 of variable sites is proportional to the mutation rate under neutrality. As mentioned already, pS is 243 very similar at disease and non-disease genes (Figure 1), suggesting that mutation rates are 244 similar at disease and non-disease genes. This is further supported by the fact that multiple 245 factors that could affect the mutation rate such as GC content or recombination are also similar at 246 disease and non-disease genes (Figure 1; Methods). Aside from mutation rate and constraint, 247 multiple other factors that could affect adaptation differ between disease and non-disease genes, We first rank genes based on the average iHS or ! in genomic windows centered on genes 265 (Methods), from the top-ranking genes with the strongest sweep signals to the genes with the 266 weakest signals. We then slide a rank threshold from a high rank value to a low rank value (from 267 top 5,000 to top 10, x-axis on Figure 2). For each rank threshold, we estimate the sweep 268 enrichment (or deficit) at disease relative to non-disease genes (Figure 2 sliding the rank threshold, we estimate a whole enrichment curve that is not only sensitive to the 272 strongest sweeps but also to weaker sweeps signals (for example using the top 5,000 threshold; 273 Petrov, 2020) because they match disease genes in terms of confounding factor values 284 (Methods). Furthermore, control non-disease genes are chosen far from disease genes (>300kb; 285 Methods). We do this to avoid choosing as controls non-disease genes that are too close to 286 disease genes and thus likely to have the same sweep profile (especially in the case of large 287 sweeps potentially overlapping both neighboring disease and non-disease genes). This, together 288 with the large number of confounding factors that we match, tends to limit the pool of possible 289 control genes (Methods). The statistical impact of a limited control pool is however fully taken 290 into account by the estimation of a FPR with block-randomized genomes (Methods). Genomes Project Consortium, 2015). At this stage we must consider the fact that most gene-296 disease associations in our dataset were likely discovered in European cohorts. Because disease 297 genes in Europe may not always be disease genes in other populations, we cannot exclude the 298 possibility that a sweep enrichment or a sweep deficit might be more pronounced in Europe, 299 unless the evolutionary processes that make a gene more likely to be a disease gene predated the 300 split of different human populations. Conversely, one might expect distinct selective patterns 301 between disease and non-disease genes to be more visible in Africa. Indeed, more intense drift,  Figure 2A, B and C respectively; Methods). Note that this FPR takes the 323 clustering of multiple genes in the same sweeps into account (Enard and Petrov, 2020). A 324 stronger depletion in Africa suggests that the evolutionary processes linking disease and 325 adaptation at the gene level predate the split of African and European populations, given that 326 most gene-disease associations studies involved European cohorts. The stronger depletion in 327 Africa also suggests that the same pattern might be present outside of Africa, but more hidden by 328 genetic drift noise. It might indeed be harder to distinguish a deficit of true sweep signals at 329 disease genes if it is swamped by an elevated level of false sweep signals occurring at random in 330 the genome, due to more intense drift. Figure 3A, B and C show the sweep deficit curves at 331 disease genes compared to control non-disease genes in Africa, East Asia and Europe, 332 The figure shows the averaged whole enrichment curves and their averaged confidence intervals 338 from the bootstrap test, averaged over both iHS and ! sweep ranks, and over all the 339 populations from each continent (Methods). The y-axis represents the relative sweep enrichment 340 at disease genes, calculated as the number of disease genes in putative sweeps, divided by the 341 number of control non-disease genes in putative sweeps. The gray areas are the 95% confidence 342 interval for this ratio. Notably, the stronger depletion observed in Africa likely excludes the possibility that it could be 354 mostly due to a technical artifact, where sweeps themselves might make it harder to identify 355 disease genes in the first place. Sweeps increase linkage disequilibrium (LD) in a way that could 356 make it more difficult to assign a disease to a single gene in regions of the genome with high LD 357 and multiple genes genetically linked to a disease variant. This could result in a depletion of 358 sweeps at monogenic disease genes, simply because disease genes are less well annotated in 359 regions of high LD. However, if this was the case, because most disease gene were identified in 360 Europe, we would expect such an artifact to deplete sweeps at disease genes primarily in Europe, 361 not in Africa. This artifact is also very unlikely due to the fact that recombination rates are 362 similar between disease and non-disease genes ( Figure 1). Overall, these results support the third 363 scenario where evolutionary processes decrease adaptation at disease genes. That said, it is 364 important to note that we only detect a deficit of adaptation strong enough to leave hitchhiking 365 signals. Our results do not imply that the same is true for adaptation that is too polygenic to leave 366 signals detectable with iHS or ! . Note that the sweep deficit at disease genes in Africa is 367 robust to differences in gene functions between disease and non-disease genes according to a creates this deficit at disease genes. Because disease genes tend to be pleiotropic and many 373 disease genes are involved in multiple diseases (see below), pleiotropy is a particularly attractive 374 potential explanation for the lack of sweeps at disease genes. Pleiotropy is defined as the ability 375 for a gene to affect multiple phenotypes. The involvement in multiple phenotypes may make it 376 more difficult for mutations to emerge at pleiotropic genes without any adverse antagonistic 377 effects (Otto, 2004). In addition to the higher selective constraint already accounted for, 378 pleiotropy may thus also make it less likely for advantageous mutations to be advantageous and 379 cause a sweep (Otto, 2004), with the advantage provided by changes at specific phenotypes 380 being mitigated by the adverse effects on other phenotypes. 381 We can test the involvement of pleiotropy with our dataset by comparing sweeps at disease 382 genes involved in multiple diseases, with sweeps at disease genes involved in only one disease. 383 If pleiotropy decreases the rate of sweeps at disease genes, we predict that genes involved in 384 multiple diseases should experience less sweeps than genes involved in only one disease. 385 There are 1221 disease genes in our dataset associated with five or more diseases (five+ disease 386 genes), and 1296 disease genes associated with only one disease according to the CUI (Concept 387 Unique Identifiers) classification provided by DisGeNet (Methods). When comparing the five+ 388 disease genes with one disease genes far away (>300 kb as when comparing all disease genes 389 with control non-disease genes), we do not find significantly less iHS and ! sweep signals at With pleiotropy likely having a limited role, we further test other possible explanations for the 396 sweep deficit at disease genes. Another possibility is that adaptation may be limited at disease 397 genes due to deleterious mutations interfering with and slowing down advantageous variants. 398 This process has been mostly studied in haploid species (Peck, 1994 For these comparisons we focus solely on African populations for which we found the strongest 435 sweep deficit (Figure 2). We first compare disease and control non-disease genes both from only 436 regions of the genome with recombination rates lower than the median recombination rate (1.137 437 cM/Mb). In agreement with recombination being involved, we find that the sweep deficit at low 438 recombination disease genes is much more pronounced than the overall sweep deficit found 439 when considering all disease and control non-disease genes regardless of recombination ( Figure  440 4, FPR=2.10 -4 ). Conversely, the sweep deficit at disease genes compared to non-disease genes is 441 much less pronounced when restricting the comparison to genes with recombination rates higher 442 than the median recombination rate (1.137 cM/Mb), and remains only marginally significant 443 (Figure 4, FPR=0.029). This provides evidence that genetic linkage may indeed be involved. 444 Low recombination is however not sufficient on its own to create a sweep deficit, and we further 445 test if the sweep deficit also depends on the number of disease variants at each disease gene. In 446 our dataset, approximately half of all the disease genes have five or more disease variants, and 447 the other half have four or less disease variants (Methods). In further agreement with possible 448 interference of recessive deleterious variants, the sweep deficit is much more pronounced at 449 disease genes with five or more disease variants (Figure 4, FPR=8.10 -4 ). The sweep deficit at 450 disease genes with four or less disease variants is barely significant compared to control non-451 disease genes (Figure 4, FPR=0.032). In addition, disease genes with five or more disease 452 variants, but with recombination higher than the median recombination rate, do not have a strong The sweep deficit is measured as the FPR score per gene (to make all tested groups comparable) 476 over all window sizes, and ! and iHS, as in Figure 1  The sweep deficit is measured as the overall FPR score per gene (Methods), to make all MeSH 501 classes comparable even if they include different numbers of genes. 502 503 504

Discussion: 505
We found a depletion of the number of genes in recent sweeps at human non-infectious, 506 mendelian disease genes compared to non-disease genes. Although more work is now needed, 507 the lack of sweeps at disease genes already favors specific evolutionary processes over others. 508 For example, it makes it unlikely that past adaptations increasing the occurrence of disease 509 variants through hitchhiking would be the dominant process linking disease and adaptation at the 510 gene level. The lack of sweeps at disease genes also seems to be unrelated to any difference in 511 mutation accumulation between disease and non-disease genes, since we find no sign of a 512 difference in mutation rates between the two categories of genes in the first place, and since we 513 match metrics accounting for mutation rate in our comparisons (for example, GC content and 514 pS). Instead, a lack of sweeps, once selective constraint has been controlled for, seems to favor a 515 relationship involving a lag of adaptation at disease genes beyond simple constraint (measured 516 by the amount of deleterious mutations that are removed). 517 518 Multiple mechanisms might explain such a lag of adaptation. A first possible hypothesis is that 519 disease genes are genes that can be sensitive to the environment and whose fitness optimum can 520 change during evolution when the environment changes. However, when this happens, 521 adaptation then might take more time to chase the new optimum. Although higher pleiotropy is a 522 tempting hypothesis to explain such a lag (Otto, 2004), genes involved in multiple diseases do 523 not have a particularly pronounced sweep depletion compared to genes associated with only one 524 disease. Completely excluding pleiotropy may however require more effort, notably by 525 considering measures of pleiotropy other than the number of diseases a gene has been associated 526 with. 527 528 Another hypothesis is that disease genes may have a distribution of deleterious fitness effects 529 that is different from other genes, but that the metrics of constraint that we used do not capture 530 this difference. Specifically, we can imagine a case where disease genes have more currently 531 segregating recessive deleterious variants than other genes, and where selective sweeps are 532 impeded due to the interference of genetically linked recessive deleterious variants. The 533 deleterious effects of these variants can reveal themselves when they hitchhike together with an 534 advantageous variant that is just starting to increase in frequency (Assaf et al., 2015) . 535 Accordingly, we find a marked sweep depletion when restricting the comparison to disease and 536 non-disease genes in low recombination regions of the genome and with higher numbers of 537 disease variants (Figure 4). All these comparisons are however indirect, and we do not quantify 538 directly the amount of recessive deleterious mutations at disease or non-disease genes. Further 539 verifying that recessive deleterious mutations impede sweeps more at disease than non-disease 540 genes will require showing that recessive deleterious mutations are indeed more abundant at 541 disease genes, ideally by also estimating dominance coefficients. That said, the majority of 542 disease variants are known to be recessive and using the number of disease variants, as done in 543 the present study, should be a good proxy of the actual number of segregating recessive 544 deleterious mutations. Estimating dominance may prove challenging, since it is difficult to 545 distinguish selection coefficient changes from dominance coefficient changes (Huber et al., 546 2018). Again, our results provide preliminary evidence to further test in the future. 547 548 In addition to suggesting possible explanatory evolutionary scenarios, our results highlight a 549 number of potential limitations and biases that also need to be explored in more detail. First, the 550 lack of sweeps at disease genes suggests the possibility of a technical bias against the annotation 551 of disease genes in sweep regions with high LD, as described in the Results. This bias is unlikely 552 to be the dominant explanation for our results, because then we would expect a stronger sweep 553 deficit at disease genes in Europe than in Africa, given that most disease genes were annotated in 554 Europe. The recombination rate at disease genes is also not different from the recombination rate 555 at non-disease genes (Figure 1). The increase of the sweep deficit when comparing disease and 556 non-disease genes only in low recombination regions (Figure 4), where disease annotation would 557 then be more difficult regardless of overlapping a sweep or not, also suggests that this bias is 558 unlikely. That said, it will still be useful to further investigate in the future how much this 559 potential bias might have contributed to our observations. 560 Second, even though more intense genetic drift seems a reasonable explanation for the less 561 In conclusion, although our analysis reveals a strong deficit of selective sweeps at human disease 585 genes, it also suggests that more work is needed to better understand the evolutionary processes 586 at work, and the biases that may have skewed our interpretations. Despite these limitations, our 587 comparison nevertheless already suggests that specific evolutionary relationships between 588 disease genes and adaptation might be more prevalent than others, especially interference 589 between recessive deleterious and adaptive variants.

Disease gene lists 612
We consider genes that are known to be associated with diseases as disease genes. We focus on 613 protein-coding genes associated with human mendelian non-infectious diseases. Complex 614 diseases are associated with several loci and environmental factors. Patterns of positive selection 615 at complex disease and mendelian disease genes may differ (Blekhman et al., 2008), which is 616 why we restrict our analysis to mendelian disease genes. We also restrict our analyses to non-617 infectious disease genes, since interactions with pathogens are an entirely different problem. We CTD includes a broad range of chemical-induced diseases that might only happen where people 639 are exposed to these chemicals, especially some inorganic chemicals that may not be present in 640 natural environments (Davis et al., 2021). 641 In order to study different types of diseases, we also divide disease genes into different We measure iHS and ! in windows centered on human coding genes (i.e. windows 664 whose center is located half-way between the most upstream transcript start site and most 665 downstream transcript stop site of protein coding genes). We use windows of sizes ranging from 666 50 kb to 1,000 kb (50kb, 100kb, 200kb, 500kb and 1,000kb) since we do not want to presuppose 667 of the size of sweeps, and since the size of the selective sweeps may vary between different 668 genes. Moreover, to avoid any preconception related to the expected strength or number of 669 sweep signals, we use a moving rank threshold strategy to measure the enrichment or deficit in 670 sweeps at disease genes. For example, we select the top 500 genes with the stronger sweep 671 signals according to a specific statistic (iHS or ! ). We then compare the number of diseases 672 and non-disease genes within the top 500 genes with the strongest iHS or ! signals. This was 673 repeated for different top thresholds and the corresponding ranks from top 5,000 to top 10 674

Comparing recent adaptation between disease and non-disease genes 683
We use a previously developed gene-set enrichment analysis pipeline to compare recent 684 adaptation between disease and non-disease genes (Enard and Petrov, 2020) 685 (https://github.com/DavidPierreEnard/Gene_Set_Enrichment_Pipeline). This pipeline includes 686 two parts. The first part is a bootstrap test that estimates the whole sweep enrichment or 687 depletion curve at genes of interest (disease genes in our case). The second part is a false positive 688 risk (also known as false discovery rate in the context of multiple testing) that estimates the 689 statistical significance of the whole sweep enrichment curve using block-randomized genomes. 690

691
To compare disease and non-disease genes, we first need to select control non-disease genes that 692 are sufficiently far away from disease genes. In that way, we avoid using as controls non-disease 693 genes that overlap the same sweeps as neighboring disease genes, thus resulting in an 694 underpowered comparison. The question is then how far do we need to choose non-disease 695 control genes? Ideally, we would choose non-disease control genes as far as possible from 696 disease genes in the human genome, further than the size of the largest known sweeps (for 697 example the lactase sweep), which would be on the order of a megabase. However, because there 698 are many disease genes in our dataset (4,215), there are very few non-disease genes in the human 699 genome that are more than one megabase away from the closest disease gene. This is a problem, 700 because the available number of potential control non-disease genes is an important parameter 701 that can affect both the type I error, false positive rate, and type II error, false negative rate of the 702 disease vs. non-disease genes comparison. Indeed, the smaller the control set, the more likely it 703 is to deviate from being representative of the true null expectation at non-disease genes. The 704 noise associated with a small sample could go either way. Either the small control sample 705 happens by chance to have less sweeps, and the bootstrap test we use to compare disease and 706 non-disease genes will become too liberal to detect sweep enrichments, and to conservative to 707 detect sweep deficits. Or the small control sample happens by chance to have more sweeps than 708 a larger control sample would, and the bootstrap test becomes too conservative to detect sweep 709 enrichments, and too liberal to detect sweep deficits. 710 After trying distances between disease genes and control disease genes of 100kb, 200kb, 300kb, 711 400kb and 500kb, we find that the sweep deficit observed at disease genes increases steadily 712 from 100kb to 300kb ( The sweep deficit is measured by the FPR score, that is the cumulative difference between the 726 number of genes in sweeps at disease and control non-disease genes, across window sizes, sweep 727 summary statistics, and African populations (see the rest of the Methods).

729
Another important aspect of the bootstrap test (first part of the pipeline), aside from setting up 731 the minimal distance of the control non-disease genes, is the matching of potential confounding factors likely to influence sweep occurrence. We choose non-disease control genes that have the 733 same confounding factors characteristics as disease genes (for example, control non-disease 734 genes that have the same gene expression level across tissues as disease genes). The precise 735 matching algorithm is detailed in Enard & Petrov (2020). 736 When comparing disease and non-disease genes with the bootstrap test, we control for the 737 following potential confounding factors that could influence the occurrence of sweeps at genes: factor (same logic for other factors where we also use both 50kb and 500kb windows). 752 • GC content is calculated as a percentage per window in 50kb and 500kb windows. It is 753 obtained from the USCS Genome Browser (Kent et al., 2002). 754 • The density of coding sequences in 50kb and 500kb windows centered on genes. The 755 density is calculated as the proportion of coding bases respect to the whole length of the 756 window. Coding sequences are Ensembl v99 coding sequences. 757 • The density of mammalian phastCons conserved elements (Siepel et al., 2005) (in 50kb 758 and 500k windows), downloaded from the UCSC Genome Browser (Kent et al., 2002). 759 We used a threshold considering 10% of genome as conserved, as it is unlikely that more 760 than 10% of the whole genome is constrained according to previous evidence (Siepel et 761 rank threshold separately, in the whole enrichment (or deficit) curve (Figure 2). It does not 793 provide any estimate of the significance of the whole curve, which is needed to estimate the 794 significance of a sweep enrichment or deficit without making too many assumptions on how 795 many sweeps are expected or how strong they are. 796 To address the increased type I and type II error risks of the bootstrap test, as well to get an 797 unbiased significance estimate for whole enrichment curves, the second part of our pipeline 798 conducts a false positive risk analysis based on block-randomized genomes (Enard and Petrov, 799 2020). Briefly, we re-estimate many whole enrichment curves reusing the same disease and 800 control non-disease genes used in the first part of the pipeline by the bootstrap test, but after 801 having randomly shuffled the locations of genes or clusters of neighboring genes in sweeps at 802 those disease and control non-disease genes. To do this, we order the disease and control non-803 disease genes as they appear in the genome. We then define blocks of neighboring genes, whose 804 limits do not interrupt clusters of genes in the same putative sweep. Then, we randomly shuffle 805 the order of these blocks. Because we do not cut any cluster of genes that might be in the same 806 sweep, the resulting block-randomized genomes preserve the same clustering of the genes in the 807 same putative sweeps as in the real genome. With this approach, we look at the exact same set of 808 disease and control non-disease genes and just shuffle sweep locations between them. Thus, by 809 using many block-randomized genomes, we can estimate the null expected range of whole 810 enrichment curves while fully accounting for the extra variance expected from having a limited 811 sample of control non-disease genes. We can then estimate a false positive risk (FPR) for the 812 whole enrichment or deficit curve by comparing the real observed one with the distribution of 813 random curves generated with block-randomized genomes. 814

815
To measure the FPR for a curve, we need to define a metric to compare the real curve with the 816 randomly generated ones. In figure 1, we show relative enrichments at each sweep rank 817 threshold, the number of disease genes in sweeps divided by the number of control non-disease 818 genes in sweeps. As a summary metric for the curve, we could then use the sum of the relative 819 enrichments over all thresholds. However, the issue with this approach is that a relative 820 enrichment is the same whether we have 2 disease genes in sweeps and one control non-disease 821 gene in sweeps, or we have 200 disease genes in sweeps and 100 control non-disease genes in 822 sweeps. Thus, although relative enrichments are convenient for visualization on a figure, they are 823 not adequate to measure the FPR. Instead of the relative enrichment, we use the difference 824 between disease and non-disease genes, that is, the number of disease genes in sweeps, minus the 825 average number of control non-disease genes across control sets built by the bootstrap test. We 826 then use as a metric for a whole curve the sum of differences over all the rank thresholds. We use 827 this sum of differences to estimate the enrichment or deficit curve FPR, as the proportion of 828 block-randomized genomes where the sum of differences exceeds the observed sum of 829 differences for an enrichment (one minus this proportion for a deficit). 830 831 Importantly, although so far we have described the case where we measure the FPR for one 832 enrichment curve, nothing prevents us from calculating a single sum of differences over an entire 833 group of enrichment or deficit curves. This way, we can measure a single FPR for any number of 834 curves considered together. In our analysis, we measure a single FPR adding iHS and ! curves 835 together, and also adding together the curves for 50kb, 100kb, 200kb, 500kb and 1000kb 836 windows (ten curves in total, 2 statistics*5 window sizes). To generate Figure 4, we separate disease genes in groups of approximately the same size based 841 on their recombination rate and numbers of disease variants annotated in OMIM/Uniprot. We 842 separate the disease genes into two groups of equal size, those with recombination lower than 843 1.137 cM/Mb, and those with recombination higher than this value. To count the disease variants 844 at each disease gene, we count not only the OMIM/Uniprot disease variants for that gene, but 845 also all the other OMIM/Uniprot disease variants that occur in a 500kb window centered on that 846 gene. We do this because the recessive deleterious variants form other nearby disease genes may 847 also interfere with adaptation. Half of disease genes have less than five OMIM/Uniprot disease 848 variants, and half have five or more. 849 850 Impact of functional differences between disease and non-disease genes on the sweep deficit 851 The sweep deficit at disease genes could be due to a different representation of gene functions at 852 disease genes compared to control non-disease genes. In this case, disease genes would have less 853 adaptation not because they are disease genes, but because the gene functions that are enriched 854