Mixture Density Regression reveals frequent recent adaptation in the human genome

How much genome differences between species reflect neutral or adaptive evolution is a central question in evolutionary genomics. In humans and other mammals, the prevalence of adaptive versus neutral genomic evolution has proven particularly difficult to quantify. The difficulty notably stems from the highly heterogeneous organization of mammalian genomes at multiple levels (functional sequence density, recombination, etc.) that complicates the interpretation and distinction of adaptive vs. neutral evolution signals. Here, we introduce Mixture Density Regressions (MDRs) for the study of the determinants of recent adaptation in the human genome. MDRs provide a flexible regression model based on multiple Gaussian distributions. We use MDRs to model the association between recent selection signals and multiple genomic factors likely to affect positive selection, if the latter was common enough in the first place to generate these associations. We find that a MDR model with two Gaussian distributions provides an excellent fit to the genome-wide distribution of a common sweep summary statistic (iHS), with one of the two distributions likely enriched in positive selection. We further find several factors associated with recent adaptation, including the recombination rate, the density of regulatory elements in immune cells, GC-content, gene expression in immune cells, the density of mammal-wide conserved elements, and the distance to the nearest virus-interacting gene. These results support that strong positive selection was relatively common in recent human evolution and highlight MDRs as a powerful tool to make sense of signals of recent genomic adaptation. Author Summary Over the last 50,000 years, human populations have been exposed to selective pressures that can trigger adaptation in the genome. The search for signals of these selective events is however obscured by the substantial variation of factors that are relevant for the prevalence and detection of adaptation across the genome. Here, we analyze the impact of multiple factors on positive selection using a biologically meaningful approach that considers the influence of adaptative and non-adaptive processes in the human genome. Our results show that this novel approach is better suited to find genomic factors associated with recent positive selection compared to the classic regression or correlation approaches. We find multiple genomic functional elements associated with selection across the genome, including novel associations that emerge only after controlling for multiple confounding factors. This strongly suggests that adaptation was common enough in recent evolutionary times to produce a widespread correlation between functional elements and positive selection in the human genome. Note on the language used in this manuscript In this manuscript we use discrete population groups, such as the Yoruba. We want to emphasize that these discrete groups are only used for convenience and clarity when presenting our results, but in fact represent arbitrary human constructions, the same way that the boundaries of countries are arbitrary. There is an unbroken continuum and mixing of geographical ancestries across groups often identified as distinct populations across the world. The discrete groups we use are, as such, by no mean discrete genetic entities. Once grouped together, the grouped individuals only happen to be genetically more similar, with their ancestries coming from specific geographic locations more predominantly than individuals from the other groups. How much more predominantly is completely arbitrary.


74
The characteristics of the human genome can influence both the occurrence and the 75 detection of recent positive selection in a specific genomic region. Similar to other 76 mammals, the human genome has a complex organization, with highly heterogeneous 77 recombination rates and distribution of functional elements (regulatory or coding) along 78 chromosomes. This inherent heterogeneity in the factors that are likely to influence both 79 the occurrence of positive selection and our ability to detect it, has complicated the study 80 background selection, has contributed to a persistent debate around the prevalence of 111 selective sweeps in the human genome [1,2]. 112 If recent adaptation and selective sweeps in particular were relatively common 113 during recent human evolution, we expect that other factors on top of recombination 114 should be relevant for their occurrence and detection in the human genome. Generally, 115 we expect that selective sweeps should occur more frequently around functional 116 segments of the genome where adaptive mutations are expected to take place [1]. 117 Therefore, we should find enough signals of positive association between selective 118 sweeps and the overall density of functional elements, either coding or non-coding. In a 119 pioneer analysis, Barreiro et al. (2008) showed for the first time a relationship between 120 genome-wide signatures of positive selection and the distribution of functional elements. 121 Among those highly differentiated Single Nucleotide Polymorphisms (SNPs), they found 122 an excess of non-synonymous SNPs compared to non-genic SNPs. Note, however, that 123 they did not explicitly control for the impact of background Selection (BGS), the 124 coincident removal of neutral variants together with genetically linked deleterious 125 mutations, which varies between the classes analyzed [2,23]. We also expect specific 126 functions to be associated with an increased occurrence of selective sweeps. statistical power to detect strong and recent incomplete sweeps [5,9,30,31]. Second, they 139 are not confounded by BGS [1,32]. This insensitivity is a particularly important attribute 140 when studying genomic regions subject to different levels of BGS. Given that multiple 141 genomic factors correlate with this process, their association with selection can be 142 confounded if the summary statistic used is also sensitive to BGS. In this regard, the 143 integrated Haplotype Score (iHS) [5] is a haplotype-based statistic that has been 144 extensively tested [1,5,8,25], and can detect selection signals both from de novo 145 mutations and to some extent from standing genetic variation, provided that selection 146 started from not too high initial allele frequency [9,30]. This summary statistic can be 147 used to scan many genomes given the availability of fast implementations [33]. 148 Therefore, we use iHS as a measure of recent positive selection across the human 149 genome. As shown in Figure 1 (see also Methods), iHS has a complex, asymmetric 150 distribution across the genome that is not well captured by classic distributions 151 (Supplemental Results S1: Figure S1). This raises an important limitation of violated 152 assumptions when using classic regression or correlation analysis to find the genomic 153 determinants of iHS, with for example the assumption of statistics following a gaussian 154 distribution clearly not met. Furthermore, classic regression and correlation analyses may 155 not capture well the process of recent positive selection through selective sweeps 156 measured by iHS. Indeed, classic regression and correlation assume a linear association 157 between iHS and a factor contributing to recent positive selection in a distributed way 158 across the entire genome. This neglects the localized nature of recent selection events, 159 that have likely affected only a limited genomic portion, while leaving the rest of the 160 genome (and likely substantial majority) without any correlation between iHS and a 161 contributing factor, even though the latter did contribute to selection occurring locally in 162 the genome. This creates a situation where only the higher values of a statistic such as 163 iHS, those that are more likely to represent selection, might correlate at all with genomic 164 factors influencing selection. In this respect, the distribution of iHS is particularly 165 interesting, with an upper tail that is much heavier than the lower tail. We can 166 hypothesize that localized selection in a limited portion of the genome might have 167 generated this heavy tail, rather than a scenario where more generalized selection 168 covering the entire genome would have shifted the whole distribution. What tools then 169 can we use to properly account for this complex distribution of iHS? 170 Here, we model the genomic determinants of recent positive selection in the 171 human genome. We measure recent selective sweeps with the iHS statistic from five 172 human populations represented in the 1,000 Genomes phase 3 dataset [34]. The iHS 173 statistic has more power to detect recent and incomplete sweeps, while it has reduced 174 power to detect complete sweeps or sweeps more than 30,000 years old [9,15]. Therefore, 175 we restrict our analyses to selection that occurred after the main human migration out of 176

205
We first validate that the MDR with two gaussian distributions fits the distribution of iHS 206 much better than a single gaussian (Supplemental Results S1: Figure S1). The tight, 207 visually obvious fit between the observed iHS distribution and the two gaussians in our 208 MDR model (Figure 1) also shows that more than two gaussians are very likely not 209 needed and would negligibly improve the fit. We consider multiple genomic factors to 210 model the association with recent selection signals. These factors are possible 211 determinants of selection according to previous evidence [1,5,9,17,18,29,37,38]. We use 212 a gene-centric perspective where each individual protein-coding gene in the human 213 genome is associated with a set of factors measured by centering a genomic window on 214 each gene (Methods). This approach is likely to detect the influence of factors that differ 215 between genes, rather than only differences between genic and intergenic regions. The set 216 of factors in the MDR model includes several genomic features, like the length of the 217 gene at the center of the window or recombination estimates from the latest deCODE 218 genetic map (Methods). We obtain other genomic factors, like the GC-content and the 219 densities of genes, coding sequences and conserved elements. Moreover, we include the 220 density of transcription factor binding sites according to the hypersensitivity to DNaseI 221 and chromatin immunoprecipitation (ChIP-seq) experiments [39]. In the latter case, we 222 also obtain the density of regulatory elements for specific subsets of cell lines: testis and 223 immune cells. Genomic factors that are calculated across the genomic windows can vary 242 depending on the window size used. Therefore, we infer the association between recent 243 selection and factors measured in genomic windows of varying sizes (Methods). In total, 244 we use five different window sizes (50, 100, 200, 500 and 1,000 kb) centered at the 245 genomic center of Ensembl v99 coding genes [41]. For example, we measure the 246 association between the average iHS within 50kb windows and the average 247 recombination rate within the same 50kb windows, or between the average iHS and 248 recombination within the same 1000kb windows (Methods). Using different window 249 sizes also relaxes assumptions about the expected strength of selection, as larger windows 250 are likely more sensitive to strong selection compared to smaller windows, and vice versa 251  (Table 1). Unsurprisingly, we find the strongest association with 339 recombination rate, which negatively correlates with iHS (e.g., Yoruba slope=-2.44; p-340 value<1E-16; Table 1). This is congruent with the eroding effect of recombination on 341 selection signatures, impacting the probability to detect selective sweeps. Recombination 342 can break haplotypes generated by selection, making it more difficult to detect the 343 selective events [17,18]. These results confirm the role of recombination as a key 344 determinant of the probability of detecting selection with iHS [8]. We also find an 345 association between GC-content and recent positive selection. When the effect of 346 recombination is not accounted for, that is, recombination is not included in the model, 347 GC-content negatively correlates with iHS (Yoruba slope=-0.29; p-value=1.22E-03; 348 Supplemental Results S3: Table S1). This is expected given that GC-content is positively 349 associated with long-term recombination [43]. However, adding the control for 350 recombination reveals a positive association between GC-content and iHS (Yoruba 351 slope=0.55; p-value=1.34E-06; Table 1). Given that recombination already explains a 352 great proportion of iHS variability including that shared with GC-content, the positive 353 association between GC-content and iHS could be independent from recombination. 354 Accounting for the eroding effect of recombination on haplotypes could make more  Table 1; Supplemental Results S3: Table S2). Since GC-content has been 364 positively associated with the density of functional elements [44,45], removing GC-365 content may thus make more visible a positive association between iHS and functional 366 density. We further test if this is also the case for more tissue-specific functional densities 367 such as the density of regulatory elements in testis or immune cells that are expected to 368 exhibit more positive selection (Methods), but we find no such evidence. Results S3: Table S3). From this model, we then further remove additional functional 378 factors whose association with selection is also affected by the removal of GC-content, 379 namely, gene number, gene length and ChIP-seq regulatory density (Methods). 380 Accordingly, the removal of these functional factors leads to a highly significant 381 association between GC-content and recent positive selection (slope=0.49; p-value<1E-382 16; Supplemental Results S3: Table S4). In summary, the removal of different functional 383 genomic factors from the model supports that the positive association between positive 384 selection and GC-content is mediated by the role of this factor as proxy of overall 385 functional density. However, we cannot exclude the implication of other processes 386 related to recombination that would require more detailed analyses (for example, see fine 387 scale patterns related to regulatory density below). 388 Another genomic factor associated with selection is the density of conserved 389 elements (Methods), showing a positive association with iHS (Yoruba slope=0.16; p-390 value=6.4E-04; Table 1). This is an expected result since coding sequences and non-391 coding regulatory elements tend to be conserved [37]. This factor shows a significant 392 association in all populations except in the Peruvian population sample, which might 393 reflect the impact of past bottlenecks on the visibility of the selective patterns (Table 1). 394 The density of conserved elements is thus likely to be a good proxy of overall (coding 395 and non-coding) functional density. The inclusion of this factor in the model could then, 396 however, explain the lack of association for other factors related to functional density 397 such as coding density. Indeed, the removal of conserved elements density makes 398 significant the positive association between coding density and selection in Yoruba 399 (Yoruba slope=0.26 vs. 0.16; p-value=2.35E-03 vs. 8.41E-02; Table 1; Supplemental 400 Results S3: Table S5). All other populations have a clear significant positive correlation 401 of iHS with coding density even without removing the density of conserved elements 402 (Table 1). In addition, the simultaneous removal of conserved elements density and GC-403 content, which can also act as a proxy of functional density, increases even more the 404 positive association of coding density with iHS in Yoruba (slope=0.29; p-value=7.76E-405 04; Supplemental Results S3: Table S6). These, and previous results about GC-content 406 suggest that the lack of association for multiple functional factors could be explained, at 407 least partially, by the simultaneous consideration of better measures of overall functional 408 density in our models. 409 The number of protein-protein interactions is another factor that correlates with 410 iHS. This factor associates negatively with selection, suggesting the existence of a slight 411 depletion of selection for regions with higher number of PPIs (Yoruba slope=-0.07; p-412 value=4.25E-02; Table 1). This is opposite to that found by Luisi  The modeling approach presented in this study can also be used to test the 422 influence of specific selective pressures. We illustrate this feature by showing that 423 positive selection is associated with multiple variables related to viral interaction (Table  424 1). One of the factors more highly associated with iHS is the distance to VIPs in all tested 425 populations (Yoruba slope=-0.17; p-value=4.16E-06; Table 1). Selection decreases 426 further away from VIPs, in other words, we find an enrichment of selection around VIPs. 427 Viruses have acted as a strong selective pressure during human evolution, shaping 428 genomic adaptation, and previous studies have also found more positive selection at VIPs 429 [9,24]. Gene expression is another functional factor significantly associated with 430 selection ( Table 1) Overall, we find positive associations between the density of regulatory elements 437 and iHS except in Yoruba where this association is not significant (Table 1; see below for  438 analyses under neutral conditions explaining these results). However, the densities of 439 regulatory elements in tissues expected to experience more positive selection (testis and 440 immune cells) show surprising patterns of association with iHS (Table 1). In particular, a 441 higher density of transcription factor binding sites in testis correlates with weaker iHS 442 selection signals (slope=-0.79; p-value<1E-16). This is a counterintuitive result, given 443 that more regulatory density would give more options to modulate gene expression and 444 hence more room for positive selection to act. An explanation could be the following: is likely unable to detect the effect of these fine-scale patterns of recombination. We also 458 observe a surprising lack of association between immune regulatory density and iHS 459 (Table 1), even though we find a strong positive association between iHS and immune 460 gene expression. Here again we find that the fine-scale patterns of recombination close to 461 immune regulatory elements are likely to blame, with an increase in the fine-scale 462 recombination rate as one gets closer to immune regulatory elements (Supplemental 463 Results S4: Figure S3). It is surprising that we find a positive association between iHS 464 and regulatory density across multiple tissues, but not when focusing on specific tissues 465 where selection is expected to act (i.e., testis and immune cells). Although recombination 466 rate tends to concentrate close to all regulatory elements, as it does more specifically 467 around regulatory elements in testis and immune cells (Supplemental Results S4: Figure  468 S1-S3), recombination declines rapidly when considering regulatory elements of all 469 tissues compared to testis and immune cells (Supplemental Results S4: Figure S4). This 470 suggests that we could have more power to detect positive associations with iHS and 471 regulatory density in all tissues compared to testis or immune cells (see below for further 472 evidence). 473 474

Robustness of recent selection patterns to varying genomic window sizes in Yoruba 475
Next, we ask if the trends observed with 1,000 kb windows in the Yoruba genomes also 476 hold or not when using smaller window sizes (Supplemental Results S2: Tables S1-S5).  Results S2: Tables S2, S5). These results suggest that both factors may correlate with 489 strong rather than weak selection. In the case of testis regulatory density, this pattern 490 might be explained by the fact that smaller windows take better into account the fine 491 scale patterns of recombination, which limits the negative influence of using average 492 recombination (see below for fine-scale analyses of recombination). Note, however, that 493 all these differences might also be explained by the fact that smaller windows do not 494 differentiate well between inside and outside of the genomic regions they are supposed to 495 delineate. That is, they may be more influenced by the surrounding, genetically linked 496 genomic regions [1,9]. See Supplemental Results S2 for the complete results across 497 populations and window sizes. 498 499 Population simulations of neutral expectations of associations with iHS 500 As described above, we find multiple strong functional associations with iHS in the 501 directions expected under recent positive selection. That said, some observations remain 502 unexplained at this point, with for example a lack of positive association between overall 503 regulatory density and iHS in Yoruba. We also observe a lack of association between the 504 overall sum of regulatory plus coding functional density and iHS in Yoruba (slope=-505 0.063; p-value=0.65; Supplemental Results S5: Figure S1). The association of this overall 506 coding plus regulatory density with iHS is strongly positive in the other tested 507 populations (Supplemental Results S5: Figure S1). This prompted us to better 508 characterize the null expectations of the associations between this overall coding plus 509 regulatory density (as a measure encompassing the different functional types of elements 510 considered in our model) and iHS in the absence of positive selection. In particular, the 511 heterogenous, fine scale patterns of recombination and their variation around functional 512 elements (Supplemental Results S4: Figures S1-S5) might make null neutral expectations 513 deviate from a simple lack of association with slopes centered around zero. 514 We therefore use forward SLiM [48] neutral simulations to estimate the neutral 515 expected associations between the overall coding plus regulatory density and iHS in our 516 analysis (Methods). Under neutral conditions (i.e., only neutral mutations), and recreating 517 the actual distribution of recombination rate and functional (coding plus regulatory) 518 elements (Methods), we find that the association between iHS and functional density 519 tends to be negative, and in every case much less positive compared to the studied 520 populations including Yoruba (Supplemental Results S5, Figure S1). In addition, the 521 magnitude of the second Gaussian component (estimated with p, see Methods) is also 522 lower under neutral expectations compared to Yoruba (Supplemental Results S5: Figure  523 S2). This suggests that the results observed with the MDR approach cannot be explained 524 by neutral evolution. Moreover, these results support that, although the second Gaussian 525 distribution captures neutral loci as shown by its presence even under strictly neutral 526 evolution (Supplemental Results S5: Figure S2), this component is also sensitive to 527 adaptation, supporting its enrichment in recent positive selection. These results also show 528 that the lack of association between iHS and functional coding plus regulatory density in 529 Yoruba does not represent a lack of support for positive selection. Indeed, the null 530 expected distribution for this association is negative and the observed Yoruba association 531 is clearly above the neutral distribution (Supplemental Results S5: Figure S1). In the 532 discussion we mention multiple reasons why recent positive selection may have been not 533 less, or even more common in the Yoruba, while still having a weaker impact on iHS. 534 535

Fine scale patterns of recombination around regulatory elements 536
The increase of recombination rate around regulatory elements may explain the fact that 537 the null expectations for the association between functional density and iHS is negative 538 (Supplemental Results S4: Figures S1-S5). Our neutral simulations recreate the 539 distribution of recombination rate and functional (coding plus regulatory) elements of the 540 human genome, thus regulatory and coding elements tend to be close to recombination 541 peaks also in the simulations (Supplemental Results S4: Figure S5), leading to lower iHS 542 and lowering the null expectation. This led us to further analyze the influence of local 543 patterns of recombination around regulatory elements. Given the local increase of 544 recombination around regulatory elements, it is likely that whole window estimates of 545 recombination are not a sufficient measure of its effect on the association between iHS 546 and regulatory densities. In particular, this might explain the surprising negative 547 association between iHS and testis regulatory density or the lack of association for 548 immune regulatory density (Table 1). Indeed, recombination rate tends to be higher 549 specifically around these regulatory elements compared to the whole set of regulatory 550 elements (Supplemental Results S4: Figure S4). We use three variables related to the 551 recombination around regulatory elements in immune cells, testis and across multiple 552 tissues, respectively. We calculate the more local, average recombination around 553 regulatory elements within each gene window, up to a maximum distance of 5kb from 554 each side of regulatory elements. Each variable is included in the original model of 555 Yoruba 1000 kb separately, and replaces the previous, window-wide average 556 recombination. Given that some gene windows can overlap with more regulatory 557 elements than others, we also consider the number of recombination data points obtained 558 for regulatory elements inside each gene window. In this way, we account for the fact 559 that recombination may vary more broadly in windows that have less regulatory 560 elements. See Supplemental Results S4 for further details about these calculations. 561 The iHS statistic decreases around regulatory elements in immune cells while 562 recombination increases, staying very high further from immune regulatory elements 563 compared to other tissues (Supplemental Results S4: Figures S1-S4). This might have 564 confounded the expected positive association with selection in the original regression 565 model with window-wide recombination rates. The control for local patterns of 566 recombination around immune regulatory elements should however increase our power to 567 detect this association. In line with this prediction, we find that the association between 568 immune regulatory density and iHS becomes positive and highly significant when 569 considering the local patterns of recombination around immune regulatory elements 570 (Yoruba slope=0.68 vs. 0.047, p-value=4.60E-14 vs. 6.24E-01; Table 1; Supplemental 571 Results S4: Table S1). In order to assess whether this increase is caused just by the lack 572 of consideration of recombination patterns at a window-wide scale, we repeat this 573 analysis but including the original window-wide average recombination variable. After 574 including the average recombination rate at window-wide scale, the positive association 575 remains stronger compared to the original model (slope=0.28 vs. 0.047, p-value=5.72E-576 03 vs. 6.24E-01; Table 1; Supplemental Results S4: Table S2). 577 As explained above, we hypothesize that the confounding effect of recombination 578 is caused by a mismatch between different scales. Gene windows with low recombination 579 can have local peaks of recombination around transcription factor binding sites 580 (Supplemental Results S4: Figures S1-S4), as these peaks are more accessible to 581 transcription factors [47]. Therefore, recombination could erode signals of positive 582 selection around regulatory elements even if the region has an overall low recombination. 583 In contrast, regions with high recombination at the average window-wide level should 584 suffer less from this confounding effect, as recombination rate is then high both at a local 585 and window-wide scale. Therefore, focusing on high-recombination regions should 586 improve our ability to detect the expected positive association between selection and 587 regulatory density. This approach, however, has the caveat of reducing the power to 588 detect positive selection due the higher probability that recombination breaks selected 589 haplotypes. We nevertheless repeat the analysis focusing only on gene windows with a 590 recombination rate equal or higher than the second tertile (1.552 cM/Mb). Results S1: Tables S1-S2). For example, they are unable to replicate the strong 652 enrichment in positive selection around VIPs reported by previous studies [9,24]. This is 653 also expected given the poor fit of a single Gaussian distribution to the whole iHS 654 distribution compared to our approach assuming two Gaussian distributions 655 (Supplemental Results S1: Figure S1). Therefore, our results support the consideration of 656 two distributions to model positive selection in the human genome. The characteristics of 657 our MDR approach make it possible to reveal expected, but also unexpected patterns.  we again used linear interpolation to estimate its genetic position. We followed the same 854 approach used to calculate the genetic position of window edges during recombination 855 calculations. In this case, we searched for genetic position data points at both sides of the 856 focal variant until 1,000 kb. We discarded variants with no data points around, as they 857 were present in genomic regions with low density of recombination data. The genetic 858 maps based on the deCODE 2019 map were used as an input in order to calculate the iHS 859 for the 5 populations. We modeled the association between genomic factors and iHS in each population and 876 window size by combining a Gaussian mixture distribution with linear regression. 877 Mixture distributions are useful to model data that cannot be fully described by a single 878 distribution and that are likely generated by multiple processes. In this case, a mixture of 879 two Gaussian distributions could fit better a scenario where selection influences some 880 genomic regions but not others, and hence a bimodal distribution of iHS would be 881 observed across the genome. Therefore, we posit that sweep signals in the human genome 882 may have components differentially influenced by positive selection. According to this, 883 we assumed that observed log iHS followed a mixture of two Gaussian distributions 884 We estimated model parameters by maximizing the log likelihood of the entire 915 dataset (L-BFGS-B optimization). These parameters included the mean and standard 916 deviation of each Gaussian distribution, along with the intercept and the slopes 917 representing the association between the selection-enriched component of iHS and each 918 genomic factor. For each genomic factor, we tested the significance of its slope by 919 comparing the log likelihood between the full model with all genomic factors and a 920 nested null model without the selected factor. We assumed that two times the difference 921 in log likelihood between the two models followed a chi-square distribution with one 922 degree of freedom. The script and input data needed to run the MDR are publicly 923 available at https://github.com/dtortosa/Mixture_Density_Regression_pipeline. 924 925

Population simulations to recreate the neutral expectation 926
We used SLiM [48] to simulate the neutral expectations in our analyses and assess in that 927 way if our results could be fully explained by neutral patterns. First, we randomly 928 selected 100 genomic segments of 20 mb each one, totaling 2000 mb. We recreated the 929 functional (coding and regulatory) elements observed in the human genome using the 930 same coordinates considered in the calculation of functional densities for the main 931 analyses (see Genomic features section). We also considered the same recombination 932 map (deCODE 2019) to recreate the actual recombination heterogeneity in these 933 segments. Therefore, we simulated a total of 2000 mb using the observed distribution of 934 functional elements and recombination rate. We only simulated neutral mutations in all 935 the simulated genomic regions. The population size was set to 10,000 individuals. We 936 first run a burn-in period of 100,000 generations, running then 100 independent neutral 937 simulations with 2,000 additional generations. For each independent neutral simulation, 938 we calculated iHS using the same approach than in the actual genomes. As the simulated 939 segments had direct correspondence with the human genome, we extracted the genes 940 included in each segment, and calculate 1000 kb windows centered around them. Inside 941 each window, we calculated the average of iHS and the number of iHS data points, along 942 with recombination rate and functional density as done in the actual genomes. Finally, we 943 used the MDR to model iHS as a function of recombination rate, functional density and 944 the number of iHS data points measured all in 1000 kb windows. Therefore, we model 945 iHS under neutral conditions across 100 independent runs simulating each one 2000 mb. 946 In addition to the genomic factors already mentioned, we also included the GC-content 947 observed for each gene window in the actual genome. It can be useful to simulate GC to 948 control for varying mutations rates, however we used a uniform mutation rate across the 949 simulated regions, so we did not include it in the simulations. We are still adding it as a 950 predictor using the GC-content in the actual gene windows because this factor is 951 associated with both functional density and long-term recombination [43][44][45]. This can 952 help to make visible more local patterns of recombination. In addition, it makes fairer the 953 comparison between the neutral simulations and the observed genomes, as in the latter 954 iHS was also modeled using GC-content. 955 which was not exposed to bottlenecks in the same degree than the rest of studied 971 populations. 972 973