A new test suggests that balancing selection maintains hundreds of non-synonymous polymorphisms in the human genome

The role that balancing selection plays in the maintenance of genetic diversity remains unresolved. Here we introduce a new test, based on the McDonald-Kreitman test, in which the number of polymorphisms that are shared between populations is contrasted to those that are private at selected and neutral sites. We show that this simple test is robust to a variety of demographic changes, and that it can also give a direct estimate of the number of shared polymorphisms that are directly maintained by balancing selection. We apply our method to population genomic data from humans and conclude that more than a thousand non-synonymous polymorphisms are subject to balancing selection.


Introduction 33
How genetic variation is maintained, either in the form of DNA sequence diversity or 34 quantitative genetic variation, remains one of the central problems of population genetics. 35 Balancing selection encapsulates several selective mechanisms that increase variability 36 within a population. These include heterozygote advantage (also referred to as 37 overdominance), frequency dependent selection, and selection that varies through space 38 and time (Nielsen, 2005). However, although there are some clear examples of each type of 39 selection (Allison, 1956;Nosil, et al., 2018), the overall role that balancing selection plays in However, these methods are generally quite complex to apply, often leveraging multiple 50 population genetic signatures of balancing selection and many require simulations to 51 determine the null distribution. Furthermore, they do not readily yield an estimate of the 52 number of polymorphisms that are directly subject to balancing selection, as opposed to 53 being in linkage disequilibrium. Here we introduce a method that is simple to apply and 54 which generates a direct estimate of the number of polymorphisms subject to balancing 55 selection. 56 57 One signature of balancing selection that has been utilised in several studies is the sharing 58 of polymorphisms between species (Asthana et al. 2004;Leffler et al. 2013; Gao et al. 2015). 59 If the species are sufficiently divergent that they are unlikely to share neutral 60 polymorphisms, then shared genetic variation can be attributed to balancing selection. 61 These studies have concluded that there are relatively few balanced polymorphisms that 62 are shared between humans and chimpanzees (Asthana et al. 2004; Leffler et al. 2013). 63 for all frequency bins except the lowest one (figure 1c). In each case Z increases as a 158 function of the time since the population split; this is to be expected because Z is related to 159 the proportion of shared polymorphisms that are subject directly to balancing selection (see 160 below), and as time progresses, so neutral and SDMs go to fixation or loss in one or both of 161 the descendant populations. Note, once again that the level of balancing selection in these 162 simulations is high because every locus contains a balanced polymorphism. The simulation above does not take into account the demographic effects that a division in a 178 population involves. We therefore performed more realistic simulations which involve 179 vicariance and dispersal scenarios with and without migration between the sampled 180 populations (Figures S1-S10). We also simulated with and without expansion after separation. We performed all simulations under two distributions of fitness effects (DFE), 182 which were estimated from human and Drosophila melanogaster populations. In the 183 vicariance scenario the ancestral population splits into two daughter populations of equal or 184 unequal sizes. In the dispersal scenario a single daughter population is generated by 185 duplicating part of the ancestral population, which remains the same size as it was before; 186 we vary the daughter population size. In both cases, we explore the consequences of 187 It is then straightforward to show that the proportion of shared non-synonymous 226 polymorphisms that are directly maintained by balancing selection is 227 This is clearly an unrealistic model in several respects. First, it can be expected that there 231 are slightly deleterious mutations in many populations and this will lead to an 232 underestimation of ; and second, it is likely that new balanced polymorphisms will be 233 arising all the time and these will contribute to private polymorphism, increasing RN/RS and 234 leading to a conservative estimate of . 235 236

Data analysis -humans 237
We have shown that the method has the potential to detect balancing selection under 238 realistic evolutionary models. We therefore applied our method to human data from the if the frequency of private and shared polymorphisms >0.1; we also find that Z>1 in the 243 South Asian and East Asian population comparison when using South Asian private polymorphisms and polymorphisms with frequencies > 0.1 (Figure 2). For several 245 comparisons we have no polymorphisms at the relevant frequencies, and many of the 246 confidence intervals on our estimates of Z are large. As a consequence, we combined the 247 data for all polymorphisms with frequencies > 0.1 (Figure 2, right-most point in each panel). 248 The patterns above are replicated; Z is significantly greater than one when we use private 249 polymorphisms from Africa, and in the comparison between the East and South Asian 250 populations, but Z<1 otherwise. In some cases, the Z value for the combined data can If we estimate to those comparisons in which Z is significantly greater than 1, we 280 estimate that approximately 12% of the non-synonymous shared polymorphisms between 281 the African and other human populations are subject to balancing selection, as well as 282 between the Asian populations (Table 1). These estimates are likely to be underestimates 283 because there will still be SDMs segregating in our data, even though we have removed the 284 lowest frequency variants (see simulation results). The proportions suggest that ~500 285 polymorphisms, which are shared between the African and other populations, are 286 maintained by balancing selection. We estimate rather more are maintained between the 287 two Asian populations, although the confidence intervals on this estimate are large. If we 288 combine data across the non-African populations, we estimate that ~1400 polymorphisms 289 are shared between African and non-African populations in total because of balancing 290 selection ( Table 1). The fact that the estimate from the African-non-African comparison is 291 larger than the estimate from the African versus each individual population, suggests that 292 many of the balanced polymorphisms shared between populations are unique to each 293 population pair -i.e. there are balanced polymorphisms shared between African and 294 European populations, that are not shared between Africans and East Asians.   A concern in any analysis of human population genetic data is the influence of biased gene 307 conversion (BGC). This process tends to increase the number and allele frequencies of 308 AT>GC mutations, and reduce the number and allele frequencies of GC>AT mutations. If this 309 process differentially affects synonymous and non-synonymous sites and shared and private 310 polymorphisms, then it could potentially lead to Z>1. To investigate whether BGC has an 311 effect we performed two analyses. In the first, we divided our genes according to whether 312 they were in high and low recombining regions, dividing the data at the median 313 recombination rate (RR). Our two groups differ substantially in their mean rate of 314 recombination (mean RR in low group = 1.2 x 10 -7 centimorgans per site and high group = 315 1.8 x 10 -6 centimorgans per site). We find that Z is actually higher in the low RR regions, 316 although not significantly so (  were generated by bootstrapping genes 100 times. 322 mutations that are not affected by BGC -i.e. G<>C and A<>T mutations. This reduces our 325 dataset by about 80%. As a consequence, we summed the data for all polymorphisms with 326 frequencies >0.1. We find that our estimate is largely unchanged from that when all 327 polymorphisms are included, however the confidence intervals are increased substantially 328 so that Z is only significantly greater than one for the African-non-African, and the South 329 versus East Asian comparisons (  1998). To investigate whether we could detect this signature in our data, we split our 346 dataset into HLA and non-HLA genes (The MHC sequencing consortium, 1999). Due to a lack 347 of private polymorphisms, we combined all frequency categories >0.1. We found Z was significantly greater in HLA than non-HLA genes for all population comparisons except 349 Europeans and South Asians using European private polymorphisms ( Figure S13). However, 350 the value of Z is greater than one in population comparisons in which African private 351 polymorphisms are used, consistent with balancing selection maintaining variation in the 352 HLA region. We estimate that a very substantial proportion of non-synonymous genetic 353 variation is being maintained by balancing selection, although the confidence intervals on 354 our estimates are large; roughly 50% of the shared non-synonymous SNPs are being 355 maintained by balancing selection between African and non-African populations in the HLA 356 region and this equates to approximately 200 polymorphisms (Table 4). 357 358 However, the signature of balancing selection that we have detected across all genes is not 359 simply due to the HLA genes. We find that Z>1 in non-HLA genes in most population 360 comparisons in which Z>1 for all genes, except in the comparison of East and South Asian 361   If we run our analysis grouping genes by their GO category and restricting the analysis to 382 those groups that have at least 100 polymorphisms with frequencies >0.1 we find 683 383 categories in which Z is significantly greater than 1 in at least population comparison 384 (Supplementary file S1) and we list those significant in 5 or more population comparisons in 385 Table 6. One of these GO categories, "nucleic acid binding" is shared across 7 of the 14 386 population comparisons, "endoplasmic reticulum membrane" across 6 population 387 comparisons; amongst those categories shared among 5 are "viral process" and "immune 388 system process", but there are many others which are surprising including "protein import 389 into the nucleus" (Supplementary file S1). Eighty-five categories are shared between 4 or 390 more population comparisons, and 155 amongst three or more population comparisons. 391 These include 7 categories related to immunity (including immune system process which is 392 significant in 5 population comparisons), and 40 categories that are linked to antigen 393 presentation though not classified as immune-related categories. There are also 5 viral-394 related categories (including viral process which is significant in 5 population comparisons).

Individual genes 407
We applied our statistic to individual human genes, combining all frequency bins (0-0.5) due 408 to a lack of polymorphism. We tested for significance using a one-tailed Fisher's exact test. 409 Of the 14,261 genes we analysed 514 had Z>1 in at least one population comparison. 410 Eighteen of these were nominally significant at p<0.1 (supplementary file S2), but no gene 411 was individually significant when we corrected for multiple testing using a Bonferroni 412 correction. Eighteen genes have Z>1 in at least 9 population comparisons; note that since 413 populations share polymorphisms, we cannot combine the evidence for balancing selection 414 across these populations (    The method can in principle be applied to any pair of populations or species. However, the 457 test is likely to be weak when the populations/species are closely related for two reasons. 458 First, there will be relatively few private polymorphisms, and second, the proportion of 459 shared polymorphisms that are subject to balancing selection is likely to be low, because so 460 many neutral polymorphisms are shared between populations because of recent common 461 ancestry. As the populations/species diverge so the number of private polymorphisms will 462 increase, and the proportion of shared polymorphisms that are balanced will increase. Of 463 course, as the time of divergence increases so the selective conditions that maintained the 464 polymorphism are likely to change and the polymorphism might become neutral, or subject 465 to directional selection. Our method is also likely, like all methods, to be better at detecting 466 balanced polymorphisms that are common, because most populations are dominated by 467 large numbers of rare neutral variants. Finally, our method requires that the neutral and 468 selected sites are interdigitated; the method is therefore easy to apply to protein coding 469 sequences, but may be more difficult to apply to other types of variation, such as that which 470 affects gene expression. the ancestral population is duplicated to form the daughter populations -this will lead to an 484 underestimate of Z because SDMs tend to contribute more to private than shared 485 polymorphism, and hence inflate RN/RS relative to SN/SS (Figure 1). Under more realistic 486 demographic models, in which at least one of the derived populations is reduced, this is 487 expected to depress Z in the population that is being reduced because more SDMs will tend 488 to segregate in smaller populations hence inflating RN/RS (compare Figure 1 to Figures S2  489   and S3). Simulations suggest there is however a slight increase in Z using private polymorphisms from the larger of the populations but these Z values never exceed 1 491 ( Figures S2 and S3). The second reason that we are likely underestimating the number of 492 balanced polymorphisms using our simple method is that we assume that there are no 493 balanced polymorphisms that are private to each population; these would inflate RN/RS. It might be argued that the evidence of balancing selection is weak because we typically find 525 Z>1 using the private polymorphism from only one of the two populations. However, we 526 have been unable to find a demographic model in which there is no balancing selection and 527 Z>1 -note that we never observe Z>1 under the Gravel et al. model of human demography 528 even when we change the parameters of the DFE and demographic model; furthermore, we 529 find that simulations which involve different demographies in the two populations generate 530 different Z values for the two populations, so there is an expectation in many species that 531 we will observe Z values that differ between populations. 532 533 Values of Z in excess of one could potentially be due to biased gene conversion; we expect 534 BGC to increase the allele frequency of AT>GC mutations and to decrease the frequency of 535 GC>AT mutations; this will tend to make AT>GC mutations more likely to be shared between 536 populations than GC>AT mutations. Since the GC-content at the third codon position is 537 typically higher than the GC content at the first two positions, this will mean that there are 538 more non-synonymous AT>GC mutations than synonymous, and hence more shared non-539 synonymous than synonymous polymorphisms. However, our results do not appear to be 540 affected by BGC; results are similar between genes in high and low recombination rate 541 regions of the human genome (Table 2), and our point estimates of Z are largely unaffected 542 by restricting the analysis to SNPs which are unaffected by BGC, although the confidence 543 intervals increase substantially (Table 3). 544

545
We estimate that there ~500 balanced polymorphisms shared between the African and each 546 of the other human populations, ~1250 shared between the Asian populations and ~1400 547 shared between the African and non-African populations; these are likely to be 548 underestimates due to the presence of slightly deleterious mutations. The fact that we 549 estimate that ~1400 shared non-synonymous polymorphisms are being maintained 550 between African and non-African populations, but only ~500 between each of the individual 551 populations and the African population suggests, that many of the polymorphisms shared 552 between the African and each individual population are unique -i.e. many polymorphisms shared between Africans and Europeans, may not be shared between Africans and Asians. 554 These numbers are substantial, but are consistent with those of Bitarello et al. (2018) who 555 estimated that ~8% of human genes show some evidence of long-term balancing selection, 556 and many of these signatures are shared between populations. However, their method 557 could not determine whether the balanced polymorphisms were coding or noncoding 558 mutations. 559 560 As expected, we find evidence of balancing selection affecting the HLA or MHC genes (table 561 4) Hedrick, 1998). However, we find evidence of balancing selection 562 even when these genes are removed from the analysis ( Table 5). The analysis of GO 563 categories shows that numerous categories show evidence of balancing selection across 564 multiple population comparisons (supplementary file S1). Some of these are expected, but 565 many are not, such as "nucleic acid binding", which is significant in 7 of the 14 population 566 populations (Gasper & Swanson, 2006). Two of these ten genes are associated with 576 tumours. MKI67 expression is associated with a higher tumour grade and early disease 577 recurrence (Yang, et al., 2017), and WDFY4 plays a critical role in the regulation of certain 578 viral and tumour antigens in dendritic cells (Theisen, et al., 2019). PKD1L2 is associated with 579 polycystic kidney disease and RP1L1 variants are associated with several retinal diseases 580 including occult macular dystrophy (Davidson, et al., 2013). SPTBN5 encodes for the 581 cytoskeletal protein spectrin, that plays a role in maintaining cytoskeletal structure (Huh,582 Glantz, Je, Morrow, & Kim, 2001) and C1orf167 expresses open reading frame protein that is 583 highly expressed in the testis (Fagerberg, et al., 2014). Finally, FAM230G is highly expressed 584 in testes (Delihas, 2018). 585 Twenty-five of the 514 with Z>1 genes overlap with those genes identified by Bitarello et al. 587 (2018), but this is similar to the level of overlap expected at random; i.e. they observed that 588 7.9% of protein coding genes overlapped regions identified by their method as being subject 589 to balancing selection, and we identified 514 candidates; so we expect 0.079 x 514 = 41 by 590 chance alone. The lack of a significant overlap is possibly not surprising; we have applied our C2. If C1 is in linkage disequilibrium (LD) with the A2 allele, and C2 is in LD with the B2 allele, 603 then C1C2 heterozygous individuals will have higher fitness than C1C1 and C2C2 604 homozygotes. This form of selection can lead to the maintenance of genetic variation (Zhao 605 and Charlesworth 2016) in low recombination rate regions. However, Z is not substantially 606 greater in regions of low recombination so AOD seems an unlikely explanation (Table 2).

Human data 619
Human variation data was obtained from 1000 genomes Grch37 vcf files (The 1000 620 Genomes Project Consortium, 2015). Variants were annotated using Annovar's hg19 621 database . The annotated data was then parsed to remove multi-622 nucleotide polymorphisms and indels. Because 1000 genomes data provides allele 623 frequencies for the non-reference allele rather than the minor allele, the minor allele 624 frequency for each superpopulation and also for the global minor allele frequency was 625 calculated. We used 1000 genomes from the African, South Asian, East Asian and European 626 populations. The American population was removed due to the fact that it is an admixed 627 population. GO category information was obtained from Ensembl's BioMart data mining 628 tool (Yates, et al., 2020). We used pyrho demography-aware recombination rate maps 629 (Spence & Song, 2019) for analyses that control for recombination rate. distribution of fitness effects for deleterious mutations was assumed to be a gamma 677 distribution using the parameters estimated from the African superpopulation using the 678 GammaZero model within the Grapes software (Galtier, 2016); the parameters are similar to those estimated by Eyre-Walker et al (2006), and used in the generic simulations (Gamma 680 shape = 0.17 and Mean Nes = 1144). We chose to infer the DFE for the African 681 superpopulation because this is currently the largest dataset available for a population that 682 has been inferred to be relatively stable. Dominance was calculated using the Huber model 683