Abstract
Balancing selection maintains advantageous diversity in populations through different mechanisms. While extensively explored from a theoretical perspective, an empirical understanding of its prevalence and targets lags behind our knowledge of positive selection. Here we describe a simple yet powerful statistic to detect signatures of long-term balancing selection (LTBS) based on the expectation that some types of LTBS result in an accumulation of polymorphic sites at moderate-to-intermediate frequencies. The Non-Central Deviation (NCD) quantifies the degree to which SNP frequencies within a window of a pre-defined size depart from deterministic expectations under balancing selection. The statistic can be implemented considering only polymorphisms (NCD1) or also including also information on fixed differences (NCD2), and can detect LTBS under different frequencies of the balanced allele(s). Because of its simplicity, NCD can be applied to single loci or genomic data, and to populations with or without known demographic history. We show that, in humans, NCD1 and NCD2 have high power to detect long-term balancing selection, with NCD2 outperforming all existing methods. We applied NCD2 to genome-wide data from African and European human populations, and found that 0.6% of the analyzed windows show signatures of LTBS, corresponding to 0.8% of the base pairs and 1.6% of the SNPs in the analyzed genome. This suggests that albeit not prevalent, LTBS affects the evolution of a sizable portion of the genome (it overlapping ∼8% of protein-coding genes). These SNPs disproportionally overlap sites with protein-coding and amino-acid altering functions, but not putatively regulatory sites. Our catalog of candidates includes known targets of LTBS, but a majority of them have not been previously identified. As expected, immune-related genes are among those with the strongest signatures, although most candidates are involved in other biological functions, suggesting that balancing selection potentially influences diverse human phenotypes.
Author Summary With the availability of whole-genome sequences on a population level, genetic variation in humans has been queried for signatures of natural selection. Most of these efforts have focused on positive selection, which results in novel adaptions. Balancing selection, an important form of natural selection that maintains advantageous genetic variants within populations, sometimes for millions of years, has attracted less attention. This is despite the important effects that variants under balancing selection have in phenotypic diversity and susceptibility to disease, as shown by the most eminent target of balancing selection: the Major Histocompatibility Complex Locus (MHC, known as HLA in humans). We developed a statistic that identifies regions of the genome with signatures that are expected under balancing selection. This statistic has very high power to detect long-term balancing selection in humans, and it is simple enough to be used in a wide variety of species, having the potential to improve our understanding of balancing selection across taxonomic groups. When applied to human data, we find that long-term balancing selection has affected genomic regions that define the sequence of protein-coding genes more often than their regulation, and has targeted genes involved in immunity and a diversity of additional biological functions.