A hidden Markov model for investigating recent positive selection through haplotype structure
Introduction
Natural selection plays an important role in the recent history of human evolution, and is still active in shaping the genetic diversity pattern of human populations. Genes under positive selection may be involved in the adaption to new environments and in the resistance to infectious diseases (Hamblin et al., 2002, Bersaglieri et al., 2004, Tishkoff et al., 2007, Simonson et al., 2010, Yi et al., 2010, Beall et al., 2010, Peng et al., 2011, Xu et al., 2011, Xiang et al., 2013). In recent years, interest is growing in detecting positive selection using DNA polymorphism data, since the rapid accumulation of genomic level molecular polymorphism data provides a chance to systemically investigate the footprints of natural selection (Tajima, 1989, Fu and Li, 1993, Fay and Wu, 2000, Akey et al., 2002, Sabeti et al., 2002, Kim and Stephan, 2002, Nielsen et al., 2005, Voight et al., 2006, Tang et al., 2007, Sabeti et al., 2007, Williamson et al., 2007, Pickrell et al., 2009, Chen et al., 2010, Grossman et al., 2013). Recent positive selection (RPS), which occurred in the recent past and is still active, has gained particular attention. RPS can increase the frequency of advantageous alleles in a short time, and thus result in high level of haplotype sharing in the vicinity of the selected mutant, and higher homozygosity among the selected haplotypes than those carrying the neutral allele. This unique pattern of multilocus haplotype structure enables methodology development for identifying genes under RPS and parameter inference of the selection process.
Statistical tests have been developed to test for natural selection based on multilocus haplotype frequency distribution or haplotype structure (e.g., Ewens, 1972, Slatkin, 1994, Depaulis et al., 1998, Innan et al., 2005). Innan et al. (2005) presented a good review of these haplotype-based methods. Among the various haplotype-based tests, several exploit the specific haplotype structure caused by RPS by comparing the homozygosity level between selected and neutral haplotype groups (e.g., Hudson et al., 1994, Sabeti et al., 2002, Hanchard et al., 2005, Voight et al., 2006). The first of this kind was proposed by Hudson et al. (1994). Their haplotype test was designed to examine a group of high frequency haplotypes with little genetic variation among them. The test was carried out by estimating the probability of observing fewer polymorphic sites in repeated coalescent simulations given the sample size and allele counts. Hudson et al. (1994) applied the method to analyze the Sod gene in Drosophila melanogaster. At this locus, there are two alleles, labeled by “slow” and “fast”. Hudson et al. (1994) found that there was no mutation among the slow allele group, which has a frequency of approximately 18%, and concluded that there was a significant deviation from neutrality. Sabeti et al. (2002) proposed a Relative Extended Haplotype Homozygosity (REHH) test, which starts with choosing a “core region” (Sabeti et al., 2002), a small region of very low historical recombination, and then calculates as the test statistic the ratio of Extended Haplotype Homozygosity (EHH) of the core haplotype under test over the other core haplotypes. The significance level of the REHH test is generated by coalescent simulation of neutral data that match to the real data by haplotype group numbers and polymorphism level. The method was applied to identify the selected core haplotypes in two malaria-resistance genes G6PD and CD40, and later to the HapMap data for a genome-wide scan (Sabeti et al., 2007). The iHS (integrated haplotype score) test, as a variant of REHH test, was proposed by Voight et al. (2006). The integrated EHH (iHH, defined as the area under the EHH curve) for the ancestral and derived alleles of the mutant are first estimated. The iHS score is then standardized to follow a normal distribution approximately, and subsequently used to test the deviation from neutral model.
In addition to detecting selection, one may be also interested in estimating the selection intensity and the timing of the selection process. There are several methods for this purpose (Slatkin and Rannala, 1997, Slatkin, 2000, Slatkin and Rannala, 2000, Slatkin, 2001, Slatkin, 2002, Kim and Stephan, 2002, Coop and Griffiths, 2004, Rannala and Reeve, 2003, Slatkin, 2008, Chen and Slatkin, 2013). Among them, some consider a single marker linked to the selected locus (Slatkin and Rannala, 1997, Slatkin, 2000, Slatkin, 2001, Kim and Stephan, 2002); and only a few of them model the haplotype structure of multiple marker loci (Coop and Griffiths, 2004, Rannala and Reeve, 2003, Slatkin, 2008, Chen and Slatkin, 2013). Coop and Griffiths (2004) developed a full likelihood method under the structured-coalescent framework (Hudson and Kaplan, 1988). They adopted the time-reversible Moran model to first simulate the allele frequency trajectory of the selected mutant, and then conditioning on the trajectory, they were able to simulate the genealogical history of the sample. The limitation of their method is that only mutations among different haplotypes are considered and the method is only applicable to non-recombining regions. Rannala and Reeve (2003) modeled both recombination and mutation, but their method depicts the haplotype structure in the vicinity of mutants under neutrality and has unrealistic assumptions of constant allele frequencies for all loci during the selective process. Slatkin (2008) used a linear birth-and-death process to simulate the allelic genealogies of selected mutants and modeled the multilocus haplotype structure under the influence of both recombination and mutation. Chen and Slatkin (2013) also proposed a multilocus haplotype model that describes the dynamics of the haplotype structure under the joint effects of selection, recombination and mutation, by efficiently reducing the complexity of state spaces. Their method exploits the importance sampling approach to generate the historical allele frequency trajectory of the selected mutant, and thus works for populations with temporally changing size (Slatkin, 2001). All the methods are coalescent-based and take into account of randomness of trajectory and genealogies by Monte Carlo averaging, which requires intensive computation. In comparison to the above computationally intensive methods, Voight et al. (2006)’s approach is simplified and computationally feasible. Their method estimates the distance at which the haplotype sharing decreases to a pre-chosen level, and then assumes the decaying of haplotype sharing follows a Poisson process. Voight et al. (2006)’s method further assumes the independent histories of different haplotypes to avoid intensive computation due to the integration over unknown gene genealogies, and thus is suitable for whole-genome analysis.
In this paper, we propose a hidden Markov model to identify the ancestral haplotypes retained during the selective process for the purpose of both detecting selection and estimating the selection intensity. Comparing to the existing methods, e.g., the REHH and iHS tests, which use summary statistics to evaluate the similarity of haplotypes, our method is model-based so that it has the potential to be extended to more complicated scenarios, such as, multiple ancestral haplotype groups (soft sweeps on standing variation, Hermisson and Pennings, 2005), haplotype data from multiple populations, and genotype data with unknown phase.
The method is also different from the aforementioned coalescent-based models in that we do not try to simulate gene genealogies among individuals and the events occurring along the genealogies by Markov Chain Monte Carlo or importance sampling approaches (Slatkin, 2008, Coop and Griffiths, 2004, Chen and Slatkin, 2013). Our method is similar to that of Voight et al. (2006) in this respect. We treat each haplotype independently by assuming a “star” genealogy and ignore the randomness of frequency trajectory of the selected allele. Both methods are computationally efficient and applicable to genome-wide analysis. Compared to Voight et al. (2006), our method provides a better estimation of the selection coefficient when the selected allele is common or nearly fixed, since we explicitly model the probability of effective recombination causing the break of ancestral haplotype extents, which is different from the simple recombination process in Voight et al. (2006) and others. As we will show in a later section, when the selected mutant is at high frequency, the bias in the Voight et al. (2006) method can be as high as ≈20%.
The aim of this paper is twofold: first, we propose a hidden Markov model (HMM) that can explore the haplotype structure of a genomic region, and the inferred haplotype structure can be used to detect the existence of selection; second, we use a simplified population genetic model for the ancestral haplotype extent inferred from the HMM to estimate the selection intensity and the allele age. In the following sections, we first elucidate the details of the method. We then use coalescent simulations to investigate the power of detecting RPS and the accuracy of parameter estimation. We apply the method to analyze several well-known genes under RPS to demonstrate its performance, including the lactose persistence gene (LCT) in Northern Europeans, and KITLG, TYRP1 and OCA2, known to confer skin pigmentation in Northern Europeans or East Asians.
Section snippets
Methods
In this section, we first present the HMM for identifying the extent of ancestral haplotypes. Two tests are further developed based on the HMM for detecting RPS. We then describe a population genetic model of hitchhiking. To be specific, we determine the allele frequency of a selected mutant and the approximate distribution of ancestral haplotype extents as a function of selection intensity and time, and then use this model to infer the selection intensity and the allele age of the selected
Power to detect selection
We used the coalescent simulator msms to generate haplotype samples under RPS, and used the samples to evaluate the power of this method in detecting RPS, and the accuracy and precision of selection coefficient estimation (Ewing and Hermisson, 2010). msms adopts a structured coalescent scheme to model the effect of a selective sweep on the genealogies of nearby loci. The allele frequency of the mutant at present was chosen to be 0.40 and 0.80, representing selected mutants with moderate and
Discussion
We present an HMM method for detecting recent positive selection and inferring selection intensity and allele age when there was a selective sweep. We have shown that the HMM method is effective in capturing the multilocus haplotype structure caused by a RPS. Using coalescent simulations, we showed that the HMM method has more power to detect selection under a range of selection parameters than the allele frequency spectrum-based methods, such as the CLR test (Nielsen et al., 2005). We also
Acknowledgments
We are grateful to Drs. Kun Chen, Thomas Mailund, Noah Rosenberg, and two anonymous reviewers for their insightful comments on an earlier version of the manuscript. We are grateful to Drs. António Santos and Jorge Rocha for providing their source code and the guidance on using it, to Jared Knoblauch for the assistance in simulation and data analysis. This research was supported by NIH grants R01-GM40282 (to MS) and R01-GM078204 (to JH), and was supported in part by the National Science
References (71)
- et al.
Genetic signatures of strong recent positive selection at the lactase gene
Am. J. Hum. Genet.
(2004) The joint allele frequency spectrum of multiple populations: a coalescent theory approach
Theor. Popul. Biol.
(2012)- et al.
Approximating selective sweeps
Theor. Popul. Biol.
(2004) The sampling theory of selectively neutral alleles
Theor. Popul. Biol.
(1972)- et al.
Simulating probability distributions in the coalescent
Theor. Popul. Biol.
(1994) - et al.
Identifying recent adaptations in large-scale genomic data
Cell
(2013) - et al.
Complex signatures of natural selection at the duffy blood group locus
Am. J. Hum. Genet.
(2002) - et al.
The evolution of human skin coloration
J. Hum. Evol.
(2000) - et al.
Modeling recent human evolution in mice by expression of a selected EDAR variant
Cell
(2013) - et al.
Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine scale genetic mapping
Am. J. Hum. Genet.
(1999)
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
Am. J. Hum. Genet.
The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory
Theor. Popul. Biol.
Reconstructing genetic ancestry blocks in admixed individuals
Am. J. Hum. Genet.
Interrogating a high-density SNP map for signatures of natural selection
Genome Res.
Natural selection on EPAS1 (HIF2) associated with low hemoglobin concentration in Tibetan highlanders
Proc. Natl. Acad. Sci. USA
The timing of pigmentation lightening in europeans
Mol. Biol. Evol.
The hitchhiking effect on the site frequency spectrum of DNA polymorphisms
Genetics
Positive selection in East Asians for an EDAR allele that enhances NF-B activation
PLoS One
Asymptotic distributions of coalescence times and ancestral lineage numbers for populations with temporally varying size
Genetics
Population differentiation as a test for selective sweeps
Genome Res.
Inferring selection intensity and allele age from multi-locus haplotype structure
Genes Genomes Genet.
Ancestral inference on gene trees under selection
Theor. Popul. Biol.
Neutrality tests based on the distribution of haplotypes under an infinite-site model
Mol. Biol. Evol.
Biological Sequence Analysis
Association of the OCA2 polymorphism His615Arg with melanin content in east asian populations: further evidence of convergent evolution of skin pigmentation
PLos Genet.
MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus
Bioinformatics
Hitchhiking under positive Darwinian selection
Genetics
Statistical tests of neutrality of mutations
Genetics
Demographic history and rare allele sharing among human populations
Proc. Natl. Acad. Sci. USA
Asymptotic line-of-descent distributions
J. Math. Biol.
Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data
PLoS Genet.
Screening for recently selected alleles by analysis of human haplotype similarity
Am. J. Hum. Genet.
Soft sweeps molecular population genetics of adaptation from standing genetic variation
Genetics
Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophica melanogaster
Genetics
The coalescent process in models with selection and recombination
Genetics
Cited by (35)
Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations
2017, American Journal of Human GeneticsCitation Excerpt :Because the prediction accuracy, which is usually measured via prediction R2, Nagelkerke’s R2, or receiver operator curve AUC, of polygenic risk scores is currently low for most traits,56 genetic risk prediction is not clinically viable at present, but polygenic risk scores have nonetheless repeatedly proven valuable in research contexts across a multitude of complex traits11,48,60–65 and will become increasingly useful as GWAS sample sizes grow.59 Additionally, several methodological advancements to the standard approach have recently been undertaken.58,66–68 In this study, we explore the impact of population diversity on the landscape of variation underlying human traits.
Harnessing deep learning for population genetic inference
2024, Nature Reviews GeneticsHaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets
2023, Molecular Biology and EvolutionDemographic history differences between Hispanics and Brazilians imprint haplotype features
2022, G3: Genes, Genomes, Genetics