Elsevier

Theoretical Population Biology

Volume 99, February 2015, Pages 18-30
Theoretical Population Biology

A hidden Markov model for investigating recent positive selection through haplotype structure

https://doi.org/10.1016/j.tpb.2014.11.001Get rights and content

Abstract

Recent positive selection can increase the frequency of an advantageous mutant rapidly enough that a relatively long ancestral haplotype will be remained intact around it. We present a hidden Markov model (HMM) to identify such haplotype structures. With HMM identified haplotype structures, a population genetic model for the extent of ancestral haplotypes is then adopted for parameter inference of the selection intensity and the allele age. Simulations show that this method can detect selection under a wide range of conditions and has higher power than the existing frequency spectrum-based method. In addition, it provides good estimate of the selection coefficients and allele ages for strong selection. The method analyzes large data sets in a reasonable amount of running time. This method is applied to HapMap III data for a genome scan, and identifies a list of candidate regions putatively under recent positive selection. It is also applied to several genes known to be under recent positive selection, including the LCT, KITLG and TYRP1 genes in Northern Europeans, and OCA2 in East Asians, to estimate their allele ages and selection coefficients.

Introduction

Natural selection plays an important role in the recent history of human evolution, and is still active in shaping the genetic diversity pattern of human populations. Genes under positive selection may be involved in the adaption to new environments and in the resistance to infectious diseases (Hamblin et al., 2002, Bersaglieri et al., 2004, Tishkoff et al., 2007, Simonson et al., 2010, Yi et al., 2010, Beall et al., 2010, Peng et al., 2011, Xu et al., 2011, Xiang et al., 2013). In recent years, interest is growing in detecting positive selection using DNA polymorphism data, since the rapid accumulation of genomic level molecular polymorphism data provides a chance to systemically investigate the footprints of natural selection (Tajima, 1989, Fu and Li, 1993, Fay and Wu, 2000, Akey et al., 2002, Sabeti et al., 2002, Kim and Stephan, 2002, Nielsen et al., 2005, Voight et al., 2006, Tang et al., 2007, Sabeti et al., 2007, Williamson et al., 2007, Pickrell et al., 2009, Chen et al., 2010, Grossman et al., 2013). Recent positive selection (RPS), which occurred in the recent past and is still active, has gained particular attention. RPS can increase the frequency of advantageous alleles in a short time, and thus result in high level of haplotype sharing in the vicinity of the selected mutant, and higher homozygosity among the selected haplotypes than those carrying the neutral allele. This unique pattern of multilocus haplotype structure enables methodology development for identifying genes under RPS and parameter inference of the selection process.

Statistical tests have been developed to test for natural selection based on multilocus haplotype frequency distribution or haplotype structure (e.g., Ewens, 1972, Slatkin, 1994, Depaulis et al., 1998, Innan et al., 2005). Innan et al. (2005) presented a good review of these haplotype-based methods. Among the various haplotype-based tests, several exploit the specific haplotype structure caused by RPS by comparing the homozygosity level between selected and neutral haplotype groups (e.g., Hudson et al., 1994, Sabeti et al., 2002, Hanchard et al., 2005, Voight et al., 2006). The first of this kind was proposed by Hudson et al. (1994). Their haplotype test was designed to examine a group of high frequency haplotypes with little genetic variation among them. The test was carried out by estimating the probability of observing fewer polymorphic sites in repeated coalescent simulations given the sample size and allele counts. Hudson et al. (1994) applied the method to analyze the Sod gene in Drosophila melanogaster. At this locus, there are two alleles, labeled by “slow” and “fast”. Hudson et al. (1994) found that there was no mutation among the slow allele group, which has a frequency of approximately 18%, and concluded that there was a significant deviation from neutrality. Sabeti et al. (2002) proposed a Relative Extended Haplotype Homozygosity (REHH) test, which starts with choosing a “core region” (Sabeti et al., 2002), a small region of very low historical recombination, and then calculates as the test statistic the ratio of Extended Haplotype Homozygosity (EHH) of the core haplotype under test over the other core haplotypes. The significance level of the REHH test is generated by coalescent simulation of neutral data that match to the real data by haplotype group numbers and polymorphism level. The method was applied to identify the selected core haplotypes in two malaria-resistance genes G6PD and CD40, and later to the HapMap data for a genome-wide scan (Sabeti et al., 2007). The iHS (integrated haplotype score) test, as a variant of REHH test, was proposed by Voight et al. (2006). The integrated EHH (iHH, defined as the area under the EHH curve) for the ancestral and derived alleles of the mutant are first estimated. The iHS score is then standardized to follow a normal distribution approximately, and subsequently used to test the deviation from neutral model.

In addition to detecting selection, one may be also interested in estimating the selection intensity and the timing of the selection process. There are several methods for this purpose (Slatkin and Rannala, 1997, Slatkin, 2000, Slatkin and Rannala, 2000, Slatkin, 2001, Slatkin, 2002, Kim and Stephan, 2002, Coop and Griffiths, 2004, Rannala and Reeve, 2003, Slatkin, 2008, Chen and Slatkin, 2013). Among them, some consider a single marker linked to the selected locus (Slatkin and Rannala, 1997, Slatkin, 2000, Slatkin, 2001, Kim and Stephan, 2002); and only a few of them model the haplotype structure of multiple marker loci (Coop and Griffiths, 2004, Rannala and Reeve, 2003, Slatkin, 2008, Chen and Slatkin, 2013). Coop and Griffiths (2004) developed a full likelihood method under the structured-coalescent framework (Hudson and Kaplan, 1988). They adopted the time-reversible Moran model to first simulate the allele frequency trajectory of the selected mutant, and then conditioning on the trajectory, they were able to simulate the genealogical history of the sample. The limitation of their method is that only mutations among different haplotypes are considered and the method is only applicable to non-recombining regions. Rannala and Reeve (2003)  modeled both recombination and mutation, but their method depicts the haplotype structure in the vicinity of mutants under neutrality and has unrealistic assumptions of constant allele frequencies for all loci during the selective process. Slatkin (2008) used a linear birth-and-death process to simulate the allelic genealogies of selected mutants and modeled the multilocus haplotype structure under the influence of both recombination and mutation. Chen and Slatkin (2013) also proposed a multilocus haplotype model that describes the dynamics of the haplotype structure under the joint effects of selection, recombination and mutation, by efficiently reducing the complexity of state spaces. Their method exploits the importance sampling approach to generate the historical allele frequency trajectory of the selected mutant, and thus works for populations with temporally changing size (Slatkin, 2001). All the methods are coalescent-based and take into account of randomness of trajectory and genealogies by Monte Carlo averaging, which requires intensive computation. In comparison to the above computationally intensive methods, Voight et al. (2006)’s approach is simplified and computationally feasible. Their method estimates the distance at which the haplotype sharing decreases to a pre-chosen level, and then assumes the decaying of haplotype sharing follows a Poisson process. Voight et al. (2006)’s method further assumes the independent histories of different haplotypes to avoid intensive computation due to the integration over unknown gene genealogies, and thus is suitable for whole-genome analysis.

In this paper, we propose a hidden Markov model to identify the ancestral haplotypes retained during the selective process for the purpose of both detecting selection and estimating the selection intensity. Comparing to the existing methods, e.g., the REHH and iHS tests, which use summary statistics to evaluate the similarity of haplotypes, our method is model-based so that it has the potential to be extended to more complicated scenarios, such as, multiple ancestral haplotype groups (soft sweeps on standing variation, Hermisson and Pennings, 2005), haplotype data from multiple populations, and genotype data with unknown phase.

The method is also different from the aforementioned coalescent-based models in that we do not try to simulate gene genealogies among individuals and the events occurring along the genealogies by Markov Chain Monte Carlo or importance sampling approaches (Slatkin, 2008, Coop and Griffiths, 2004, Chen and Slatkin, 2013). Our method is similar to that of Voight et al. (2006) in this respect. We treat each haplotype independently by assuming a “star” genealogy and ignore the randomness of frequency trajectory of the selected allele. Both methods are computationally efficient and applicable to genome-wide analysis. Compared to Voight et al. (2006), our method provides a better estimation of the selection coefficient when the selected allele is common or nearly fixed, since we explicitly model the probability of effective recombination causing the break of ancestral haplotype extents, which is different from the simple recombination process in Voight et al. (2006) and others. As we will show in a later section, when the selected mutant is at high frequency, the bias in the Voight et al. (2006) method can be as high as ≈20%.

The aim of this paper is twofold: first, we propose a hidden Markov model (HMM) that can explore the haplotype structure of a genomic region, and the inferred haplotype structure can be used to detect the existence of selection; second, we use a simplified population genetic model for the ancestral haplotype extent inferred from the HMM to estimate the selection intensity and the allele age. In the following sections, we first elucidate the details of the method. We then use coalescent simulations to investigate the power of detecting RPS and the accuracy of parameter estimation. We apply the method to analyze several well-known genes under RPS to demonstrate its performance, including the lactose persistence gene (LCT) in Northern Europeans, and KITLG, TYRP1 and OCA2, known to confer skin pigmentation in Northern Europeans or East Asians.

Section snippets

Methods

In this section, we first present the HMM for identifying the extent of ancestral haplotypes. Two tests are further developed based on the HMM for detecting RPS. We then describe a population genetic model of hitchhiking. To be specific, we determine the allele frequency of a selected mutant and the approximate distribution of ancestral haplotype extents as a function of selection intensity and time, and then use this model to infer the selection intensity and the allele age of the selected

Power to detect selection

We used the coalescent simulator msms to generate haplotype samples under RPS, and used the samples to evaluate the power of this method in detecting RPS, and the accuracy and precision of selection coefficient estimation (Ewing and Hermisson, 2010). msms adopts a structured coalescent scheme to model the effect of a selective sweep on the genealogies of nearby loci. The allele frequency of the mutant at present was chosen to be 0.40 and 0.80, representing selected mutants with moderate and

Discussion

We present an HMM method for detecting recent positive selection and inferring selection intensity and allele age when there was a selective sweep. We have shown that the HMM method is effective in capturing the multilocus haplotype structure caused by a RPS. Using coalescent simulations, we showed that the HMM method has more power to detect selection under a range of selection parameters than the allele frequency spectrum-based methods, such as the CLR test (Nielsen et al., 2005). We also

Acknowledgments

We are grateful to Drs. Kun Chen, Thomas Mailund, Noah Rosenberg, and two anonymous reviewers for their insightful comments on an earlier version of the manuscript. We are grateful to Drs. António Santos and Jorge Rocha for providing their source code and the guidance on using it, to Jared Knoblauch for the assistance in simulation and data analysis. This research was supported by NIH grants R01-GM40282 (to MS) and R01-GM078204 (to JH), and was supported in part by the National Science

References (71)

  • P. Scheet et al.

    A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase

    Am. J. Hum. Genet.

    (2006)
  • W. Stephan et al.

    The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory

    Theor. Popul. Biol.

    (1992)
  • H. Tang et al.

    Reconstructing genetic ancestry blocks in admixed individuals

    Am. J. Hum. Genet.

    (2006)
  • J.M. Akey et al.

    Interrogating a high-density SNP map for signatures of natural selection

    Genome Res.

    (2002)
  • C. Beall et al.

    Natural selection on EPAS1 (HIF2α) associated with low hemoglobin concentration in Tibetan highlanders

    Proc. Natl. Acad. Sci. USA

    (2010)
  • S. Beleza et al.

    The timing of pigmentation lightening in europeans

    Mol. Biol. Evol.

    (2013)
  • J. Braverman et al.

    The hitchhiking effect on the site frequency spectrum of DNA polymorphisms

    Genetics

    (1995)
  • J. Bryk et al.

    Positive selection in East Asians for an EDAR allele that enhances NF-κB activation

    PLoS One

    (2008)
  • H. Chen et al.

    Asymptotic distributions of coalescence times and ancestral lineage numbers for populations with temporally varying size

    Genetics

    (2013)
  • H. Chen et al.

    Population differentiation as a test for selective sweeps

    Genome Res.

    (2010)
  • H. Chen et al.

    Inferring selection intensity and allele age from multi-locus haplotype structure

    Genes Genomes Genet.

    (2013)
  • G. Coop et al.

    Ancestral inference on gene trees under selection

    Theor. Popul. Biol.

    (2004)
  • F. Depaulis et al.

    Neutrality tests based on the distribution of haplotypes under an infinite-site model

    Mol. Biol. Evol.

    (1998)
  • R. Durbin et al.

    Biological Sequence Analysis

    (1998)
  • M. Edwards et al.

    Association of the OCA2 polymorphism His615Arg with melanin content in east asian populations: further evidence of convergent evolution of skin pigmentation

    PLos Genet.

    (2010)
  • G. Ewing et al.

    MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus

    Bioinformatics

    (2010)
  • J.C. Fay et al.

    Hitchhiking under positive Darwinian selection

    Genetics

    (2000)
  • Y.X. Fu et al.

    Statistical tests of neutrality of mutations

    Genetics

    (1993)
  • S. Gravel et al.

    Demographic history and rare allele sharing among human populations

    Proc. Natl. Acad. Sci. USA

    (2011)
  • R.C. Griffiths

    Asymptotic line-of-descent distributions

    J. Math. Biol.

    (1984)
  • R.N. Gutenkunst et al.

    Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data

    PLoS Genet.

    (2009)
  • N.A. Hanchard et al.

    Screening for recently selected alleles by analysis of human haplotype similarity

    Am. J. Hum. Genet.

    (2005)
  • J. Hermisson et al.

    Soft sweeps molecular population genetics of adaptation from standing genetic variation

    Genetics

    (2005)
  • R.R. Hudson et al.

    Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophica melanogaster

    Genetics

    (1994)
  • R.R. Hudson et al.

    The coalescent process in models with selection and recombination

    Genetics

    (1988)
  • Cited by (35)

    • Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations

      2017, American Journal of Human Genetics
      Citation Excerpt :

      Because the prediction accuracy, which is usually measured via prediction R2, Nagelkerke’s R2, or receiver operator curve AUC, of polygenic risk scores is currently low for most traits,56 genetic risk prediction is not clinically viable at present, but polygenic risk scores have nonetheless repeatedly proven valuable in research contexts across a multitude of complex traits11,48,60–65 and will become increasingly useful as GWAS sample sizes grow.59 Additionally, several methodological advancements to the standard approach have recently been undertaken.58,66–68 In this study, we explore the impact of population diversity on the landscape of variation underlying human traits.

    View all citing articles on Scopus
    View full text