Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A New Method to Scan Genomes for Introgression in a Secondary Contact Model

  • Anthony J. Geneva,

    Affiliation Department of Biology, University of Rochester, Rochester, New York, United States of America

  • Christina A. Muirhead,

    Affiliations Department of Biology, University of Rochester, Rochester, New York, United States of America, Ronin Institute, Montclair, New Jersey, United States of America

  • Sarah B. Kingan,

    Affiliation Department of Biology, University of Rochester, Rochester, New York, United States of America

  • Daniel Garrigan

    dgarriga@ur.rochester.edu

    Affiliation Department of Biology, University of Rochester, Rochester, New York, United States of America

Abstract

Secondary contact between divergent populations or incipient species may result in the exchange and introgression of genomic material. We develop a simple DNA sequence measure, called Gmin, which is designed to identify genomic regions experiencing introgression in a secondary contact model. Gmin is defined as the ratio of the minimum between-population number of nucleotide differences in a genomic window to the average number of between-population differences. Although it is conceptually simple, one advantage of Gmin is that it is computationally inexpensive relative to model-based methods for detecting gene flow and it scales easily to the level of whole-genome analysis. We compare the sensitivity and specificity of Gmin to those of the widely used index of population differentiation, FST, and suggest a simple statistical test for identifying genomic outliers. Extensive computer simulations demonstrate that Gmin has both greater sensitivity and specificity for detecting recent introgression than does FST. Furthermore, we find that the sensitivity of Gmin is robust with respect to both the population mutation and recombination rates. Finally, a scan of Gmin across the X chromosome of Drosophila melanogaster identifies candidate regions of introgression between sub-Saharan African and cosmopolitan populations that were previously missed by other methods. These results show that Gmin is a biologically straightforward, yet powerful, alternative to FST, as well as to more computationally intensive model-based methods for detecting gene flow.

Introduction

Secondary contact occurs when sympatry is restored between two or more populations that have evolved for some amount of time in allopatry. For evolutionary biologists, secondary contact between diverging populations can provide a compelling natural experiment. For example, the frequency and symmetry of hybrid matings can yield insight into the roles of sexual selection [1] and/or reinforcement [2] in speciation. Likewise, the frequency of backcrossing and subsequent introgression can reveal the extent to which postzygotic isolating mechanisms have accumulated [3]. In this context, studies of naturally occurring secondary contact offer a distinct advantage over laboratory-based studies of reproductive isolation—the patterns of introgression represent the fitness of hybrid genotypes in natural environments, replete with a variety of ecological selection pressures. Lastly, studies of secondary contact are not limited merely to satisfying the intellectual curiosity of evolutionary biologists: hybridization and introgression from closely related invasive populations can be a significant extinction threat for endangered endemic populations [4,5].

With the advent of comparative population genomics, there is now the potential to 1) quantify the frequency and tempo of introgression between natural populations experiencing secondary contact at the level of entire genomes, and 2) identify which genomic regions are exchanged. A variety of methods have been developed to estimate the rate and directionality of gene flow between diverging populations [69] Generally, these estimate historical population demography to assess if the observed data fit with an isolation model, and if not, estimate the direction and magnitude of gene flow necessary to explain the observed data. Comparatively fewer methods have been developed to localize introgression—identifying which genomic regions have experienced exchange—and most are tailored to have utility in particular taxa, for example requiring both “pure” and admixed samples or requiring that one population was formed by a recent dispersal event [10,11]. Many investigators rely upon unusually low observed values of the traditional fixation index, FST [12], to identify introgressing genomic regions (e.g., [1315]). We suggest that FST may not be ideally suited for this particular application: it is derived from the variance in allele frequencies among populations and may lack power to detect introgression in cases of secondary contact [16]. This is because for FST to take on values close to zero following secondary contact, alleles must not only be shared across populations, but their frequencies in the two populations must also be equal. This is not necessarily expected in a secondary contact model, in which introgression is either very recent or otherwise limited. In this paper, we consider whether whole-genome sequence data can be leveraged to obtain both greater sensitivity and specificity to detect introgression than using FST alone.

While there are a variety of alternatives to FST for detecting introgression [6,811,1719], our aim is to develop a method that fulfills seven criteria: 1) it has minimal prior assumptions, 2) is sensitive to recent gene flow, 3) has a low rate of false positives, 4) has a straightforward biological interpretation, 5) is applicable to a wide range of taxa, 6) can localize tracts of introgression in the genome, and 7) is fast to compute on large genomic datasets. To this end, we propose a simple haplotype-based sequence measure called Gmin, which is can be quickly calculated in sliding windows across whole-genome alignments. Gmin is the ratio of the minimum between-population haplotype distance to the mean between-population haplotype distance, calculated in windows across the genome. We present the results of extensive computer simulations demonstrating that Gmin is more sensitive to recent introgression than FST in a secondary contact model. We also use Gmin on a previously published dataset to scan the X chromosome for introgression between sub-Saharan African and cosmopolitan populations of the commensal fruit fly Drosophila melanogaster.

Materials and Methods

Rationale for the Gmin measure

Assume that we have nucleotide sequences of multiple individuals sampled from two populations, such that there are a total n1 sequences from population 1 and n2 sequences from population 2. The average number of pairwise nucleotide differences between sequences from the two populations is defined as 1 in which dXY is the Hamming distance (or, p-distance) between sequence X from population 1 and sequence Y from population 2 [20]. Similarly, let min(dXY) be the minimum value of dXY among all n1×n2 comparisons. We can then define the ratio, 2

The ratio Gmin ranges from zero to unity and has the property that if n1 = 1 and n2 = 1, then. Gmin = 1. Under a strict model of isolation (i.e., no historical gene flow), a lower bound is imposed upon Gmin by the divergence time between the two populations. However, for population divergence models that include recent gene flow the lower bound is determined by the timing of the most recent gene flow event (for example, see Fig 1). A coalescent approximation for the expectation of Gmin is provided in S1 Text. We performed coalescent simulation to contrast Gmin with FST calculated with the expression given by [21], 3

thumbnail
Fig 1. Illustration of the average and minimum between population coalescent times for models that include A) population divergence in isolation, and B) secondary contact.

For sufficiently high rates of mutation, these two times are the main determinants for the observable quantities: the mean number of between population nucleotide differences, and the minimum between population differences min(dXY).

https://doi.org/10.1371/journal.pone.0118621.g001

Behavior of the Gmin ratio

To characterize the behavior of the Gmin ratio, two sets of coalescent simulations were generated. The first set was intended only to examine the distribution of Gmin under the null model of neutral population divergence with no gene flow (isolation). The second set of simulations was designed to contrast the sensitivity and specificity of Gmin with those of FST, using a binary classification procedure. This second set considers a large parameter space for a secondary contact model, which includes an ancestral population of size N that splits into two descendant populations at time τD (measured in units of N generations). We focus on cases in which each of the descendant populations also has size N (however, for treatment of the effects of varying population size in secondary contact models, see [22]). Subsequently, at time τM (also measured in units of N generations) before the present, the source population is allowed to send migrants instantaneously to the other population. Instantaneous migration was assumed, rather than specifying a time for the onset of continuous gene flow, because it more discretely captures the effect of the timing of secondary contact. The number of migrating lineages is governed by the “migration probability” parameter, λ. For example, at time τM, let there be k ancestral lineages present in the source population, so that the number of lineages chosen to migrate is a binomial distributed random variable with expectation . We assume that gene flow is unidirectional. This model is implemented in a modified version of the coalescent simulation software MS [23], called MSMOVE [24]. This modified version has the added feature of recording which simulated genealogies experienced a migration event.

Since Gmin is intended to be measured in a sliding window scan of whole-genome sequence alignments, we performed simulations that approximate variably sized genomic windows. This was achieved by varying both the population mutation rate (θ = 4 where μ is the mutation rate for a given window) and the population crossing-over rate (ρ = 4Nc, where c is the rate of crossing-over per window). Specifically, we used values of θ ∊ {10,20,50,100,150} and ρ ∊ {0,1,10,20,50,100,150}. To provide a more familiar frame of reference for these simulation parameters we provide the following expected values calculated as if our simulated data were derived from population sampling of DNA sequences. For a sample size of 10 individuals, it is expected that θ = 10 corresponds to a window size with 28 segregating sites, while θ = 150 approximates a window with 424 segregating sites. Similarly, ρ approximates the size of haplotypes within windows. For example, when θ = 150 and ρ = 0, all 424 segregating sites would be partitioned among haplotypes that span the length of the window. However, when θ = 150 and ρ = 150, there are also 424 expected recombination events, therefore each segregating site would have its own non-recombining coalescent history, on average.

For each pairwise combination of parameter values, a total of 104 independent windows were simulated. This scheme assumes that large windows are being used to scan the genome for gene flow, such that genealogical histories within windows can be correlated, but that adjacent windows contain independent genealogies. Additionally, we considered two different sample size configurations. The first configuration is one in which only a single source-population sequence is available (n1 = 10 and n2 = 1) and the second sample configuration assumes that polymorphism data are available from both populations (n1 = 10 and n2 = 10). For both sample size configurations, the direction of the gene flow is from population 2 into population 1, going forward in time.

For the first set of simulations, which characterizes the behavior of Gmin under the null isolation model, we considered a range of population divergence times, τD ∊ {1/25,2/25,3/25,…,8}. We performed a variance partitioning analysis to quantify the effects of the n2, θ, ρ, and τ parameters (as well as their interactions) on the mean and variance of both Gmin and FST. We first fit a linear model that includes all parameters and their interactions. We then quantified the variance explained by each parameter by calculating the partitioned sum of squares. For all analyses, we tested the non-independence of parameters and for any potential bias-inducing effects of model complexity by comparing variance partitioning for each parameter after 1) iterating the order of parameters in the model, 2) running models both with and without interaction terms, and 3) serially removing parameters. All post-processing and analyses of simulated data was performed using the R statistical environment [25].

Sensitivity and specificity

To contrast the sensitivities of Gmin and FST to gene flow under the alternative secondary contact model, we examine the proportion of simulated true migrant genealogies that are deemed outliers using a simple designation criterion. While this is not meant to be a formal statistical test of gene flow versus isolation, it is a convenient procedure for approximating the sensitivity and specificity of Gmin and FST. Using this procedure, we classify a genomic window as being “positive” for gene flow on the basis of its standardized deviation from the genome-wide mean (Z-score). We defined three levels of stringency for considering an individual window as positive for gene flow, Z < −1.645, Z < −2.326, and Z < −3.090. Let the set of windows with a Z-score less than the threshold be denoted as Q. Furthermore, simulated windows are classified as “true” gene flow windows if they contain a genealogy in which an ancestral lineage has switched populations. Therefore, any particular parameterization of the secondary contact model will yield the set M of true gene flow windows. Let MQ represent the set of true gene flow windows with a Z-score below the threshold value. The sensitivity of the test (φ) can therefore be defined as the proportion 4

Thus, φ = 1, when all true gene flow windows have an outlying Z-score. Conversely, we define specificity (ψ) as 5 such that if ψ = 1, then all windows with an outlying Z-score are true gene flow windows. For the analysis of sensitivity and specificity, the simulated parameter combinations were the same as those used in the first set of simulations described in the previous subsection. The only exceptions were that we simulated a narrower range of divergence times τD ∊ {1/100,2/100,3/100,…,1} and added two additional parameters: the relative time of gene flow, which had the range τM ∊ {τD/100,2τD/100,3τD/100,…,τD} (for τD > 0) and migration probability in the set, λ ∊ {0.001,0.005,0.01,0.05,0.1}. In addition to assessing the sensitivity and specificity of Gmin and FST, we also evaluated the effect of each varied parameter on sensitivity and specificity. Variance partitioning was performed as described in the previous subsection.

Application to Drosophila melanogaster data

We developed Gmin in anticipation of high-quality short-read assemblies of population-level samples from more than one population. Such data have just begun to emerge from a variety of organisms. To contrast the sensitivity of Gmin with that of FST, we apply it to a subset of the highest quality available resequence dataset: X chromosome polymorphism of two populations of Drosophila melanogaster [10]. The two populations include a cosmopolitan population from France and a sub-Saharan African population from Rwanda. While these two populations generally show low levels of sequence divergence (chromosome average FST = 0.183 and ), a recent study was able to detect a signal of recent cosmopolitan admixture in several African populations, including the deeply sampled Rwandan population [10].

We obtained 76 bp paired-end Illumina reads from seven French and nine Rwandan lines from the NCBI short read archive (see S1 Table for details on the sampled lines). All reads were aligned to the reference genome of D. melanogaster, build 5.45 (http://flybase.org), using the BWA software, version 0.6.2 [26]. The resulting alignments for individual lines in the BAM format were merged using the SAMTOOLS software package [27]. The values of FST and Gmin were calculated in non-overlapping 50 kb windows using the POPBAM software package [28]. We only analyzed nucleotide sites that met the following criteria: read depth per line greater than 5, Phred-scaled scores for the minimum root-mean squared mapping quality greater than or equal to 25, and a SNP quality that is at least 25; we also only incorporated reads with a minimum mapping quality of 20 and an individual base quality of at least 13. Of the 443 X chromosome 50 kb windows, seven (1.58%) had less than 25% of the reference genome positions passing the above filters and were subsequently ignored. Lastly, we construct neighbor-joining trees based on uncorrected Hamming distance in 50 kb windows using POPBAM. For the sake of consistency, individual windows were identified as outliers if Z < −1.645. We compare our analysis to that of Pool et al. [10], who utilized a Hidden Markov Model method based on the pairwise distances between sub-Saharan African and cosmopolitan genomes. In windows of 1000 non-singleton SNPs, each Rwandan line was assigned a posterior probability of admixture. We identified previously known admixed regions as those whose sum of posterior probabilities across lines is greater than 0.50 (see S5 Table from Pool et al. 2012).

Results

Behavior of the Gmin ratio under an isolation model

Gmin is the ratio of min(dXY), the minimum number of nucleotide differences between haplotypes sampled from different populations, to the average number of between-population differences (Eq. 2). In a strict isolation model of divergence, we expect that both min(dXY) and will increase as a function of the population divergence time, τD. Ultimately, Gmin is expected to approach unity for very ancient divergence times (τD >> 4N), because there is a high probability of only a single ancestral lineage remaining in each population. Conversely, for very recent divergence times, Gmin is expected to be much less than unity, since it is unlikely that all coalescent events will occur only between ancestral lineages from the same population before a single coalescent occurs between lineages from different populations. Computer simulations show that both Gmin and FST increase asymptotically to unity as the divergence time increases, but also that Gmin increases at a faster rate and plateaus at an earlier divergence time (Fig 2).

thumbnail
Fig 2. Expected values of Gmin in a pure isolation model.

A) The mean simulated values of Gmin plotted against divergence time for a model of divergence in isolation. B) Mean simulated values of Gmin plotted against population mutation rate under an isolation model with divergence occurring at time τD = N generations ago. Also shown is the mean simulated values of FST plotted against C) divergence time and D) population mutation rate under an isolation model. The shaded areas delimit the mean ± one standard deviation. The blue lines represent sample sizes in the two populations of n1 = 10 and n2 = 10, while the red lines represent sample sizes of n1 = 10 and n2 = 1. The simulations shown here do not include the effects of intra-locus recombination.

https://doi.org/10.1371/journal.pone.0118621.g002

In the isolation model, the variance of Gmin is most strongly affected by the time of population divergence, τD. Variation in τD alone explains approximately half of the simulated variance for both Gmin and FST (Table 1). When the population mutation rate θ ≤ 10, Gmin becomes downwardly biased (Fig 2B). We suspect that this bias arises for low mutation rates because, when few mutations occur on a set of correlated genealogies, Gmin does not always capture the minimum time of the between-population coalescent events, rather it may reflect a randomly chosen between-population coalescent event that, by chance, has fewer mutations separating them than the true minimum event. Finally, whether a single source-population sequence is available (n2 = 1) or polymorphism data are available (n2 = 10) has a minor, but predictable, effect: Gmin is always closer to unity when n2 = 1 than when n2 = 10 (Fig 2). It should be noted that although we report on the results for FST in the case of n2 = 1, this is obviously not a situation in which FST (as a measure of difference in allele frequencies) would be applicable. Finally, we found no evidence of bias in any of the variance partitioning analyses, so that the full models with all parameters and interaction terms have been included.

thumbnail
Table 1. Variance partitioning for Gmin and FST under the isolation model of divergence.

https://doi.org/10.1371/journal.pone.0118621.t001

Sensitivity and specificity

When we consider a secondary contact model, the two parameters that exert the strongest influence on the behavior of both Gmin and FST are the time of migration relative to divergence (τM) and the magnitude (λ) of the migration event (S2 and S3 Tables). Our simulations show that Gmin has increased sensitivity and specificity compared to FST for all combinations of the τM and λ parameters, regardless of the values of nuisance parameters, such as θ and ρ (Fig 3). The sensitivity of Gmin is greatest when τM is recent and λ is small (S1 Fig). It is interesting to note that the sensitivity of Gmin decreases with increasing λ because large amounts of migration tends to reduce the average between-population sequence distance, thereby also reducing the expected Gmin and increasing its variance (S4 Table). However, for FST, λ does not have a profound effect on its sensitivity (S4 Table). In contrast, increased λ results in a greater specificity for Gmin (Fig 3). This means that although high λ results in a lower proportion of the migrant genealogies appearing in the negative Z-score tail, a greater proportion of all genealogies in the tail are true migrant genealogies.

thumbnail
Fig 3. Comparison of Gmin and FST.

Heatmaps of percent improvement of Gmin over FST for sensitivity (left) and specificity (right). Improvement was calculated for varying rates of migration (migration probability) and time of migration (relative to time of population divergence) and averaged over all other parameters.

https://doi.org/10.1371/journal.pone.0118621.g003

Surprisingly, the rate of recombination has only a mild effect on the sensitivity of Gmin and FST (S2 Fig). This may be due to the relatively intermediate levels of recombination used in the computer simulations, since the recombination rate must be very high (ρ > 50) to break up introgressed haplotypes when τM is very recent. This is also true of specificity (S3 Fig). Likewise, increasing the population mutation rate also slightly increases both the sensitivity (S4 Fig) and the specificity (S5 Fig). These results suggest that sensitivity and specificity of Gmin are optimal when large genomic windows (θ > 10) with relatively low levels of recombination (ρ < 20) are considered.

A trade-off between sensitivity and specificity occurs when we contrast results from simulations of divergence from a single source population sequence (n2 = 1) with those from polymorphism data from both populations (n2 = 10). Gmin has increased sensitivity when n2 = 1 compared to when n2 = 10 (S6 Fig). In contrast, the specificity of Gmin is substantially greater when n2 = 10 (S7 Fig). Therefore, situations in which only a single source-population sequence is available results in Gmin having increased power to detect migrant genealogies at any given locus in the genome, while polymorphism data from two populations yields increased power to detect gene flow across the genome. The specificity result is intuitive from a biological standpoint: if low levels of gene flow occur, then having more sequences per population will increase the probability of recovering an introgressed haplotype. Sensitivity increases when n2 = 1 because there is less variance in the coalescent process in the ancestral population for genealogies that do not experience gene flow and the expected Gmin in an isolation model is closer to unity; this results in a higher proportion of migrant genealogies significantly departing from a genome-wide distribution.

Application to cosmopolitan admixture in Drosophila melanogaster

We compare the ability of Gmin versus FST to detect cosmopolitan admixture in a Rwandan population of D. melanogaster. We used POPBAM to calculate the two statistics in 436 non-overlapping 50 kb windows on the X chromosome in a sample of seven French and nine Rwandan lines (Fig 4A). The mean and standard error for Gmin is 0.6500 ± 0.0311 and for FST is 0.1725 ± 0.0083. Interestingly, the range of Gmin (0.0982—0.9833) is more than twice as large as that of FST (0.0170–0.5107) (Fig 4B). This expanded range of Gmin is consistent with a greater sensitivity of Gmin, even for relatively low levels of population divergence. The outliers from the chromosome-wide Gmin distribution identified cosmopolitan admixture in all of the previously identified admixture windows (Fig 4A). In contrast, outlier values of FST appear in only one of the six previously identified tracts (Fig 4A). The outliers of Gmin also reveal two additional candidate introgression tracts on the X chromosome—a region consisting of five significant windows between coordinates 1.65–2.05 Mb, and a single window located at 12.95–13 Mb just above our arbitrary cut-off (Z = −1.6352); neither region was previously identified by Pool et al. [10]. The first region near the 2 Mb coordinate harbors a low frequency introgressed haplotype carried by Rwandan line, RG35. Neighbor-joining trees indicate that the RG35 sample is nested within the French samples, although the particular French line(s) with which it clusters varies across windows (S8 Fig). The second marginally significant window involves a similar scenario where RG35 is nested within the clade of French lines, sister to the French line FR229 (S9 Fig). These inferred low frequency introgressions went undetected in both our FST scan and the Hidden Markov Model analysis performed by Pool et al. [10] The window size used by Pool et al. [10] was based on the number of SNPs, rather than physical distance, such that windows in this sub-telomeric region are larger than 100 kb, on average. Therefore, it is possible that the large windows analyzed by Pool et al. [10] contain conflicting genealogical histories, resulting in the distance between RG35 and any particular French line not being reduced, on average.

thumbnail
Fig 4. Cosmopolitan admixture in sub-Saharan African Drosophila melanogaster.

A) Gmin (above) and FST (below) in 50 kb windows across the X chromosome in a sample of seven French and nine Rwandan lines. Shaded regions indicate where Pool et al. [10] previously detected admixture. Open circles mark windows that are identified as outliers from the chromosome-wide distribution. B) Scatterplot of FST versus Gmin across the X chromosome in 50 kb windows. The diagonal line is added for reference only.

https://doi.org/10.1371/journal.pone.0118621.g004

Discussion

Comparative population genomic datasets, or whole genome alignments of many individuals from multiple populations within a species or between closely related species, are finally becoming realized in evolutionary genetics. One of the many potential uses of these new data is to estimate the degree to which introgression occurs between populations coming into secondary contact. Also of interest is pinpointing the genomic location of introgression and characterizing the functional properties of introgressing coding material, if any. Many of the first studies to make use of whole-genome datasets rely on the traditional fixation index, FST, to identify introgressed genomic regions. However, we have shown that FST has a number of inherent weaknesses for detecting introgression in a secondary contact model.

Our analyses focus on phased haplotype data, which can be especially useful for inferring details of historical population demography and gene flow [18,29,30] and haplotype sharing among populations is often used as a criterion for detecting introgression [19,31,32]. We show that haplotype-based measures of within- and between-population sequence differences, such as Gmin, offer better sensitivity and specificity over allele frequency measures such as FST. Furthermore, our simulations show that Gmin is robust to local variation in mutation rate and, to a lesser extent, recombination rate. The robustness of Gmin to the local recombination rate primarily occurs when gene flow is both recent and limited, in which case there is a limited opportunity for recombination to break up introgressed haplotypes (S2 and S3 Figs). This result suggests that choice of window size offers an avenue for distinguishing recent versus older introgression events (S10 Fig). Larger windows with more mutation and recombination events offer greater power to identify very recent introgression events, whereas smaller windows can identify older introgression events, albeit with less specificity than larger windows. In practice, the most useful window size will vary by the particular taxa of interest. Due to the relative ease in calculating Gmin, optimal window size can be rapidly evaluated over a range of genomic intervals.

Like FST or Gmin is not a formal test statistic, rather it is a sequence measure designed to identify a distinctly bimodal pattern of between-population coalescence that is expected under models of secondary contact, but not expected in models of strict population isolation. We were unable to derive a closed-form expression for the variance of the Gmin ratio in a pure isolation model, due in part to the fact that we observe a non-zero positive covariance between the numerator, min(dXY), and the denominator, (data not shown). Therefore, using Gmin as the basis for a simple single-locus test is not currently feasible. However, like FST, Gmin can be readily incorporated into other inferential frameworks, such as approximate likelihood methods [33]. Our approach differs from more formal inferential frameworks, such as those used by the IM program [9], in that IM tests the hypothesis of whether or not gene flow has occurred; the goal of Gmin is less formal, seeking instead to localize introgression genealogies in otherwise diverging genomes. In practice, a Gmin scan may be an extremely useful first step for identifying candidate regions for introgression. Unlike many likelihood-based methods for detecting gene flow in a population divergence model, Gmin can be quickly applied to large whole-genome datasets and interpretation of Gmin requires a minimal set of assumptions. The fundamental assumption is that the individuals in the analysis came from either one population or a different population. This is in contrast to some methods for detecting admixed regions of the genome, which rely on investigators being able to assign individuals to two pure parental populations, as well as a third population of hybrid individuals [11]. Of course, knowing the hybrid status of individuals, or having more detailed information of sample geographical distribution, may enable more advanced analysis [6,17].

While Gmin is more sensitive to recent gene flow than FST, it has additional desirable properties that distinguish it from other recently proposed haplotype-based methods. For example, Harris and Nielsen [8] describe a method for detecting recent gene flow by measuring the genomic length distribution of tracts of identity-by-state. The computer simulations presented by Harris and Nielsen [8] demonstrate that their method can accurately infer the timing and magnitude of admixture events, as well as other demographic parameters, over a range of time scales. However, the identity-by-state method of Harris and Nielsen [8] may also be sensitive to 1) low quality reads and sequencing error, 2) reductions in effective population size due to background selection, and 3) accuracy of the required modeling of historical population bottlenecks. In contrast, we argue that Gmin is not as sensitive to errors in sequencing or assembly, because Gmin does not explicitly depend upon uninterrupted runs of shared polymorphic sites. Additionally, the lower tail of Gmin is not expected to be strongly affected by background selection under a secondary contact model. This is because background selection does not affect the tempo of neutral divergence [34] and can skew within-population polymorphism towards an excess of rare alleles [35], neither of which affects Gmin (however, for the effect of reductions in the effective population size, see below).

Besides recent introgression, the primary factor affecting Gmin is the number of ancestral lineages present at the time of the initial population split. As a result, the distribution of Gmin will be affected by any force that alters the probability density of within-population coalescent events, including changes in the effective population size or natural selection. If natural selection acts to reduce diversity in one population exclusively or, if the effective population size of one population is smaller than that of the other, we expect there to be fewer ancestral lineages present at the time of the initial population divergence. To consider the performance of Gmin in these cases, we can extrapolate from our computer simulation results of different sampling schemes, in particular when n1 = 10 and n2 = 1. We find that when only a single source-population genome is used, Gmin has greater sensitivity (S6 Fig), but reduced specificity compared to when n2 = 10 (S7 Fig). This suggests that forces acting to increase the rate of coalescence within populations, such as population bottlenecks, will result in increased confidence that small values of Gmin can be attributed to recent gene flow, but also a diminished ability to recover all of the introgressed regions in a genome. Similarly, the reduced specificity of FST when there is a reduction in within-population variation is well-known [21,36,37], however Gmin does not appear to be as strongly affected as FST (S7 Fig).

In conclusion, we do not wish to argue that Gmin is in any way a panacea for the longstanding problem of distinguishing models of gene flow from those of pure isolation [38]. Indeed, Gmin lacks sensitivity when gene flow occurred more than halfway back to the time of the population divergence or when there is a large amount of gene flow (S1 Fig). For example, if a genomic region is sweeping across species boundaries [39], Gmin is not expected to be as informative as FST. Therefore, it is also important to caution that genomic intervals with low Gmin should be subsequently vetted to ensure that the region does not have unusually low absolute values of . However, in cases of recent secondary contact, and when the rates of gene flow are not extremely high, we have shown that Gmin performs well and is more reliable than FST (Fig 3). In addition, we illustrate how a simple statistical procedure employing Gmin to scan the X chromosome of recently diverged cosmopolitan and sub-Saharan African populations of Drosophila melanogaster performs as well as more sophisticated methods (Fig 4). However, unlike many more sophisticated methods, the calculation of Gmin is fast and broadly applicable to any taxa for which haploid genome sequences are available. Gmin can be easily calculated from population genomic data using the software package POPBAM [28]. We anticipate that with the continued emergence of new haplotype sequencing methods [40,41], these types of data will be increasingly used for evolutionary studies. In this case, Gmin can be an effective and biologically straightforward addition to the suite of tools available to evolutionary biologists.

Supporting Information

S1 Fig. A) Sensitivity of the FST and Gmin measures for varying rates of migration (migration probability) and time of migration (relative to time of population divergence).

The left column shows plots of sensitivity for FST and the right column shows sensitivity for Gmin. The top row shows sensitivity when outliers are defined by Z < −1.645, the middle row shows the same for Z < −2.326, and the bottom row shows sensitivity when Z < −3.090. B) Specificity of the FST and Gmin measures for varying rates and times of migration. Layout of the plots are the same as in panel A.

https://doi.org/10.1371/journal.pone.0118621.s001

(EPS)

S2 Fig. Sensitivity of FST (left column) and Gmin (right column) for varying levels of population recombination rate: ρ = 0 (top), ρ = 50 (middle), and ρ = 150 (bottom).

https://doi.org/10.1371/journal.pone.0118621.s002

(EPS)

S3 Fig. Specificity of FST (left column) and Gmin (right column) for varying levels of population recombination rate: ρ = 0 (top), ρ = 50 (middle), and ρ = 150 (bottom).

https://doi.org/10.1371/journal.pone.0118621.s003

(EPS)

S4 Fig. Sensitivity of FST (left column) and Gmin (right column) for varying levels of population mutation rate: θ = 10 (top), θ = 50 (middle), and θ = 150 (bottom).

https://doi.org/10.1371/journal.pone.0118621.s004

(EPS)

S5 Fig. Specificity of FST (left column) and Gmin (right column) for varying levels of population mutation rate: θ = 10 (top), θ = 50 (middle), and θ = 150 (bottom).

https://doi.org/10.1371/journal.pone.0118621.s005

(EPS)

S6 Fig. Sensitivity of FST (left column) and Gmin (right column) for varying sample size: n2 = 10 (top) and n2 = 1 (bottom).

https://doi.org/10.1371/journal.pone.0118621.s006

(EPS)

S7 Fig. Specificity of FST (left column) and Gmin (right column) for varying sample sizes: n2 = 10 (top) and n2 = 1 (bottom).

https://doi.org/10.1371/journal.pone.0118621.s007

(EPS)

S8 Fig. Neighbor-joining trees showing the first newly identified region of gene flow on the Drosophila melanogaster X chromosome between coordinates 1.65–2.05 Mb.

https://doi.org/10.1371/journal.pone.0118621.s008

(EPS)

S9 Fig. Neighbor-joining trees showing the second newly identified region of gene flow on the Drosophila melanogaster X chromosome between coordinates 12.95–13 Mb.

https://doi.org/10.1371/journal.pone.0118621.s009

(EPS)

S10 Fig. Gmin and FST scans of the Drosophila melanogaster X chromosome in differently sized windows.

https://doi.org/10.1371/journal.pone.0118621.s010

(EPS)

S1 Table. The sampled lines from two populations of Drosophila melanogaster.

https://doi.org/10.1371/journal.pone.0118621.s011

(DOCX)

S2 Table. Analysis of variance of sensitivity of Gmin and FST.

https://doi.org/10.1371/journal.pone.0118621.s012

(DOCX)

S3 Table. Analysis of variance of specificity of Gmin and FST.

https://doi.org/10.1371/journal.pone.0118621.s013

(DOCX)

S4 Table. Influence of migration probability (λ) on the sensitivity, specificity and variance of Gmin and FST.

https://doi.org/10.1371/journal.pone.0118621.s014

(DOCX)

S1 Text. A new method to scan genomes for introgression in a secondary contact model.

https://doi.org/10.1371/journal.pone.0118621.s015

(DOCX)

Acknowledgments

We would like to thank Sohini Ramachandran and Carlos Machado for valuable comments on earlier drafts of this manuscript. We also thank LeAnne Lovato for preliminary work on this project.

Author Contributions

Conceived and designed the experiments: AJG DG. Performed the experiments: AJG SBK. Analyzed the data: AJG SBK. Contributed reagents/materials/analysis tools: AJG DG CAM. Wrote the paper: AJG SBK DG.

References

  1. 1. Ritchie MG. Sexual selection and speciation. Annu Rev Ecol Syst. 2007;38: 79–102.
  2. 2. Yukilevich R. Asymmetrical patterns of speciation uniquely support reinforcement in Drosophila. Evolution. 2012;66: 1430–1446. pmid:22519782
  3. 3. Gompert Z, Parchman TL, Buerkle CA. Genomics of isolation in hybrids. Phil Trans R Soc B. 2012;367: 439–450. pmid:22201173
  4. 4. Rhymer JM, Simberloff D. Extinction by hybridization and introgression. Annu Rev Ecol Syst. 1996;27: 83–109.
  5. 5. Seehausen OLE, Takimoto G, Roy D, Jokela J. Speciation reversal and biodiversity dynamics with hybridization in changing environments. Mol Ecol. 2008;17: 30–44. pmid:18034800
  6. 6. Barton NH, Etheridge AM, Kelleher J, Véber A. Inference in two dimensions: allele frequencies versus lengths of shared sequence blocks. Theor Popul Biol. 2013;87: 105–119. pmid:23506734
  7. 7. Durand EY, Patterson N, Reich D, Slatkin M. Testing for ancient admixture between closely related populations. Mol Biol Evol. 2011;28: 2239–2252. pmid:21325092
  8. 8. Harris K, Nielsen R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 2013;9: e1003521. pmid:23754952
  9. 9. Sousa V, Hey J. Understanding the origin of species with genome-scale data: modelling gene flow. Nat Rev Genet. 2013;14: 404–414. pmid:23657479
  10. 10. Pool J, Corbett-Detig R, Sugino R, Stevens K, Cardeno C, et al. Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture. PLoS Genet. 2012;8: e1003080. pmid:23284287
  11. 11. Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5: e1000519. pmid:19543370
  12. 12. Wright S. The genetical structure of populations. Ann Eugen. 1951;15: 323–354. pmid:24540312
  13. 13. Nadachowska-Brzyska K, Burri R, Olason PI, Kawakami T, Smeds L, et al. Demographic divergence history of pied flycatcher and collared flycatcher inferred from whole-genome re-sequencing data. PLoS Genet. 2013;9: e1003942. pmid:24244198
  14. 14. Neafsey DE, Barker BM, Sharpton TJ, Stajich JE, Park DJ, et al. Population genomic sequencing of Coccidioides fungi reveals recent hybridization and transposon control. Genome Res. 2010;20: 938–946. pmid:20516208
  15. 15. Smith J, Kronforst MR. Do Heliconius butterfly species exchange mimicry alleles? Biol Lett. 2013;9: 20130503. pmid:23864282
  16. 16. Murray MC, Hare MP. A genomic scan for divergent selection in a secondary contact zone between Atlantic and Gulf of Mexico oysters, Crassostrea virginica. Mol Ecol. 2006;15: 4229–4242. pmid:17054515
  17. 17. Gompert Z, Buerkle CA. Bayesian estimation of genomic clines. Mol Ecol. 2011;20: 2111–2127. pmid:21453352
  18. 18. Machado CA, Kliman RM, Markert JA, Hey J. Inferring the history of speciation from multilocus DNA sequence data: the case of Drosophila pseudoobscura and close relatives. Mol Biol Evol. 2002;19: 472–488. pmid:11919289
  19. 19. Ralph P, Coop G. The geography of recent genetic ancestry across europe. PLoS Biol. 2013;11: e1001555. pmid:23667324
  20. 20. Nei M, Li W-H. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci USA. 1979;76: 5269–5273. pmid:291943
  21. 21. Charlesworth B. Measures of divergence between populations and the effect of forces that reduce variability. Mol Biol Evol. 1998;15: 538–543. pmid:9580982
  22. 22. Geneva A, Garrigan D. Population genomics of secondary contact. Genes. 2010;1: 124–142. pmid:24710014
  23. 23. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18: 337–338. pmid:11847089
  24. 24. Garrigan D, Geneva AJ. msmove: A modified version of Hudson's coalescent simulator ms allowing for finer control and tracking of migrant genealogies. 2014; https://doi.org/10.6084/m9.figshare.1060474
  25. 25. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2013. https://doi.org/10.3758/s13428-013-0330-5 pmid:23519455
  26. 26. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25: 1754–1760. pmid:19451168
  27. 27. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25: 2078–2079. pmid:19505943
  28. 28. Garrigan D. POPBAM: tools for evolutionary analysis of short read sequence alignments. Evol Bioinform. 2013;9: 343–353. pmid:24027417
  29. 29. Pool JE, Hellmann I, Jensen JD, Nielsen R. Population genetic inference from genomic sequence variation. Genome Res. 2010;20: 291–300. pmid:20067940
  30. 30. Pool JE, Nielsen R. Inference of historical changes in migration rate from the lengths of migrant tracts. Genetics. 2009;181: 711–719. pmid:19087958
  31. 31. Hufford MB, Lubinksy P, Pyhajarvi T, Devengenzo MT, Ellstrand NC, et al. The genomic signature of crop-wild introgression in maize. PLoS Genet. 2013;9: e1003477. pmid:23671421
  32. 32. Kijas JW, Lenstra JA, Hayes B, Boitard S, Porto Neto LR, et al. Genome-wide analysis of the world's sheep breeds reveals high levels of historic mixture and strong recent selection. PLoS Biol. 2012;10: e1001258. pmid:22346734
  33. 33. Beaumont M, Zhang W, Balding D. Approximate Bayesian computation in population genetics. Genetics. 2002;162: 2025–2035. pmid:12524368
  34. 34. Birky CW, Walsh JB. Effects of linkage on rates of molecular evolution. Proc Natl Acad Sci USA. 1988;85: 6414–6418. pmid:3413105
  35. 35. Charlesworth D, Charlesworth B, Morgan MT. The pattern of neutral molecular variation under the background selection model. Genetics. 1995;141: 1619–1632. pmid:8601499
  36. 36. Cruickshank TE, Hahn MW. Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Molecular Ecology. 2014;23: 3133–3157. pmid:24845075
  37. 37. Nei M. Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci USA. 1973;70: 3321–3323. pmid:4519626
  38. 38. Takahata N, Slatkin M. Genealogy of neutral genes in two partially isolated populations. Theor Popul Biol. 1990;38: 331–350. pmid:2293402
  39. 39. Brand CL, Kingan SB, Wu L, Garrigan D. A selective sweep across species boundaries in Drosophila. Mol Biol Evol. 2013;30: 2177–2186. pmid:23827876
  40. 40. Kirkness EF, Grindberg RV, Yee-Greenbaum J, Marshall CR, Scherer SW, et al. Sequencing of isolated sperm cells for direct haplotyping of a human genome. Genome Res. 2013;23: 826–832. pmid:23282328
  41. 41. Langley CH, Crepeau M, Cardeno C, Corbett-Detig R, Stevens K. Circumventing heterozygosity: sequencing the amplified genome of a single haploid Drosophila melanogaster embryo. Genetics. 2011;188: 239–246. pmid:21441209