- Split View
-
Views
-
Cite
Cite
Brendan D. O'Fallon, Jon Seger, Frederick R. Adler, A Continuous-State Coalescent and the Impact of Weak Selection on the Structure of Gene Genealogies, Molecular Biology and Evolution, Volume 27, Issue 5, May 2010, Pages 1162–1172, https://doi.org/10.1093/molbev/msq006
- Share Icon Share
Abstract
Coalescent theory provides an elegant and powerful method for understanding the shape of gene genealogies and resulting patterns of genetic diversity. However, the coalescent does not naturally accommodate the effects of heritable variation in fitness. Although some methods are available for studying the effects of strong selection (Ns ≫ 1), few tools beyond forward simulation are available for quantifying the impact of weak selection at many sites. Here, we introduce a continuous-state coalescent capable of accurately describing the distortions to genealogies caused by moderate to weak natural selection affecting many linked sites. We calculate approximately the full distribution of pairwise coalescent times, the lengths of coalescent intervals, and the time to the most recent common ancestor of a sample. Weak selection (Ns ≈ 1) is found to substantially decrease the tree depth, primarily through a shortening of the lengths of the basal coalescent intervals. Additionally, we demonstrate that only two parameters, population size and the variance of the distribution describing fitness heritability, are sufficient to describe most changes.
INTRODUCTION
Understanding the manner in which natural selection affects patterns of genetic variation is of fundamental importance in population genetics. Coalescent theory, first described by Kingman (1982a), Kingman (1982b), Kingman (1982c), provides an elegant framework for this endeavor by describing the shapes of gene genealogies and resulting patterns of variation. However, because the coalescent relies on the assumption that individuals in the population do not differ in their expected reproductive success, the theory does not easily accommodate natural selection. Although many authors have extended coalescent theory to include various forms of selection, most have focused on selective schemes with only a small number of segregating alleles, often two (Krone and Neuhauser 1997; Neuhauser and Krone 1997; Barton and Etheridge 2004; Coop and Griffiths 2004; Wakeley 2008). Although two-allele models aid our understanding in situations such as the selective sweep of an advantageous allele, many populations, particularly those with large sizes or high mutation rates, contain loci that simultaneously segregate many alleles. The coalescent process is less well understood in these situations. In this paper, we demonstrate that relatively weak natural selection affecting multiple linked sites can significantly distort the shapes of gene genealogies from the predictions of neutral and two-allele models, and we develop methods that accurately predict these distortions.
Previous work on the coalescent with selection has focused on the limiting cases of strong and weak selection. If selection is strong, then allele frequencies either remain constant or change deterministically over time. The population may be thought of as several subpopulations, each corresponding to an allele (or more accurately, a fitness variant). Within subpopulations, expected reproductive success is equal, and thus, coalescent theory requires only modest modification to accurately describe these cases (Kaplan et al. 1988; Wakeley 2008). This approach may be applied to a variety of selective schemes (such as overdominance and balancing selection). However, natural selection must be fairly strong for this approximation to hold, particularly if many alleles or loci are considered (Barton and Navarro 2002; Navarro and Barton 2002).
If natural selection is weaker, then allele frequencies fluctuate over time, and it is necessary to incorporate these fluctuations to accurately model genealogies. Barton and Etheridge (2004) and Barton et al. (2004), extending a model first put forth by Kaplan et al. (1988), used a diffusion approximation to describe the probability that an allelic class had a particular frequency at some time in the past and then described relationships between genes conditional on the allelic frequencies. The method worked well for the one-locus, two-allele case but was numerically difficult to extend to cases involving more loci or alleles. Hudson and Kaplan (1994), Hudson and Kaplan (1995) analyzed a model that tracked a larger number of mutational classes that could be applied to weak selection. Selection was presumed to act in a multiplicative manner based on the number of mutations experienced by a particular sequence, and frequencies of these mutational classes were assumed to be Poisson distributed and constant. However, for weaker selection coefficients, the distribution of allelic classes will not remain constant, and the Poisson approximation becomes inaccurate.
At loci that harbor more than a few alleles, the combined effect of many mutations may distort genealogical structure even if each mutation has only a small fitness effect (Przeworski et al. 1999; McVean and Charlesworth 2000; Williamson and Orive 2002; Maia et al. 2004; Comeron et al. 2008). Quantitative analysis is difficult in this regime because both selection and drift are important and fitness variants may arise and be lost frequently. Results have primarily been obtained through simulation studies, both forward simulation (Golding 1997; McVean and Charlesworth 2000; Williamson and Orive 2002; Maia et al. 2004) and simulated reconstruction of the genealogy itself, using the ancestral selection graph (ASG; Krone and Neuhauser 1997; Neuhauser and Krone 1997; Przeworski et al. 1999). These studies have primarily concluded that weak selection has only a modest effect on genealogical structure and one that is maximized for intermediate levels of selection. However, forward simulation techniques cannot handle realistic population sizes, particularly when the entire genealogy must be tracked. Additionally, the ASG becomes inaccurate if multiple sites combine to yield large selection coefficients, thus limiting the total strength of selection that can be modeled (see Przeworski et al. 1999; some recent modifications to the ASG allow for stronger selection, e.g., Slade 2000).
At their core, many of the studies above involve the “structured coalescent” (Nordborg 1997) in which the population is divided into a number of discrete groups, usually representing allelic or fitness states. Within a given allelic, class individuals are identical, and thus, the neutral coalescent accurately describes the history of samples within groups, whereas the mutational regime describes movement between groups. In the case of weak selection, the number of potential states grows rapidly, and tracking the size of groups and movement of lineages among groups approaches intractability. To address these concerns, we have developed a model that assumes an infinite number of fitness states. In this case, the matrix describing transitions between groups becomes a continuous function (similar to a dispersal kernel), and the probability that a lineage is in a certain state some number of generations ago is given by a continuous probability density function. Despite this difference, the basic approach remains unchanged. We track the probability that a lineage is in a certain state some number of generations ago and then calculate the probability that two individuals share a parent in the previous generation by integrating over the distribution of potential states. If the population size is much larger than the sample size, then this pairwise coalescent rate is sufficient to describe the ancestry of the entire sample.
In this paper, we utilize the continuous approximation to examine the impact of weak selection operating at multiple sites on the structure of a genealogy. Our model tracks only the expected reproductive success of individuals, thus fitness is a quantitative trait and individuals are endowed not with genotypes or allelic states but with a single (nonnegative) real number describing expected reproductive success. We investigate how the distribution of ancestral fitnesses changes as one looks deeper into the past, and how this influences the probability that two randomly selected individuals first shared a common ancestor at a certain time. These calculations are used to find the distribution of pairwise coalescence times, the distribution of lengths of coalescent intervals, and the time to the most recent common ancestor (TMRCA) of a sample of genes. The results are also compared with simulations using a more realistic finite-sites model of fitness variation. In addition, we demonstrate that a single parameter describing the variance in fitness heritability in a single generation is sufficient to describe most distortions to the genealogies brought about by selection.
METHODS
Model Description
We begin by describing a simple population genetic model where individuals are endowed with a genome consisting of a finite number of sites. We then demonstrate how a model that tracks only the relative fitness of individuals can be used to approximate the discrete-sites model. Using the simpler relative fitness model, we address genealogical structure in three steps. First, we calculate the probability that a randomly chosen lineage (the series of ancestors of an individual chosen from the “present” generation) has fitness w at a given generation in the past. Second, we calculate the probability that two lineages, with fitnesses drawn from the probability distribution calculated in the first step, first shared a common parent t generations ago. Finally, we use the calculations to derive the expected lengths of coalescent intervals and TMRCA for a sample of arbitrary size. We verify our assumptions through comparison to forward simulations, both of our continuous model and a more realistic model with discrete number of sites. Unless otherwise noted, we use “fitness” to mean relative fitness or, equivalently, an individual's expected number of offspring.
Consider an asexual population of constant size N with nonoverlapping generations. Each individual contains a nonrecombining genome of L sites where each site may exist in one of two possible states. Each site is mutated independently with probability μ each generation. An individual's absolute fitness is determined by the total number of sites that differ from a predetermined most-fit genotype. Specifically, if the genome differs at n sites, absolute fitness is given by e−sn. Each new generation is populated by selecting individuals in proportion to their absolute fitness, and if selected, a parent produces a single offspring. The parental generation is sampled repeatedly and with replacement until exactly N offspring exist. Because sites may mutate to both more- and less fit states, the model encompasses both beneficial and deleterious mutations. Similar models have been analyzed by a number of authors, including McVean and Charlesworth (2000), Comeron and Kreitman (2002),Rouzine et al. (2003), and Seger et al. (2010).
The primary approximation we use in this work is to track only the evolution of the relative fitnesses of individuals. Specifically, we assume that there is some function that describes the probability that an individual has relative fitness wo conditional on its parent having fitness wp. The parameter τ2 describes the variance of this distribution; if τ2 = 0, then offspring have fitness identical to their parents and the model collapses to neutrality. If f is known, then offspring fitnesses may easily be generated by drawing a single random variable from f instead of simulating mutation at many independent sites. In the Appendix, we derive the mean and the variance of f for the discrete-sites model. To first order in s and μ, and is independent of parental fitness. Although we have been unable to derive a closed form for f, it is approximately Gaussian if somewhat leptokurtic for Lμ < 1 (fig. 1a and b). In the calculations and simulations below, we use a Gaussian function for f, and we refer to this model as the “Gaussian model.” Although the true f for the discrete-sites model is not exactly Gaussian (only a finite number of fitnesses are possible), we demonstrate below that many of the results are surprisingly insensitive to the shape of f, depending only on the standard deviation τ.
Performing the calculations below requires choosing exact values for both population variance, σ2, and skewness parameter α for p(w). In lieu of analytic results, we resort to simulation data to find the appropriate σ2. In the results that follow, the σ2 corresponding to a particular N and τ has been interpolated from simulation runs conducted for each combination of N and τ (the length of each run varied with N, but at minimum 1 million generations were simulated, with the first 5N generations discarded as burn-in). The distribution of population fitnesses exhibited leftward (negative) skew. Our results in general are not strongly dependent on the choice of α. Except when indicated, we use α=−2. The resulting function closely, but not exactly, matches the actual distribution of fitnesses observed in simulation results (fig. 1c and d). Nonetheless, this choice of α yields results that are broadly consistent with those obtained in simulations over a range of parameter values.
Distribution of Ancestral Fitnesses
Pr{} is the probability that an offspring has fitness wo conditional on parent fitness and may be expressed as the product of two quantities. First, is the probability that any parent with fitness wp could have given rise to an offspring with wo. Because our model tracks only relative fitness, we must account for the fact that each offspring fitness is divided by the mean fitness of all offspring each generation. The expected value of this normalization factor (the mean fitness of offspring prior to normalization) may be expressed as the mean fitness of the parental generation plus the expected increase in fitness of the offspring. In our model, the mean fitness of the parents is exactly one, whereas Fisher's fundamental theorem states that the expected increase in fitness of the offspring is equal to the variance in fitness in the parental generation (Fisher 1931). At stationarity, the variance of fitnesses in the parental generation is σ2, and therefore, the expected normalization factor is 1 + σ2. Although sampling error may cause the actual value in a given generation to differ from the expected value, in the calculations below we ignore these fluctuations and assume that the factor is exactly 1 + σ2. Second, the factor accounts for the fact that fitter individuals are more likely to be a randomly chosen individual's parent. The product then describes the probability that an offspring with fitness wo had a parent with fitness wp.
Repeated application of equation (5) results in a series of distributions describing ancestral fitnesses at each generation in the past. Note that the distributions describe the fitness of the ancestors of a single individual, not the group of individuals ancestral to the current generation. Henceforth, we use to describe the distribution resulting from t iterations, yielding the distribution of ancestral fitnesses t generations ago.
Several distributions for different time points are shown in figure 2 in which two primary phenomena are evident. First, the expectation of ancestral fitness increases with t. Intuitively, this makes sense because individuals that leave more offspring are more likely to be represented in the current generation. Similar findings have been noted by Barton and Etheridge (2004) and Wakeley (2001), who both found a tendency for lineages to migrate toward more fit states. Second, the variance in ancestral fitnesses decreases, which also follows from lineages tending toward a narrow range of states. The change in both the mean and the variance of ancestral fitness decreases with time and approaches an equilibrium. The position and scale of the distribution at equilibrium depends on both N and τ, with large populations support more variability in fitnesses for a given τ, which leads to higher ancestral fitnesses. Larger τ also yields higher ancestral fitnesses as well as a more rapid approach to this level (results not shown).
Coalescent Rate and the Distribution of Pairwise Coalescence Times
Using the distributions of ancestral fitnesses described by , it is possible to compute the probability that any two individuals randomly sampled from the current generation first had the same ancestor t generations ago. Some approximation is involved. In particular, we assume that the fitnesses of the ancestors of the two individuals, say and , from t generations ago, are described by independent draws from . This may not be the case in reality, for instance, if significant correlations exist in the distribution of fitnesses within generations, then knowing may influence the distribution of . This concern may be relevant to small populations with large amounts of fitness variability, such that only a few individuals found the entire next generation. Additionally, because we assume that the two lineages have not yet coalesced, their true fitnesses are likely to be more different than expected under the independence assumption (see Wilkins and Wakeley 2002). Nonetheless, the approximations are accurate for the parameter combinations examined here.
If the exact values wA and wB are not known, but instead are drawn from a known distribution, then the marginal probability of coalescence in the previous generation may be obtained by integrating equation (7) over the distributions describing wA and wB. If A and B are randomly selected from the same generation, then each has fitness w with probability described by p(w), and integrating over this distribution for both wA and wB yields the probability that two random individuals had the same parent, which is the reciprocal of the inbreeding effective population size. If there is no variance in fitness, then p(w) is a delta function at w = 1, and the probability of shared parentage is 1/N, as predicted by neutrality.
For small t, λt is close to the reciprocal of the inbreeding effective population size. However, λt increases with t in roughly linear fashion at first and then approaches an asymptote (fig. 3). Because λt reflects the probability of coalescence, it can be thought of as the reciprocal of a time-dependent inbreeding effective size, which decreases as one looks into the past, before eventually reaching an equilibrium size. In Seger et al. (2010), this effective size is plotted as a function of t and selection strength. If τ is very large, indicating either strong selection or high mutation rate, the approach to this equilibrium may be very fast. In this regime, simply assuming that the population has a constant effective size somewhat smaller than the census size may accurately describe the effects of selection; this is similar to the “background selection” limit proposed by Charlesworth et al. (1993). However, for smaller τ, the approach to the equilibrium is more gradual and must be taken into account to describe genealogies accurately. For a given τ, relatively large populations experience a greater increase in coalescent rate than do smaller populations (cf. and ; fig. 3), such that two populations of very different sizes may have similar final, “asymptotic” coalescent rates. For this reason, population size may have a less pronounced effect on the genealogies of genes under selection than on neutral genealogies.
This distribution of pairwise coalescent times is similar to the geometric distribution that describes the neutral pairwise coalescent times in discrete time models, but here, the rate λ (1/N under neutrality) increases with time. As expected from the results above, increasing τ results in a substantial reduction in both the mean and the variance of the pairwise coalescent time distribution (fig. 4). Even relatively small values create a noticeable distortion when compared with the neutral expectation. For instance, at , the mean time to coalescence is reduced from 1,000 to near 700, and the standard deviation is reduced from 1,000 to roughly 600. Greater values of τ result in considerable reductions in mean time to coalescence, for instance, at τ = 0.00025, the mean time is reduced to approximately 540, with a standard deviation of 416.
Timing and Length of Coalescent Intervals
The calculation of the distribution of pairwise coalescence times says little about the structure of genealogies with more than two tips. One way to gather additional information about tree shape is to examine the distribution of lengths of different coalescent intervals. Under neutrality, these are straightforward to calculate; the expected length for an interval with j lineages is . Under selection, the times are more complex because they must take into account how long ago the interval occurred because fitnesses and thus probability of coalescence change as a function of time.
This procedure yields the total time until the second interval ends; the length of the interval may be found by subtraction. Distributions for other intervals may be found in the same manner but require iterative calculation of more recent intervals in order to obtain the distribution of starting times for the interval in question. This computation relies on the assumption that the pairwise coalescence rate λt is unaffected by coalescent events. In reality, the expected fitness of a lineage immediately after a coalescence (in backwards time) is higher than predicted by because the ancestor had at least two offspring in one generation. Changes in fitness cause the rate of coalescence to deviate from λt. This deviation is likely to be transient, however, and as t increases the expected fitness distribution will approach and the rate will return to λt. Equation (11) assumes that the return to is instantaneous. Inclusion of the transient deviations in coalescence rate due to prior coalescences leads to skewed trees, a property investigated in the Seger et al. (2010).
These calculations were performed for several parameter combinations and the expectations of the resulting length distributions compared with simulation results in figure 5. Several features are evident. First, increasing population size while holding τ constant results in an ever greater distortion of the genealogy from neutral expectation, particularly near the root, where coalescence times are much reduced. Along intermediate intervals, the length of coalescent intervals are similar to those predicted under neutrality. The numerical approximation consistently predicts a somewhat smaller deviation from neutrality near the basal nodes than does the simulation. This inaccuracy seems to result from a somewhat lower equilibrium fitness resulting from the calculations than actually found in simulations; an error that decreases the coalescence rate deep in the genealogy and may result either from inaccuracy in the prior p(w) or the assumption that the fitnesses are independent draws from . Second, the relative lengths of the one or two intervals closest to the tips are actually larger than the neutral expectation in the simulations. This reason for this remains unclear, although it may be related to the ambiguity in measuring interval lengths when more than one coalescent event occurs in a single generation. This explanation is consistent with the observation that the phenomenon decreases with population size. The distortion appears quite transient and is not likely to strongly influence patterns of nucleotide diversity.
The generation at which final coalescent event occurs is the TMRCA of the sample. This timing of this event describes the depth of the tree and thus affects the total tree length and the total number of mutations that have arisen in the genealogy. The distribution of this event is particularly strongly influenced by heritable variation (fig. 6), with increasing τ reducing both the mean TMRCA and its variance. A modification of the procedure described in this section may also be used to find the total length of the tree, which is given by the sum over all intervals of the number of lineages in the interval multiplied by the length of the interval.
RESULTS
Comparison with Discrete-Sites Models
To validate our methods, we return to the discrete-sites model introduced at the beginning of the Methods section and compare the mathematical results from the continuous model to forward simulation of the discrete-sites case. We stress that the discrete-sites model makes no assumptions regarding the form of f or the continuous nature of relative fitnesses; it is nearly identical to the model presented in Rouzine et al. (2003) (without compensatory mutations) and Seger et al. (2010). For all results shown, simulations were allowed to burn in for 10N generations before data recording began. After burn-in data were sampled every 1,000 generations for at least 5 million generations. We set in all cases. For a variety of selection coefficients and population sizes, the predicted distributions of both pairwise coalescent times and the TMRCA closely match those obtained from simulation of the more complex discrete-sites model (fig. 7). For larger selection coefficients, however, such that Ns > 10, the continuous approximation becomes inaccurate, and the mathematical analysis predicts a greater distortion than observed in simulations (results not shown). One potential cause of this disagreement is that when selection is relatively strong, few sites are segregating, and the stationary state distribution p(w) is no longer accurately represented by a continuous function. The effects of strong background selection have been investigated by other authors (Charlesworth et al. 1993; Wakeley 2008).
To more closely examine the relationship between the number of segregating sites and the accuracy of the approximation, we conducted additional simulations with , , and varying levels of L and μ, while maintaining the genomic mutation rate or . The results in table 1 demonstrate that increasing the number of selected sites beyond 1 reduces both the pairwise coalescent time and the TMRCA substantially and therefore that one-locus, two-allele models (equivalent to L=1) should not be used to infer the effects of selection at many sites, even when the expected mutation probability Lμ is held constant. As L increases, the population harbors more fitness variability, as measured by σ, with σ eventually ceasing to increase for L>1000.
Pairwise Coalescent | ||||
La | τb | σc | Time d | TMRCAe |
Lμ = 0.01 | ||||
1 | 0.0001 | 0.0005 | 973 (961) | 1,875 (1,029) |
10 | 0.0001 | 0.0014 | 958 (1,045) | 1,857 (1,241) |
100 | 0.00011 | 0.0021 | 746 (636) | 1,400 (642) |
1,000 | 0.00012 | 0.0022 | 714 (594) | 1,320 (567) |
2,500 | 0.00012 | 0.0022 | 711 (582) | 1,286 (567) |
Calc.f | 0.0001 | 0.0021 | 681 (562) | 1,212 (555) |
Lμ = 0.1 | ||||
1 | 0.0003 | 0.0005 | 996 (1,072) | 1,924 (1,230) |
10 | 0.00032 | 0.0016 | 1,022 (1,084) | 1,976 (1,196) |
100 | 0.00032 | 0.0041 | 657 (560) | 1,213 (573) |
1,000 | 0.00032 | 0.0053 | 488 (381) | 889 (396) |
2,500 | 0.00032 | 0.0055 | 482 (360) | 880 (363) |
Calc.f | 0.0003 | 0.0050 | 499 (378) | 854 (368) |
Pairwise Coalescent | ||||
La | τb | σc | Time d | TMRCAe |
Lμ = 0.01 | ||||
1 | 0.0001 | 0.0005 | 973 (961) | 1,875 (1,029) |
10 | 0.0001 | 0.0014 | 958 (1,045) | 1,857 (1,241) |
100 | 0.00011 | 0.0021 | 746 (636) | 1,400 (642) |
1,000 | 0.00012 | 0.0022 | 714 (594) | 1,320 (567) |
2,500 | 0.00012 | 0.0022 | 711 (582) | 1,286 (567) |
Calc.f | 0.0001 | 0.0021 | 681 (562) | 1,212 (555) |
Lμ = 0.1 | ||||
1 | 0.0003 | 0.0005 | 996 (1,072) | 1,924 (1,230) |
10 | 0.00032 | 0.0016 | 1,022 (1,084) | 1,976 (1,196) |
100 | 0.00032 | 0.0041 | 657 (560) | 1,213 (573) |
1,000 | 0.00032 | 0.0053 | 488 (381) | 889 (396) |
2,500 | 0.00032 | 0.0055 | 482 (360) | 880 (363) |
Calc.f | 0.0003 | 0.0050 | 499 (378) | 854 (368) |
Number of sites under selection.
Standard deviation of the difference parent–offspring fitness.
Standard deviation in population fitness distribution.
Average pairwise coalescence time in generations, with standard deviation in parentheses.
Time to most recent common ancestor, with standard deviation in parentheses.
Results calculated using the numerical method with τ = Ls2, numbers in parenthesis are standard deviations of the calculated distributions.
Pairwise Coalescent | ||||
La | τb | σc | Time d | TMRCAe |
Lμ = 0.01 | ||||
1 | 0.0001 | 0.0005 | 973 (961) | 1,875 (1,029) |
10 | 0.0001 | 0.0014 | 958 (1,045) | 1,857 (1,241) |
100 | 0.00011 | 0.0021 | 746 (636) | 1,400 (642) |
1,000 | 0.00012 | 0.0022 | 714 (594) | 1,320 (567) |
2,500 | 0.00012 | 0.0022 | 711 (582) | 1,286 (567) |
Calc.f | 0.0001 | 0.0021 | 681 (562) | 1,212 (555) |
Lμ = 0.1 | ||||
1 | 0.0003 | 0.0005 | 996 (1,072) | 1,924 (1,230) |
10 | 0.00032 | 0.0016 | 1,022 (1,084) | 1,976 (1,196) |
100 | 0.00032 | 0.0041 | 657 (560) | 1,213 (573) |
1,000 | 0.00032 | 0.0053 | 488 (381) | 889 (396) |
2,500 | 0.00032 | 0.0055 | 482 (360) | 880 (363) |
Calc.f | 0.0003 | 0.0050 | 499 (378) | 854 (368) |
Pairwise Coalescent | ||||
La | τb | σc | Time d | TMRCAe |
Lμ = 0.01 | ||||
1 | 0.0001 | 0.0005 | 973 (961) | 1,875 (1,029) |
10 | 0.0001 | 0.0014 | 958 (1,045) | 1,857 (1,241) |
100 | 0.00011 | 0.0021 | 746 (636) | 1,400 (642) |
1,000 | 0.00012 | 0.0022 | 714 (594) | 1,320 (567) |
2,500 | 0.00012 | 0.0022 | 711 (582) | 1,286 (567) |
Calc.f | 0.0001 | 0.0021 | 681 (562) | 1,212 (555) |
Lμ = 0.1 | ||||
1 | 0.0003 | 0.0005 | 996 (1,072) | 1,924 (1,230) |
10 | 0.00032 | 0.0016 | 1,022 (1,084) | 1,976 (1,196) |
100 | 0.00032 | 0.0041 | 657 (560) | 1,213 (573) |
1,000 | 0.00032 | 0.0053 | 488 (381) | 889 (396) |
2,500 | 0.00032 | 0.0055 | 482 (360) | 880 (363) |
Calc.f | 0.0003 | 0.0050 | 499 (378) | 854 (368) |
Number of sites under selection.
Standard deviation of the difference parent–offspring fitness.
Standard deviation in population fitness distribution.
Average pairwise coalescence time in generations, with standard deviation in parentheses.
Time to most recent common ancestor, with standard deviation in parentheses.
Results calculated using the numerical method with τ = Ls2, numbers in parenthesis are standard deviations of the calculated distributions.
Robustness of Results to Other Fitness Heritability Functions
Although the results in figure 7 indicate that models with different assumptions regarding the heritability of fitness may have similar effects on genealogies provided they share the same τ, we further pursue this topic by examining different forms of the fitness heritability function f via forward simulation (fig. 8). Despite the considerable differences in the form of f, the distributions of pairwise coalescent times are indistinguishable, and the means and variances differ by only a few percent. Identical results were obtained with other population sizes and values of τ (results not shown). The similarity suggests that the higher moments of f do not greatly, or even moderately, affect genealogical structure and that the process is influenced primarily by the variance of the fitness heritability function, τ2. One possible explanation for this result is the central limit theorem-like property of adding many small deviations, such that offspring fitness distributions are approximately Gaussian when compared with ancestors from many generations ago, regardless of the form of the deviation produced in each generation. These results suggest that analytic calculations using a Gaussian f may accurately describe genealogical relationships for a substantially larger class of models, including more traditional finite- or infinite-sites models.
DISCUSSION
This work demonstrates that a model of coalescence involving continuously variable fitnesses can capture some of the ways in which weak selection acting at many sites distorts genealogies from their neutral expectation. We find that weak selection, on the order of Ns ≈ 1, can significantly shorten the time taken for two lineages to reach a common ancestor and that the variance of this time is reduced by an even greater factor (fig. 4). Weak selection also distorts the shapes of larger trees by shortening the lengths of coalescent intervals near the root of the tree while leaving the other intervals similar in length to the neutral expectation (fig. 5). The TMRCA of a sample may also be significantly reduced, in some cases by nearly 50% compared with the neutral expectation (fig. 6). Although our numerical methods assume an infinite number of possible fitness states, our calculations appear accurate when compared with simulations with more than a few hundred sites.
Although a number of authors have constructed models of the coalescent with selection, most have focused on the action of selection at a single locus (Golding 1997; Neuhauser and Krone 1997; Barton and Etheridge 2004; Wakeley 2008) and have largely concluded that weak selection does not significantly impact the shape of genealogical trees. Simulation studies of selection at multiple sites have reinforced this view (Przeworski et al. 1999; Williamson and Orive 2002). The analysis here offers a contrasting view and demonstrates several ways in which tree shape is distorted as a result of selection. In particular, table 1 demonstrates that as the number of selected sites increases, populations harbor a greater variance in fitnesses (σ), and this increase is commensurate with increased genealogical distortion. Among authors who have examined multiple-sites simulations, some examined only selection coefficients too small to have an impact (Przeworski et al. 1999; Ns = 0.1). Others found deviations similar to those here but concluded that “selection only had a moderate effect on tree statistics…consistent with single locus results” (Williamson and Orive, 2002, p. 1379). Given that selection is likely to act at many sites simultaneously in natural populations and that many mutations are likely to impact reproduction only modestly, this work suggests that many real genealogies may experience considerable distortions due to selection.
One intriguing finding of this analysis is the fundamental importance of the variance τ2 to the exclusion of the other properties of the distribution f (fig. 8). The parameter τ reflects the extent to which offspring fitness may differ from parental fitness and thus incorporates both mutation and selection. If mutation never occurs, or if mutations have no impact on fitness, then offspring will always have the same fitness as their parents, τ = 0, and genealogies conform to the neutral coalescent. Conversely, if mutations are frequent and their impact on fitness is large, offspring fitness may differ greatly from parental fitness, τ will be very large and coalescences will happen much more rapidly than predicted under neutrality.
Why variance, but not, for instance, skewness, should be the primary determining factor of coalescence time remains unclear, but in many ways, it is fortunate. For instance, it suggests that very different mutational models may still have similar effects on genealogies, provided that they produce similar variances in fitness heritability. This finding may be particularly important when comparing analytic results with empirical data because in any real data set, the distribution of mutation probabilities and selection coefficients across a sequence are likely to be unknown. Nonetheless, the influence of a complicated mutational and selective regime on genealogical structure may be similar to the Gaussian model presented here, provided that τ2 is the same. This result is supported by investigations of a finite-sites model of mutation and selection (fig. 7), which produces results very similar to those predicted by the continuous coalescent (as long as L is large enough), despite very different assumptions regarding the mutational model.
Although τ is presented here as a parameter of the model, τ may be calculated for any model in which the reproductive success of parents and offspring may be compared. Specifically, τ is the standard deviation of the differences between parent and offspring expected reproductive successes. τ may be estimated by tabulating the difference in actual reproductive success between a parent and its offspring over many parent–offspring pairs and then calculating the variance among these differences. For haploid organisms with nonrecombining genomes, it is possible that τ may be calculated in laboratory studies. However, for organisms with recombining chromosomes, empirical assessment of τ is likely to be more difficult, unless a single polymorphism can be tracked. Nonetheless, it may be possible to estimate τ from a genealogy reconstructed for a particular genomic region.
One advantage of the approach presented here is that it allows for the likelihood of a genealogy to be calculated given a particular population size and τ. Using a genealogy sampler such as LAMARC (Kuhner, 2006) or BEAST (Drummond and Rambaut, 2007), it may then be possible to estimate the true values of these parameters using sequence data from natural populations. Such an approach would allow for estimation of the amount of heritable fitness variation produced at unlinked loci in a single generation, facilitating a test for selection based on deviations in genealogical structure. We consider the feasibility of such an approach in a forthcoming publication.
The work presented here is similar to models of coalescence in continuous habitats (Barton and Wilson 1995; Wilkins and Wakeley 2002; Wilkins 2004). These models typically assume strict population regulation, such that population density is distributed uniformly across the habitat as well as Gaussian dispersal of offspring. Under these assumptions and using a diffusion approximation, Wilkins and Wakeley (2002) found an analytic expression for the full distribution of pairwise coalescence times, conditional on the starting location of the samples, and the variance of the dispersal function. The analysis in this paper may be seen as extending Wilkins and Wakeley (2002) model to include differential reproductive success along the habitat in a linear manner and assuming that population density is given by the skew-Gaussian distribution. One key difference, however, is that the model here rescales “space” each generation so that the mean “location” (fitness) is unity.
In order for the work here to be fully analytic, a function must be found that describes the steady-state distribution of fitnesses in a population with a particular mutation and selection model. Deriving this expression may require solving a functional equation relating the distribution of offspring fitnesses to the distribution of parental fitnesses. A potential alternative is to find an expression for the moments of the steady-state distribution by solving a system of equations relating each moment of the parental distribution to the offspring distribution. Such an approach requires a method of moment closure because each moment of the parental distribution affects a different moment in the offspring distribution (for instance, the mean of the offspring distribution is governed by the variance of the parental distribution). Additionally, as genealogical shape appears to be fairly sensitive to changes in the population variance (σ2), any approximations made must be quite accurate over the appropriate range of parameter values.
This analysis calls into question some techniques used to infer past population dynamics. Specifically, several related techniques have been proposed to infer population growth rates and historical sizes based upon analysis of the distribution of coalescent intervals (Kuhner et al. 1998; Pybus et al. 2000; Strimmer and Pybus 2001; Minin et al. 2008). As demonstrated above, however, weak selection may produce a systematic distortion of the intervals, such that basal intervals are much shorter than expected. Performing a skyride analysis or estimating the growth rate of a population from loci experiencing moderate selection at multiple-sites selection produces a strong signal of population expansion, when in fact population size (and the population's inbreeding effective size) has remained constant (results not shown). Minin et al. (2008) analyzed the Egyptian hepatitis C virus (HCV), for example, and found a marked population expansion. However, because HCV has a high mutation rate and a single nonsegmented genome with high gene density, it seems unlikely that any region will be free from the effects of selection. Combining a skyline or growth rate analysis with the techniques presented here in a maximumlikelihood context may allow for a more robust estimate of historical population sizes and selection parameters.
We would like to thank Mary Kuhner, Joe Felsenstein, Jon Wilkins, and two anonymous reviewers for their helpful comments on a previous version of the manuscript. Financial support was provided by National Science Foundation grant DBI-0906018 to B.D.O.
APPENDIX
Here, we derive the mean and the variance of fitness heritability function f from the discrete-sites model. We begin by considering the distribution of the number, say J, of mutated sites in an individual conditional on the individual's parent having exactly K mutated sites. Let X be the number of forward mutations and Y be the number of back mutations. Because each site mutates independently with probability μ, X and Y are conditionally independent binomial random variables with index L−K and K, respectively, each with parameter μ. Let be the change in the number of mutations from parent to offspring, so that . In the case where L and K are large and μ is small, and X and Y are approximated well by Poisson random variables with rate and . The large K assumption is satisfied with the selection coefficient , the regime we consider here. Z is then Skellam-distributed (Skellam 1946), with mean and Pr{} = Pr{}.
References
Author notes
Associate editor: Rasmus Nielsen