Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies

Julia A. Palacios; John Wakeley; Sohini Ramachandran

doi:10.1101/019216

Abstract

Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Accurate methods are available for data from a single locus or from independent loci. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model which allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method’s credible intervals for population size as a function of time cover 90 percent of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with ARGweaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.

1 Introduction

For a single non-recombining locus, neutral coalescent theory predicts the set of timed ancestral relationships among sampled individuals, known as a gene genealogy (Kingman 1982; Hudson 1983; Tajima 1983; Hudson 1990). In the coalescent model with variable population size, the rate at which two lineages coalesce, or have a common ancestor, is a function of the population size in the past. Here we denote the population size trajectory by N(t), where t is time in the past, and use the term local genealogy to describe ancestral relationships at one non-recombining locus. When analyzing multilocus sequences, a single local genealogy will not represent the full history of the sample. Instead, the set of ancestral relationships and recombination events among a sample of multilocus sequences can be represented by a graph, known as the ancestral recombination graph (ARG) which depicts the complex structure of neighboring local genealogies and results in a computationally expensive model for inferring N(t) (Griffiths and Marjoram 1997; Wiuf and Hein 1999).

Recent studies have leveraged computationally simpler approximations for the coalescent with recombination—the sequentially Markov coalescent (SMC) (McVean and Cardin 2005) and its variant SMC′ (Marjoram and Wall 2006; Chen et al. 2009)—both of which model local genealogies as a continuous time Markov process along sequences (Figure 1). The difference between the SMC and SMC′ is that the SMC models only the class of recombination events that alter local genealogies of the sample. In general, the SMC′ is a better approximation to the ARG than the SMC (Chen et al. 2009; Wilton et al. 2015). Because of these features, in this work we rely on the SMC′ to model local genealogies with recombination.

Figure 1: SMC′ model for inferring population size trajectories.

Drawn after Rasmussen et al. (2014) to highlight notation specific to our study. A. Observed sequence data in a segment of length L from five individuals; three loci are shown delimited by recombination breakpoints b₁ and b₂. Only the derived mutations at polymorphic sites are shown. B. Corresponding local genealogies g_i for each locus i. The five sampled individuals are depicted as black filled circles. Local genealogies have a Markovian degree 1 dependency. Each inter-coalescent time (the time interval between coalescent events denoted as empty circles) provides information about past population size (number of gray filled circles at a given time point). Moving from left to right after recombination breakpoint b₁, the pruning location p₁ is selected from genealogy g₀ and the pruned branch is regrafted back on the genealogy (blue filled circle). The coalescent event of g₀ depicted as a red filled circle in g₁ is deleted. This creates the next genealogy g₁. The process continues until L. At L, the population size trajectory N(t) (depicted as a black curve superimposed on g₂) can be inferred.

Under the coalescent and the sequentially Markov coalescent (SMC and SMC′) models, population size trajectories and sequence data are separated by two stochastic processes: i) a state process which describes the relationship between the population size trajectory and the set of local genealogies, and ii) an observation process which describes how the hidden local genealogies are observed through patterns of nucleotide diversity in the sequence data. The observation process includes mutation and genotyping error while the state process models coalescence. Sequence data are then used to make inferences of population size trajectories. In this paper, we restrict attention to the state process of local genealogies and show how inferences of population size trajectories can be made from them. We solve a number of key modeling and inference problems, and thus provide a basis for developing efficient algorithms to infer population parameters from sequence data directly.

Whole-genome inference of population size trajectories has been hampered by the enormous size of the state space of local genealogies when the sample size is large. The pioneering, pairwise sequentially Markov Coalescent (PSMC) method of Li and Durbin (2011) employed the SMC to make inferences from a sample of size two (n = 2). In this method, time is discretized and the population size trajectory is piece-wise constant, allowing pairwise genealogies also to be discretized. Subsequent methods for samples larger than two similarly rely on the discretization of time and genealogies. The natural extension of the PSMC to n > 2 is the multiple sequentially Markovian coalescent (MSMC) (Schiffels and Durbin 2014). However, the MSMC models only the most recent coalescent event of the sample, and hence its estimation of population sizes is limited to very recent times. Other recent methods propose efficient ways of exploring the state space of hidden genealogies for n > 2 (Sheehan et al. 2013; Rasmussen et al. 2014), yet also rely on discretizing the state space of local genealogies and assume a piece-wise constant trajectory of population sizes. We show that the a priori specification of change points for the piece-wise population size trajectory required by current approaches is problematic because estimates of N(t) are sensitive to this specification. Moreover, current methods do not generate interval estimates for N(t).

Gaussian Process-based Bayesian inference of population size trajectories has proven to be a powerful and flexible nonparametric approach when applied to a single local genealogy (Palacios and Minin 2013; Lan et al. 2015). The two main advantages of the GP-based approach are: (i) it does not require a specific functional form of the population size trajectory (such as constant or exponential growth) and (ii) it does not require an arbitrary specification of change points in a piece-wise constant or linear framework.

In this paper, we show the downstream effects of discretizing time, assuming a piecewise constant trajectory, and reporting only point estimates for past population sizes. We overcome previous limitations by introducing a Bayesian nonparametric approach with a Gaussian Process (GP) to model the population size trajectory as a continuous function of time. More specifically, we model the logarithm of the population size trajectory a priori as a Gaussian process (the log ensures our estimates are positive). As mentioned above, we assume that local gene genealogies are known. For our Bayesian model, we develop a Markov Chain Monte Carlo (MCMC) method to sample from the posterior distribution of population sizes over time. Our MCMC algorithm uses the recently developed algorithm Split Hamiltonian Monte Carlo (splitHMC) (Shahbaba et al. 2014; Lan et al. 2015). splitHMC updates all model parameters jointly and it can be extended to a full inferential framework that is directly applicable to sequence data. In order to compare our Bayesian GP-based estimation of population size trajectories with a piece-wise constant maximum likelihood-based estimation (e.g. Li and Durbin 2011; Sheehan et al. 2013; Schiffels and Durbin 2014), we implemented the Expectation-maximization (EM) algorithm within our framework and computed the observed Fisher information to obtain confidence intervals of the maximum likelihood estimates.

Lastly, we address a key problem for inference of population size trajectories under sequentially Markov coalescent models is the efficient computation of transition densities needed in the calculation of likelihoods. Here, we express the transition densities of local genealogies in terms of local ranked tree shapes (Tajima 1983) and coalescent times, and show that these quantities are statistically sufficient for inferring population size trajectories either from sequence data directly or from the set of local genealogies. The use of ranked tree shapes allows us to exploit the state process of local genealogies efficiently since the space of ranked tree shapes has a smaller cardinality than the space of labeled topologies (Sainudiin et al. 2014).

Methods: SMC′ Calculations

Following notation similar to Rasmussen et al. (2014) (Table 1) a realization of the embedded SMC′ chain consists of a set of m local genealogies (g₀, g₁,…,g_m-1), m − 1 recombination break-points at chromosomal locations (b₁, b₂,…,b_m-1), and m − 1 pruning locations (p₁, p₂,…,p_m-1), where p_i = (u_i, w_i) indicates the time of the recombination event u_i and the branch w_i where recombination happened in genealogy g_i-1 (Figure 1). Genealogy g₀ corresponds to the genealogy of n sequences that contains the set of timed ancestral relationships among the n individuals for the chromosomal segment (0, b₁]. Genealogy g_i corresponds to the genealogy of the same n sequences for the chromosomal segment (b_i, b_i+1] for i = 1, 2,…,m − 2. Finally, denotes the time when two of j lineages coalesce in genealogy g_i, measured in units of generations before present.

View this table:

Table 1:

Notation for the SMC′ model used in this work.

Using capital letters to denote random variables, the evolution of the SMC′ process along chromosomal segments is governed by a point process B = {B_i}_i∈ℕ that represents the random locations of recombination breakpoints. We use S_i = B_i − B_i-1, for i = 1, 2,…, m, to denote the segment lengths for each local genealogy, with S₀ = B₀ = 0. Let G = {G_i}_i∈ℕ be the chain which records the local genealogies, and let P = (U, W) = {(U_i, W_i)}_i∈ℕ represent the chain which records the pruning locations (time and branch) on G. The sequence (G_i, P_i = {U_i, W_i}, B_i) has the following conditional independence relation:

Given a chain of local genealogies, pruning locations and recombination breakpoints, the joint transition probability to a new genealogy, pruning location and locus length can be expressed as the product of the locus length probability conditioned on the current genealogy (Expression 1, above), the pruning location probability conditioned on the current genealogy (Expression 2, above) and, the transition probability of the new genealogy conditioned on the current genealogy and pruning location (Expression 3, above).

2.1 Complete data transition densities

Consider the chain of local genealogies g = (g₀, g₁,…,g_m-1) with recombination breakpoints at b = (0, b₁,…,b_m-1). According to the SMC′ process, the first local genealogy g₀ follows the standard coalescent density: where and are the set of coalescent times in local genealogy g₀. The piece-wise constant function Aⁱ(t) denotes the number of ancestral lineages present at time t in genealogy g_i, that is with .

Given a current local genealogy g_i-1, the distribution of the length S_i = B_i–b_i-1 of the current locus depends on the current state of the SMC′ chain through the local genealogy’s total tree length l_i−1 (the sum of all branch lengths in g_i-1) and the recombination rate per site per generation ρ. At recombination breakpoint b_i, a new local genealogy g_i is generated (Figure 1). This new local genealogy g_i depends on the previous local genealogy g_i-1 and the population size trajectory N(t). To generate g_i we first randomly choose a pruning location p_i (consisting of a pruning time u_i and a lineage w_i) uniformly along g_i-1. At pruning location p_i, we add a new lineage and coalesce it further in the past at time with some lineage, c_i (Figure 2). We then delete the w_i lineage’s segment from u_i to (the coalescent time of lineage w_i). The transition density to a new genealogy at recombination breakpoint b_i is then where l_i-1 denotes the total tree length of g_i-1.

This generative process of local genealogies can result in the two types of transitions depicted in Figure 2. A visible transition results in a genealogy g_i which is different from g_i-1 (Figure 2A), while an invisible transition makes g_i identical to g_i-1 (Figure 2B).

Figure 2: Schematic representation of SMC′ transitions given a recombination break-point at location b_i (indicated as an arrow in each panel).

A: Visible transition. We uniformly sample the pruning location p_i from g_i-1 at time u_i along some branch w_i, we add a new branch at u_i and re-graft it (dashed black line). The new branch coalesces with some branch c_i at time . We then delete branch w_i and the coalescent time to generate genealogy g_i. Any pruning time along the branch w_i (shown in green) would have produced the same visible transition from g_i-1 to g_i. B: Invisible transition. We uniformly sample the pruning location p_i = (u_i, w_i), add a new branch at u_i and re-graft it. The new branch coalesces with itself (dashed black line); that is, C_i = w_i, and then the segment of w_i is deleted. If C_i = w_i, any pruning location along the green branches would have produced the same invisible transition.

An invisible transition g_i = g_i-1, occurs when c_i = w_i. Given the pruning location p_i = (u_i, w_i), a transition to an invisible event occurs when and C_i, the random variable indicating the lineage that coalesces with lineage , takes the value w_i. The probability of an invisible transition is given by Thus, the joint transition probability to an invisible event with pruning location (u_i, w_i), given g_i-1is:

4.2 Transition densities averaged over unknown pruning locations

Even though we will assume that local genealogies are known, in order to anticipate later applications to sequence data we do not wish to make the same assumption about pruning locations. Thus, we average over pruning locations to obtain marginal transition densities between genealogies, for both visible and invisible transitions.

To compute the marginal visible transition density to a new genealogy , we need to average over all possible pruning locations p_i = (u_i, w_i) along g_i-1. By comparing the two genealogies g_i-1 and g_i in Figure 2A, we know that p_i corresponds to the lineage w_i some time along , or equivalently, along . In general, comparison of g_i-1 and g_imay not provide complete information to identify the lineage that was pruned. When the children of the node corresponding to t_del and the children of the node corresponding to t_new are the same, pruning different branches can lead to the same transition. We enumerate all cases of incomplete information for visible transitions in Supporting Information Figure S1.

We introduce a function I^i-1(t), equal to the number of possible lineages at time t where the pruning location along g_i-1 would produce a visible transition to g_i. I^i-1(t) is a piece-wise constant function that takes the values in {0, 1, 2} depending on whether the pruning location p_i can happen in 0, 1 or 2 branches at time t. In the example in Figure 2A, For a general I^i-1(t) piece-wise constant function that indicates the number of possible pruning branches at time t, the marginal visible transition density to a new genealogy is

Turning now to the computation of marginal transition probabilities for invisible events, we need to average over all possible pruning locations p_i. Consider the example in Figure 2B and choosing a pruning time (u_i) along g_i-1. In order to have an invisible transition, the coalescing branch C_i must be the same pruning branch W_i. In Figure 2B the new coalescent time can happen along five lineages in the interval , three lineages in the interval , and two lineages in the interval . To generalize this calculation, we introduce the quantity with (n + 1) ≥ j ≥ k ≥ 2 which denotes the number of lineages in g_i that are free (do not coalesce), in the time segment . The time interval includes the interval of pruning up to the interval of self-coalescence . Thus, if the pruning time happens at time, an invisible transition with new coalescent time can happen along free lineages.

In Figure 2B, u_i happened in the time interval . If the new coalescent time happens in the interval along the same (unknown) pruning branch, then this invisible transition has probability with F_5,5 = 5.

Now consider the same example of Figure 2B but with an unknown pruning time u_i. The joint event where recombination occurs at pruning time and coalescent time occurs in the interval and this results in an invisible transition has probability: where denotes the double integral expression in Equation 9 for ease of notation.

An invisible transition would also result if and along the same (unknown) pruning branch; according to Figure 2B, this can happen along three lineages, so and this event has probability: If we continue considering the cases where and or we have and . Then, the joint probability of an invisible event and is For the cases when and the new coalescent time falls in another coalescent interval , we need to compute the following:

The joint probability of and no coalescence in the interval :
The probability of no coalescence in any of the intermediate coalescent intervals : and
The probability of coalescing at

Then, represents the probability that the pruning location is w_i at time and the new lineage coalesces at time with lineage c_i = w_i. Overall, the marginal transition probability to an invisible event is:

2.3 The likelihood of the embedded SMC′ chain

Instead of having a complete realization of the embedded SMC′ chain of m local genealogies g₀,…,g_m-1 and pruning locations p₁,…,p_m-1 at recombination breakpoints b₁,…,b_m-1, we assume that our data (unless otherwise noted) consist only of m local genealogies at recombination breakpoints from a chromosomal segment of length L. Note that our observed data are not sequence data. More specifically, our observed data are Then, the observed data likelihood is where h(L − b_m-1 | g_m-1, ρ) is the survival function in state g_m-1. Equation 13 is factored into terms that depend on N(t) alone and ones that depend on ρ alone. The terms that depend on ρ, given by Equation 5, depend on the data only through total tree lengths l₀,…, l_m−1 and locus lengths s₁,…, s_m-1, L − b_m−1. By the factorization theorem for sufficient statistics, local tree lengths l₀,…, l_m−1 and locus lengths s₁,…, s_m−1, L − b_m-1 are sufficient for inferring ρ.

Methods: Inference

Current coalescent-based methods that infer a population size trajectory N(t) from whole-genome data assume N(t) is a piece-wise constant function with change points x₁ = 0 < x₂ <… < x_d (Li and Durbin 2011; Sheehan et al. 2013; Rasmussen et al. 2014; Schiffels and Durbin 2014). That is Equation 14 presents two challenges. The first challenge lies in the specification of the change points. The narrower an interval is, the higher the probability that we do not observe coalescent times in that interval. The fewer observed coalescent times in an interval, the greater the uncertainty of the estimate (if the estimate even exists). The second challenge lies in the specification of the time window (0, x_d): if x_d is set too far in the past, we might not have enough data to accurately estimate N(t) for x_d≤t < ∞.

In order to solve the first challenge, Rasmussen et al. (2014) and Li and Durbin (2011) distribute the d change points evenly on a logarithmic scale: where κ is specified by the user. Schiffels and Durbin (2014) propose discretizing time according to the quantiles of the exponential distribution. where λ is the rate of an exponential distribution. Schiffels and Durbin (2014) model the time to the most recent coalescent event and set . However, Equation 16 is not directly applicable here because we use all coalescent events for inference.

In the following sections, we first present our Bayesian nonparametric method, then develop a maximum likelihood method under a piece-wise constant trajectory so we can directly compare an EM-based method (Li and Durbin 2011; Sheehan et al. 2013) to our Bayesian nonparametric method. The R code for all simulation studies and real data analysis conducted in this paper are publicly available at http://ramachandran-data.brown.edu/datarepo/.

3.1 Gaussian-Process-based Bayesian Nonparametric Estimation of N(t)

For our Bayesian methodology, we assume the following log-Gaussian Process prior on the population size trajectory, N(t): where 𝒢 𝒫(0, C(τ) denotes a Gaussian process with mean function 0 and inverse covariance function C^-1(τ) = τ C^-1 with precision parameter τ. For computational convenience, we use Brownian motion as our prior for f(t) since its inverse covariance matrix is sparse. We place a Gamma prior on the precision parameter τ, Assuming that recombination rate ρ is known, the posterior distribution of model parameters (Figure 3) is then The first two factors on the right side of Equation 18, detailed in Equations 8 and 11 involve integration over N(t), an infinite dimensional random function (Equation 17). We approximate the integral by the Riemann sum over a partition of the integration interval. That is, for x_i < a < x_i+1 <… < x_k-1 < b < x_k, Δ_i = x_i+1 − a, Δ_k = b − x_k-1 and Δ_j = x_j+1 − x_j for i < j < k. is a representative value of f(t) in the interval (x_j, x_j+1); in our implementation, we set with . This way, we discretize our time window in d evenly spaced segments x₁ = 0 < x₂ <… < x_d, with , the maximum time to the most common ancestor observed in the sequence of local genealogies, and approximate N(t) by a piece-wise linear function evaluated at .

Figure 3: Structure of our Bayesian model for inferring population size trajectories from a realization of the SMC′ process at recombination breakpoints.

Hyperparameter τ controls the smoothness of the log-Gaussian process prior on N(t). Local genealogies depend on N(t) and form a Markov chain of degree one. Given the current local genealogy g_i-1, we sample the location of the new recombination breakpoint b_i and a pruning location p_i on genealogy g_i-1. The new genealogy g_i depends on N(t), p_i and g_i-1.

We condition on the set of m local genealogies g₀,…,g_m-1 to generate posterior samples for the vector and τ and use these posterior samples to infer N(t) at , where . Updating N(t) and τ separately is not recommended because of their strong dependency (Lan et al. 2015). Therefore, we update (N(t), τ) jointly in an MCMC sampling algorithm using Split Hamiltonian Monte Carlo (Shahbaba et al. 2014; Lan et al. 2015). Split Hamiltonian Monte Carlo relies on our ability to calculate the log-likelihood of the observed data and the gradient vector of the log-likelihood (i.e., the score function). The log-likelihood of the observed data is approximated via sums of the form in Equation 19. We approximate the score function ∇ℒ_obs(Y; f*) with respect to f* by applying Fisher’s identity: where, at each iteration in the MCMC, expectation is calculated using the current value of f*. We show the details of this calculation in the Appendix.

Alternatively, one can update N(t) in the MCMC algorithm using Elliptical Slice Sampler (Murray et al. 2010) with a fixed value of τ (perhaps estimated from previous studies or from a preliminary run from the Split Hamiltonian Monte Carlo algorithm). The advantage of using Elliptical Slice Sampler over the Split Hamiltonian Monte Carlo is purely computational since Elliptical Slice Sampler does not require calculation of the score function.

3.2 Maximum-likelihood estimation of N(t) with measures of uncertainty

We assume that the population size trajectory N(t) is defined as in Equation 14. The standard coalescent density (Equation 4) and the transition densities defined in Equations 11 and 8 are tractable, so calculation of the likelihood (Equation 13) is tractable. However maximization of the likelihood function cannot be performed analytically because pruning locations are missing. We rely on the Expectation-Maximization (EM) algorithm (Dempster et al. 1977) to find the maximum likelihood estimator of N = (N₁,…,N_d). The complete data Y_c for inferring N(t) are then the set of local genealogies g₀,…,g_m-1 and the set of pruning locations p₁…, p_m-1. For the invisible transitions, we also need to know the new coalescent times , where ℐ ⊂ {1, 2,…,m − 1} denotes the set of indices of invisible transitions (transition i is an invisible transition if g_i = g_i-1).

The complete data log-likelihood is then The EM algorithm starts by initializing the population size trajectory to a piece-wise constant function with change points x₁,…, x_d with arbitrarily chosen vector N⁰. At the kth iteration of the algorithm we set The conditional expectation in Equation 21 is conditional on the observed data Y defined in Equation 12. Let be the ordered set of time points corresponding to the change points x₁,…,x_d and the coalescent time points tⁱ of local genealogy i. If the transition from g_i to g_i+1 is visible, we replace the jth time point by , where j corresponds to the index such that . For ease of notation, we will denote the number of time intervals |xⁱ| by D = d + n − 2. Let be an indicator function that takes the value of 1 when the jth interval contains a coalescent time of the first genealogy g₀. Then, the log density of the first genealogy is: Let be an indicator function that takes the value of 1 when the new coalescent time of genealogy i happens in the corresponding time interval , and let the adjusted interval length be

Then, the augmented transition density can be expressed as: where zⁱ and Δ ⁱ are the vectors with and elements. For the EM algorithm we need to compute the conditional expected vectors and . The details of these calculation are in the Appendix.

We use the Fisher information matrix to compute approximate standard errors of and use these standard errors together with asymptotic normality of maximum likelihood estimators to produce confidence intervals for log population size piece-wise trajectories. We compute the observed Fisher information matrix following Louis (1982): where is the gradient and is the Hessian of the complete-data log-likelihood with respect to log N. This requires the calculation of conditional cross-product means and conditional second moments described in the Appendix.

Results

We simulated 1000 local genealogies of 2, 20 and 100 individuals from each of the three different demographic models described in Table 2 using MaCS (Chen et al. 2009); see Supporting Information for details of these simulations. We assumed that all individuals were sampled at time t = 0 under a demographic model in Table 2.

View this table:

Table 2:

Simulated demographic scenarios. The argument t denotes time measured in units of N₀ generations.

We compared the point estimates with the truth for each demographic model using the sum of relative errors (SRE): where is the estimated population size trajectory at time x_i. We compute SRE at equally space time points x₁,…,x_K. Second, we compute the mean relative width (MRW) as follows: where corresponds to the 97.5% upper limit and corresponds to the 2.5% lower limit of . For EM estimates, corresponds to the 95% confidence interval estimated using the observed Fisher information; for Bayesian GP estimates, corresponds to the 95% Bayesian credible interval (BCI) of . To measure how well these intervals cover the truth, we compute the envelope measure (ENV) in the following way:

We compute SRE, MRW and ENV for K = 150 at equally spaced time points.

For our Bayesian GP estimates, we estimate N(x_i) at d = 100 time points, unless stated otherwise. The parameters of the Gamma prior on the GP precision parameter τ were set to α = β = 0.001, reflecting our lack of prior information about the smoothness of the population size trajectory.

For our EM estimates, we used different discretizations based on Equation 15 and varying the number of change points d and κ over the fixed interval (0, x_d) with x_d set to be the maximum observed coalescent time. For the cases where we only consider one genealogy (m = 1), the EM approach becomes standard maximum likelihood estimation. We summarize our posterior inference and compare our Bayesian GP method to the EM method. The population size trajectory is logtransformed for ease of visualization and for direct comparison with other methods (Minin et al. 2008; Palacios and Minin 2013).

View this table:

Table 3:

Summary statistics for simulation results depicted in Figure 4. SRE is the sum of relative errors (Equation 24), MRW is the mean relative width of the 95% BCI (Equation 25), and ENV is the envelope measure (Equation 26). Values in bold indicate best performance.

Figure 4: Sensitivity to parameter discretization

Comparison of population size trajectories estimated from one simulated genealogy (m = 1) of 100 individuals with a constant population size. We show true trajectories as dashed lines. (A) Bayesian GP estimates at d = 50, 100 and 200 equally spaced time points. (B) EM estimates of a piece-wise constant trajectory with d = 5 change points and κ = 1, 10 and 100 (Equation 15). (C) EM estimates of a piece-wise constant trajectory with d = 10 change points and κ = 10, 100 and 500 (Equation 15). Point estimates are shown as solid black lines. 95% credible intervals and 95% confidence intervals are shown by gray shaded areas.

4.1 Sensitivity of EM estimates of N (t) to discretization

In Figure 4, we show our Bayesian GP and EM estimates of a constant population size trajectory from a single genealogy of 100 individuals with different discretizations. We find that our Bayesian GP point estimates depicted in Figure 4A recover the truth (dashed line) almost perfectly with less uncertainty than the EM (Figure 4B-C). Comparing our Bayesian GP estimates with different discretizations: 50, 100 and 200 equally spaced time points (Figure 4A), we find that increasing the number of time points improves inference (Table 4) but that the differences between estimates among the three discretizations are marginal (Figure 4A). In contrast, we show that different grid definitions alter the EM estimates (Figure 4B). It is not clear how to define a good strategy for the definition of the grid for the EM method, even for the simple model of constant population size. For example, increasing κ from 100 to 500 with 5 change points (Figure 4B), does not improve estimation. Increasing the number of change points does not necessarily improve the estimates either; for example, increasing the the number of change points from 5 to 10 for κ = 10 (Figures 4B-C). EM grid sensitivity is persistent even when the number of genealogies increases; Figure S2 in Supplementary Information shows that the best definition of change points when our data consist of 1000 local genealogies of 100 individuals has 10 change points evenly distributed.

View this table:

Table 4:

Summary of simulation results depicted in Figures 5. SRE is the sum of relative errors calculated as in (24), MRW is the mean relative width of the 95% BCI as defined in (25), and ENV is the envelope measure calculated as in (26). Values in bold indicate best performance for each demographic model and sample size.

4.2 Comparing Methods of Estimating N (t)

Figure 5 shows the estimated population size trajectories when the number of samples is 2 for the three different demographic scenarios and varying the number of local genealogies (100, 500 and 1000 local genealogies). For constant and exponential growth, our EM method assumes a piece-wise constant trajectory of 10 change points (d = 10) and κ = 1 using Equation 15 (similar to Li and Durbin (2011) and Rasmussen et al. (2014)). For the bottleneck scenario, some of the intervals did not have coalescent events; hence, for this case we assumed a piece-wise constant trajectory of 5 change points (d = 5) and κ = 1 for constructing our EM estimates. We show the boxplots of the time to the most recent common ancestor (TMRCA) at the bottom of each plot in Figure 5. The distribution of the TMRCA serves as an indicator of the uncertainty expected of our estimates. Both approaches, EM and Bayesian GP show narrower confidence and credible intervals at the center of the distribution of the TMRCA, particularly during the bottleneck in Figure 5C.

Figure 5: Inference of population size trajectories N(t) for a pair of individuals (n = 2).

(A) Simulated data under constant population size, (B) exponential and constant trajectory, and a bottleneck. We show estimates from m = 100, m = 500, and m = 1000 local genealogies. We show the true trajectories as dashed lines, blue lines and light blue shaded areas represent EM point estimates and 95% confidence areas, and red lines and pink shaded areas represent Bayesian GP posterior medians and 95% BCIs. Boxplots of the TMRCA are shown at the bottom of each plot.

For the constant population demographic model in Figure 5A, our Bayesian GP outperforms our EM estimates considerably. This is not surprising since a priori log N(t) has mean 0 in our Bayesian approach (Equation 17). Moreover, EM confidence intervals only cover the truth constant population size around 30% of the time, while the GP method covers 100% of the truth (Table 4A). Despite placing a mean-0 prior on logN(t), the Bayesian GP method accurately recovers sudden changes as shown in the bottleneck example. Our Bayesian GP prior on log N(t) is Brownian motion which is not differentiable at any point; yet, our Bayesian GP recovers smooth curves (Figure 5B).

Table 4A shows the performance statistics for the estimates of N(t) in Figure 5. In general, our Bayesian GP has wider credible intervals than the EM confidence intervals but these credible intervals cover the true trajectory better than the EM confidence intervals in all the cases (MRW and ENV in Table 4). Our Bayesian GP estimates also generally have smaller sums of relative errors (SRE in Table 4). Under the bottleneck scenario, our Bayesian GP produces greater sums of relative errors than does the EM, but our Bayesian GP estimates recover the truth more accurately than the EM during the bottleneck.

Figures 6 and 7 show our estimates when n = 20 and n = 100. The performance statistics of the estimates displayed are shown in Table 4(B) and (C). In general, our GP-based estimates have smaller SRE and larger ENV than the EM-based estimates and hence, the MRW is usually wider in the GP-based estimates, accurately reflecting the uncertainty of the estimates. As expected, increasing the number of loci (m) generally decreases the width of the confidence and credible intervals of our estimates (MRW). Although this is generally true for EM estimates as well, EM estimates have very low coverage of the truth (MRE in Table 4) when the number of loci increases.

Figure 6: Inference of population size trajectories N(t) for n = 20.

(A) Simulated data under constant population size, (B) exponential and constant trajectory, and (C) a bottleneck. We show estimates from m = 1 genealogy, m = 100 local genealogies and m = 1000 local genealogies. We show the true trajectories as dashed lines, blue lines and light blue shaded areas represent EM point estimates and 95% confidence areas, and red lines and pink shaded areas represent Bayesian GP posterior medians and 95% BCIs. Boxplots of the TMRCA are shown at the bottom of each plot.

Figure 7: Inference of population size trajectories N(t) for n = 100.

(A) Simulated data under constant population size, (B) exponential and constant trajectory, and (C) bottleneck. We show estimates from m = 1 genealogy, m = 100 local genealogies and m = 1000 local genealogies. We show the true trajectories as dashed lines, blue lines and light blue shaded areas represent EM point estimates and 95% confidence areas, and red lines and pink shaded areas represent Bayesian GP posterior medians and 95% BCIs. Boxplots of the TMRCA are shown at the bottom of each plot.

4.3 Sampling more individuals versus sequencing more loci

Figures 5-7 show our estimates for n = 2, 20 and 100 sampled individuals across varying numbers of loci. Since performance of EM estimates depends strongly on the definition of the grid, we base what follows on the Bayesian GP estimates. We find that increasing the number of loci, decreases uncertainty of our estimates and allows us to infer N(t) further back in time. Increasing the number of samples does not necessarily increase the performance of our GP estimates. For example, under the bottleneck scenario, we are able to detect the bottleneck phase fairly accurately even for two samples with m = 1000 local genealogies. Increasing the number of samples to n = 20 and n = 100 does not improve estimation of the features of the bottleneck. This is because most TMRCAs observed under the bottleneck scenario occur during the bottleneck (Figures 5,6 and 7), regardless of the number of individuals sampled. In contrast, in our exponential growth example, increasing the number of samples from n = 2 to n = 100 improves accuracy (point estimates are closer to the truth, see SRE in Tables 4A-C) and credible intervals cover the truth completely (ENV of 100%).

4.4 Sequential Tajima’s genealogies are sufficient statistics under the SMC ′

Under the SMC′ process, marginally at each locus along the chromosome, a local genealogy is a realization of Kingman’s n-coalescent (Kingman 1982), a continuous-time Markov chain taking its values in the set 𝒦_n of partitions of the label set {1, 2,…, n}. A local genealogy g of n individuals includes labeled topology 𝒦_n and coalescent times t = (t_n,…,t₂). The state space of a local genealogy is then 𝒢 = 𝒦_n ⊗ R⁺ⁿ⁻¹, and the cardinality of the set 𝒦_n is n!(n − 1)!/2ⁿ⁻¹. However, only the set of ordered coalescent times carry information about N(t). For a single locus, the set coalescent times are sufficient statistics for inferring N(t) (proof is in the Appendix). A natural question that follows is whether the coalescent times corresponding to the set of local genealogies are sufficient statistics for inferring N(t) under the SMC′ model. We find that the sufficient statistics for inferring N(t) under the SMC′ model, are the coalescent times, when taken together with local ranked tree shapes. For a single locus, the set of coalescent times together with the ranked tree shape correspond to a realization of Tajima’s n-coalescent. Tajima’s n-coalescent (Tajima 1983) is a continuous-time Markov chain taking its values in the set H_n of ranked tree shapes also called histories, evolutionary relationships or vintaged and sized coalescent (Sainudiin et al. 2014). The state space of Tajima’s local genealogy is then 𝒢^T = ℋ_n ⊗ ℝ^+n-1, and the cardinality of the set ℝ_n corresponds to the sequence of Euler zigzag numbers whose first ten elements are 1, 1, 1, 2, 5, 16, 61, 272, 1385, 7936 (Disanto and Wiehe 2013). The probability of getting a particular type of ranked tree shape H_n of n samples (Tajima 1983) is given by where c is the number of cherries, defined as branching events that lead to exactly two leaves.

In the Methods section, we defined transition densities in terms of coalescent times and F_i,j quantities. The set of all F_i,j quantities from a local genealogy form a triangular matrix: F-matrix. In the Appendix, we show that (i) F-matrices are in bijection with ranked tree shapes and (ii) the set of local Tajima’s genealogies are sufficient statistics for inferring N(t) under the SMC′ model. These observations are crucial for inferring N(t) from sequence data directly. Coalescent-based inference from sequence data rely on marginalization over the hidden state space of genealogies. In the Appendix, we show that the state space needed is the space of local Tajima’s genealogies, as opposed to the space of local Kingman’s genealogies. For n = 10 sequences, there are 2, 571, 912, 000 possible labeled topologies while only 7, 936 possible ranked tree shapes.

4.5 Application to human data

We applied our method to a 2-Mb region on chromosome 1 (187,500,000-189,500,000) with no genes from five Yorubans from Ibadan, Nigeria (YRI) and five Utah residents of central European descent (CEU) from the 1000 Genomes pilot project (1000 Genomes Project Consortium 2012) and previously analyzed for the same purpose (Sheehan et al. 2013). We used ARGweaver (Rasmussen et al. 2014) to obtain a sample path of local genealogies for the two populations (YRI and CEU). The parameters used are 200 change points, a mutation rate of μ = 1.26×10⁻⁸ and a recombination rate of ρ = 1.6 × 10^-8 (Rasmussen et al. 2014, details regarding parameters used can be found in Supplementary Information). We note that ARGweaver assumes the SMC process and our method assumes the SMC′ process. Moreover, our inference is based on a single sample of the SMC process with known pruning times. Our ARGweaver set of local genealogies are discretized at 200 time points and our GP-based inference is influenced by this discretization. In Figure 8 we show our estimates of past Yoruban (in blue) and European population sizes (in green). The two population size trajectories experience a series of bottlenecks and overlap until about 100 YKA, assuming a diploid reference population size of N₀=10,000 and a generation time of 25 years. In Figure 8 we recover an out-of-Africa bootleneck that starts about 100 KYA and ends about 30 KYA in the European population. These results are consistent with previously published results (Li and Durbin 2011; Gronau et al. 2011; Rasmussen et al. 2011; Sheehan et al. 2013; Schiffels and Durbin 2014). In Supplementary Information Figure S4 we show the estimates of logN(t) instead of logN(t) and time measured in units of N₀ generations (same scaling as with simulations in Figures 5-7).

Figure 8: Inference of human population size trajectories N(t) for n = 10.

Green solid line and green shaded areas represent the posterior median and 95% BCI for European population (CEU) and blue solid line and blue shaded areas represent the posterior median and 95% BCI for Yoruban population (YRI). Time is measured in years in the past assuming a generation length of 25 years and a reference diploid population of 10,000 individuals. The x-axis is log transformed.

Discussion

In this paper, we propose a Gaussian-process based Bayesian nonparametric method for estimating effective population size trajectories N(t) from a sequence of local genealogies, accounting for recombination. Under a variety of simulated demographic scenarios and sampling designs, our method recovers the truth with better precision and accuracy than a maximum likelihood approach (Figures 5-7). We apply our method to genealogies estimated from human genomic data ARGweaver (Rasmussen et al. 2014) and conduct inference of the human population size trajectory for European and African populations; this application to real data recover the known features of the out of Africa bottleneck (Figure 8).

Several recent approaches have emerged for inference of population size trajectories from multiple whole-genome sequences using the sequentially Markov coalescent (SMC) (Li and Durbin 2011; Sheehan et al. 2013; Schiffels and Durbin 2014). However, current SMC-based methods rely on maximum likelihood inference (EM) of both a discretized parameter space and a discretized state space in order to gain computational tractability, and incur the costs of reduced accuracy and biased estimates. Although in principle the EM approach and the Bayesian nonparametric approach approximate N(t) similarly — by either a piece-wise constant or a piece-wise linear function — the Bayesian nonparametric approach is not affected by increasing the number of parameters (or change points) in the estimation of N(t). For comparison with existing methods, we implemented an EM approach to infer population size trajectories from a sequence of local genealogies and we note that increasing the number of loci may actually increase the bias of the EM estimates (Figures 5-7). For example, in simulation, our EM approach incorrectly detects the initial period of the simulated bottleneck (around 0.8N₀ instead of 0.5N₀ generations ago) with narrow confidence intervals (Figure 7C).

There are many advantages to using Bayesian GP over EM for inference of population size trajectories. Similar to Palacios and Minin’s (2013) approach to inference from a single genealogy, we a priori assume that N(t) follows a log Brownian Motion process. This allows us to model N(t) as a continuous positive function. The main advantage of using a Brownian Motion process is that its inverse covariance function is a sparse matrix that allows for fast computations. Since the likelihood function involves integration over N(t), this integral is approximated by the Riemann sum over a regular grid of points. The finer the grid is, the better the approximation. We find that our method performs well for inferring N(t) at 100 change points in all our examples and, more importantly, results are not sensitive to the number of change points used in the analysis (Figure 4). Our Bayesian approach relies on MCMC for inference from the posterior distribution of model parameters. Because population sizes at different grid points are correlated, we adapt the recently developed MCMC technique Split Hamiltonian Monte Carlo (splitHMC) for jointly sampling all model parameters (Shahbaba et al. 2014; Lan et al. 2015). splitHMC is a Metropolis sampling algorithm that efficiently proposes states that are distant from current states with high acceptance rates. It has been shown to be more efficient in inferring N(t) from a single genealogy than elliptical slice sampling or regular Hamiltonian Monte Carlo sampling(Lan et al. 2015). However, splitHMC relies on calculating the score function at every single iteration. Because pruning time in each local genealogy is unknown, we calculate the score function via Fisher’s formula.

In simulations, we find that our algorithm scales well with hundreds of individuals; our computational bottleneck is in the number of local genealogies. We envision that extending the current methodology to inference from sequence data directly will require a strategy for sampling shorter genomic segments. This would be a probabilistic alternative to arbitrarily choosing segment lengths (Sheehan et al. 2013; Rasmussen et al. 2014).

Under the SMC model, every recombination event along the genome translates to a new coalescent event for the sample under study, so increasing the number of loci results in more realizations of the coalescent process. The longer the segments are and the larger the number of samples taken, the greater the chance of observing variation due to recombination. This fact makes it hard to define a sampling strategy: longer genomes or larger sample sizes? We show that increasing the number of local genealogies improves precision of our Bayesian GP estimates (Figures 5-7). However, resolution into the past from contemporaneous sequences highly depends on the actual population size trajectory N(t).

We use ARGweaver (Rasmussen et al. 2014) to generate two samples of contiguous local genealogies corresponding to a 2-Mb region of chromosome 1 for five Europeans (CEU) and five Africans (YRI) from the 1000 Genomes Project; this genomic region is free of genes and was also analyzed in Sheehan et al. (2013). Taking these two samples of local genealogies as our data (4186 local genealogies for CEU and 6247 local genealogies for YRI), we were able to use our Bayesian GP method to infer Yoruban and and European effective population size trajectories (Figure 8). We find an out-of-Africa bottleneck that began ∼ 100 KYA and ended ∼ 30 KYA in the European population consistent with Li and Durbin (2011); Rasmussen et al. (2011); Gronau et al. (2011); Sheehan et al. (2013) and Schiffels and Durbin (2014). We note that our estimates are based on a single sample of local genealogies and thus ignore genealogical uncertainty. Moreover, we generated our data from the posterior distribution of local genealogies using ARGweaver at 200 time intervals so our GP-based approach cannot fully detect sudden changes that may occur between the discretized times. In addition, ARGweaver assumes an SMC prior model on local genealogies and our GP-based method assumes the SMC′ process; the lack of invisible recombination events in ARGweaver ‘s genealogies will bias inference.

The natural next extension for our method presented in this study is to infer N(t) from sequence data directly and not from the set of local genealogies. Our MCMC approach allows us extend the current methodology in a Bayesian hierarchical framework where the SMC′ process would be used as a prior distribution over local genealogies. The work we present here suggests a combination of ARGweaver accommodating SMC′ and GP priors would result in an efficient method for inferring population size trajectories from sequence data directly. In addition, our model can be easily modified to model a variable recombination rate along chromosomal segments and to jointly infer variable recombination rates and N(t).

Finally, we show that, under the SMC′ model, local ranked tree shapes and coalescent times correspond to a set of local Tajima’s genealogies; these Tajima’s genealogies are the sufficient statistics for inferring N(t). Under the SMC′ model, the state space needed for inferring population size trajectories from sequence data is that of a sequence of local Tajima’s genealogies. This lumping, or reduction of the original SMC′ process, will allow more efficient inference from sequence data directly.

Current methods for inferring population size trajectories make tradeoffs to analyze whole genomes that limit both biological understanding of sudden population size changes and the ability to test hypotheses regarding population size changes. This work represents a critical set of theoretical results that lay the groundwork for efficient estimation of detailed histories from sequence data with measures of uncertainty.

Acknowledgements

We thank Amandine Veber for her valuable suggestions and comments. We thank Shiwei Lan for his suggestions for speeding up the MCMC sampling scheme and Sara Sheehan and Melissa J. Hubisz for helpful discussions. J.A.P. acknowledges scholarship from CONACyT Mexico to pursue her research work. This research is supported in part by NSF CAREER Award DBI-1452622 (to S.R.). S.R. is a Pew Scholar in the Biomedical Sciences, supported by The Pew Charitable Trusts, and an Alfred P. Sloan Research Fellow.

Appendix A

Discretization

For both our Bayesian method and our EM method, we assume that N(t) is a piece-wise linear (or piece-wise constant) function with d change points. Let be the ordered set of time points corresponding to the change points x₁,…,x_d and the coalescent time points tⁱ of local genealogy i. Then, we calculate all the factors needed for the observed data likelihood (Equation 13) and the complete data likelihood (Equation 20).

Let denote the discretized version of F ⁱ that represents the number of branches in g_i that do not coalesce with any other branch in the time interval . Note that the indices here are in increasing order, k ≤ j. Similarly, let denote the probability that U_i (the time along genealogy i), occurs in and the self-coalescing event occurs at time in .That is where is the joint probability of pruning time and not coalescing back to the same branch in the time interval , and

Expectation-Maximization Algorithm

E-step: Equations 22 and 23 show that for the E-step, the only expectations we need are and . We compute these expression as follows:

For i ∈ ℐ and for i ∈ ℐ^c, let

Then where

M-step. Now, for the kth iteration of the algorithm and maximizing the complete data loglikelihood (Equation 20) we have where is an indicator function that takes the value of 1 when (x_l, x_xl+1) covers the interval .

Observed score function for Split Hamiltonian Monte Carlo

Our Bayesian approach relies on Split Hamiltonian Monte Carlo (splitHMC) to sample from the posterior distribution of model parameters. This method requires the calculation of the observed score function. We use Fisher’s identity and calculate the observed score function as the conditional expected complete score function. The lth element of ∇ℒ_obs is

Fisher Information Calculation

The calculation of the Fisher information needed to estimate confidence intervals of a piece-wise constant trajectory of population sizes, requires the following expected values:

For k < j and i ∈ ℐ and for k < j and i ∈ ℐ^c

For j < k and i ∈ ℐ and

For i ∈ ℐ^c and j < k where is as defined in Equation 30, and where and

For i ∈ ℐ and for i ∈ ℐ^c

The gradient vector of the complete data log-likelihood has lth element

With and Next, differentiating Equation (32), we have = 0 for all l ≠ m, so the Hessian is a diagonal matrix with (l, l)th element and where and

Also, where and for l < 0

Sufficient statistics under SMC′

Proposition 1.

For a single locus, the set of coalescent times are sufficient statistics for inferring N(t).

Proof. This can be proved using the factorization theorem. The marginal density of a local genealogy (Equation 3) has a unique factor that depends on N(t) and g only through t_n,…,t₂. The values of A(t) are induced by the natural order of the coalescent times.

Let F denote a lower triangular matrix of size n × n with the F_i,j entry the number of lineages that do not coalesce in the time interval (t_i+1, t_j), as defined in the methods section and with the following properties:

F_i,1 = 0 for all i = 1,…,n (The first column contains 0s for completion)
F_i,j = 0 for all j > i (Lower triangular matrix)
F_i,i = i for all i ≥ 2 (The diagonal corresponds to the number of lineages at each intercoalescent interval)
F_i,i−1 = i − 2 for all i ≥ 2 (At each intercoalescent interval, we loose two free lineages, so the second diagonal correspond to the number of lineages minus two)
For j < n − 1, the last row of F is defined according to:
Let c denote the number of cherries, then
For i < n and j < i − 1, if F_n,j−1 = F_n,j − 2, then F_i,j−1 = F_i,j − 2.
Let v_i denote the set of lineages in the intercoalescent interval (t_i, t_i-1) with direct descendant internal nodes. The lineage labels correspond to the label of the coalescent time, when the direct descendant internal node was created. That is, the lineage created at t_n has label n: v_n = {n}; the lineage created at t_i has label i. Let |v_i| denote the size of the set v_i. Note that 1≤ |v_i| ≤c and
For i < n and j < i − 1, if F_n,j−1 = F_n,j − 1, then at time t_j, there is a coalescence between a singleton and a lineage in the set v_j. Let a_j be the lineage selected uniformly at random
For i < n and j < i – 1, if F_n,j-1 = F_n,j, then at time t_j, there is a coalescence between two lineages and from the set v_j. Let denote the minimum and the maximum of the two lineages selected, then

We show the correspondence between a ranked tree shape and the F-matrix in the example of Figure A1. The first row and the first column are set to 0, the first two diagonals are known with probability 1: F_i,i = i and F_i,i-1 = F_i,i − 2 for i > 1. In our example, n = 5 and so, the first diagonal corresponds to (0, 2, 3, 4, 5) and the second diagonal corresponds to (0, 1, 2, 3). The last row F₅, contains 0, followed by the number of branches that do not coalesce in the time intervals (t₆, t₂), (t₆, t₃), (t₆, t₄) and (t₆, t₅) corresponding to (0, 0, 2, 3, 5).

Figure A1:

Ranked tree shape for n = 6

Proposition 2.

There is a bijection between the set of ranked tree shapes ℋ_n and ℱ, the set of F-matrices.

Proof. The probability of the F matrix can be expressed as the product of the conditional probabilities of the columns of the F matrix, that is: since the first and last column of F are known with probability 1. Note F_.,j represents the j-th column vector of the F matrix.

Let d_i = F_n,i − F_n,i-1 for i = 3,…,n, and d₂ = F_n,2 then That is, the conditional probability of the (n − j)th column of F given the (n − j + 1)th column of F is the product of the conditional probability of the last element of the (n − j)th column and the conditional probability of the rest of the (n − j)th column. When d_n-j = 2 the rest of the column is known with probability 1 (property 7 of the F-matrix). When d_n-j = 1, the rest of the n − jth column has probability 1/|v_n-j+1| (property 9 of the F-matrix) and when d_n-j = 2, the rest of the n − jth column has probability (property 10 of the F-matrix). Then re-writing Equation 33, we have since , and , then and Since d_n = 2 and |v_n| = 1, for j = n − 1, then d_n-1 is either 1 or 2, then If we continue expanding the expressions, we get:

Note that the entries of the F matrix correspond to the same quantity needed to express the transition density of an invisible event (Equation 11). We claim that the sequence of coalescent times sets t⁰, t¹,…,t^m-1 and F ⁰,F ¹,…,F^m-1 matrices corresponding to the ranked tree shapes of local genealogies g₀, g₁,…,g_m-1 are sufficient statistics to infer N(t) under the SMC′ process. We prove this through the following propositions.

Proposition 3.

The probability density of Tajima’s genealogy is proportional, up to a combinatorial factor, to the probability density of Kingman’s genealogy.

Proof.

Proposition 4.

The marginal visible transition density from a local Kingman’s genealogy g_i-1 to G_i is proportional to the marginal visible transition density from the corresponding local Tajima’s genealogy to

Proof. When the labeled topology of g_i-1 is the same as the labeled topology of g_i, then a transition from g_i-1 to g_i contains the same information about pruning location as a transition from to (Supplementary Information, Figures S1A and S2D). In fact, the I^i-1(t) function defined in section 2.1.2 (Equation 8) can be defined in terms of the F ⁱ-matrix and the coalescent times t^i-1 and tⁱ. In this case, for some j ∈ and . Then Hence, if K_i-1 = K_i, the labeled topologies of g_i-1 and g_i, then When the labeled topologies of g_i-1 and g_i are different, but the children of and the children of are the same, we cannot exactly identify the pruning branch and the new coalescing branch (Supplementary Information, Figure S1B) and then a transition from g_i-1 to g_i contains the same information about pruning location as a transition from to . Let and , since the children of and are the same, it is enough to consider F^i-1. Then and

Now, when the deleted node corresponding to t_del is a cherry and the new node corresponding to t_new is also a cherry, there are four possible topologies K_i that lead to the same ranked tree shape F ⁱ, then

Proposition 5

The marginal invisible transition density from a local Kingman’s genealogy g_i-1 to G_i is equal to the marginal invisible transition density from the corresponding local Tajima’s genealogy to .

Proof. since all needed to compute the transition probability are the coalescent times and the F^i-1 matrix. Since the topology does not change, the proof follows.

Proposition 6.

The Likelihood of partially observed embedded SMC′ chain of local Kingman’s genealogies is proportional, up to a combinatorial factor, to the likelihood of partially observed embedded SMC′ chain of the corresponding local Tajima’s genealogies.

Proof. The proof follows from propositions 3, 4 and 5 needed to express the likelihood of partially observed embedded SMC′ chain (Equation 13).

Literature Cited

↵
1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56 – 65.
OpenUrl CrossRef PubMed Web of Science
↵
Chen, G. K., Marjoram, P., and Wall, J. D. (2009). Fast and flexible simulation of DNA sequence data. Genome Research, 19(1):136–142.
OpenUrl Abstract/FREE Full Text
↵
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statitistical Society. Series B, 39(1):1–38.
OpenUrl
↵
Disanto, F. and Wiehe, T. (2013). Exact enumeration of cherries and pitchforks in ranked trees under the coalescent model. Mathematical Biosciences, 242(2):195 – 200.
OpenUrl CrossRef PubMed
↵
1. Donnelly, P. and
2. Tavaré, S.
Griffiths, R. C. and Marjoram, P. (1997). An ancestral recombination graph. In Donnelly, P. and Tavaré, S., editors, Progress in population genetics and human evolution, volume 87 of IMA Volumes in Mathematics and Its Applications, pages 257–270. Springer Verlag, New York.
↵
Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G., and Siepel, A. (2011). Bayesian inference of ancient human demography from individual genome sequences. Nature Genetics, 43(10):1031– 1034.
OpenUrl CrossRef PubMed
↵
Hudson, R. R. (1983). Testing the constant-rate neutral allele model with protein sequence data. Evolution, 37:203–217.
OpenUrl CrossRef Web of Science
↵
Hudson, R. R. (1990). Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7:1–44.
OpenUrl
↵
Kingman, J. (1982). The coalescent. Stochastic Processes and their Applications, 13(3):235–248.
OpenUrl CrossRef
↵
Lan, S., Palacios, J. A., Karcher, M., Minin, V., and Shahbaba, B. (2015). An efficient Bayesian inference framework for coalescent-based nonparametric phylodynamics.
↵
Li, H. and Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature, 475(7357):493–496.
OpenUrl CrossRef PubMed Web of Science
↵
Louis, T. A. (1982). Finding the observed information matrix whe nusing the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 44(2):226–233.
OpenUrl Web of Science
↵
Marjoram, P. and Wall, J. (2006). Fast “coalescent” simulation. BMC Genetics, 7(1).
↵
McVean, G. and Cardin, N. (2005). Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci, 360(1459):1387–1393.
OpenUrl CrossRef PubMed
↵
Minin, V. N., Bloomquist, E. W., and Suchard, M. A. (2008). Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Molecular Biology and Evolution, 25(7):1459–1471.
OpenUrl CrossRef PubMed Web of Science
↵
Murray, I., Adams, R. P., and MacKay, D. J. (2010). Elliptical slice sampling. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, volume 9, pages 541–548.
OpenUrl
↵
Palacios, J. A. and Minin, V. N. (2013). Gaussian process-based Bayesian nonparametric inference of population trajectories from gene genealogies. Biometrics, 63:8–18.
OpenUrl
↵
Rasmussen, M., Guo, X., Wang, Y., Lohmueller, K. E., Rasmussen, S., Albrechtsen, A., Skotte, L., Lindgreen, S., Metspalu, M., Jombart, T., Kivisild, T., Zhai, W., Eriksson, A., Manica, A., Orlando, L., De La Vega, F. M., Tridico, S., Metspalu, E., Nielsen, K., Ávila Arcos, M. C., Moreno-Mayar, J. V., Muller, C., Dortch, J., Gilbert, M. T. P., Lund, O., Wesolowska, A., Karmin, M., Weinert, L. A., Wang, B., Li, J., Tai, S., Xiao, F., Hanihara, T., van Driem, G., Jha, A. R., Ricaut, F.-X., de Knijff, P., Migliano, A. B., Gallego Romero, I., Kristiansen, K., Lambert, D. M., Brunak, S., Forster, P., Brinkmann, B., Nehlich, O., Bunce, M., Richards, M., Gupta, R., Bustamante, C. D., Krogh, A., Foley, R. A., Lahr, M. M., Balloux, F., Sicheritz-Pontén, T., Villems, R., Nielsen, R., Wang, J., and Willerslev, E. (2011). An aboriginal Australian genome reveals separate human dispersals into Asia. Science, 334(6052):94–98.
OpenUrl Abstract/FREE Full Text
↵
Rasmussen, M. D., Hubisz, M. J., Gronau, I., and Siepel, A. (2014). Genome-wide inference of ancestral recombination graphs. PLoS Genet, 10(5):e1004342.
OpenUrl CrossRef PubMed
↵
Sainudiin, R., Stadler, T., and Véber, A. (2014). Finding the best resolution for the Kingman-Tajima coalescent: theory and applications. Journal of Mathematical Biology, pages 1–41.
↵
Schiffels, S. and Durbin, R. (2014). Inferring human population size and separation history from multiple genome sequences. Nature Genetics, 46(8):919–925.
OpenUrl CrossRef PubMed
↵
Shahbaba, B., Lan, S., Johnson, W., and Neal, R. (2014). Split Hamiltonian Monte Carlo. Statistics and Computing, 24(3):339–349.
OpenUrl
↵
Sheehan, S., Harris, K., and Song, Y. S. (2013). Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics, 194(3):647–662.
OpenUrl Abstract/FREE Full Text
↵
Tajima, F. (1983). Evolutionary relationship of DNA sequences in finite populations. Genetics, 105(2):437–460.
OpenUrl Abstract/FREE Full Text
↵
Wilton, P. R., Carmi, S., and Hobolth, A. (2015). The SMC⁰ is a highly accurate approximation to the ancestral recombination graph. Genetics.
↵
Wiuf, C. and Hein, J. (1999). Recombination as a point process along sequences. Theoretical Population Biology, 55(3):248 – 259.
OpenUrl CrossRef PubMed Web of Science

References

↵
Chen, G. K., Marjoram, P., and Wall, J. D. (2009). Fast and flexible simulation of DNA sequence data. Genome Research, 19(1):136–142.
OpenUrl Abstract/FREE Full Text
↵
Rasmussen, M. D., Hubisz, M. J., Gronau, I., and Siepel, A. (2014). Genome-wide inference of ancestral recombination graphs. PLoS Genet, 10(5):e1004342.
OpenUrl CrossRef PubMed

View the discussion thread.

Posted May 11, 2015.

Download PDF

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5213)
Biochemistry (11744)
Bioengineering (8751)
Bioinformatics (29193)
Biophysics (14968)
Cancer Biology (12094)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14178)
Epidemiology (2067)
Evolutionary Biology (18303)
Genetics (12244)
Genomics (16801)
Immunology (11866)
Microbiology (28082)
Molecular Biology (11592)
Neuroscience (60959)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4957)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7339)
Zoology (1651)

[1] ↵
1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56 – 65.
OpenUrl CrossRef PubMed Web of Science

[2] ↵
Chen, G. K., Marjoram, P., and Wall, J. D. (2009). Fast and flexible simulation of DNA sequence data. Genome Research, 19(1):136–142.
OpenUrl Abstract/FREE Full Text

[3] ↵
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statitistical Society. Series B, 39(1):1–38.
OpenUrl

[4] ↵
Disanto, F. and Wiehe, T. (2013). Exact enumeration of cherries and pitchforks in ranked trees under the coalescent model. Mathematical Biosciences, 242(2):195 – 200.
OpenUrl CrossRef PubMed

[5] ↵
Donnelly, P. and
Tavaré, S.
Griffiths, R. C. and Marjoram, P. (1997). An ancestral recombination graph. In Donnelly, P. and Tavaré, S., editors, Progress in population genetics and human evolution, volume 87 of IMA Volumes in Mathematics and Its Applications, pages 257–270. Springer Verlag, New York.

[6] Donnelly, P. and

[7] Tavaré, S.

[8] ↵
Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G., and Siepel, A. (2011). Bayesian inference of ancient human demography from individual genome sequences. Nature Genetics, 43(10):1031– 1034.
OpenUrl CrossRef PubMed

[9] ↵
Hudson, R. R. (1983). Testing the constant-rate neutral allele model with protein sequence data. Evolution, 37:203–217.
OpenUrl CrossRef Web of Science

[10] ↵
Hudson, R. R. (1990). Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7:1–44.
OpenUrl

[11] ↵
Kingman, J. (1982). The coalescent. Stochastic Processes and their Applications, 13(3):235–248.
OpenUrl CrossRef

[12] ↵
Lan, S., Palacios, J. A., Karcher, M., Minin, V., and Shahbaba, B. (2015). An efficient Bayesian inference framework for coalescent-based nonparametric phylodynamics.

[13] ↵
Li, H. and Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature, 475(7357):493–496.
OpenUrl CrossRef PubMed Web of Science

[14] ↵
Louis, T. A. (1982). Finding the observed information matrix whe nusing the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 44(2):226–233.
OpenUrl Web of Science

[15] ↵
Marjoram, P. and Wall, J. (2006). Fast “coalescent” simulation. BMC Genetics, 7(1).

[16] ↵
McVean, G. and Cardin, N. (2005). Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci, 360(1459):1387–1393.
OpenUrl CrossRef PubMed

[17] ↵
Minin, V. N., Bloomquist, E. W., and Suchard, M. A. (2008). Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Molecular Biology and Evolution, 25(7):1459–1471.
OpenUrl CrossRef PubMed Web of Science

[18] ↵
Murray, I., Adams, R. P., and MacKay, D. J. (2010). Elliptical slice sampling. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, volume 9, pages 541–548.
OpenUrl

[19] ↵
Palacios, J. A. and Minin, V. N. (2013). Gaussian process-based Bayesian nonparametric inference of population trajectories from gene genealogies. Biometrics, 63:8–18.
OpenUrl

[20] ↵
Rasmussen, M., Guo, X., Wang, Y., Lohmueller, K. E., Rasmussen, S., Albrechtsen, A., Skotte, L., Lindgreen, S., Metspalu, M., Jombart, T., Kivisild, T., Zhai, W., Eriksson, A., Manica, A., Orlando, L., De La Vega, F. M., Tridico, S., Metspalu, E., Nielsen, K., Ávila Arcos, M. C., Moreno-Mayar, J. V., Muller, C., Dortch, J., Gilbert, M. T. P., Lund, O., Wesolowska, A., Karmin, M., Weinert, L. A., Wang, B., Li, J., Tai, S., Xiao, F., Hanihara, T., van Driem, G., Jha, A. R., Ricaut, F.-X., de Knijff, P., Migliano, A. B., Gallego Romero, I., Kristiansen, K., Lambert, D. M., Brunak, S., Forster, P., Brinkmann, B., Nehlich, O., Bunce, M., Richards, M., Gupta, R., Bustamante, C. D., Krogh, A., Foley, R. A., Lahr, M. M., Balloux, F., Sicheritz-Pontén, T., Villems, R., Nielsen, R., Wang, J., and Willerslev, E. (2011). An aboriginal Australian genome reveals separate human dispersals into Asia. Science, 334(6052):94–98.
OpenUrl Abstract/FREE Full Text

[21] ↵
Rasmussen, M. D., Hubisz, M. J., Gronau, I., and Siepel, A. (2014). Genome-wide inference of ancestral recombination graphs. PLoS Genet, 10(5):e1004342.
OpenUrl CrossRef PubMed

[22] ↵
Sainudiin, R., Stadler, T., and Véber, A. (2014). Finding the best resolution for the Kingman-Tajima coalescent: theory and applications. Journal of Mathematical Biology, pages 1–41.

[23] ↵
Schiffels, S. and Durbin, R. (2014). Inferring human population size and separation history from multiple genome sequences. Nature Genetics, 46(8):919–925.
OpenUrl CrossRef PubMed

[24] ↵
Shahbaba, B., Lan, S., Johnson, W., and Neal, R. (2014). Split Hamiltonian Monte Carlo. Statistics and Computing, 24(3):339–349.
OpenUrl

[25] ↵
Sheehan, S., Harris, K., and Song, Y. S. (2013). Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics, 194(3):647–662.
OpenUrl Abstract/FREE Full Text

[26] ↵
Tajima, F. (1983). Evolutionary relationship of DNA sequences in finite populations. Genetics, 105(2):437–460.
OpenUrl Abstract/FREE Full Text

[27] ↵
Wilton, P. R., Carmi, S., and Hobolth, A. (2015). The SMC⁰ is a highly accurate approximation to the ancestral recombination graph. Genetics.

[28] ↵
Wiuf, C. and Hein, J. (1999). Recombination as a point process along sequences. Theoretical Population Biology, 55(3):248 – 259.
OpenUrl CrossRef PubMed Web of Science

Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies

Abstract

1 Introduction

Methods: SMC′ Calculations

2.1 Complete data transition densities

4.2 Transition densities averaged over unknown pruning locations

2.3 The likelihood of the embedded SMC′ chain

Methods: Inference

3.1 Gaussian-Process-based Bayesian Nonparametric Estimation of N(t)

3.2 Maximum-likelihood estimation of N(t) with measures of uncertainty

Results

4.1 Sensitivity of EM estimates of N (t) to discretization

4.2 Comparing Methods of Estimating N (t)

4.3 Sampling more individuals versus sequencing more loci

4.4 Sequential Tajima’s genealogies are sufficient statistics under the SMC ′

4.5 Application to human data

Discussion

Acknowledgements

Appendix A

Discretization

Expectation-Maximization Algorithm

Observed score function for Split Hamiltonian Monte Carlo

Fisher Information Calculation

Sufficient statistics under SMC′

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.

Proposition 5

Proposition 6.

Literature Cited

References

Citation Manager Formats

Subject Area