Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Geometry of the sample frequency spectrum and the perils of demographic inference

Zvi Rosen, Anand Bhaskar, Sebastien Roch, Yun S. Song
doi: https://doi.org/10.1101/233908
Zvi Rosen
aDepartment of Statistics, University of California, Berkeley, CA 94720, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Anand Bhaskar
bDepartment of Genetics, Stanford University, Stanford, CA 94305, USA
cHoward Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sebastien Roch
dDepartment of Mathematics, University of Wisconsin, Madison, WI 53706, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yun S. Song
aDepartment of Statistics, University of California, Berkeley, CA 94720, USA
eComputer Science Division, University of California, Berkeley, CA 94720, USA
fChan Zuckerberg Biohub, San Francisco, CA 94158, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: yss@berkeley.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

The sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to 0 or diverge to infinity, and show undesirable sensitivity of the inferred demography to perturbations in the data. The goal of this paper is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographic histories and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model with sample size 4, and generalize our intuition to arbitrary sample sizes n using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size n under an arbitrary population history can be recapitulated by a piecewise-constant demography with only κn epochs, where κn is between n/2 and 2n – 1. The set of expected SFS for piecewise-constant demographies with fewer than κn epochs is open and non-convex, which causes the above phenomena for inference from data.

1 Introduction

The sample frequency spectrum (SFS), also known as the site or allele frequency spectrum, is a fundamental statistic in population genomics for summarizing the genetic variation in a sample of DNA sequences. Given a sample of n sequences from a panmictic (i.e., randomly mating) population, the SFS is a vector of length n – 1 of which the kth entry corresponds to the number of segregating sites each with k mutant (or derived) alleles and n – k ancestral alleles. The SFS provides a compact way to summarize n sequences of arbitrary length into just n – 1 numbers, and is frequently used in empirical population genetic studies to test for deviations from equilibrium models of evolution. For instance, the SFS has been widely used to infer demographic history where the effective population size has changed over time (Nielsen, 2000; Gutenkunst et al., 2009; Gravel et al., 2011; Keinan and Clark, 2012; Excoffier et al., 2013; Bhaskar et al., 2015), and to test for selective neutrality (Kaplan et al., 1989; Achaz, 2009). In fact, many commonly used population genetic statistics for testing neutrality, such as Watterson’s θW (Watterson, 1975), Tajima’s θπ (Tajima, 1983), and Fu and Li’s θFL (Fu and Li, 1993) can be expressed as linear functions of the SFS (Durrett, 2008).

In the coalescent framework (Kingman, 1982a,b,c), the unnormalized expected SFS ξn for a random sample of n genomes drawn from a population is obtained by taking the expectation of the SFS over the distribution of sample genealogical histories under a specified population demography. In this work, we will be concerned with well-mixed, panmictic populations with time-varying his-torical population sizes, evolving according to the neutral coalescent process with the infinite-sites model of mutation. The coalescent arises as the continuum limit of a large class of discrete models of random mating, such as the Wright-Fisher, Moran, and Cannings exchangeable family of models (Möhle and Sagitov, 2001). The infinite-sites model postulates that every mutation in the genealogy of a sample occurs at a distinct site, and is commonly employed in population genetic studies for organisms with low population-scaled mutation rates, such as humans. The SFS also appears in the context of statistical modeling as a vector of probabilities. In particular, the normalized expected SFS Embedded Image, defined by normalizing the entries of ξn so that they sum to 1, gives the probability that a random mutation appears in k out of n sequences in the sample. Unless stated otherwise, we use the term expected SFS to refer to the unnormalized quantity ξn.

The expected SFS is strongly influenced by the demographic history of the population, and extensive theoretical and empirical work has been done to characterize this dependence (Fu, 1995; Wakeley and Hey, 1997; Polanski et al., 2003; Marth et al., 2004; Chen, 2012; Kamm et al., 2017; Jouganous et al., 2017). Fu (1995) showed that under the infinite-sites model for a panmictic pop-ulation with constant size and no selection, the expected SFS is given by Embedded Image, where θ/2 denotes the population-scaled mutation rate. When the population size is variable, how-ever, the formula for the expected SFS depends on the entire population size history. In particular, Polanski and Kimmel (2003) showed that the expected SFS under a time-varying population size is given by ξn = Anc, with An being an (n – 1)-by-(n – 1) invertible matrix that only depends on n and c = (c2,…, cn), where cm denotes the expected time to the first coalescence event in a random sample of size m drawn from the population at present. For any time-varying population size function η(t), cm is given by the following expression: Embedded Image

A natural statistical question that arises when using the SFS for demographic inference is whether it is theoretically possible to reconstruct the population history η(t) from the expected SFS ξn of a large enough sample size n. This question was famously answered by Myers et al. (2008) in the negative, by constructing a way of perturbing any population size history without altering the expected SFS for all sample sizes. However, in subsequent work by two of this paper’s authors (Bhaskar and Song, 2014), it was shown that for a wide class of biologically plausible population histories, such as those given by piecewise-constant and piecewise-exponential functions, the expected SFS of a finite sample size is sufficient to uniquely identify the population history. These two results give us insight into the map from population size histories to the expected SFS vectors. The space of all possible population size histories is of infinite dimension, while the expected SFS vectors for any fixed sample size n form a finite-dimensional space; naturally, the pre-image of an expected SFS under this map will typically be an infinite set of population size histories. However, imposing conditions such as those of Bhaskar and Song (2014) restricts us to a function space of finite dimension, so that the pre-images of the expected SFS may become finite/unique.

While the above results are concerned with the identifiability of demographic models from noiseless SFS data, they do not directly provide an explicit characterization of the geometry of the expected SFS as a function of the demographic model. Studying such geometry would be very useful for understanding the behavior of inference algorithms which perform optimization by repeatedly computing the image of the map from the space of demographic parameters to the expected SFS while trying to minimize the deviation of the expected SFS from the observed SFS data. To this end, our main contribution is a universal characterization of the space of expected SFS for any sample size n under arbitrary population size histories, in terms of the space of expected SFS of piecewise-constant population size functions with O(n) epochs. This is a useful reduction because the latter space is much more tractable for mathematical analysis and computation. We provide a complete geometric description of this space for a sample of size n = 4, and generalize our intuition to arbitrary sample sizes n using tools from convex and algebraic geometry. Our reduction also provides an explanation for a puzzling phenomenon frequently observed in empirical demographic inference studies - namely, for some observed SFS data, the optimization procedure for inferring population histories sometimes exhibits pathological behavior where the inferred population sizes and epoch durations can degenerate to 0 or diverge to ∞.

2 Piecewise-Constant Demographies

Let Πk be the set of piecewise-constant population size functions with k pieces. Any population size function in Πk is described by 2k – 1 positive numbers, representing the k population sizes (y1,…, yk) and the k – 1 time points (t1,…, tk-1) when the population size changes. Let Ξn,k, which we call the (n,k)-SFS manifold*, denote the set of all expected SFS vectors for a sample of size n that can be generated by population size functions in Πk. Similarly, let 𝒞n,k, called the (n, k)-coalescence manifold, denote the set of all vectors c = (c2,…, cn) giving the expected first coalescence times of samples of size 2,…,n for population size functions in Πk. Let Embedded Image and Embedded Image respectively be equal to the normalization of all points in Ξn,k and 𝒞n,k by their l1-norms (i.e., the sums of their coordinates). Note that both manifolds live in ℝn−1 and their normalized versions live in the (n – 2)-dimensional simplex Δn−2; this is the set of nonnegative vectors in ℝn−1 whose coordinates sum to 1.

Now that we have defined our basic objects of study, we can describe the remainder of the paper: In Section 3, we provide a complete geometric picture of the Ξ4,k SFS manifold describing the expected SFS for samples of size n = 4 under piecewise-constant population size functions with an arbitrary number k of pieces. We make explicit the map between regions of the demographic model space and the corresponding probability vectors, and this will foreshadow some of the difficulties with population size inference in practice. In Section 4, we develop a characterization of the space of expected SFS for arbitrary population size histories. In particular, we show that for any sample size n, there is a finite integer κn such that the expected SFS for a sample of n under any population size history can be generated by a piecewise-constant population size function with at most κn epochs. Stated another way, we show that the Ξnκn SFS manifold contains the expected SFS for all possible population size histories, no matter how complicated their functional forms. We establish bounds on κn that are linear in n, and along the way prove some interesting results regarding the geometry of the general Ξn,k, SFS manifold. Finally, in Section 5, we demonstrate the implications of our geometric characterization of the expected SFS for the problem of demographic inference from noisy genomic sequence data.

Before proceeding further, we state a proposition regarding the structure of the map from Πk to 𝒞n,k, which we will call Embedded Image the vector of k – 1 transformed breakpoints is denoted by Embedded Image and defined below, while the vector of population sizes in the k epochs is denoted by Embedded Image. This allows us to explain the algebraic nature of most of our proofs. All proofs of the results presented in this paper are deferred to Section 7.

Proposition 2.1

Fix a piecewise-constant population size function in Πk with epochs [t0, t1), [t1,t2),…, [tk-1,tk), where 0 = t0 <t1 < … < tk-1 <tk = ∞, and which has constant population size value yj in the epoch [tj-1, tj) for j = 1,…,k. Let xj = exp[–(tj – tj−1)/yj ] for j = 1,…, k, where xk = 0 (corresponding to time T = ∞), and define x0 = 1 (corresponding to time T = 0) for convenience. The vectors (x1,…, xk-1,y1,…,yk), where 0 < xj < 1 and yj > 0 for all j, (uniquely) identify the population size functions in Πk, and they satisfy both of the following equations: Embedded Image and Embedded Image where cm is the expected first coalescence time for a sample of size m, as defined in (1).

These two formulations provide two perspectives on the coalescence manifold 𝒞n,k:

  1. In (2), the left-hand matrix, call it M1(n,k), has each column of the same form with two parameters; this indicates they all live in a 2-dimensional surface. The vector (y1,…, yk) has all positive parameters. This means that the vector c = (c2,…, cn is contained in the cone over the surface described by the columns of M1.

  2. In (3), the left-hand matrix, call it M2(n,k) has each column of the same form with one parameter; this indicates they all live on a curve. The vector (y1,y2 – y1, …,yk – yk-1) on the left hand side has parameters with possibly negative coordinates. So the vector c = (c2,…, cn is contained in the linear span of the curve described by the columns of M2.

Proposition 2.1 gives us the algebraic mappings that will serve as our objects of interest. Since the SFS manifold is simply a linear transformation of the coalescence manifold, we will use these maps as our entry into understanding the SFS manifold.

3 The Ξ4,k SFS Manifold: A Toy Model

The first in-depth study will involve the set of all possible expected SFS for a sample of size 4. We choose n = 4 for a number of reasons: First, the cases of n = 2 and 3 are cones with simple boundaries in the line or plane. Second, when n = 4, the absolute SFS manifold lives in ℝ3, which can be nicely visualized, and the normalized SFS manifold lives in the 2-simplex, i.e. the triangle with vertices (1,0,0), (0,1,0), and (0,0,1). Finally, as observed in Proposition 2.1, the most interesting phenomena in SFS manifolds of any dimension are fundamentally phenomena of curves and surfaces. These are already captured in the n = 4 case.

For the sake of completeness, we begin by formally describing the coalescence manifolds 𝒞n,k for the trivial cases of n = 2 and n = 3.

Proposition 3.1

We list some basic results on the coalescence manifolds 𝒞n,k for small values of (n, k): Embedded Image

Note that from (2) and (3) for Embedded Image (Section 2), it follows that Embedded Image for a > 0. In words, rescaling the population sizes in each epoch by a constant a also rescales the first coalescence times by a. This implies that every point in the coalescence manifold 𝒞n,k generates a full ray contained in the 𝒞n,k coalescence manifold. Another consequence is that the normalized coalescence manifold Embedded Image is precisely the intersection of the coalescence manifold 𝒞n,k with the simplex Δn−2.

With that justification, we begin to consider the normalized coalescence manifold Embedded Image living in the simplex. As stated in Proposition 3.1, 𝒞4,1 is a ray, which implies that Embedded Image is a single point. We now characterize the set Embedded Image.

Proposition 3.2

The manifold Embedded Image is a two-dimensional subset of the 2-simplex which can be described as the union of the point Embedded Image with the interiors of the convex hulls of two curves γ1 and γ2. The curves are parametrized as follows: Embedded Image where [n]t denotes 1 + … + tn.

This set has some highly unpleasant geometry. First of all, the set is non-convex; it is also neither closed nor open, because most of the boundary is excluded with the exception of the point (2/3; 2/9; 1/9). The set is visualized in Figure 1(a).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: Coalescence and SFS manifolds for sample size 4 and 2 population epochs.

(a) The coalescence manifold 𝓒4,2 is the union of red and green cones. The 2-simplex, shaded in blue, intersects 𝓒4,2 in the normalized coalescence manifold Embedded Image. The green region corresponds to small-then-large demographies; the red region to large-then-small demographies. (b) The SFS manifold Ξ4,2 is the union of red and green cones. The 2-simplex intersects Ξ4,2 in the normalized SFS manifold Embedded Image. Here, too, the green region corresponds to small-then-large demographies; the red region to large-then-small demographies. Ξ4,2 is obtained from 𝓒4,2 by a linear transformation.

In order to precisely illustrate the geometry of Embedded Image we will consider how contours in the domain map to contours in the image. Specifically, we plot the images of lines with fixed values of x1, respectively fixed values of (y1, y2), to 𝒞4,2 in the 2-simplex. The resulting contours are pictured in Figure 2.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2: Fixed-time and fixed-size contours in Embedded Image.

(a) The blue line segments correspond to the image of Embedded Image where x* is a constant fixing the break-point between the two demographies. The other input Embedded Image varies over all positive vectors, though scaledEmbedded Image vectors point to the same normalized value. As y1/y2 → 0, the image approaches γ2 and as y2/y1→ 0, the image approaches γ1. (b) The blue curves correspond to the image of Embedded Image where Embedded Image is a fixed vector indicating the population values and x takes all values in (0,1). The endpoints 0 and 1 correspond to breakpoints at ∞ and 0 respectively. For Embedded Image, x traces a loop in the green region; for Embedded Image, x traces a loop in the red region.

Finally, we consider how the map χ acts on the boundaries of the domain. To aid visualization, we limit the inputs to x1 and y1/y2, since all rescalings of y1 and y2 by the same positive constant while keeping x1 fixed map to the same normalized coalescence vector. The resulting map is illustrated in Figure 3.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3: Pairing the boundaries of demography space and Embedded Image.

(a) The domain of χ4,2. Note that for fixed y1/y2, the normalized coalescence vector is the same. (b) The normalized SFS manifold Embedded Image projected onto its first two coordinates. (c) The normalized coalescence manifold Embedded Image projected onto its first two coordinates. The red square at left corresponding to y1 > y2 maps to the red regions at right; the green square at left corresponding to y2 < y1 maps to the green regions at right. The black line segments on left (corresponding to y1/y2 = 1; y2 < y1 and x1 = 0 (equivalently t1 = ∞); y2 > y1 and x1 = 1 (equivalently t1 = 0)) all map to the central black points on right, since they each mimic a constant demography. The green line corresponding to y1 = 0 maps to the curve β1 in Embedded Image and the curve γ1 in Embedded Image; the red line corresponding to y2 = 0 maps to the curve β2 inEmbedded Image and the curve γ2 in Embedded Image. The orange point (x1 = 1,y2 = 0) maps to Embedded Image in Embedded Image and maps to (1,0,0) in Embedded Image. The blue point (x1 = 0,y1 = 0) maps to (1,0,0) in Embedded Image and Embedded Image in Embedded Image. The remaining aqua and violet segments map to the segments of the same color.

We note that the map fails to be one-to-one within the domain only when y1 = 1; this is also in the pre-image of the point Embedded Image. The inverse function theorem implies that on the complement of y1 = 1, the map is a homeomorphism. This is consistent with our observation that the two rectangles in Figure 3(a) correspond to the two envelopes in Figure 3(c).

Proposition 3.3

For all values k ≥ 3, the manifold Embedded Image, and Embedded Image is the interior of the convex hull of the following curve: Embedded Image

As we can see from Proposition 3.3, Embedded Image is open and convex; however, we lose one useful property of the normalized map Embedded Image. Specifically, let Embedded Image be given by Embedded Image, noting that Embedded Image for λ > 0. Under this definition Embedded Image is generically one-to-one. Meanwhile, the analogous construction Embedded Image mapping the three-epoch demography with breakpoints (x1,x2) and population sizes (y1, y2,1) to the cor-responding normalized coalescence vector has two-dimensional pre-images, generically. For this reason, contour images do not lend themselves to easy description. Still, we can at least describe the image of the map on the boundaries of our domain.

The easiest way to visualize the map is first to understand how the time variables affect the value of the columns of M1(4,3) and to view the y variables as specifying points in the convex hull of those 3 columns. The boundaries of the square (x1,x2) ∈ [0,1] × [0,1] map the columns (after rescaling to the simplex) as follows: Embedded Image

The case of x2 = 1 is the most interesting: when we fix y1 = y3 = 0 and y2 = 1, we obtain the boundary curve γ3(t). Note that x2 = 1 corresponds to a second epoch of length 0. The intuition is that very short population booms at the second epoch lead to coalescence vectors close to γ3. The maps encoded by a general column of M1 (4, k) correspond to the interior of the orange region. Adding in convex combinations of points gives the lined region, which is the remainder of C4,3; this is discussed more rigorously in Section 7. When the number of epochs k steps higher, all columns of M1(4, k) still map to the same region of the simplex, so 𝒞4,k will still be contained in this convex hull. The region C4,3 is depicted in Figure 4(a).

Figure 4:
  • Download figure
  • Open in new tab
Figure 4: Coalescence and SFS manifolds for sample size 4 and 3 population epochs.

(a) The coalescence manifold 𝓒4,3 is the entire yellow and orange region. The 2-simplex, shaded in blue, intersects 𝓒4,2 in the normalized coalescence manifold Embedded Image. The orange region of Embedded Image, bounded by γ1, γ2, and γ3, is the image of the surface described by the columns of M1(4,3), while the yellow region adds in vectors gained by using linear combinations. (b) The SFS manifold Ξ4,3 is the entire yellow and orange region. The 2-simplex intersects Ξ4,3 in the normalized SFS manifold Embedded Image. Ξ4,2 is obtained from 𝓒4,3 by a linear transformation. The orange region of Embedded Image, bounded by β1, β2, and β3, is the image of the surface described by the columns of M1(4,3), while the yellow region adds in vectors gained by using linear combinations.

As mentioned earlier, the SFS manifold Ξn,k is merely a linear transformation of 𝒞a,k; however, since it is of interest in its own right, we include the formulae for Ξ4,k analogous to those derived in this section.

Proposition 3.4

The following hold for the normalized (4, k)-SFS manifold: Embedded Image Embedded Imageis the union of Embedded Image with the convex hulls of two curves: Embedded Image

Here, also, [n]t denotes 1 + t + … + tn. Finally, Embedded Image for all k, and Embedded Image is the convex hull of β3, where Embedded Image

Visualizations of Ξ4,2 and Ξ4,3 may be found in Figure 1(b) and Figure 4(b).

4 The Ξn,kSFS Manifold: General Properties

In this section, we examine the constant Kn, defined in Section 2 as the smallest index for which 𝒞n,k ⊆ 𝒞n,Kn for all k. The tools for the proofs in this section come from algebraic geometry (for the derivation of the lower bound) and convex geometry (for the upper bound).

The gist of the algebraic geometry argument is that, under the M2(n, k) formulation, the man-ifold 𝒞n,k can be seen to be a relatively open subset of an algebraic variety (manifold) built by a sequence of well-understood constructions. Details of this perspective are reserved for the Proofs section.

Two concrete consequences follow from this observation:

  1. the ability to compute all equations satisfied by 𝒞n,k using computer algebra, and

  2. a formula for the dimension of the coalescence and SFS manifolds.

While the former is harder to explain without more setup, the latter can be formulated as follows:

Proposition 4.1

The dimension of Embedded Image is given by: Embedded Image

In particular, 𝒞n,k ⊊ 𝒞n,k+1 for Embedded Image.

We will illustrate how these algebraic ideas can be applied in the next case we have not seen, namely to the sample size n = 5.

Example 4.2

Note that Embedded Image, by Proposition 3.1. We will use the new ideas above to describe Embedded Image for higher values of k.

Since the normalized coalescence manifold has dimension min(2k − 2, n − 2), we know that Embedded Image has dimension 2 inside of the 3-simplex; therefore, we anticipate that it will satisfy one equation, matching its codimension. The degree of the algebraic variety implies that this polynomial should have degree 8. Indeed, when we compute this equation using Macaulay2 (Grayson and Stillman, 2002), we obtain a huge degree-8 polynomial with 105 terms, whose largest integer coefficient is 5, 598, 720. Finally, Embedded Image is full-dimensional in the 3-simplex, so it will satisfy no algebraic equations relative to the simplex. It would be defined instead by the inequalities determining its boundary.

While Proposition 4.1 is useful for analyzing individual coalescence manifolds, it also leads to the observation that Embedded Image, since the inclusions are proper until that index. It is worth remarking that a slightly weaker lower bound of Embedded Image follows immediately from the identifiability result of Bhaskar and Song (2014, Corollary 7), which states that for a piecewise-constant population size function with k pieces, the expected SFS of a sample of size n ≥ 2k suffices to uniquely identify the function.

The convex geometry argument is more elementary. As we noted, the M1 formulation is con-tained in the convex hull over the surface described by a general column of M1. Because the columns are related, our selection of points in the surface is not unrestricted. For this reason, it is not obviously equal to the convex hull. However, once we fix some collection of values x1,…, xk for 𝒞n,k we can use convex geometry for the resulting polytope. In particular, we use Carathéodory’s Theorem (Carathéodory (1907) or Barvinok (2002, Theorem 2.3)), which states that for X a subset of ℝn every x ∈ cone(X) can be represented as a positive combination of vectors x1,…, xm ∈ X for some m ≤ n.

The argument, roughly, allows us to construct any point in that convex hull, with as few as n + 1 points. This allows us to place the point in 𝒞n,j for j ≤ 2n − 1. Since no new SFS are generated by using more than 2n − 1 epochs, we learn that κn is bounded above by 2n − 1.

Combining the two bounds obtained in this section, we have the following:

Theorem 4.3

For any integer n ≥ 2, there exists a positive integer κn such that Ξn,k ⊆ Ξn,κn for all k ≥ 1. Furthermore, κn satisfies Embedded Image

This allows us to express the SFS from any piecewise-constant demography as coming from a demography with relatively few epochs. Because the SFS is an integral over the demography, the SFS from a general measurable demography can be uniformly approximated by a piecewise-constant demography with sufficiently many epochs. Our results imply that it can be precisely obtained by a demography with at most 2n − 1 epochs.

5 Implications for Statistical Demographic Inference from Data

The SFS data used for demographic inference in population genomic studies are noisy observa-tions of the expected SFS from the underlying population demography. Finite sequence lengths, ancestral/derived allele confounding, and sequencing and variant calling errors are some common reasons for the empirical SFS observed in sequencing studies differing substantially from the ex-pected SFS for the underlying demographic model. It is thus possible that the empirical SFS observed in a sequencing study is not contained in the space of expected SFS Ξn,κn for any demo-graphic model. Commonly used demographic inference methods such as ∂a∂i (Gutenkunst et al., 2009), fastsimcoal2 (Excoffier et al., 2013), and fastNeutrino (Bhaskar et al., 2015) perform parametric demographic inference by searching for demographies which maximize the likelihood of the observed SFS Embedded Image. Under the widely used Poisson Random Field model which assumes that the genomic sites being analyzed are unlinked (Sawyer and Hartl, 1992), maximizing the likelihood is equivalent to minimizing the KL divergence between the empirical SFS and the expected SFS under the parametric demographic model. Given an observed SFS Embedded Image, these algorithms traverse the interior of some user-specified space of parametric population size functions such as Πk, while computing the expected SFS Embedded Image under the forward map Embedded Image in each optimization iteration. The optimization procedure either terminates and returns a demography η in the search space Πk whose expected SFS Embedded Image minimizes the KL divergence KLEmbedded Image among all demographies in Πk, or it exhibits runaway behavior where some population sizes or epochs diverge to infinity or go to 0 with successive optimization iterations.

Our geometric study of the Embedded Image SFS manifold can clearly explain the success and failure modes of these optimization algorithms. Suppose the demographic search space is Πk. When the observed SFS Embedded Image lies in the interior of the Embedded Image SFS manifold, this observed SFS is also exactly equal to the expected SFS of some demographic model η* ∈ Πk, and hence any of the above mentioned optimization algorithms, barring numerical difficulties,† should be able to find this demography η* whose expected SFS Embedded Image is exactly equal to the observed data Embedded Image and KLEmbedded Image. On the other hand, if the noise in the observed SFS causes it to lie outside the Embedded Image SFS manifold, these optimization algorithms will attempt to find the demography η(t) ∈ Πk which minimizes the projection under the KL divergence of the observed SFS Embedded Image onto the Embedded Image SFS manifold, Embedded Image

The optimization problem in (4) has a couple of issues. As described in Sections 3 and 4, the SFS manifold Embedded Image is a set where most of the boundary points are not contained in Embedded Image. For example, the description of Embedded Image in Section 3 showed that the only boundary point of Embedded Image contained in Embedded Image is the expected SFS corresponding to the constant population size function, namely the point (6/11, 3/11, 2/11). Since the points on the boundary of the Embedded Image correspond to different limiting regimes with the epoch durations and population sizes tending to 0 or ∞, commonly used demographic inference algorithms that attempt to solve the optimization problem in (4) experience runaway behavior when the observed SFS lies outside the Embedded Image SFS manifold. Figure 5 shows the SFS vectors of simulated sequences (blue circles) under the coalescent with a constant population size, where most of the simulated SFS fall outside the Embedded Image manifold. We used fastNeutrino to fit two-epoch piecewise-constant demographies to these simulated SFS. The observed SFS vectors which project onto the curved boundary of the upper convex set are inferred to come from a demography where both the population size and the duration of the recent epoch, y1 and t1 respectively, go to 0, while for the observed SFS projecting onto the curved boundary of the lower convex set, y1 and t1 diverge to infinity. In both cases, the location of the projection along the curved boundaries is determined by the value of x1 = exp(−t1/y1) where x1 ∈ (0,1). On the other hand, for the observed SFS which project onto the straight line boundaries of the upper and lower convex sets, the inferred recent epoch durations go to 0 (upper convex set) or diverge to infinity (lower convex set) while the inferred recent population size y1 is a constant relative to the ancient population size y2 and the location of the projection along the boundaries is determined by y1/y2.

Figure 5:
  • Download figure
  • Open in new tab
Figure 5: Each blue circle is the observed SFS of n = 4 haplotypes simulated using msprime (Kelleher et al., 2016) under a constant population size coalescent with recombination using realistic mutation and recombination rates of 10−8 mutations and 2.2 × 10−8 crossovers per basepair per generation per haploid. Each sequence has 1000 unlinked loci of length 10 kb each, resulting in an average of 7,300 segregating sites. The red crosses are the expected SFS at the two-epoch piecewise-constant demographies inferred for these simulated SFS using fastNeutrino; the red crosses are the projections of the observed SFS onto the closure of Embedded Image using the KL divergence, with the dotted blue lines showing the correspondence between the observed SFS and their projections. For observed SFS lying in the interior of Embedded Image, the observed SFS and their projections coincide, while the observed SFS lying outside Embedded Image project onto the boundaries of one of the two convex sets that form Embedded Image.

A second more subtle issue arises from the fact that the Embedded Image SFS manifold being projected onto in (4) may be a non-convex set, and hence the solution to the optimization problem in (4) may not be unique.‡ For example, we already observed in Section 3 that the set Embedded Image is non-convex and is given by the union of two convex sets. Hence, for n = 4 and k = 2, and for some values of the observed SFS Embedded Image, there could exist multiple different demographic models η1,η2 ∈ Π2, η1 ≠ η2, such that Embedded Image

By algebraic considerations, the observed SFS Embedded Image which have such non-unique projections onto Embedded Image form a set of measure zero among all possible probability vectors on three elements, and hence such SFS are unlikely to be encountered in real data. However, the existence of such SFS vectors Embedded Image with non-unique projections implies that slight perturbations to these vectors, say due to different quality control procedures for selecting the set of genomic sites to analyze, could result in very different demographic models being inferred due to the projection of the perturbed vector occurring onto one or the other of the two convex sets composing Embedded Image. This is also apparent in Figure 5, where several pairs of observed SFS vectors that are very close to each other project onto the different convex sets forming Embedded Image. As described in the previous paragraph, the boundaries of the upper and lower convex sets represent various limiting regimes with either vanishingly small or arbitrarily large recent population sizes and epoch durations, and this shows that even minor perturbations to the SFS vector could yield qualitatively very different inference results which cannot be reliably interpreted.

6 Discussion

In this work, we characterized the manifold of expected SFS Ξn,k generated by piecewise-constant population histories with k epochs, while giving a complete geometric description of this manifold for the sample size n = 4 and k = 2 epochs. This special case is already rich enough to shed light on the issues that practitioners can face when inferring population demographies from SFS data using popular software programs. While we demonstrated these issues in Section 5 using the fastNeutrino program, the issues we point out are inherent to the geometry of the SFS manifold and not specific to any particular demographic inference software. Our simulations showed that the demographic inference problem from SFS data can be fraught with interpretability issues, due to the sensitivity of the inferred demographies to small changes in the observed SFS data. These results can also be viewed as complementary to recent pessimistic minimax bounds on the number of segregating sites required to reliably infer ancient population size histories (Terhorst and Song, 2015).

Our investigation of piecewise-constant population histories also let us show a general result that the expected SFS for a sample of size n under any population history can also be generated by a piecewise-constant population history with at most 2n − 1 epochs. This result could have potential applications for developing non-parametric statistical tests of neutrality. Most existing tests of neutrality using classical population genetic statistics such as Tajima’s D (Tajima, 1989) implicitly test the null hypothesis of selective neutrality and a constant effective population size (Stajich and Hahn, 2004). Exploiting our result characterizing the expected SFS of samples of size n under arbitrary population histories in terms of the expected SFS under piecewise-constant population histories with at most κn epochs, we see that the KL divergence of an observed SFS Embedded Image to the expected SFS ξn(η*) under the best fitting piecewise constant population history η*; ∈ Πηn with at most κn ≤ 2n − 1 epochs is also equal (up to a constant shift) to the negative log-likelihood of the observed SFS Embedded Image under the best fitting population size history without any constraints on its form, assuming the commonly used Poisson Random Field model where the sites being analyzed are unlinked. One can then use the KL divergence inferred by existing parametric demographic inference programs to create rejection regions for the null hypothesis of selective neutrality without having to make any parametric assumption on the underlying demography. Such an approach would also obviate the need for interpreting the inferred demography itself, since the space of piecewise-constant population histories is only being used to compute the best possible log-likelihood under any single population demographic model. This approach could serve as an alternative to recent works which first estimate a parametric demography using genome-wide sites, and then perform a hypothesis test in each genomic region using simulated distributions of SFS statistics like Tajima’s D under the inferred demography (Rafajlovic et al., 2014). We leave the exploration of such tests for future work.

7 Proofs

Proof of Proposition 2.1

First, we reduce the integral expression for cm to a finite sum; then we make appropriate manipu-lations until we arrive at the desired expressions.

Coalescence in the Wright-Fisher model is an inhomogeneous Poisson process with parameter Embedded Image. Therefore, the probability density of first coalescence at time T is:

Embedded Image

Let Embedded Image. To compute the expected time to first coalescence, we have the integral:

Embedded Image

Substituting variables, τ = Rη(t), note that dτ = η(R−1(τ))dτ. Therefore, the integral becomes: Embedded Image where Embedded Image.

The population size η(t) is a piecewise constant function, whose value η(t) = ηj if tj-1 ≤ t ≤ tj. As specified in the Proposition, t0 = 0, tk = ∞, and (y1,…,yk) is the vector of population sizes. Observe that Embedded Image is also piecewise constant. In particular, Embedded Image

Let sj = tj − tj-1 for brevity. The resulting formula is: Embedded Image

We turn the integral into a sum of integrals on the constant epochs: Embedded Image

We now make the substitution xj = exp [–sj/yj]. Note that the old restriction tj+1 > tj > 0 becomes the new constraint 0 < xj < 1. Our formula for the cm is now: Embedded Image

Noting the linear form of this expression, we factor as a matrix multiplication: Embedded Image

Combining the first three matrices yields (2); combining the first two and last two separately yields (3).

Proof of Proposition 3.1

We justify each equation in turn:

  1. As mentioned in the introduction, this is a classical result in population genetics, and can be derived directly from (3).

  2. The inclusion 𝒞2,1 ⊂ 𝒞2,k is immediate, so we need only show that any a ∈ 𝒞2,k satisfies a > 0. Using (2), a is written as a sum of products of strictly positive numbers; so 𝒞2,k ⊂ 𝒞2,1.

  3. First, we show that 𝒞3,2 is the interior of the open cone spanned by (1, 0) and (1,1). Fix y1 = a/(1 — x1) (for a positive) and consider χ(x1,a/(1 − x1),y2): Embedded Image

    When x1 → 0, the second vector approaches (1,0); when x1 → 1, the first vector approaches (1,1). The vectors are in the interior of that cone for all other permissible values of x1 and y2. To show that 𝒞3,k =𝒞2,1, note that for larger values of k, the same cone of vectors are produced. In particular, χ(x1,…,xk-1,y1,…,yk) yields Embedded Image

    Clearly, the second coordinate of all vectors is bounded between 0 and 1.

Proof of Proposition 3.2

First we observe that γ1 and γ2 are normalizations of the curves defined by parameterizations Embedded Image and Embedded Image where t is constrained to the open interval (0,1).

Now we claim that the definition in terms of the map χ(x, y) is equivalent to the definition in terms of these two curves. We can use the first formulation of χ to prove this: Embedded Image

When y2 = y1, the image is the point (2/3,2/9,1/9) = X as stated. When y2 > y1, we can use the left-hand expression to view the image as a point on the line segment between 𝒞4,1 and the curve (t, t3/3, t6/6). When y2 < y1, the right-hand expression can be used to write the image as a point on the line segment between X and (1 − t, (1 −t3)/3, (1 −t6)/6). This means that the image of χ is contained in the regions and point specified.

To show that the reverse inclusion holds, we fix a point P in the interior of the convex hull of γ1. By convexity, the line segment from X to P is contained in the region; continue in the direction P − X until the line intersects the curve. This must occur because all points in the region are further from the bounding line than X. The point of intersection q is specified as q = γ1(τ) for some τ ∈ (0,1). By convexity, there exists some ρ such that ρ 𝒞4,1 + (1 − ρ)q = P. Fixing x1 = τ, y1 = ρ and y2 = 1, shows that P is in the image of χ. The same argument holds with slight variation for γ2.

Proof of Proposition 3.3

The strategy to prove the equality of 𝒞4,3 and the cone over {t, t3, t6} comes in two steps:

  1. Show that the columns of M1(4,k) are always contained in the region R whose boundary is γ1 ∪ γ2 ∪ γ3.

  2. Divide the convex hull of R into two regions and show that each of these regions are included in Embedded Image.

First we demonstrate that the regions maps precisely into R. We have already shown in the main text of the document that the boundaries of (0,1) × (0,1) map to the boundaries of R under the mapping defined by Embedded Image, where S is the sum of the coordinates. We compute the Jacobian of this map explicitly in Macaulay2 (Grayson and Stillman, 2002). The result is: Embedded Image

Plainly, this is nowhere zero in our domain. The inverse function theorem then implies that the interior is contained in the image of the boundaries. This accomplishes Step 1 of our proof.

For Step 2, we divide the image into two regions:

  1. The triangle defined by vertices (1, 0, 0), (2/3, 2/9,1/9) and (1/3,1/3,1/3), including the two edges [(1/3,1/3,1/3), (2/3, 2/9,1/9)] and [(2/3, 2/9,1/9), (1, 0, 0)].

  2. The remainder of the convex hull of R–explicitly, the interior of the region bounded by γ3 and the line segment [(1/3,1/3,1/3), (1, 0, 0)].

To show that the triangle is included, let x2 = є ≈ 0, and let x1 vary. Then the third col-umn sits arbitrarily close to (1,0,0) and the first column traces out γ2. Set y2 ≈ 0 and toggle y1 and y3, to obtain the full span, including the interior of the triangle, and the line segment [(1/3,1/3,1/3), (2/3, 2/9,1/9)]. Set x1 = 1 − є, and the first column sits at (1/3,1/3,1/3) while the third column traces out γ1. This catches the missing line segment.

For the remainder of the convex hull, fix a point P in this region. This point lies on a line segment between (2/3,2/9,1/9) and some point Q in γ3. Suppose it is equal to ρ · (2/3, 2/9,1/9) + (1 − ρ) · Q. Set x2 = 1 − є ≈ 1. We can choose є and x1 so that the second column is arbitrarily close to P. Furthermore, observe that the first column is approximately equal to the point on γ2 corresponding to x1 and the third column is approximately the point on γ1 corresponding to x1. Choosing y1 = y3 = ρ and y2 = 1 − ρ points us to Embedded Image

Proof of Proposition 3.4

This is a direct application of the linear map W4, computed as in Polanski and Kimmel (2003): Embedded Image

Proof of Proposition 4.1

In order to prove the result about dimension, we show that 𝒞n,k is a relatively open subset of a certain algebraic variety. Because the relevant operations are native to projective geometry, we transport our objects of interest in the obvious way to projective space. The same scaling properties that allow us to focus on the simplex also lead to good behavior in projective space.

Lemma 7.1. For k ≥ 2, the Zariski closure of 𝒞n,k is the affine cone over 𝓘(σk-2(Cn,pn)), where:

  1. Cn is the projective curve defined by mapping [s : t] to Embedded Image

  2. pn is the projective point Embedded Image,

  3. 𝓘 denotes the join of algebraic varieties, and

  4. σi(·) denotes the i-th secant variety. Following Harris (2013), the i-th secant variety is the union of i-dimensional planes generated by i + 1 points in the variety.

Proof of Lemma 7.1. The variety ϕ(σk-2(Cn),pn)) is the image of the following map: Embedded Image where si and ti are not simultaneously zero, and λ is unrestricted.

Define the map ϕ : ℝ2k-1 → (ℙ1)k−1 × ℝk sending Embedded Image

We can recast the expression in (3) as the composition ψ o ϕ. Based on this formulation, the set 𝒞n,k is clearly contained in 𝓘(σk-2(Cn),p). To demonstrate the equality of the Zariski closures, we only need to show that the dimensions match and that the variety is irreducible. Both joins and secants have the property that irreducible inputs yield irreducible outputs, so the variety of interest is irreducible. The image of ϕ is open in (ℙ1)k-1 × (ℙ1)k−2, and the map ψ has deficient rank on a set of positive codimension. Therefore, the composition of ψ o ϕ has full dimension. This proves the Lemma.

The i-th secant variety of an irreducible nondegenerate curve in ℙn has projective dimension given by min(2i + 1,n) (Harris, 2013, Exercise 16.16). The curve Cn is a toric transformation of a coordinate projection of the rational normal curve. The rational normal curve is nondegenerate, and both of these operations preserve that property. This means our secant variety has projective dimension min(2(k − 2) + 1,n − 2) = min(2k − 3,n − 2). The join with a point adds 1 to the dimension of the variety, while the operation of passing to the affine cone adds 1 to the dimension of the variety and the ambient space. However, normalizing to the (n − 2)-simplex subtracts 1 from both variety and ambient space again. This means that dimEmbedded Image, assuming that k ≥ 2.

Proof of upper bound in Theorem 4.3

Suppose a point c is in 𝒞n,q. By definition, this implies that there is a point (x1,…, xq-1, y1,…, yq) such that (2) yields Embedded Image

Since the point c is in the cone over the q columns of the matrix, Carathéodory’s Theorem implies that it is also in the cone over some n − 1 of the columns. Therefore we can replace the vector y1,…, yq with Embedded Image so that all but n − 1 (or fewer) are zero.

Passing to the expression in (3), this gives us: Embedded Image

Since at most n − 1 of the Embedded Image are nonzero, at most 2n − 2 of the indices of the vector at right are nonzero. We can delete the columns of the X matrix corresponding to zero entries except the first column. A new sequence Embedded Image can then be obtained from the ratio between the first entries in adjacent columns. The new sequence Embedded Image can be obtained by taking the sequence of partial sums of the vector.

Acknowledgments

We thank the Simons Institute for the Theory of Computing, where some of this work was carried out while the authors were participating in the “Evolutionary Biology and the Theory of Computing” program. This research is supported in part by a Math+X Research Grant, an NSF grant DMS-1149312 (CAREER), an NIH grant R01-GM109454, and a Packard Fellowship for Science and Engineering. YSS is a Chan Zuckerberg Biohub investigator.

Footnotes

  • ↵* The sets Ξn,k and 𝒞n,k are not technically manifolds; they would be more accurately described as semialgebraic sets. However, for expository purposes, we use the widely known term “manifold”.

  • ↵† The package ∂a∂i uses numerical methods to approximate the solution to a diffiusion PDE, while fastsimcoal2 uses coalescent simulations to estimate the expected SFS for a given demographic model. Hence, these software packages might have numerical issues beyond the failure modes we consider here. For this reason, we conduct our inference experiments in this section using the fastNeutrino package, which uses the analytic results of Polanski and Kimmel (2003) for exact computation of the expected SFS for piecewise-constant population size functions.

  • ↵‡ If Embedded Image is a convex set, the solution to (4) is unique due to the fact that the KL divergence is a convex function of either argument. Namely, for Embedded Image with equality holding if and only if Embedded Image.

References

  1. ↵
    Achaz, G. 2009. Frequency spectrum neutrality tests: one for all and all for one. Genetics, 183, (1) 249–258.
    OpenUrlAbstract/FREE Full Text
  2. ↵
    Barvinok, A. A course in convexity, volume 54. American Mathematical Society Providence, 2002.
  3. ↵
    Bhaskar, A. and Song, Y. S. 2014. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Annals of Statistics, 42,(6) 2469–2493.
    OpenUrl
  4. ↵
    Bhaskar, A., Wang, Y. R., and Song, Y. S. 2015. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Research, 25, (2) 268–279.
    OpenUrlAbstract/FREE Full Text
  5. ↵
    Carathéodory, C. 1907. Über den variabilitatsbereich der koeffizienten von potenzreihen, die gegebene werte nicht annehmen. Mathematische Annalen, 64,(1) 95–115.
    OpenUrlCrossRef
  6. ↵
    Chen, H. 2012. The joint allele frequency spectrum of multiple populations: A coalescent theory approach. Theoretical Population Biology, 81,(2) 179–195.
    OpenUrlCrossRefPubMed
  7. ↵
    Durrett, R. Probability models for DNA sequence evolution. Springer Science & Business Media, 2008.
  8. ↵
    Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V. C., and Foll, M. 2013. Robust demographic inference from genomic and SNP data. PLoS Genetics, 9,(10) e1003905.
  9. ↵
    Fu, Y.-X. 1995. Statistical properties of segregating sites. Theoretical Population Biology, 48,(2) 172–197.
    OpenUrlCrossRefPubMedWeb of Science
  10. ↵
    Fu, Y.-X. and Li, W.-H. 1993. Statistical tests of neutrality of mutations. Genetics, 133,(3) 693–709.
    OpenUrlAbstract/FREE Full Text
  11. ↵
    Gravel, S., Henn, B. M., Gutenkunst, R. N., Indap, A. R., Marth, G. T., Clark, A. G., Yu, F., Gibbs, R. A., Bustamante, C. D., Altshuler, D. L., et al. 2011. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences, 108, (29) 11983–11988.
    OpenUrlAbstract/FREE Full Text
  12. ↵
    Grayson, D. R. and Stillman, M. E. Macaulay 2, a software system for research in algebraic geometry, 2002.
  13. ↵
    Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., and Bustamante, C. D. 2009. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS Genetics, 5,(10) e1000695.
  14. ↵
    Harris, J. Algebraic geometry: a first course, volume 133. Springer Science & Business Media, 2013.
  15. ↵
    Jouganous, J., Long, W., Ragsdale, A. P., and Gravel, S. 2017. Inferring the joint demographic history of multiple populations: Beyond the diffusion approximation. Genetics, 206,(3) 1549–1567.
    OpenUrlAbstract/FREE Full Text
  16. ↵
    Kamm, J. A., Terhorst, J., and Song, Y. S. 2017. Efficient computation of the joint sample frequency spectra for multiple populations. Journal of Computational and Graphical Statistics, 26, (1) 182–194.
    OpenUrl
  17. ↵
    Kaplan, N. L., Hudson, R., and Langley, C. 1989. The “hitchhiking effect” revisited. Genetics, 123,(4) 887–899.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    Keinan, A. and Clark, A. G. 2012. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science, 336,(6082) 740–743.
    OpenUrlAbstract/FREE Full Text
  19. ↵
    Kelleher, J., Etheridge, A. M., and McVean, G. 2016. Efficient coalescent simulation and genealog-ical analysis for large sample sizes. PLoS Computational Biology, 12,(5) e1004842.
  20. ↵
    1. Koch, G. and
    2. Spizzichino, F.
    Kingman, J. F. C. Exchangeability and the evolution of large populations. In Koch, G. and Spizzichino, F., editors, Exchangeability in Probability and Statistics, pages 97–112. North-Holland Publishing Company, 1982a.
  21. ↵
    Kingman, J. F. C. 1982b. On the genealogy of large populations. Journal of Applied Probability, 19, 27–43.
    OpenUrlCrossRef
  22. ↵
    Kingman, J. F. C. 1982c. The coalescent. Stochastic Processes and Their Applications, 13,(3) 235–248.
    OpenUrlCrossRef
  23. ↵
    Marth, G. T., Czabarka, E., Murvai, J., and Sherry, S. T. 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics, 166,(1) 351–372.
    OpenUrlAbstract/FREE Full Text
  24. ↵
    Möhle, M. and Sagitov, S. 2001. A classification of coalescent processes for haploid exchangeable population models. Annals of Probability, 29,(4) 1547–1562.
    OpenUrl
  25. ↵
    Myers, S., Fefferman, C., and Patterson, N. 2008. Can one learn history from the allelic spectrum? Theoretical Population Biology, 73,(3) 342–348.
    OpenUrlCrossRefPubMedWeb of Science
  26. ↵
    Nielsen, R. 2000. Estimation of population parameters and recombination rates from single nu-cleotide polymorphisms. Genetics, 154,(2) 931–942.
    OpenUrlAbstract/FREE Full Text
  27. ↵
    Polanski, A., Bobrowski, A., and Kimmel, M. 2003. A note on distributions of times to coalescence, under time-dependent population size. Theoretical Population Biology, 63,(1) 33–40.
    OpenUrlCrossRefPubMed
  28. ↵
    Polanski, A. and Kimmel, M. 2003. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genet-ics, 165,(1) 427–436.
    OpenUrlAbstract/FREE Full Text
  29. ↵
    Rafajlović, M., Klassmann, A., Eriksson, A., Wiehe, T., and Mehlig, B. 2014. Demography-adjusted tests of neutrality based on genome-wide SNP data. Theoretical Population Biology, 95, 1–12.
    OpenUrlCrossRefPubMed
  30. ↵
    Sawyer, S. A. and Hartl, D. L. 1992. Population genetics of polymorphism and divergence. Genetics, 132,(4) 1161–76.
    OpenUrlAbstract/FREE Full Text
  31. ↵
    Stajich, J. E. and Hahn, M. W. 2004. Disentangling the effects of demography and selection in human history. Molecular Biology and Evolution, 22,(1) 63–73.
    OpenUrlCrossRef
  32. ↵
    Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics, 105, (2) 437–460.
    OpenUrlAbstract/FREE Full Text
  33. ↵
    Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymor-phism. Genetics, 123,(3) 585–595.
    OpenUrlAbstract/FREE Full Text
  34. ↵
    Terhorst, J. and Song, Y. S. 2015. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proceedings of the National Academy of Sciences, 112, (25) 7677–7682.
    OpenUrlAbstract/FREE Full Text
  35. ↵
    Wakeley, J. and Hey, J. 1997. Estimating ancestral population parameters. Genetics, 145,(3) 847–855.
    OpenUrlAbstract/FREE Full Text
  36. ↵
    Watterson, G. 1975. On the number of segregating sites in genetical models without recombination. Theoretical Population Biology, 7,(2) 256–276.
    OpenUrlCrossRefPubMedWeb of Science
Back to top
PreviousNext
Posted December 13, 2017.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Geometry of the sample frequency spectrum and the perils of demographic inference
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Geometry of the sample frequency spectrum and the perils of demographic inference
Zvi Rosen, Anand Bhaskar, Sebastien Roch, Yun S. Song
bioRxiv 233908; doi: https://doi.org/10.1101/233908
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Geometry of the sample frequency spectrum and the perils of demographic inference
Zvi Rosen, Anand Bhaskar, Sebastien Roch, Yun S. Song
bioRxiv 233908; doi: https://doi.org/10.1101/233908

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Evolutionary Biology
Subject Areas
All Articles
  • Animal Behavior and Cognition (3505)
  • Biochemistry (7346)
  • Bioengineering (5323)
  • Bioinformatics (20263)
  • Biophysics (10016)
  • Cancer Biology (7743)
  • Cell Biology (11300)
  • Clinical Trials (138)
  • Developmental Biology (6437)
  • Ecology (9951)
  • Epidemiology (2065)
  • Evolutionary Biology (13322)
  • Genetics (9361)
  • Genomics (12583)
  • Immunology (7701)
  • Microbiology (19021)
  • Molecular Biology (7441)
  • Neuroscience (41036)
  • Paleontology (300)
  • Pathology (1229)
  • Pharmacology and Toxicology (2137)
  • Physiology (3160)
  • Plant Biology (6860)
  • Scientific Communication and Education (1272)
  • Synthetic Biology (1896)
  • Systems Biology (5311)
  • Zoology (1089)