Estimation of population divergence times from SNP data and a test for treeness

We present a method for estimating population divergence times from genome sequences when one individual is sampled from each population. Our method is a simplified version of one presented by Rassmussen et al. (2014) for testing for direct ancestry of an archaic genome. Our method does not require distinguishing ancestral from derived alleles or assumptions about demographic history before population divergence. We discuss the relationship of our method to two similar methods, one introduced by Green et al. (2010) and denoted by F(A | B) and the other introduced by Schlebusch et al. (2017) and called the TT method. When our method is applied to individuals from three or more populations, it provides a test of whether the population history is treelike. We illustrate the use of our method by applying it to three high-coverage archaic genomes, two Neanderthals (Vindija and Altai) and a Denisovan.

Several methods have been proposed for estimating the divergence time of populations under the assumption that they have remained isolated since they diverged. For distantly related populations, the numbers of mutational differences between sequences indicates relative times of divergence. Relative times can be converted to absolute times if the mutation rate is known. This method traces to Zuckerkandl and Pauling (1962;1965) and has been used and refined extensively.
This class of methods estimates genomic divergence times. Using it to estimate population or species divergence times assumes that divergence times are so large that the difference between genomic and population divergence can be ignored.
For recently diverged populations, the number of mutational differences probably does not provide a reliable estimate of population divergence times both because there may be too few mutations that distinguish populations and because the difference between the genomic and population divergence times may be substantial. To overcome this problem, Green et al. (2010) (in Supplement 14) introduced a method for estimating population divergence times that accounts for the difference between genomic and population divergence. This method was used in later papers from the same group (MEYER et al. 2012;PRÜFER et al. 2014;PRÜFER et al. 2017).
The Green et al. (2010) method is applicable when one genome is sampled from each of two populations. It depends on the statistic F(A|B), which is the fraction of sites in population A that carry the derived allele when that site is heterozygous in population B. Green et al. (2010) showed by simulation that F(A|B) decreases roughly exponentially with the separation time of populations A and B.
The pattern of decrease depends on the history of population sizes in B and in the population ancestral to A and B.
In a different context, Rasmussen et al. (2014) (Supplement 17) introduced a method for inferring whether an archaic sample was in a population directly ancestral to a present-day population. Although the Rasmussen et al. (2014) method appears to be different from the F(A|B) method, it is actually quite similar. It uses analytic expressions for the number of configurations of pairs of sites in the two populations, when no distinction is made between ancestral and derived alleles. It makes no assumptions about the history of population sizes in either population.
More recently, Schlebusch et al. (2017) in Section 9.1 of their supplementary materials, introduced another and similar method, called the TT method, for estimating population divergence times. Their method is based on analytic expressions for the seven configurations of SNPs that are polymorphic in the two populations. The TT method assumes ancestral and derived alleles can be distinguished and the population before divergence was of constant size.
In the present paper, we will derive a simpler version of the Rassmussen et al. method and describe the relationship to the F(A|B) and TT methods. We will show that our method provides a way to test whether the population history of three or more populations is accurately represented by a population tree.

Analytic theory of F(A|B)
We assume that two populations A and B diverged at time T in the past and remained isolated since. Two chromosomes are sampled from population B and one from A. Let N(t) denote the population size t generations before the present (t=0).
Between 0 and T, N(t) is the effective size of population B. Before T, it is the effective size of the ancestral population. Because only one chromosome is sampled from A, the effective size of A between 0 and T does not matter. If there is no recurrent mutation. A carries the derived allele only if one of the two B lineages coalesced with the A lineage and there was a mutation on the internal branch of the gene tree, as shown in Figure 1. We calculate the probability of those two events using standard coalescent theory.
The probability of the gene tree shown in Figure 1 is 2(1− c) / 3 where c is the probability that the two B lineages coalesce between 0 and T. The 2/3 reflects the fact that in the ancestral population each pair of lineages is equally likely to coalesce first. The probability that there is coalescence between 0 and T is where the approximation is accurate when N(t) is large.
We denote the expected length of the internal branch of the gene tree shown in Figure 1 by u. In general u depends on N(t) in a complicated way but if N is constant, u = 2N (WAKELEY 2009). The probability that a mutation occurs in the internal branch is µu where µ is the per site mutation rate.
The unconditional probability that the two B lineages carry different alleles is µ multiplied by twice the average coalescence time of those two lineages, t , where Note that t does not depend on T. When N is constant t = 2N .
We denote the probability that A carries the derived allele given that the two B lineages carry different alleles by P(A | B) . We distinguish this probability from the statistic F(A | B) computed from the data. From the rules of conditional probability we obtain and V2 (the expected times to coalescence in the two populations, given that they coalesce before the populations split), and θ, (the effective size of the ancestral population scaled by the mutation rate). They assume that the numbers of sites in each of the nine configurations take their expected values for a given sample size, and they derived expressions for each of the parameters. In particular, they showed that the two coalescence probabilities are given by where m i is the observed number of sites in configuration Oi. that, in the absence of mutations, the probabilities of the five configurations depend on five parameters, c1, the probability that the two lineages from A coalesce after the populations diverge, c2, the probability that the two lineages from B coalesce after the populations diverge, and k0, k1 and k2, the elements of the normalized folded site-frequency spectrum in a sample of size 4 immediately before the populations diverged: k0 is the probability of SSSS or ssss, k1 is the probability of SSSs or sSSS, and k2 is the probability of SSss, where the ordering of S and s does not matter.

Rasmussen et al. method
The data consist of the numbers of sites ni with each configuration. Rasmussen et al. (2014) assumed the data had a multinomial distribution with probabilities pi. They used standard numerical methods for estimating the five parameters from the data. This is a composite likelihood method because linkage disequilibrium can cause nearby sites to be non-independent. Rasmussen et al. (2014) applied their method to an archaic sample from Montana, which in this notation is population B, and several present-day Native American individuals, each of which in turn was population A. Rasmussen et al. restricted their analysis to sites that are polymorphic in a panel of African populations. As a test of direct ancestry, Rasmussen et al. used a likelihood ratio test of the hypothesis that c2=0. If c2=0, the branch to B from the population ancestral to A and B was so short that no coalescence events occurred. In doing this analysis, Rasmussen et al. (2014) needed no assumptions about the history of population sizes either before or after T.
In this paper, we simplify the Rasmussen et al. (2014) where the p i are the probabilities of configurations 1, 2 and 3. Given the data, n i for i=1, 2, 3, the three parameters can be estimated by assuming a multivariate normal distribution of the data. From ĉ , the estimated value of c, the estimated value of T (T ) is estimated by solving We can understand the relationship to F(A|B) by assuming the sample sizes are large enough that the probabilities take their expected values. In that case, ĉ = 2n 3 − n 2 2n 3 + n 2 (8) where n = n 1 + n 2 + n 3 .
The F(A|B) method described in the previous section is similar. To apply it, ancestral and derived alleles must be distinguished. Let S be the derived allele.

Application to Neanderthals and Denisovans, with a test of treeness
We illustrate the application of our method to three high-coverage archaic genomes, the Altai Neanderthal from the Denisova Cave in central Siberia (PRÜFER et al. 2014), the Vindija Neanderthal from the Vindija Cave in Croatia (PRÜFER et al. 2017) and the Denisova genome (MEYER et al. 2012). All three genomes were sequenced to sufficient depth that heterozygous sites can be called with confidence.
We ascertained a set of SNPs that are polymorphic in a panel of 40 African genomes and restricted our analysis to those SNPs. We used an additional filtering step for the Altai genome. Prüfer et al. (2014) showed that the Altai Neanderthal was inbred with an estimated inbreeding coefficient of 1/8. For the comparisons involving this individual, only sites not in runs of homozygosity longer than 2 mb were analyzed.
With three populations, there are six possible comparisons using each population in turn as population A or B. Table 1  To convert the estimates of c to estimates of T, we need to solve Equation (5) numerically after assuming something about the history of population sizes in each group. We used the size estimates obtained by Prüfer et al. (2017) from applying PSMC (LI AND DURBIN 2011) to each genome. PSMC returns piecewise constant estimates, with size Ni in time interval (t i ,t i+1 ) with t0=0. We used the time intervals and sizes reported in Figure S7.5 in Supplement 7 of Prüfer et al. (2017). For piecewise constant population sizes, Equation (5) reduces to where j is chosen so that t j < T ≤ t j+1 . Solving Equation (12)

Discussion and Conclusions
We present a simple method to estimate the divergence time In theory, the three methods differ in how they estimate divergence times.
Both the F(A | B) and TT methods estimate the divergence times scaled by the mutation rate. Our method estimates first the coalescence probability in each population and then estimates the coalescence time from some assumption about the history of population sizes after the populations diverged. In practice, the history of population sizes is inferred from PSMC which depends on an assumed mutation rate. Therefore, all three methods depend on the mutation rate. However, the coalescent probabilities estimated with our method and with the TT method do not depend on the assumed mutation rate and hence can be used in the test for a tree-like population history that we have proposed.
One goal of our paper is to call attention to three methods for estimating population divergence times using SNP data from pairs of genomes. These methods have a similar theoretical structure. The differences between them are relatively minor. Most important to the accuracy of results obtained using any of them is the assumption of complete isolation of the populations after they diverged from a common ancestor.  (6) in the text and maximizing the likelihood. The same estimates of k 1 and ĉ were obtained using Equations (7) and (8)   Illustration of the notation used in this paper. Populations A and B are assumed to have diverged from a common ancestor T generations in the past. Two chromosomes from B and one from A are sampled. A mutation from the ancestral to derived allele at a SNP is assumed to have occurred on the gene tree as shown by the arrow. Assuming no recurrent mutation, the gene tree and the branch on which the mutation occurred is the only way that B can be heterozygous for the derived allele and A also carry the derived allele.