Joint estimation of contamination, error and demography for nuclear DNA from ancient humans
===========================================================================================

* Fernando Racimo
* Gabriel Renaud
* Montgomery Slatkin

## Abstract

When sequencing an ancient DNA sample from a hominin fossil, DNA from present-day humans involved in excavation and extraction will be sequenced along with the endogenous material. This type of contamination is problematic for downstream analyses as it will introduce a bias towards the population of the contaminating individual(s). Quantifying the extent of contamination is a crucial step as it allows researchers to account for possible biases that may arise in downstream genetic analyses. Here, we present an MCMC algorithm to co-estimate the contamination rate, sequencing error rate and demographic parameters – including drift times and admixture rates – for an ancient nuclear genome obtained from human remains, when the putative contaminating DNA comes from present-day humans. We assume we have a large panel representing the putative contaminant population (e.g. European, East Asian or African). The method is implemented in a C++ program called ’Demographic Inference with Contamination and Error’ (DICE). We applied it to simulations and genome data from ancient Neanderthals and modern humans. With reasonable levels of genome sequence coverage (> 3X), we find we can recover accurate estimates of all these parameters, even when the contamination rate is as high as 50%.

Keywords
*   Ancient DNA
*   Contamination
*   MCMC
*   Human evolution
*   Demography

## 1. Author Summary

When extracting and sequencing ancient DNA from human remains, a recurrent problem is the presence of DNA from the paleontologists, archaeologists or geneticists that may have handled the fossil. If a DNA library is highly contaminated, this will introduce biases in downstream analyses, so it is important to determine the amount of extraneous DNA. Different methods exist for this purpose, but few are applicable to the nuclear genome, and none of them can extract reliable genomic information from highly contaminated samples. Thus, samples with high rates of contamination are usually discarded. Here, we present a method to jointly estimate contamination and error rates, along with demographic parameters, like drift times and admixture rates. Our method can serve to uncover important details about the evolutionary history of archaic and early modern humans from ancient DNA samples, even if those samples are highly contaminated.

## 2. Introduction

When sequencing a human genome using ancient DNA (aDNA) recovered from fossils, a common practice is to assess the amount of present-day human contamination in a sequencing library [1, 2, 3, 4, 5, 6]. Several methods exist to obtain a contamination estimate. First, one can look at ‘diagnostic positions’ in the mitochondrial genome at which a particular archaic population may be known to differ from all present-day humans. Then, one counts how many aDNA fragments support the present-day human base at those positions. This is the most popular technique and has been routinely deployed in the sequencing of Neanderthal genomes [7, 1]. However, contamination levels of the mitochondrial genome may sometimes differ drastically from those of the nuclear genome [8, 9].

A second technique involves assessing whether the sample was male or female using the number of fragments that map to the X and the Y chromosomes. After determining the biological sex, the proportion of reads that are non-concordant with the sex of the archaic individual are used to estimate contamination from individuals of the opposite sex (e.g. Y-chr reads in an archaic female genome are indicative of male contamination) [8, 1, 10, 4]. Another method uses a maximum-likelihood approach to estimate contamination, but is only applicable to single-copy chromosomes, like the X chromosome in individuals known *a priori* to be male [11, 12]. Finally, one last technique involves using a maximum-likelihood approach to co-estimate the amount of contamination, sequencing error and heterozygosity in the entire autosomal nuclear genome [1, 3], using an optimization algorithm such as L-BFGS-B [13].

Afterwards, if the aDNA library shows low levels of present-day human contamination (< ∼2%), demographic analyses are performed on the sequences while ignoring the contamination. If the library is highly contaminated, it is usually treated as unusable and discarded. Neither of these outcomes is optimal: contaminating fragments may affect downstream analyses, while discarding the library as a whole may waste precious genomic data that could provide important demographic insights.

One way to address this problem was proposed by Skoglund et al. [14], who developed a statistical framework to separate contaminant from endogenous DNA fragments by using the patterns of chemical deamination characteristic of ancient DNA. The method produces a score which reflects the odds that a particular fragment is endogenous or not. This approach, however, may not be able to make a clean distinction between the two sources of DNA, especially for young ancient DNA samples, as chemical degradation may not have affected all fragments belonging to the ancient individual.

Instead of (or in addition to) attempting to separate the two type of fragments before performing a demographic analysis, one could incorporate the uncertainty stemming from the contaminant fragments into a probabilistic inference framework. Such an approach has already been implemented in the analysis of a haploid mtDNA archaic genome [15]. However, mtDNA represents a single gene genealogy, and, so far, no equivalent method has been developed for the analysis of the nuclear genome, which contains the richest amount of population genetic information. Here, we present a method to co-estimate the contamination rate, per-base error rate and a simple demography for an autosomal nuclear genome of an ancient hominin. We assume we have a large panel representing the putative contaminant population, for example, European, Asian or African 1000 Genomes data [16]. The method uses a Bayesian framework to obtain posterior estimates of all parameters of interest, including population-size-scaled divergence times and admixture rates.

## 3. Methods

### 3.1. Basic framework for estimation of error and contamination

We will first describe the probabilistic structure of our inference framework. We begin by defining the following parameters:

*   *rc:* contamination rate in the ancient DNA sample coming from the contaminant population

*   *∊*: error rate, i.e. probability of observing a derived allele when the true allele is ancestral, or vice versa.

*   *i*: number of chromosomes that contain the derived allele at a particular site in the ancient individual (*i* = 0, 1 *or* 2)

*   *dj*: number of derived fragments observed at site *j*

*   **d**: vector of *dj* counts for all sites *j* = {1,…, *N*} in a genome

*   *aj*: number of ancestral fragments observed at site *j*

*   **a**: vector of *aj* counts for all sites *j* = {1,…, *N*} in a genome

*   *Wj*: known frequency of a derived allele in a candidate contaminant panel at site *j* (0 ≤ *wj* ≤ 1)

*   **w**: vector of *wj* frequencies for all sites *j* = {1,…, *N*} in a genome

*   *K*: number of informative SNPs used as input

*   *θ*: population-scaled mutation rate. *θ* = 4*Neμ*, where *Ne* is the effective population size and *μ* is the per-generation mutation rate.

We are interested in computing the probability of the data given the contamination rate, the error rate, the derived allele frequencies from the putative contaminant population (**w**) and a set of demographic parameters (**Ω**). We will use only sites that are segregating in the contaminant panel and we will assume that we observe only ancestral or derived alleles at every site (i.e. we ignore triallelic sites). In some of the analyses below, we will also assume that we have additional data (**O**) from present-day populations that may be related to the population to which the sample belongs. The nature of the data in **O** will be explained below, and will vary in each of the different cases we describe. The parameters contained in **Ω** may simply be the population-scaled times separating the contaminant population and the sample from their common ancestral population. However, **Ω** may include additional parameters, such as the admixture rate – if any – between the contaminant and the sample population. The number of parameters we can include in **Ω** will depend on the nature of the data in **O**.

For all models we will describe, the probability of the data can be defined as: ![Formula][1]</img>  where ![Formula][2]</img>  Here, *i* is the true (unknown) genotype of the ancient sample, and *P*[*i* |Ω, **O**] is the probability of genotype *i* given the demographic parameters and the data.

We focus now on computation on the likelihood for one site *j* in the genome. In the following, we abuse notation and drop the subscript *j*. Given the true genotype of the ancient individual, the number of derived and ancestral fragments at a particular site follows a binomial distribution that depends on the genotype, the error rate and the rate of contamination [1, 3]: ![Formula][3]</img>  where ![Formula][4]</img>  ![Formula][5]</img>  ![Formula][6]</img> 

In the sections below, we will turn to the more complicated part of the model, which is obtaining the probability *P*[*i*|**Ω**, **O**] for a genotype in the ancient sample, given particular demographic parameters and additional data available. We will do this in different ways, depending on the kind of data we have at hand.

### 3.2. Diffusion-based likelihood for neutral drift separating two populations

First, we will work with the case in which **O** = **y**, where **y** is a vector of frequencies *yj* from an “anchor” population that may be closely related to the population of the ancient DNA sample. An example of this scenario would be the sequencing of a Neanderthal sample that is suspected to have contamination from present-day humans, from which many genomes are available.

For all analyses below, we restrict to sites where 0 < *yj* < 1. Note that it is entirely possible (but not required) that **y** = **w**, meaning that, aside from the ancient DNA sample, the only additional data we have are the frequencies of the derived allele in the putative contaminant population, which we can use as the anchor population too. However, it is also possible to use a contaminant panel that is different from the anchor population (Figure 1.A). We will assume we have sequenced a large number of individuals from a panel of the contaminant population (for example, The 1000 Genomes Project panel) and that the panel is large enough such that the sampling variance is approximately 0. In other words, the frequency we observe in the contaminant panel will be assumed to be equal to the population frequency in the entire contaminant population. In this case, **Ω** = {*τ*C, *τ*A}, where *τA* and *τC* are defined as follows:

![Figure 1.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F1.medium.gif)

[Figure 1.](http://biorxiv.org/content/early/2016/01/19/022285/F1)

Figure 1. A) Schematic of two-population modeling framework: at each site, derived and ancestral fragments (a, d) are binomially sampled from the true genotype of the archaic individual, with some amount of contamination and error. In turn, the true genotype depends on a demographic model, which can include the contaminant population. B) Schematic of three-population modeling framework, incorporating admixture between the archaic population and one of two anchor populations.

*τA*: drift time (i.e. time in generations scaled by twice the haploid effective population size) separating the population to which the ancient individual belongs from the ancestor of both populations

*τC*: drift time separating the anchor population from the ancestor of both populations

We need to calculate the conditional probabilities *P*[*i*|**Ω**, **O**] = **P**[**i**|**y**, *τ*C, *τ*A] for all three possibilities for the genotype in the ancient individual: *i* = 0, 1 or 2. To obtain these expressions, we rely on Wright-Fisher diffusion theory (reviewed in Ewens [17]), especially focusing on the two-population site-frequency spectrum (SFS) [18]. The full derivations can be found in Appendix A, and lead to the following formulas: ![Formula][7]</img>  ![Formula][8]</img>  ![Formula][9]</img> 

We generated 10,000 neutral simulations using msms [19] for different choices of *τ*C and *τ*A (with *θ* = 20 in each simulation) to verify our analytic expressions were correct (Figure 2). The probability does not depend on *θ*, so the choice of this value is arbitrary.

![Figure 2.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F2.medium.gif)

[Figure 2.](http://biorxiv.org/content/early/2016/01/19/022285/F2)

Figure 2. Comparison of analytic solutions to *P*[*i*|*y*, *τC*, *τA*] and simulations under neutrality from msms, for different choices of *τA* and *τC*.

The above probabilities allows us to finally obtain *P*[*i* | *yj*, **Ω**, **O**].

### 3.3. Estimating drift and admixture in a three-population model

Although the above method gives accurate results for a simple demographic scenario, it does not incorporate the possibility of admixture from the ancient sample to the contaminant population. This is important, as the signal of contamination may mimic the pattern of recent admixture. We will assume that, in addition to the ancient DNA sample, we also have the following data, which constitute **O**:

1.  A large panel from a population suspected to be the contaminant in the ancient DNA sample. The sample frequencies from this panel will be labeled **w**, as before.

2.  Two panels of genomes from two “anchor” populations that may be related to the ancient DNA sample. One of these populations – called population Y – may (but need not) be the same population as the contaminant and may (but need not) have received admixture from the ancient population (Figure 1.B). The sample frequencies for this population will be labeled as **y**. The other population – called Z – will have sample frequencies labeled **z**. We will assume the drift times separating these two populations are known (parameters *τY* and *τZ* in Figure 1.B). This is a reasonable assumption as these parameters can be accurately estimated without the need of using an ancient outgroup sample, as long as admixture is not extremely high.

We can then estimate the remaining drift parameters, the error and contamination rates and the admixture time (*β*) and rate (α) between the archaic population and modern population *Y*. The diffusion solution for this three-population scenario with admixture is very difficult to obtain analytically. Instead, we use a numerical approximation, implemented in the program *∂a∂i* [20].

### 3.4. Markov Chain Monte Carlo method for inference

We incorporated the likelihood functions defined above into a Markov Chain Monte Carlo (MCMC) inference method, to obtain posterior probability distributions for the contamination rate, the sequencing error rate, the drift times and the admixture rate. Our program – which we called ’DICE’ – is coded in C++ and is freely available at: [http://grenaud.github.io/dice/](http://grenaud.github.io/dice/). We assumed uniform prior distributions for all parameters, and the boundaries of these distributions can be modified by the user.

For the starting chain at step 0, an initial set of parameters *X* = {*r**C*, ∊, Ω} is sampled randomly from their prior distributions. At step *k*, a new set of values for step *k* + 1 is proposed by drawing values for each of the parameters from normal distributions. The mean of each of those distributions is the value for each parameter at state *Xk* and the standard deviation is the difference between the upper and lower boundary of the prior, divided by a constant that can be increased or decreased to achieve a desired rate of acceptance of new states [21]. By default, this constant is equal to 1, 000 for all parameters. The new state is accepted with probability: ![Formula][10]</img>  where *P*[**a**, **d** | *Xk*] is the likelihood defined in Equation 1.

Unless otherwise stated below, we ran the MCMC chain for 100,000 steps in all analyses, with a burn-in period of 40,000 and sampling every 100 steps. The sampled values were then used to construct posterior distributions for each parameter.

### 3.5. Multiple error rates and ancestral state misidentification

Fu et al. [5] showed that, when estimating contamination, ancient DNA data can be better fit by a two-error model than a single-error model. In that study, the authors co-estimate the two genome-wide error rates along with the proportion of the data that is affected by each rate. Therefore, we also included this error model as an option that the user can choose to incorporate when running our program.

Furthermore, we developed an alternative error estimation method that allows the user to flag transition polymorphisms, which are more likely to have occurred due to cytosine deamination in ancient DNA. These sites are therefore likely to be subject to different error rates than those common in present-day sequencing data [22, 23]. Our program can then estimate two error rates separately: one for transitions and one for transversions. Finally, we incorporated an option to include an ancestral state misidentification (ASM) parameter, which should serve to correct for mispolarization of alleles [24].

### 3.6. BAM file functionality

The standard input for DICE is a file containing counts of particular ancestral/derived base combinations and SNP frequencies (see README file online). As an additional feature, we also developed a module for the user to directly input a BAM file and a file containing population allele frequencies for the anchor and contaminant panels, rather than the standard input. The user can either choose to convert the BAM file to native DICE format using a program provided with the software package and then run the program, or run it directly on the BAM file. In the latter case, instead of calculating genome-wide error parameters, the program will calculate error parameters specific to each sequenced fragment, based on mapping qualities, base qualities and estimated deamination rates at each site (see Appendix B).

## 4. Results: two-population method

### 4.1. Simulations

We first used DICE to obtain posterior distributions from simulated data, under the two-population inference framework. We simulated two populations (i.e. an archaic and a modern human population) with constant population size that split a number of generations ago. For each demographic scenario tested, we generated 20,000 independent replicates (theta=1) in *ms* [25], making sure each simulation had at least one usable SNP. In general, this yielded ∼80,000 usable SNPs in total. We then proceeded to sample derived and ancestral allele counts using the same binomial sampling model we use in our inference framework, under different sequencing coverage and contamination conditions. In all simulations, the contaminant panel was the same as the anchor population panel. We then applied our method to the combined set of ∼80,000 SNPs.

Figure 3 and 4 show parameter estimation results from various demographic and contamination scenarios for a low-coverage (3X) and a high-coverage (30X) archaic genome, respectively, with low sequencing error (0.1%), and a contaminant/anchor population panel of 100 haploid genomes. In both cases, the method accurately estimates the error rate, the contamination rate and the drift parameters. All parameters are also accurately estimated for the same scenarios even if the sequencing error rate is high (10%) (Figure S1).

![Figure 3.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F3.medium.gif)

[Figure 3.](http://biorxiv.org/content/early/2016/01/19/022285/F3)

Figure 3. Estimation of parameters for a low-coverage ancient DNA genome (3X) with low sequencing error (0.1%), no admixture and a large anchor population panel (100 haploid genomes). Error bars represent 95% posterior intervals.

![Figure 4.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F4.medium.gif)

[Figure 4.](http://biorxiv.org/content/early/2016/01/19/022285/F4)

Figure 4. Estimation of parameters for a high-coverage ancient DNA genome (30X) with low sequencing error (0.1%), no admixture and a large anchor population panel (100 haploid genomes). Error bars represent 95% posterior intervals.

Figures 5, S2, S3, S4 show how well the method does at estimating parameters over a wide range of contamination and drift scenarios, by displaying the absolute difference between simulated parameters and their corresponding posterior modes. So long as coverage is high (for example, 5X or 30X), the contamination and anchor drift parameters are accurately estimated even at 75% contamination. The method performs well even if the drift times on both sides of the tree are as small as ≈ 0.001 or as large as ≈ 5, but starts becoming inaccurate when contamination is extremely high. In general, the contamination rate and anchor drifts are easier to determine than the drift corresponding to the ancient population.

![Figure 5.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F5.medium.gif)

[Figure 5.](http://biorxiv.org/content/early/2016/01/19/022285/F5)

Figure 5. We tested the performance of the two-population method under a variety of drift and contamination scenarios for a sample of very low (0.5X) or very high (30X) coverage. We found that we needed more sites (≈ 1:6 million) to obtain accurate estimates from the low coverage sample. The MCMC chain was also run for a longer time (1 million steps). The top row shows the absolute difference between the estimated and the simulated contamination rate, while the bottom row shows the absolute difference corresponding to the anchor drift. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

We find that for samples of very low coverage (0.5X, 1X, 1.5X) we require a larger number of sites to obtain accurate estimates (Figures S5, S6, S7). For example, for a sample of 0.5X coverage, we tried different numbers of independent replicate simulations and found that at 800,000 replicates, we obtained approximately 1.6 million valid SNPs for inference, which was enough to reach reasonable levels of accuracy (Figure S14). We note that this number of SNPs is approximately the same as what is available, for example, in the low-coverage (0.5X) Mezmaiskaya Neanderthal genome [4], which contains about 1.55 million valid sites with coverage ≥ 1, and which we analyze below. We also observed that the MCMC chain in some of these simulations needed a longer time to converge than when testing samples of higher coverage, especially when contamination is very high, and so in this set of simulations, we ran it for 1 million steps instead of 100,000, with a burn-in of 940,000 steps and sampling every 100 steps. Finally, we note that our failure to recover the true parameters under low coverage in a single MCMC run is partly due to the chain failing to converge. Indeed, when we run the MCMC 10 times and recover the estimates from the chain with the highest posterior probability, we are able to obtain increased accuracy relative to the single run, especially when the drift parameters are extremely low and when the contamination rate is extremely high (Figures S8, S9, S10).

Finally, we tested the method on simulations in a more realistic scenario, in which we generated ancient and contaminant fragments based on empirical fragment sizes and then mapped them to a simulated reference genome using BWA [26] with default parameters. We produced DNA sequences from the output of msms [19] via seq-gen v.1.3.3 [27] with the HKY substitution model [28]. This allows for multiple substitutions to occur at the same site since the split from chimpanzee (which could cause ASM). We then simulated ancient DNA fragments that had a fragment size distribution emulating empirical distributions. Contaminant fragments were also sampled from the contaminant population. We used the deamination rates from the singlestranded library from the Loschbour ancient individual [29] (∼ 8% at the 5’ end and ∼ 34% at the 3’ end with a residual deamination rate of ∼ 1% along the whole fragment) to artificially deaminate the ancient fragments. We simulated sequencing errors on both the ancient and contaminant fragments using empirical sequencing error rates from a PhiX library (Illumina Corp.) sequenced at the Max Planck Institute for Evolutionary Anthropology on an Illumina HiSeq, basecalled using freeIbis [30]. With the same empirical PhiX dataset distribution, we generated quality scores for each nucleotide. Fragments were mapped back to a random individual from the contaminant panel. Figure 6 shows DICE’s performance on this scenario with different error models. In all cases, we find that the parameters are estimated with high accuracy. As expected, the ts/tv model infers a higher error rate at transitions, due to the additional errors introduced by deamination on the ends of the ancient fragments.

![Figure 6.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F6.medium.gif)

[Figure 6.](http://biorxiv.org/content/early/2016/01/19/022285/F6)

Figure 6. Estimation of parameters for a high-coverage ancient DNA genome (30X) simulated under a realistic scenario in which fragments from the ancient and contaminant genome were generated and then mapped to a reference genome. We allowed for multiple substitutions at the same site after the split from chimp, as well as sequencing errors and post-mortem deamination errors at the ends of the fragments. The five panels show results from inferring parameters under five different error rate models. Top-left: single-error model. Top-right: two-error model [5]. Middle-left: model with separate errors for transitions (ts) and tranversions (tv). Middle-right: single-error model with an ancestral state misidentification parameter. Bottom-left: Model in which errors were inferred individually at each site, using base and mapping qualities obtained from the simulated BAM file. Error bars represent 95% posterior intervals.

### 4.2. Performance under violations of model assumptions

We evaluated the consequences of different violations of model assumptions. We started by observing the effects of using a small modern human panel. Figure S12 shows results for cases in which the contaminant/anchor panel is made up of only 20 haploid genomes. In this case, all parameters are estimated accurately, with only a slight bias towards overestimating the drift parameters, presumably because the low sampling of individuals acts as a population bottleneck, artificially increasing the drift time parameters estimated.

Additionally, we simulated a scenario in which only a single human contaminated the sample. That is, rather than drawing contaminant fragments from a panel of individuals, we randomly picked a set of two chromosomes at each unlinked site and only drew contaminant fragments from those two chromosomes. Figure S13 shows that inference is robust to this scenario, unless the contamination rate is very high (25%). In that case, the drift of the archaic genome is substantially under-estimated, but the error, contamination and anchor drift parameters only show slight inaccuracies in the estimate.

We then investigated the effect of admixture in the anchor/contaminant population from the archaic population, occurring after their divergence, which we did not account for in the simple, two-population model (Figure S11). In this case, the error and the contamination rates are accurately estimated, but both drift times are underestimated. This is to be expected, as admixture will tend to homogenize allele frequencies and thereby reduce the apparent drift separating the two populations.

### 4.3. Identifying the contaminant population

We sought to see whether we would use our method to identify the contaminant population, from among a set of candidate contaminants (for example, different present-day human panels). Because our MCMC samples are samples from the posterior distribution of the parameters and not the marginal likelihood of the data over the entire parameter space, we cannot perform proper Bayesian model selection. Instead, we used the posterior mode as a heuristic statistic that may suggest which panel is most likely to have contaminated the sample. We validated this choice of statistic using simulations under a variety of demographic scenarios (Figure S15). We simulated 5-population trees of varying drift times. The outgroup was chosen to be the ancient population and the rest were chosen to be the present-day human populations (A, B, C and D). One of the populations (A) was the true contaminant. To add another layer of complexity, we also allowed for admixture (at 0%, 5% and 50% rate) from the ancient population to the ancestral population of A and B. We then ran our MCMC method four times on each of these demographic scenarios, using D as the anchor and different panels as the putative contaminant in each run.

Figure S16 shows that the lowest posterior mode always corresponds to the run that uses the true contaminant (A), and that the mode decreases the farther the tested contaminant is from the true contaminant in the tree. Additionally, Figures S17, S18, S19 show the effect of misspecifying the contaminant panel for different admixture scenarios. The error rate and the anchor drift time are correctly estimated, even when the candidate contaminant is highly diverged from the true contaminant, while the other two parameters are more sensitive to misspecification. In general, the correct candidate contaminant produces the highest posterior probability and yields the best parameter estimates.

### 4.4. Empirical data

We first applied our method to published ancient DNA data from a high-coverage genome (52X) from Denisova cave in Siberia (the Altai Neanderthal) [4], and visually ensured that the chain had converged. The demographic, error and contamination estimates are shown in Table 1. We used the African (AFR) 1000 Genomes Phase 3 panel [16] as the anchor population. The drift times estimated for both samples are consistent with the known demographic history of Neanderthals and modern humans, and the contamination rates largely agree with previous estimates (see Discussion below).

View this table:
[Table 1.](http://biorxiv.org/content/early/2016/01/19/022285/T1)

Table 1. 
Posterior modes of parameter estimates under the two-population inference framework for the Altai Neanderthal autosomal genome. We used different 1000G populations as candidate contaminants. Africans were the anchor population in all cases, so the modern human drift is with respect to Africans. Values in parentheses are 95% posterior quantiles.

We ran our method with different putative contaminant panels: Africans (AFR), East Asians (EAS), Native Americans (AMR), Europeans (EUR), South Asians (SAS). For the Altai sample, we observe a contamination rate of ∼ 1% and an error rate of ∼ 0.1%, regardless of which panel we use. Furthermore, the drift on the Neanderthal side of the tree seems to be 6 times as large as the drift on the modern human side of the tree, reflecting the smaller effective population size of Neanderthals after their divergence. The EUR panel is the one with the highest posterior mode (Table 1).

We then tested a variety of ancient DNA nuclear genome sequences at different levels of coverage, obtained via different methods (shotgun sequencing and SNP capture) and from different hominin groups (modern humans and Neanderthals). We used AFR as the anchor panel and either AFR (Table S1) or EUR (Table S2) as the contaminant panel. For samples of high and medium average coverage, the MCMC converges to reasonable values for all parameters. For example, we estimate the ancient population drift parameter (*τA*) to be larger in Neanderthals than in various modern humans sampled across Eurasia, as the effective population size of the former was smaller and their split time to Africans was larger.

However, for samples of very low coverage, we observe a failure of some of the parameters to properly converge, as the MCMC seems to get stuck in the boundaries of parameter space. We tested different boundaries and the problem remains. This appears to be less of a problem when using AFR as the putative contaminant panel than when using EUR as the putative contaminant panel, presumably because of the larger amount of SNPs that may be informative for inference. In the former case, we only observe this problem when samples are at lower than ∼ 0.5X coverage. In the latter case, we observe the problem for samples at lower than ∼ 3X coverage.

For example, the low-coverage Neanderthal genome (0.5X) from Mezmaiskaya Cave in Western Russia [4] seems to converge to parameters within the prior boundaries when using AFR as the contaminant panel but the ancient population drift gets stuck in the upper limit of parameter space when any of the other panels are used as contaminants (Table S3). Regardless of which contaminant panel is used, there is good agreement with the modern human drift parameter obtained when using the Altai Neanderthal genome. However, we note that when using non-African populations as the contaminants, we obtain a higher (∼ 5%) contamination rate in the Mezmaiskaya Neanderthal than in the Altai Neanderthal. It is currently unclear to us whether this is due to the MCMC failing to properly converge or to a real feature of the data.

We sought to determine the robustness of our results to different levels of GC content. We did this because we initially hypothesized that endogenous DNA might be preserved at lower rates when GC content is low, leading to the presence of proportionally more contaminant DNA. We partitioned the Altai Neanderthal genome into three different regions of low (0% – 30%), medium (31% – 69%) and high (70% – 100%) GC content, using the ’GC content’ track downloaded from the UCSC genome browser [31]. We then used the two-population method to infer contamination, error and drift parameters, using Africans as the anchor population and Europeans as the contaminant population (Figure S20). We observe that contamination rates are higher in low-GC regions than in medium-GC regions (Welch one-sided t-test on the posterior samples, P < 2.2e-16), which in turn have higher contamination rates than high-GC regions (P < 2.2e-16). The opposite trend occurs in the error estimates, while the drift parameters are largely unaffected. However, we find that the differences we observe across GC levels are almost entirely eliminated by removing CpG sites from the input dataset (Figure S20), as CpG sites are known to have higher mutation rates than the rest of the genome. For this reason, we recommend filtering them out when testing for contamination on ancient DNA datasets, which is what was done in Tables 1 and 2.

View this table:
[Table 2.](http://biorxiv.org/content/early/2016/01/19/022285/T2)

Table 2. 
Posterior modes of parameter estimates under the three-population inference framework for the Altai Neanderthal autosomal genome. We used different 1000G populations as candidate contaminants. In all cases, Africans were the unadmixed anchor population and Europeans were the admixed anchor population. The ancestral human drift refers to the drift in the modern human branch before the split of Europeans and Africans. The post-split European-specific and African-specific drifts were estimated separately without the archaic genome (*τAfr* = 0.009, *τEur* = 0.255).

Finally, we tested a present-day Yoruba genome (HGDP00936) sequenced to high coverage [4], which should not contain any contamination. Indeed, when applying our method, we find this to be the case (Figure S21). We infer 0% contamination, regardless of whether we use EUR or AFR as the candidate contaminant. Furthermore, the anchor drift time is very close to 0 when using AFR as the anchor population (as the sample belongs to that same population), while it is non-zero (= 0.22) when using EUR, which is consistent with the drift time separating Europeans from the ancestor of Europeans and their closest African sister populations [32].

## 5. Results: three-population method

### 5.1. Simulations

We applied our three-population method to estimate both drift times and admixture rates. We simulated a high-coverage (30X) archaic human genome under various demographic and contamination scenarios. Each of the two anchor population panels contained 20 haploid genomes. The admixture time was 0.08 drift units ago, which under a constant population size of 2N=20,000 would be equivalent to 1,600 generations ago. When running our inference program, we set the admixture time prior boundaries to be between 0.06 and 0.1 drift units ago.

We find that the admixture time is inaccurately estimated under this implementation – likely due to lack of information in the site-frequency spectrum – so we do not show estimates for that parameter below. For admixture rates of 0%, 5% or 20%, the error and contamination parameters are estimated accurately in all cases (Figures S22, S23 and S24, respectively). The method is less accurate when estimating the demographic parameters, especially the admixture rate which is sometimes under-estimated. Importantly though, the accuracy of the contamination rate estimates are not affected by incorrect estimation of the demographic parameters.

We also tested what would happen if the admixture time was simulated to be recent: 0.005 drift units ago, or 100 generations ago under a constant population size of 2N=20,000. When estimating parameters, we set the prior for the admixture time to be between 0 and 0.01 drift units ago. In this last case, we observe that the drift times and the admixture rate (20%) are more accurately estimated than when the admixture event is ancient (Figure 7).

![Figure 7.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F7.medium.gif)

[Figure 7.](http://biorxiv.org/content/early/2016/01/19/022285/F7)

Figure 7. Estimation of error, contamination and demographic parameters in various three-population demographic scenarios, where the admixture rate is 20% and the admixture time was recent (0.005 drift units ago). The prior used for the admixture time was uniform over [0,0.01]. Error bars represent 95% posterior intervals.

As before, we also verified that the posterior mode was a good proxy to identify the true contaminant (A), when running the MCMC using different contaminant panels (A, B, C and D). In all cases, we used D as the unadmixed anchor panel and B as the admixed anchor panel. Results are shown in Figure S25 for all the demographic scenarios from Figure S15. Again, we observe that the true contaminant (A) is always the one that corresponds to the lowest posterior probability, though we again caution that because we do not have the marginal probabilities, we cannot formally perform model selection to favor a particular panel. Furthermore,the admixture rate from the ancient population into the ancestors of A and B is robustly estimated unless the true contaminant (A) is highly diverged from the candidate contaminant (Figures S26, S27, S28, for admixture rates of 0%, 5% and 50%, respectively).

### 5.2. Empirical data

We also applied the three-population inference framework to the high-coverage Altai Neanderthal genome. We first estimated the two drift times specific to Europeans and Africans after the split from each other (*τY* and *τZ*, respectively), using ∂a∂i and the L-BFGS-B likelihood optimization algorithm [13], but without using the archaic genome (*τAfr* = 0.009, *τEur* = 0.255). Then, we used our MCMC method to estimate the rest of the drift times, the archaic admixture rate and the contamination and error parameters in the Neanderthal genome. We set the admixture time prior boundaries to be between 0.06 and 0.1 drift units ago, which is a realistic time frame given knowledge about modern human – Neanderthal cohabitation in Eurasia [33]. The error rate and contamination rates we obtain are similar to those obtained under the two-population method, and we estimate an admixture rate from Neanderthals into modern humans of 1.72% for the choice of contaminant panel with the highest posterior mode – which is again EUR (Table 2).

We also applied the method to the low-coverage Mezmaiskaya Neanderthal genome. As before, we are able to reach convergence for all parameters (including the admixture rate) with the exception of the Neanderthal drift, which gets stuck in the upper boundary of parameter space (Table S4).

## 6. Discussion

We have developed a new method to jointly infer demographic parameters, along with contamination and error rates, when analyzing an ancient DNA sample. The method can be deployed using a C++ program (DICE) that is easy to use and freely downloadable. We therefore expect it to be highly applicable in the field of paleogenomics, allowing researchers to derive useful information from previously unusable (highly contaminated) samples, including archaic humans like Neanderthals, as well as ancient modern humans.

Applications to simulations show that the error and contamination parameters are estimated with high accuracy, and that demographic parameters can also be estimated accurately so long as enough information (e.g. a large panel of modern humans) is available. The drift time estimates reflect how much genetic drift has acted to differentiate the archaic and modern populations since the split from their common ancestral population, and can be converted to divergence times in generations if an accurate history of population size changes is also available (for example, via methods like PSMC, [34]). Although we cannot perform proper model testing, we found via extensive simulations that the posterior mode of an MCMC run was a robust heuristic statistic to help detect which panel was most likely to have contaminated the sample. We caution, however, that the fact that a particular panel yields a higher posterior mode than another is no guarantee that it is a better fit to the data for demographic scenarios that may be different from the ones we simulated.

We also applied our method to empirical data, specifically to two Neanderthal genomes at high and low coverage, a present-day high-coverage Yoruba genome, and several ancient genome sequences of varying degrees of coverage, some obtained via shotgun-sequencing and some via SNP capture. For the high-coverage Yoruba genome, we infer no contamination, as would be expected from a modern-day sample, and drift times indicating the Yoruba sample indeed belongs to an African population.

The contamination and sequencing error estimates we obtained for the Altai Neanderthal are roughly in accordance with previous estimates [4]. The drift times we obtain under the three-population model for the African population (*τC* + *τAfr*) are approximately 0.411 + 0.009 = 0.42 drift units. The geometric mean of the history of population sizes from the PSMC results in Prüfer et al. [4] give roughly that *Ne* ≈ 21, 818 since the African population size history started differing from that of Neanderthals, assuming a mutation rate of 1.25 * 10−8 per bp per generation. If we assume a generation time of 29 years, and use our drift time in the equation relating divergence time in generations to drift time (*t*/(2*Ne*) ≈ *τ*), this gives an approximate human-Neanderthal population divergence time of 531,486 years. This number roughly agrees with the most recent estimates obtained via other methods [4]. Additionally, the Neanderthal-specific drift time is approximately 6.5 times as large as the modern human drift time, which is expected as Neanderthals had much smaller population sizes than modern humans [35, 4]. The admixture rate from archaic to modern humans that we estimate is 1.72%, which is consistent with the rate estimate obtained via methods that do not jointly model contamination (1.5 – 2.1%) [4]. In the case of the Altai Neanderthal, we observe that the sample was probably contaminated by one or more individuals with European ancestry.

When testing modern human and Neanderthal ancient genomes of lower coverage than the Altai Neanderthal, we obtain reasonable parameter estimates for samples of medium to high-coverage. However, we run into problems in estimation when the samples are of low coverage. For these reasons, and from our simulation results, we recommend that our method should be used on nuclear genomes with > 3X coverage. The method may converge under certain conditions at coverages as low as 0.5X (for example, in the case of the Mezmaiskaya genome under the two-population model when using AFR as the anchor and contaminant panel), but, in such cases, we caution the user to check convergence is achieved before drawing any conclusions from the estimates. For SNP capture data, we obtain reliable estimates for samples with a minimum coverage of 500,000 sites that are polymorphic in the anchor panel.

The demographic models used in our approach are simple, involving no more than three populations and a single admixture event. This is partly due to limitations of known theory about the diffusion-based likelihood of an arbitrarily complex demography for the 2-D site-frequency spectrum – in the case of the two-population method – and to the inability of ∂a∂i [20] to handle more than 3 populations at a time. In recent years, several studies have made advances in the development of methods to compute the likelihood of an SFS for larger numbers of populations using coalescent theory [36, 37, 38], with multiple population size changes and admixture events. We hope that some of these techniques could be incorporated in future versions of our inference framework.

## Supporting Information

![Figure S1.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F8.medium.gif)

[Figure S1.](http://biorxiv.org/content/early/2016/01/19/022285/F8)

Figure S1. Estimation of parameters for a high-coverage ancient DNA genome (30X) with high sequencing error (10%), no admixture and a large anchor population panel (100 haploid genomes). Error bars represent 95% posterior intervals.

View this table:
[Table S1.](http://biorxiv.org/content/early/2016/01/19/022285/T3)

Table S1. 
We applied the two-population method to ancient Neanderthal and modern human genomes ranging from 52X to 0.054X coverage. We tested both shotgun-sequencing data and SNP capture data. We used AFR as both the anchor panel and the putative contaminant panel. Samples are sorted by decreasing mean coverage. We define Convergence to be true (T) if all the parameters stably converged in a region of parameter space that does not include the upper parameter boundary. Otherwise Convergence is false (F). A line separates the two Convergence classes. SNPs = number of SNPs overlapping with anchor panel. Observations = total number of base observations analyzed. SC = SNP capture. SS = shotgun sequencing. HG = hunter-gatherer. LBK = Linear Pottery culture. MN = Middle Neolithic. LN = Late Neolithic. NEA = Neanderthal. MH = Modern Human. LogPos = Log-posterior mode. Reported Cov. = Mean read coverage reported in corresponding study. For SNP capture, this is the mean coverage of the targeted SNPs.

View this table:
[Table S2.](http://biorxiv.org/content/early/2016/01/19/022285/T4)

Table S2. 
We applied the two-population method to ancient Neanderthal and modern human genomes ranging from 52X to 0.054X coverage. We tested both shotgun-sequencing data and SNP capture data. We used AFR as the anchor panel and EUR as the putative contaminant panel. Samples are sorted by decreasing mean coverage. We define Convergence to be true (T) if all the parameters stably converged in a region of parameter space that does not include the upper parameter boundary. Otherwise Convergence is false (F). A line separates the two Convergence classes. SNPs = number of SNPs overlapping with anchor panel. Observations = total number of base observations analyzed. SC = SNP capture. SS = shotgun sequencing. HG = hunter-gatherer. LBK = Linear Pottery culture. MN = Middle Neolithic. LN = Late Neolithic. NEA = Neanderthal. MH = Modern Human. LogPos = Log-posterior mode. Reported Cov. = Mean read coverage reported in corresponding study. For SNP capture, this is the mean coverage of the targeted SNPs.

View this table:
[Table S3.](http://biorxiv.org/content/early/2016/01/19/022285/T5)

Table S3. 
Posterior modes of parameter estimates under the two-population inference framework for the Mezmaiskaya Neanderthal autosomal genome. We used different 1000G populations as candidate contaminants. AFR were the anchor population in all cases, so the modern human drift is with respect to Africans. Values in parentheses are 95% posterior quantiles. Except when using AFR as the contaminant, the Neanderthal drift parameter gets stuck at the upper boundary (5 drift units) of parameter space.

View this table:
[Table S4.](http://biorxiv.org/content/early/2016/01/19/022285/T6)

Table S4. 
Posterior modes of parameter estimates under the three-population inference framework for the Mezmaiskaya Neanderthal autosomal genome. We used different 1000G populations as candidate contaminants. In all cases, Africans were the unadmixed anchor population and Europeans were the admixed anchor population. The ancestral human drift refers to the drift in the modern human branch before the split of Europeans and Africans. The post-split European-specific and African-specific drifts were estimated separately without the archaic genome (*τAfr* = 0.009, *τEur* = 0.255). In all cases, the Neanderthal drift parameter gets stuck at the upper boundary (5 drift units) of parameter space.

![Figure S2.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F9.medium.gif)

[Figure S2.](http://biorxiv.org/content/early/2016/01/19/022285/F9)

Figure S2. Absolute difference between estimated and simulated contamination rates for a variety of anchor drift and contamination scenarios, for different levels of coverage. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

![Figure S3.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F10.medium.gif)

[Figure S3.](http://biorxiv.org/content/early/2016/01/19/022285/F10)

Figure S3. Absolute difference between estimated and simulated anchor drifts for a variety of anchor drift and contamination scenarios, for different levels of coverage. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

![Figure S4.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F11.medium.gif)

[Figure S4.](http://biorxiv.org/content/early/2016/01/19/022285/F11)

Figure S4. Absolute difference between estimated and simulated ancient sample drifts for a variety of anchor drift and contamination scenarios, for different levels of coverage. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

![Figure S5.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F12.medium.gif)

[Figure S5.](http://biorxiv.org/content/early/2016/01/19/022285/F12)

Figure S5. Absolute difference between estimated and simulated contamination rates for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). Here, we used a large number of sites and run the MCMC chain for 1 million steps. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

![Figure S6.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F13.medium.gif)

[Figure S6.](http://biorxiv.org/content/early/2016/01/19/022285/F13)

Figure S6. Absolute difference between estimated and simulated anchor drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). Here, we used a large number of sites and run the MCMC chain for 1 million steps. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

![Figure S7.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F14.medium.gif)

[Figure S7.](http://biorxiv.org/content/early/2016/01/19/022285/F14)

Figure S7. Absolute difference between estimated and simulated ancient sample drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). Here, we used a large number of sites and run the MCMC chain for 1 million steps. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

![Figure S8.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F15.medium.gif)

[Figure S8.](http://biorxiv.org/content/early/2016/01/19/022285/F15)

Figure S8. Absolute difference between estimated and simulated contamination rates for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). We used a large number of sites and run 10 MCMC chains for 1 million steps each. To ensure convergence, we then selected the chain with the highest posterior probability, and here show estimates from that chain. In all simulations, the anchor drift was set to be equal to the ancient sample drift

![Figure S9.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F16.medium.gif)

[Figure S9.](http://biorxiv.org/content/early/2016/01/19/022285/F16)

Figure S9. Absolute difference between estimated and simulated anchor drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). We used a large number of sites and run 10 MCMC chains for 1 million steps each. To ensure convergence, we then selected the chain with the highest posterior probability, and here show estimates from that chain. In all simulations, the anchor drift was set to be equal to the ancient sample drift

![Figure S10.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F17.medium.gif)

[Figure S10.](http://biorxiv.org/content/early/2016/01/19/022285/F17)

Figure S10. Absolute difference between estimated and simulated ancient sample drifts for a variety of anchor drift and contamination scenarios, when coverage is low (0.5X, 1X or 1.5X). We used a large number of sites and run 10 MCMC chains for 1 million steps each. To ensure convergence, we then selected the chain with the highest posterior probability, and here show estimates from that chain. In all simulations, the anchor drift was set to be equal to the ancient sample drift.

![Figure S11.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F18.medium.gif)

[Figure S11.](http://biorxiv.org/content/early/2016/01/19/022285/F18)

Figure S11. Estimation of parameters for a high-coverage ancient DNA genome (30X) with low sequencing error (0.1%), a large anchor population panel (100 haploid genomes) and admixture in the anchor population from the archaic population (5%), using the two-population inference framework, which does not model admixture. Error bars represent 95% posterior intervals.

![Figure S12.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F19.medium.gif)

[Figure S12.](http://biorxiv.org/content/early/2016/01/19/022285/F19)

Figure S12. Estimation of parameters for a high-coverage ancient DNA genome (30X) with low sequencing error (0.1%), no admixture and a small anchor population panel (20 haploid genomes). Error bars represent 95% posterior intervals.

![Figure S13.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F20.medium.gif)

[Figure S13.](http://biorxiv.org/content/early/2016/01/19/022285/F20)

Figure S13. Estimation of parameters for a high-coverage ancient DNA genome (30X), when the contaminant fragments are exclusively drawn from a single diploid individual from the contaminant panel. Error bars represent 95% posterior intervals.

![Figure S14.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F21.medium.gif)

[Figure S14.](http://biorxiv.org/content/early/2016/01/19/022285/F21)

Figure S14. Estimation of parameters for an ancient DNA genome of very low coverage (0.5X) with low sequencing error (0.1%) and a large anchor population panel (100 haploid genomes). Note that unlike the rest of the simulations, the number of SNPs used in this case was approximately 1.6 million instead of 80,000, and the MCMC chain was run for 1 million steps instead of 100,000. Using a lower number of SNPs or running the chain for a shorter time resulted in inaccurate inferences. Error bars represent 95% posterior intervals.

![Figure S15.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F22.medium.gif)

[Figure S15.](http://biorxiv.org/content/early/2016/01/19/022285/F22)

Figure S15. Three demographic models used to test the method when the contaminant is misspecified. When testing the two-population method, we set panel A as the true contaminant and panel D as the anchor. When testing the three-population method, we set panel A as the true contaminant, panel D as the unadmixed anchor and panel B as the admixed anchor. The numbers on the branches represent the drift parameters. The parameter *α* represents the admixture rate from the ancient population into the ancestor of A and B.

![Figure S16.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F23.medium.gif)

[Figure S16.](http://biorxiv.org/content/early/2016/01/19/022285/F23)

Figure S16. When testing different putative contaminants, the highest mode of the posterior likelihoods from the MCMC under the two-population model corresponds to the true contaminant (panel A). The y-axis shows the difference between the log-posterior for contaminant panel A and the log-posterior for different candidate contaminant panels (A, B, C, D). We added a 1 to the difference to be able to plot the difference on a logarithmic scale. The three panels contain results for three admixture scenarios (from left to right: admixture rate of 0%, 5% and 50%) and each panel shows the difference under different contamination rates and demographic models (see Figure S15).

![Figure S17.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F24.medium.gif)

[Figure S17.](http://biorxiv.org/content/early/2016/01/19/022285/F24)

Figure S17. Parameters estimates under the two-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 0%. The anchor panel used was panel D (see Figure S15).

![Figure S18.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F25.medium.gif)

[Figure S18.](http://biorxiv.org/content/early/2016/01/19/022285/F25)

Figure S18. Parameters estimates under the two-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 5%. The anchor panel used was panel D (see Figure S15).

![Figure S19.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F26.medium.gif)

[Figure S19.](http://biorxiv.org/content/early/2016/01/19/022285/F26)

Figure S19. Parameters estimates under the two-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 50%. The anchor panel used was panel D (see Figure S15).

![Figure S20.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F27.medium.gif)

[Figure S20.](http://biorxiv.org/content/early/2016/01/19/022285/F27)

Figure S20. Estimation of parameters for the Altai Neanderthal genome across different GC levels using the two-population model, while keeping (black) or removing (red) CpG sites from the input dataset. Error bars represent 95% posterior intervals.

![Figure S21.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F28.medium.gif)

[Figure S21.](http://biorxiv.org/content/early/2016/01/19/022285/F28)

Figure S21. We tested one of the Yoruba genomes from Prüfer et al. [4] and obtain an estimate of 0% contamination, regardless of whether we use Europeans or Africans as the candidate contaminant. The anchor drift time is close to 0 when using Africans as the anchor population, as the sample belongs to that same population, while it is non-zero (= 0.22) when using Europeans. Error bars represent 95% posterior intervals.

![Figure S22.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F29.medium.gif)

[Figure S22.](http://biorxiv.org/content/early/2016/01/19/022285/F29)

Figure S22. Estimation of error, contamination and demographic parameters in various three-population demographic scenarios, where the admixture rate is 0%. The prior used for the admixture time was uniform over [0.06,0.1]. Error bars represent 95% posterior intervals.

![Figure S23.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F30.medium.gif)

[Figure S23.](http://biorxiv.org/content/early/2016/01/19/022285/F30)

Figure S23. Estimation of error, contamination and demographic parameters in various three-population demographic scenarios, where the admixture rate is 5% and the admixture time is ancient (0.08 drift units ago). The prior used for the admixture time was uniform over [0.06,0.1]. Error bars represent 95% posterior intervals.

![Figure S24.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F31.medium.gif)

[Figure S24.](http://biorxiv.org/content/early/2016/01/19/022285/F31)

Figure S24. Estimation of error, contamination and demographic parameters in various three-population demographic scenarios, where the admixture rate is 20% and the admixture time is ancient (0.08 drift units ago). The prior used for the admixture time was uniform over [0.06,0.1]. Error bars represent 95% posterior intervals.

![Figure S25.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F32.medium.gif)

[Figure S25.](http://biorxiv.org/content/early/2016/01/19/022285/F32)

Figure S25. When testing different putative contaminants, the highest mode of the posterior likelihoods from the MCMC under the three-population model corresponds to the true contaminant (panel A). The y-axis shows the difference between the log-posterior for contaminant panel A and the log-posterior for different candidate contaminant panels (A, B, C, D). We added a 1 to the difference to be able to plot the difference on a logarithmic scale. The three panels contain results for three admixture scenarios (from left to right: admixture rate of 0%, 5% and 50%) and each panel shows the difference under different contamination rates and demographic models (see Figure S15).

![Figure S26.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F33.medium.gif)

[Figure S26.](http://biorxiv.org/content/early/2016/01/19/022285/F33)

Figure S26. Parameters estimates under the three-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 0%. The unadmixed anchor panel used was panel D and the admixed anchor panel was panel B (see Figure S15).

![Figure S27.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F34.medium.gif)

[Figure S27.](http://biorxiv.org/content/early/2016/01/19/022285/F34)

Figure S27. Parameters estimates under the three-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 5%. The unadmixed anchor panel used was panel D and the admixed anchor panel was panel B (see Figure S15).

![Figure S28.](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2016/01/19/022285/F35.medium.gif)

[Figure S28.](http://biorxiv.org/content/early/2016/01/19/022285/F35)

Figure S28. Parameters estimates under the three-population model using different putative contaminants, when the true contaminant is panel A. Each row of panels represents a different set of drift parameters, keeping the contamination rate fixed at 25% and the error rate at 0.1%. In this case, the admixture rate from the ancient population to the ancestor of A and B was kept at 50%. The unadmixed anchor panel used was panel D and the admixed anchor panel was panel B (see Figure S15).

## 7. Acknowledgments

We thank Kelley Harris, Philip Johnson, Graham Coop, Nicolas Duforet-Frebourg, Joshua Schraiber, Sergi Castellano, Christoph Theunert, Janet Kelso, Rasmus Nielsen and members of the Slatkin and Nielsen labs for helpful advice and discussions.

## Appendix A. Genotype probabilities conditional on a demography

Below we derive formulas 7, 8 and 9. Recall that we are interested in calculating the conditional probabilities *P*[*i*|**Ω**, **O**] = **P**[**i**|**y**, *τ*C, *τ*A] for all three possibilities for the genotype in the ancient individual: *i* = 0, 1 or 2. These can be obtained from the definition of conditional probability. Let ![Graphic][11]</img> be the joint probability that a site has frequency *y* (0 < *y* < 1) in the contaminant panel and is homozygous for the derived allele in the ancient individual. Let ![Graphic][12]</img> be the joint probability that a site has frequency *y* in the contaminant panel and is heterozygous in the ancient individual. Finally, let ![Graphic][13]</img> be the joint probability that a site has frequency *y* in the anchor panel and is homozygous for the ancient allele in the ancient individual. Then: ![Formula][14]</img>  ![Formula][15]</img>  ![Formula][16]</img> 

In the above expressions, the functions *f* depend on *TC* and *TA*, but we omit this conditioning for ease of notation. As can be seen, all we need to find is the joint probabilities ![Graphic][17]</img> and ![Graphic][18]</img>. Here is where diffusion theory comes into play. Let *ϕ*(*y, τ*|*x*, 0) be the Kimura solution to the neutral forward diffusion equation in the absence of mutation [42], given a frequency *x* at time 0 and an elapsed drift time *τ*: ![Formula][19]</img> 

Here, *x* is the unknown population frequency of the derived allele in the ancestral population and ![Graphic][20]</img> (•) is the Gegenbauer polynomial of order h-1 [43].

Assuming the ancestral population follows an equilibrium frequency distribution *g(x)* = *θ/x*, we can write ![Graphic][21]</img> as follows: ![Formula][22]</img>  where *z* is the unknown population frequency of a derived allele in the population to which the ancient individual belongs.

The expression in parentheses is the second moment of the transition density and its solution is known [44]: ![Formula][23]</img> 

This results in: ![Formula][24]</img>  ![Formula][25]</img> 

The integral of the first two terms of the sum was solved in Chen et al. ![Formula][26]</img> 

The third term of the sum can be solved by noting that, though the integrand is an infinite sum (i.e. formula A.4 multiplied by *x*), only the integrals of the first two terms of that infinite sum are not equal to 0. This can be seen by integrating the parts of the terms of that infinite sum that depend on *x*:

![Formula][27]</img> 
Therefore, after integrating the first two terms of the infinite sum, we obtain:

![Formula][28]</img> 
So we finally arrive at:

![Formula][29]</img> 
We can obtain ![Graphic][30]</img> in a similar fashion:

![Formula][31]</img> 
Solving the term in the parentheses:

![Formula][32]</img> 
The first term of the difference is the first moment of the transition density, which is equal to *x* [44], while the second term is the second moment (formula A.6). Therefore:

![Formula][33]</img> ![Formula][34]</img> 
And after using formulas A.9 and A.10, we obtain:

![Formula][35]</img> 
To obtain ![Graphic][36]</img>, we know that, assuming the anchor population to be at equilibrium:

![Formula][37]</img> 
And therefore:

![Formula][38]</img> 
So we finally obtain:

![Formula][39]</img> 
We now have all the elements necessary to obtain the conditional probabilities from formulas A.1, A.2 and A.3, which immediately lead us to formulas 7, 8 and 9.

## Appendix B. Probabilistic inference using BAM files

Here, we briefly explain the way we infer fragment-specific error parameters in the optional BAM mode of DICE. Let ℝ be the set of all fragments in the BAM file, and *Rj* ∈ ℝ be a particular aligned fragment of length *l*. For fragment *Rj*, let {*b**j*,1,…, *bj,l*} be the individuals nucleotides in the fragment. At each position of the fragment, there is a specific probability *κj,i* that the base is erroneous. This probability is provided by the basecaller. Below, we will compute the likelihood of observing a base *bj,i* ∈ *Rj* under a bi-allelic model, given an error rate *κj, i*. Below, we focus on an individual fragment *Rj* and an individual position *i* on that fragment, so for simplicity, we drop the subscripts *i* and *j* and we let *bj, i* = *b* and *κj, i* = *κ*.

Let *ν* be the base that was originally sampled at a given site, before deamination or mismapping. This base could be ancestral or derived. Let *Pdam*[*ν* → *b*] be the probability of substitution from v to b due to postmortem chemical damage. The probabilities of different types of damage (e.g. C → or G → A) occurring at different positions of a fragment can be computed following Ginolhac et al. [45] and Jónsson et al. [46], producing a matrix that can be provided to DICE as input. We offer the possibility of specifying different post-mortem damage matrices for the endogenous and the contaminant fragments.

Let *E* denote the event that a sequencing error has occurred, let *D* the event that chemical damage has occurred, let *M* be the event that *Rj* was correctly mapped and let ¬ denote the complement of an event (i.e. event has not occurred). We define the probability of observing sequenced base *b* given that no sequencing error has occurred at a position on a correctly mapped fragment that was originally *υ*, by summing over two possibilities, either chemical damage occurred or it did not:

![Formula][40]</img> 
Here, ![Graphic][41]</img>(*υ* = *b*) is an indicator function that is equal to 1 if *υ* is equal to b, and 0 otherwise. The probabilities *P[D]* and *P[¬D]* are respectively equal to *Pdam*[*υ* → *b*] and 1 – *Pdam*[*υ* → *b*].

Subsequently, we compute *P*[*b*|*υ*, *M*], the probability of observing *b* given *υ* under the assumption that *Rj* was mapped at the correct genomic location. We have:

![Formula][42]</img> 
This is because if a sequencing error has occurred, the probability of observing *b* is independent of *υ*, and therefore ![Graphic][43]</img>. Finally, let *P[M]* be the probability that the fragment *Rj* is mapped at the correct location as given by the mapping quality. The probability of seeing *b* given that *υ* was the base that was sampled before deamination is then:

![Formula][44]</img> 
The probability of observing *b* given that the fragment was mismapped is independent of *υ*, hence ![Graphic][45]</img>. If either the base quality or mapping quality indicate a probability of error of 100%, *P*[*b*|*v*] will be equal to ![Graphic][46]</img>. These probabilities are used instead of the genome-wide error term *∈* in equations 4, 5 and 6. For instance, Equation 4 for a specific base b in fragment *Rj* becomes:

![Formula][47]</img> 
Here, *der* is the derived base and *anc* is the ancestral base. In case different post-mortem damage matrices are provided by the user for the ancient and the contaminant fragments, the events *contaminant* and *ancient* serve to denote which damage probabilities (i.e. *Pdam*) should be used in each case.

## Footnotes

*   Email address: fernandoracimo{at}gmail.com (Fernando Racimo)

*   Received July 8, 2015.
*   Accepted January 19, 2016.


*   © 2016, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## 8. References

1.  [1]. R. E. Green,  J. Krause,  A. W. Briggs,  T. Maricic,  U. Stenzel,  M. Kircher,  N. Patterson,  H. Li,  W. Zhai,  M. H.-Y. Fritz, et al., A draft sequence of the Neandertal genome, Science 328 (2010) 710–722.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzMjgvNTk3OS83MTAiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNi8wMS8xOS8wMjIyODUuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

2.  [2]. D. Reich,  R. E. Green,  M. Kircher,  J. Krause,  N. Patterson,  E. Y. Durand,  B. Viola,  A. W. Briggs,  U. Stenzel,  P. L. Johnson, et al., Genetic history of an archaic hominin group from Denisova Cave in Siberia, Nature 468 (2010) 1053–1060.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature09710&link_type=DOI) 
    
    [GeoRef](http://biorxiv.org/lookup/external-ref?access_num=2011015829&link_type=GEOREF) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21179161&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000285553800049&link_type=ISI) 

3.  [3]. M. Meyer,  M. Kircher,  M.-T. Gansauge,  H. Li,  F. Racimo,  S. Mallick,  J. G. Schraiber,  F. Jay,  K. Prüfer,  C. de Filippo, et al., A high-coverage genome sequence from an archaic Denisovan individual, Science 338 (2012) 222–226.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzMzgvNjEwNC8yMjIiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNi8wMS8xOS8wMjIyODUuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

4.  [4]. K. Prüfer,  F. Racimo,  N. Patterson,  F. Jay,  S. Sankararaman,  S. Sawyer,  A. Heinze,  G. Renaud,  P. H. Sudmant,  C. de Filippo, et al., The complete genome sequence of a neanderthal from the altai mountains, Nature 505 (2014) 43–49.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature12886&link_type=DOI) 
    
    [GeoRef](http://biorxiv.org/lookup/external-ref?access_num=2014009693&link_type=GEOREF) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=24352235&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000329163300020&link_type=ISI) 

5.  [5]. Q. Fu,  H. Li,  P. Moorjani,  F. Jay,  S. M. Slepchenko,  A. A. Bondarev,  P. L. Johnson,  A. Aximu-Petri,  K. Prüfer,  C. de Filippo, et al., Genome sequence of a 45,000-year-old modern human from western Siberia, Nature 514 (2014) 445–449.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature13810&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25341783&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000343775900030&link_type=ISI) 

6.  [6]. A. Seguin-Orlando,  T. S. Korneliussen,  M. Sikora,  A.-S. Malaspinas,  A. Manica,  I. Moltke,  A. Albrechtsen,  A. Ko,  A. Margaryan,  V. Moiseyev, et al., Genomic structure in Europeans dating back at least 36,200 years, Science 346 (2014) 1113–1118.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNDYvNjIxMy8xMTEzIjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTYvMDEvMTkvMDIyMjg1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

7.  [7]. R. E. Green,  A.-S. Malaspinas,  J. Krause,  A. W. Briggs,  P. L. Johnson,  C. Uhler,  M. Meyer,  J. M. Good,  T. Maricic,  U. Stenzel, et al., A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing, Cell 134 (2008) 416–426.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.cell.2008.06.021&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=18692465&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000258665500015&link_type=ISI) 

8.  [8]. R. E. Green,  A. W. Briggs,  J. Krause,  K. Prüfer,  H. A. Burbano,  M. Siebauer,  M. Lachmann,  S. Pääbo, The Neandertal genome and ancient DNA authenticity, The EMBO journal 28 (2009) 2494–2502.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiZW1ib2pubCI7czo1OiJyZXNpZCI7czoxMDoiMjgvMTcvMjQ5NCI7czo0OiJhdG9tIjtzOjM3OiIvYmlvcnhpdi9lYXJseS8yMDE2LzAxLzE5LzAyMjI4NS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

9.  [9]. S. Sawyer,  G. Renaud,  B. Viola,  J.-J. Hublin,  M.-T. Gansauge,  M. V. Shunkov,  A. P. Derevianko,  K. Prüfer,  J. Kelso,  S. Pääbo, Nuclear and mitochondrial dna sequences from two denisovan individuals, Proceedings of the National Academy of Sciences 112 (2015) 15696–15700.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTEyLzUxLzE1Njk2IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTYvMDEvMTkvMDIyMjg1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

10. [10]. P. Skoglund,  J. Storå,  A. Götherstrom,  M. Jakobsson, Accurate sex identification of ancient human remains using DNA shotgun sequencing, Journal of Archaeological Science 40 (2013) 4477–4482.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.jas.2013.07.004&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20811451&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000328015000032&link_type=ISI) 

11. [11]. M. Rasmussen,  X. Guo,  Y. Wang,  K. E. Lohmueller,  S. Rasmussen,  A. Albrechtsen,  L. Skotte,  S. Lindgreen,  M. Metspalu,  T. Jombart, et al., An aboriginal australian genome reveals separate human dispersals into asia, Science 334 (2011) 94–98.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjExOiIzMzQvNjA1Mi85NCI7czo0OiJhdG9tIjtzOjM3OiIvYmlvcnhpdi9lYXJseS8yMDE2LzAxLzE5LzAyMjI4NS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

12. [12]. T. S. Korneliussen,  A. Albrechtsen,  R. Nielsen, ANGSD: analysis of next generation sequencing data, BMC Bioinformatics 15 (2014) 356.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1186/s12859-014-0356-4&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25420514&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

13. [13]. R. H. Byrd,  P. Lu,  J. Nocedal,  C. Zhu, A limited memory algorithm for bound constrained optimization, SIAM Journal on Scientific Computing 16 (1995) 1190–1208.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1137/0916069&link_type=DOI) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1995RR54100011&link_type=ISI) 

14. [14]. P. Skoglund,  B. H. Northoff,  M. V. Shunkov,  A. P. Derevianko,  S. Pääbo,  J. Krause,  M. Jakobsson, Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal, Proceedings of the National Academy of Sciences 111 (2014) 2229–2234.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMDoiMTExLzYvMjIyOSI7czo0OiJhdG9tIjtzOjM3OiIvYmlvcnhpdi9lYXJseS8yMDE2LzAxLzE5LzAyMjI4NS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

15. [15]. G. Renaud,  V. Slon,  A. T. Duggan,  J. Kelso, Schmutzi: estimation of contamination and endogenous mitochondrial consensus calling for ancient dna, Genome biology 16 (2015) 1–18.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1186/s13059-014-0572-2&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25583448&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

16. [16]. . G. P. Consortium, et al., A global reference for human genetic variation, Nature 526 (2015) 68–74.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature15393&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=26432245&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

17. [17]. W. J. Ewens, Mathematical Population Genetics 1: I. Theoretical Introduction, volume 27, Springer Science & Business Media, 2004.
    
    
18. [18]. H. Chen,  R. E. Green,  S. Pääbo,  M. Slatkin, The joint allele-frequency spectrum in closely related species, Genetics 177 (2007) 387–398.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6OToiMTc3LzEvMzg3IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTYvMDEvMTkvMDIyMjg1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

19. [19]. G. Ewing,  J. Hermisson, MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus, Bioinformatics 26 (2010) 2064–2065.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btq322&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20591904&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000280703500024&link_type=ISI) 

20. [20]. R. N. Gutenkunst,  R. D. Hernandez,  S. H. Williamson,  C. D. Bustamante, Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data, PLoS Genetics 5 (2009) e1000695.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1000695&link_type=DOI) 

21. [21]. G. O. Roberts,  A. Gelman,  W. R. Gilks, et al., Weak convergence and optimal scaling of random walk Metropolis algorithms, The Annals of Applied Probability 7 (1997) 110–120.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1214/aoap/1034625254&link_type=DOI) 

22. [22]. M. Hofreiter,  V. Jaenicke,  D. Serre,  A. von Haeseler,  S. Pääbo, Dna sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient dna, Nucleic acids research 29 (2001) 4793–4799.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/29.23.4793&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=11726688&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000172446800006&link_type=ISI) 

23. [23]. A. W. Briggs,  U. Stenzel,  M. Meyer,  J. Krause,  M. Kircher,  S. Pääbo, Removal of deaminated cytosines and detection of in vivo methylation in ancient dna, Nucleic acids research 38 (2010) e87–e87.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/gkp1163&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20028723&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

24. [24]. R. D. Hernandez,  S. H. Williamson,  C. D. Bustamante, Context dependence, ancestral misidentification, and spurious signatures of natural selection, Molecular Biology and Evolution 24 (2007) 1792–1800.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/molbev/msm108&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=17545186&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000248848400026&link_type=ISI) 

25. [25]. R. R. Hudson, Generating samples under a wright–fisher neutral model of genetic variation, Bioinformatics 18 (2002) 337–338.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/18.2.337&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=11847089&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000174028100021&link_type=ISI) 

26. [26]. H. Li,  R. Durbin, Fast and accurate short read alignment with burrows-wheeler transform, Bioinformatics 25 (2009) 1754–1760.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btp324&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=19451168&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000267665900006&link_type=ISI) 

27. [27]. A. Rambaut,  N. C. Grass, Seq-gen: an application for the monte carlo simulation of dna sequence evolution along phylogenetic trees, Computer applications in the biosciences: CABIOS 13 (1997) 235–238.
    
    
28. [28]. M. Hasegawa,  H. Kishino,  T.-a. Yano, Dating of the human-ape splitting by a molecular clock of mitochondrial dna, Journal of molecular evolution 22 (1985) 160–174.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1007/BF02101694&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=3934395&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1985AWB1600007&link_type=ISI) 

29. [29]. I. Lazaridis,  N. Patterson,  A. Mittnik,  G. Renaud,  S. Mallick,  K. Kirsanow,  P. H. Sudmant,  J. G. Schraiber,  S. Castellano, et al., Ancient human genomes suggest three ancestral populations for present-day Europeans, Nature 513 (2014) 409–413.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature13673&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25230663&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000341814900058&link_type=ISI) 

30. [30]. G. Renaud,  M. Kircher,  U. Stenzel,  J. Kelso, freeibis: an efficient base-caller with calibrated quality scores for illumina sequencers, Bioinformatics 29 (2013) 1208–1209.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btt117&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=23471300&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000318573900015&link_type=ISI) 

31. [31]. K. R. Rosenbloom,  J. Armstrong,  G. P. Barber,  J. Casper,  H. Clawson,  M. Diekhans,  T. R. Dreszer,  P. A. Fujita,  L. Guruvadoo,  M. Haeussler, et al., The ucsc genome browser database: 2015 update, Nucleic acids research 43 (2015) D670–D681.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/gku1177&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25428374&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

32. [32]. M. Lipson,  P.-R. Loh,  A. Levin,  D. Reich,  N. Patterson,  B. Berger, Efficient moment-based inference of admixture parameters and sources of gene flow, Molecular biology and evolution 30 (2013) 1788–1802.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/molbev/mst099&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=23709261&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

33. [33]. T. Higham,  K. Douka,  R. Wood,  C. B. Ramsey,  F. Brock,  L. Basell,  M. Camps,  A. Arrizabalaga,  J. Baena,  C. Barroso-Ruíz, et al., The timing and spatiotemporal patterning of Neanderthal disappearance, Nature 512 (2014) 306–309.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature13621&link_type=DOI) 
    
    [GeoRef](http://biorxiv.org/lookup/external-ref?access_num=2015057118&link_type=GEOREF) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25143113&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

34. [34]. H. Li,  R. Durbin, Inference of human population history from individual whole-genome sequences, Nature 475 (2011) 493–496.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature10231&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21753753&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000293167900039&link_type=ISI) 

35. [35]. S. Castellano,  G. Parra,  F. A. Sánchez-Quinto,  F. Racimo,  M. Kuhlwilm,  M. Kircher,  S. Sawyer,  Q. Fu,  A. Heinze,  B. Nickel, et al., Patterns of coding variation in the complete exomes of three Neandertals, Proceedings of the National Academy of Sciences 111 (2014) 6666–6671.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMToiMTExLzE4LzY2NjYiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNi8wMS8xOS8wMjIyODUuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

36. [36]. H. Chen, The joint allele frequency spectrum of multiple populations: a coalescent theory approach, Theoretical Population Biology 81 (2012) 179–195.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.tpb.2011.11.004&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=22155588&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 

37. [37]. E. M. Jewett,  N. A. Rosenberg, Theory and applications of a deterministic approximation to the coalescent model, Theoretical Population Biology 93 (2014) 14–29.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.tpb.2013.12.007&link_type=DOI) 

38. [38]. J. A. Kamm,  J. Terhorst,  Y. S. Song, Efficient computation of the joint sample frequency spectra for multiple populations, arXiv preprint arXiv:1503.01133 (2015).
    
    
39. [39]. W. Haak,  I. Lazaridis,  N. Patterson,  N. Rohland,  S. Mallick,  B. Llamas,  G. Brandt,  S. Nordenfelt,  E. Harney,  K. Stewardson, et al., Massive migration from the steppe was a source for indo-european languages in europe, Nature (2015).
    
    
40. [40]. M. Rasmussen,  M. Sikora,  A. Albrechtsen,  T. S. Korneliussen,  J. V. Moreno-Mayar,  G. D. Poznik,  C. P. Zollikofer,  M. S. P. de Leόn,  M. E. Allentoft,  I. Moltke, et al., The ancestry and affiliations of kennewick man, Nature (2015).
    
    
41. [41]. M. Raghavan,  P. Skoglund,  K. E. Graf,  M. Metspalu,  A. Albrechtsen,  I. Moltke,  S. Rasmussen,  T. W. Stafford Jr.,  L. Orlando,  E. Metspalu, et al., Upper palaeolithic siberian genome reveals dual ancestry of native americans, Nature 505 (2014) 87–91.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature12736&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=24256729&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000329163300029&link_type=ISI) 

42. [42]. M. Kimura, Solution of a process of random genetic drift with a continuous model, Proceedings of the National Academy of Sciences 41 (1955) 144.
    
    [FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czo4OiI0MS8zLzE0NCI7czo0OiJhdG9tIjtzOjM3OiIvYmlvcnhpdi9lYXJseS8yMDE2LzAxLzE5LzAyMjI4NS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

43. [43]. M. Abramowitz,  I. A. Stegun, Handbook of mathematical functions, Dover New York, 1965.
    
    
44. [44]. J. F. Crow,  M. Kimura, An Introduction to population genetics theory, Harper and Row, New York, Evanston, London, 1970.
    
    
45. [45]. A. Ginolhac,  M. Rasmussen,  M. T. P. Gilbert,  E. Willerslev,  L. Orlando, mapdamage: testing for damage patterns in ancient dna sequences, Bioinformatics 27 (2011) 2153–2155.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btr347&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21659319&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000292778700022&link_type=ISI) 

46. [46]. H. Jόnsson,  A. Ginolhac,  M. Schubert,  P. L. Johnson,  L. Orlando, map-damage2. 0: fast approximate bayesian estimates of ancient dna damage parameters, Bioinformatics 29 (2013) 1682–1684.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btt193&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=23613487&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2016%2F01%2F19%2F022285.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000321746100055&link_type=ISI)

 [1]: /embed/graphic-1.gif
 [2]: /embed/graphic-2.gif
 [3]: /embed/graphic-3.gif
 [4]: /embed/graphic-4.gif
 [5]: /embed/graphic-5.gif
 [6]: /embed/graphic-6.gif
 [7]: /embed/graphic-8.gif
 [8]: /embed/graphic-9.gif
 [9]: /embed/graphic-10.gif
 [10]: /embed/graphic-12.gif
 [11]: /embed/inline-graphic-1.gif
 [12]: /embed/inline-graphic-2.gif
 [13]: /embed/inline-graphic-3.gif
 [14]: /embed/graphic-52.gif
 [15]: /embed/graphic-53.gif
 [16]: /embed/graphic-54.gif
 [17]: /embed/inline-graphic-4.gif
 [18]: /embed/inline-graphic-5.gif
 [19]: /embed/graphic-55.gif
 [20]: /embed/inline-graphic-6.gif
 [21]: /embed/inline-graphic-7.gif
 [22]: /embed/graphic-56.gif
 [23]: /embed/graphic-57.gif
 [24]: /embed/graphic-58.gif
 [25]: /embed/graphic-59.gif
 [26]: /embed/graphic-60.gif
 [27]: /embed/graphic-61.gif
 [28]: /embed/graphic-62.gif
 [29]: /embed/graphic-63.gif
 [30]: /embed/inline-graphic-8.gif
 [31]: /embed/graphic-64.gif
 [32]: /embed/graphic-65.gif
 [33]: /embed/graphic-66.gif
 [34]: /embed/graphic-67.gif
 [35]: /embed/graphic-68.gif
 [36]: /embed/inline-graphic-9.gif
 [37]: /embed/graphic-69.gif
 [38]: /embed/graphic-70.gif
 [39]: /embed/graphic-71.gif
 [40]: /embed/graphic-72.gif
 [41]: /embed/inline-graphic-10.gif
 [42]: /embed/graphic-73.gif
 [43]: /embed/inline-graphic-11.gif
 [44]: /embed/graphic-74.gif
 [45]: /embed/inline-graphic-12.gif
 [46]: /embed/inline-graphic-13.gif
 [47]: /embed/graphic-75.gif