ARGON: Fast, Whole-Genome Simulation of the Discrete Time Wright-Fisher Process

Pier Francesco Palamara

doi:10.1101/036376

Abstract

Motivation Simulation under the coalescent model is ubiquitous in the analysis of genetic data. The rapid growth of real data sets from multiple human populations led to increasing interest in simulating very large sample sizes at whole-chromosome scales. When the sample size is large, the coalescent model becomes an increasingly inaccurate approximation of the discrete time Wright-Fisher model (DTWF). Analytical and computational treatment of the DTWF, however, is generally harder.

Results We present a simulator (ARGON) for the DTWF process that scales up to hundreds of thousands of samples and whole-chromosome lengths, with a time/memory performance comparable or superior to currently available methods for coalescent simulation. The simulator supports arbitrary demographic history, migration, Newick tree output, variable mutation/recombination rates and gene conversion, and efficiently outputs pairwise identical-bydescent (IBD) sharing data.

Availability ARGON (version 0.1) is written in Java, open source, and freely available at https://github.com/pierpal/ARGON.

Contact ppalama{at}hsph.harvard.edu

Supplementary information Supplementary data are available online.

1 Introduction

The coalescent ([Kingman, 1982]) can be constructed as an approximation of the discrete time Wright-Fisher process (DTWF, [Fisher et al., 1922, Wright, 1931]), and leads to simplified analytical and computational treatment. Simulators based on the coalescent process (e.g. [Hudson, 2002]) have been extensively adopted in computational methods. The coalescent approximation, however, relies on the assumption that the sample size is small compared to the effective population size, and violations of this assumption may result in substantial distortions of key genealogical properties ([Wakeley and Takahashi, 2003, Bhaskar et al., 2014]).

Until very recently, coalescent simulators did not scale up to long chromosomes and the very large sample sizes of modern day data sets, which now comprise hundreds of thousands of individuals. The recently developed simulators COSI2 ([Shlyakhter et al., 2014]) and SCRM ([Staab et al., 2015]) enable fast simulation under approximate coalescent models. The COSI algorithm uses a standard backwards-in-time approach, while SCRM adopts a “sequential” approach ([Wiuf and Hein, 2000, McVean and Cardin, 2005]). Both can be also used to simulate large sample sizes and chromosome-long regions under the coalescent process with reasonable time and memory requirements.

Here, we present ARGON, an efficient simulator of the DTWF process that scales up to very large chromosomes, and hundreds of thoudsands of samples. The simulator offers substantially improved performance compared to recent DTWF simulators, e.g. GENOME ([Liang et al., 2007]), and is comparable or superior to current coalescent simulators in terms of speed and memory usage.

2 Approach

ARGON proceeds backwards in time one generation at a time, occasionally sampling coalescent and recombination events subject to population structure and migration. Each individual is represented as a list of regions that are still being tracked at the current time (i.e. for which not all samples have found a common ancestor). Recombination events are sampled in genetic space from an exponential distribution, and rounded to the closest physiscal base pair position based on the desired recombination rate, which may be varying along the chromosome. A recombination event can be a crossover or a non-crossover event, based on user-specified rates. For each individual, a maximum of two parents are sampled, so that multiple recombination events result in alternating ancestry between two individuals from the previous generation. When two or more individuals choose the same parent, coalescence occurs if the individuals contain overlapping regions of genetic material. During coalescence, regions within individuals are annotated with links to descendant ancestral recombination graph (ARG [Griffiths, 1981]) nodes.

Compared to other DTWF implementations, ARGON offers substantially improved speed and memory usage. In the GENOME simulator, for instance, individuals are represented as arrays of blocks of genetic material of a fixed size, and individuals from all populations are explicitly represented in memory using arrays. In ARGON, large regions are represented as intervals with arbitrary boundary values, and hash map data structures are extensively used to take advantage of sparsity, avoiding explicit representation of all individuals. As in the GENOME simulator, ARGON can run in approximate mode, so that recombinations breakpoints are rounded to blocks of a user-specified genetic length. This reduces the granularity of the recombination process, improving speed and memory usage, at the cost of slightly inflated correlation of markers at a short genomic distance. We tested this approximation using nonrecombinant blocks of size 10μM (1μM = 0.000001 Morgans). Additional tests using 50μM blocks are described in the Supplementary Note.

ARGON can efficiently output a list of pairwise IBD segments longer than a user-specified centimorgan length. Previous approaches to output simulated IBD sharing data (e.g. [Palamara et al., 2012, Palamara and Pe’er, 2013, Browning and Browning, 2013]) required comparing recent ancestry for all pairs of individual at each marginal tree in the sampled ARG, with computational cost quadratic in the sample size and linear in the number of analyzed marginal trees. IBD segments are delimited by the occurrence of recombination events that change the most recent common ancestor for pairs of samples. In ARGON, these events are detected by visiting internal ARG nodes, which requires less computational effort than the approach based on marginal trees.

3 Results

3.1 Accuracy for small sample sizes

When n ≪ N_e, the coalescent is a good approximation of the DTWF. We performed extensive testing for several scenarios including population size variation, migration across multiple demes, and gene conversion. We report detailed results in the Supplementary Note. We find good agreement between ARGON and MS. We also tested the accuracy of COSI version 2.0, SCRM version 1.6.1, and MSprime version 0.1.6 ([Kelleher et al., 2015]), a new efficient simulator for which a preliminary version was released at the time of writing. All simulators were found to be well calibrated against MS.

View this table:

Table 1:

Comparison of theoretical predictions and simulation results for the number of singleton and doubleton alleles when n = N_e. We performed 100, 000 independent simulations, for a region of 1 Mb, μ = 2 × 10⁻⁸, and 1 cM/Mb. Reported errors are obtained as 100 × (θ_s -θ_t)/θ_s, where θ_s is the average simulation result and θ_t is the theoretical expectation. The ± sign introduces a standard error. Statistically significant deviations from 0 are highlighted. ARGON₁₀ results were obtained running ARGON with non-recombinant blocks of 10μM.

3.2 Deviation of the coalescent from the DTWF

The coalescent becomes a poor approximation of the DTWF process when the sample size is not substantially smaller than the effective population size ([Wakeley and Takahashi, 2003, Bhaskar et al., 2014]). We verified that ARGON matches the theoretical prediction for the number of singletons and doubletons described in [Bhaskar et al., 2014] for the DTWF (see Table 1). We simulated populations of effective size N_e = 1,000 and N_e = 20,000 haploid individuals, and sampled all present-day individuals. While ARGON matches the prediction of [Bhaskar et al., 2014] in both exact and approximate mode, MS simulations substantially deviate from the DTWF model. Simulation using other coalescent approaches (e.g. COSI2, SCRM, MSprime), will similarly deviate in these scenarios.

3.3 Scalability to large sample size and whole-chromosome length

We tested the run time and memory usage of ARGON and two recently developed programs that enable simulating very large sample sizes and long chromosomes: COSI2 and SCRM (see Table 2). We find that for large parameter values SCRM generally performs worse than ARGON. COSI2 is generally faster, but scales poorly for memory usage as the size of the region and the sample size grows. Additional tests, including a pre-release version of MSprime, are detailed in the Supplementary Note.

We further tested the performance of approximate algorithms for the same set of simulation parameters (see Table 3). We compared ARGON with a minimum recombination block size of 10μM (AR₁₀), SCRM with the “-l” flag set to 0 (SC₀), and COSI2 with the “-u” flag set to 0 (CS₀). We find that ARGON’s speed and memory usage is substantially improved, at the cost of slightly inflated correlation for neighboring markers (see Supplementary Note). For a constant population of size N_e = 20,000, for instance, squared correlation (r²) of markers 0 to 50 Kb apart (using 1 cM/Mb) was increased by ~27% for AR₁₀, but remained unchanged for markers at a larger distance. For CS₀ and SC₀, r² was decreased by ~5% for markers at a larger distance. AR₁₀ is faster than SC₀, and approximately as fast as CS₀, and uses less memory than both simulators for the large values of the test parameters. Comparison to other simulators and additional tests are detailed in the Supplementary Note.

View this table:

Table 2:

Comparison of simulation algorithms (AR=ARGON, SC=SCRM, CS=COSI2). Simulation parameters were the chromosome length (Mb), exponential expansion rate (measured as ρ = log(A/C)/G where A = ancestral size, C = current size, G = 500 = generation of expansion start), number of haploid samples (in thousands), and haploid ancestral population size (A). Recombination and mutation rates were set to 1 cM/Mb and 2 × 10⁻⁸ mutations per bp per generation, respectively. We compare simulation time (in hours) and memory usage (in Gb). All tests were run on a 2.27GHz Intel Xeon L5640, using a single core and up to 60 Gb of memory. † represents runs terminated due to insufficient memory (> 60Gb) or a memory error. Additional results are shown in the Supplementary Note.

View this table:

Table 3:

Comparison of approximate simulation algorithms (AR₁₀=ARGON with minimum recombination block of size 10μM, SC₀=SCRM with “-l” set to 0, CS₀=COSI2 with “-u” set to 0). See Table 2 for a description of the simulation setup. † represents runs terminated due to insufficient memory (> 60Gb) or a memory error. ⋆ indicates that the program took longer than 100 hours to complete. Additional results are shown in the Supplementary Note.

4 Discussion

We described ARGON, a fast and scalable simulator for the discrete time Wright-Fisher process. Accurate simulation of large genealogical and sequence data will facilitate analysis of modern day genomic data sets, which have now reached hundreds of thousands of samples ([Sudlow et al., 2015]). ARGON has similar or superior time and memory performance compared to current coalescent simulators when large samples and long chromosomes are simulated, and enables accurately simulating the DTWF process when the sample size approaches the effective population size, where the coalescent approximation makes current scalable simulators become imprecise. Version 0.1 of ARGON supports arbitrary demographic history with migration, variable mutation/recombination rate and gene conversion, and outpus Newick trees and pairwise IBD segments for large sample sizes in both exact and approximate modes.

Funding

This research was funded by NIH grant R01 MH101244.

Supplementary Note

1 Approach

Sampling the ARG: ARGON proceeds backwards in time one generation at a time, occasionally sampling coalescent and recombination events subject to population structure and migration. Each individual is represented as a list of regions that are still being tracked at the current time (i.e. for which not all samples have found a common ancestor). Recombination events are sampled in genetic space from an exponential distribution, and rounded to the closest physiscal base pair position based on the desired recombination rate, which may be varying along the chromosome. A recombination event can be a crossover or a non-crossover event, based on user-specified rates. For each individual, a maximum of two parents are sampled, so that multiple recombination events result in alternating ancestry between two individuals from the previous generation. When two or more individuals choose the same parent, coalescence occurs if the individuals contain overlapping regions of genetic material. During coalescence, regions within individuals are merged and annotated with links to descendant ancestral recombination graph (ARG [Griffiths, 1981]) nodes.

Sampling sequences from the ARG: Once the ARG has been computed, mutation events are sampled on the graph edges according to a Pois-son process. For each ARG edge affected by mutation, it is neccessary to calculate and store a list of samples that will be used to output genotypic information. Naïve calculation of these lists is intensive, and the storage of descendant lists for all nodes in the ARG results in prohibitive memory requirements. To overcome these difficulties, ARGON efficiently computes descendant lists using dynamic programming, and relies on a combination of top-down and bottom-up traversals of the ARG in order to always store the smallest possible set of descendant lists in memory while sampling mutations. During these operations, sparsity is exploited to further reduce memory consumption. Descendant lists are stored as either bit-set data structures, or lists of integers representing sample IDs, depending on their size. During the ARG traversals, whenever the size of a descendant list grows to exceed the expected memory consumption of the corresponding bit-set representation, the list is converted to the bit-set format. Similarly, ARGON offers an option (“-shrink” flag) to output the genotypes of the simulated individuals using a representation where the carriers of a derived allele are reported as a list of sample IDs, rather than the classical 0/1 representation of ancestral/derived alleles. Because a large number of mutations will have a set number of carriers, this representation saves computational speed and output disk space, and should be used to increase performance. Output formats also include the VCF format (default) and the hap/samples format.

These and other algorithmic details enable ARGON to be substantially more efficent than other DTWF implementations. In the GENOME simulator, for instance, individuals are represented as arrays of blocks of genetic material of a fixed size, and individuals from all populations are explicitly represented in memory using arrays. In ARGON, large regions are represented as intervals with arbitrary boundary values, and hash map data structures are extensively used to take advantage of sparsity, avoiding explicit representation of all individuals. As in the GENOME simulator, ARGON can run in approximate mode, so that recombinations breakpoints are rounded to blocks of a user-specified genetic length. This reduces the granularity of the recombination process, improving speed and memory usage, at the cost of slightly inflated correlation of markers at a short genomic distance. We tested this approximation using non-recombinant blocks of size 10μM and 50μM (1μM = 0.000001 Morgans).

2 Time and memory performance

We report detailed time and memory performance evaluation results for several simulators. Table 1 shows results for ARGON version 0.1.160101, SCRM version 1.6.1 [Staab et al., 2015], COSI version 2.0 [Shlyakhter et al., 2014], and MSprime version 0.1.6 [Kelleher et al., 2015]. ARGON was run using the “-shrink” flag and writing genotypes in the “seq” format, in which rare alleles are represented as lists of sample IDs. MSprime was run using the “-max-memory=60G” flag to enable generating large genealogies. Table 2 reports performance evaluation for several algorithms run in approximate mode: ARGON version 0.1.160101 with 10 and 50 microMorgans, FAST-SIMCOAL version 2.5.2.21 (implementing the SMC’ approximation), SCRM version 1.6.1 with the “-l” flag set to 0, and COSI version 2.0 with the “-u” flag set to 0. Note that the approximation flags for COSI2 and SCRM were set to the smallest allowed value, and that for both simulators slightly larger values may be used to find the appropriate trade-off between accuracy and speed. Table 3 reports a comparison of ARGON 0.1.160101 and GENOME version 0.2 [Liang et al., 2007], both run with a non-recombinant block size of 10 and 50 microMorgans. GENOME could not be run in exact mode due to excessive time and memoty requirements.

View this table:

Table 1:

Comparison of simulation algorithms (AR=ARGON, SC=SCRM, CS=COSI2, MS’=MSprime). Simulation parameters were the chromosome length (Mb), exponential expansion rate (measured as ρ = log(A/C)/G where A = ancestral size, C = current size, G = 500 = generation of expansion start), number of haploid samples (in thousands), and haploid ancestral population size (A). Recombination and mutation rates were set to 1 cM/Mb and 2 × 10⁻⁸ mutations per bp per generation, respectively. We compare simulation time (in hours) and memory usage (in Gb). All tests were run on a 2.27GHz Intel Xeon L5640, using a single core and up to 60 Gb of memory. † represents runs terminated due to insufficient memory (> 60Gb) or a memory error.

View this table:

Table 2:

Comparison of approximate algorithms (AR₁₀=ARGON with 10μM blocks, AR₅₀=ARGON with 50μM blocks, FS=FASTSIMCOAL, SC₀=SCRM with “-l” flag set to 0, CS₀=COSI2 with “-u” flag set to 0), time and memory performance. Simulation parameters were the chromosome length (Mb), exponential expansion rate (measured as ρ = log(A/C)/G where A = ancestral size, C = current size, G = 500 = generation of expansion start), number of haploid samples (in thousands), and haploid ancestral population size (A). Recombination and mutation rates were set to 1 cM/Mb and 2 × 10⁻⁸ mutations per bp per generation, respectively. We compare simulation time (in hours) and memory usage (in Gb). All tests were run on a 2.27GHz Intel Xeon L5640, using a single core and up to 60 Gb of memory. † represents runs terminated due to insufficient memory (> 60Gb) or a memory error. (⋆) indicates that the program took longer than 100 hours to complete. * indicates that the result has been linearly extrapolated from previous runs with smaller parameter values.

View this table:

Table 3:

Comparison of approximate algorithms (AR₁₀=ARGON with 10μM blocks, AR₅₀=ARGON with 50μM blocks, GEN₁₀=GENOME with 10μM blocks, GEN₅₀=GENOME with 50μM blocks), time and memory performance. Simulation parameters were the chromosome length (Mb), exponential expansion rate (measured as ρ = log(A/C)/G where A = ancestral size, C = current size, G = 500 = generation of expansion start), number of haploid samples (in thousands), and haploid ancestral population size (A). Recombination and mutation rates were set to 1 cM/Mb and 2 × 10⁻⁸, respectively. We compare simulation time (in hours) and memory usage (in Gb). All tests were run on a 2.27GHz Intel Xeon L5640, using a single core and up to 60 Gb of memory. † represents runs terminated due to insufficient memory (> 60Gb) or a memory error.

3 Simulator accuracy

We compared ARGON, MS, COSI2, SCRM and MSprime to test accuracy for small sample sizes, when the coalescent process is a good approximation of the discrete time Wright-Fisher process. We tested a number of scenarios, for which we provide a description and corresponding MS command below. We did not test all scenarios for MSprime, because version 0.1.6 does not support some of the required features.

Constant size: A population of constant size N = 20,000 haploid individuals; 50 haploid samples of 5 Mb, 10⁻⁸ recombinations per bp per generation, 2 × 10⁻⁸ mutations per bp per generation.
Exponential expansion: A population of ancestral size N = 20, 000 haploid individuals, expanding from generation 500 to a present day size of 200,000 haploid individuals; 50 haploid samples of 5 Mb, 10⁻⁸ recombinations per bp per generation, 2 × 10⁻⁸ mutations per bp per generation.
Bottleneck: A population of ancestral size N = 20, 000 haploid individuals, shrinking from generation 500 to a present day size of 2, 000 haploid individuals; 50 haploid samples of 5 Mb, 10⁻⁸ recombinations per bp per generation, 2 × 10⁻⁸ mutations per bp per generation.
Island model: 4 demes of constant size 5,000 haploid individuals, symmetric pairwise migration rate of 0.001 individuals per generation; 20 samples of 5 Mb from each deme, 10⁻⁸ recombinations per bp per generation, 2 × 10⁻⁸ mutations per bp per generation.
Island model, high migration: 4 demes of constant size 5000 haploid individuals, symmetric pairwise migration rate of 0.01 individuals per generation; 20 samples of 5 Mb from each deme, 10⁻⁸ recombinations per bp per generation, 2 × 10⁻⁸ mutations per bp per generation.
Piecewise constant: A population of varying constant size. Size 100, 000 haploid individuals from generation 0 to 100; 5, 000 individuals from generation 100 to 200; 50, 000 haploid individuals from generation 200 to 300; and ancestral size 10, 000 haploid individuals from generation 300 on; 50 haploid samples of 5 Mb, 10⁻⁸ recombinations per bp per generation, 2 × 10⁻⁸ mutations per bp per generation.

The flag “-c 1 300” was added when simulating gene conversion with same rate as recombination, and a mean tract length of 300 bp. The flag “-p 10” was added in all cases to increse the number of reported significant digits, and “-seed” was used to generate independent random draws. For the COSI2 simulator the infinite sites model was enabled adding the “infinite_sites yes” command in the configuration file. We tested the following summary statistics:

r₅₀ correlation of markers at distance between 0 and 50 Kb.
r₀₀ correlation of markers at distance between 50 and 100 Kb.
r₁₅₀ correlation of markers at distance between 100 and 150 Kb.
squared correlation of markers at distance between 0 and 50 Kb.
squared correlation of markers at distance between 50 and 100 Kb.
squared correlation of markers at distance between 100 and 150 Kb.
θ heterozygosity computed as for n samples and l bases in the region, each with sample frequency p_i.
S number of polymorphic sites in the sample.
S_<.1 number of polymorphic sites with minor allele frequency below 10% in the sample.
S_∈[.1,.4] number of polymorphic sites with minor allele frequency between 10% and 40% in the sample.
S_>.4 number of polymorphic sites with minor allele frequency above 40% in the sample.

r and r² statistics were obtained using Plink 1.9 using the flags “-r” and “-r2”, for variants with MAF greater than 10%. The tables below report the value of these statistics for the considered pairs of simulators after 5, 000 independent draws. The column Δ% reports differences of each feature as 100 × (mean₁ — mean₂)/mean₁. Z-scores corresponding to p < 0.001 are highlighted with an asterisk. Results are reported in tables 4 through 52.

We tested the accuracy of ARGON when a recombination map is used in input, and identical-by-descent (IBD) segments are requested in output. We simulated 400 haploid samples from populations of constant effective size ranging from 5, 000 to 30, 000 haploid individuals. We performed 10 independent simulations of a realistic ~ 249 Mb Human Chromosome 1, using the recombination map downloaded from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html. To test the accuracy of the IBD provided in output, we estimated the size of the simulated populations using output segments of length greater than 2 cM. To estimate the effective population size, we used the maximum likelihood estimator derived in [Palamara, 2014]: where and Ŝ_R represents the total number of segments longer than u

Morgans observed for all pairs. γ represents the total length (in Morgans) of the analyzed genomes. We observed good agreement between inferred and simulated population sizes (Figure 1), indicating that the distribution of IBD segments output by ARGON matches theoretical expectations. We plotted the density of recombination events, which we estimated using all edges of the IBD segments output across all simulations. As shown in figures 2 and 3, the density of observed recombination events follows the recombination map specified in input.

Figure 1:

Comparison of simulated effective population size vs. the population size estimated using Eq. 1 and IBD segments of length at least 2 cM output by ARGON.

Figure 2:

Genetic map.

Figure 3:

Density of IBD segment edges.

3.1 ARGON and MS

View this table:

Table 4:

s₁ = ARGON, s₂ = MS; model: constant size

View this table:

Table 5:

s₁ = ARGON, s₂ = MS; model: exponential expansion

View this table:

Table 6:

s₁ = ARGON, s₂ = MS; model: bottleneck

View this table:

Table 7:

s₁ = ARGON, s₂ = MS; model: island model

View this table:

Table 8:

s₁ = ARGON, s₂ = MS; island model, high migration

View this table:

Table 9:

s₁ = ARGON, s₂ = MS; model: piecewise model

3.2 ARGON and COSI2

View this table:

Table 10:

s₁ = ARGON, s₂ = COSI2; model: constant size

View this table:

Table 11:

s₁ = ARGON, s₂ = COSI2; model: exponential expansion

View this table:

Table 12:

s₁ = ARGON, s₂ = COSI2; model: bottleneck

View this table:

Table 13:

s₁ = ARGON, s₂ = COSI2; model: island model

View this table:

Table 14:

s₁ = ARGON, s₂ = COSI2; model: island model, high migration

View this table:

Table 15:

s₁ = ARGON, s₂ = COSI2; model: piecewise model

3.3 MS and COSI2

View this table:

Table 16:

s₁ = MS, s₂ = COSI2; model: constant size

View this table:

Table 17:

s₁ = MS, s₂ = COSI2; model: exponential expansion

View this table:

Table 18:

s₁ = MS, s₂ = COSI2; model: bottleneck

View this table:

Table 19:

s₁ = MS, s₂ = COSI2; model: island model

View this table:

Table 20:

s₁ = MS, s₂ = COSI2; model: island model, high migration

View this table:

Table 21:

s₁ = MS, s₂ = COSI2; model: piecewise model

3.4 MSprime and COSI2

View this table:

Table 22:

s₁ = MSprime, s₂ = COSI2; model: constant size

View this table:

Table 23:

s₁ = MSprime, s₂ = COSI2; model: exponential expansion

View this table:

Table 24:

s₁ = MSprime, s₂ = COSI2; model: bottleneck

View this table:

Table 25:

s₁ = MSprime, s₂ = COSI2; model: piecewise model

3.5 MSprime and ARGON

View this table:

Table 26:

s₁ = MSprime, s₂ = ARGON; model: constant size

View this table:

Table 27:

s₁ = MSprime, s₂ = ARGON; model: exponential expansion

View this table:

Table 28:

s₁ = MSprime, s₂ = ARGON; model: bottleneck

View this table:

Table 29:

s₁ = MSprime, s₂ = ARGON; model: piecewise model

3.6 MSprime and MS

View this table:

Table 30:

s₁ = MSprime, s₂ = MS; model: constant size

View this table:

Table 31:

s₁ = MSprime, s₂ = MS; model: exponential expansion

View this table:

Table 32:

s₁ = MSprime, s₂ = MS; model: bottleneck

View this table:

Table 33:

s₁ = MSprime, s₂ = MS; model: piecewise model

3.7 SCRM and MS

View this table:

Table 34:

s₁ = SCRM, s₂ = MS; model: constant size

View this table:

Table 35:

s₁ = SCRM, s₂ = MS; model: exponential expansion

View this table:

Table 36:

s₁ = SCRM, s₂ = MS; model: island model, high migration

View this table:

Table 37:

s₁ = SCRM, s₂ = MS; model: island model

View this table:

Table 38:

s₁ = SCRM, s₂ = MS; model: piecewise model

3.8 ARGON with approximation and MS

View this table:

Table 39:

s₁ = ARGON with 10μM approximation, s₂ = MS; model: constant size

View this table:

Table 40:

s₁ = ARGON with 10μM approximation, s₂ = MS; model: exponential expansion

View this table:

Table 41:

s₁ = ARGON with 50μM approximation, s₂ = MS; model: constant size

View this table:

Table 42:

s₁ = ARGON with 50μM approximation, s₂ = MS; model: exponential expansion

3.9 SCRM with approximation and MS

View this table:

Table 43:

s₁ = SCRM with “-l” set to 0, s₂ = MS; model: constant size

View this table:

Table 44:

s₁ = SCRM with “-l” set to 0, s₂ = MS; model: exponential expansion

3.10 COSI2 with approximation and MS

View this table:

Table 45:

s₁ = COSI2 with “-u” set to 0, s₂ = MS; model: constant size

View this table:

Table 46:

s₁ = COSI2 with “-u” set to 0, s₂ = MS; model: exponential expansion

3.11 ARGON and MS, with gene conversion

View this table:

Table 47:

s₁ = ARGON (with gene conversion), s₂ = MS (with gene conversion); model: constant size

View this table:

Table 48:

s₁ = ARGON (with gene conversion), s₂ = MS (with gene conversion); model: exponential expansion

3.12 ARGON and COSI2, with gene conversion

View this table:

Table 49:

s₁ = ARGON (with gene conversion), s₂ = COSI2 (with gene conversion); model: constant size

View this table:

Table 50:

s₁ = ARGON (with gene conversion), s₂ = COSI2 (with gene conversion); model: exponential expansion

3.13 MS and COSI2, with gene conversion

View this table:

Table 51:

s₁ = MS (with gene conversion), s₂ = COSI2 (with gene conversion); model: constant size

View this table:

Table 52:

s₁ = MS (with gene conversion), s₂ = COSI2 (with gene conversion); model: exponential expansion

Acknowledgements

We thank Alkes L. Price and John Wakeley for useful discussions and comments on an early draft.

References

[Bhaskar et al., 2014].↵
Bhaskar, A., Clark, A. G., and Song, Y. S. (2014). Distortion of genealogical properties when the sample is very large. Proceedings of the National Academy of Sciences, 111(6):2385–2390.
[Browning and Browning, 2013].↵
Browning, B. L. and Browning, S. R. (2013). Detecting identity by descent and estimating genotype error rates in sequence data. The American Journal of Human Genetics, 93(5):840–851.
OpenUrl CrossRef PubMed
[Fisher et al., 1922].↵
Fisher, R. A. et al. (1922). On the dominance ratio. Proceedings of the royal society of Edinburgh, 42:321–341.
[Griffiths, 1981].↵
Griffiths, R. C. (1981). Neutral two-locus multiple allele models with recombination. Theoretical Population Biology, 19(2):169–186.
OpenUrl CrossRef Web of Science
[Hudson, 2002].↵
Hudson, R. R. (2002). Generating samples under a wright-fisher neutral model of genetic variation. Bioinformatics, 18(2):337–338.
OpenUrl CrossRef PubMed Web of Science
[Kelleher et al., 2015].↵
Kelleher, J., Etheridge, A. M., and McVean, G. (2015). Efficient coalescent simulation and genealogical analysis for large sample sizes. bioRxiv, page 033118.
[Kingman, 1982].↵
Kingman, J. F. (1982). The coalescent. Stochastic processes and their applications, 13(3):235–248.
OpenUrl CrossRef
[Liang et al., 2007].↵
Liang, L., Zöllner, S., and Abecasis, G. R. (2007). Genome: a rapid coalescent-based whole genome simulator. Bioinformatics, 23(12):1565–1567.
OpenUrl CrossRef PubMed Web of Science
[McVean and Cardin, 2005].↵
McVean, G. A. and Cardin, N. J. (2005). Approximating the co-alescent with recombination. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1459):1387–1393.
OpenUrl CrossRef PubMed
[Palamara et al., 2012].↵
Palamara, P. F., Lencz, T., Darvasi, A., and Peer, I. (2012). Length distributions of identity by descent reveal fine-scale demographic history. The American Journal of Human Genetics, 91(5):809–822.
OpenUrl CrossRef PubMed
[Palamara and Pe’er, 2013].↵
Palamara, P. F. and Pe’er, I. (2013). Inference of historical migration rates via haplotype sharing. Bioinformatics, 29(13):i180–i188.
OpenUrl CrossRef PubMed
[Shlyakhter et al., 2014].↵
Shlyakhter, I., Sabeti, P. C., and Schaffner, S. F. (2014). Cosi2: An efficient simulator of exact and approximate coalescent with selection. Bioinformatics, 30(23):3427–3429.
OpenUrl CrossRef PubMed Web of Science
[Staab et al., 2015].↵
Staab, P. R., Zhu, S., Metzler, D., and Lunter, G. (2015). scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10):1680–1682.
OpenUrl CrossRef PubMed
[Sudlow et al., 2015].↵
Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):1–10.
OpenUrl
[Wakeley and Takahashi, 2003].↵
Wakeley, J. and Takahashi, T. (2003). Gene genealogies when the sample size exceeds the effective size of the population. Molecular biology and evolution, 20(2):208–213.
OpenUrl CrossRef PubMed Web of Science
[Wiuf and Hein, 2000].↵
Wiuf, C. and Hein, J. (2000). The coalescent with gene conversion. Genetics, 155(1):451–462.
OpenUrl Abstract/FREE Full Text
[Wright, 1931].↵
Wright, S. (1931). Evolution in mendelian populations. Genetics, 16(2):97.
OpenUrl FREE Full Text

References

[Browning and Browning, 2013].↵
Browning, B. L. and Browning, S. R.(2013). Detecting identity by descent and estimating genotype error rates in sequence data. The American Journal of Human Genetics, 93(5):840–851.
OpenUrl CrossRef PubMed
[Griffiths, 1981].↵
Griffiths, R. C. (1981). Neutral two-locus multiple allele models with recombination. Theoretical Population Biology, 19(2):169–186.
OpenUrl CrossRef Web of Science
[Kelleher et al., 2015].↵
Kelleher, J., Etheridge, A. M., and McVean, G. (2015). Efficient coalescent simulation and genealogical analysis for large sample sizes. bioRxiv, page 033118.
[Liang et al., 2007].↵
Liang, L., Zöllner, S., and Abecasis, G. R. (2007). Genome: a rapid coalescent-based whole genome simulator. Bioinformatics, 23(12): 1565–1567.
OpenUrl CrossRef PubMed Web of Science
[Palamara, 2014].↵
Palamara, P. F. (2014). Population genetics of identity by descent. PhD thesis (New York City: Columbia University).
[Palamara et al., 2012].↵
Palamara, P. F., Lencz, T., Darvasi, A., and Peer, I. (2012). Length distributions of identity by descent reveal fine-scale demographic history. The American Journal of Human Genetics, 91(5):809–822.
OpenUrl CrossRef PubMed
[Palamara and Pe’er, 2013].↵
Palamara, P. F. and Pe’er, I. (2013). Inference of historical migration rates via haplotype sharing. Bioinformatics, 29(13):i180–i188.
OpenUrl CrossRef PubMed
[Shlyakhter et al., 2014].↵
Shlyakhter, I., Sabeti, P. C., and Schaffner, S. F. (2014). Cosi2: An efficient simulator of exact and approximate coalescent with selection. Bioinformatics, 30(23):3427–3429.
OpenUrl CrossRef PubMed Web of Science
[Staab et al., 2015].↵
Staab, P. R., Zhu, S., Metzler, D., and Lunter, G. (2015). scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10): 1680–1682.
OpenUrl CrossRef PubMed