Abstract
Motivation Simulation under the coalescent model is ubiquitous in the analysis of genetic data. The rapid growth of real data sets from multiple human populations led to increasing interest in simulating very large sample sizes at whole-chromosome scales. When the sample size is large, the coalescent model becomes an increasingly inaccurate approximation of the discrete time Wright-Fisher model (DTWF). Analytical and computational treatment of the DTWF, however, is generally harder.
Results We present a simulator (ARGON) for the DTWF process that scales up to hundreds of thousands of samples and whole-chromosome lengths, with a time/memory performance comparable or superior to currently available methods for coalescent simulation. The simulator supports arbitrary demographic history, migration, Newick tree output, variable mutation/recombination rates and gene conversion, and efficiently outputs pairwise identical-bydescent (IBD) sharing data.
Availability ARGON (version 0.1) is written in Java, open source, and freely available at https://github.com/pierpal/ARGON.
Contact ppalama{at}hsph.harvard.edu
Supplementary information Supplementary data are available online.
1 Introduction
The coalescent ([Kingman, 1982]) can be constructed as an approximation of the discrete time Wright-Fisher process (DTWF, [Fisher et al., 1922, Wright, 1931]), and leads to simplified analytical and computational treatment. Simulators based on the coalescent process (e.g. [Hudson, 2002]) have been extensively adopted in computational methods. The coalescent approximation, however, relies on the assumption that the sample size is small compared to the effective population size, and violations of this assumption may result in substantial distortions of key genealogical properties ([Wakeley and Takahashi, 2003, Bhaskar et al., 2014]).
Until very recently, coalescent simulators did not scale up to long chromosomes and the very large sample sizes of modern day data sets, which now comprise hundreds of thousands of individuals. The recently developed simulators COSI2 ([Shlyakhter et al., 2014]) and SCRM ([Staab et al., 2015]) enable fast simulation under approximate coalescent models. The COSI algorithm uses a standard backwards-in-time approach, while SCRM adopts a “sequential” approach ([Wiuf and Hein, 2000, McVean and Cardin, 2005]). Both can be also used to simulate large sample sizes and chromosome-long regions under the coalescent process with reasonable time and memory requirements.
Here, we present ARGON, an efficient simulator of the DTWF process that scales up to very large chromosomes, and hundreds of thoudsands of samples. The simulator offers substantially improved performance compared to recent DTWF simulators, e.g. GENOME ([Liang et al., 2007]), and is comparable or superior to current coalescent simulators in terms of speed and memory usage.
2 Approach
ARGON proceeds backwards in time one generation at a time, occasionally sampling coalescent and recombination events subject to population structure and migration. Each individual is represented as a list of regions that are still being tracked at the current time (i.e. for which not all samples have found a common ancestor). Recombination events are sampled in genetic space from an exponential distribution, and rounded to the closest physiscal base pair position based on the desired recombination rate, which may be varying along the chromosome. A recombination event can be a crossover or a non-crossover event, based on user-specified rates. For each individual, a maximum of two parents are sampled, so that multiple recombination events result in alternating ancestry between two individuals from the previous generation. When two or more individuals choose the same parent, coalescence occurs if the individuals contain overlapping regions of genetic material. During coalescence, regions within individuals are annotated with links to descendant ancestral recombination graph (ARG [Griffiths, 1981]) nodes.
Compared to other DTWF implementations, ARGON offers substantially improved speed and memory usage. In the GENOME simulator, for instance, individuals are represented as arrays of blocks of genetic material of a fixed size, and individuals from all populations are explicitly represented in memory using arrays. In ARGON, large regions are represented as intervals with arbitrary boundary values, and hash map data structures are extensively used to take advantage of sparsity, avoiding explicit representation of all individuals. As in the GENOME simulator, ARGON can run in approximate mode, so that recombinations breakpoints are rounded to blocks of a user-specified genetic length. This reduces the granularity of the recombination process, improving speed and memory usage, at the cost of slightly inflated correlation of markers at a short genomic distance. We tested this approximation using nonrecombinant blocks of size 10μM (1μM = 0.000001 Morgans). Additional tests using 50μM blocks are described in the Supplementary Note.
ARGON can efficiently output a list of pairwise IBD segments longer than a user-specified centimorgan length. Previous approaches to output simulated IBD sharing data (e.g. [Palamara et al., 2012, Palamara and Pe’er, 2013, Browning and Browning, 2013]) required comparing recent ancestry for all pairs of individual at each marginal tree in the sampled ARG, with computational cost quadratic in the sample size and linear in the number of analyzed marginal trees. IBD segments are delimited by the occurrence of recombination events that change the most recent common ancestor for pairs of samples. In ARGON, these events are detected by visiting internal ARG nodes, which requires less computational effort than the approach based on marginal trees.
3 Results
3.1 Accuracy for small sample sizes
When n ≪ Ne, the coalescent is a good approximation of the DTWF. We performed extensive testing for several scenarios including population size variation, migration across multiple demes, and gene conversion. We report detailed results in the Supplementary Note. We find good agreement between ARGON and MS. We also tested the accuracy of COSI version 2.0, SCRM version 1.6.1, and MSprime version 0.1.6 ([Kelleher et al., 2015]), a new efficient simulator for which a preliminary version was released at the time of writing. All simulators were found to be well calibrated against MS.
3.2 Deviation of the coalescent from the DTWF
The coalescent becomes a poor approximation of the DTWF process when the sample size is not substantially smaller than the effective population size ([Wakeley and Takahashi, 2003, Bhaskar et al., 2014]). We verified that ARGON matches the theoretical prediction for the number of singletons and doubletons described in [Bhaskar et al., 2014] for the DTWF (see Table 1). We simulated populations of effective size Ne = 1,000 and Ne = 20,000 haploid individuals, and sampled all present-day individuals. While ARGON matches the prediction of [Bhaskar et al., 2014] in both exact and approximate mode, MS simulations substantially deviate from the DTWF model. Simulation using other coalescent approaches (e.g. COSI2, SCRM, MSprime), will similarly deviate in these scenarios.
3.3 Scalability to large sample size and whole-chromosome length
We tested the run time and memory usage of ARGON and two recently developed programs that enable simulating very large sample sizes and long chromosomes: COSI2 and SCRM (see Table 2). We find that for large parameter values SCRM generally performs worse than ARGON. COSI2 is generally faster, but scales poorly for memory usage as the size of the region and the sample size grows. Additional tests, including a pre-release version of MSprime, are detailed in the Supplementary Note.
We further tested the performance of approximate algorithms for the same set of simulation parameters (see Table 3). We compared ARGON with a minimum recombination block size of 10μM (AR10), SCRM with the “-l” flag set to 0 (SC0), and COSI2 with the “-u” flag set to 0 (CS0). We find that ARGON’s speed and memory usage is substantially improved, at the cost of slightly inflated correlation for neighboring markers (see Supplementary Note). For a constant population of size Ne = 20,000, for instance, squared correlation (r2) of markers 0 to 50 Kb apart (using 1 cM/Mb) was increased by ~27% for AR10, but remained unchanged for markers at a larger distance. For CS0 and SC0, r2 was decreased by ~5% for markers at a larger distance. AR10 is faster than SC0, and approximately as fast as CS0, and uses less memory than both simulators for the large values of the test parameters. Comparison to other simulators and additional tests are detailed in the Supplementary Note.
4 Discussion
We described ARGON, a fast and scalable simulator for the discrete time Wright-Fisher process. Accurate simulation of large genealogical and sequence data will facilitate analysis of modern day genomic data sets, which have now reached hundreds of thousands of samples ([Sudlow et al., 2015]). ARGON has similar or superior time and memory performance compared to current coalescent simulators when large samples and long chromosomes are simulated, and enables accurately simulating the DTWF process when the sample size approaches the effective population size, where the coalescent approximation makes current scalable simulators become imprecise. Version 0.1 of ARGON supports arbitrary demographic history with migration, variable mutation/recombination rate and gene conversion, and outpus Newick trees and pairwise IBD segments for large sample sizes in both exact and approximate modes.
Funding
This research was funded by NIH grant R01 MH101244.
Supplementary Note
1 Approach
Sampling the ARG: ARGON proceeds backwards in time one generation at a time, occasionally sampling coalescent and recombination events subject to population structure and migration. Each individual is represented as a list of regions that are still being tracked at the current time (i.e. for which not all samples have found a common ancestor). Recombination events are sampled in genetic space from an exponential distribution, and rounded to the closest physiscal base pair position based on the desired recombination rate, which may be varying along the chromosome. A recombination event can be a crossover or a non-crossover event, based on user-specified rates. For each individual, a maximum of two parents are sampled, so that multiple recombination events result in alternating ancestry between two individuals from the previous generation. When two or more individuals choose the same parent, coalescence occurs if the individuals contain overlapping regions of genetic material. During coalescence, regions within individuals are merged and annotated with links to descendant ancestral recombination graph (ARG [Griffiths, 1981]) nodes.
Sampling sequences from the ARG: Once the ARG has been computed, mutation events are sampled on the graph edges according to a Pois-son process. For each ARG edge affected by mutation, it is neccessary to calculate and store a list of samples that will be used to output genotypic information. Naïve calculation of these lists is intensive, and the storage of descendant lists for all nodes in the ARG results in prohibitive memory requirements. To overcome these difficulties, ARGON efficiently computes descendant lists using dynamic programming, and relies on a combination of top-down and bottom-up traversals of the ARG in order to always store the smallest possible set of descendant lists in memory while sampling mutations. During these operations, sparsity is exploited to further reduce memory consumption. Descendant lists are stored as either bit-set data structures, or lists of integers representing sample IDs, depending on their size. During the ARG traversals, whenever the size of a descendant list grows to exceed the expected memory consumption of the corresponding bit-set representation, the list is converted to the bit-set format. Similarly, ARGON offers an option (“-shrink” flag) to output the genotypes of the simulated individuals using a representation where the carriers of a derived allele are reported as a list of sample IDs, rather than the classical 0/1 representation of ancestral/derived alleles. Because a large number of mutations will have a set number of carriers, this representation saves computational speed and output disk space, and should be used to increase performance. Output formats also include the VCF format (default) and the hap/samples format.
These and other algorithmic details enable ARGON to be substantially more efficent than other DTWF implementations. In the GENOME simulator, for instance, individuals are represented as arrays of blocks of genetic material of a fixed size, and individuals from all populations are explicitly represented in memory using arrays. In ARGON, large regions are represented as intervals with arbitrary boundary values, and hash map data structures are extensively used to take advantage of sparsity, avoiding explicit representation of all individuals. As in the GENOME simulator, ARGON can run in approximate mode, so that recombinations breakpoints are rounded to blocks of a user-specified genetic length. This reduces the granularity of the recombination process, improving speed and memory usage, at the cost of slightly inflated correlation of markers at a short genomic distance. We tested this approximation using non-recombinant blocks of size 10μM and 50μM (1μM = 0.000001 Morgans).
ARGON can efficiently output a list of pairwise IBD segments longer than a user-specified centimorgan length. Previous approaches to output simulated IBD sharing data (e.g. [Palamara et al., 2012, Palamara and Pe’er, 2013, Browning and Browning, 2013]) required comparing recent ancestry for all pairs of individual at each marginal tree in the sampled ARG, with computational cost quadratic in the sample size and linear in the number of analyzed marginal trees. IBD segments are delimited by the occurrence of recombination events that change the most recent common ancestor for pairs of samples. In ARGON, these events are detected by visiting internal ARG nodes, which requires less computational effort than the approach based on marginal trees.
2 Time and memory performance
We report detailed time and memory performance evaluation results for several simulators. Table 1 shows results for ARGON version 0.1.160101, SCRM version 1.6.1 [Staab et al., 2015], COSI version 2.0 [Shlyakhter et al., 2014], and MSprime version 0.1.6 [Kelleher et al., 2015]. ARGON was run using the “-shrink” flag and writing genotypes in the “seq” format, in which rare alleles are represented as lists of sample IDs. MSprime was run using the “-max-memory=60G” flag to enable generating large genealogies. Table 2 reports performance evaluation for several algorithms run in approximate mode: ARGON version 0.1.160101 with 10 and 50 microMorgans, FAST-SIMCOAL version 2.5.2.21 (implementing the SMC’ approximation), SCRM version 1.6.1 with the “-l” flag set to 0, and COSI version 2.0 with the “-u” flag set to 0. Note that the approximation flags for COSI2 and SCRM were set to the smallest allowed value, and that for both simulators slightly larger values may be used to find the appropriate trade-off between accuracy and speed. Table 3 reports a comparison of ARGON 0.1.160101 and GENOME version 0.2 [Liang et al., 2007], both run with a non-recombinant block size of 10 and 50 microMorgans. GENOME could not be run in exact mode due to excessive time and memoty requirements.
3 Simulator accuracy
We compared ARGON, MS, COSI2, SCRM and MSprime to test accuracy for small sample sizes, when the coalescent process is a good approximation of the discrete time Wright-Fisher process. We tested a number of scenarios, for which we provide a description and corresponding MS command below. We did not test all scenarios for MSprime, because version 0.1.6 does not support some of the required features.
Constant size: A population of constant size N = 20,000 haploid individuals; 50 haploid samples of 5 Mb, 10−8 recombinations per bp per generation, 2 × 10−8 mutations per bp per generation.
Exponential expansion: A population of ancestral size N = 20, 000 haploid individuals, expanding from generation 500 to a present day size of 200,000 haploid individuals; 50 haploid samples of 5 Mb, 10−8 recombinations per bp per generation, 2 × 10−8 mutations per bp per generation.
Bottleneck: A population of ancestral size N = 20, 000 haploid individuals, shrinking from generation 500 to a present day size of 2, 000 haploid individuals; 50 haploid samples of 5 Mb, 10−8 recombinations per bp per generation, 2 × 10−8 mutations per bp per generation.
Island model: 4 demes of constant size 5,000 haploid individuals, symmetric pairwise migration rate of 0.001 individuals per generation; 20 samples of 5 Mb from each deme, 10−8 recombinations per bp per generation, 2 × 10−8 mutations per bp per generation.
Island model, high migration: 4 demes of constant size 5000 haploid individuals, symmetric pairwise migration rate of 0.01 individuals per generation; 20 samples of 5 Mb from each deme, 10−8 recombinations per bp per generation, 2 × 10−8 mutations per bp per generation.
Piecewise constant: A population of varying constant size. Size 100, 000 haploid individuals from generation 0 to 100; 5, 000 individuals from generation 100 to 200; 50, 000 haploid individuals from generation 200 to 300; and ancestral size 10, 000 haploid individuals from generation 300 on; 50 haploid samples of 5 Mb, 10−8 recombinations per bp per generation, 2 × 10−8 mutations per bp per generation.
The flag “-c 1 300” was added when simulating gene conversion with same rate as recombination, and a mean tract length of 300 bp. The flag “-p 10” was added in all cases to increse the number of reported significant digits, and “-seed” was used to generate independent random draws. For the COSI2 simulator the infinite sites model was enabled adding the “infinite_sites yes” command in the configuration file. We tested the following summary statistics:
r50 correlation of markers at distance between 0 and 50 Kb.
r00 correlation of markers at distance between 50 and 100 Kb.
r150 correlation of markers at distance between 100 and 150 Kb.
squared correlation of markers at distance between 0 and 50 Kb.
squared correlation of markers at distance between 50 and 100 Kb.
squared correlation of markers at distance between 100 and 150 Kb.
θ heterozygosity computed as for n samples and l bases in the region, each with sample frequency pi.
S number of polymorphic sites in the sample.
S<.1 number of polymorphic sites with minor allele frequency below 10% in the sample.
S∈[.1,.4] number of polymorphic sites with minor allele frequency between 10% and 40% in the sample.
S>.4 number of polymorphic sites with minor allele frequency above 40% in the sample.
r and r2 statistics were obtained using Plink 1.9 using the flags “-r” and “-r2”, for variants with MAF greater than 10%. The tables below report the value of these statistics for the considered pairs of simulators after 5, 000 independent draws. The column Δ% reports differences of each feature as 100 × (mean1 — mean2)/mean1. Z-scores corresponding to p < 0.001 are highlighted with an asterisk. Results are reported in tables 4 through 52.
We tested the accuracy of ARGON when a recombination map is used in input, and identical-by-descent (IBD) segments are requested in output. We simulated 400 haploid samples from populations of constant effective size ranging from 5, 000 to 30, 000 haploid individuals. We performed 10 independent simulations of a realistic ~ 249 Mb Human Chromosome 1, using the recombination map downloaded from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html. To test the accuracy of the IBD provided in output, we estimated the size of the simulated populations using output segments of length greater than 2 cM. To estimate the effective population size, we used the maximum likelihood estimator derived in [Palamara, 2014]: where and ŜR represents the total number of segments longer than u
Morgans observed for all pairs. γ represents the total length (in Morgans) of the analyzed genomes. We observed good agreement between inferred and simulated population sizes (Figure 1), indicating that the distribution of IBD segments output by ARGON matches theoretical expectations. We plotted the density of recombination events, which we estimated using all edges of the IBD segments output across all simulations. As shown in figures 2 and 3, the density of observed recombination events follows the recombination map specified in input.
3.1 ARGON and MS
3.2 ARGON and COSI2
3.3 MS and COSI2
3.4 MSprime and COSI2
3.5 MSprime and ARGON
3.6 MSprime and MS
3.7 SCRM and MS
3.8 ARGON with approximation and MS
3.9 SCRM with approximation and MS
3.10 COSI2 with approximation and MS
3.11 ARGON and MS, with gene conversion
3.12 ARGON and COSI2, with gene conversion
3.13 MS and COSI2, with gene conversion
Acknowledgements
We thank Alkes L. Price and John Wakeley for useful discussions and comments on an early draft.
References
- [Bhaskar et al., 2014].↵
- [Browning and Browning, 2013].↵
- [Fisher et al., 1922].↵
- [Griffiths, 1981].↵
- [Hudson, 2002].↵
- [Kelleher et al., 2015].↵
- [Kingman, 1982].↵
- [Liang et al., 2007].↵
- [McVean and Cardin, 2005].↵
- [Palamara et al., 2012].↵
- [Palamara and Pe’er, 2013].↵
- [Shlyakhter et al., 2014].↵
- [Staab et al., 2015].↵
- [Sudlow et al., 2015].↵
- [Wakeley and Takahashi, 2003].↵
- [Wiuf and Hein, 2000].↵
- [Wright, 1931].↵