Dsuite - fast D-statistics and related admixture evidence from VCF files

Summary The D-statistic, also known as the ABBA-BABA statistic, and related statistics are commonly used to assess evidence of gene flow between populations or closely related species. While the calculations are not computationally intensive, currently available implementations require custom file formats and are impractical to evaluate all gene flow hypotheses across datasets that include many populations or species. Dsuite is a fast C++ implementation, allowing genome scale calculations of the D-statistic across all combinations of tens or even hundreds of populations or species directly from a variant call format (VCF) file. Furthermore, the program can estimate the admixture fraction and provide evidence of whether introgression is confined to specific loci. Thus Dsuite facilitates assessment of gene flow across large genomic datasets. Availability and implementation Source code and documentation are available at: https://github.com/millanek/Dsuite


Introduction
There is evidence that admixture between populations and hybridization between species are common and that a bifurcating tree is often insufficient to capture their evolutionary history (Green et al., 2010;Patterson et al., 2012;Tung and Barreiro, 2017;Malinsky et al., 2018;Kozak et al., 2018). The Patterson's D-statistic, first used to detect introgression between modern human and Neanderthal populations (Green et al., 2010;Durand et al., 2011), has since been widely applied and used across a broad range of taxa (Fontaine et al., 2015;vonHoldt et al., 2016;Malinsky et al., 2018;Kozak et al., 2018). The D-statistic and the related estimate of the admixture fraction f, also referred to as f4 (Patterson et al., 2012), are easy to calculate and well suited for taking full advantage of genome-scale datasets, while being robust under most demographic scenarios (Durand et al., 2011).
Several programs for calculating D and f from genomic data are available, among them ADMIXTOOLS (Patterson et al., 2012), DFOIL (Pease and Hahn, 2015), HyDe (Blischak et al., 2018), andComp-D (Mussmann et al., 2019). However, what limits their utility is that none of the programs can handle the variant call format (VCF) (Danecek et al., 2011), the standard file format for storing genetic polymorphism data produced by variant callers such as samtools (Li, 2011) andGATK (De-Pristo et al., 2011). Moreover, as each calculation of D and f applies to four populations or taxa, the number of calculations/quartets grows geometrically with the size of the dataset. Therefore, previous studies utilizing D and f involved small numbers of populations or taxa, with few exceptions (Malinsky et al., 2018;Kozak et al., 2018). With ever more genomic data becoming available, there is a need for computationally efficient implementation able to handle datasets with tens and even hundreds of taxa. Dsuite addresses the above issues in that it calculates D and f statistics directly from VCF files, and is substantially more efficient than other programs. Thus Dsuite further increases the accessibility and applicability of admixture testing based on these statistics.

Methods and implementation
The statistics calculated by Dsuite apply to biallelic SNPs across four populations or taxa: P1, P2, P3, and O, related by the asymmetric tree (((P1,P2),P3),O). The outgroup O defines the ancestral allele, denoted by A, and the derived allele is denoted by B. The site patterns are ordered such that the pattern BBAA refers to P1 and P2 sharing the derived allele, ABBA to P2 and P3 sharing the derived allele, and BABA to P1 and P3 sharing the derived alleles. Under the null hypothesis, which assumes no gene flow, the ABBA and BABA patterns are expected to occur with equal frequencies, while a significant deviation from that suggests possible introgression between P3 and either P1 or P2. See (Durand et al., 2011) for more detail.
While simple site pattern counts can be computed for single sequences, the Dsuite implementation works with allele frequency estimates, so multiple individuals can, and ideally should, be included from each population or taxon. Denoting the allele frequency estimate at site i in P1 as !! , the following sums are calculated across all n biallelic sites:

The Dtrios program
Dsuite does not assume a priori knowledge of population or species relationships, only the outgroup has to be specified. The first subprogram, Dtrios calculates the sums in equation (1) for all trios of populations or taxa in the dataset. The command produces three types of output. For the first, in a file with the "BBAA.txt" suffix, Dtrios attempts to infer the population or species relationships: it orders each trio assuming that the correct tree is the one where the BBAA pattern is more common than the discordant ABBA and BABA patterns, which are assumed to result for example from incomplete lineage sorting, repeated mutation at the same site, or from introgression. In addition P1 and P2 are ordered so that nABBA >= nBABA and, therefore is never negative. The second type of output is the Dmin score, the minimum D for each trio, regardless of any assumptions about the tree topology (Malinsky et al., 2018). There is no attempt to infer the true tree; instead, the trio is ordered so that the difference between nABBA and nBABA is minimized. This output is in a file with the "Dmin.txt" suffix. Finally, there is also an option for the user to supply a tree in Newick format to specify known or hypothesized relationships between the populations or species. An output file with the "tree.txt" suffix then contains D values for trios ordered in a way consistent with this tree. To assess whether D is significantly different from zero, Dtrios uses a standard block-jackknife procedure as in (Green et al., 2010;Durand et al., 2011), obtaining an approximately normally distributed standard error. For three types of output, Dtrios calculates the Z-scores: = / _ ( ), and outputs the associated p-values. However, when testing more than one trio, users should take into account the multiple testing problem and adjust the p-values accordingly. Although the different D-statistics calculated on the same dataset do not constitute independent tests, users could consider them as such and control for overall false discovery rate, as this is a straightforward and conservative approach.

The DtriosCombine program
It is common practice, especially for larger genomic datasets, that VCF files are divided into smaller subsets by genomic regions, e.g. per chromosome. This facilitates the parallelization of many computational workflows. The DtriosCombine program enables parallel computa-tion of the D-statistic across genomic regions, by combining the outputs of multiple Dtrios runs, summing up the counts in equation (1). It also calculates overall block-jackknife standard error across all regions.

The Dinvestigate program
The program Dinvestigate is designed to provide further information about trios that have significant D-statistics. Dinvestigate estimates the admixture fraction f, using fG as in (Green et al., 2010), and also calculates fd (Martin et al., 2015) and fdM (Malinsky et al., 2015), two statistics related to fG but better suited to be applied to genomic windows. The fd and fdM statistics can be used to assess whether the admixture signal is confined to specific loci and to assist in locating any such loci.
Calculating fG, the Green et al. estimate of f, requires that P3 be split into two subsets: P3a and P3b. If the dataset contains more than one individual from P3, then individuals are randomly assigned to the two subsets. If there is only one individual from P3, then only SNPs where the individual is homozygous for the derived allele are used.
The program outputs overall fG, and also the more conservative fd, and fdM to the standard output for each trio specified by the user. In addition, it produces a text file for each trio which contains the values of fd and fdM in sliding genomic windows, with the size of the windows specified by the user as a fixed number of SNPs.

Performance
I assessed the computational performance of Dsuite by calculating the Dstatistic across the dataset of 73 species of Lake Malawi cichlid fishes from (Malinsky et al., 2018). These 73 species are represented in the dataset by 131 individuals. The outgroup is one individual from a Lake Tanganyika cichlid Neolamprologus brichardi. For the tests, I used 610,333 SNPs that were called on scaffold_0, the largest scaffold of the reference genome assembly.  Dsuite was the most efficient of the programs, both in terms of memory requirements and in terms of run time, being the only software to finish in under two hours. Admixtools was comparably fast, but required more than 27GB of memory, ~370x more than Dsuite. HyDe was more memory efficient, but the run time was longer, at just under four hours. Finally, Comp-D required substantial amount of memory at 8.3GB and also took by far the longest to complete the analysis; I cancelled the run after 24hours, with only ~3,000 trios completed. While the Dsuite analysis was run directly on the VCF file, all other software required format conversion. For Admixtools, I first obtained data in the PED format using vcftools v0.1.12b (Danecek et al., 2011) with the --plink option, and then translated these into the softwarespecific EIGENSTRAT format using the convertf program, which is included in the Admixtools package. Data conversion into the PHYLIP input format for HyDe and Comp-D was done using the vcf2phylip script (Ortiz, 2019). The conversions added approximately ten minutes to each of these workflows, in addition to the run times shown in Table 1.