## Abstract

**Summary** nQuire is a statistical framework that distinguishes between diploids, triploids and tetraploids using next generation sequencing. The command-line tool models the distribution of base frequencies at variable sites using a Gaussian Mixture Model, and uses maximum likelihood to select the most plausible ploidy model.

**Availability and Implementation** The model is implemented as a stand-alone Linux command line tool in the C programming language and is available at github under the MIT licence. Please also refer to github.com/clwgg/nQuire for usage instructions. Contact: clemens.weiss{at}tuebingen.mpg.de or hernan.burbano{at}tuebingen.mpg.de

## Introduction

Polyploidy, the presence of more than two complete sets of chromosomes, can both fuel and hinder adaptation [1,8,10]. Polyploidization can lead to aneuploidy, which burdens mating due to the presence of individuals of different ploidy. Therefore, intraspecific variation in ploidy tends to occur - although not exclusively - in organisms that have the capacity to reproduce asexually [5,11,12], are self-compatible or are perennial [7].

Ploidy can be inferred from Next Generation Sequencing (NGS) data, for instance, by assessing the distribution of allele frequencies at biallelic Single Nucleotide Polymorphisms (SNPs) [11, 12]. This method assumes that alleles present at biallelic SNPs occur at different ratios for different ploidy levels, that is, 0.5/0.5 in diploids, 0.33/0.67 in triploids, and a mixture of 0.25/0.75 and 0.5/0.5 in tetraploids. To determine the ploidy level, the distribution of biallelic SNPs can be inspected visually and/or qualitatively compared with simulated data [11]. However, this methodology is not quantitative and relies on the identification of variable sites (”SNP calling”), which is performed using approaches that benefit from a previously known ploidy level [9]. A further development based on biallelic SNPs uses a Bayesian statistical approach to estimate allelic ratios followed by a clustering procedure that helps discriminating between ploidy levels from Genotyping-By-Sequencing data [3].

Here we present a new statistical model that aims to distinguish between the distribution of base frequencies at variable sites for diploids, triploids and tetraploids directly from read mappings to a reference genome. It models base frequencies as a Gaussian Mixture Model (GMM), and uses maximum likelihood to assess empirical data under the assumptions of diploidy, triploidy and tetraploidy. We evaluated the performance of our method at different coverages using published genomes of *Saccharomyces cerevisiae* [12] and high-coverage genomes of *Phytophthora infestans* produced for this study.

## Methods

### Implementation

To distinguish between diploids, triploids and tetraploids based on their distributions of base frequencies at variable sites with only two bases segregating (Figure 1A), we implemented a GMM that models the base frequency profiles as a mixture of three Gaussian distributions (Figure 1B), which are scaled relative to each other as:

Here, *n* describes the numbers of data points, *x _{i}* describes the value of each data point (i.e. the base frequency) and

*μ*and

_{j}*σ*are the parameters of the

_{j}*j’th*of three Gaussian distributions

*N*that are scaled relative to each other through the parameter

_{j}*α*. The only constraint here is that .

_{j}This model allows estimating the parameters of the Gaussian mixture components, as well as their mixture proportions by maximizing the log-likelihood, either with or without constraints on the possible parameter space.

The likelihood maximization of the GMM is implemented through an Expectation-Maximization (EM) algorithm (Figure 1B), which is specific to the GMM but can be extended to similar models. The algorithm estimates all parameters at once and computes a likelihood (“free model”). Alternatively, a likelihood can be calculated when parameters are held constant (”fixed models”) to the expected values under diploidy (one Gaussian with mean 0.5), triploidy (two Gaussian with means 0.33 and 0.67) and tetraploidy (three Gaussian with means 0.25, 0.5 and 0.75). Since all fixed models are nested within the free model, it is possible to directly compute the log-likelihood ratios, following:

The Δ*logLs* describe the distance between each fixed model and the best fit under the assumptions of the GMM. A substantially lower Δ*logL* of one fixed model over the others supports the ploidy level described by this fixed model (Figure 1C). Therefore, we used Δ*logL* as summary statistics where the minimum value supports a given ploidy level.

Additionally, the GMM can be extended to a Gaussian Mixture Model with Uniform noise component (GMMU), by adding a uniform mixture component:

The constraint on the mixture proportions then becomes .

The uniform noise component is used in our implementation to allow base-line noise removal. This is important when the Gaussian peaks are observable but embedded in a basal noise, which could be caused by highly repetitive genomes or low coverage.

*Phytophthora infestans* libraries

The two benchmarking libraries from *P. infestans* were generated according to the protocol by Meyer and Kircher [6] from DNA extracted from cultures [2]. The libraries were sequenced to high coverage on an Illumina HiSeq 3000 machine in paired end 150bp mode. This read data is available at the European Nucleotide Archive (ENA) under study number PRJEB20998.

## Results

### Method evaluation

We evaluate nQuire’s performance using three *S. cerevisiae* samples at 100x coverage, which represent each of the three ploidy levels evaluated by the model, as well as two *P. infestans* samples, one diploid and one triploid, at 210x and 368x coverage, respectively. The Δ*logL* of each of the fixed models to the free model at full coverage is shown in Table 1. At those coverages, the Δ*logL* of the best model is more than two times closer to the free model than the second best. Also, it coincides in all samples with the ploidy level inferred by visually inspecting the empirical distributions of base frequencies at full coverage (Figure S1A-C and S2A-B). To investigate the impact of coverage on the performance of the GMM, we downsampled mapped reads from all *S. cerevisiae* (Figure S1G-I) and *P. infestans* (Figure S2E-F) strains to different coverage levels ranging from almost full to 1x coverage. These analyses showed that while the ΔlogLs between the free model and the true fixed model start to plateau at low coverage, the Δ*logL* between the free model and the two incorrect fixed models keeps increasing at higher coverage (Figure S1G-I and S2E-F).

### Method performance

nQuire directly processes BAM files [4] and is designed to be efficient in memory usage and runtime. To process a 1GB *S. cerevisiae* BAM file (100x coverage), nQuire needs 70 seconds to build appropriate data structures, 6 seconds to run the models and calculate the maximum likelihood estimates, and uses a maximum of 8 Mb of RAM, whereas for processing a 10GB *P. infestans* BAM file (100x coverage) it needs 760 seconds, 100 seconds and 60 Mb of RAM, respectively.

## Conclusion

We present nQuire, a statistical approach to distinguish diploids, triploids and tetraploids from NGS data. In comparison to a previous quantitative approach [3], nQuire requires neither SNP calls nor reference individuals with known ploidy. The higher level of noise resulting from omitting SNP calling is accounted for by using Gaussian distributions to approximate a binomial process, since such distributions are impacted less by the effects of outliers. nQuire will be useful to assess intraspecific variation in ploidy from both historic and modern samples, as well as in experimental evolution experiments.

## Acknowledgements

We thank Michael Dannemann, Richard Neher, Thomas Mailund, Oliver Kohlbacher, Moises Exposito-Alonso, Kay Pruefer, and members of the Research Group for Ancient Genomics and Evolution (AGE) for useful discussions and input on model implementation; the AGE group and Michael Dannemann for comments on the manuscript; and the Presidential Innovation Fund of the Max Planck Society for financial support.