Abstract
Next-generation sequencing of pooled samples (Pool-seq) is an important tool in population genomics and molecular ecology. In Pool-seq, the relative number of reads with an allele reflects the allele frequencies in the sample. However, unequal individual contributions to the pool and sequencing errors can lead to inaccurate allele frequency estimates. When designing Pool-seq studies, researchers need to decide the pool size (number of individuals) and average depth of coverage (sequencing effort). An efficient sampling design should maximize the accuracy of allele frequency estimates while minimizing the sequencing effort. We introduce an R package, poolHelper, enabling users to simulate Pool-seq data under different combinations of average depth of coverage, pool sizes and number of pools, accounting for unequal individual contribution and sequencing errors, modelled by parameters that users can modify. poolHelper can be used to assess how different combinations of those parameters influence the error of sample allele frequencies and expected heterozygosity. The mean absolute error is computed by comparing the sample allele frequencies obtained based on individual genotypes with the frequency estimates obtained with Pool-seq. Using simulations under a single population model, we illustrate that increasing the depth of coverage does not necessarily lead to more accurate estimates, reinforcing that finding the best Pool-seq study design is not straightforward. Moreover, we show that simulations can be used to identify different combinations of parameters with similarly low mean absolute errors. The poolHelper package provides tools for performing simulations with different combinations of parameters before sampling and generating data, allowing users to define sampling schemes that minimize the sequencing effort.
Competing Interest Statement
The authors have declared no competing interest.