RT Journal Article SR Electronic T1 Estimating Error Models for Whole Genome Sequencing Using Mixtures of Dirichlet-Multinomial Distributions JF bioRxiv FD Cold Spring Harbor Laboratory SP 031724 DO 10.1101/031724 A1 Steven H. Wu A1 Rachel S. Schwartz A1 David J. Winter A1 Donald F. Conrad A1 Reed A. Cartwright YR 2016 UL http://biorxiv.org/content/early/2016/09/20/031724.abstract AB Motivation Accurate identification of genotypes is critical in identifying de novo mutations, linking mutations with disease, and determining mutation rates. Because de novo mutations are rare, even low levels of genotyping error can cause a large fraction of false positive de novo mutations. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error, and reference-mapping biases, among others.Results We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity region. We expect that this approach to modeling the distribution of NGS data, will lead to improved genotyping. For example, this approach provides an expected distribution of reads that can be incorporated into a model to estimate de novo mutations using reads across a pedigree.Availability Methods and data files are available at https://github.com/CartwrightLab/WuEtAl2016/.Contact cartwright{at}asu.edu