An open resource for accurately benchmarking small variant and reference calls

Justin M Zook; Jennifer McDaniel; Nathan D Olson; Justin Wagner; Hemang Parikh; Haynes Heaton; Sean A Irvine; Len Trigg; Rebecca Truty; Cory Y McLean; Francisco M De La Vega; Chunlin Xiao; Stephen Sherry; Marc Salit

doi:10.1038/s41587-019-0074-6

An open resource for accurately benchmarking small variant and reference calls

Nat Biotechnol. 2019 May;37(5):561-566. doi: 10.1038/s41587-019-0074-6. Epub 2019 Apr 1.

Authors

Justin M Zook¹, Jennifer McDaniel², Nathan D Olson², Justin Wagner², Hemang Parikh², Haynes Heaton^{3

4}, Sean A Irvine⁵, Len Trigg⁵, Rebecca Truty⁶, Cory Y McLean^{7

8}, Francisco M De La Vega⁹, Chunlin Xiao¹⁰, Stephen Sherry¹⁰, Marc Salit^{2

11

12}

Affiliations

¹ Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA. jzook@nist.gov.
² Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
³ 10x Genomics, Pleasanton, CA, USA.
⁴ Wellcome Trust Sanger Institute,, Hinxton, Cambridge, UK.
⁵ Real Time Genomics, Hamilton, New Zealand.
⁶ Invitae Corporation, San Francisco, CA, USA.
⁷ Verily Life Sciences, South San Francisco, CA, USA.
⁸ Google Inc., Mountain View, CA, USA.
⁹ Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA.
¹⁰ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
¹¹ Joint Initiative for Metrology in Biology, Stanford, CA, USA.
¹² Department of Bioengineering, Stanford University, Stanford, CA, USA.

Abstract

Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a 'first of its kind' resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.

Publication types

Research Support, N.I.H., Extramural
Research Support, N.I.H., Intramural

MeSH terms

Benchmarking*
Computational Biology / trends*
Genetic Variation / genetics
Genome, Human / genetics*
Genomics / trends*
High-Throughput Nucleotide Sequencing
Humans
INDEL Mutation / genetics
Polymorphism, Single Nucleotide
Software / trends

Abstract

Publication types

MeSH terms

Grants and funding