Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Benchmarking challenging small variants with linked and long reads

Justin Wagner, Nathan D Olson, Lindsay Harris, Ziad Khan, Jesse Farek, Medhat Mahmoud, Ana Stankovic, Vladimir Kovacevic, Aaron M Wenger, William J Rowell, Chunlin Xiao, Byunggil Yoo, Neil Miller, View ORCID ProfileJeffrey A. Rosenfeld, Bohan Ni, Samantha Zarate, Melanie Kirsche, Sergey Aganezov, Michael Schatz, View ORCID ProfileGiuseppe Narzisi, Marta Byrska-Bishop, Wayne Clarke, Uday S. Evani, Charles Markello, Kishwar Shafin, Xin Zhou, View ORCID ProfileArend Sidow, Vikas Bansal, Alvaro Martinez Barrio, Ian T Fiddes, Arkarachai Fungtammasan, Chen-Shan Chin, Fritz J Sedlazeck, Andrew Carroll, Marc Salit, Justin M Zook
doi: https://doi.org/10.1101/2020.07.24.212712
Justin Wagner
1Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nathan D Olson
1Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lindsay Harris
1Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ziad Khan
2Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston TX 77030, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jesse Farek
2Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston TX 77030, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Medhat Mahmoud
2Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston TX 77030, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ana Stankovic
3Seven Bridges, Omladinskih brigada 90g, 11070 Belgrade, Republic of Serbia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vladimir Kovacevic
4School of Electrical Engineering, University in Belgrade, Bulevar kralja Aleksandra 73, Belgrade, Republic of Serbia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Aaron M Wenger
5Pacific Biosciences, Menlo Park, CA 94025, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
William J Rowell
5Pacific Biosciences, Menlo Park, CA 94025, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chunlin Xiao
6National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Byunggil Yoo
7Children’s Mercy Kansas City, Kansas City, MO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Neil Miller
7Children’s Mercy Kansas City, Kansas City, MO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jeffrey A. Rosenfeld
8Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jeffrey A. Rosenfeld
Bohan Ni
9Department of Computer Science, Johns Hopkins University, Baltimore, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Samantha Zarate
9Department of Computer Science, Johns Hopkins University, Baltimore, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Melanie Kirsche
9Department of Computer Science, Johns Hopkins University, Baltimore, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sergey Aganezov
9Department of Computer Science, Johns Hopkins University, Baltimore, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michael Schatz
9Department of Computer Science, Johns Hopkins University, Baltimore, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Giuseppe Narzisi
10New York Genome Center, 101 Avenue of the Americas, New York, NY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Giuseppe Narzisi
Marta Byrska-Bishop
10New York Genome Center, 101 Avenue of the Americas, New York, NY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wayne Clarke
10New York Genome Center, 101 Avenue of the Americas, New York, NY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Uday S. Evani
10New York Genome Center, 101 Avenue of the Americas, New York, NY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Charles Markello
11University of California at Santa Cruz Genomics Institute, 1156 High Street, Santa Cruz, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kishwar Shafin
11University of California at Santa Cruz Genomics Institute, 1156 High Street, Santa Cruz, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xin Zhou
12Department of Computer Science, Stanford University, Stanford, CA 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arend Sidow
13Departments of Pathology and of Genetics, Stanford University, Stanford CA 94305 USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Arend Sidow
Vikas Bansal
14Department of Pediatrics, University of California San Diego, La Jolla CA 92093, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Alvaro Martinez Barrio
1510x Genomics, Pleasanton, CA 94588, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ian T Fiddes
1510x Genomics, Pleasanton, CA 94588, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arkarachai Fungtammasan
16DNAnexus, Inc., Mountain View, CA, 94040, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chen-Shan Chin
16DNAnexus, Inc., Mountain View, CA, 94040, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Fritz J Sedlazeck
2Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston TX 77030, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Andrew Carroll
17Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA 94040, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marc Salit
18Joint Initiative for Metrology in Biology, SLAC National Laboratory, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Justin M Zook
1Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jzook@nist.gov
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Genome in a Bottle (GIAB) benchmarks have been widely used to validate clinical sequencing pipelines and develop new variant calling and sequencing methods. Here we use accurate long and linked reads to expand the prior benchmark to include difficult-to-map regions and segmental duplications that are not readily accessible to short reads. Our new benchmark adds more than 300,000 SNVs, 50,000 indels, and 16 % new exonic variants, many in challenging, clinically relevant genes not previously covered (e.g., PMS2). We increase coverage of the GRCh38 assembly from 85 % to 92 %, while excluding problematic regions for benchmarking small variants (e.g., copy number variants and assembly errors) that should not have been in the previous version. Our new benchmark reliably identifies both false positives and false negatives across multiple short-, linked-, and long-read based variant calling methods. As an example of its utility, this benchmark identifies eight times more false negatives in a short read variant call set relative to our previous benchmark, mostly in difficult-to-map regions. To enable robust small variant benchmarking, we still exclude 3.6% of GRCh37 and 5.0% of GRCh38 in (1) highly repetitive regions such as large, highly similar segmental duplications and the centromere not accessible to our data and (2) regions where our sample is highly divergent from the reference due to large indels, structural variation, copy number variation, and/or errors in the reference (e.g., some KIR genes that have duplications in HG002). We have demonstrated the utility of this benchmark to assess performance in more challenging regions, which enables benchmarking in more difficult genes and continued technology and bioinformatics development. The benchmarks are available at: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.1/ ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchmark_07092020/

Competing Interest Statement

AMW and WJR are employees and shareholders of Pacific Biosciences. AMB and ITF were employees and shareholders of 10X Genomics. FJS has received sponsored travel from Oxford Nanopore and Pacific Biosciences, and received a 2018 sequencing grant from Pacific Biosciences. AS and VK are employees of Seven Bridges. AC is an employee of Google Inc. and is a former employee of DNAnexus. AF and C-SC are employees of DNAnexus. SMES is an employee of Roche.

Footnotes

  • ↵# The work was jointly supervised by these authors

  • ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.1/

  • ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchmark_07092020/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
Back to top
PreviousNext
Posted July 25, 2020.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Benchmarking challenging small variants with linked and long reads
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Benchmarking challenging small variants with linked and long reads
Justin Wagner, Nathan D Olson, Lindsay Harris, Ziad Khan, Jesse Farek, Medhat Mahmoud, Ana Stankovic, Vladimir Kovacevic, Aaron M Wenger, William J Rowell, Chunlin Xiao, Byunggil Yoo, Neil Miller, Jeffrey A. Rosenfeld, Bohan Ni, Samantha Zarate, Melanie Kirsche, Sergey Aganezov, Michael Schatz, Giuseppe Narzisi, Marta Byrska-Bishop, Wayne Clarke, Uday S. Evani, Charles Markello, Kishwar Shafin, Xin Zhou, Arend Sidow, Vikas Bansal, Alvaro Martinez Barrio, Ian T Fiddes, Arkarachai Fungtammasan, Chen-Shan Chin, Fritz J Sedlazeck, Andrew Carroll, Marc Salit, Justin M Zook
bioRxiv 2020.07.24.212712; doi: https://doi.org/10.1101/2020.07.24.212712
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Benchmarking challenging small variants with linked and long reads
Justin Wagner, Nathan D Olson, Lindsay Harris, Ziad Khan, Jesse Farek, Medhat Mahmoud, Ana Stankovic, Vladimir Kovacevic, Aaron M Wenger, William J Rowell, Chunlin Xiao, Byunggil Yoo, Neil Miller, Jeffrey A. Rosenfeld, Bohan Ni, Samantha Zarate, Melanie Kirsche, Sergey Aganezov, Michael Schatz, Giuseppe Narzisi, Marta Byrska-Bishop, Wayne Clarke, Uday S. Evani, Charles Markello, Kishwar Shafin, Xin Zhou, Arend Sidow, Vikas Bansal, Alvaro Martinez Barrio, Ian T Fiddes, Arkarachai Fungtammasan, Chen-Shan Chin, Fritz J Sedlazeck, Andrew Carroll, Marc Salit, Justin M Zook
bioRxiv 2020.07.24.212712; doi: https://doi.org/10.1101/2020.07.24.212712

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4672)
  • Biochemistry (10338)
  • Bioengineering (7658)
  • Bioinformatics (26300)
  • Biophysics (13501)
  • Cancer Biology (10672)
  • Cell Biology (15412)
  • Clinical Trials (138)
  • Developmental Biology (8487)
  • Ecology (12806)
  • Epidemiology (2067)
  • Evolutionary Biology (16831)
  • Genetics (11382)
  • Genomics (15469)
  • Immunology (10601)
  • Microbiology (25167)
  • Molecular Biology (10206)
  • Neuroscience (54383)
  • Paleontology (399)
  • Pathology (1667)
  • Pharmacology and Toxicology (2889)
  • Physiology (4334)
  • Plant Biology (9235)
  • Scientific Communication and Education (1586)
  • Synthetic Biology (2555)
  • Systems Biology (6773)
  • Zoology (1461)