Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Reducing INDEL calling errors in whole-genome and exome sequencing data

Han Fang, Yiyang Wu, Giuseppe Narzisi, Jason A. O’Rawe, Laura T. Jimenez Barrón, Julie Rosenbaum, Michael Ronemus, Ivan Iossifov, Michael C. Schatz, Gholson J. Lyon
doi: https://doi.org/10.1101/006148
Han Fang
1Stanley Institute for Cognitive Genomics, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
2Stony Brook University, 100 Nicolls Rd, Stony Brook, NY, USA;
3Simons Center for Quantitative Biology, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yiyang Wu
1Stanley Institute for Cognitive Genomics, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
2Stony Brook University, 100 Nicolls Rd, Stony Brook, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Giuseppe Narzisi
3Simons Center for Quantitative Biology, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
4New York Genome Center, New York, NY;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jason A. O’Rawe
1Stanley Institute for Cognitive Genomics, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
2Stony Brook University, 100 Nicolls Rd, Stony Brook, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Laura T. Jimenez Barrón
1Stanley Institute for Cognitive Genomics, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
5Centro de Ciencias Genomicas, Universidad Nacional Autonoma de Mexico, Cuernavaca, Morelos, MX;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Julie Rosenbaum
3Simons Center for Quantitative Biology, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michael Ronemus
3Simons Center for Quantitative Biology, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ivan Iossifov
3Simons Center for Quantitative Biology, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michael C. Schatz
3Simons Center for Quantitative Biology, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gholson J. Lyon
1Stanley Institute for Cognitive Genomics, One Bungtown Road, Cold Spring Harbor Laboratory, NY, USA;
2Stony Brook University, 100 Nicolls Rd, Stony Brook, NY, USA;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Background INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts.

Methods We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low quality INDELs (7% vs. 51%).

Results Simulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5 bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (52%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (85% vs. 54%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data.

Conclusions Overall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (e.g. capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.

  • List of abbreviations used

    (INDELs)
    Insertions and Deletions
    (WGS)
    whole genome sequencing
    (WES)
    whole exome sequencing
    (NGS)
    next-generation sequencing
    bp
    (base pair)
    PCR
    (polymerase chain reaction)
    (STR)
    short tandem repeats
    (poly-A)
    homopolymer A
    (poly-C)
    homopolymer C
    (poly-G)
    homopolymer G
    (poly-T)
    homopolymer T
    (STR)
    short tandem repeats
    (other STR)
    except homopolymers
    (poly-A/T)
    homopolymer A or T
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
    Back to top
    PreviousNext
    Posted September 17, 2014.
    Download PDF

    Supplementary Material

    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Reducing INDEL calling errors in whole-genome and exome sequencing data
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Reducing INDEL calling errors in whole-genome and exome sequencing data
    Han Fang, Yiyang Wu, Giuseppe Narzisi, Jason A. O’Rawe, Laura T. Jimenez Barrón, Julie Rosenbaum, Michael Ronemus, Ivan Iossifov, Michael C. Schatz, Gholson J. Lyon
    bioRxiv 006148; doi: https://doi.org/10.1101/006148
    Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Reducing INDEL calling errors in whole-genome and exome sequencing data
    Han Fang, Yiyang Wu, Giuseppe Narzisi, Jason A. O’Rawe, Laura T. Jimenez Barrón, Julie Rosenbaum, Michael Ronemus, Ivan Iossifov, Michael C. Schatz, Gholson J. Lyon
    bioRxiv 006148; doi: https://doi.org/10.1101/006148

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (4232)
    • Biochemistry (9126)
    • Bioengineering (6774)
    • Bioinformatics (23985)
    • Biophysics (12116)
    • Cancer Biology (9520)
    • Cell Biology (13772)
    • Clinical Trials (138)
    • Developmental Biology (7626)
    • Ecology (11683)
    • Epidemiology (2066)
    • Evolutionary Biology (15502)
    • Genetics (10637)
    • Genomics (14318)
    • Immunology (9476)
    • Microbiology (22828)
    • Molecular Biology (9088)
    • Neuroscience (48947)
    • Paleontology (355)
    • Pathology (1480)
    • Pharmacology and Toxicology (2567)
    • Physiology (3844)
    • Plant Biology (8325)
    • Scientific Communication and Education (1471)
    • Synthetic Biology (2296)
    • Systems Biology (6185)
    • Zoology (1300)