Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data

Nicole M. Davis, Diana M. Proctor, View ORCID ProfileSusan P. Holmes, David A. Relman, View ORCID ProfileBenjamin J. Callahan
doi: https://doi.org/10.1101/221499
Nicole M. Davis
1Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Diana M. Proctor
2Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
3Department of Orofacial Sciences, University of California, San Francisco School of Dentistry, San Francisco, CA 94143, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Susan P. Holmes
4Department of Statistics, Stanford University, Stanford, CA 94305, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Susan P. Holmes
David A. Relman
1Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA
2Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
5Infectious Diseases Section, Veterans Affairs Palo Alto Health Care System, Palo Alto, CA 94304, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Benjamin J. Callahan
6Department of Population Health and Pathobiology, College of Veterinary Medicine, North Carolina State University, Raleigh, NC 27607, USA
7Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Benjamin J. Callahan
  • For correspondence: beniamin.i.callahan@gmail.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Background The accuracy of microbial community surveys based on marker-gene and metagenomic sequencing (MGS) suffers from the presence of contaminants — DNA sequences not truly present in the sample. Contaminants come from various sources, including reagents. Appropriate laboratory practices can reduce contamination, but do not eliminate it. Here we introduce decontam (https://github.com/benjjneb/decontam), an open-source R package that implements a statistical classification procedure that identifies contaminants in MGS data based on two widely reproduced patterns: contaminants appear at higher frequencies in low-concentration samples, and are often found in negative controls.

Results decontam classified amplicon sequence variants (ASVs) in a human oral dataset consistently with prior microscopic observations of the microbial taxa inhabiting that environment and previous reports of contaminant taxa. In metagenomics and marker-gene measurements of a dilution series, decontam substantially reduced technical variation arising from different sequencing protocols. The application of decontam to two recently published datasets corroborated and extended their conclusions that little evidence existed for an indigenous placenta microbiome, and that some low-frequency taxa seemingly associated with preterm birth were contaminants.

Conclusions decontam improves the quality of metagenomic and marker-gene sequencing by identifying and removing contaminant DNA sequences. decontam integrates easily with existing MGS workflows, and allows researchers to generate more accurate profiles of microbial communities at little to no additional cost.

  • List of abbreviations

    MGS
    marker-gene and metagenomic sequencing
    ASV
    amplicon sequence variant
    MAG
    metagenome-assembled genome
    OTU
    operational taxonomic unit
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
    Back to top
    PreviousNext
    Posted July 25, 2018.
    Download PDF

    Supplementary Material

    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data
    Nicole M. Davis, Diana M. Proctor, Susan P. Holmes, David A. Relman, Benjamin J. Callahan
    bioRxiv 221499; doi: https://doi.org/10.1101/221499
    Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
    Citation Tools
    Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data
    Nicole M. Davis, Diana M. Proctor, Susan P. Holmes, David A. Relman, Benjamin J. Callahan
    bioRxiv 221499; doi: https://doi.org/10.1101/221499

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (3602)
    • Biochemistry (7569)
    • Bioengineering (5524)
    • Bioinformatics (20792)
    • Biophysics (10328)
    • Cancer Biology (7980)
    • Cell Biology (11638)
    • Clinical Trials (138)
    • Developmental Biology (6603)
    • Ecology (10202)
    • Epidemiology (2065)
    • Evolutionary Biology (13617)
    • Genetics (9541)
    • Genomics (12847)
    • Immunology (7921)
    • Microbiology (19541)
    • Molecular Biology (7657)
    • Neuroscience (42095)
    • Paleontology (308)
    • Pathology (1258)
    • Pharmacology and Toxicology (2202)
    • Physiology (3267)
    • Plant Biology (7041)
    • Scientific Communication and Education (1294)
    • Synthetic Biology (1951)
    • Systems Biology (5426)
    • Zoology (1117)