Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data

Nicole M. Davis, Diana Proctor, View ORCID ProfileSusan P. Holmes, David A. Relman, Benjamin J. Callahan
doi: https://doi.org/10.1101/221499
Nicole M. Davis
1Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, CA 94305
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Diana Proctor
2Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305
6Department of Orofacial Sciences, University of California, San Francisco School of Dentistry, San Francisco, CA 94143
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Susan P. Holmes
3Department of Statistics, Stanford University, Stanford, CA 94305
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Susan P. Holmes
David A. Relman
1Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, CA 94305
2Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305
4Infectious Diseases Section, Veterans Affairs Palo Alto Health Care System, Palo Alto, CA 94304
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Benjamin J. Callahan
5Department of Population Health and Pathobiology, College of Veterinary Medicine, North Carolina State University, Raleigh, NC 27607
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: benjamin.j.callahan@gmail.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

The accuracy of microbial community surveys based on marker-gene and metagenomic sequencing (MGS) suffers from the presence of contaminants — DNA sequences not truly present in the sample. Contaminants come from a variety of sources, including reagents. Appropriate laboratory practices can reduce contamination in MGS data, but do not eliminate it. Here we introduce decontam (https://github.com/benjjneb/decontam), an open-source R package which implements a statistical classification procedure for identifying contaminants in MGS data. Contaminants are identified on the basis of two widely reproduced signatures: contaminants are more frequent in low-concentration samples, and are often found in negative controls. In a dataset from the human oral microbiome, the classification of amplicon sequence variants by decontam was strongly consistent with prior microscopic observations of microbial taxa in that environment. In both metagenomics and marker-gene measurements of a mock community dilution series, the removal of contaminants identified by decontam substantially reduced technical variation due to differences in reagents and sequencing centers. The application of decontam to two recently published datasets corroborated and extended their conclusions that little evidence existed for an indigenous placenta microbiome, and that some low-frequency taxa seemingly associated with preterm birth were run-specific contaminants. decontam integrates easily with existing MGS workflows, and allows researchers to generate more accurate profiles of microbial community composition at little to no additional cost.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted November 17, 2017.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data
Nicole M. Davis, Diana Proctor, Susan P. Holmes, David A. Relman, Benjamin J. Callahan
bioRxiv 221499; doi: https://doi.org/10.1101/221499
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data
Nicole M. Davis, Diana Proctor, Susan P. Holmes, David A. Relman, Benjamin J. Callahan
bioRxiv 221499; doi: https://doi.org/10.1101/221499

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One
Subject Areas
All Articles
  • Animal Behavior and Cognition (4117)
  • Biochemistry (8824)
  • Bioengineering (6528)
  • Bioinformatics (23481)
  • Biophysics (11800)
  • Cancer Biology (9218)
  • Cell Biology (13333)
  • Clinical Trials (138)
  • Developmental Biology (7440)
  • Ecology (11420)
  • Epidemiology (2066)
  • Evolutionary Biology (15166)
  • Genetics (10447)
  • Genomics (14054)
  • Immunology (9180)
  • Microbiology (22183)
  • Molecular Biology (8820)
  • Neuroscience (47610)
  • Paleontology (350)
  • Pathology (1430)
  • Pharmacology and Toxicology (2492)
  • Physiology (3735)
  • Plant Biology (8085)
  • Scientific Communication and Education (1438)
  • Synthetic Biology (2222)
  • Systems Biology (6042)
  • Zoology (1254)