Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers

Alexa McIntyre, Rachid Ounit, Ebrahim Afshinnekoo, Robert Prill, Elizabeth Henaff, Noah Alexander, Sam Minot, David Danko, Jonathan Foox, Sofia Ahsanuddin, Scott Tighe, Nur A Hasan, Poorani Subramanian, Kelly Moffat, Shawn Levy, Stefano Lonardi, Nick Greenfield, Rita Colwell, Gail Rosen, Christopher E Mason
doi: https://doi.org/10.1101/156919
Alexa McIntyre
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rachid Ounit
UC-Riverside;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ebrahim Afshinnekoo
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Robert Prill
IBM;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Elizabeth Henaff
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Noah Alexander
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sam Minot
OneCodex;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David Danko
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jonathan Foox
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sofia Ahsanuddin
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Scott Tighe
University of Vermont;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nur A Hasan
CosmosID;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Poorani Subramanian
CosmosID;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kelly Moffat
CosmosID;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shawn Levy
HudsonAlpha;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stefano Lonardi
UC-Riverside;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nick Greenfield
OneCodex;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rita Colwell
CosmosID;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gail Rosen
Drexel University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christopher E Mason
Weill Cornell Medicine;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: chm2042@med.cornell.edu
  • Abstract
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole genome shotgun sequencing data, comprehensive comparisons of these methods are limited. In this study, we use the largest (n=35) to date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of eleven metagenomics classifiers. We also assess the effects of filtering and combining tools to reduce the number of false positives. Tools were characterized on the basis of their ability to (1) identify taxa at the genus, species, and strain levels, (2) quantify relative abundance measures of taxa, and (3) classify individual reads to the species level. Strikingly, the number of species identified by the eleven tools can differ by over three orders of magnitude on the same datasets. However, various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Indeed, leveraging tools with different heuristics is beneficial for improved precision. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species and where customized tools may be required. The results of this study provide positive controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision and recall. We show that proper experimental design and analysis parameters, including depth of sequencing, choice of classifier or classifiers, database size, and filtering, can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

Copyright 
The copyright holder for this preprint is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
  • Posted June 28, 2017.

Download PDF

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers
Alexa McIntyre, Rachid Ounit, Ebrahim Afshinnekoo, Robert Prill, Elizabeth Henaff, Noah Alexander, Sam Minot, David Danko, Jonathan Foox, Sofia Ahsanuddin, Scott Tighe, Nur A Hasan, Poorani Subramanian, Kelly Moffat, Shawn Levy, Stefano Lonardi, Nick Greenfield, Rita Colwell, Gail Rosen, Christopher E Mason
bioRxiv 156919; doi: https://doi.org/10.1101/156919
del.icio.us logo Digg logo Reddit logo Technorati logo Twitter logo CiteULike logo Connotea logo Facebook logo Google logo Mendeley logo
Citation Tools
Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers
Alexa McIntyre, Rachid Ounit, Ebrahim Afshinnekoo, Robert Prill, Elizabeth Henaff, Noah Alexander, Sam Minot, David Danko, Jonathan Foox, Sofia Ahsanuddin, Scott Tighe, Nur A Hasan, Poorani Subramanian, Kelly Moffat, Shawn Levy, Stefano Lonardi, Nick Greenfield, Rita Colwell, Gail Rosen, Christopher E Mason
bioRxiv 156919; doi: https://doi.org/10.1101/156919

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (545)
  • Biochemistry (743)
  • Bioengineering (447)
  • Bioinformatics (4340)
  • Biophysics (1321)
  • Cancer Biology (895)
  • Cell Biology (1261)
  • Clinical Trials (43)
  • Developmental Biology (848)
  • Ecology (1459)
  • Epidemiology (703)
  • Evolutionary Biology (3439)
  • Genetics (2332)
  • Genomics (3018)
  • Immunology (483)
  • Microbiology (1947)
  • Molecular Biology (760)
  • Neuroscience (5779)
  • Paleontology (36)
  • Pathology (107)
  • Pharmacology and Toxicology (184)
  • Physiology (240)
  • Plant Biology (809)
  • Scientific Communication and Education (225)
  • Synthetic Biology (352)
  • Systems Biology (1195)
  • Zoology (148)