Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering

View ORCID ProfileTheresa Ullmann, View ORCID ProfileStefanie Peschel, Philipp Finger, View ORCID ProfileChristian L. Müller, View ORCID ProfileAnne-Laure Boulesteix
doi: https://doi.org/10.1101/2022.06.24.497500
Theresa Ullmann
1Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig- Maximilians-Universität München, München, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Theresa Ullmann
  • For correspondence: tullmann@ibe.med.uni-muenchen.de
Stefanie Peschel
2Institute for Asthma and Allergy Prevention, Helmholtz Zentrum München, Neuherberg, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stefanie Peschel
Philipp Finger
1Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig- Maximilians-Universität München, München, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christian L. Müller
3Department of Statistics, Ludwig-Maximilians-Universität München, München, Germany
4Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
5Center for Computational Mathematics, Flatiron Institute, New York, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christian L. Müller
Anne-Laure Boulesteix
1Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig- Maximilians-Universität München, München, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Anne-Laure Boulesteix
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

In recent years, unsupervised analysis of microbiome data, such as microbial network analysis and clustering, has increased in popularity. Many new statistical and computational methods have been proposed for these tasks. This multiplicity of analysis strategies poses a challenge for researchers, who are often unsure which method(s) to use and might be tempted to try different methods on their dataset to look for the “best” ones. However, if only the best results are selectively reported, this may cause over-optimism: the “best” method is overly fitted to the specific dataset, and the results might be non-replicable on validation data. Such effects will ultimately hinder research progress. Yet so far, these topics have been given little attention in the context of unsupervised microbiome analysis. In our illustrative study, we aim to quantify over-optimism effects in this context. We model the approach of a hypothetical microbiome researcher who undertakes three unsupervised research tasks: clustering of bacterial genera, hub detection in microbial networks, and differential microbial network analysis. While these tasks are unsupervised, the researcher might still have certain expectations as to what constitutes interesting results. We translate these expectations into concrete evaluation criteria that the hypothetical researcher might want to optimize. We then randomly split an exemplary dataset from the American Gut Project into discovery and validation sets multiple times. For each research task, multiple method combinations (e.g., methods for data normalization, network generation, and/or clustering) are tried on the discovery data, and the combination that yields the best result according to the evaluation criterion is chosen. While the hypothetical researcher might only report this result, we also apply the “best” method combination to the validation dataset. The results are then compared between discovery and validation data. In all three research tasks, there are notable over-optimism effects; the results on the validation data set are worse compared to the discovery data, averaged over multiple random splits into discovery/validation data. Our study thus highlights the importance of validation and replication in microbiome analysis to obtain reliable results and demonstrates that the issue of over-optimism goes beyond the context of statistical testing and fishing for significance.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://github.com/thullmann/overoptimism-microbiome

  • https://doi.org/10.5281/zenodo.6652711

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted June 28, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering
Theresa Ullmann, Stefanie Peschel, Philipp Finger, Christian L. Müller, Anne-Laure Boulesteix
bioRxiv 2022.06.24.497500; doi: https://doi.org/10.1101/2022.06.24.497500
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering
Theresa Ullmann, Stefanie Peschel, Philipp Finger, Christian L. Müller, Anne-Laure Boulesteix
bioRxiv 2022.06.24.497500; doi: https://doi.org/10.1101/2022.06.24.497500

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4232)
  • Biochemistry (9128)
  • Bioengineering (6774)
  • Bioinformatics (23989)
  • Biophysics (12117)
  • Cancer Biology (9523)
  • Cell Biology (13772)
  • Clinical Trials (138)
  • Developmental Biology (7627)
  • Ecology (11686)
  • Epidemiology (2066)
  • Evolutionary Biology (15504)
  • Genetics (10638)
  • Genomics (14322)
  • Immunology (9477)
  • Microbiology (22831)
  • Molecular Biology (9089)
  • Neuroscience (48960)
  • Paleontology (355)
  • Pathology (1480)
  • Pharmacology and Toxicology (2568)
  • Physiology (3844)
  • Plant Biology (8327)
  • Scientific Communication and Education (1471)
  • Synthetic Biology (2296)
  • Systems Biology (6186)
  • Zoology (1300)