Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Interpreting tree ensemble machine learning models with endoR

View ORCID ProfileAlbane Ruaud, View ORCID ProfileNiklas Pfister, View ORCID ProfileRuth E Ley, View ORCID ProfileNicholas D Youngblut
doi: https://doi.org/10.1101/2022.01.03.474763
Albane Ruaud
aMax Planck Institute for Developmental Biology, Department of Microbiome Science, Tuebingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Albane Ruaud
Niklas Pfister
bUniversity of Copenhagen, Department of Mathematical Sciences, Copenhagen, Denmark
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Niklas Pfister
Ruth E Ley
aMax Planck Institute for Developmental Biology, Department of Microbiome Science, Tuebingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ruth E Ley
Nicholas D Youngblut
aMax Planck Institute for Developmental Biology, Department of Microbiome Science, Tuebingen, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nicholas D Youngblut
  • For correspondence: nicholas.youngblut@tuebingen.mpg.de
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Background

Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa or genomic content may be associated. Results: We developed endoR, a method to interpret a fitted tree ensemble model. First, endoR simplifies the fitted model into a decision ensemble from which it then extracts information on the importance of individual features and their pairwise interactions and also visualizes these data as an interpretable network. Both the network and importance scores derived from endoR provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed the performance of endoR on both simulated and real metagenomic data. We found endoR to infer true associations with more or comparable accuracy than other commonly used approaches while easing and enhancing model interpretation. Using endoR, we also confirmed published results on gut microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to gain insights into components of the microbiome that predict the presence of human gut methanogens, as these hydrogen-consumers are expected to interact with fermenting bacteria in a complex syntrophic network. Specifically, we analyzed a global metagenome dataset of 2203 individuals and confirmed the previously reported association between Methanobacteriaceae and Christensenellales. Additionally, we observed that Methanobacteriaceae are associated with a network of hydrogen-producing bacteria. Conclusion: Our method accurately captures how tree ensembles use features and interactions between them to predict a response. As demonstrated by our applications, the resultant visualizations and summary outputs facilitate model interpretation and enable the generation of novel hypotheses about complex systems. An implementation of endoR is available as an open-source R-package on GitHub (https://github.com/leylabmpi/endoR).

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://github.com/leylabmpi/endoR

  • https://github.com/aruaud/endoR_data_analysis

  • Abbreviations

    RF
    random forest
    FS
    feature selection
    CV
    cross-validation
    TP
    true positive
    TN
    true negative
    FP
    false positive
    FN
    false negative
    BMI
    body mass index
    ML
    machine Learning
    FSD
    fully simulated dataset
    AP
    artificial phenotype
    DNA
    deoxyribonucleic acid
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
    Back to top
    PreviousNext
    Posted January 04, 2022.
    Download PDF

    Supplementary Material

    Data/Code
    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Interpreting tree ensemble machine learning models with endoR
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Interpreting tree ensemble machine learning models with endoR
    Albane Ruaud, Niklas Pfister, Ruth E Ley, Nicholas D Youngblut
    bioRxiv 2022.01.03.474763; doi: https://doi.org/10.1101/2022.01.03.474763
    Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Interpreting tree ensemble machine learning models with endoR
    Albane Ruaud, Niklas Pfister, Ruth E Ley, Nicholas D Youngblut
    bioRxiv 2022.01.03.474763; doi: https://doi.org/10.1101/2022.01.03.474763

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Bioinformatics
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (4230)
    • Biochemistry (9123)
    • Bioengineering (6767)
    • Bioinformatics (23970)
    • Biophysics (12109)
    • Cancer Biology (9511)
    • Cell Biology (13753)
    • Clinical Trials (138)
    • Developmental Biology (7623)
    • Ecology (11675)
    • Epidemiology (2066)
    • Evolutionary Biology (15492)
    • Genetics (10632)
    • Genomics (14310)
    • Immunology (9474)
    • Microbiology (22824)
    • Molecular Biology (9087)
    • Neuroscience (48920)
    • Paleontology (355)
    • Pathology (1480)
    • Pharmacology and Toxicology (2566)
    • Physiology (3841)
    • Plant Biology (8322)
    • Scientific Communication and Education (1468)
    • Synthetic Biology (2295)
    • Systems Biology (6180)
    • Zoology (1299)