Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions

Samir Rachid Zaim, Colleen Kenost, Joanne Berghout, Wesley Chiu, Liam Wilson, Hao Helen Zhang, Yves A. Lussier
doi: https://doi.org/10.1101/681973
Samir Rachid Zaim
1Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
2The Graduate Interdisciplinary Program in Statistics, 617 N. Santa Rita Ave. The University of Arizona, Tucson, AZ 85721, USA
3College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Colleen Kenost
1Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
3College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joanne Berghout
1Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
3College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wesley Chiu
1Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
3College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Liam Wilson
1Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
3College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hao Helen Zhang
1Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
2The Graduate Interdisciplinary Program in Statistics, 617 N. Santa Rita Ave. The University of Arizona, Tucson, AZ 85721, USA
4Department of Mathematics, College of Sciences, 617 N. Santa Rita Ave. The University of Arizona, Tucson, AZ 85721, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: hzhang@math.arizona.edu yves@email.arizona.edu
Yves A. Lussier
1Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
2The Graduate Interdisciplinary Program in Statistics, 617 N. Santa Rita Ave. The University of Arizona, Tucson, AZ 85721, USA
3College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
5The Center for Applied Genetic and Genomic Medicine, 1295 N. Martin, Tucson, AZ 85721, USA
6The University of Arizona Cancer Center, 3838 N. Campbell Ave, Tucson, AZ 85721, USA
7The University of Arizona BIO5 Institute, 1657 E. Helen Street, Tucson, AZ 85721, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: hzhang@math.arizona.edu yves@email.arizona.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Background In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcript) than samples (i.e., mice or human samples) in a study, this poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest1 (RF) classifiers are widely used2–7 due to their flexibility, powerful performance, and robustness to “P predictors ≫ subjects N” difficulties and their ability to rank features. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.

Methods binomialRF treats each tree in a RF as a correlated but exchangeable binary trial. It determines importance by constructing a test statistic based on a feature’s selection frequency to compute its rank, nominal p-value, and multiplicity-adjusted q-value using a one-sided hypothesis test with a correlated binomial distribution. A distributional adjustment addresses the co-dependencies among trees as these trees subsample from the same dataset. The proposed algorithm efficiently identifies multiway nonlinear interactions by generalizing the test statistic to count sub-trees.

Results In simulations and in the Madelon benchmark datasets studies, binomialRF showed computational gains (up to 30 to 600 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone.

Conclusion binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide path-way-level feature selection from gene expression input data.

Availability Github: https://github.com/SamirRachidZaim/binomialRF

Supplementary information Supplementary analyses and results are available at https://github.com/SamirRachidZaim/binomialRF_simulationStudy

Footnotes

  • This version of the manuscript has been revised to change the binomialRF algorithm to account for tree-to-tree sampling co-dependency. This effective modifies the binomial assumption to a set of correlated Bernoulli trials. The corresponding tables and figures have been modified to reflect the performance of the new binomialRF algorithm.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted March 06, 2020.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions
Samir Rachid Zaim, Colleen Kenost, Joanne Berghout, Wesley Chiu, Liam Wilson, Hao Helen Zhang, Yves A. Lussier
bioRxiv 681973; doi: https://doi.org/10.1101/681973
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions
Samir Rachid Zaim, Colleen Kenost, Joanne Berghout, Wesley Chiu, Liam Wilson, Hao Helen Zhang, Yves A. Lussier
bioRxiv 681973; doi: https://doi.org/10.1101/681973

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2235)
  • Biochemistry (4302)
  • Bioengineering (2958)
  • Bioinformatics (13483)
  • Biophysics (5959)
  • Cancer Biology (4633)
  • Cell Biology (6641)
  • Clinical Trials (138)
  • Developmental Biology (3939)
  • Ecology (6240)
  • Epidemiology (2053)
  • Evolutionary Biology (9181)
  • Genetics (6883)
  • Genomics (8803)
  • Immunology (3918)
  • Microbiology (11286)
  • Molecular Biology (4458)
  • Neuroscience (25625)
  • Paleontology (183)
  • Pathology (722)
  • Pharmacology and Toxicology (1209)
  • Physiology (1776)
  • Plant Biology (3999)
  • Scientific Communication and Education (892)
  • Synthetic Biology (1194)
  • Systems Biology (3627)
  • Zoology (654)