Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

View ORCID ProfileSahir Rai Bhatnagar, Yi Yang, Budhachandra Khundrakpam, Alan C Evans, Mathieu Blanchette, Luigi Bouchard, Celia MT Greenwood
doi: https://doi.org/10.1101/102475
Sahir Rai Bhatnagar
1Department of Epidemiology, Biostatistics and Occupational Health, McGill University Lady Davis Institute, Jewish General Hospital, Montréal, QC
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sahir Rai Bhatnagar
Yi Yang
2Department of Mathematics and Statistics, McGill University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Budhachandra Khundrakpam
3Montreal Neurological Institute, McGill University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Alan C Evans
3Montreal Neurological Institute, McGill University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mathieu Blanchette
4Department of Computer Science, McGill University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Luigi Bouchard
5Department of Biochemistry, Université de Sherbrooke
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Celia MT Greenwood
1Department of Epidemiology, Biostatistics and Occupational Health, McGill University Lady Davis Institute, Jewish General Hospital, Montréal, QC
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Predicting a phenotype and understanding which variables improve that prediction are two very challenging and overlapping problems in analysis of high-dimensional data such as those arising from genomic and brain imaging studies. It is often believed that the number of truly important predictors is small relative to the total number of variables, making computational approaches to variable selection and dimension reduction extremely important. To reduce dimensionality, commonly-used two-step methods first cluster the data in some way, and build models using cluster summaries to predict the phenotype.

It is known that important exposure variables can alter correlation patterns between clusters of high-dimensional variables, i.e., alter network properties of the variables. However, it is not well understood whether such altered clustering is informative in prediction. Here, assuming there is a binary exposure with such network-altering effects, we explore whether use of exposure-dependent clustering relationships in dimension reduction can improve predictive modelling in a two-step framework. Hence, we propose a modelling framework called ECLUST to test this hypothesis, and evaluate its performance through extensive simulations.

With ECLUST, we found improved prediction and variable selection performance compared to methods that do not consider the environment in the clustering step, or to methods that use the original data as features. We further illustrate this modelling framework through the analysis of three data sets from very different fields, each with high dimensional data, a binary exposure, and a phenotype of interest. Our method is available in the eclust CRAN package.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted October 12, 2017.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures
Sahir Rai Bhatnagar, Yi Yang, Budhachandra Khundrakpam, Alan C Evans, Mathieu Blanchette, Luigi Bouchard, Celia MT Greenwood
bioRxiv 102475; doi: https://doi.org/10.1101/102475
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures
Sahir Rai Bhatnagar, Yi Yang, Budhachandra Khundrakpam, Alan C Evans, Mathieu Blanchette, Luigi Bouchard, Celia MT Greenwood
bioRxiv 102475; doi: https://doi.org/10.1101/102475

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4397)
  • Biochemistry (9624)
  • Bioengineering (7118)
  • Bioinformatics (24929)
  • Biophysics (12659)
  • Cancer Biology (9985)
  • Cell Biology (14395)
  • Clinical Trials (138)
  • Developmental Biology (7986)
  • Ecology (12141)
  • Epidemiology (2067)
  • Evolutionary Biology (16021)
  • Genetics (10947)
  • Genomics (14774)
  • Immunology (9897)
  • Microbiology (23730)
  • Molecular Biology (9502)
  • Neuroscience (51038)
  • Paleontology (370)
  • Pathology (1544)
  • Pharmacology and Toxicology (2690)
  • Physiology (4035)
  • Plant Biology (8687)
  • Scientific Communication and Education (1512)
  • Synthetic Biology (2404)
  • Systems Biology (6453)
  • Zoology (1349)