Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

View ORCID ProfileElisabetta Manduchi, Weixuan Fu, Joseph D. Romano, Stefano Ruberto, Jason H. Moore
doi: https://doi.org/10.1101/2020.08.24.265116
Elisabetta Manduchi
1Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
2Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Elisabetta Manduchi
Weixuan Fu
2Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joseph D. Romano
1Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stefano Ruberto
1Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jason H. Moore
1Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
2Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jhmoore@upenn.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Background A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.

Results We present an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We then describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj.

Conclusions In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

Competing Interest Statement

The authors have declared no competing interest.

  • List of abbreviations

    AutoML
    Automated Machine Learning
    CV
    Cross Validation
    FSS
    Feature Set Selector
    GP
    Genetic Programming
    GSEA
    Gene Set Enrichment Analysis
    resAdj
    residual adjustment
    TPOT
    Tree-based Pipeline Optimization Tool
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
    Back to top
    PreviousNext
    Posted August 24, 2020.
    Download PDF

    Supplementary Material

    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
    Elisabetta Manduchi, Weixuan Fu, Joseph D. Romano, Stefano Ruberto, Jason H. Moore
    bioRxiv 2020.08.24.265116; doi: https://doi.org/10.1101/2020.08.24.265116
    Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
    Citation Tools
    Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
    Elisabetta Manduchi, Weixuan Fu, Joseph D. Romano, Stefano Ruberto, Jason H. Moore
    bioRxiv 2020.08.24.265116; doi: https://doi.org/10.1101/2020.08.24.265116

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Bioinformatics
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (4091)
    • Biochemistry (8772)
    • Bioengineering (6487)
    • Bioinformatics (23356)
    • Biophysics (11756)
    • Cancer Biology (9154)
    • Cell Biology (13257)
    • Clinical Trials (138)
    • Developmental Biology (7418)
    • Ecology (11376)
    • Epidemiology (2066)
    • Evolutionary Biology (15095)
    • Genetics (10403)
    • Genomics (14014)
    • Immunology (9126)
    • Microbiology (22070)
    • Molecular Biology (8783)
    • Neuroscience (47395)
    • Paleontology (350)
    • Pathology (1421)
    • Pharmacology and Toxicology (2482)
    • Physiology (3705)
    • Plant Biology (8054)
    • Scientific Communication and Education (1433)
    • Synthetic Biology (2211)
    • Systems Biology (6017)
    • Zoology (1250)