Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham
doi: https://doi.org/10.1101/2021.02.11.430695
Elliott Gordon-Rodriguez
1Department of Statistics, Columbia University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: eg2912@columbia.edu
Thomas P. Quinn
2Applied Artificial Intelligence Institute (A2I2), Deakin University.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
John P. Cunningham
1Department of Statistics, Columbia University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. In the context of high-throughput genetic sequencing data, and Compositional Data more generally, an important class of features are the log-ratios between subsets of the input variables. However, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. Building on recent literature on continuous relaxations of discrete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing methods. As well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://github.com/cunningham-lab/codacore

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted February 12, 2021.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Learning Sparse Log-Ratios for High-Throughput Sequencing Data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Learning Sparse Log-Ratios for High-Throughput Sequencing Data
Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham
bioRxiv 2021.02.11.430695; doi: https://doi.org/10.1101/2021.02.11.430695
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Learning Sparse Log-Ratios for High-Throughput Sequencing Data
Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham
bioRxiv 2021.02.11.430695; doi: https://doi.org/10.1101/2021.02.11.430695

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3701)
  • Biochemistry (7820)
  • Bioengineering (5695)
  • Bioinformatics (21343)
  • Biophysics (10603)
  • Cancer Biology (8206)
  • Cell Biology (11974)
  • Clinical Trials (138)
  • Developmental Biology (6786)
  • Ecology (10425)
  • Epidemiology (2065)
  • Evolutionary Biology (13908)
  • Genetics (9731)
  • Genomics (13109)
  • Immunology (8171)
  • Microbiology (20064)
  • Molecular Biology (7875)
  • Neuroscience (43171)
  • Paleontology (321)
  • Pathology (1282)
  • Pharmacology and Toxicology (2267)
  • Physiology (3363)
  • Plant Biology (7254)
  • Scientific Communication and Education (1316)
  • Synthetic Biology (2012)
  • Systems Biology (5550)
  • Zoology (1133)