Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham
doi: https://doi.org/10.1101/2021.02.11.430695
Elliott Gordon-Rodriguez
1Department of Statistics, Columbia University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: eg2912@columbia.edu
Thomas P. Quinn
2Applied Artificial Intelligence Institute (A2I2), Deakin University.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
John P. Cunningham
1Department of Statistics, Columbia University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. In the context of high-throughput genetic sequencing data, and Compositional Data more generally, an important class of features are the log-ratios between subsets of the input variables. However, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. Building on recent literature on continuous relaxations of discrete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing methods. As well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://github.com/cunningham-lab/codacore

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted February 12, 2021.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Learning Sparse Log-Ratios for High-Throughput Sequencing Data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Learning Sparse Log-Ratios for High-Throughput Sequencing Data
Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham
bioRxiv 2021.02.11.430695; doi: https://doi.org/10.1101/2021.02.11.430695
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Learning Sparse Log-Ratios for High-Throughput Sequencing Data
Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham
bioRxiv 2021.02.11.430695; doi: https://doi.org/10.1101/2021.02.11.430695

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4838)
  • Biochemistry (10734)
  • Bioengineering (8013)
  • Bioinformatics (27174)
  • Biophysics (13935)
  • Cancer Biology (11080)
  • Cell Biology (15984)
  • Clinical Trials (138)
  • Developmental Biology (8757)
  • Ecology (13232)
  • Epidemiology (2067)
  • Evolutionary Biology (17309)
  • Genetics (11665)
  • Genomics (15882)
  • Immunology (10989)
  • Microbiology (25989)
  • Molecular Biology (10608)
  • Neuroscience (56326)
  • Paleontology (417)
  • Pathology (1727)
  • Pharmacology and Toxicology (2998)
  • Physiology (4529)
  • Plant Biology (9588)
  • Scientific Communication and Education (1610)
  • Synthetic Biology (2671)
  • Systems Biology (6959)
  • Zoology (1507)