PT - JOURNAL ARTICLE
AU - Gordon-Rodriguez, Elliott
AU - Quinn, Thomas P.
AU - Cunningham, John P.
TI - Learning Sparse Log-Ratios for High-Throughput Sequencing Data
AID - 10.1101/2021.02.11.430695
DP - 2021 Jan 01
TA - bioRxiv
PG - 2021.02.11.430695
4099 - http://biorxiv.org/content/early/2021/05/25/2021.02.11.430695.short
4100 - http://biorxiv.org/content/early/2021/05/25/2021.02.11.430695.full
AB - The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable, and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite, and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods.1Competing Interest StatementThe authors have declared no competing interest.