PT - JOURNAL ARTICLE AU - Elliott Gordon-Rodriguez AU - Thomas P. Quinn AU - John P. Cunningham TI - Learning Sparse Log-Ratios for High-Throughput Sequencing Data AID - 10.1101/2021.02.11.430695 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.02.11.430695 4099 - http://biorxiv.org/content/early/2021/02/12/2021.02.11.430695.short 4100 - http://biorxiv.org/content/early/2021/02/12/2021.02.11.430695.full AB - The automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. In the context of high-throughput genetic sequencing data, and Compositional Data more generally, an important class of features are the log-ratios between subsets of the input variables. However, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. Building on recent literature on continuous relaxations of discrete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing methods. As well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets.Competing Interest StatementThe authors have declared no competing interest.