RT Journal Article
SR Electronic
T1 REINDEER: efficient indexing of <em>k</em>-mer presence and abundance in sequencing datasets
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 2020.03.29.014159
DO 10.1101/2020.03.29.014159
A1 Camille Marchet
A1 Zamin Iqbal
A1 Daniel Gautheret
A1 Mikael Salson
A1 Rayan Chikhi
YR 2020
UL http://biorxiv.org/content/early/2020/03/30/2020.03.29.014159.abstract
AB Motivation In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.Results We used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ~4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availability https://github.com/kamimrcht/REINDEERContact camille.marchet{at}univ-lille.fr