RT Journal Article SR Electronic T1 REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.03.29.014159 DO 10.1101/2020.03.29.014159 A1 Camille Marchet A1 Zamin Iqbal A1 Daniel Gautheret A1 Mikael Salson A1 Rayan Chikhi YR 2020 UL http://biorxiv.org/content/early/2020/03/30/2020.03.29.014159.abstract AB Motivation In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.Results We used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ~4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availability https://github.com/kamimrcht/REINDEERContact camille.marchet{at}univ-lille.fr