Abstract
Background The need to associate information to words is shared among a plethora of applications and methods in high throughput sequence analysis, and could be marked as fundamental. A scalability problem is promptly met when indexing billions of k-mers, as exact associative indexes can be memory expensive. To leverage this challenge, recent works take advantage of the k-mer sets properties. They exploit the overlaps shared among k-mers by using a De Bruijn graph as a compact k-mer set.
Contribution We propose a scalable and exact index structure able to associate unique identifiers to indexed k-mers and to reject alien k-mers. The proposed structure combines an extremely compact representation along with a high throughput. Moreover, it can be efficiently built from the De Bruijn graph sequences. Using the efficient implementation of the index we provide, the k-mers from the human genome can be indexed with 8GB within 30 minutes. We achieve to index the huge axolotl genome with 63 GB within 10 hours. Furthermore, while being memory efficient, the index allows above a million queries per second on a single CPU in our experiments. This throughput can be raised using multiple cores. Finally, we also present the index ability to practically represent metagenomic and transcriptomic sequencing data.
Availability The index is implemented as a header-only library in C++ is open source and available at https://github.com/Malfoy/Blight. It was designed as a user-friendly library and comes along with sample code usage.