PT  - JOURNAL ARTICLE
AU  - Camille Marchet
AU  - Christina Boucher
AU  - Simon J Puglisi
AU  - Paul Medvedev
AU  - Mikaël Salson
AU  - Rayan Chikhi
TI  - Data structures based on &lt;em&gt;k&lt;/em&gt;-mers for querying large collections of sequencing datasets
AID  - 10.1101/866756
DP  - 2020 Jan 01
TA  - bioRxiv
PG  - 866756
4099  - http://biorxiv.org/content/early/2020/12/17/866756.short
4100  - http://biorxiv.org/content/early/2020/12/17/866756.full
AB  - High-throughput sequencing datasets are usually deposited in public repositories, e.g. the European Nucleotide Archive, to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow to perform online sequence searches; yet such a feature would be highly useful to investigators. Towards this goal, in the last few years several computational approaches have been introduced to index and query large collections of datasets. Here we propose an accessible survey of these approaches, which are generally based on representing datasets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.Competing Interest StatementThe authors have declared no competing interest.