PT - JOURNAL ARTICLE AU - Lemane, Téo AU - Lezzoche, Nolan AU - Lecubin, Julien AU - Pelletier, Eric AU - Lescot, Magali AU - Chikhi, Rayan AU - Peterlongo, Pierre TI - kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets AID - 10.1101/2023.05.31.543043 DP - 2023 Jan 01 TA - bioRxiv PG - 2023.05.31.543043 4099 - http://biorxiv.org/content/early/2023/10/31/2023.05.31.543043.short 4100 - http://biorxiv.org/content/early/2023/10/31/2023.05.31.543043.full AB - Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as one cannot efficiently search them for any sequence(s) of interest. We present kmindex, an innovative approach that can index thousands of highly complex metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. We demonstrate the scalability of kmindex by successfully indexing 1,393 complex marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server “Ocean Read Atlas” (ORA) at https://ocean-read-atlas.mio.osupytheas.fr/, which enables real-time queries on the Tara Oceans dataset. The open-source kmindex software is available at https://github.com/tlemane/kmindex.Competing Interest StatementThe authors have declared no competing interest.