Abstract
Motivation A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings – in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge.
Results To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0, n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions.
Availability The C++ implementation of the dictionary is available at https://github.com/jermp/sshash.
Contact giulio.ermanno.pibiri{at}isti.cnr.it
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Overall, the major differences compared to the first version include: - The removal of some technical details from the body of the paper (now in Supplementary Material). - The addition of 3 figures (Fig.1, Fig. 2a and Fig. 2b) to better illustrate the data structure and discuss examples. - The addition of Table 5 and Table 7 reporting, respectively, the Lookup time and construction time for all approaches.