Toward optimal fingerprint indexing for large scale genomics

Motivation To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases. Availability and implementation We wrote the NIQKI index as an open-source C++ library under the AGPL3 license available at https://github.com/Malfoy/NIQKI. It is designed as a user-friendly tool and comes along with usage samples. 2012 ACM Subject Classification Applied computing → Bioinformatics Digital Object Identifier 10.4230/LIPIcs.WABI.2022.25


38
Historically, genomic databases such as GenBank are growing exponentially 3 . Lower se-39 quencing costs and required investment, broader access to sequencing technologies, and The main bottleneck of existing methods is that a query requires the queried sketch to 114 be compared to all sketches in the database. The query is O(N.S) with N the number of 115 indexed entries. This cannot be considered scalable regarding the exponential growth of 116 available genomic resources. To cope with this algorithmic problem, we design a structure 117 to perform queries in Θ(#hits) where #hits is the number of fingerprint hits between the 118 queried sketch and the indexed sketches. Since the expected output of a query is a list of 119 genomes identifier associated with a number of hits (or an approximation of the Jaccard 120 index), we can consider our query time as close to optimal. 121 However, such a structure presents an overhead exponential with the fingerprint size. 122 For this reason, we are highly attentive to using powerful but small fingerprint sizes. We 123 generalize the Hyperminhash fingerprint and introduce (h, m)-Hyperminhash fingerprints 124 that cover both Minhash and Hyperloglog cases and provide an analysis to select the best 125 fingerprint for a given use case.

Figure 1
Example of processing partitions of 8 (N ) sketches where each sketch has 5 (S) fingerprints of 2 (W ) bits. B = {01303, 20103, 32112, 30221, 02212, 03212, 20321, 22031} is the set of sketches used as instance where the sketch s1 = 01303 corresponds to the tuple of the 5 fingerprints where the first fingerprint is 0, the second fingerprint is 1, the third fingerprint is 3 and so on. To represent the decomposition by column, we add specific color. For the first column (in blue) P1 corresponds to all the fingerprints seen as the first position of a sketch of B (the first partition) and X 1,vf contain the set of sketch identifiers (positions of the sketches in B) whose first fingerprint value is vf . We obtain S × 2 W = 20 sets X j,l (1 ≤ j ≤ S and 0 ≤ l ≤ 2 W − 1) corresponding to the stored sets.  Instead of indexing each sketch, we split the set of sketches B by column j between 1 and 141 S. We denote by P j the sequence of fingerprints f 1,j . . . f N,j . Unlike previous approaches, 142 our scheme handles each column P j independently by grouping the indices with the same 143 fingerprint on this column. Indeed, for each column j between 1 and S, we index in a vector Figure 1). 147 We argue that performing queries by partition presents several advantages. One advantage 148 of this processing partition by column is the possibility to stop queries that can not find 149 high-scoring matches. If we are looking for a minimum of m shared fingerprint among S 150 to report a match, if after P |S|−j no sketches share at least m − j fingerprints, that query 151 can be stopped as no matches can reach the minimal score. If not using the densification 152 technique, empty partitions could also be skipped.

153
As, for our case, a sketch represents a genome, and the fingerprints come from the set of 154 k-mers of this genome, one possible mechanism presented in [10] inspired by [7] is to insert C. Agret and B. Cazaux and A. Limasset 25:5 the k-mers of a given partition in a Bloom filter (or any set structure) to "protect" each partition from alien k-mers. If the selected k-mer is not in the partition set, the partition is 157 skipped as we know that the k-mer was not inserted in the partition and that any fingerprint 158 hit would be a false positive. 159 We use dense addressing to index our partitions in a trade-off favoring code simplicity and 160 throughput over memory footprint. This way, our index has an overhead of Θ(S.2 W ) as we  If this approach presents a costly overhead, its interest is that query a sketch performs 168 at worst S "costly" random access. We argue that using relatively small, well-designed 169 fingerprints, our scheme can perform swift queries with a reasonable memory footprint.
To begin, if n < 2 h − 1 + m, some bits can be in both prefix and suffix of the fingerprint 195 due to the overlap, and thus it is not space-efficient. Indeed, we can prove that the number 196 of fingerprint values that are not reached is equal to 2 m × (2 h − 1 + m − n). For this reason, 197 we take in all the following n ≥ 2 h − 1 + m.

25:6
Toward optimal fingerprint indexing for large scale genomics the suffix of length m and the prefix of length 2 h − 1 − max(0, k 2 m − 1) which are fixed.

203
As For the sake of completeness, we added all the calculations of k i=0 H(h, m, i), but the only 208 thing one needs to remember is that it can be calculated in constant time. Besides, we have We can extend this formal definition to the probability of interval {a, . . . , b}, i.e.
By approximating the value of P[HMH h,m (X) ≤ k] by 1 − ( give an approximation of a h,m and b h,m for a threshold ε of [0, 1] in O(1): As shown in Figure 2a,  The correspondence between the theoretical optimal fingerprint and the fingerprint with 222 the minimum false positive values in practice justifies the relevance of our study (see Figure 2).  Our implementation delivers a gzip-compressed sparse output to limit disk usage by 251 default. We provide a pretty printer to parse such a file and an option to directly output 252 (larger) human-readable output.    This experiment shows that using fingerprints as small as eight bits seems reasonable

318
To show those two regimes, we perform a first "idealistic" benchmark with randomly 319 generated genomes sharing no k-mers that display the best case of our index. We compare 320 those results to a real database composed of GenBank genomes in a second experiment. We 321 downloaded all Genbank bacterial assemblies (1,042,611) and randomly selected genomes 322 from this pool to build such an index. We want to point out that such a database is highly 323 redundant. For instance, it contains 142,568 assemblies of the Escherichia coli organism. It 324 constitutes a stress case for our index as each query associated with Escherichia coli should 325 report a large part of the indexed genomes among its matches.

326
In Figure 6 we ran Dashing, Mash, and NIQKI to compute pairwise distance on randomly

352
To access our approach scaling ability scalability, we choose to index and compute pairwise 353 distance on all bacterial genomes from GenBank. This dataset represents more than one 354 million genomes, counting more than five tera-nucleotides or more than one terabytes of  Table 1. We observe that we can index and compute pairwise distance on such a database 361 with medium-sized sketches in a few days with a reasonable memory footprint. If the indexing 362 time is roughly constant and dominated by reading the input, we observe that the query 363 time grows linearly with the sketch size (plus a "reading" time constant cost). We want to 364 recall that most of the query computational time is due to the database's redundancy, which 365 generates many matches for some queries. For comparison, on a simulated dataset of one 366 million random synthetics genomes indexed with 4096 fingerprints, the query time lasted 12 367 hours instead of 38 hours for the Genbank database.

368
Conclusions and future work 369 We showed that using inverted indexes on partitioned sketches leads to algorithmic im-   need for an efficient fingerprint that presents smoother patterns to enable a larger range of 414 possible hashes with or without estimation of the sub-sampling. On the fingerprint indexing 415 problem, we showed that our partition query could be considered optimal as its running 416 time is linear with the size of the output. However, the classical usage is to ask for matches 417 with a number of hits above a certain threshold. It would be interesting to investigate which 418 indexes or algorithmic solutions would provide an optimal answer to this problem.