Abstract
Consider comparing a sequencing read of unknown origin to a set of reference genomes. This problem underlines many applications, including metagenomic analyses. The exact genome generating the read is not in the reference set, but may be evolutionarily related to some references. Ideally, we need not just the identity of the closest references to the read but also their distance to the read. The distances can help us identify the read at the right taxonomic level and, more ambitiously, place it on a reference phylogeny. Aligning reads to reference genomes, the only available approach for computing such distances, becomes impractical for very large reference sets. It is also not effective at higher distances when used with efficient indexes (e.g., Bowtie2). While k-mers can create scalable indexes, existing k-mer-based methods are incapable of distance calculation. Thus, estimating distances between short reads and large, diverse reference sets remains challenging and seldom used. We introduce a method called krepp that combines four ideas to solve this challenge and to further enable placing reads on a reference phylogeny. We use i) locality-sensitive hashing to find inexact k-mer matches, ii) a phylogeny-guided colored k-mer index to map each k-mer to all references containing it, iii) a maximum likelihood framework to estimate read-genome distances using k-mer matches, and iv) an extension of distances to clades of the reference tree, which enables placement using a likelihood ratio test. We show that krepp matches true distances using a fraction of time compared to alignment, extends to higher distances, and accurately places short reads coming from any part of the genome (not just marker genes) on the reference phylogeny. We demonstrate that krepp easily extends to databases with tens of thousands of reference genomes and performs well in characterizing real microbial samples.
Availability The tool is available at github.com/bo1929/krepp. All results, auxiliary data, and scripts used in the analyses can be found at github.com/bo1929/shared.krepp.
Competing Interest Statement
The authors have declared no competing interest.