RT Journal Article
SR Electronic
T1 The number of *k*-mer matches between two *DNA* sequences as a function of *k*
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 527515
DO 10.1101/527515
A1 RÃ¶hling, Sophie
A1 Dencker, Thomas
A1 Morgenstern, Burkhard
YR 2019
UL http://biorxiv.org/content/early/2019/04/30/527515.abstract
AB We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences depending on k. We show that the number of substitutions per site that occurred since two sequences evolved from their last common ancestor, can be estimated from the slope of a certain function of Nk. This approach can be generalized from contiguous word matches to so-called spaced-word matches, where mismatches are allowed at certain pre-defined positions. Based on these theoretical results, we implemented a software program for alignment-free sequence comparison called Slope-SpaM. Test runs on simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distance values up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.Availability https://github.com/burkhard-morgenstern/Slope-SpaMContact bmorgen{at}gwdg.de