TY - JOUR T1 - The number of spaced-word matches between two <em>DNA</em> sequences as a function of the underlying pattern weight JF - bioRxiv DO - 10.1101/527515 SP - 527515 AU - Sophie Röhling AU - Burkhard Morgenstern Y1 - 2019/01/01 UR - http://biorxiv.org/content/early/2019/01/23/527515.abstract N2 - We study the number Nk of (spaced) word matches between pairs of evolutionarily related DNA sequences depending on the word length or pattern weight k, respectively. We show that, under the Jukes-Cantor model, the number of substitutions per site that occurred since two sequences evolved from their last common ancestor, can be esti-mated from the slope of a certain function of Nk. Based on these considerations, we implemented a software program for alignment-free sequence comparison called Slope-SpaM. Test runs on simulated sequence data show that Slope-SpaM can estimate phylogenetic dis-tances with high accuracy for up to around 0.5 substitutions per po-sitions. The statistical stability of our results is improved if spaced words are used instead of contiguous k-mers. Unlike previous methods that are based on the number of (spaced) word matches, our approach can deal with sequences that share only local homologies. ER -