PT  - JOURNAL ARTICLE
AU  - Burkhard Morgenstern
TI  - Sequence Comparison without Alignment: The &lt;em&gt;SpaM&lt;/em&gt; approaches
AID  - 10.1101/2019.12.16.878314
DP  - 2019 Jan 01
TA  - bioRxiv
PG  - 2019.12.16.878314
4099  - http://biorxiv.org/content/early/2019/12/17/2019.12.16.878314.short
4100  - http://biorxiv.org/content/early/2019/12/17/2019.12.16.878314.full
AB  - Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods have become too slow for many data-analysis tasks. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are based on the length of maximal word matches. While these methods are very fast, most of them are based on ad-hoc measures of sequences similarity or dissimilarity that are often hard to interpret. In this review article, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced word matches (‘SpaM’), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences based on stochastic models of molecular evolution.