Abstract
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods have become too slow for many data-analysis tasks. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are based on the length of maximal word matches. While these methods are very fast, most of them are based on ad-hoc measures of sequences similarity or dissimilarity that are often hard to interpret. In this review article, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced word matches (‘SpaM’), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences based on stochastic models of molecular evolution.
1 Introduction
Alignment-free sequence comparison has a long tradition in bioinformatics. The first approaches to compare sequences without alignments were proposed in the Nineteen-eighties by E. Blaisdell [6, 7]. The interest in alignment-free methods increased when more and more partially or completely sequenced genomes became available through novel sequencing technologies, leading to an urgent need for faster methods of sequence comparison. Most existing alignment-free methods represent sequences as word-frequency vectors for words of a fixed length k so-called k-mers –, and by comparing k-mer frequencies instead of comparing sequences position-by-position, based on alignments [75, 32, 70, 13, 79]. This approach has been extended by taking the background word-match frequencies into account [62, 80, 73, 1]; a review of these latter methods is given in [63]. Other approaches to alignment-free sequence comparison are based on the length of maximal common sub-words of the compared sequences, to define alternative measures of sequence similarity or dissimilarity [77, 14, 45, 61, 76].
The main advantage of these word-based methods is their high speed, compared to alignment-based methods. While – for most scoring schemes – finding an optimal alignment of two sequences takes time proportional to the product of their lengths [56, 22, 51], word-based or alignment-free methods are much more efficient, since word-frequency vectors can be calculated in time proportional to the length of the analyzed sequences. Similarly, the length of longest common sub-words can be efficiently found using data structures such as generalized suffix trees or suffix arrays [25]. A review of earlier alignment-free methods is given in [78]; more recent review papers are [27, 72, 84, 5, 40]. A first systematic benchmark study of alignment-free methods has been published in 2019, as a collaboration of several groups working in the field [83].
From the beginning, phylogenetic tree reconstruction has been a main application of alignment-free sequence comparison. Choi and Kim, for example, were able to calculate a phylogenetic tree of > 4, 000 whole-proteome sequences [12], using the alignment-free tool FFP that has been developed by the same group [70]. The fastest phylogeny methods are distance-based approaches: to calculate a tree for a set of taxa, these methods use pair-wise distance values as input. Thus, for each pair of compared taxa, their distance or dissimilarity needs to be measured in some way. Once all pairwise distances have been calculated for the input set of taxa, a matrix with these distances can be used as input for standard distance-based methods for phylogeny reconstruction. The most commonly used methods for distance-based phylogeny reconstruction are Neighbor-Joining (NJ) [67] and BIONJ [24].
If DNA sequences are compared, a common way of defining the distance between two evolutionarily related sequences is to use the (estimated) number of substitutions per position that have occurred since the two sequences have evolved from their last common ancestor. The simplest substitution model for nucleic-acid sequences is the Jukes-Cantor model. Here, all nucleotide substitutions are assumed to occur with the same probability per time unit. Under this model, the number of substitutions per position can be estimated from the number of mismatches per position in an alignment of the compared sequences, using the well-known Jukes-Cantor formula [37]. More elaborate substitution models are available to for DNA or protein sequences, that account for different substitution probabilities for different pairs of nucleotide or amino-acid residues.
By contrast, most of the above mentioned alignment-free methods are not based on probabilistic models of evolution, they rather use heuristic measures of sequence similarity or dissimilarity. If sequences are represented by word-frequency vectors, for example, standard distance measures on vector spaces can be applied to these frequency vectors, in order to calculate a ‘distance’ between two compared sequences, such as the Euclidean distance or the Manhattan distance. Such distances, however, are hard to interpret from a phylogenetic point-of-view. Clearly, a pair of closely related sequences will have more words in common – so the Euclidean distance between their word-frequency vectors will be smaller – than is the case for a more distantly related pair of sequences.
But distance values calculated in this way are not measures of evolutionary distances, e.g. in terms of events that happened since two sequences separated from their last common ancestor. They only indicate if one pair of sequences shares more or less similarity than another pair of sequences. Such heuristic distance measures can be used for clustering, but not for more accurate phylogenetic analyses.
Since the distance values calculated by earlier word-based alignment-free methods have no direct phylogenetic interpretation, it would make no sense to ‘evaluate’ the accuracy of these values directly. Therefore, the developers of these methods did not evaluate and benchmark the distance values produced by their methods themselves. Instead, they applied clustering algorithms or distance-based tree reconstruction methods to these distances and evaluated the resulting trees. Again, since the computed distances between the sequences have no direct meaning, the branch-lengths of these trees were usually ignored, and only the resulting tree topologies were evaluated. The standard approach to evaluate tree topologies is to compare them to trusted reference topologies using the Robinson-Foulds metric [65]. Note that this is only a very rough and indirect method to benchmark the performance of sequence-comparison methods.
It was only in the last ten years or so that alignment-free methods were proposed that can estimate phylogenetic distances in the sense of an underlying probabilistic model of sequence evolution. The first such approach has been published in 2009 by Haubold et al. [29]. These authors developed kr, an alignment-free method that can accurately estimate phylogenetic distances between DNA sequences [29] in the sense of the Jukes-Cantor model. That is, kr estimates the number of nucleotide substitutions per sequence positions since the compared sequences have evolved from their last common ancestor. To this end, the program used the average length of common substrings between the compared sequences. Later, we proposed an approach to estimate phylogenetic distances based on the length distribution of k-mismatch common substrings [53].
In the last few years, other alignment-free methods have been proposed to estimate phylogenetic distances in a rigorous way [82, 28, 47, 39]. Some of these methods are based on so-called micro-alignments, short gap-free pairwise alignments with a simplistic structure, that can be rapidly calculated. So, strictly spoken, these methods are not alignment-free. They are referred to as ‘alignment-free’ anyway, since they avoid the time-consuming process of calculating optimal alignments over the entire length of the input sequences. Other approaches to estimate distances in a stochastically rigorous way are based on the number of word matches [55]. More recently, extremely fast programs have become popular that can accurately estimate phylogenetic distances between DNA sequences from the number of word matches, using the so-called Jacard Index [36] and min-hash algorithms [10]. A widely-used implementation of these ideas is Mash [58]; further improvements to this approach have been proposed and are implemented in the programs Skmer [68], Dashings [4] and Mash Screen [57].
2 Spaced Words
In 2013, we proposed to use so-called spaced words that contain wildcard characters at certain positions, for alignment-free DNA and protein sequence comparison [8, 43, 33]. A spaced word is based on a pre-defined binary pattern P that are called match positions (’1’) and don’t-care positions (’0’). Given such a pattern P, we defined a spaced word w with respect to P to be a word that has the same length as the pattern P and that has symbols for nucleotide or amino-acid residues at the match positions of P and wildcard symbols (‘*’) at the don’t-care positions, see Figure 1 for an example. Spaced words – or spaced seeds – have been previously introduced in database searching, to improve the sensitivity of the standard seed-and-extend search strategy [49]. Efficient algorithms have been proposed to optimize the underlying patterns [35], and for spaced-seed hashing [60].
Spaced word w with respect to a pattern P = 1100101 of length ℓ = 7. w consists of nucleotide symbols at the match positions (’1’) of P and of wildcard symbols, represented as ′*′ at the don’t-care positions (’0’). w occurs at position 4 in the DNA sequence S.
In a first study, we simply replaced word-frequency vectors by spaced-word frequency vectors to calculate distances between DNA and protein sequences. As in earlier word-based methods, we used the Euclidean distance or, alternatively, the Jenson-Shannon distance of the frequency vectors to define the distance between two DNA or protein sequences. As a result, the quality the resulting phylogenetic trees was improved, compared to when we used contiguous words – in particular when we used multiple binary patterns and the corresponding spaced-word frequencies, instead of single pattern [43]. The resulting software program is called Spaced Words, or Spaced, for short.
Our spaced-words approach was motivated by the spaced-seeds [48] that have been introduced in database searching, to improve the sensitivity of hit-and-extend approaches such as BLAST [2]. The main advantage of spaced word matches – or ‘spaced seeds’ – compared to contiguous word matches is that neighbouring spaced-word matches are statistically less dependent, so they are distributed more evenly over the sequences. In database searching, this increases the sensitivity, i.e. the probability of finding sequence similarities. For alignment-free sequence comparison, we have shown that results obtained with spaced words are statistically more stable than results based on contiguous words [55].
Note that, like earlier alignment-free approaches, this first version of the program Spaced was still based on heuristic measures of sequence similarity. It does not estimate evolutionary distances in the sense of some probabilistic model. Later, however, we introduced a new distance measure in Spaced based on the number of spaced-word matches [55] that actually estimates phylogenetic distances between DNA sequences in the sense of the Jukes-Cantor model [37]. This is now the the default distance measure used in the program Spaced. To find good patterns – or sets of patterns in the multiple-pattern approach –, we developed a program called rasbhari [26].
3 Filtered Spaced-Word Matches and Prot-SpaM
In a subsequent project, we introduced a different approach to use spaced words for alignment-free sequence comparison. Instead of comparing spaced-word frequencies, we used spaced-word matches (SpaM) as a special type of micro-alignments. For a binary pattern P as above, a SpaM between two sequences is simply the occurrence of the same spaced word in both sequences with respect to P, see Fig. 2 for an example. In other words, a SpaM is a local, gap-free alignment that has the same length as the pattern P and that has matching nucleotides or amino acids at the match positions and possible mismatches at the don’t-care positions of P. The idea is to consider a large number of SpaMs, and to estimate phylogenetic distances between two sequences by looking at the residues that are aligned to each other at the don’t-care positions of these SpaMs. Obviously, this is only possible if the considered SpaMs represent true homologies, so we have to filter out spurious background SpaMs. To do so, our program first considers all possible SpaMs between two input sequences and calculate a score for each SpaM based on the aligned residues at the don’t-care positions. The program then discards all SpaMs with scores below some threshold. We could show that, that with this sort of SpaM filter, one can reliably separate true homologies (’signal’) from random SpaM (’noise’).
Spaced-word match (SpaM) between two DNA sequences S1 and S2 with respect to a binary pattern P = 1100101 of length ℓ = 7, representing match positions (‘1’) and don’t-care positions (‘0’). The two segments have matching nucleotides at all match positions of P but may mismatch at the don’t-care positions.
An implementation of this approach for DNA sequences is called Filtered Spaced Word Matches (FSWM), To estimate distances between DNA sequences, FSWM calculates the proportion of mismatches at the don’t-care positions of the selected SpaMs, as an estimate of the proportion of mismatches in the (unknown) full alignment of the two sequences. It then applies the usual Jukes-Cantor correction, to calculate the estimated number of substitutions per position since the two sequences have evolved from their last common ancestor. By default, the program uses a pattern P of length ℓ = 112 with 12 match positions and 100 don’t-care positions, but the user can adjust these parameters. The length of the pattern P seems to be certain limitation, as it means that, by default, the program is restricted to using gap-free homologies of length ≥ ℓ = 112. A sufficient number of don’t-care positions is necessary, though, to reliably distinguish SpaMs representing true homologies from random background SpaMs. To speed-up the program, it can be run with multiple threads; by default 10 threads are used.
An implementation of the same algorithm for protein sequences is called Prot-SpaM [46]. Here, we are using by default a pattern with 6 match positions and 40 don’t-care positions, i.e. with a length of ℓ = 46. For protein sequences, we are using the BLOSUM 64 substitution matrix [30] to score SpaMs in order to filter out low-scoring random SpaMs. Again, the user can modify these parameters. To estimate the evolutionary distance between two protein sequences, we consider the pairs of amino acids aligned to each other at the don’t-care positions of the selected spaced-word matches, and we are using the Kimura model [38] that approximates the PAM distance [17] between sequences based on the number of mismatches per position.
4 Read-SpaM: estimating phylogenetic distances based on unassembled sequencing reads
Several authors have pointed out that alignment-free approaches can be applied, in principle, not only to full genome sequences, but also to unassembled reads. Some approaches have been particularly designed for this purpose [82, 68]. The ability to estimate phylogenetic distances based on unassembled reads is not only useful in phylogeny studies, but also in biodiversity research [68] or in clinical studies [20, 9]. Here, species or strains of bacteria can often be identified by genome skimming, i.e. by low-coverage sequencing [81, 21, 64, 19, 50, 68].
We adapted our FSWM approach to estimate phylogenetic distances between different taxa using unassembled reads; we called this approach Read-SpaM [42]. This software can estimate distances between an assembled genomes from one taxon and a set of unassembled reads from another taxon, or sets of unassembled reads from two different taxa. Using simulated sequence data, we could show that Read-SpaM can accurately estimate distances between genomes up to 0.8 substitutions per position, for sequence coverage as low as 2−9X, if an assembled genome is compared to unassembled reads. If unassembled reads from two different taxa are compared, distances estimates by Read-SpaM are still accurately for up to 0.7 substitutions per position, for a sequencing coverage down to 2−4X.
5 The most recent approaches: Multi-SpaM and Slope-SpaM
For nucleic-acid sequences, we extended our Filtered-Spaced Words Matches approach from pairwise to multiple sequence comparison [18]. Our software MultiSpaM is based on spaced-word matches between four sequences each. Such a multiple spaced word match is, thus, a local gap-free four-way alignment with columns of identical nucleotides at the match positions of the underlying binary pattern P, while mismatches are, again, allowed at the don’t-care positions of P. An example is given in Fig. 3, such local four-way alignments are also called P-blocks.
P-block for a pattern P = 11001: the spaced word W = CC ∗ ∗G occurs at [S1, 2], [S4, 1], [S5, 7] and [S6, 3] (top). A P-block defines a gap-free local four-way alignment (bottom).
Multi-SpaM samples P-blocks from the set of input sequences; by default up to 106 P-blocks are sampled. To ensure that these P-blocks represent true homologies, only those P-blocks are considered that have a score above a certain threshold. For each of the sampled P-blocks, the program then uses RAxML[74] to calculate an optimal unrooted quartet tree topology. Finally, the program Quartet MaxCut[71] is used to calculate a super tree topology from these quartet topologies.
As another approach to alignment-free phylogeny reconstruction, we developed a program called Slope-SpaM [66]. This program considers the number Nk of k-mer matches or spaced-word matches for patterns P with k match positions, respectively, between nucleic-acid sequences. It calculates Nk for different values of k and estimates the Jukes-Cantor distances between sequences – i.e. the average number of substitutions per sequence position since the sequences diverged from their most recent common ancestor – from the decay of Nk when k increases. This way, evolutionary distances can be calculated accurately for up to around 0.5 substitutions per sequence position.
6 Back to Multiple Sequence Alignment
There is not strict separation between sequence alignment and word-based, alignment-free methods. As mentioned above, a whole class of so-called ‘alignment-free’ methods are based on ‘micro-alignments’, local pairwise alignments of a simple structure, that can be rapidly calculated. In Multi-SpaM, we extended this approach to local multiple alignments.
Ironically, one of the first major applications of fast alignment-free methods was multiple sequence alignment (MSA). The programs MUSCLE [23] or Clustal Omega [69], for example, are using word-frequency vectors to rapidly calculate guide trees for the ‘progressive’ approach to MSA [23]. Similarly, fast alignment-free methods are used to find anchor points [54, 34] to make alignments of large genomic sequences possible [31, 52, 41, 15, 16, 3, 59]. In a recent study [44], we used our program FSWM to generate anchor points for multiple genome alignment. We could show that, if distantly related genomes are compared, spaced-word matches are more sensitive and lead to better output alignments than anchor points that are based on exact word matches.
Software Availability
We made Filtered Spaced Word Matches (FSWM) available through a web interface at Göttingen Bioinformatics Compute Server (GOBICS) at http://fswm.gobics.de/ see Figure 4. There are certain limitations at this web server for the size of the input data: (a) the upper limit for the total size of the input sequences is 512 mb, (b) the number of input sequences must be between 2 and 100, and the minimum length of each input sequence is 1000 bp. At our web server, the underlying pattern P has by default 12 match positions and 100 don’t care positions. The number of match positions can be adapted by the user. To calculate a score for each spaced-word match, a nucleotide substitution matrix published by Chiaromonte et al. [11] is used. By default, the cut-off value to distinguish ‘homologous’ from background spaced-word matches is set to 0. This value, too, can be adjusted by the user.
Homepage of Filtered Spaced Word Matches (FSWM) at Göttinen Bioinformatics Compute Server (GOBICS)
In addition, the above described software tools FSWM, Prot-SpaM, Multi-SpaM, Read-SpaM and Slope-SpaM are freely available as source code through github or through our home page, details are given in Table 1.
Our software is available as open source code from the following URLs:
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵