andi: fast and accurate estimation of evolutionary distances between closely related genomes

Bernhard Haubold; Fabian Klötzl; Peter Pfaffelhuber

doi:10.1093/bioinformatics/btu815

andi: fast and accurate estimation of evolutionary distances between closely related genomes

Bioinformatics. 2015 Apr 15;31(8):1169-75. doi: 10.1093/bioinformatics/btu815. Epub 2014 Dec 10.

Authors

Bernhard Haubold¹, Fabian Klötzl², Peter Pfaffelhuber¹

Affiliations

¹ Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany.
² Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany.

PMID: 25504847
DOI: 10.1093/bioinformatics/btu815

Abstract

Motivation: A standard approach to classifying sets of genomes is to calculate their pairwise distances. This is difficult for large samples. We have therefore developed an algorithm for rapidly computing the evolutionary distances between closely related genomes.

Results: Our distance measure is based on ungapped local alignments that we anchor through pairs of maximal unique matches of a minimum length. These exact matches can be looked up efficiently using enhanced suffix arrays and our implementation requires approximately only 1 s and 45 MB RAM/Mbase analysed. The pairing of matches distinguishes non-homologous from homologous regions leading to accurate distance estimation. We show this by analysing simulated data and genome samples ranging from 29 Escherichia coli/Shigella genomes to 3085 genomes of Streptococcus pneumoniae.

Availability and implementation: We have implemented the computation of anchor distances in the multithreaded UNIX command-line program andi for ANchor DIstances. C sources and documentation are posted at http://github.com/evolbioinf/andi/

Contact: haubold@evolbio.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Animals
Biological Evolution*
Databases, Genetic
Genome*
Genomics / methods*
Humans
Phylogeny
Sequence Alignment / methods*
Sequence Analysis, DNA / methods*
Software*