Starcode: sequence clustering based on all-pairs search

Eduard Zorita; Pol Cuscó; Guillaume J Filion

doi:10.1093/bioinformatics/btv053

Starcode: sequence clustering based on all-pairs search

Bioinformatics. 2015 Jun 15;31(12):1913-9. doi: 10.1093/bioinformatics/btv053. Epub 2015 Jan 31.

Authors

Eduard Zorita¹, Pol Cuscó¹, Guillaume J Filion¹

Affiliation

¹ Genome Architecture, Gene Regulation, Stem Cells and Cancer Programme, Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003 Barcelona and Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain Genome Architecture, Gene Regulation, Stem Cells and Cancer Programme, Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003 Barcelona and Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain.

Abstract

Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem.

Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman-Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision.

Availability and implementation: The C source code is available at http://github.com/gui11aume/starcode.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Cluster Analysis*
Computational Biology / methods*
High-Throughput Nucleotide Sequencing / methods*
Humans
Sequence Analysis, DNA / methods*
Software*