Reducing storage requirements for biological sequence comparison

Michael Roberts; Wayne Hayes; Brian R Hunt; Stephen M Mount; James A Yorke

doi:10.1093/bioinformatics/bth408

Reducing storage requirements for biological sequence comparison

Bioinformatics. 2004 Dec 12;20(18):3363-9. doi: 10.1093/bioinformatics/bth408. Epub 2004 Jul 15.

Authors

Michael Roberts¹, Wayne Hayes, Brian R Hunt, Stephen M Mount, James A Yorke

Affiliation

¹ Institute for Physical Science and Technology, University of Maryland, College Park, MD 20742-2431, USA.

PMID: 15256412
DOI: 10.1093/bioinformatics/bth408

Abstract

Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process.

Results: We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.

Publication types

Evaluation Study
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms*
Databases, Genetic*
Information Storage and Retrieval / methods*
Numerical Analysis, Computer-Assisted*
Sequence Alignment / methods*
Sequence Analysis / methods*

Grants and funding

1R01HG0294501/HG/NHGRI NIH HHS/United States