A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Guillaume Marçais; Carl Kingsford

doi:10.1093/bioinformatics/btr011

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

Authors

Guillaume Marçais¹, Carl Kingsford

Affiliation

¹ Department of Computer Science, University of Maryland, College Park, MD 20742, USA. gmarcais@umd.edu

Abstract

Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.

Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.

Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Animals
Base Sequence
Computational Biology / methods*
Genome
Humans
Sequence Alignment
Sequence Analysis, DNA / methods*
Software*

Abstract

Publication types

MeSH terms

Grants and funding