Data-dependent bucketing improves reference-free compression of sequencing reads

Rob Patro; Carl Kingsford

doi:10.1093/bioinformatics/btv248

Data-dependent bucketing improves reference-free compression of sequencing reads

Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24.

Authors

Rob Patro¹, Carl Kingsford²

Affiliations

¹ Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA and.
² Department Computational Biology, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA.

Abstract

Motivation: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data.

Results: We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes.

Availability and implementation: Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince.

Contact: carlk@cs.cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Computational Biology / methods*
Data Compression / methods*
Genomics
High-Throughput Nucleotide Sequencing / methods*
Humans
Sequence Alignment
Sequence Analysis, DNA / methods*
Software*

Abstract

Publication types

MeSH terms

Grants and funding