Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Squeakr: An Exact and Approximate k-mer Counting System

Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro
doi: https://doi.org/10.1101/122077
Prashant Pandey
1Department of Computer Science, Stony Brook University, Stony Brook, NY 11790, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michael A. Bender
1Department of Computer Science, Stony Brook University, Stony Brook, NY 11790, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rob Johnson
1Department of Computer Science, Stony Brook University, Stony Brook, NY 11790, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rob Patro
1Department of Computer Science, Stony Brook University, Stony Brook, NY 11790, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing (HTS) data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g., for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations and data structures. In this paper, we set forth the fundamental operations for maintaining multisets of k-mers and classify existing systems from a data-structural perspective. We then show how to build a k-mer-counting and multiset-representation system using the counting quotient filter (CQF), a feature-rich approximate membership query (AMQ) data structure. We introduce the k-mer-counting/querying system Squeakr (Simple Quotient filter-based Exact and Approximate Kmer Representation), which is based on the CQF. This off-the-shelf data structure turns out to be an efficient (approximate or exact) representation for sets or multisets of k-mers.

Results Squeakr takes 2×−3;4.3× less time than the state-of-the-art to count and perform a random-point-query workload. Squeakr is memory-efficient, consuming 1.5X–4.3X less memory than the state-of-the-art. It offers competitive counting performance, and answers point queries (i.e. queries for the abundance of a particular k-mer) over an order-of-magnitude faster than other systems. The Squeakr representation of the k-mer multiset turns out to be immediately useful for downstream processing (e.g., de Bruijn graph traversal) because it supports fast queries and dynamic k-mer insertion, deletion, and modification.

Availability https://github.com/splatlab/squeakr

Contact ppandey{at}cs.stonybrook.edu

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted March 29, 2017.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Squeakr: An Exact and Approximate k-mer Counting System
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Squeakr: An Exact and Approximate k-mer Counting System
Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro
bioRxiv 122077; doi: https://doi.org/10.1101/122077
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Squeakr: An Exact and Approximate k-mer Counting System
Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro
bioRxiv 122077; doi: https://doi.org/10.1101/122077

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2520)
  • Biochemistry (4969)
  • Bioengineering (3475)
  • Bioinformatics (15190)
  • Biophysics (6886)
  • Cancer Biology (5383)
  • Cell Biology (7722)
  • Clinical Trials (138)
  • Developmental Biology (4524)
  • Ecology (7139)
  • Epidemiology (2059)
  • Evolutionary Biology (10212)
  • Genetics (7504)
  • Genomics (9776)
  • Immunology (4828)
  • Microbiology (13190)
  • Molecular Biology (5132)
  • Neuroscience (29384)
  • Paleontology (203)
  • Pathology (836)
  • Pharmacology and Toxicology (1462)
  • Physiology (2132)
  • Plant Biology (4738)
  • Scientific Communication and Education (1008)
  • Synthetic Biology (1337)
  • Systems Biology (4005)
  • Zoology (768)