Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets

View ORCID ProfileBarış Ekim, View ORCID ProfileBonnie Berger, View ORCID ProfileYaron Orenstein
doi: https://doi.org/10.1101/2020.01.17.910513
Barış Ekim
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge MA 02139, USADepartment of Mathematics, Massachusetts Institute of Technology, Cambridge MA 02139, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Barış Ekim
Bonnie Berger
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge MA 02139, USADepartment of Mathematics, Massachusetts Institute of Technology, Cambridge MA 02139, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bonnie Berger
Yaron Orenstein
School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yaron Orenstein
  • For correspondence: yaronore@bgu.ac.il
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of k-mers that hit every sequence of length L, and can thus serve as indices to L-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. k > 13). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal k-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating near-optimal UHSs, which newly handles k > 13. We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small factor of the optimal size. PASHA’s runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.

Footnotes

  • bab{at}mit.edu

  • http://pasha.csail.mit.edu

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted January 18, 2020.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets
Barış Ekim, Bonnie Berger, Yaron Orenstein
bioRxiv 2020.01.17.910513; doi: https://doi.org/10.1101/2020.01.17.910513
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets
Barış Ekim, Bonnie Berger, Yaron Orenstein
bioRxiv 2020.01.17.910513; doi: https://doi.org/10.1101/2020.01.17.910513

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (1647)
  • Biochemistry (2737)
  • Bioengineering (1907)
  • Bioinformatics (10241)
  • Biophysics (4183)
  • Cancer Biology (3217)
  • Cell Biology (4538)
  • Clinical Trials (135)
  • Developmental Biology (2840)
  • Ecology (4460)
  • Epidemiology (2041)
  • Evolutionary Biology (7229)
  • Genetics (5473)
  • Genomics (6813)
  • Immunology (2386)
  • Microbiology (7479)
  • Molecular Biology (2991)
  • Neuroscience (18581)
  • Paleontology (136)
  • Pathology (472)
  • Pharmacology and Toxicology (780)
  • Physiology (1149)
  • Plant Biology (2706)
  • Scientific Communication and Education (680)
  • Synthetic Biology (888)
  • Systems Biology (2846)
  • Zoology (468)