Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Syllable-PBWT for space-efficient haplotype long-match query

Victor Wang, Ardalan Naseri, View ORCID ProfileShaojie Zhang, Degui Zhi
doi: https://doi.org/10.1101/2022.01.31.478234
Victor Wang
1School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ardalan Naseri
1School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shaojie Zhang
2Department of Computer Science University of Central Florida, Orlando, FL 32816, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shaojie Zhang
  • For correspondence: Degui.Zhi@uth.tmc.edu shzhang@cs.ucf.edu
Degui Zhi
1School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: Degui.Zhi@uth.tmc.edu shzhang@cs.ucf.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

The positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query. Compared to Algorithm 3 of Sanaullah et al. (2021), the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions. The implementation of our algorithm is available at https://github.com/ZhiGroup/Syllable-PBWT.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted February 02, 2022.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Syllable-PBWT for space-efficient haplotype long-match query
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Syllable-PBWT for space-efficient haplotype long-match query
Victor Wang, Ardalan Naseri, Shaojie Zhang, Degui Zhi
bioRxiv 2022.01.31.478234; doi: https://doi.org/10.1101/2022.01.31.478234
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Syllable-PBWT for space-efficient haplotype long-match query
Victor Wang, Ardalan Naseri, Shaojie Zhang, Degui Zhi
bioRxiv 2022.01.31.478234; doi: https://doi.org/10.1101/2022.01.31.478234

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4672)
  • Biochemistry (10336)
  • Bioengineering (7655)
  • Bioinformatics (26284)
  • Biophysics (13497)
  • Cancer Biology (10664)
  • Cell Biology (15408)
  • Clinical Trials (138)
  • Developmental Biology (8485)
  • Ecology (12802)
  • Epidemiology (2067)
  • Evolutionary Biology (16819)
  • Genetics (11380)
  • Genomics (15458)
  • Immunology (10593)
  • Microbiology (25164)
  • Molecular Biology (10197)
  • Neuroscience (54381)
  • Paleontology (399)
  • Pathology (1664)
  • Pharmacology and Toxicology (2889)
  • Physiology (4332)
  • Plant Biology (9223)
  • Scientific Communication and Education (1585)
  • Synthetic Biology (2554)
  • Systems Biology (6769)
  • Zoology (1459)