Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Fast Alignment-Free Similarity Estimation By Tensor Sketching

View ORCID ProfileAmir Joudaki, View ORCID ProfileGunnar Rätsch, View ORCID ProfileAndré Kahles
doi: https://doi.org/10.1101/2020.11.13.381814
Amir Joudaki
1Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland {, , }
2University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland
3SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Amir Joudaki
  • For correspondence: ajoudaki@inf.ethz.ch raetsch@inf.ethz.ch andre.kahles@inf.ethz.ch
Gunnar Rätsch
1Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland {, , }
2University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland
3SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gunnar Rätsch
  • For correspondence: raetsch@inf.ethz.ch ajoudaki@inf.ethz.ch raetsch@inf.ethz.ch andre.kahles@inf.ethz.ch
André Kahles
1Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland {, , }
2University Hospital Zurich, Biomedical Informatics Research, Zurich 8091, Switzerland
3SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for André Kahles
  • For correspondence: raetsch@inf.ethz.ch ajoudaki@inf.ethz.ch raetsch@inf.ethz.ch andre.kahles@inf.ethz.ch
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.956 Spearman’s rank correlation with the exact edit distance, improving its best competitor Ordered MinHash by 23%, while running almost 5 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory.

Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform k-mer-based methods, such as min-hash sketching over a set of k-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space.

Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • a few typos fixed

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted March 01, 2021.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Fast Alignment-Free Similarity Estimation By Tensor Sketching
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Fast Alignment-Free Similarity Estimation By Tensor Sketching
Amir Joudaki, Gunnar Rätsch, André Kahles
bioRxiv 2020.11.13.381814; doi: https://doi.org/10.1101/2020.11.13.381814
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Fast Alignment-Free Similarity Estimation By Tensor Sketching
Amir Joudaki, Gunnar Rätsch, André Kahles
bioRxiv 2020.11.13.381814; doi: https://doi.org/10.1101/2020.11.13.381814

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4383)
  • Biochemistry (9602)
  • Bioengineering (7097)
  • Bioinformatics (24871)
  • Biophysics (12625)
  • Cancer Biology (9962)
  • Cell Biology (14362)
  • Clinical Trials (138)
  • Developmental Biology (7964)
  • Ecology (12112)
  • Epidemiology (2067)
  • Evolutionary Biology (15992)
  • Genetics (10929)
  • Genomics (14745)
  • Immunology (9871)
  • Microbiology (23681)
  • Molecular Biology (9486)
  • Neuroscience (50891)
  • Paleontology (369)
  • Pathology (1540)
  • Pharmacology and Toxicology (2683)
  • Physiology (4020)
  • Plant Biology (8657)
  • Scientific Communication and Education (1510)
  • Synthetic Biology (2397)
  • Systems Biology (6441)
  • Zoology (1346)