Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Content-Based Similarity Search in Large-Scale DNA Data Storage Systems

View ORCID ProfileCallista Bee, Yuan-Jyue Chen, David Ward, Xiaomeng Liu, View ORCID ProfileGeorg Seelig, View ORCID ProfileKarin Strauss, View ORCID ProfileLuis Ceze
doi: https://doi.org/10.1101/2020.05.25.115477
Callista Bee
1Paul G. Allen School of Computer Science & Engineering, University of Washington
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Callista Bee
  • For correspondence: kstwrt@cs.washington.edu kstrauss@microsoft.com luisceze@cs.washington.edu
Yuan-Jyue Chen
2Microsoft
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David Ward
1Paul G. Allen School of Computer Science & Engineering, University of Washington
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xiaomeng Liu
1Paul G. Allen School of Computer Science & Engineering, University of Washington
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Georg Seelig
1Paul G. Allen School of Computer Science & Engineering, University of Washington
3Department of Electrical & Computer Engineering, University of Washington
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Georg Seelig
Karin Strauss
2Microsoft
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Karin Strauss
  • For correspondence: kstwrt@cs.washington.edu kstrauss@microsoft.com luisceze@cs.washington.edu
Luis Ceze
1Paul G. Allen School of Computer Science & Engineering, University of Washington
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Luis Ceze
  • For correspondence: kstwrt@cs.washington.edu kstrauss@microsoft.com luisceze@cs.washington.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Synthetic DNA has the potential to store the world’s continuously growing amount of data in an extremely dense and durable medium. Current proposals for DNA-based digital storage systems include the ability to retrieve individual files by their unique identifier, but not by their content. Here, we demonstrate content-based retrieval from a DNA database by learning a mapping from images to DNA sequences such that an encoded query image will retrieve visually similar images from the database via DNA hybridization. We encoded and synthesized a database of 1.6 million images and queried it with a variety of images, showing that each query retrieves a sample of the database containing visually similar images are retrieved at a rate much greater than chance. We compare our results with several algorithms for similarity search in electronic systems, and demonstrate that our molecular approach is competitive with state-of-the-art electronics.

One Sentence Summary Learned encodings enable content-based image similarity search from a database of 1.6 million images encoded in synthetic DNA.

Competing Interest Statement

C.B., Y.C., G.S, K.S, and L.C. have filed a patent application on the core idea. K.S. and Y.C. are employed by Microsoft.

Footnotes

  • https://github.com/uwmisl/primo-data

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted May 27, 2020.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Content-Based Similarity Search in Large-Scale DNA Data Storage Systems
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Content-Based Similarity Search in Large-Scale DNA Data Storage Systems
Callista Bee, Yuan-Jyue Chen, David Ward, Xiaomeng Liu, Georg Seelig, Karin Strauss, Luis Ceze
bioRxiv 2020.05.25.115477; doi: https://doi.org/10.1101/2020.05.25.115477
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Content-Based Similarity Search in Large-Scale DNA Data Storage Systems
Callista Bee, Yuan-Jyue Chen, David Ward, Xiaomeng Liu, Georg Seelig, Karin Strauss, Luis Ceze
bioRxiv 2020.05.25.115477; doi: https://doi.org/10.1101/2020.05.25.115477

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioengineering
Subject Areas
All Articles
  • Animal Behavior and Cognition (4667)
  • Biochemistry (10332)
  • Bioengineering (7653)
  • Bioinformatics (26277)
  • Biophysics (13497)
  • Cancer Biology (10663)
  • Cell Biology (15389)
  • Clinical Trials (138)
  • Developmental Biology (8480)
  • Ecology (12800)
  • Epidemiology (2067)
  • Evolutionary Biology (16817)
  • Genetics (11378)
  • Genomics (15451)
  • Immunology (10591)
  • Microbiology (25141)
  • Molecular Biology (10187)
  • Neuroscience (54317)
  • Paleontology (399)
  • Pathology (1663)
  • Pharmacology and Toxicology (2889)
  • Physiology (4331)
  • Plant Biology (9223)
  • Scientific Communication and Education (1585)
  • Synthetic Biology (2551)
  • Systems Biology (6769)
  • Zoology (1459)