PT - JOURNAL ARTICLE AU - Bee, Callista AU - Chen, Yuan-Jyue AU - Ward, David AU - Liu, Xiaomeng AU - Seelig, Georg AU - Strauss, Karin AU - Ceze, Luis TI - Content-Based Similarity Search in Large-Scale DNA Data Storage Systems AID - 10.1101/2020.05.25.115477 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.05.25.115477 4099 - http://biorxiv.org/content/early/2020/05/27/2020.05.25.115477.short 4100 - http://biorxiv.org/content/early/2020/05/27/2020.05.25.115477.full AB - Synthetic DNA has the potential to store the world’s continuously growing amount of data in an extremely dense and durable medium. Current proposals for DNA-based digital storage systems include the ability to retrieve individual files by their unique identifier, but not by their content. Here, we demonstrate content-based retrieval from a DNA database by learning a mapping from images to DNA sequences such that an encoded query image will retrieve visually similar images from the database via DNA hybridization. We encoded and synthesized a database of 1.6 million images and queried it with a variety of images, showing that each query retrieves a sample of the database containing visually similar images are retrieved at a rate much greater than chance. We compare our results with several algorithms for similarity search in electronic systems, and demonstrate that our molecular approach is competitive with state-of-the-art electronics.One Sentence Summary Learned encodings enable content-based image similarity search from a database of 1.6 million images encoded in synthetic DNA.Competing Interest StatementC.B., Y.C., G.S, K.S, and L.C. have filed a patent application on the core idea. K.S. and Y.C. are employed by Microsoft.