Diagnostic design with machine learning model-based optimization

Hayden C. Metsky; Nicole L. Welch; Nicholas J. Haradhvala; Laurie Rumker; Yibin B. Zhang; Priya P. Pillai; David K. Yang; Cheri M. Ackerman; Juliane Weller; Paul C. Blainey; Cameron Myhrvold; Michael Mitzenmacher; Pardis C. Sabeti

doi:10.1101/2020.11.28.401877

Abstract

Diagnostics, particularly for rapidly evolving viruses, can greatly benefit from principled machine learning-based design. Concentrating here on CRISPR-based viral diagnostics, we screen ~19,000 guide-target pairs and devise a deep neural network to predict a design’s diagnostic signal; we then leverage this model to generate assays that are optimal in expectation over a virus’s genomic variation. We introduce ADAPT, a system for rapid and fully-automated design using our approach. We design CRISPR-based diagnostics for the 1,933 vertebrate-infecting viral species, and experimentally show ADAPT’s designs are sensitive and specific to the lineage-level against extensive viral variation and exhibit lower limits of detection than designs from standard techniques. Model-based optimization is primed to advance CRISPR-based viral diagnostics and many genome-informed technologies for detecting and responding to pathogens.

Competing Interest Statement

H.C.M., N.J.H., C.M., and P.C.S. are co-inventors on a patent application filed by the Broad Institute related to work in this manuscript. N.J.H. is a consultant to Constellation Pharmaceuticals. P.C.B. is a consultant to and equity holder in 10X Genomics, GALT, Celsius Therapeutics, and Next Generation Diagnostics. P.C.S. is a co-founder of and consultant to Sherlock Biosciences and a Board Member of Danaher Corporation, and holds equity in the companies.

Footnotes

↵* Jointly supervised this work.
↵1 Although high, this is only an upper bound and is meant to restrict the search space and thus restrict runtime.
↵2 In particular, by sampling nucleotides—i.e., concatenating locality-sensitive hash functions drawn from a Hamming distance family—and taking the consensus of sequences within each cluster, where clusters are defined by their hash values.
↵3 We synthesize the reverse complement of g and use that for detection, so these rules correspond to permitting G-T base pairing.
↵4 We typically add a fixed constant (4) to the objective values before reporting them to the user, which we find makes the values more interpretable to users because it makes them more likely to be non-negative. This shift has no impact on the design options or their rankings.
↵5 The parameters include primer length, number of tolerated mismatches, bounds on GC content, and the fraction of genomes that must be covered.
↵6 One technicality: many species have segmented genomes. For these, ADAPT also needs the label of the segment. ADAPT effectively treats each segment as a separate taxonomy—i.e., for species that are segmented, the t_is are actually pairs of taxonomy ID and segment.
↵7 The “reference” sequences are determined by NCBI, but can also be provided by the user. They are manually curated, high-quality genomes and encompass major strains.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.