Abstract
Diagnostics, particularly for rapidly evolving viruses, stand to benefit from a principled, measurement-driven design that harnesses machine learning and vast genomic data—yet the capability for such design has not been previously built. Here, we develop and extensively validate an approach to designing viral diagnostics that applies a learned model within a combinatorial optimization framework. Concentrating on CRISPR-based diagnostics, we screen a library of 19,209 diagnostic–target pairs and train a deep neural network that predicts diagnostic signal better than contemporary techniques. Our model then makes it possible to design assays that are maximally sensitive over the spectrum of a virus’s genomic variation. We introduce ADAPT, a system for fully-automated design, and use ADAPT to design optimal diagnostics for the 1,933 vertebrate-infecting viral species within 2 hours for most species and 24 hours for all but 3. We experimentally show ADAPT’s designs are sensitive and specific down to the lineage level, including against viruses that pose challenges involving genomic variation and specificity. ADAPT’s designs exhibit significantly higher fluorescence and permit lower limits of detection, across a virus’s entire variation, than the outputs of standard design techniques. Our model-based optimization strategy can be applied broadly to viral nucleic acid diagnostics and other sequence-based technologies, and helps achieve a critically-needed, proactive resource of assays for surveilling and responding to pathogens.
Competing Interest Statement
H.C.M., N.J.H., C.M., and P.C.S. are co-inventors on a patent application filed by the Broad Institute related to work in this manuscript. N.J.H. is a consultant to Constellation Pharmaceuticals. P.C.B. is a consultant to and equity holder in 10X Genomics, GALT, Celsius Therapeutics, and Next Generation Diagnostics. P.C.S. is a co-founder of and consultant to Sherlock Biosciences and a Board Member of Danaher Corporation, and holds equity in the companies.
Footnotes
↵* Jointly supervised this work.
↵1 Although high, this is only an upper bound and is meant to restrict the search space and thus restrict runtime.
↵2 In particular, by sampling nucleotides—i.e., concatenating locality-sensitive hash functions drawn from a Hamming distance family—and taking the consensus of sequences within each cluster, where clusters are defined by their
↵3 We synthesize the reverse complement of g and use that for detection, so these rules correspond to permitting G-T base pairing.
↵4 We typically add a fixed constant (4) to the objective values before reporting them to the user, which we find makes the values more interpretable to users because it makes them more likely to be non-negative. This shift has no impact on the design options or their rankings.
↵5 The parameters include primer length, number of tolerated mismatches, bounds on GC content, and the fraction of genomes that must be covered.
↵6 One technicality: many species have segmented genomes. For these, ADAPT also needs the label of the segment. ADAPT effectively treats each segment as a separate taxonomy—i.e., for species that are segmented, the tis are actually pairs of taxonomy ID and segment.
↵7 The “reference” sequences are determined by NCBI, but can also be provided by the user. They are manually curated, high-quality genomes and encompass major strains