Abstract
Harnessing genomic data and predictive models will provide activity-informed diagnostic assays for thousands of viruses and offer rapid design for novel ones. Here we develop and extensively validate new algorithms that design nucleic acid assays having maximal predicted detection activity over a virus’s full genomic diversity with stringent specificity. Focusing on CRISPR-Cas13a detection, we test a library of ~ 19,000 guide-target pairs and construct a convolutional neural network that predicts Cas13a detection activity better than other techniques. We link our methods by building ADAPT, an end-to-end system that automatically leverages the latest viral genome data. We designed optimal species-specific assays for the 1,933 vertebrate-infecting viral species within 2 hours for most species and 24 hours for all but 3. ADAPT’s designs are sensitive and specific down to the lineage-level for the range of taxa we tested, including ones that pose challenges involving genomic diversity and specificity. They also exhibit significantly higher fluorescence and lower limits of detection, across a virus’s full spectrum of genomic diversity, than designs from standard techniques. ADAPT is available in an accessible software package and can be applied to other detection technologies to enhance critically-needed viral diagnostic and surveillance efforts.
Competing Interest Statement
H.C.M., N.J.H., C.M., and P.C.S. are co-inventors on a patent application filed by the Broad Institute related to work in this manuscript. N.J.H. is a consultant to Constellation Pharmaceuticals. P.C.B. is a consultant to and equity holder in 10X Genomics, GALT, Celsius Therapeutics, and Next Generation Diagnostics. P.C.S. is a co-founder of and consultant to Sherlock Biosciences and a Board Member of Danaher Corporation, and holds equity in the companies.
Footnotes
↵* Jointly supervised this work.
↵1 Although high, this is only an upper bound and is meant to restrict the search space and thus restrict runtime.
↵2 In particular, by sampling nucleotides—i.e., concatenating locality-sensitive hash functions drawn from a Hamming distance family—and taking the consensus of sequences within each cluster, where clusters are defined by their hash values.
↵3 We synthesize the reverse complement of g and use that for detection, so these rules correspond to permitting G-T base pairing.
↵4 We typically add a fixed constant (4) to the objective values before reporting them to the user, which we find makes the values more interpretable to users because it makes them more likely to be non-negative. This shift has no impact on the design options or their rankings.
↵5 The parameters include primer length, number of tolerated mismatches, bounds on GC content, and the fraction of genomes that must be covered.
↵6 One technicality: many species have segmented genomes. For these, ADAPT also needs the label of the segment. ADAPT effectively treats each segment as a separate taxonomy—i.e., for species that are segmented, the tis are actually pairs of taxonomy ID and segment.
↵7 The “reference” sequences are determined by NCBI, but can also be provided by the user. They are manually curated, high-quality genomes and encompass major strains.