Abstract
Detecting and mitigating off-target activity is critical to the practical application of CRISPR-mediated genome and epigenome editing. While numerous methods have been developed to map Cas9 binding specificity genome-wide, they are generally time-consuming and/or expensive, and not applicable to catalytically dead CRISPR enzymes. We have developed a rapid, inexpensive, and facile assay for identifying off-target CRISPR enzyme binding and cleavage by chemically mapping the unwound single-stranded DNA structures formed upon binding of a sgRNA-loaded Cas9 protein (“CasKAS”). We demonstrate this method in both in vitro and in vivo contexts.
CRISPR-based methods for editing the genome and epigenome have emerged as a highly versatile means of manipulating the genetic makeup and regulatory states of cells. CRISPR technologies hold the potential to transform medical practice by enabling direct elimination of pathogenic sequence variants or manipulation of aberrant gene expression programs. CRISPR has also become a standard tool for discovery in biomedical research, including its uses for high-throughput, massively parallel genomic screens 1.
The presence of significant off-target effects is of universal concern for genome engineering technologies, presenting a major hurdle to fully realizing their potential utility. CRISPR tools have been shown to exhibit biochemical activity away from their intended target sites, which is particularly problematic for therapeutic applications, where risks of activity at sites other than the intended target leading to negative consequences to patient health must be minimal. Understanding and mapping these effects is therefore an urgent need.
To this end, numerous experimental approaches have been developed to experimentally map off-target effects genome-wide. Methods such as Digenome-seq 2 look for particular types of cut sites around target sequences in whole-genome sequencing data; however, deep whole-genome sequencing remains expensive. Assays such as BLESS 3, GUIDE-seq 4, HTGTS 5, DSBCapture 6, BLISS 7, SITE-seq 8, CIRCLE–seq 9, TTISS 10, INDUCE-seq 11, and CHANGE-seq 12 aim instead to directly map Cas9 cleavage events. However, all these methods involve some combination of complex and laborious molecular biology protocols and non-standard reagents, and have not been widely adopted. Other methods, such as DISCOVER-seq 13, which maps DNA repair activity by applying ChIP-seq against the MRE11 protein, as well as earlier applications of ChIP-seq to map catalytically dead dCas9 occupancy sites genome-wide 14,15, suffer from technical issues associated with the ChIP procedure. Most recently, long-read sequencing has been adapted to the problem of Cas9 specificity profiling, in the form of SMRT-OTS and Nano-OTS 16, but the cost of these methods is relatively high while their throughput is comparatively low.
Various computational models have also been trained to predict off-targets genome-wide 17,18. However, these exhibit far from perfect accuracy, and thus in many situations, especially within clinical contexts, direct experimental evidence is needed to accurately identify potential unintended effects of CRISPR-based reagents.
A faster, more accessible, and versatile method for mapping CRISPR off targets is thus still a critical need in the field. When a Cas9-sgRNA ribonucleoprotein (RNP) is engaged with its target site, the sgRNA invades the DNA double helix, forming a ssDNA structure on the other strand (Fig. 1a). We thus reasoned that mapping ssDNA-containing regions should be a sensitive biochemical signal of productive Cas9 binding. The recently developed KAS-seq 19 assay for mapping single-stranded DNA (ss-DNA) (kethoxal-assisted ssDNA sequencing 19) is ideally suited for the purpose of identifying ssDNA generated by CRISPR protein binding to DNA (Fig. 1a-b). KAS-seq is based on the specific covalent labeling of unpaired guanine bases with N3-kethoxal, generating an adduct to which biotin can then be added using click chemistry. After shearing, biotinylated DNA, corresponding to regions containing ssDNA structure, can be specifically enriched for and sequenced.
To determine the feasibility of using KAS-seq to map regions of ssDNA generated by Cas9 binding, we carried out an initial in vitro experiment using mouse genomic DNA (gDNA), purified dCas9 and two sgRNAs targeting the Hoxa locus.
Strikingly, we observed strong peaks at the expected target sites for each sgRNA (Fig. 1c). Detailed examination of dCas9 CasKAS profiles around the predicted sgRNA target sites revealed strand coverage asymmetry patterns similar to those observed for ChIP-seq around transcription factor binding sites 20 (Fig. 1d), indicating that enrichment derives from the sgRNA target site itself and confirming the utility of N3-kethoxal for mapping dCas9 occupancy sites. We termed the assay “CasKAS”.
We then reasoned that CasKAS should also capture active Cas9 complexed with DNA, as the enzyme is thought to remain associated with DNA for some time after cleavage 21. We performed CasKAS experiments with the same sgRNAs and active Cas9 nuclease, and again observed enrichment at the expected on-target sites (Fig. 1e). Examination of Cas9 CasKAS read profiles around the on-target site showed that the 5’ ends of reads are precisely positioned around the expected cut site, with one cut position on the target strand (which binds the sgRNA and is cleaved by the HNH domain) and two to three such positions on the non-target strand (which is cleaved by the RuvC domain; Fig. 1f), consistent with the previously known patterns of Cas9 cleavage 22,23. CasKAS therefore provides target specificity profiles for both active and catalytically dead Cas9 enzymes.
In vitro CasKAS data was highly reproducible between replicates (Fig. 1g), and a modest sequencing depth of between 10 and 20 million mapped reads was sufficient to capture off-target specificity profiles (Fig. 1h), which is an order of magnitude lower than required for resequencing the whole genome.
We observed similar results with two mouse sgRNAs targeting the Nanog locus (Supplementary Fig. 1) and with two human sgRNA (“EMX1” and “VEGFA”; Supplementary Fig. 2 and 3). We found no enrichment using components of the RNP in isolation – sgRNAs, dCas9 or Cas9 (Supplementary Fig. 2).
Next we tested the application of CasKAS in vivo. Living cells contain substantial ssDNA due to active transcription and other processes 19, so in vivo CasKAS signal derives from a mixture of Cas9-associated ssDNA and endogenous processes. We carried out KAS-seq experiments using both dCas9 and Cas9 in HEK293 cells transfected with RNPs targeting EMX1 or VEGFA, as well as negative, no-guide controls, which provided a map of background endogenous ssDNA profiles. At EMX1, which is not active in HEK293 cells, we observe strong peaks at the expected target site (Fig. 1i), as well as an asymmetric read profile around it for dCas9 (Fig. 1j), and a substantial degree of 5’ end clustering at the cut site, similar to what is observed in vitro for active Cas9 (Fig. 1g). The VEGFA gene is active in HEK293 cells, but the dCas9/Cas9 CasKAS signal is still readily identifiable as an addition to the endogenous ssDNA enrichment pattern (Supplementary Fig. 4). These results demonstrate the utility of CasKAS for profiling CRISPR specificity both in vitro and in vivo
We next examined the genome-wide specificity of sgR-NAs as measured by CasKAS. We focus on the mouse sgRNA #1 as it displayed a substantial number of off-targets yet that number was also sufficiently small for all of them to be examined directly. We first called peaks de novo (see Methods for details) without relying on off-target prediction algorithms, then manually curated the resulting peak set, excluding peaks not exhibiting the canonical asymmetric read distribution around a fixed point on the two strands (Fig. 2a). Remarkably, while we found 32 peaks at predicted off-target sites, we also found 198 (i.e. ∼6× as many) additional manually curated peaks; while these peaks exhibit generally lower CasKAS signal (Fig. 2b), they all display proper peak shape characteristics (see Supplementary Fig. 13 for details), suggesting that they are genuine sites of occupancy. Most of the predicted (in total ∼7,500) off-target sites for this sgRNA did not show substantial occupancy by dCas9 CasKAS (Fig. 2c-d).
Sequence comparison of the occupied predicted off-target sites allowed us to evaluate determinants of Cas9 specificity (Fig. 2e). Consistent with previous reports 24,25, the PAM-distal region was much less sequence-constrained than the PAM-proximal seed region. We observed a similar pattern with the other sgRNAs we profiled, in both mouse and human (Supplementary Fig. 5-8 and Supplementary Fig. 9-12).
When analyzing peaks not associated with predicted off-target sites (Supplementary Fig. 14) we observed other telling patterns – at numerous sites with strong dCas9 CasKAS signal, we observe a large number of mismatches to the sgRNA sequence as well as “bulge” regions wherein indels are observed in the target sequence. These mismatches and bulges were in general much larger than what is considered permissible by off-target prediction algorithms; we speculate that the lack of consideration of potential target sequences with large numbers of mismatches or substantial insertions could explain the much larger number of such sites compared to the set of occupied predicted off-targets.
We next devised a simple metric for evaluating the degree of read clustering at cut sites (a “C-score”; see Methods for details) to estimate the degree of cutting by Cas9. The on-target site exhibits the second highest dCas9 CasKAS signal genome-wide. However, strikingly, even though all CasKAS-identified off-target sites showed Cas9 binding, only the on-target site displayed strong cutting activity (Fig. 2f). The behavior of other sgRNAs varies (Supplementary Fig. 5-8 and 15), with some showing multiple clearly identifiable cut sites. Overall, these results are consistent with previous reports that Cas9 requires more successful RNA:DNA basepairing for cleavage activity than is necessary for binding 26,27. Thus, interpreting the read distributions of Cas9 CasKAS at target sites enables simultaneous detection of binding specificity and the promiscuity of catalytic activity.
Finally, we compared in vitro and in vivo CasKAS profiles (Fig. 2g-h). We find many fewer strongly enriched sites in in vivo datasets than in vitro, with the on-target site being either the top (for dCas9) or among the top (for Cas9) sites in vivo. A potential explanation for this difference is the previously reported impediment of Cas9/dCas9 binding to DNA by the presence of nucleosomes 28.
In conclusion, we have presented CasKAS, a simple and robust method for mapping the specificity of active and catalytically dead versions of CRISPR enzymes. CasKAS has numerous advantages over existing tools while also opening up new possibilities for studying CRISPR biology. CasKAS requires no specialized molecular biology protocols, takes just a few hours in vitro (and a similar amount of time after harvesting cells in vivo), and, due to the strong, active enrichment of target sequences, is inexpensive. In contrast to previously developed methods, It measures strand invasion by CRISPR, which is biochemically more specific and relevant to CRISPR function than DNA association. We compared de novo called CasKAS peaks to those generated by other means, and while we found large sets of peaks unique to each method, those found only by CasKAS contained much higher fractions of predicted off-target sites than those unique to other methods (Supplementary Fig. 16).
CasKAS does not rely on measuring DNA cleavage or modification and can thus be used to profile the specificity of all types of DNA-targeting CRISPR proteins. CasKAS also does not rely on cellular repair processes, cell division, or delivery of additional exogenous DNA (as in GUIDE-seq) to generate a detectable signal. These advantages, coupled with low cell input requirements, may increase the utility of the method in rare primary cell types, tissues from animal models, or even for direct assessment of specificity in edited patient cells (e.g. ex vivo edited immune cells). A current limitation of CasKAS is the requirement that a G nucleotide is present within the sgRNA sequence, since kethoxal requires an exposed G to react with. However, only a small fraction (≤5%) of sgRNAs in the human genome lack any Gs for S. pyogenes PAM sequences (Supplementary Fig. 17). Another minor limitation of the current in vitro protocol is that labeling is carried out on high molecular weight (HMW) DNA and samples must be sheared serially. We have explored using pre-sheared and end-repaired DNA (to minimize kethoxal labeling of Gs on sticky ends generated by sonication), with comparable results to using HMW DNA (Supplementary Fig. 18); we anticipate that further optimization or using other approaches, such as enzymatic fragmentation, should allow the parallel high-throughput plate-based profiling of the specificity of very large numbers of sgRNAs.
In addition to being highly valuable for off-target profiling in vitro and in previously difficult to assay settings such as primary cells, we expect CasKAS to provide fruitful insights into the mechanisms and dynamics of in vivo CRISPR action (taking advantage of finely controllable CRISPR systems such as vfCRISPR 29), and the influence of transcriptional, regulatory, and epigenetic and other functional genomic contexts on CRISPR activity.
Methods
Guide RNA sequences
Guide RNAs were obtained from IDT (“sgRNA #1” and “sgRNA #2”) or from Synthego (all others).
The following sgRNA sequences were used in this study:
“sgRNA #1”: GCTTAATTAAGGTAAACGTC
“sgRNA #2”: CCAACCTGGCGGCTCGTTGG
“EMX1 Tsai”: GAGTCCGAGCAGAAGAAGAA
“VEGFA-site1”: GGGTGGGGGGAGTTTGCTCC
“Nanog-sg2”: GATCTCTAGTGGGAAGTTTC
“Nanog-sg3”: GTCTGTAGAAAGAATGGAAG
Guide RNAs were dissolved to a concentration of 100 µM using nuclease-free 1× TE buffer and stored at –20 °C.
In vitro CasKAS
In vitro CasKAS experiments were executed as follows.
First, 1 µL of each synthetic sgRNA were incubated at room temperature with 1 µL of recombinant purified dCas9 (MCLab dCAS9B-200) for 20 minutes. The RNP was then incubated with 1 µg of gDNA at 37 °C for 10 minutes.
The KAS reaction was then carried out by adding 1 µL of 500 mM N3-kethoxal (ApeXBio A8793). DNA was immediately purified using the MinElute PCR Purification Kit (Qiagen 28006), and eluted in 87.5 or 175 µL 25mM K3BO3.
In vivo CasKAS
For in vivo CasKAS experiments, HEK293T cells were seeded at 400,000 cells/well into a 6-well plate the day before RNP transfection. Media was exchanged 2 hours before transfection. For each well, 6,250 ng of Cas9 (MCLAB CAS9-200) or dCas9 (MCLAB dCAS9B-200) and 1,200 ng sgRNA was complexed with CRISPRMAX (Thermo Fisher CMAX00008) reagent in Opti-MEM (Thermo Fisher 51985091) following manufacturer’s protocol. After incubation at room temperature for 15 minutes, the RNP solution was directly added to each well and gently mixed. The cells were incubated with the RNP complex for 14 hours at 37 °C. To harvest and perform kethoxal labeling, media was removed and room temperature 1×PBS was used to wash the cells. Cells were then dissociated with trypsin, trypsin was quenched with media, cells were pelleted at room temperature, and then resuspended in 100 µL of media supplemented with 5 M N3-kethoxal (final concentration). Cells were incubated for 10 minutes at 37 °C with shaking at 500 rpm in a Thermomixer. Cells were then pelleted by centrifuging at 500 g for 5 minutes at 4 °C. Genomic DNA was then extracted using the Monarch gDNA Purification Kit (NEB T3010S) following the standard protocol but with elution using 175 µL 25 mM K3BO3 at pH 7.0.
Click reaction, biotin pull down and library generation
The click reaction was carried out by combining 175 µL purified DNA, 5 µL 20 mM DBCO-PEG4-biotin (DMSO solution, Sigma 760749), and 20 µL 10×PBS in a final volume of 200 µL or 87.5 µL purified and sheared DNA, 2.5 µL 20 mM DBCO-PEG4-biotin (DMSO solution, Sigma 760749), and 10 µL 10×PBS in a final volume of 100 µL. The reaction was incubated at 37 °C for 90 minutes.
DNA was purified using AMPure XP beads (50 µL for a 100 µL reaction or 100 µL for a 200 µL reaction), beads were washed on a magnetic stand twice with 80% EtOH, and eluted in 130 µL 25mM K3BO3.
Purified DNA was then sheared on a Covaris E220 instrument down to ∼150-400 bp size.
For streptavidin pulldown of biotin-labeled DNA, 10 µL of 10 mg/mL Dynabeads MyOne Streptavidin T1 beads (Life Technologies, 65602) were separated on a magnetic stand, then washed with 300 µL of 1×TWB (Tween Washing Buffer; 5 mM Tris-HCl pH 7.5; 0.5 mM EDTA; 1 M NaCl; 0.05% Tween 20). The beads were resuspended in 300 µL of 2×Binding Buffer (10 mM Tris-HCl pH 7.5, 1 mM EDTA; 2 M NaCl), the sonicated DNA was added (diluted to a final volume of 300 µL if necessary), and the beads were incubated for≥15 minutes at room temperature on a rotator. After separation on a magnetic stand, the beads were washed with 300 µL of 1×TWB, and heated at 55 °C in a Thermomixer with shaking for 2 minutes. After removal of the supernatant on a magnetic stand, the TWB wash and 55 °C incubation were repeated.
Final libraries were prepared on beads using the NEB-Next Ultra II DNA Library Prep Kit (NEB, #E7645) as follows. End repair was carried out by resuspending beads in 50 µL 1×EB buffer, and adding 3 µL NEB Ultra End Repair Enzyme and 7 µL NEB Ultra End Repair Enzyme, followed by incubation at 20 °C for 30 minutes (in a Thermomixer, with shaking at 1,000 rpm) and then at 65 °C for 30 minutes.
Adapters were ligated to DNA fragments by adding 30 µL Blunt Ligation mix, 1 µL Ligation Enhancer and 2.5 µL NEB Adapter, incubating at 20 °C for 20 minutes, adding 3 µL USER enzyme, and incubating at 37 °C for 15 minutes (in a Thermomixer, with shaking at 1,000 rpm).
Beads were then separated on a magnetic stand, and washed with 300 µL TWB for 2 minutes at 55 °C, 1000 rpm in a Thermomixer. After separation on a magnetic stand, beads were washed in 100 µL 0.1×TE buffer, then resuspended in 15 µL 0.1×TE buffer, and heated at 98 °C for 10 minutes.
For PCR, 5 µL of each of the i5 and i7 NEB Next sequencing adapters were added together with 25 µL 2×NEB Ultra PCR Mater Mix. PCR was carried out with a 98 °C incubation for 30 seconds and 12 cycles of 98 °C for 10 seconds, 65 °C for 30 seconds, and 72 °C for 1 minute, followed by incubation at 72 °C for 5 minutes.
Beads were separated on a magnetic stand, and the supernatant was cleaned up using 1.8× AMPure XP beads.
Libraries were sequenced in a paired-end format on an Illumina NextSeq instrument using NextSeq 500/550 high output kits (2×36 cycles).
Data processing
Demultipexed fastq files were mapped to the hg38 assembly of the human genome or the mm10 version of the mouse genome as 2×36mers using Bowtie 30 with the following settings: -v 2 -k 2 -m 1 --best --strata -X 1000. Duplicate reads were removed using picard-tools (version 1.99).
Browser tracks generation, fragment length estimation, TSS enrichment calculations, and other analyses were carried out using custom-written Python scripts (https://github.com/georgimarinov/GeorgiScripts). The refSeq set of annotations were used for evaluation of enrichment around TSSs.
Peak calling
Peak calling on in vitro binding datasets was carried out using version 2.1.0 of MACS231 with default settings.
Peaks were then compared against the ENCODE set of “blacklisted” regions 32 to filter out likely artifacts.
Sequence analysis
Guide RNA off-target predictions were obtained from Cas-OFFinder 33
Multiple sequence alignments of sgRNA sequences and their off-targets were generated using MUSCLE 34 and visualized using JalView 35.
Quantification
Cutting score calculation
The Cas9 cutting C-score was calculated as follows.
First, basepair-level Read-Per-Million (RPM) profiles for mapped read 5’ ends were generated separately for the forward and reverse strands. Then the C-score was calculated by multiply the forward and reverse strand profiles (summed over a running window of 3 bp): Where c, i indicate the coordinates by chromosome and position.
Data availability
Sequencing reads for the datasets described in this study are available from GEO accession GSE171962.
Author contributions
G.K.M. conceptualized the study, performed initial in vitro CasKAS experiments, analyzed data, and wrote the manuscript with input from all authors. S.H.K. developed the in vivo CasKAS protocol and performed in vivo CasKAS experiments. S.T.B. carried out in vitro CasKAS optimization. A.E.T. and J.T. supplied sgRNAs and designed off-target profiling experiments. A.E.T. carried out off-target analysis for mouse sgRNAs. T.W. provided key reagents. W.J.G., A.K., C.H. M.C.B. and L.B. supervised the study.
Competing interests
G.K.M., W.J.G, T.W. and C.H. have submitted a provisional patent application based on this work.
Supplementary Materials
Supplementary Figures
Acknowledgments
This work was supported by NIH grants (P50HG007735, RO1 HG008140, U19AI057266 and UM1HG009442 to W.J.G., 1UM1HG009436 to W.J.G. and A.K., 1DP2OD022870-01 and 1U01HG009431 to A.K., and HG006827 to C.H.), the Rita Allen Foundation (to W.J.G.), the Baxter Foundation Faculty Scholar Grant, and the Human Frontiers Science Program grant RGY006S (to W.J.G). W.J.G is a Chan Zuckerberg Biohub investigator and acknowledges grants 2017-174468 and 2018-182817 from the Chan Zuckerberg Initiative. S.K. is supported by MSTP training grant T32GM007365 and the Paul and Daisy Soros Fellowship. J.T. is supported by the NIDDK F99/K00 fellowship of the National Institutes of Health (F99DK126120). M.C.B. is supported by a grant from Stanford ChEM-H and an NIH Directors New Innovator Award (1DP2HD08406901). Fellowship support also provided by the Stanford School of Medicine Dean’s Fellowship (G.K.M.), the Siebel Scholars, the Enhancing Diversity in Graduate Education Program and the Weiland Family Fellowship (A.E.T.). C.H. is a Howard Hughes Medical Institute Investigator.
The authors would like to thank Zohar Shipony and members of the Greenleaf, Kundaje, and Bassik labs for helpful discussion and suggestions regarding this work.