Abstract
Acute myeloid and lymphoid leukemias often harbor chromosomal translocations involving the Mixed Lineage Leukemia-1 gene, which encodes the KMT2A lysine methyltransferase. The most common translocations produce in-frame fusions of KMT2A to trans-activation domains of chromatin regulatory proteins. Here we develop a strategy to map the genome-wide occupancy of oncogenic KMT2A fusion proteins in primary patient samples regardless of fusion partner. By modifying the versatile CUT&Tag method for full automation we identify common and tumor-specific patterns of aberrant chromatin regulation induced by different KMT2A fusion proteins. Integration of automated and single-cell CUT&Tag uncovers lineage heterogeneity within patient samples and provides an attractive avenue for future diagnostics.
Introduction
Ten percent of acute leukemias harbor chromosomal translocations involving the Lysine Methyl-transferase 2A (KMT2A) gene (also referred to as Mixed Lineage Leukemia-1)1. In its normal role, KMT2A catalyzes methylation of the K4 residue of the histone H3 nucleosome tail and is required for fetal and adult hematopoiesis2. The N-terminal portion of KMT2A contains a low complexity domain that mediates protein-protein interactions, an AT-hook/CXXC domain that binds DNA, and multiple chromatin-interacting domains (PHD domains and a bromo domain), whereas the C-terminal portion contains a trans-activation domain that interacts with histone acetyl-transferases and a SET domain that catalyzes histone H3K4 methylation3,4. The KMT2A pre-protein is cleaved to form a 320-kDa N-terminal fragment (KMT2A-N) and a 180-kDa C-terminal fragment (KMT2A-C) that form a stable dimer5,6.
KMT2A contributes to leukemogenesis through oncogenic chromosomal rearrangements involving the DNA-binding domain in the N-terminal portion of KMT2A with a diverse array of other chromatin regulatory proteins7,8. Although more than 80 translocation partners have been identified in KMT2A-rearranged (KMT2Ar) leukemias, fusions involving AF9, ENL, ELL, AF4 or AF10 transcriptional elongation factors account for the majority of cases1,8. These fusion partners regulate RNA Polymerase II (RNAPII) elongation (AF9, ENL, ELL and AF4) or recruit the Dot1L-H3K79 histone methyltransferase (ENL, AF9 and AF10)9-12. Additionally, ENL and AF9 interact with the CBX8 chromobox protein to neutralize the PRC1 gene silencing complex13-16
Previous work has suggested that KMT2A fusion proteins bind different genomic loci depending on the fusion partner to drive different leukemia subtypes17,18. For example, AF4 fusions are more common in acute lymphoid leukemia (ALL), and AF9 fusions are associated with acute myeloid leukemia (AML)1. In addition, KMT2A rearrangements are also prevalent in mixed lineage leukemia (MPAL), and numerous examples of tumors that interconvert between lineage types have been documented17,19-21. However, because methods for efficiently and reliably profiling KMT2A fusion binding sites in patients samples are lacking, the relationship between KMT2A fusions, associated chromatin proteins, leukemia subtypes, and lineage plasticity has been challenging to fully characterize. Here, we establish a chromatin profiling platform that efficiently profiles oncogenic fusion proteins, transcription-associated complexes, and histone modifications in cell lines and patient samples. By integrating these results with related single-cell methods we characterize the regulatory dynamics of KMT2Ar leukemias and find that distinct fusion partners display differential affinity for various transcriptional cofactors and may influence lineage plasticity.
Results
A strategy for mapping the binding sites of diverse KMT2A fusion proteins
Characterizing the chromatin localization of oncogenic fusion proteins has often been limited by the inability of ChIP-seq to be used with small amounts of patient samples. To efficiently compare the binding sites for wildtype KMT2A and the fusion proteins, we applied AutoCUT&RUN22 across a panel of four KMT2Ar leukemia cell lines and five primary KMT2Ar patient samples sorted for CD45-positive blasts. This collection spans the spectrum of KMT2Ar leukemia subtypes with diverse KMT2A translocations that create oncogenic fusion proteins with the transcriptional elongation factors AF4 (SEM, RS4;11, 10 ALL-1 and 10 MPAL-2), AF9 (10 MPAL-1), ENL (KOPN-8, 10 AML-2), AF6 (ML-2), or a relatively rare fusion to the cytoplasmic GTPase Sept6 (10 AML-1) (Supplementary Table 1). With the exception of ML-2, these samples also contain a wildtype copy of the KMT2A locus. Antibodies to the C-terminal portion recognize only wildtype KMT2A-C, while antibodies to the N-terminal portion recognize both wildtype KMT2A-N and the fusion protein (Fig. 1a). Therefore, binding sites unique to the oncogenic fusion protein can be identified by comparing chromatin profiling of C-terminal and N-terminal KMT2A antibodies. We used an automated CUT&RUN platform to profile replicate samples of cell lines with two different antibodies to the N-terminus and two to the C-terminus of KMT2A, and correlation analysis of sequencing results showed high reproducibility (Supplementary Figure 1).
By examining the profiling landscapes we identified many sites where both the N-terminus and the C-terminus of KMT2A coincide in both H1 hESCs and KMT2Ar leukemia cells (Fig. 1b). Other sites are apparent where only the N-terminus of KMT2A is detected and only in leukemia cells: these must be sites of fusion protein binding (Fig. 1b). To define fusion protein binding sites, we used Gaussian mixture modeling to partition KMT2A peaks into two different distributions in the total enrichment-normalized ratio of N-terminus to C-terminus KMT2A signal (KMT2A N/C score) (Fig. 1c). In the control H1 cells, all of the KMT2A N/C scores fall within a single Gaussian distribution and so a two-component model fails to partition the data, whereas in each of the KMT2Ar leukemia samples two-component Gaussian mixture modeling identifies one group of sites with appreciably higher N/C scores than the other, allowing us to set a threshold to call oncogene target sites (Fig. 1c, Supplementary Fig. 2a-h). For example, in the MPAL-1 cell line 7,264 sites are identified as binding KMT2A-N and KMT2A-C, whereas 1,517 sites are enriched only for the N-terminus of KMT2A and are therefore called as oncogene binding sites (Fig. 1d). While ∼60% of full-length KMT2A binding sites coincide with gene promoters, ∼70% of fusion protein binding sites occupy gene bodies and often span broad - domains up to 10 kb over transcribed regions (Fig. 1e,f). This pattern of oncogenic KMT2A fusion localization is consistent with previous reports18,23.
Only ∼1% of fusion protein binding sites are shared across leukemia samples, and these shared sites represent 6% of total sequence space (506kb/8377kb) bound by KMT2Ar in any cell type (Supplementary Table 2). The group of 15 genes that is targeted by the fusion protein in all KMT2Ar leukemia samples is highly enriched for master regulators of hematopoiesis as well as genes that are required for KMT2Ar leukemia24-28. Interestingly, several of the shared KMT2A oncogene targets had not been investigated as downstream mediators of leukemia. By examining the DepMap database of CRISPR-Cas9 screens targeting KMT2Ar leukemia cell lines29,30, we identified SENP6 and ARID2 as oncoprotein targets that are novel dependencies in KMT2Ar leukemia (Supplementary Table 2).
Principal Component Analysis (PCA) of oncogene binding sites indicates that the specific partner in each fusion protein likely influences the tumor-specific localization. KMT2A-AF4 samples cluster together and include both an ALL patient sample as well as an MPAL patient sample (Fig. 1g). The ALL cell line KOPN-8 carries a KMT2A-ENL fusion and has distinct oncogene binding sites. In contrast, despite the primary AML-1, AML-2, and MPAL-1 samples producing distinct KMT2A fusion proteins, the oncogene binding profile is similar. This suggests that KMT2A fusion partners are not the sole determinants of oncogene landscapes, and that lineage-specific features of the chromatin landscape also contribute to oncogene targeting.
We reasoned that the distinct binding sites of KMT2A fusion proteins may in part be driven by the unique cofactors with which the fusion partners associate. Therefore, we mapped the distributions of ENL, ELL, and Dot1L, three chromatin proteins that have previously been shown to interact with KMT2A fusion proteins31. Overall, regions bound by KMT2A fusion proteins are also highly enriched for ENL in most of the samples we profiled, but are only slightly enriched for ELL binding in three of the four KMT2A-AF4 fusion lines (Fig. 1 h,i; Supplementary Fig. 2b-e).
Whereas Dot1L has been proposed to be a central component of oncogenic transformation by a variety of KMT2A fusion proteins, this histone methyltransferase is most enriched at the oncogene binding sites of the primary MPAL-1 patient sample (Fig. 1d, j). This leukemia carries a KMT2A-AF9 fusion protein, and AF9 is normally a component of the DotCom complex32. Thus, as for the ENL and ELL transcriptional elongation factors, the interaction of the oncogenic KMT2A fusion partner with an elongation-coupled histone methyltransferase appears to drive localization to gene bodies (Supplementary Fig. 1i). Finally, the 10 AML-1 sample harbors a relatively rare KMT2A-Sept6 fusion, and this leukemia cell line has a distinctive set of fusion protein binding sites and lacks the characteristic wide spreading of fusion protein into transcribed genes (Fig. 1k). This suggests that the Sept6 fusion protein is mechanistically distinct from other leukemias we profiled. Thus, our profiling strategy has successfully distinguished loci of fusion protein mislocalization from endogenous sites that can be used to delineate differences between KMT2Ar cell lines that contribute to leukemogenesis.
Mapping chromatin features of KMT2A fusion protein binding sites
We next aimed to characterize chromatin features around the fusion protein binding sites that we had identified in each KMT2Ar cell line. To do this economically and at a scale that could be generally applied to patient samples, we developed AutoCUT&Tag, a modification of our previous AutoCUT&RUN robotic platform22. CUT&Tag takes advantage of the high efficiency and low background of antibody-tethered Tn5 tagmentation-based chromatin profiling relative to previous methods, such as ChIP-seq and CUT&RUN33. The standard CUT&Tag protocol requires DNA extraction before library enrichment by PCR. However, we recently developed conditions for DNA release and PCR enrichment without extraction (CUT&Tag-Direct)34. In this improved protocol a low concentration of SDS is sufficient to displace bound Tn5 from tagmented DNA, and the subsequent addition of the non-ionic detergent Triton-X100 is sufficient to quench the SDS and allow for efficient PCR. This streamlined protocol makes CUT&Tag compatible with robotic handling of samples in a 96-well format in a single plate and generates profiles with data quality comparable to those produced by benchtop CUT&Tag (Supplementary Fig. 3).
To define the chromatin features around KMT2A fusion protein binding sites, we used AutoCUT&Tag to profile the active histone modifications H3K4me1, H3K4me3, and H3K36me3, and the silencing histone modifications H3K27me3 and H3K9me3. Together, these five histone modifications distinguish active promoters, enhancers, transcribed regions, developmentally silenced, and constitutively silenced chromatin35, and provide a straightforward picture of the regulatory status of a genome (Supplementary Fig. 3c). Replicate profiles for each mark in leukemia cell lines were very similar and were merged for further analysis (Supplementary Fig. 4).
We first examined the chromatin features of wildtype KMT2A and oncoprotein binding sites. Consistent with KMT2A catalyzing trimethylation of the H3K4 residue, binding sites for wildtype KMT2A are heavily marked with H3K4me3, whereas oncogene binding sites are relatively depleted for H3K4me3 (Fig. 2a). This difference in chromatin marking supports the observation that oncogenes bind at new sites in the genome without the accompanying wildtype methyltransferase. Interestingly, at a limited subset of broad KMT2A-AF4, KMT2A-ENL, and KMT2A-AF9 binding sites we see that H3K4me3 is deposited away from the gene promoter (Fig. 2b). This suggests that these oncogenic fusions may direct the aberrant localization of an alternative H3K4-methyltransferase under certain circumstances. Oncoprotein binding sites lack H3K27me3 or H3K9me3 (Fig. 2c,d), but are enriched in H3K4me1 and H3K36me3, both of which mark transcribed gene bodies (Fig. 2e,f). Such enrichment of gene-body marks is as expected for mis-targeting of transcriptional elongation complexes via KMT2A fusions31.
Histone modification profiling holds the potential to reveal similarities and distinctions between leukemias by reporting their transcriptional regulome status. For example, H3K4me3 reports activity of gene promoters, and the signal for this modification at blood cell marker gene promoters resembles the immunophenotype characteristic of each leukemia (Fig. 2g). Correlation matrices for each histone mark across the genome showed the greatest variance in H3K4me1, H3K4me3 and H3K27me3 modifications (Supplementary Fig. 4), and we examined these profiles in more detail. We first identified leukemia enriched regions for each modification using the SEACR peak-calling method36, and performed PCA to determine modification-specific similarities between samples. Overall, ALL, AML, and MPAL leukemias clustered together by their H3K4me1 and H3K4me3 profiles (Fig. 2h,i), consistent with similar repertoires of lineage-specific transcriptionally active regions in each leukemia type. In contrast, PCA based on profiling of H3K27me3 partitions samples into groups largely unrelated to their leukemia subtype, suggesting that there are distinctions between leukemias in silenced regions (Fig. 2j). H3K27me3 is an epigenetically inherited histone modification that is linked to developmental progression as cells determine their identities. Thus, these distinct H3K27me3 leukemia landscapes may be indicative of the hematopoietic transitions that are defective in each tumor.
Clustering of regulatory elements between leukemias
To identify common groups of regulatory elements that are shared between leukemia subtypes we compiled merged lists of H3K4me1 peaks from all samples, quantified the sample-specific signal over all peaks (Figure 3a) and performed t-distributed stochastic neighbor embedding (t-SNE) of these elements, followed by density peak clustering (Figure 3e)37. A majority of features are shared across all leukemias (“common”), while a smaller number are specific to each sample. As expected, we also identified groups of H3K4me1-enriched regions in AML (“myeloid elements”), ALL (“lymphoid elements”), or shared between AML and MPAL (“mixed myeloid elements”), or ALL and MPAL (“mixed lymphoid elements”). Thus, this regulatory analysis implies that MPAL leukemias share features with both ALL and AML.
PCA analysis indicates that the other histone modifications we profiled are also able to distinguish between KMT2Ar leukemias and so we extended our t-SNE and clustering analysis to identify groups of regions enriched for H3K4me3, H3K36me3, and H3K27me3 that are shared between KMT2Ar leukemia subtypes. Most H3K4me3 and H3K36me3 peaks are common across leukemias, indicating that they largely share gene expression repertoires (Figs. 3b-c). Grouping H3K4me3-marked promoter regions by t-SNE also identified myeloid, lymphoid, mixed myeloid and mixed lymphoid elements (Figs. 3f), however as compared to H3K4me1, a smaller proportion of H3K4me3 marked features show any lineage specificity. This is consistent with previous reports that regulatory elements marked by H3K4me1 generally show more cell-type specificity than the promoter elements marked by H3K4me338,39. H3K27me3 peaks show diversity similar to H3K4me1 (Fig. 3d). As suggested by the H3K27me3 PCA analysis, t-SNE of the H3K27me3 developmentally repressed landscape is uniquely able to subdivide lymphoid specific elements and mixed lymphoid specific elements into two spatially separated groups (Fig. 3h). We conclude that high-throughput CUT&Tag profiling of active and repressed chromatin landscapes provides a powerful tool to characterize KMT2Ar leukemias, and that profiling the developmentally repressed genome reveals tumor-specific differences that are not apparent by profiling the active genome.
Integration of autoCUT&Tag with scCUT&Tag reveals tumor heterogeneity
Given our ability to identify regulatory regions that discriminate leukemia types, we reasoned that the heterogeneous usage of those elements within the same leukemia might underlie the phenotypic plasticity of KMT2Ar leukemia. To test this, we performed CUT&Tag on single KMT2Ar leukemia cells, where antibody binding and pA-Tn5 tethering is performed on bulk samples, and then individual cells are arrayed on an ICELL8 platform for barcoded PCR library enrichment33. After optimizing the SDS and Triton X-100 inputs to CUT&Tag-Direct for single cell applications, we were able to increase the median number of unique reads per cell from - ∼6000 to ∼24,000 (Fig. 4a), while maintaining a high fraction of reads in peaks (Supplementary Fig. 5a).
To examine the cellular heterogeneity of active and repressed chromatin in KMT2Ar leukemia we applied this modified protocol to profile between 1137-3611 cells from our collection of samples for the H3K4me3, H3K27me3, and H3K36me3 histone modifications. After cells with fewer than ∼300 fragments were excluded, single-cell CUT&Tag for H3K4me3, H3K36me3, and H3K27me3 yielded a median of 4972, 3962, and 13025 unique reads per cell, respectively (Supplementary Fig. 5b). Profiles for each single cell were first binned across the groups of regulatory features we identified by clustering analysis of our bulk profiling data (Supplementary Fig. 5c,d,e), and cells were projected in UMAP space based on that binning (Fig. 4b-d). Discrete sample-specific clusters were resolved by UMAP projection of cells profiled for H3K4me3 and H3K27me3 (Fig. 4b,c) but not H3K36me3 (Fig. 4d), indicating that the differences in the H3K4me3 and H3K27me3 landscapes of KMT2Ar leukemia cells of the same samples are generally less than the differences between samples.
To directly examine whether individual cells in KMT2Ar leukemia samples show differential enrichment for H3K4me3 at lineage-specific elements, we compared the percentage of fragments within individual cells for each leukemia type that fell within the myeloid- or lymphoid-enriched features as defined by bulk CUT&Tag profiling (Supplementary Fig. 5f). Whereas the ALL and AML cells generally exhibit mutually exclusive enrichment for mixed lymphoid or myeloid elements, respectively, the majority of MPAL cells are enriched for both (Fig. 4e). Interestingly, the distribution of cells in the lymphoid-myeloid space differs between samples defined by different fusions. A small subset of KMT2A-AF4 ALL cells exhibit bias toward myeloid features (Fig. 4f), and cells in the primary MPAL-2 sample containing KMT2A-AF4 are more dispersely arrayed than cells from the KMT2A-AF9 containing primary MPAL-1 (Fig. 4g). This suggests that KMT2A-AF4 containing leukemias may have greater lineage plasticity than the other KMT2A fusions proteins we profiled. Thus, single-cell CUT&Tag profiling is able to resolve heterogenous lineage biases within primary pediatric leukemia samples providing a powerful tool for these cancers.
Discussion
Here we have applied high-throughput chromatin profiling to KMT2Ar leukemias to delineate fusion protein-specific targets and to identify chromatin features that are characteristic of myeloid, lymphoid and mixed-lineage leukemias. To profile these features with high signal-to-noise requiring only low sequencing depths for maximum economy, we modified CUT&Tag-direct34 for full automation in 96-well format on a standard robot. As CUT&Tag-direct requires only hundreds to thousands of cells for informative histone modifications40, AutoCUT&Tag is suitable for profiling of samples for a wide range of studies, including developmental and disease studies and screening patient samples.
By also performing AutoCUT&RUN on KMT2A fusions and components of the SuperElongation and DotCom complexes we have elucidated mechanistic details that likely contribute to the heterogeneity of these tumors. First, we found the most common KMT2A-fusion proteins, including KMT2A-AF4, KMT2A-AF9 and KMT2A-ENL all colocalize with the ENL protein in gene bodies, whereas a relatively rare KMT2A-Sept6 fusion protein does not colocalize with ENL and also tends to be more tightly associated with promoters. This suggests that the interaction of the C-terminal domain of AF4, ENL and ELL with transcriptional elongation complexes likely recruits the fusion protein from the promoter into the gene-body. Consistent with the possibility that these interactions play a pivotol role in oncogenic transformation, the wildtype ENL allele is required for tumor growth in numerous KMT2Ar cell lines41.
How do KMT2A fusion proteins promote lineage plasticity in KMT2Ar leukemia? AF4 is the most common KMT2A fusion partner in pediatric leukemias1,7, and KMT2A-AF4 fusions are associated with lineage switching, where an ALL at diagnosis presents as AML upon tumor relapse17,21. Our single cell CUT&Tag profiling data suggests that of the samples included in our study KMT2A-AF4 leukemias are also likely to be the most plastic since they contain the most diverse set of active regulatory elements.
The enhanced throughput and consistency of the AutoCUT&RUN and AutoCUT&Tag platforms for chromatin profiling makes these technologies suitable for profiling patient specimens. Recent advances in our understanding of KMT2Ar leukemia has allowed for the development or repurposing of numerous targeted compounds as therapeutics42,43, and the set of 14 genes bound by the oncoprotein across KMT2Ar leukemias may represent promising potential therapeutic targets. Incorporating AutoCUT&RUN and AutoCUT&Tag into longitudinal clinical trials will provide unprecedented resolution to assess the efficacy of novel epigenetic medicines. In addition these technologies are extremely scalable and cost effective, meaning the information obtained from these trials could also be used to apply chromatin profiling for patient diagnosis in the future.
Data Accession
Gene Expression Omnibus GSEXXXXX
Author contributions
DHJ and SH optimized the CUT&Tag method for automation, and DHJ adapted these modifications for single cell CUT&Tag profiling. SM provided clinical samples and helpful discussion. DHJ, JFS, KA, and SH designed experiment. DHJ, EB and JFS performed experiments. DHJ, MPM, SJW and JFS performed data analysis. DHJ, MPM, KA, and SH wrote the manuscript. All authors read and approved the final manuscript.
Methods
Cell Culture
Human K562 cells were purchased from ATCC (Manassas, VA, Catalog #CCL-243) and cultured according to supplier’s protocol. H1 hESCs were obtained from WiCell (Cat# WA01-lot# WB35186) and cultured in Matrigel™ (Corning) coated plates in mTeSR™1 Basal Media (STEMCELL Technologies cat# 85851) containing mTeSR™1 Supplement (STEMCELL Technologies cat# 85852). The KMT2Ar cell lines ML-2, KOPN-8, RS4;11 and SEM were obtained from the Bleakley lab at the Fred Hutchinson Cancer Research Center.
Primary Patient Samples
Cryopreserved CD45 leukmia blasts for primary MPAL-1 (Sample ID: SJMPAL012424_D1, Alias TB-11-3295) and primary ALL-1 (Sample ID: SJALL048347_D1, Alias TB-13-0939) were obtained from St. Jude Children’s Research Hospital in accordance with institutional regulatory practices. Cryopreserved CD45 leukmia blasts for primary AML-1 (Sample ID: A40725), primary AML-2 (Sample ID: A67194) and primary MPAL-2 (Sample ID: A58548) were obtained from the Meshinchi lab at the Fred Hutchinson Cancer Research Center. Diagnosis of clinical samples as ALL, AML or MPAL was based on flow cytometry of samples stained with CD45-APC-H7 (BD Cat# 560178), cytoplasmic CD3-PE (BD Cat# 347347), CD34-PerCP Cy5.5 (BD Cat# 347203), CD18-APC (BD Cat# 340437), cytoplasmic MPO-FITC (Dako Cat# F071401-1), and CD33-PE-Cy7 (BD Cat# 333946). The KMT2A fusion present in each sample was determined by RNA-sequencing.
Antibodies
For profiling the wildtype and oncogenic KMT2A protein we used two monoclonal antibodies targeting the KMT2A N-terminus: Mouse anti-KMT2A (1:100, Millipore Cat #05-764) refered to as KMT2A-N1, and Rabbit anti-KMT2A (1:100, Cell Signaling Tech Cat #14689S) refered to as KMT2A-N2; as well as two monoclonal anitbodies targeting the KMT2A C-terminus: Mouse anti-KMT2A (1:100, Millipore Cat #05-765) refered to as KMT2A-C1, and Mouse anti-KMT2A (1:100, Santa Cruz Cat #sc-374392) refered to as KMT2A-C2. Since pA-MNase does not bind efficiently to many mouse antibodies, we used a rabbit anti-Mouse IgG (1:100, Abcam Cat# ab46540) as an adapter; this antibody was also used in the absence of a primary antibody as the IgG negative control. For profling the SEC and Dotcom components via manual and and AutoCUT&Tag we used rabbit anti-ENL (Cell Signaling Tech Cat# 14893S), rabbit anti-ELL (Cell Signaling Tech Cat# 14468S) and rabbit anti-Dot1L (Cell Signaling Tech Cat# 90878S). For profiling histone marks via manual and AutoCUT&Tag, as well as single-cell CUT&Tag we used Rabbit anti-H3K4me1 (1:100 Thermo Cat# 710795), Rabbit anti-H3K4me3 (1:100 for bulk profiling or 1:10 for single-cell experimetns, Active Motif Cat# 39159), Rabbit anti-H3K36me3 (1:100 for bulk profiling or 1:10 for single-cell experimetns, Epicypher Cat# 13-0031), Rabbit anti-H3K27me3 (1:100 for bulk profiling or 1:10 for single-cell experimetns, Cell Signaling Technologies Cat# 9733S), and Rabbit anti-H3K9me3 (1:100, Abcam Cat# ab8898). To increase the local concentration of pA-Tn5, all CUT&Tag reactions also included the secondary antibody Guinea Pig anti-Rabbit IgG (1:100, antibodies-online Cat# ABIN101961).
AutoCUT&RUN
Primary patient samples were thawed at room temperature, washed and bound to Concanavalin-A (ConA) paramagnetic beads (Bangs Laboratories Cat# BP531) for magnetic separation. Samples were then suspended in Antibody Binding Buffer and split for incubation with either the KMT2A N- or C-terminus specific antibodies or the IgG control antibody overnight. Sample processing was performed by the CUT&RUN core facility at the Fred Hutchinson Cancer Research Center according to the AutoCUT&RUN protocol available through the Protocols.io website (dx.doi.org/10.17504/protocols.io.ufeetje).
CUT&Tag
Manual CUT&Tag reactions were performed according to the CUT&Tag-Direct protocol34. Briefly, nuclei were prepared by suspending cella in NE1 Buffer (20 mM HEPES-KOH pH 7.9, 10 mM KCl, 0.5mM Spermidine, 0.1% Triton X-100, 20% Glycerol) for 10 min on ice. Samples were then spun down and resuspended in Wash Buffer (20 mM HEPES pH 7.5, 150 mM NaCl, 0.5 mM Spermidine, Roche Complete Protease Inhibitor EDTA-Free) and lightly cross-linked by addition of 16% fomaldehyde to 0.1%. After 2 min, cross-linking was stopped by addition of 2.5 M glycine to a final concentration of 75 mM. Nuclei were washed and either cryopreserved in a Mr. Frosty Chamber for long term storage, or bound to ConA magnetic beads for further processing. ConA-bound nuclei were suspended in Antibody Binding Buffer (Wash Buffer containing 2 mM EDTA) and split into individual 0.5 mL tubes for antibody incubation at room temperature for 1 hr or 4°C overnight. Samples were then washed to remove unbound primary antibody, brought up in Wash buffer containing the secondary antibody, and incubated at 4°C for 1 hr. Samples were then washed and brought up in 300-Wash Buffer (Wash Bufer with 300 mM NaCl), containing pA-Tn5 (1:150 dilution), and incubated at 4°C for 1 hr. Samples were then washed in 300-Wash Buffer, and brought up in Tagmentation Buffer (300 Wash Buffer plus 10 mM MgCl2), and incubated at 37°C for 1 hr to allow the Tn5 tagmentation reaction to go to completion. Samples were then washed with TAPS wash buffer (10 mM TAPS with 0.2 mM EDTA), and brought up in 5 µL of Release Solution (10 mM TAPS with 0.1% SDS). Samples were then incubated in a thermocycler with heated lid at 58 degrees for 1 hr to release Tn5 and prepare tagmented chromatin for PCR. Neutralizing Solution (15 µL 0.67% Triton-X100) was added followed by 2 µL barcoded i5 primer (10 µM), 2 µL barcoded i7 primer (10 µM) and 25 µL of NEBNext PCR mix. Samples were then placed in a thermocycler and PCR amplification was performed using 12-14 rapid cycles. CUT&Tag libraries were then cleaned up with a single round of SPRIselect beads at a 1.3 : 1 v/v ratio of beads to sample, quantified on a Tapestation bioanalyzer instrument and pooled for sequencing.
AutoCUT&Tag
A detailed protocol complete with program downloads has been made publicly available on protocols.io for implementing AutoCUT&Tag on a Beckman Coulter Biomek liquid handling robot (https://www.protocols.io/view/autocut-amp-tag-streamlined-genome-wide-profiling-bgztjx6n). To facilitate adaptation of the method to other standard liquid handling modules, the complete specifications for each step in the automated procedure are outlined in guidelines section. Briefly, nuclei were extracted, lightly cross-linked, bound to ConA beads and incubated with primary antibody as in manual CUT&Tag. Up to 96 samples were then arrayed in a 96 well PCR plate and positioned on a a stationary ALP on the Beckman Coulter Biomek FX Robot equipped with an ALPAQUA Magent Plate for standard magnetic separation, an ALPAQUA LE Magent Plate for low volume elution, and a thermal block for temperature controlled inbuation. Wash Buffer and 300-Wash Buffer were loaded in Deep Well Plates, Secondary Antibody Solution, pA-Tn5 solution, Tagmentation Buffer, TAPS Buffer and Release Buffer were all loaded into V-Bottom Plates and were positioned on Stationary ALPs in accordance with the preprogrammed AutoCUT&Tag method. The AutoCUT&Tag processing was conducted over the course of 4 hours. The sample plate containing ConA-bound tagmented nuclei in 10 µL 0.1% SDS was then removed, sealed and placed on a thermocycler with heated lid for a 1 hour incubation at 58°C. Using a reservoir and multichannel pipettor, 54 µL of 0.15% SDS neutralization solution was added to each well, followed by 4 µL of premixed i5/i7 barcoded primers, and 36 µL of premixed KAPA PCR Master Mix. The plate was then sealed and returned to a thermocycler for 14 rapid PCR cycles. Following PCR amplification, the sample plate was returned to the Biomek for one round of post-PCR cleanup on the Biomek deck set up in accordance a preprogrammed post-PCR cleanup method, including a second 96-well plate preloaded with SPRISelect Ampure beads, a Deep Well Plate loaded with 80% Ethanol for bead washes, and two V-Bottom Plates preloaded with 10 mM Tris-HCl pH 8.0 for tip washes and elution. Upon completion of the 1 hr cleanup the samples were then quantified using a Tapestation bioanalyzer instrument and pooled for sequencing.
Single-cell CUT&Tag
Nuclei were extracted and lightly cross-linked using the same strategy as for manual CUT&Tag. The nuclei concentration was then quantified to allow for accurate dilution prior to dispensing into nanowells on the ICELL8. For each antibody 10 µL of ConA beads were washed in Binding Buffer (20 mM HEPES-KOH pH 7.9, 10 mM KCl, 1 mM CaCl2, 1 mM MnCl2) and bound to the sample for 10 min. Samples were the split into 0.5 mL Lobind tubes, one for each antibody, and resuspended in 25 µL of Antibody Buffer containing primary antibody at a 1:10 dilution. Samples were incubated at 4°C overnight, washed twice with 100 µL of Wash Buffer, and then resuspended in 50 µL Wash Buffer containing secondary antibody at a 1:50 dilution. Samples were incubated at 4°C for 1 hr, washed twice with 100 µL of Wash Buffer, and then resuspended in 50 µL 300-Wash Buffer with 1:50 diltuion of pA-Tn5. Samples were incubated at 4°C for 1 hr, washed 2X with 100 µL of 300-Wash Buffer, and then resuspended in 50 µL of Tagmentation Solution (300-Wash Buffer with 10 mM MgCl2). Samples were incubated at 37°C in a thermocycler with heated lid for 1 hr to allow the tagmentation reaction to go to completion. Samples were washed with 10 mM TAPS to remove any residual salt, and then resuspended in 10 mM TAPS pH8.5 containing 1X DAPI and 1X secondary diluent reagent (Takara Cat# 640196) at a concentration of 400 nuclei/µL. 80 µL of cell suspension was loaded into 8 wells of the 384 cell plate, together with 25 µL of the fiducial reagent (Takara Cat# 640196) according to the manufacturer’s instructions. Sample suspension (35 nL) was dispensed on the ICELL8 into the nanowells of a 350v Chip (Takara Cat# 640019). The 350v Chip was dried and sealed, and cells were centrifuged at 1200xg for 3 min. The Chip was then imaged to identify wells containing a single nuclei and a filter file was prepared. During image processing, 35 nL of 0.19% SDS in TAPS was added to all nanowells on the ICELL8 using an unfilitered dispense. The Chip was then dried, sealed and centrifuged at 1200xg for 3 min and then heated at 58°C in a thermocycler with heated lid for 1 hr to release the pA-Tn5 and prepare the tagmented chromatin for PCR. Before opening, the Chip was centrifuged at 1200xg, and 35 nL of 2.5% Triton-X100 neutralization solution was added to all wells containing a single nuclei via a filtered dispense on the ICELL8. The Chip was then dried and 35 nL of i5 indices was added via a filtered dispense. The Chip was then dried and 35 nL of i7 indices was added via a filtered dispense. The Chip was then dried, sealed and centrifuged at 1200xg for 3 min. Then 100 nL of KAPA PCR mix (2.775 X HiFi Buffer, 0.85 mM dNTPs, 0.05 U KAPA HiFi polymerase / µL)(Roche Cat# 07958846001) was added to all wells containing a single nucleus via two 50 nL filtered dispenses. The Chip was centrifuged at 1200xg for 3 min, sealed and placed in a thermocycler for PCR amplification using the following conditions: 1 cycle 58 °C 5 min; 1 cycle 72 °C 10 min; 1 cycle of 98 °C 45 sec; 15 cycles of 98 °C 15 sec, 60 °C 15 sec, 72 °C 10 sec; 1 cycle 72 °C 2 min. The Chip was then centrifuged at 1200xg for 3 min into a collection tube (Takara Cat# 640048). To remove residual PCR primers and detergent, the sample was then cleaned up using two rounds of SPRISelect Ampure bead cleanup at a 1.3 : 1 v/v ratio of beads to sample. Samples were resuspended in 30 uL of 10 mM Tris-HCL pH 8.0, quantified on a Tapestation bioanalyzer instrument, and pooled with bulk samples for sequencing.
DNA sequencing and Data processing
The size distributions and molar concentration of libraries were determined using an Agilent 4200 TapeStation. Up to 48 barcoded CUT&RUN libraries or 96 barcoded CUT&Tag libraries were pooled at approximately equimolar concentration for sequencing. Paired-end 25 × 25 bp sequencing on the Illumina HiSeq 2500 platform was performed by the Fred Hutchinson Cancer Research Center Genomics Shared Resources. This yielded 5-10 million reads per antibody. Single-cell CUT&Tag libararies were prepared using unique i5 and i7 barcodes and pooled with bulk samples for sequencing. For 500-100 cells 20 million reads was sufficient to obtain an average of approximately 80% saturation of the estimated library size for each single cell. Paired-end reads were aligned using Bowtie2 version 2.3.4.3 to UCSC HG19 with options: --end-to-end --very-sensitive --no-mixed --no-discordant -q --phred33 -I 10 -X 700.
Identifying KMT2Ar oncoprotein targets
To identify unique KMT2Ar targets, we first generated a merged set of 18087 SEACR peaks originating from either N-terminal or C-terminal KMT2A antibody-targeted CUT&RUN in any cell type assayed. We quantified the number of fragments mapping to each peak i from each dataset j, and summed reads mapped from the two antibodies targeting the same KMT2A terminus in the same dataset to yield N-terminal (nij) and C-terminal (cij) fragments mapped in each peak, existing in cell type sets Nj and Cj, respectively. We calculated the cell type-specific “N over C ratio” (NCR) for each peak as follows: where min(x) = minimum value of x across the peak set; and ECDF(y)(x) = Empirical Cumulative Distribution Function of set y evaluated at x, as implemented in R (https://www.r-project.org/) using the ecdf() function. As illustrated in equation (1), ECDF was used to shrink NCR values towards zero in inverse proportion with the mean nij+cij signal observed in the peak. Each peak-cell type combination Pij was assigned a True or False value for peak identity and KMT2Ar identity. Peak identity was asserted as True and added to peak set Pj if the mean nij+cij was at least 1.96 standard deviations above the mean Nj+Cj. For all Pj, KMT2Ar identity was evaluated by fitting a two-component Gaussian Mixture Model to all NCRj corresponding to Pj, and asserting as True any NCRij that are greater than the NCR value greater than the mean of NCRj at which the two fitted Gaussian distributions intersect. Gaussian Mixture Modeling was implemented in R using the normalMixEM() function from the “mixtools” library. For all peaks assigned as KMT2Ar in any cell type, NCR scores were hierarchically clustered using the hclust() function in R on a euclidean distance matrix generated by the dist() function.
t-SNE embedding of the active and repressed chromatin regions
For histone modification data, peaks were called from merged replicate datasets using SEACR36, and peak sets were merged for each modification across all cell types. We generated matrices of raw read counts mapping in each cell type (columns) to merged peaks (rows) for each modification, and we filtered out instances were counts were lower than any count value whose evaluated Empirical Cumulative Distribution Function was more than 5% diverged from the predicted ECDF value based on a lognormal fit of the data distribution, using the fitdistr() function from the MASS library with “densfun” set to “lognormal”. We then log10-transformed the results and rescaled columns to z-scores. Principal component analysis (PCA) was performed on the resulting transformed matrices using the prcomp() function in R. For t-SNE analysis, all principal components contributing greater than 1% variance were used as input to the Rtsne() function from the Rtsne library, with perplexity set as the nearest integer to the square root of the number of peaks, and check_duplicates set as FALSE. We used the resulting two-dimensional t-SNE values as input to the densityClust() function from the densityClust library, and used that output in the findClusters() function, with rho and delta values set to the 95th percentile of all rho and delta values output from densityClust(), respectively. To generate cluster average heatmaps, scaled count values were averaged by cluster and the resulting matrix was used as input to the heatmap.2() function from the gplots library. PCA and t-SNE plots were generated using the ggplot2 library (https://ggplot2.tidyverse.org/).
UMAP embedding of single cells
Cluster-specific regions defined from the bulk t-SNE embeddings were used to generate a single-cell count matrix of N features (Fig 3). These matrices were then normalized by sequencing depth. Next, each feature in the matrix (row) was scaled by subtracting the mean and dividing by the standard deviation (z-scaling). The upper and lower bound values of the matrix were capped at ±1.544. We then used hierarchical clustering on the count-matrix to organize features and cells. While clustering, we confined single cells to each celltype so no cells could be organized outside of their celltype category. In addition, the normalized count-matrix was reduced from N dimensions to two dimensions using UMAP and plotted.
Comparison of myeloid and lymphoid enriched H3K4me3 signal in single cells
We quantified the number of unique fragments from each cell that fell in the myeloid, mixed myeloid, lymphoid, mixed lymphoid, common or lineage specific clusters of H3K4me3 peaks as defined by analysis of bulk data. The number of unique fragments that fell in each cluster was then divided by the number of base pairs covered by the set of peaks in a given cluster. The percent of bp normalized myeloid, mixed myeloid, lymphoid, and mixed lymphoid fragments for each cell was then determined relative to the total bp normalized signal in all peaks.
Preparation of Figure Panels
All heat maps were generated using DeepTools45. All of the data were analyzed using either bash or python (https://github.com/python).. The following packages were used in python: Matplotlib, NumPy, Pandas, Scipy, and Seaborn.
Supplementary Figures
Acknowledgements
We thank the Fred Hutchinson Genomics Shared Resource Facility for technical support, particularly Phil Corrin and Jeff Delrow for help with AutoCUT&RUN profiling of KMT2A. We thank Terri Bryson and Trizia Llagas for help with cell culture, and Jorja Henikoff and Matt Fitzgibbon for preparing the sequencing data for analysis. In addition, we thank Jitendra Thakur for helpful discussions related to data analysis and presentation. We thank Charles Mullighan, Jenny Lill and Marie Bleakley for generously sharing the KMT2Ar samples and cell lines used in this study. This work was supported by NIH grants R01 HG010492 (S.H.), 4DN TCPA A093 (S.H.) and F32 GM129954 (M.M.), by the Howard Hughes Medical Institute (S.H.), by a pilot project grant from the Chan-Zuckerberg Initiative (S.H.), by a Damon Runyon-Sohn Foundation Fellowship (J.F.S.) and by an Alex’s Lemonade Stand Foundation Young Investigator Award (J.F.S.).