Abstract
Parallel single-cell sequencing protocols represent powerful methods for investigating regulatory relationships, including epigenome-transcriptome interactions. Here, we report the first single-cell method for parallel chromatin accessibility, DNA methylation and transcriptome profiling. scNMT-seq (single-cell nucleosome, methylation and transcription sequencing) uses a GpC methyltransferase to label open chromatin followed by bisulfite and RNA sequencing. We validate scNMT-seq by applying it to mouse embryonic stem cells, finding links between all three molecular layers and revealing strong and widespread associations between chromatin accessibility and DNA methylation.
Understanding regulatory associations between the epigenome and the transcriptome requires simultaneous profiling of multiple molecular layers. Previously, such multi-omics analyses have been limited to bulk assays, which profile ensembles of cells. These studies have used variation in the expression of a gene across individuals1 or between cell types2 or conditions to assess such linkages. Alternatively, it is also possible to link chromatin state with transcription by exploiting variability between genes within a single sample. However, insights from such an approach are limited to the discovery of genome-wide global trends3.
With rapid advances in single-cell technologies it is increasingly possible to leverage variation between single cells in order to probe regulatory associations between molecular layers. Existing protocols allow the methylome and the transcriptome or, alternatively, the methylome and chromatin accessibility to be assayed in the same cell4–7. However, it is well known that DNA methylation and other epigenomic features such as chromatin accessibility do not act independently of one another. Consequently, the ability to profile, at single cell resolution, multiple epigenetic features in conjunction with gene expression is critical for obtaining a more complete understanding of how transcription, and thus cell state, is regulated8.
To address this, we have developed a method that enables the joint analysis of the transcriptome, the methylome and chromatin accessibility. Our approach builds on previous parallel protocols such as single-cell methylation and transcriptome sequencing (scM&T-seq)1, in which physical separation of DNA and RNA is performed first, to enable the cell’s transcriptome to be profiled using a conventional Smartseq2 protocol9. To measure chromatin accessibility together with DNA methylation, we adapted the Nucleosome Occupancy and Methylation sequencing (NOMe-seq) method7,10, where a methyltransferase (methylase) enzyme is used to label accessible (or nucleosome depleted) DNA prior to bisulfite sequencing (BS-seq), which distinguishes between the two chromatin states. In mammalian cells, cytosine residues in CpG dinucleotides are frequently methylated, whereas cytosines followed by either adenine, cytosine or thymine (collectively termed CpH) are methylated at a much lower rate11. Consequently, by using a GpC methylase enzyme (M.CviPI) to label accessible chromatin, NOMe-seq can recover endogenous CpG methylation information in parallel. NOMe-seq is particularly attractive for single-cell applications since, contrary to count-based methods such as ATAC-seq or DNase-seq, the GpC accessibility is encoded through the bisulfite conversion and hence inaccessible chromatin can be directly discriminated from missing data. Additionally, the resolution of the method is determined by the frequency of GpC sites within the genome (~1 in 16bp), rather than the size of a library fragment (>100bp) (see Fig. 1a for an illustration of the protocol).
To demonstrate the performance of scNMT-seq, we applied the method to a batch of 70 serum-grown EL16 mouse embryonic stem cells (ESCs), together with four negative (empty wells) and three scM&T-seq controls (cells processed using scM&T-seq, i.e., which did not receive M.CviPI enzyme treatment). This facilitates direct comparison with previous methods for assaying DNA methylation and transcription in the same cell4,12.
We isolated single cells into GpC methylase reaction mixtures by FACS, before physically separating the DNA and RNA prior to bisulfite and RNA sequencing library preparation1. See Supplementary Table 1 for sequencing summary statistics. Alignment of the BS-seq data and other bioinformatics processing can be carried out using established pipelines, with the addition of a filter to discard G-C-G positions, for which it is intrinsically not possible to distinguish endogenous methylation from in vitro methylated bases (21% genome-wide). Similarly, we remove C-C-G positions to mitigate possible off-target effects of the enzyme10 (27% genome-wide). In total, 58 out of 70 cells processed using scNMT-seq passed quality control for both bisulfite and RNA-seq.
First, we considered the RNA-seq component, which is directly comparable to scM&T-seq transcriptome data. On average, we detected 7,700 genes per cell (CPM ≥1), which is comparable with data from the same cell type profiled using scM&T-seq1. We used PCA and hierarchical clustering to jointly analyse cells across protocols and studies (using data from Angermueller et al. 20164), and found that scM&T-seq and scNMT-seq samples prepared in parallel cluster together. This indicates that the enzyme treatment does not adversely affect the transcriptome (Supplementary Fig. 1). Larger differences were observed when comparing across studies, most likely reflecting differences in the cell lines used (male E14 versus female EL1613, Supplementary Fig. 1).
The need to filter out C-C-G and G-C-G positions from the methylation data reduces the number of genome-wide cytosines that can be assayed from 22 million to 11 million. However, despite this filter, a large proportion of the loci in genomic regions with important regulatory roles, such as promoters and enhancers, can be profiled using scNMT-seq (Fig. 1b). Consistent with this theoretical expectation, we observed high empirical coverage: 51% of promoters and 78% of gene bodies are captured by at least 5 cytosines (Fig. 1c, Supplemental Fig. 2a). We also compared the methylation coverage to data from our previous publication4, again finding small differences relative to conventional BS-seq, albeit these differences became more pronounced when down-sampling the total sequence coverage (e.g. the reduction in gene body coverage increased from 5% to 16% when sampling 1/10th of the reads; Supplemental Fig 2b). Due to the higher frequency of GpC compared to CpG dinucleotides in the mouse genome, the coverage of GpC accessibility was larger than that observed for endogenous CpG methylation (Fig. 1b, c and Supplementary Fig. 2a). We found, on average, that 91% of gene bodies and 79% of promoters per cell were assessable, which is the highest coverage achieved by any single-cell accessibility protocol to date (9.4% using scATAC-seq14, and with scDNase-seq, ~50% of genes >1 RPKM, >80% of genes >3 RPKM15). Analogous to the analysis of the RNA-seq data, we compared the CpG methylation profiles obtained from scNMT-seq to single-cell libraries profiled using scM&T-seq4, scBS-seq12 and bulk BS-seq16, finding that cells did not cluster by protocol or by study, with most variation being attributable to difference in cell type (Supplementary Fig. 3).
To validate the accuracy of the GpC accessibility measurements, we generated a synthetic bulk dataset by merging GpC methylation data from all cells, and compared this with published bulk DNase-seq data from the same cell type7. Globally, we observed high consistency between datasets (Pearson R = 0.75, weighted by coverage in our merged dataset, Supplemental Fig 4). The most notable difference was that the scNMT-seq data showed oscillating profiles, with peaks spaced ~180 to ~200bp apart, consistent with the positions of nucleosomes (Fig. 1d) and similar to profiles obtained with bulk-cell NOMe-seq2.
Next, we examined GpC methylation levels at known regulatory regions in single-cells. Across the genome, GpC accessibility was ~30%, with low cell-cell variability. However, we found a large increase in GpC accessibility at known DNase hypersensitivity sites (DHS, ~60% GpC methylated, Supplemental Fig. 5), as well as transcriptional start sites (~60% GpC methylated, Fig. 1e). We observed similar patterns for protein- and transcription factor binding sites (from p300, CTCF, Nanog and Oct4 ChIP-seq data), which were accessible at the centre of the peaks. Cells processed using the scM&T-seq control were universally low in GpC methylation (~2%) with no enrichment at regulatory regions, indicating that our accessibility data are not affected by endogenous GpC methylation (Supplementary Fig. 6). To illustrate the high-resolution GpC accessibility measurements obtained by our method, we profiled the pattern and density of nucleosomes around transcription start sites finding characteristic nucleosome depleted regions at transcription start sites and variation between cells in the position of nucleosomes (see Supplementary Fig. 7 for example plots).
To assess how differences in gene expression are associated with methylation and GpC accessibility, we stratified loci based on the expression level of the nearest gene using the RNA-seq profiles from the corresponding cells. We found that highly expressed genes were associated with the greatest GpC accessibility at promoters and at nearby regulatory sites, whereas the GpC accessibility of lowly-expressed genes was reduced (Fig. 1e; Supplementary Fig 8).
Taken together, these results demonstrate that our method is able to robustly profile gene expression, DNA methylation and GpC accessibility within the same single cell.
Having established the efficacy of our method, we next explored its potential for identifying loci with coordinated epigenetic and transcriptional heterogeneity. Globally, we observed a clear relationship between average CpG methylation rate and the GpC accessibility across cells, where methylated loci were associated with decreased accessibility (Fig. 2a). When assessing the heterogeneity of CpG methylation in different genomic contexts, enhancers were most variable (particularly primed enhancers – H3K4me1 marked but lacking H3K27ac), followed by non-CGI promoters and inactive promoters (Supplemental Fig. 9), which is in agreement with previous data4,12. In contrast, heterogeneity in GpC accessibility was largest at known binding sites of transcription factors (Oct4 and Nanog) and regions of active chromatin (p300 binding sites and DNase-hypersensitive sites), indicating cell to cell differences in the accessibility of the DNA to important regulatory factors (Fig. 2a and supplemental Fig. 9).
We next jointly considered the GpC accessibility and CpG methylation data to test for correlated changes between the two layers. Significant associations were observed across all genomic contexts, with up to 98 loci showing significant patterns (FDR < 10%; Fig. 2b; Supplementary Fig. 10a and 11). The majority of significant correlations were negative, reflecting the known relationship between these two layers17. The largest number of individual associations was observed in intronic regions (N=98), followed by Super Enhancer regions (N=51, Fig. 2b.).
In addition to coupling between different epigenetic layers, we also considered associations between CpG methylation and GpC accessibility and gene expression levels. Because these effects were generally weaker than the relationship between accessibility and methylation, we used a data-driven approach to optimise the set of promoter proximal regions in which to test for such associations (Methods). This analysis identified −100bp to +100bp for accessibility and −1kb to +1kb for methylation as suitable parameters for such analyses (Supplementary Fig. 12). Notably, the strongest associations between accessibility and expression were observed upstream of the TSS, whereas the linkages for DNA methylation were most pronounced downstream of the TSS. We used these regions to assess linkages between DNA methylation and accessibility with gene expression. We found 4 significant associations between GpC accessibility and gene expression with a greater number of positive (3) compared to negative (1) correlations (Fig. 2c and Supplementary Fig. 13a and 14) and for CpG methylation and transcription, we found 39 significant associations with an enrichment for negative correlations (33/39), confirming the known negative relationship between DNA methylation and gene expression (Supplementary Fig. 15a and 16). See Supplementary Table 2 for a list of all significant correlations.
As an example, Fig. 2d displays the gene Cth and surrounding region, showing mean accessibility and methylation rates across the locus as well as a scatter plot, depicting significant associations between GpC accessibility or CpG and methylation at the promoter region and Cth expression. Notably, this relationship could also be observed in individual cells, as shown in the zoom-in examples, revealing specific cells with either an accessible promoter and expressed transcripts or inaccessible and non-expressed.
We additionally analysed associations across genes within each cell (rather than across cells within each gene), which is similar to previous approaches used to investigate such linkages using a single bulk sample. This approach showed global correlations in different genomic contexts (Supplementary Fig. 10b, 13b, 15b), indicating that our method is accurately measuring each layer and recapitulates the expected bulk-cell results.
In conclusion, we describe a method for parallel single-cell DNA methylation, gene expression and high resolution chromatin accessibility measurements and report novel associations between each molecular layer with a strong enrichment for DNA methylation – chromatin accessibility correlations. This method will greatly expand our ability to investigate relationships between the epigenome and transcriptome in heterogeneous cell types and across developmental transitions.
Methods
Cell culture
Mouse embryonic stem cells were derived from a 129×Cast/129 embryo previously13 and cultured in serum media without feeders as previously4. Single-cells were collected by FACS, selecting for live cells and low DNA content (i.e., G0 or G1 phase cells) using ToPro-3 and Hoechst 33342 staining as previously described4. The cell line was subjected to routine mycoplasma testing using the MycoAlert testing kit (Lonza).
Library preparation
Cells were collected directly into 2.5μl methylase reaction mixture which was comprised of 1x M.CviPI Reaction buffer (NEB), 2U M.CviPI (NEB), 160 μM S-adenosylmethionine (NEB), 1U/μl RNAsein (Promega), 0.1% IGEPAL (Sigma) then incubated for 15 minutes at 37°C. The reaction was stopped and the RNA preserved with the addition of 5μl RLT plus (Qiagen) prior to scM&T-seq library preparation according to the published protocols for G&T-seq19 and scBS-seq20 but with the following modifications. Three G&T-seq washes were performed with 15μl volumes (steps 22 to 24 of the G&T-seq protocol 21) and the reverse transcription reaction and PCR were performed using volumes of the published Smart-seq2 protocol22 (i.e. 10 μl for reverse transcription and 25 μl for PCR).
Sequencing
20 of the BS-seq libraries, including 3 negative controls, were initially sequenced on 50bp single-end MiSeq run to assess quality. The negative controls were found to have substantially reduced mapping efficiencies compared to the single cell samples (mean of 2.7% compared to 36.8%, see Supplementary Table 1). All single-cell BS-seq libraries were subsequently sequenced to a mean depth of 17 million paired-end reads and RNA-seq libraries were sequenced to a mean depth of 1.7 million paired-end reads. Both sets of libraries were sequenced on HiSeq 2500 instruments using v4 reagents and 125bp read length.
Data processing
Bisulfite-seq alignment
Single-cell bisulfite libraries were processed using Bismark23 as described20 but with the additional –NOMe option in the coverage2cytosine script which produces CpG report files containing only A-C-G and T-C-G positions and GpC report files containing only G-C-A, G-C- C and G-C-T positions.
RNA-seq alignment
Single-cell RNA-seq libraries were aligned using HiSat224 using options −O3 −m64 −msse2 - funroll-loops −g3 −DPOPCNT.
Allele-sorting
Since the cell-line used was derived from a hybrid embryo (129 × 129/cast) reads were separated by known SNPs between the two strains, using SNPsplit25, however for the purposes of this study, genome-specific data was merged and therefore the allelic origin ignored.
Quality control
From the bisulfite-seq data, we discarded cells that had (1) less than 10% mapping efficiency (2) less than 500,000 CpG sites or 5,000,000 GpC sites covered. In total, 64 cells (88%) passed the quality control (supplemental Fig. 18). From the RNA-seq data we discarded cells that had (1) less than 300,000 reads mapped (2) more than 15% of total reads mapped to mitochondrial genes, (3) less than 2,000 genes expressed. In total, 66 cells (90%) passed the quality control (supplemental Fig. 17), 61 of which also passed BS-seq QC (84%) comprising 58 scNMT-seq cells and 3 scM&T-seq cells.
CpG Methylation and GpC accessibility quantification
Following the approach of Smallwood et al7 individual CpG or GpC sites in each cell were modelled using a binomial model where the number of successes is the number of reads that support methylation and the number of trials is the total number of reads. A CpG methylation or GpC accessibility rate for each site and cell was calculated by maximum a posteriori assuming a beta prior distribution. Subsequently, CpG methylation and GpC accessibility rates were computed for each genomic feature assuming a Normal distribution across cells and accounting for differences in the standard errors of the single site estimates. The coverage (number of observed CpG or GpC sites) was recorded and used as weight in subsequent analysis. See Supplementary Table 3 for details of genomic contexts used in this study.
RNA quantification
Gene expression counts were quantified from the mapped reads using featureCounts26. Gene annotations were obtained from Ensembl version 8726. Only protein-coding genes matching canonical chromosomes were considered. Following27 the count data was log-transformed and size-factor adjusted based on a deconvolution approach that accounts for variation in cell size28.
Statistical analysis
CpG Methylation and GpC accessibility profiles
CpG methylation and GpC accessibility profiles were visualised by taking predefined windows around the genomic context of interest. For each cell and feature, CpG methylation and GpC accessibility values were averaged using running windows of 50 bp. The information from multiple cells was combined by calculating the mean and the standard deviation for each running window. Profiles were calculated using a subset of 20 cells with similar mean methylation rate values. Genes were split into three classes according to a histogram of the log2 normalised expression counts (x): Low (x<2), Medium (2<x<6) and High (x>6). For genomic features that are not directly linked to genes (i.e. enhancers or transcription factor binding sites), all possible relationships between genes and features within 5kb of the gene (upstream and downstream of gene start and stop) were considered.
GpC accessibility profiles around the TSS in a single cell (as displayed in Supplementary Fig. 9a and Fig. 2e) were generated using a generalised linear model (GLM) of basis function regression coupled with a Bernoulli likelihood using BPRMeth29.
Correlation analysis
For the correlation analysis across cells, genes with low expression levels and low variability were discarded, according to the rationale of independent filtering30. Genomic features observed in less than 50% of the cells and with a coverage of less than 3 sites were discarded. Furthermore, only the top 50% of the most variable loci were considered for analysis and a minimum number of 20 cells was required to compute a correlation. Only genomic contexts with more than 100 features that passed the filtering criteria were considered for the analysis. A minimum coverage of 3 sites was required per feature. For association tests, all possible relationships between genes and genomic features within 8kb of the gene (upstream and downstream) were considered. Following our previous approach4, we tested for linear associations by computing a weighted Pearson correlation coefficient, thereby accounting for differences in the coverage between cells. When assessing correlations between GpC accessibility with CpG methylation, we used the average CpG methylation coverage as a weight.
Two-tailed Student’s t-tests were performed to test for nonzero correlation, and P-values were adjusted for multiple testing for each context using the Benjamini-Hochberg procedure.
To improve the correlations of promoter methylation or accessibility with expression, we optimized the genomic window used to define the CpG methylation or GpC accessibility rate as follows. First, we selected 20 random cells and we extracted +/−4kb regions around the transcription start site of all genes and we divided them into overlapping 200bp windows with a stride of 50bp (Supplementary figure 12). Then, for each cell and window, we performed a correlation across all genes between the CpG methylation or GpC accessibility rates and the corresponding gene expression. Finally, we selected the regions for which the correlation is maximized, in the case of accessibility being +/−100bp and in the case of methylation +/− 1kb.
Author contributions
S.J.C and W.R. conceived the project. S.J.C, T.M.S and H.J.L performed experiments. R.A., S.J.C and C-A.K performed statistical analysis. F.K. processed and managed sequencing data. S.J.C, R.A, J.C.M, O.S, W.R interpreted results and drafted the manuscript. G.S., G.D.K, J.C.M, O.S. and W.R supervised the project.
Footnotes
↵9 Joint senior authors.