Abstract
Conventional approaches to studying 5mC marks in single cells or samples with picogram input DNA amounts usually suffer from low genome coverage due to DNA degradation. Many methods have been developed to optimize the library construction efficiency for bisulfite-treated DNA. However, most of these approaches ignored the amplification bias of bisulfite-treated DNA, which leads to shallow genome coverage. In this study, we developed the T7-assisted enzymatic methyl-sequencing method (TEAM-seq), which adopts enzymatic conversion to minimize DNA degradation and T7 polymerase-assisted unbiased amplification. We demonstrate that TEAM-seq delivered, to the best of our knowledge, the highest reported coverage(70% for 100pg, 35% for 20pg) of single cell genomes in whole-genome 5mC sequencing.
Introduction
In recent years, the rapid development of single-cell sequencing technology has provided many valuable insights into complex biological systems, for example, revealing complex and rare cell populations and tracking the trajectories of distinct cell lineages in development 1. Numerous single-cell RNA and DNA sequencing methods have been developed 2. Heritable phenotype changes occur not only owing to changes in DNA and RNA nucleotide sequence, but also via epigenetic modifications, which do not change the DNA sequence itself. Among the epigenetic modifications, methylation of cytosine with the formation of 5-methylcytosine (5mC) is the most abundant epigenetic base change in vertebrates. This epigenetic modification has been intensively studied in relation to embryonic development, genomic imprinting, X-inactivation, cellular proliferation, and differentiation 3, 4. Conventional approaches to studying 5mC epigenetic marks usually rely on chemical treatment to convert modified/unmodified sites to readable mutated nucleotides, which requires high DNA input 5. However, the limited number of DNA copies significantly complicates the application of these conventional approaches to studies in single cells.
The standard and well-established technology to detect sequence-specific 5mC marks is bisulfite sequencing 6, 7, which utilizes sodium bisulfite to convert the unmethylated cytosine to uracil after constructing the sequencing library. Methylated cytosine is resistant to bisulfite conversion. Although bisulfite sequencing has been established as the gold standard for bulk DNA methylation analysis, single-cell adaptations of this method for low DNA input face a major hurdle of bisulfite-induced DNA degradation. Pioneer studies of single-cell bisulfite sequencing were performed by integrating all conventional steps, including bisulfite sequencing, into a single-tube reaction, followed by PCR amplification and deep sequencing 6, 7. The constructed DNA library is usually broken by the harsh bisulfite treatment, and the CpG/genome coverage drops to 5%/4% 8. The CpG coverage rate can be improved to approximately 18% through post-bisulfite adaptor tagging 9. Single-cell genome-wide bisulfite sequencing, in which the adaptor tags bisulfite-treated DNA with a 3′ stretch of random nucleotides was shown to further improve the coverage by up to 5–48.4% of all CpG sites 10–13. Recent advances in single-nucleus methylcytosine sequencing have been based on this approach, with an additional use of adaptase and random priming to tag bisulfite-treated DNA with adaptors 14.
In addition, recent advances have facilitated high-throughput single-cell methylation sequencing. Single-cell combinatorial indexing for methylation analysis uses transposons to assign combinatorial indices on single cells and then tag the other adaptor by random priming after bisulfite conversion 15. Although these previous methods improved the adaptor tailing efficiency, they suffered from DNA degradation because of the harsh bisulfite treatment. Recently, a gentle conversion method that combines ten-eleven translocation (TET) oxidation with pyridine borane sequencing was reported 16, which used a high concentration of TET1 to convert 5mC to 5CaC, which in turn, was converted to uracil by bicarbonate treatment. However, high concentrations of TET1 are not available in most laboratories. Overall, single-cell methylation sequencing coverage was poor, as it covered only 20% of CpG sites and less than 10% of the genome. Therefore, it is necessary to improve the coverage, accuracy, and read length of single-cell methylation sequencing.
The methods mentioned above improve the adaptor tagging efficiency or minimize DNA loss, but they ignore biased PCR amplification that causes uniform genome coverage. Here, we describe the T7-assisted enzymatic methyl-sequencing method (TEAM-seq) that improves coverage. In TEAM-seq, transposons tagmentate genomic DNA with T7 adaptors. Then, a gentle and high conversion efficiency kit is used to protect the methylation sites by TET2 and convert the unmethylated sites to uracil by the activity of apolipoprotein B mRNA editing enzyme catalytic subunit 3A (APOEC3A). This method can also be easily scaled up to high-throughput analysis of single cells. With TEAM-seq, we achieved a superior 70% coverage efficiency with only 2× sequencing depth and ultra-low DNA input, which, to the best of our knowledge, is the highest single-cell methylation coverage ever reported.
Results
Feasibility of enzyme-based conversion to detect the methylome
To detect the methylome, the conventional method is bisulfite sequencing (BS), which relies on the bisulfite-induced conversion of cytidine to uridine in the genome. However, 5-methylcytidine (5mC) does not get converted in this way (Sfig.1a). Furthermore, this harsh treatment usually results in very short fragments and significant loss of genomic DNA 17. To avoid the harsh bisulfite treatment, we consider using the gentle enzyme-based conversion methods, such as TET-assisted pyridine borane sequencing (TAPS)16 or commercially available EM-seq (NEBNext® Enzymatic Methyl-seq Kit | NEB). Unlike directly converting 5mC to U by TAPS, the commercially available enzyme-based method termed Enzymatic Methyl-seq (EM-seq) has been developed recently by New England Biolabs, which uses APOEC3A to convert cytidine to uridine for a gentler treatment of DNA 18. The conversion protocol included two general steps: 1) TET2 transformed 5mC/5hmC to 5-formylcytosine (5fC), 5-carboxycytosine (5caC), and 5-(β-glucosyloxymethyl)cytosine (5gmC), which could be protected from downstream APOEC3A conversion; 2) APOEC3A converted all the remaining unmodified cytidines to uridines (Sfig.1a). The unconverted “C” in sequencing data represented methylation sites, as in BS. A comprehensive analysis of the catalytic activity of this enzyme and a comparative study of the NA12878 dataset have been published previously19. Here, we sought to further extend the application of EM-seq to plant genomic DNA of Arabidopsis.
The BS and EM-seq libraries were prepared in a pairwise manner. We used 1μg of Arabidopsis DNA for the initial test. Both BS and EM-seq require PhiXDNA to balance the sequencing signal bias. The sequencing quality and data yield were comparable between EM-seq and BS (Sfig1.b–f). EM-seq was characterized by a 14% lower duplication rate (Sfig.1c) and better methylome coverage (Sfig.1d, e) compared to those observed with the BS approach. To assess the accuracy of EM-seq, we added two different types of DNA to estimate the false-positive rate (lambda DNA) and 600 bp synthetic methylated DNA to estimate the false-negative rate. We found that false positive and false negative rates were comparable for EM-seq and BS approaches (1.69% vs. 0.96% and 6.03% vs. 2.93%, respectively) (Sfig.1c). The conversion rate of unmodified C in EM-seq was approximately 98.31% (Sfig.1c). The same software and pipeline were used to analyze the outputs of BS and EM-seq experiments (Supplementalmaterial1_detailedprotocol).
Next, we analyzed the consistency of EM-seq and BS methylome data. At least four reads covered more than 15 million CpG positions in both methods. We defined a base as methylated if its methylation ratio was greater than 0.5 and found that 98.5% of CpG regions (covered > 4 times) had a consistent modification pattern in EM-seq and BS experiments. (Fig. 1a). For example, both EM-seq and BS reported highly correlated methylation sites across chromosomal regions (Fig. 1b). By further comparing the modification level for each 5mC covered by at least three reads, we observed a good correlation (0.89) between EM-seq and BS data (Fig. 1c). The density of the methylation across chromosome 4 or on gene elements were similar between the two methods (Fig. 1d–f). For example, both methods showed equal methylation level distribution around the CpG islands (CGIs) (Fig. 1g). Together, these results indicate that EM-seq can directly replace whole genome BS (WGBS) and provide comparable results. We then further examined whether this non-destructive method could be used for methylation detection in ultra-low input samples, using various conversion methods.
DNA polymerase amplification of a large uridine-rich fragment
Multiple displacement amplification (MDA) is one of the most popular choices for non-destructive amplification of the genome in ultra-low input samples 20, 21. We started by performing EM-MDA-seq in a 1-ng sample of genomic DNA (Fig. 2b). Genome cytidines were converted to uridines by EM-seq, producing long uridine-rich genomic fragments (Fig. 2a). The converted DNA fragments underwent MDA and 10-fold amplification was achieved after 30 min of incubation (Fig. 2b). Subsequently, conventional DNA library construction was performed, and EM-MDA-seq of this 1-ng input showed excellent performance (84.79% coverage at the 6.96× depth) (Fig. 2c, green). However, in the experiment with a picogram input, EM-MDA-seq showed a large amplification bias and insufficient coverage (5.41% coverage at the 2.62× depth) (Fig. 2c, green). We then decided to lower the EM-MDA-seq bias with picogram input samples by improving the priming efficiency and polymerase speed.
It has been previously reported that random hexamer priming results in a bias in nucleotide composition and that this bias influences the uniformity of the location of the reads. The novel MDA “Trueprime” method, which is based on the use of DNA primase without a random hexamer, demonstrated superior breadth and uniformity of genome coverage with high reproducibility 22 (Fig. 2b). However, Trueprime was not very efficient in amplifying the uridine-rich converted DNA and yielded only 2.03% coverage at the 3.94× depth. (Fig. 2c, blue).
It has been noted previously that DNA polymerase has base-pairing preference for nucleotide incorporation 23, which may cause amplification bias during EM-MDA-seq. Most DNA polymerases incorporate GT and AC nucleotides with high and low efficiency, respectively 23. In our experiments, we also found that the speed of Phi29 was 5-fold lower in bisulfite-converted DNA than in untreated genomic DNA (based on BGI production center reports), supporting the notion of the low efficiency of adenosine incorporation. We tried several solutions to normalize the nucleotide incorporation speed, including poly (ethylene glycol)/trimethylammonium chloride solution, slow down amplification, and minimizing the bias. We also tried an alternative robust polymerase, BST. However, none of these approaches showed superiority in terms of uniform coverage (Fig. 2c). However, to solve the issue of the AU bias in EM-seq, we used the GpC methyltransferase M.CviPI to pretreat the genome, which preserves 25% more GC content through the EM-seq conversion (Fig. 2a). Surprisingly, GpC methyltransferase treatment did not lead to obvious advantages over the control group in a side-to-side comparison (Fig. 2c orange).
Thus, the DNA polymerases Phi29 and BST had amplification bias in the converted uridine-rich DNA and were concluded to be unsuitable for EM-seq amplification. We then pursued an alternative amplification method with a more even coverage, by using T7 RNA polymerase amplification.
Uniform amplification of ultra-low input uridine-rich DNA fragments by T7 RNA polymerase
Considering that the AC/GT bias amplification may be general problem for DNA polymerase, we used RNA polymerase instead of DNA polymerase to avoid the bias amplification. An improved single-cell whole genome amplification method based on the use of T7 polymerase has been recently reported that outperformed other MDA approaches, enabling a more uniform coverage of the genome 24. Unlike the MDA, using Phi29 DNA polymerase to ceaselessly synthesize the complementary strand on the single strand DNAs, T7 polymerase generated marvelous RNAs on the DNA templates with less bias. Inspired by that finding, we have developed a novel method, which we termed T7-Assisted Enzymatic methyl-seq (TEAM-seq).
In this method, we used Tn5 transposons to tagmentate the genome with biotinide adaptors and obtain medium-sized fragments (500–2,000 bp). The tagmentated DNA was then enzymatically converted, and single stranded DNAs were extended on the T7 promoter oligo to generate the T7 promoter on the DNA (Fig. 3a). Throughout the process, we used streptavidin beads to bind DNA without elution to minimize DNA loss in the purification. Finally, the enzymatically converted DNA was amplified using T7 RNA polymerase. The final constructed library was sequenced by the nanopore or Pacbio method to generate long reads for paternal/maternal phasing (the detailed protocol and experimental information can be found in the supplementarymaterial1). We used 100-pg or 20-pg of genomic DNA for the initial analysis. The coverage and other parameters were better or comparable to those achieved by the WGBS of a 1-μg sample (11.58% duplication rate, coverage at 2.58× depth). The consistency between TEAM-seq and WGBS approaches was approximately ∼93% (Fig. 3b). We also summarized the features of the different published methods (Fig.3c). We then performed a detailed study of the coverage and accuracy of TEAM-seq (comparing with scWGBS, and conventional WGBS).
The sequencing quality and data yield for 100-pg/20-pg samples were similar to those achieved by the conventional nanopore DNA sequencing. Analysis of the 20-pg input showed a higher base volatility than that of the larger, 100-pg input (Sfig. 2a). The 20-pg library resulted in 37.09% genome coverage with 2.05× sequencing depth, which are, to the best of our knowledge, the highest recorded values for such analyses (Fig. 3b). The accumulation curve showed the coverage saturation in each method (Sfig. 2b). For the picogram-level input, TEAM-seq showed lower GC bias and equal coverage of CGIs(CpG islands) than those achieved by the analyses of a 1-μg input by WGBS and 100-ng input by single cell WGBS (Sfig. 2c–e). The higher and more uniform genome coverage resulted in a larger number of all C sites covered by TEAM-seq in the case of a picogram input. In the experiment with a 20-pg input, TEAM-seq covered 26.97% of CpG and 24.75% of C sites at the 2× depth, whereas for a 100 pg input, these values were higher: 54.85% and 52.79%, respectively. Further, TEAM-seq covered 34.12% and 63.74% of CGIs in experiments with 20 and 100-pg inputs, respectively.
TEAM-seq also showed a high consistency and methylation level correlation with the results of the gold standard WGBS. Over ∼93% of CpGs in TEAM-seq (100-pg/20-pg) showed modification states that matched those detected by WGBS (Sfig. 3a). The correlations between WGBS and TEAM-seq results (100-pg/20-pg) were 0.85/0.81 for the sites with >20× coverage. Distributions of the methylated sites were also very similar (average standard error ± 0.2 on CpG; ± 0.6 on CHG) on chromosome and gene element levels and close to those observed for WGBS (Sfig. 3c, d). The methylation ratios of the 20 pg genomic DNA sample or single-cell genomic DNA, which only had 1 or 2 genome copies, showed that the methylation distribution was polarized (peaks on 0 or 1) (Sfig. 3b). For example, in the 20 pg input sample, a similar methylation level distribution in the CGI was observed (Sfig. 3e). Moreover, we found that TEAM-seq had a shallow coverage-related methylation ratio bias (Sfig. 3f). Three technicians performed the experiments, and the reproducibility was sufficiently good (Sfig.4).
In summary, TEAM-seq uniformly amplified enzymatically converted DNA, resulting in better coverage of samples with ultra-low input DNA amount. Next, we used TEAM-seq to perform single-cell methylome analysis to expand its application scope.
Indexed TEAM-seq for mid to high throughput single-cell analysis
Using transposons, we designed a unique, relatively long barcode, which could be identified by nanopore sequencing (Sfig. 5). After cell nuclei were sorted into plates, we labeled the cell genome using the barcoded transposons. We then performed single-cell TEAM-seq on the pooled cells (Please see Supplemental material 1_detailed protocol). Due to the throughput limitation of nanopore sequencing, we only processed 50–100 cells. Our proposed method can be easily scaled up and down based on your nanopore sequencing budget. We observed a meager collision rate of 5%, based on the artificial mixture of the human(HeLa) and mouse(4T1) cells (Fig. 4a, b).
Next, we profiled an artificial mixture of HeLa cells and the HEK293T cells. We could debarcode >70% reads and 64 cells (100% cell capturing efficiency), all of which passed the quality filters (depth > 0.2×, mapping rate > 45%, genome coverage > 3%). Each cell had a median of 10 million reads and a mean alignment rate of 56%, approaching the levels seen in the TEAM-seq experiment with 20-pg DNA input (Sfig. 6a). These data translate to the coverage of mappable CpG ranging from ∼5% to ∼20% (Sfig. 6c) and genome coverage ranging from ∼10% to ∼30% (Sfig. 6a). As expected, the sequencing depth was not yet saturated, and an increase in the amount of data could further improve the genome coverage (Sfig. 6b). Among these covered CpG and genome sites, ∼0.2 million Ensembl builds were covered by at least half of the cell population (Sfig. 6f). We did not observe any abnormal coverage bias on the whole genome scale or in CGI (Sfig. 6e, d). Next, we summarized the methylation status of each cell across Ensembl builds to observe the accuracy and variance of methylation mark detection. The correlations between single cells ranged from 0.4 to 0.7 in 0.1–2 kb fragments of Ensembl build (Sfig. 7a). The variance across the genome comprised 0.01–0.05 in CGIs, enhancers, and gene bodies (Sfig. 7b). On H3K27me3/H3K27ac/H3K4me3, all cells produced the expected nucleosome pattern with corresponding heterochromatin/euchromatin properties (Sfig. 7c).
Next, we profiled a mixture of HEK293T and HeLa human cell lines (Fig. 4). We performed non-negative matrix factorization (NMF) followed by k-means embedding to project cells in two-dimensional space, producing two clearly defined clusters (Fig. 4c, d). We correlated the methylation rate of the collapsed cluster with publicly available WGBS datasets (Fig. 4e). One cell cluster showed the highest correlation with the corresponding cell type (Fig. 4e). To display the advantage of increased coverage of the single cell, we down-sampled the coverage and found that the identified two cluster distances decreased with coverage until merging (Fig. 4d). Overall, TEAM-seq could generate high methylation site coverage on the indexed single cells in a high-throughput manner.
Discussion
In the current study, we developed the TEAM-seq method based on the enzymatic cytidine conversion (EM-seq) and T7 polymerase amplification, achieving a very high genome coverage in 100 pg and 20 pg input samples, and even in single cells. However, EM-seq still demonstrated 0.7% lower conversion rates than the gold standard BS-seq, resulting in a higher false-positive/negative rate. The higher number of false positive/negative methylation marks may be caused by the substrate/motif preference of Tet1/2 and APOEC3A or different enzymatic efficiency. The slightly higher false positive/negative rate would be problematic in single-cell analysis. Because a single cell has only one/two copies of genome, the accuracy is difficult to correct. However, this problem cannot be easily avoided in enzymatic conversion. Given the significant DNA loss during the harsh bisulfite treatment, we believe that enzymatic conversion is a better choice for single-cell methylation analysis. Moreover, we are developing a machine learning algorithm that may help distinguish false positive/negative methylation marks in single cells.
We have tried many other approaches in the current study, but most methods produced surprisingly unsatisfactory results. The Trueprime kit demonstrated superior performance in single-cell genome amplification compared to that achieved with MDA, but not for enzymatically converted uridine-rich DNA. We suspected that heavy modifications (5CaC/5fC/5gmC) generated by EM-seq inhibited primer synthesis by primase. A similar phenomenon was also observed during the GpC methyltransferase treatment, which preserved 25% of the GC content in the enzymatic conversion. The high GC content adversely affected MDA owing to the parallel increase in heavy modifications (5CaC/5fC/5gmC). We also observed a related phenomenon in a mixture of HEK293T and HeLa cells. HEK293T cells generated twice more reads than HeLa cells, suggesting the preferential amplification of HEK293T cell DNA. This bias was not observed in a mixture of HeLa and 4T1 cells. Considering the higher abundance of 5mC marks in cancer cells25, heavy modifications may also lead to the inefficiency of T7 polymerase. Therefore, the vast global difference in the methylation rate may cause biased amplification in cell populations for pooling the indexed cells.
In TEAM-seq, we used Tn5 to fragment genomic DNA, but the fragment length was difficult to control precisely in single cells. The varied fragment length required nanopore or Pacbio sequencing. The nanopore method has higher error rate than Pacbio. However, we found that two reads per site could confidently correct the sequencing errors. Otherwise, our patented linked-reads method 26 on the Pacbio platform could be an alternative to nanopore sequencing, which could increase 5x output of Pacbio sequencing.
Overall, TEAM-seq reported genome coverage for ultralow input DNA amounts and DNA from single cells. This method could help in studies of samples with a limited DNA amount, e.g., embryos, oocytes, clinical tissues, etc. The high coverage of single cell genomes may further disclose methylation heterogeneity of different cell populations.
Methods
Experimental methods
Assess the EM-seq and WGBS
Arabidopsis DNAs were given by the other lab. The EM-seq was performed and NEB EM-seq protocol (NEB E7120S)(1ug DNA per reaction). The WGBS was parallelly performed by using MGIEasy Whole Genome Bisulfite Sequencing Library Prep Kit (MGI 1000005251). Both methods were similar except the conversion steps. The final products were sequenced by MGISEQ-2000.
MDA for the converted DNA
Both of the WGBS and EM-seq converted cytidine for the constructed DNA library, which was fragmented and ligated with adaptors. Unlike the WGBS, MDA-EM-seq used the Phi29 to do the multiple displacement amplification on the non-fragmented DNA. The genomic DNA were directly processed to EM-seq conversion steps (https://international.neb.com/protocols/2019/03/28/protocol-for-use-with-standard-insert-libraries-370-420-bp-e7120, 1.5-1.9). The converted single-strand DNAs were around 6∼10kb after enzyme treatment. The clean-up single-strand DNAs were processed to the MDA amplification by REPLI-g Single Cell Kit (Qiagen 150343). For Trueprime MDA-EM-seq, the clean-up DNAs underwent the protocol SYGNIS TruePrimeTM WGA & Single Cell WGA Kits (Lucigen), instead of the REPLI-g single cell kit. In the EM-MDA-seq with additives, all the addictive (3% PEG,100mM TMAC, 3%PEG+100mM TMAC) were added to the REPLI-g Single Cell reaction. In the EM-MDA-seq with BST amplification, we used the BST2.0 (NEB M0537S) to substitute the Phi29 in REPLI-g single cell kit. The amplified DNAs were processed as MGIEasy Universal DNA Library Prep Kit V1.0 and sequenced in MGISEQ-2000.
scWGBS_ng
The scWGBS_ng was downloaded 27. Briefly, bisulfite conversion was performed using the EZ DNA Methylation-Direct Kit (Zymo Research, D5020) according to the manufacturer’s protocol, with the modification of eluting the DNA and minimizing the DNA loss. The single stranded library construction was followed.
TEAM-seq
We first fragment the DNA by Tn5 transposon with biotin adaptor. After tagmentation, the DNA with biotin adaptors could be bound with the streptavidin beads. Then the DNAs carried on streptavidin beads were processed to EM-seq (https://international.neb.com/protocols/2019/03/28/protocol-for-use-with-standard-insert-libraries-370-420-bp-e7120, 1.5-1.9). The DNA immobilized on the beads could be smoothly transfer to the next reaction without the significant loss. After conversion, the single strand DNAs (beads-on reaction) were extended on the complementary oligos to generate the T7 promoters. Then we performed the bead-on T7 transcription reaction by HiScribe™ T7 High Yield RNA Synthesis Kit (NEB E2040S). The amplified RNAs were processed to the PCR-cDNA sequencing kit (nanopore). The detailed description of methods can be found in Supplemental Material supplementalmaterial1_detailedprotocol. For the single cell TEAM-seq, we used the Qiagen protease to lysis the single cell to release DNAs and heat inactivated the Qiagen protease before processing to the Tn5 tagmentation.
Cell culture
Human mammary gland carcinoma cell line HeLa/HEK293T were obtained from ATCC. MCF-7 were grown in DMEM (Gibco,11995065) supplemented with 10% FBS (Gibco,10099141), 0.01mg/ml insulin (HY-P1156, MedChemExpress), and 1% penicillin-streptomysin (Gibco, 15140122). Cell line was regularly checked for mycoplasma infection (Yeasen, 40612ES25).
Bioinformatics method
Data processing
Sequencing adapters were cut using fastp (v0.21.0). The remaining reads were trimmed based on their theoretical barcode locations. Then they were aligned to the barcode white list in the following pattern: Ns-GGGAGATACAACCTACAATCACT-10bp barcode- AAATATATATAAAAAACAA-Ns using bowtie2 local alignment mode (v2.3.4). Only reads aligned to the barcode reference with high mapping quality would be processed and used for subsequent analysis.
To assess the accuracy of demultiplexing, 10,048 reads were simulated with a length of 52 bases, including a 23bp 5’ primer, 10bp barcode from the white list, and a 19bp 3’ primer. In order to intimate nanopore sequencing errors, a mean indel rate of 2.96% and a mean mismatch rate of 2.77% were introduced to the simulated data.
Reads were aligned by methylpy (v1.3) in the following steps: (1) reads mapped to the forward strand GRCh37 reference genome were de-duplicated with the parameter “--pbat --remove-clonal --path-to --picard --min-mapq 15”. (2) Remaining unmapped reads were then mapped to the reverse strand of the reference genome in the same way. (3) Merge all mapped bam, then call methylation status by methylpy call_methylated_sites. In terms of human-mouse mixed library, a combined human-mouse hybrid genome was used instead of GRCh37 reference.
Fastq files of scWGBS, sci-MET, snmC-seq and scBS-seq downloaded from the database were aligned to the reference genome and called the methylation ratio using Bismark (v0.22.3). Any other conventional WGBS data was analysed using our standard pipeline.
Analysis of cytosine modifications called by TEAM-seq, scWGBS and WGBS
Chromosomal methylation status was plotted using CG/CHG/CHH positions covered by more than three times. The chromosome was then binned into equal sizes (100kb for Arabidopsis Chr4, 2Mb for human Chr1 and Chr4). Mean methylation ratio per bin was calculated and plotted along the chromosome. The mean occurrences of C/CG/CHG/CHH in each bin was computed and plotted in the same way.
Information of histone (H3K27me3/H3K27ac/H3K4me3) ChIP-seq peaks was downloaded from the ENCODE database. Bed files of gene elements including CGIs, TSS, TES and gene body were downloaded from UCSC genome browser. These regions were binned into 20 or 50 windows, flanking by 50 windows of the up-/downstream regions. The methylation ratio of each bin was averaged and plotted using R.
Average coverage depth in CGI
Total nucleotide count for each bin was reported using samtools bedcov (v1.9), then average coverage depth in each window of the CGI region was computed for TEAM-seq, scWGBS and WGBS. To overcome differences in sequencing depth, TEAM-seq and WGBS data were down-sampled to the corresponding sequencing depth of scWGBS. Average coverage depth for single cell data was computed in the same manner.
Correlation and consistency analysis
For each single cell, weighted methylation ratios of the Ensembl Regulatory Build regions were computed in a previously described method 28,10. We then calculated the Pearson correlation of regions including all Ensembl Regulatory loci, promoter region, enhancer region and CTCF binding sites. Clustered heatmap was then plotted using pheatmap in R.
In order to correlate with bulk WGBS data, we merged the methylation information of all cells in cluster 1 and calculated methylation ratio at single nucleotide level. The methylation ratio was then correlated to WGBS datasets available from the ENCODE database as well as the conventional HeLa WGBS data 29.
The consistency of methylation status was compared between WGBS and EM-seq, TEAM-seq 100pg and 1ug WGBS, TEAM-seq 20pg and WGBS, and among three TEAM-seq 20pg libraries. For each comparison, only positions covered more than three times were used and a methylation ratio of 0.5 was used as the threshold. For example, if the methylation ratio of a cytosine is either below 0.5 or above 0.5 in both libraries, the methylation pattern was considered as consistent. Venn diagram and Euler diagram were plotted using VennDiagram and eulerr packages in R.
NMF decomposition and k-means clustering
Non-negative Matrix Factorization (NMF) is a decomposition algorithm to split a matrix V into two matrices W (dictionary matrix) and H (meta-feature matrix). Due to its clustering property, it has been widely used in DNA methylation studies to identify features that contribute the most to the cluster. Therefore, using Ensembl Regulatory Build as the feature, NMF clustering was performed to distinguish HeLa and HEK293T cell lines. Then, k-means clustering algorithm was used to cluster cells by selecting 100%, 60%, 30% and 10% features to investigate the trend of two clusters getting more similar to each other. All clustering analyses were performed in R.
Data availability
TEAM-seq_100pg, TEAM-seq_20pg, HeLa&4T1 mixture data as well as HeLa&HEK293T mixture data are available at China National GeneBank (CNGB) 30, 31 with project number of CNP0002270.
Author contributions
CT designed and supervised the experiments. JW performed the experiments; YTF, FC, and CT performed bioinformatics data analysis. All authors collectively analyzed experimental data. All authors read and approved the final draft of the manuscript.
Competing interest
A patent application has been filed by BGI Genomics Co Ltd for the technology disclosed in this publication.
Acknowledgment
This research was supported by the Science, Technology, and Innovation Commission of Shenzhen Municipality (grant number JSGG20170824152728492). The supporter had no role in designing the study, data collection, analysis, interpretation, or writing of the manuscript.
Footnotes
↵† These authors contributed equally to this work.