Abstract
Detection of DNA cytosine modifications such as 5-methylcytosine (5mC) and 5-hydroxy-methylcytosine (5hmC) is essential for understanding the epigenetic changes that guide development, cellular lineage specification, and disease. The wide variety of approaches available to interrogate these modifications has created a need for harmonized materials, methods, and rigorous benchmarking to improve genome-wide methylome sequencing applications in clinical and basic research.
We present a multi-platform assessment and a global resource for epigenetics research from the FDA’s Epigenomics Quality Control (EpiQC) Group. The study design leverages seven human cell lines that are publicly available from the National Institute of Standards and Technology (NIST) and Genome in a Bottle (GIAB) consortium. These genomes were subject to a variety of genome-wide methylation interrogation approaches across six independent laboratories. Our primary focus was on cytosine modifications found in mammalian genomes (5mC, 5hmC). Each sample was processed in two or more technical replicates by three whole-genome bisulfite sequencing (WGBS) protocols (TruSeq DNA methylation, Accel-NGS, SPLAT), oxidative bisulfite sequencing (oxBS), Enzymatic Methyl-seq (EM-seq), Illumina EPIC targeted-methylation sequencing, and ATAC-seq. Each library was sequenced to high coverage on an Illumina NovaSeq 6000. The data were subject to rigorous quality assessment and subsequently compared to Illumina EPIC methylation microarrays. We provide a wide range of sequence data for commonly used genomics reference materials, as well as best practices for epigenomics research. These findings can serve as a guide for researchers to enable epigenomic analysis of cellular identity in development, health, and disease.
Introduction
DNA methylation, the addition of a methyl group to a nitrogenous base, plays a key role in the regulation of gene expression, disease onset, cellular development, and transposable element activity [1]. In mammalian genomes, a methyl group binds to the fifth carbon of cytosine, creating 5-methylcytosine (5mC) or its oxidized form, 5-hydroxy-methylcytosine (5hmC) [2]. This modification most often occurs at regions in the genome known as CpG dinucleotides, which are characterized by a cytosine nucleotide followed immediately by a guanine nucleotide [3]. Variations in DNA methylation levels correlate to altered gene expression [4], and this phenomenon holds significant implications for developmental processes [4], cancer [5], and biological age [6]. The prevalence, location, and dynamic methylation and hydroxy-methaylion of CpGs sites in the genome are areas of focus for studies seeking to examine their array of physiological effects.
The field of epigenetics has expanded rapidly in recent decades. Since its inception in 1992 [7], the use of a sodium bisulfite treatment, which selectively deaminates unmethylated cytosines to uracil, has emerged as the dominant protocol for 5mC and 5hmC profiling. The advent of massively parallel sequencing in the early 2000s spurred the development of new bisulfite-based and other methods to capture DNA methylation information. The scale of bisulfite analyses has expanded from specific regions to whole-genome methylation sequencing (WMS), including preparation methods such as Swift Biosciences Accel-NGS Methyl-Seq, SPlinted Ligation Adapter Tagging (SPLAT) [8], Illumina TruSeq DNA Methylation, amongst others. More recently, protocols utilizing oxidative bisulfite sequencing (TrueMethyl oxBS) [9], enzymatic deamination (EM-seq) [10], targeted-methylation sequencing (Illumina EPIC Capture), and transposase-accessible chromatin sequencing (ATAC-seq and Omni-ATAC-seq) [11,12], among others, have further accelerated the breadth and rate of discovery.
As the field of epigenomics continues to advance, there is a need to establish definitive standards and benchmarks reflecting the DNA methylome of human cells and tissues. In particular, there is a need to characterize the unique biases of each library preparation, which can influence not only estimates of methylation, but sequencing quality metrics such as insert sizes of the libraries [8], quality scores [8], duplication rates [8], mapping efficiency [13], and evenness of coverage [14]. Together, these factors can contribute to unexpected differences in methylation calls and result in biased methylation measurements [14]. Bisulfite conversion also presents a computational challenge for data alignment, owing to asymmetrical C-T alignment and reduced sequence complexity. Commonly used bisulfite-sensitive sequence aligners are designed either to work with a three-letter alphabet, or using wild-card algorithms [15]. The choice of aligner can significantly impact computational time, alignment efficiency and data accuracy.
Here, the FDA’s Epigenomics Quality Control (EpiQC) Group presents a comparative analysis of targeted and genome-wide methylation protocols to function as a comprehensive resource for epigenetics research. These data come from seven publicly available human cell line genomes from the Genome in a Bottle (GIAB) consortium, which has developed a series of reference materials to enable reproducible genomics research [16]. Aliquots of cell lines were processed as two or more technical replicates across six independent laboratories. The resultant libraries were sequenced on multiple Illumina NovaSeq 6000 flowcells, quality controlled, computationally refined, and measured against Illumina methylation arrays to characterize each methylation assay. This reference dataset can act as a useful benchmarking tool and a reference point for future studies as epigenetics research becomes more widespread within genomics research
Results
Whole Methylome Sequencing
Genomes were sequenced from seven well-characterized human cell lines (HG001-HG007) from the GIAB Consortium [17]. These seven cell lines come from one female HapMap CEU participant (HG001) and two Personal Genomes Project parent/son trios: an Ashkenazi Jewish trio (HG002-HG004) and a Han Chinese trio (HG005-HG007). Genome-wide methylation was examined using a variety of common, commercially available bisulfite and enzymatic conversion library preparation kits, including NEBNext Enzymatic Methyl-Seq (referred to here as EMSeq), Swift Biosciences Accel-NGS Methyl-Seq (referred to here as MethylSeq), SPlinted Ligation Adapter Tagging (referred to here as SPLAT), NuGEN TrueMethyl oxBS-Seq (referred to here as TrueMethyl), and Illumina TruSeq DNA Methylation (referred to here as TruSeq). Aliquots of the same stock of cell lines were distributed to six independent laboratories, with one lab preparing libraries from each methylome assay, and two labs preparing EMSeq libraries. Biological and technical replicates of genomic libraries were pooled and sequenced in multiplex using paired-end 150bp chemistry across two S2 and four S4 flow cells on Illumina NovaSeq 6000, and outputs across flow cells were combined per replicate for subsequent analysis (Table 1).
Each methylome replicate was sequenced from 475M to 2.3B paired-end reads when combining all rounds of sequencing per replicate (Figure 1A), resulting from imbalance in library pooling. In contrast, each library type exhibited tight, assay-specific distributions of estimated insert sizes per read pair, as calculated from mapping distance of paired end reads (Figure 1B). The combination of variable sequencing depth and insert sizes resulted in divergent genome coverage distributions per assay type across the seven cell lines (Figure 1C). Generally, MethylSeq, SPLAT, and EMSeq had the deepest coverage, followed by bisulfite and oxidative-bisulfite replicates from TrueMethyl, and finally TruSeq, which returned an imbalanced coverage of genome, with the lowest percentage of the genome covered at lower depths, but a long tail of high-coverage sites. TruSeq also showed an imbalance of coverage of cytosines in CpG contexts, with a lowered mean and a longer tail, compared to more normal distributions in other assays (Figure 1D). TruSeq replicates exhibited GC-rich bias in genomic coverage and dinucleotide distribution (Figure 1E,F), owing to the random hexamer priming strategy implemented by this library preparation, in contrast to the more balanced profiles of other genomic assays.
All libraries were passed through an alignment and methylation calling pipeline (see below). Reads were filtered out if they did not map to the reference genome, were marked as PCR or optical duplicates, or returned a mapping quality score below Q10. The number of reads filtered varied by assay, with EMSeq retaining 68-85% of reads per preparation, MethylSeq retaining 80%, SPLAT retaining 75-82%, TrueMethyl retaining 58-62% for oxidative replicates and 65-70% for bisulfite-only replicates, and finally TruSeq retaining as low as 45% of reads (Figure 1G). As a result, different sequencing depths were required to achieve a given mean depth of coverage per CpG dinucleotide (Figure 1H), with EMSeq returning the greatest depth per base, followed by MethylSeq/SPLAT, and then TruSeq/TrueMethyl.
Mapping and Methylation Calling Comparison
Alignment was performed using a set of commonly available aligners for methylome read mapping, including Bismark [18], BitMapperBS [19], bwa-meth [20], and GemBS [21], all against a GRCh38 reference genome appended with bisulfite controls (see methods;Figure S1). The run time of each aligner was first tested using one million random paired-end reads from each HG002 library. BitMapperBS was the fastest aligner, with an average of 550-650 read pairs processed per CPU core per second, with stable performance between replicates (Supplementary Table 1). Bismark, bwa-meth, and GemBS showed equal alignment speed (about 200 read pairs per CPU core per second). However, Bismark showed the most variability of timing between runs.
Mapping rates varied between the algorithms across methylome library types. On average, bwa-meth and GemBS had the highest rate of reads mapping properly (forward and reverse mates aligning in proper orientation within an expected distance of one another), with values between 92-98%, while Bismark and BitMapperBS returned a rate of 78-86% (Figure 2A). Reciprocally, BitMapperBS and Bismark had a higher rate of unmapped reads (9-18%) than bwa-meth and GemBS (0-2%), owing to different read filtering strategies by the aligners. Bismark and BitMapperBS had fewer ambigious (secondary and supplementary) alignments for reads that were properly mapped than bwa-meth and GemBS, and all four aligners returned very similar read duplication estimates.
Coverage of cytosines in CpG dinucleotide contexts also varied by caller, though callers performed consistently across assays (Figure 2B). Generally, all four aligners captured a similar, assay-specific fraction of CpG sites at low mean depths, while at higher depths the per-algorithm average dropped off, with Bismark dropping fastest, followed by GemBS, followed by BitMapperBS. Overall, bwa-meth captured the highest fraction of CpG sites along increasing depth cutoffs compared to other algorithms. Accordingly, all down-stream analyses were performed using bwa-meth methylation calls.
In contrast to mapping and coverage rates, per-read methylation bias (or “mBias”) curves were extremely similar among all four algorithms, with different, strand-specific profiles seen for each assay (Figure 2C). EM-Seq and TrueMethyl showed hypomethylation at the 3’ OT end and 5’ OB end;MethylSeq showed hypermethylation in these same regions;SPLAT is relatively flat;and TruSeq is more irregular, though overall hypermethylated. In line with this, the Spearman correlation of epigenome-wide methylation profiles between assays and algorithms showed high differentiation among assays, followed by closer grouping of alignment strategies within assays (Figure 2D).
Differences in sequencing depth, and thus CpG coverage, were shown to be a driver of differences in methylation estimates. When replicates of HG002 were compared in a pairwise manner, the coefficient of variation (stdev/mean) of CpG coverage was higher in sites with 20% or more difference in estimated methylation percentage, as compared to sites with 10% or less difference (Figure 2E), for all but one comparison.
Downsampled Coverage and Methylation Estimates
Downsampling can be used to simulate the effect of generating similar amounts of sequence data for a given sample when the number of reads sequenced is unbalanced, as in the data generated herein (Figure 1). Downsampling can be done on aligned reads (BAM files) or on the methylation call files (bedGraph files). As the downsampling process at the alignment level can be slow and demanding in terms of disk space and compute time, we set out to evaluate if downsampling methylation calls in bedGraph format recapitulated downsampling aligned reads (BAM files) (Figure S2, Figure S3). Both downsampling approaches yielded similar results in methylation calls, number of CpG sites detected, and distribution of read counts (Figure S2B-D). We also measured the distribution of read counts between the different downsampling approaches (Figure S2E). These data support that downsampling of bedGraph files produces equivalent DNA methylation calls and count distributions as downsampling BAM files, but with the added benefit that the targeted average coverage is more acurately estimated when downsampling bedGraphs.
Given that downsampling bedGraphs yielded reproducible methylation calls, we evaluated the performance of different library preparation methods for genome-wide DNA methylation analysis using downsampled, replicate-merged bedGraph files. The bedGraphs for all assays and genomes were downsampled along a range from 5X to 30X mean coverage. We subsequently evaluate the CpG sites covered by each assay and the reproducibility of methylation calls. In bedGraphs downsampled to average 10X CpG coverage, 12-15M (43-54%) CpG sites across the genome are covered at 10X or greater and 20-26M (71-92%) are covered by at least 5X (Figure 3A). This pattern is consistent across libraries and average coverage level. However, the number of sites detected at each cut-off varied between the different assays, with the EM-seq assay capturing the greatest number (range 25.6-26.3M) and TruSeq assay capturing the lowest number of CpG sites (range 20.3-20.5M) in the 10X downsampled bedGraphs with a minimum cutoff of >=5 reads. Approximately 16M (range 15.9-16.4M) CpG sites were consistently detected by all assays (Figure 3C) and an additional 5M (range 4.6-5.3M) CpG sites were detected in EMSeq, MethylSeq, SPLAT, and TrueMethyl, but not by TruSeq. The numbers were remarkably stable between genomes (Figure S5). The different library types displayed differences in coverage around the transcription start site (TSS), with TrueMethyl showing the most even coverage, lower coverage in EMSeq followed by MethylSeq/SPLAT, whereas TruSeq displayed higher coverage around the TSS, likely due to its bias for high CG rich regions, which coincide with CpG islands around the TSS (Figure 3D). In pairwise comparisons, the CpG-level DNA methylation calls were generally very reproducible (Pearson’s rho 0.87-0.92) and the average deviation from the mean was low (RMSE 0.15 - 0.17) (Figure 3E). Each of the genome-wide methylome sequencing assays performed approximately equivalently, with the exception of TruSeq consistently yielding more variable DNA methylation calls than the other methods. The number of CpG sites captured, RMSE, and correlation coefficients for each assay and genome is outlined in Figure S4.
Differential Methylation of Family Trios Among Methylome Assays
After downsampling to median 10X coverage, 2,227,395 CpG sites present on chromosome 1 in replicates from all five assays (EMSeq, MethylSeq, SPLAT, TrueMethyl, and TruSeq) were analyzed for differential methylation signal between assays. This analysis was done at the family level (Ashkenazi Trio HG002-HG004 against the Chinese Trio HG005-HG007) to avoid a one-to-one differential analysis. This also included a restriction to sites with 5X coverage in at least two out of three members of each family group, which resulted in small data reductions for EMSeq, MethylSeq, and TrueMethyl (6%, 8%, and 5%, respectively), with greater losses for SPLAT (12%) and TruSeq (27%). Coverage levels after this filtration step were highly correlated among MethylSeq, TrueMethyl, and SPLAT (r ≥ 0.75), while TruSeq and EMSeq were the least correlated assays. The correlation matrix for HG002 samples is seen in Figure S6; these correlations are representative of all members of the family trio.
To assess consistency in sites identified as differentially methylated (DM) by each assay (DMA), we computed the fraction of DMA sites that were uniquely identified by that assay (a pseudo false-positive rate) (Table 2). We also computed the total number of DM sites commonly identified by three or more assays (DM3+), which totaled 0.15% of the common sites. We then determined the percentage of DMA sites that were also DM3+ sites (a measure of specificity), as well as the percentage of DM3+ sites that were also DMA sites (a measure of sensitivity). EMSeq and TrueMethyl produced the smallest numbers of DMA sites among the assays, with the lowest proportions of unique sites (35%) and the highest proportions of DMA sites in DM3+ sites (39%), indicating a good balance between sensitivity and specificity. MethylSeq and SPLAT both had higher numbers of DMA sites, associated with greater rates of unique DM sites (46% and 49%, respectively) but also the highest sensitivity to detect DM3+ sites (75% and 78%, respectively). TruSeq, which was associated with a much larger number of DMA sites than any other assay, had the lowest concordance with the other assays, with only 13% of its DMA sites in DM3+ and 58% of the DM3+ sites among its DMA sites.
We analyzed the profile of coverage variability for each assay (Figure 4), which illustrated the agreement with other assays for DM sites as a function of coverage, with values ranging between the 5th and 95th percentiles of median coverage across the six samples. For all assays, the analysis shows that agreement declines at higher coverage levels, but this effect is small for EMSeq, MethylSeq, and TrueMethyl. Because SPLAT has a heavy-tailed coverage distribution, the impact is more pronounced, while for TruSeq the coverage distribution is extremely diffuse and there is markedly poor agreement with other assays in its upper coverage percentiles.
Differential Methylation Within Microarray Sites
Of the 82,013 probes mapping to chromosome 1 on the 850k EPIC Illumina methylation array, 81,630 (99.5%) overlapped with sites common to all five assays. Of these, the number of differentially methylated assays (DMAs) ranged from 189 (TrueMethyl) to 729 (TruSeq). For all assays other than TruSeq, 100% of these DMAs had an estimated percent methylation difference (PMD) of 20% or greater between the family groups, and for TruSeq 725 of the 729 sites met this criterion. To analyze concordance between whole methylome sequencing (WMS) and microarray results, we computed the proportion of these DMAs for which a corresponding difference of at least 20% was observed for the microarrays, with these array PMDs estimated via ANOVA models with random intercepts for each genome. The overall agreement was comparable for four of the five methods with values ranging from 79.3% (MethylSeq) to 83.0% (EMSeq) and no statistically significant differences in proportion (Supplementary Table 2). However, for TruSeq the fraction of DMAs that were matched by the array was only 63.2%, which was significantly lower in comparison to every other assay. Similar results were observed when the results were separated into hypermethylated and hypomethylated sites.
ATAC-seq Integration
ATAC-Seq provides information about DNA organization within the nucleus, which can be synthesized alongside methylation data to better understand the mechanistics of epigenetic pathways. Two protocols are routinely used to prepare ATAC-Seq libraries from cells and tissues: the Original ATAC-Seq protocol published by Buenrostro et al [22] and the Omni-ATAC protocol published by Corces et al [12]. In order to provide a complete epigenomic dataset for the 7 cell lines, we generated ATAC-Seq libraries with both protocols, on the same cell aliquots.
Both ATAC and Omni-ATAC produce similar fragment profiles for all the cell lines (Figure 5a). After mapping to the human genome, the Omni-ATAC protocol provided the most reads to the autosomal regions when compared to ATAC, and the least mitochondrial contamination (Figure 5b). The Omni-ATAC protocol also showed an improvement in enrichment around the TSS of genes compared to the ATAC protocol (Figure 5c). Spearman correlations between libraries for the same protocol, and between protocols, were calculated to provide an assessment of reproducibility. As shown in Figure 5d, the Omni-ATAC shows the best correlation across protocols. To evaluate the impact of the difference in data quantity and quality obtained by both protocols, we performed a differential accessibility analysis between HG002 and HG005 cell lines. The results summarized in supplementary figure (Figure S7) suggest that the higher quality of the Omni-ATAC datasets result in more peaks significantly open.
The above analysis was produced with the data generated by paired-end 150 nucleotides sequencing. To determine if ATAC-Seq analysis would benefit from shorter reads (as ATAC-seq libraries are more commonly prepared), we repeated the quality control with reads hard trimmed in silico to 3 lengths: 50, 75, and 100bp for mates of paired end sequences. The results show that trimming the reads does not have an impact on the quality metrics obtained (Figure 5e), annotation to genomic regions (Figure 5f), or mapping to mitochondrial reads. Overall, both libraries are minimally impacted by experimental read length, and the Omni-ATAC protocol generates libraries with more reproducible replicates, which can improve the overall results obtained in downstream analysis.
Multi-omic data integration is becoming an essential component of epigenomics studies. Using the data generated for HG001, the mean methylation at CpG sites (across all the methylomic libraries) as a function of chromatin accessibility measured by Omni-ATAC-Seq (open/closed) was plotted by genomic region. A genomic location was considered “closed” if it was not called as an accessible peak when analyzing the Omni-ATAC-Seq data. As shown in Figure 5g, there is an overall increase in mean methylation across gene features starting from 5’ Regulatory/5’UTR to 3’ Downstream 5k region. It is in the 5’ region (Regulatory and 5’UTR) that we see the widest difference in mean methylation between the two chromatin conformations, with “open” chromatin showing the lowest methylation level. This lower mean methylation in the “open” chromatin was still observed for the 1st exon, but the difference is much smaller. First introns showed no difference in mean methylation between the chromatin states. The highest mean methylation was observed for exons and introns (i.e other than 1st) and with very little difference. Interestingly, mean methylation becomes slightly higher in “open” chromatin compared to “closed” chromatin in the introns and exons, and remains as such in the 3’UTR. Finally, integrating transcriptomic data from publicly accessible RNAseq sequencing of HG001 (SRA run identifier SRR1153470) shows concordance between methylation state, chromatin accessibility, and gene expression (Figure S8).
Microarray Normalization and Site Filtering
Each cell line had 3-6 biological or technical replicates with microarray data from the Illumina MethylationEPIC Beadchip (850k array) generated from up to 3 labs. These replicates were used to assess different microarray normalization pipelines. We implemented 26 normalization pipelines with different combinations of between-array and within-array normalization methods. The between-array normalization methods evaluated were no normalization (None), quantile normalization (pQuantile) [23], functional normalization (funnorm) [24], ENmix [25], dasen [26], SeSAMe [27], and Gaussian Mixture Quantile Normalization (GMQN) [28]. The within-array normalization methods evaluated were no normalization (None), Subset-quantile Within Array Normalisation (SWAN) [29], peak-based correction (PBC) [30], and Regression on Correlated Probes (RCP) [31]. All combinations were implemented with the exception of pQuantile + SWAN and SeSAMe + SWAN, which were not possible due to incompatible R object types.
We first performed principal component analysis (PCA) and visually inspected the first two principal components (PCs) for each normalization pipeline. Generally, samples from the same cell line clustered together more tightly after normalization, although a few pipelines (PBC alone, GMQN alone, GMQN + PBC) did not show obvious improvement in replicate clustering (Figure S9). Most pipelines failed to clearly distinguish samples from cell lines HG005 and HG006, the Han Chinese father/son pair, from one another.
A variance partition analysis was used to compute the percentage of methylation variance explained by cell line at each CpG site in each normalized dataset. Funnorm + RCP had the highest median across the epigenome (90.4%), although many pipelines had medians in the 85-90% range Figure 6a. SeSAMe and RCP performed well (median>85%) no matter which methods they were combined with. While using RCP or SWAN usually improved performance compared to having no within-array normalization, using PBC for within-array normalization always reduced the median variance explained by cell line. For all downstream analyses, we used the funnorm + RCP normalized microarray data because this pipeline had the highest median variance explained by cell line. Figure 6a shows the full distribution of variance explained by cell line across the epigenome for each normalization pipeline. Most pipelines had a bimodal distribution, meaning CpG sites typically had almost no variation explained by cell line or nearly 100% of variation explained by cell line.
In light of previous work that has shown that microarray data is not reliable for sites with low population variation [32], we investigated whether sites with poor concordance between replicates (% variance explained near 0) overlapped with low-varying sites. We used the 59 SNP probes on the Illumina EPIC array to compute a data-driven threshold for categorizing sites as low varying (Figure 6b-d, see Methods for details). We found that nearly all CpG sites in the normalized (funnorm + RCP) microarray data with poor concordance between replicates met our definition of low-varying sites (Figure 6e). When we compared the microarray beta values to the sequencing-based beta values for all 3 HG002 microarray replicates (Figure S11,Figure S12,Figure S13), we observed that these low-varying sites tended to have more extreme methylation values according to at least one platform, and there were many sites with large disrepancies (>20%) between methylation estimates from different platforms. This suggests that our data-driven definition of low-varying CpG sites, which can be applied to any Illumina 450k or 850k array dataset, may be useful for filtering out less reliable CpG sites before analysis.
Microarray Versus Sequencing Comparison
We performed 5 additional variance paritition analyses, adding samples from one sequencing platform (EMSeq, MethylSeq, SPLAT, TrueMethyl, or TruSeq) at a time, to evaluate the concordance between microarray and sequencing data. Because each cell line had 3-6 microarray replicates and only one (merged replicate) sequencing sample, these results are largely driven by the microarray data and the values may be biased upward by this. However, these models are a useful way to compare agreement between sequencing and microarray data across sequencing platforms, where a higher percentage of variance explained by cell line in one platform compared to another indicates better agreement with the microarray data.
For low-varying microarray sites, cross-platform agreement was low for all sequencing platforms (Figure S10a). This was expected, because we observed poor concordance between microarray replicates at these sites as well. For a small number of these low-varying sites, nearly 100% of the variation in methylation was explained by platform, indicating that there were some technical artifacts introduced by platform, but these technical artifacts were not widespread across the epigenome (Figure S10c).
For high-varying microarray sites, most of the variability across the epigenome was explained by cell line rather than platform, indicating good cross-platform concordance (Figure S10b,d). MethylSeq was most concordant with the microarray data, followed by SPLAT and EMSeq, which were comparable to one another, then TruSeq and finally TrueMethyl. Visual inspection of the microarray beta values compared to the sequencing beta values for 3 HG002 microarray replicates (Figure S11,Figure S12,Figure S13) show much more noise in the TruSeq and TrueMethyl comparisons.
Discussion
The EpiQC study provides a comprehensive resource for epigenetic research, using human cell lines already established as reference materials to advance genomics research from the Genome in a Bottle consortium. In addition to providing an epigenetic data layer to existing genomic references, we sought to generate datasets for a broad range of methylome sequencing assays, including whole genome bisulfite sequencing (WGBS) and enzymatic deamination (EMSeq). We also provided data from targeted approaches, including chromatin accessibility datasets (ATAC-Seq) from two protocols common to the field of epigenetics, EPIC Methyl Capture for a subset of genomic CpGs, and the Illumina 850k array. Finally, we provide sequence and epigenetic data for Oxford PromethION, an emerging third generation long read instrument.
While most of the published and/or commercialized assays have been tested with some standard sample (e.g. GM12878), the sample used to benchmark each assay was drawn from different DNA aliquots, extracted from cells grown at different passage, and potentially grown in different media. Here, aliquots of the same gDNA were distributed across multiple laboratories, and used for all data generated. To remove additional variability, all libraries were sequenced on one instrument (then a second time all on one instrument), across multiple NovaSeq6000 flow cells. For whole methylome sequencing, libraries were produced in duplicates, and triplicates were generated for the ATAC-Seq protocols. In total, we are sharing with the scientific community over 7 Tb of epigenetic data.
Benchmarking whole methylome sequencing technologies is important for determining which technology and method will achieve the best performance, and to provide recommendations and standards for future comprehensive methylomic studies. Large projects such as the NIH Roadmap Epigenomics Project [33] and the International Human Epigenome Consortium [34] have produced, compiled and analyzed a vast amount of WGBS data comprising tissues and cell lines from normal and neoplastic tissues. These data continue to provide an invaluable source of data for the epigenetics research community and have helped broaden our understanding of the various roles that epigenetics plays in health and disease. However, new methods are constantly being developed that address and circumvent issues with traditional approaches in terms of DNA input, resolution, and cost. Third-generation sequencing approaches are also rapidly advancing and are emerging as a complementary method to the gold standard bisulfite conversion methods. Our study encompassed the most up-to-date range of assays offering to measure whole-genome DNA methylation. We were able to incorporate sample preparation protocols using the gold standard bisulfite conversion (Swift Accel-NGS Methyl-Seq, TrueMethyl-Seq, EPIC Methyl Capture and 850k array, and SPLAT), a new method utilizing enzymatic deamination (EM-Seq), and Oxford Nanopore sequencing. With the use of 7 different cell lines, this is to our knowledge the most extensive examination of DNA methylation analysis methods on the most extensive set of samples.
Cost is an important parameter to decide which library preparation method to use. Libraries with longer inserts benefit from less adapter contamination and overlapping reads, which increases coverage efficiency, especially when employing cost-effective sequencing on the Illumina HiSeq or NovaSeq systems with paired-end 150 bp reads. In this study, this sequencing scheme resulted in a highly variable depth of coverage per library preparation. While imbalanced pools may account for some of the difference, library preparation methods had the biggest impact. Except for TruSeq, all the other library preparations start with shearing of the gDNA. For the other bisulfite-dependent protocols, the DNA fragments range between 200-400, whereas EM-Seq allows for longer fragments ( 550bp). TruSeq libraries tend to have short (130 bp) insert sizes and are therefore more suitable for 75 bp paired-end read lengths. Despite the imbalance of coverage, this studies provides robust recommendations for downsampling across sequencing types, showing both how different downsampling schemes (i.e. at the BAM level or at the methylation bedGraph level) are comparable, and how downsampled datasets can be directly compared to one another to assess the performance of the assays themselves.
The methods that have proven to have greater genome-wide evenness of coverage, namely Accel-NGS MethylSeq [35], SPLAT [36], and TrueMethyl [37] tend to have longer insert sizes (200-300 bp), fewer PCR duplicates (down to a few percent, depending on sequencing platform), and high mapping efficiencies (>75%). The SPLAT libraries herein had shorter insert sizes than desired due to the use of 400 bp Covaris shearing prior to library preparation. To achieve insert sizes of >=300bp, the SPLAT authors now recommend using DNA fragmented to 500-600 bp as input and to perform final library purification at 0.8x AMPure ratio to remove shorter fragments. The same recommendation would work for MethylSeq and TrueMethyl protocols. SPLAT is the only method in our evaluation that is not commercial/kit-based and could be comparatively 10x cheaper per library [36]. This can be important when considering the sample preparation costalongside sequencing costs.
Another important parameter is the amount of data retained from a WGBS experiment following adapter and quality trimming, mapping and deduplication. Here, we show the effects of each mapping step on each methylome assay, and how reads are filtered along each step, including the estimated number of reads required to achieve a certain mean coverage per CpG. Similarly, previous studies (e.g. Miura et al., 2016 and Zhou et al., 2019) have implemented a metric to estimate the efficiency of WGBS genome coverage by determining the raw library size (number of PE 150 bp reads prior to filtering) required to achieve at least 30x coverage of 50% or more of the genome. According to these studies, this corressponded to 500M for Accel-NGS, 900M for TruSeq DNA methylation, and 1000M for the QIAGEN QIAseq Methyl Library Kit [35]. Standardization and adoption of such a metric in future studies would make it significantly easier to compare and contrast results from different methods.
NEB’s EM-Seq protocol [38] compares favorably to the bisulfite sequencing-based approaches analyzed herein. In almost all comparisons EM-Seq libraries captures more CpG sites at equal or better coverage. A “conventional” pre-enzymatic conversion library preparation approach is recommended in the EM-Seq protocol (NEB), as the cytosine bases in the adapter sequences are methylated and thus preserved during the enzymatic APOBEC treatment. However, for some studies using low- or poor-quality DNA samples, such as those from FFPE or liquid biopsies that are comprised of a mix of ssDNA and dsDNA molecules, the EM-seq approach in combination with library preparation methods such as SPLAT or Accel-NGS MethylSeq, which are capable of capturing both ssDNA and dsDNA, may prove to be beneficial for creating higher quality libraries.
Beyond library preparation, the use of algorithmic tools has an impact on the performance of each methylome assay. Asymmetrical C-T distributions between DNA strands and reduced sequence complexity make epigenetic sequence alignment different from regular DNA processing. Computational time, alignment efficiency, and accuracy are the main factors for choosing an alignment, all of which are impacted by these factors. We observed a general trade-off between time and efficiency and accuracy for all aligners, with bwa-meth providing the optimal balance of high accuracy and efficiency.
Choice of computational algorithms is equally important in analyzing methylation microarray data. In this study, we compared 26 different normalization pipelines. Many algorithms (SWAN, RCP, pQuantile, dasen, funnorm, ENmix, and SeSAMe) generally performed well in this dataset, clustering replicates from the same cell line (across different labs) together while preserving differences between cell lines, but all pipelines performed poorly at sites with low population variance, confirming previous work [32]. We proposed using the 59 SNPs on the 850k array to calculate a data-driven threshold for classifying low-varying sites. Using our threshold, which can be calculated in any Illumina microarray dataset with or without technical replicates, we observed that low-varying sites had poor concordance across replicates from the same cell line, tended to have extreme (near 0% or 100%) methylation values, and showed poor agreement with sequencing data regardless of sequencing platform. This suggests that low-varying sites are not well captured by microarrays and should be filtered out before analysis. It is very possible that the issue of unreliable data at low-varying sites is not specific to microarrays, but we were not able to address this question in the sequencing data because of the limited number of replicates, which were ultimately merged for analysis.
One final caveat herein is the use of high quality DNA from cell lines. Using this highly controlled input, the methods examined within this study produced mostly comparable data. However, the performance of each kit may be more variable on less optimal input DNA (lower input, more highly fragmented, etc.) that mirrors real clinical samples more closely. The optimal data herein could serve as a launch point for future studies of more realistic inputs.
Methods
Library preparation
Illumina TruSeq DNA Methylation (TruSeq)
100 ng of genomic DNA was bisulfite converted using EZ DNA Methylation-Gold Kit (Zymo Research). Sequencing libraries were prepared according to the manufacturer’s protocol (Illumina). The libraries were amplified with 10 PCR cycles using the FailSafe PCR enzyme (Illumina/Epicentre).
SPlinted Ligation Adapter Tagging (SPLAT)
100 ng gDNA was fragmented to 400 bp (Covaris). Bisulfite conversion was performed using the EZ DNA Methylation-Gold kit (Zymo Research). SPLAT libraries were constructed as described previously (Raine et al., 2017). The libraries were amplified with 4 PCR cycles using KAPA HiFi Uracil+ PCR enzyme (Roche).
Illumina EPIC Capture
500 ng of genomic DNA was prepared according to the manufacturer’s protocol (Illumina). Pools of 3 and 4 libraries were amplified using KAPA Uracil+ HiFi enzyme (Roche).
Swift Biosciences Accel-NGS Methyl-Seq (MethylSeq)
100 ng of genomic DNA was spiked in with 1% unmethylated Lambda gDNA, and fragmented to 350 bp (Covaris). Bisulfite conversion was performed using EZ DNA Methylation-Gold kit (Zymo Research). Libraries were prepared according to manufacturer’s instructions (Swift), using dual-indexing primers. A total of 6 rounds of amplification were performed using the Enzyme R3 provided with the kit.
NuGEN TrueMethyl oxBS-Seq (TrueMethyl)
200 ng of genomic DNA was spiked with 1% unmethylated Lambda gDNA and fragmented to 400 bp (Covaris). Fragmented DNA was processed for end-repair, A-tailing, and ligation using NEB’s methylated hairpin adapter. Ligation was performed at 16C overnight in a thermocycler. The USER enzyme reaction was performed the next morning, according to the manufacturer’s protocol, before Ampure XP bead cleanup of the ligated DNA. Each sample was then split into 2 aliquots to perform oxidation + bisulfite conversion or mock (water) + bisulfite conversion according to the NuGen OxBS module instructions (Tecan/NuGen). PCR amplification was performed using NEB’s dual-indexing primers and KAPA Uracil+ HiFi enzyme for a total of 10 cycles.
Enzymatic Methyl-Seq (EMSeq)
100, 50 and 10 ng of genomic DNA spiked in with 2 ng unmethylated lambda and 0.1 ng CpG methylated pUC19 was fragmented to 500 bp (Covaris S2, 200 cycles per burst, 10% duty-cycle, intensity of 5 and treatment time of 50 seconds). EM-seq libraries were prepared using the NEBNext Enzymatic Methyl-seq (E7120, NEB) kit following manufacturer’s instructions. Final libraries were amplified with the included NEBNext Q5U polymerase using 4 cycles for 100 ng, 5 cycles for 50 ng and 7 cycles for 10 ng inputs.
MeDIP and hMeDIP-Seq
MeDIP-seq and hMeDIP-Seq were performed, with all the biological triplicates after DNA isolation, according to the protocol of Taiwo et al. [39], with minor adjustments. For DNA fragmentation to a size of 200 bp, 300 ng of isolated DNA were sonicated on the bioruptor (Diagenode) by using instrument settings of 15 cycles, each consisting of 30 seconds on/off periods. After fragmentation, the genomic DNA size range was assessed using an Agilent 2100 Bioanalyzer and high-sensitivity DNA chips (Agilent Technologies), according to the manufacturer’s instructions. Libraries were prepared using 300 ng of fragmented DNA ( 200 bp) and the NEBNext Ultra DNA Library Prep Kit for Illumina (NEB), according to the manufacturer’s protocol. The purified adaptor-ligated DNAs were used for Methylated DNA ImmunoPrecipitation (MeDIP), according to the manufacturer’s instructions of the MagMeDIP kit (Diagenode) and IPure kit (Diagenode).
PCR was used to amplify the MeDIP/hMeDIP adaptor-ligated DNA fragments. In brief, 25 μL NEBNext High Fidelity 2x PCR Master mix (NEB), 1 μL of Index primer (NEB) that was used as a barcode for each sample, and 1 μL of Universal PCR primer (NEB) were added to 23 μL of the MeDIP adaptor ligated DNA fragments. PCR was performed by using the temperature profile: 98 °C for 30 s, 15 cycles of 98 °C for 10 s, 65 °C for 30 sec., and 72 °C for 30 s, followed by 5 minutes at 72 °C and hold on 4 °C as described before. Thereafter, PCR-amplified DNAs (libraries) were cleaned using Cleanup of PCR Amplification in the NEBNext Ultra DNA Library Prep Kit for Illumina (NEB). Fragmented DNA size and quality were checked using the Agilent 2200 TapeStation and High Sensitivity D5000 Screen Tape. In addition, generated libraries were size-selected on a 6% TBE Gel;fragments of 250-500 bp were excised and the Illumina Truseq Purify cDNA construct was used to extract and purify the DNA libraries. Libraries were quantified on a Qubit fluorimeter (Invitrogen) by using the Qubit dsDNA HS Assay kit (Invitrogen) and qualified checked using the Agilent 2200 TapeStation and High Sensitivity D5000 Screen Tape. All kits and chips were used according to the manufacturer’s protocol.
Illumina Infìnium MethylationEPIC BeadChip (850k array)
Bisulfite conversion was performed using the EZ DNA Methylation Kit (Zymo Research). with 250 ng of DNA per sample. The bisulfite converted DNA was eluted in 15 μl according to the manufacturer’s protocol, evaporated to a volume of <4 μl, and used for methylation analysis on the 850k array according to the manufacturer’s protocol (Illumina).
Microarray experiments were run at three different labs, two of which included technical replicates. The resulting dataset consisted of 30 samples, with each of the 7 cell lines having between 3 and 6 replicates (both biological and technical). For all cell lines (HG001-HG007), 2 technical replicates were generated at lab 1 and 1 biological replicate was generated at from lab 2. Additionally, 3 technical replicates were generated for the Han Chinese family trio cell lines (HG005-HG007) at lab 3.
Preparation of ATAC-Seq libraries
ATAC vs Omni-ATAC protocols: cryopreserved cells were thawed, counted, and split into 2 aliquots for processing in parallel according to each protocol. Library quality control was assessed with Qubit and TapeStation HS D1000.
LC-MS/MS quantification of 5mC and 5hmC
Genomic DNA from HG001-007 cell lines was used for the analysis. Samples were digested into nucleosides using Nucleoside digestion mix (M0649S, New England Biolabs) following manufacturers protocol. Briefly, 200 ng of each sample was digested in a total volume of 20 μl using 1 μl of the digestion mix. Samples were incubated at 37°C for 2 hours.
LC-MS/MS analysis was performed using two biological duplicates and two technical duplicates by injecting digested DNA on an Agilent 1290 UHPLC equipped with a G4212A diode array detector and a 6490A Triple Quadrupole Mass Detector operating in the positive electrospray ionization mode (+ESI). UHPLC was performed on a Waters XSelect HSS T3 XP column (2.1 × 100 mm, 2.5 μm) using a gradient mobile phase consisting of 10 mM aqueous ammonium formate (pH 4.4) and methanol. Dynamic multiple reaction monitoring (DMRM) mode was employed for the acquisition of MS data. Each nucleoside was identified in the extracted chromatogram associated with its specific MS/MS transition: dC [M+H]+ at m/z 228-112, 5mC [M+H]+ at m/z 242-126, and 5hmC [M+H]+ at m/z 258-142. External calibration curves with known amounts of the nucleosides were used to calculate their ratios within the analyzed samples.
Sequencing
NEB Sequencing
An Illumina NovaSeq 6000 was used for sequencing. Dual-unique index pools were constructed from libraries made at multiple sites after quantification using an Agilent Bioanalyzer. To maximize usable reads, 5mC converted libraries were sequenced in pools containing unconverted libraries instead of PhiX. Pools were loaded at ~250 pM for pools with length < 500 bp (paired-end 2×100) or ~300 pM for longer-insert pools (paired-end 2×150). In some cases dual-unique balancing libraries were not available. These were sequenced in combination with the dual-unique libraries and demultiplexed using the expected index 2 sequence derived from the universal adapter. When too many libraries used the same indices we employed an Illumina XP manifold system to sequence in 4 distinct pools. Basecalling occurred on the NovaSeq using RTA v3.4.4x. Demultiplexing and fastq generation was performed using Picard 2.20.6 using default settings except as listed below: picard ExtractllluminaBarcodes MAX_NO_CALLS=0 MIN_MISMATCH_DELTA=2 MAX_MISMATCHES=2 picard IlluminaBasecallsToFastq \ read_structure=100T8B8B100T RUN_BARCODE=A00336 \ LANE=<lane> FIRST_TILE=<tile> TILE_LIMIT=1 \ MACHINE_NAME=<instrument> FLOWCELL_BARCODE=<flowcell>
Illumina Sequencing
Aliquots of stock DNA were sent to Illumina in order to ameliorate depth of sequencing for WGBS libraries. Libraries were pooled and diluted to 1.5nM (final loading concentration of 300pM on flow cell), then sequenced on Illumina NovaSeq S4 flow cells with direct flow cell loading (Xp workflow) according to manufacturer’s instructions. MethylSeq and SPLAT libraries were multiplexed on two lane;SPLAT libraries on their own in the third lane;and TrueMethyl libraries on their own in the fourth lane. Run data were uploaded to BaseSpace and fastq files were generated using default parameters.
Alignment
Quality Control
FastQC was used to evaluate the quality of sequencing data, including base qualities, GC content, adapter content, and overrepresentation analysis. Adapters were trimmed using Trim Galore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/).
Mapping
Sequencing replicates were mapped against a modified build of the human reference genome (build GRCh38) which included additional contigs representing bisulfite controls spiked within the pooled libraries, including lambda, T4, and Xp12 phages and pUC19 plasmid. Alignment to the genome was performed with Bismark (v0.22.1), BitMapperBS (v1.0.2.2), BWA-METH (v0.2.1),and gemBS(v3.2.0). BS-Seeker3 and BRAT-nova were not included after failing to build an index of the reference genome and repeated memory errors. Alignments were run using default parameters for each software.
For the time comparison analysis, we subsampled a random set of one million read pairs per library, using the same random seed for each. Each pipeline was run on the subsetted inputs a total of 10 times. All experiments were performed using a 24 CPU-threaded server, running Ubuntu 16.04, and the performance of each replicate was timed (see Supplementary Table 1). Post-alignment statistics were generated using samtools stats and Qualimap. Alignment files generated from the four pipelines were fed into MethylDackel for methylation bias (mBias) methylation calling, using the suggested trimming parameters from the mBias analysis for each replicate.
CpG Characterization
We examine the number of common CpG sites of all possible combinations of four aligners using bedtools intersect (https://github.com/arq5x/bedtools2). The intersection attributes of CpG methylation estimates from each aligner were visualized with Intervene (https://github.com/asntech/intervene). Pairwise Spearman correlation was calulated to evaluate the concordance of CpG methylation calls from the four aligners.
We further evaluated the performance of the four methods by comparing distribution of annotations, including 3’ UTR, 5’ UTR, Exon, Intergenic, Intron, Non-coding, Promoter-TSS, TSS, and unknown regions. Additionally, to explore the aligner’s effect on methylation level in relation to the TSS, we profile the DNA methylation level at each CpG site surrounding the gene’s TSS ±5kb.
Downsampling
The bedGraph files generated by the BWA-meth aligner (see results for rationale to proceed with BWA-meth calls for secondary analyses) for each technical replicate were combined by summing up the methylated and unmethylated counts per CpG site by chromosome. Next, the strands were merged in order to produce one value per CpG dinucleotide using MethylDackel mergeContext. The resulting replicate-CpG-merged bedgraphs were downsampled using https://github.com/nebiolabs/methylation_tools/downsample_methylKit.py where a fraction of counts kept corresponding to the desired downsampling depth.
To compare downsampling mapped reads (BAM files) and bedGraph files, the BAM files from all replicates representing EMSeq HG006 (Lab 1) and MethylSeq HG004 (Lab 1) were respectively merged using samtools merge. The merged BAMs were then downsampled using samtools view using the – s parameter, calculating the fraction of reads necessary to achieve the desired mean coverage per BAM. Methylation was called on these BAM files using the same methodology as above. The strands were merged by CpG dinucleotide using MethylDackel merge context, creating one methylation call per CpG site. The procedure is outlined in the Supplementary Information (Figure S2A), (Figure S3A).
Differential Methylation Analysis
Differential methylation between the two family groups (HG002-HG004 vs HG005-HG007) was assessed at each site on chromosome 1 for which at least two samples per group were covered by 5 or more reads. Following aggregation of replicates, strand merging, and downsampling to median 10X coverage, analysis was independently conducted via logistic region for each of five platforms (MethylSeq, EMSeq, TruSeq, SPLAT, and TrueMethyl bisulfite replicates) using the standard “glm” function in R. p-values were adjusted using the Benjamini-Hochberg correction and adjusted values < 0.05 were considered statistically significant. Comparisons among platforms considered only sites that were present in all datasets.
ATACseq Processing
Pre-Processing
Trim Galore was used both to remove adapters and, for the purpose of the read length titration experiment, to hard-trim reads to fixed lengths (50bp, 75bp and 100bp) starting from the five-prime-end. The NextSeq quality trimming option was set to 20. The hard-trimmed reads were then processed with the pigx-chipseq pipeline for preprocessing, peak calling and reporting for ChIP and ATAC sequencing experiments (https://github.com/BIMSBbioinfo/pigx_chipseq, v0.0.41).
Alignment
Briefly, reads were aligned to the human reference genome (build GRCh38) using bowtie2 (v2.3.4.3) with maximum fragment length for valid paired-end alignments extended to 2000 bp. Alignments were subsequently filtered via samtools (v1.9) removing mappings with mapping quality below 10 and discarding duplicate alignments.
Peak Calling
Macs2 (v2.1.1.20160309) was used to call peaks on the filtered alignments with automatic duplicate removal enabled (-keep-dup ‘auto’), input format specified as paired-end bam (-format ‘BAMPE’), shifting model-building disabled (-nomodel), effective genome size changed to human (-gsize ‘hs’) and ignoring peaks with FDR less than 0.05 (-q 0.05).
Oxidative Bisulfite Analysis
TrueMethyl Libraries
quality of data was assessed with fastqc. Adapters were trimmed using Trim_Galore. Reads were aligned to the hg38 genome using Bismark/Bowtie2. CpG methylation data was extracted using MethylDackel, in destranded format, and keeping sites covered by at least 5 reads. This data was loaded in the R/Bioconductor bsseq package [40]. CpG sites common to all replicates were obtained, and the M (counts for methylated C) and Cov (total count) matrices were extracted and used to generate the matrices required for the MLML2R package [41] to estimate the levels of 5mC, 5hmC, C from the beta values. The resulting estimates were used to create bed files for further comparison with corresponding MeDIP/hMeDIP-Seq data.
Microarray Normalization and Site Filtering
Microarray normalization methods were divided into two broad categories: between-array normalization and within-array normalization. Between-array normalization is used to reduce technical variation while preserving biological variation between samples, while within-array normalization is used to correct for the two different probe designs on the Illumina methylation arrays, which have been observed to have different dynamic ranges [30]. The between-array normalization methods evaluated were pQuantile [23], funnorm [24], ENmix [25], dasen [26], SeSAMe [27], and GMQN [28]. We implemented all possible combinations of between-array and within-array normalization methods as well as each method individually. Samples from all 3 labs were normalized together as one joint dataset.
In order to evaluate the performance of each pipeline, all 30 microarray samples from 3 labs were pooled together in a variance partition analysis [42]. For each pipeline and at each CpG site, the percentage of variation in DNA methylation beta values explained by cell line and lab was calculated. Additionally, we performed principal components analysis (PCA) and visually inspeced clustering of technical and biological replicates across all normalization pipelines. A superior normalization pipeline would have more variation explained by cell line across the epigenome compared to other pipelines as well as clear clustering of biological and technical replicates.
After normalization, we used the 59 SNP probes on the 850k array, meant to identify sample swaps [43], to define a data-driven classification of low-varying sites. Previous studies have found that low-varying sites have poor reproducibility on the Illumina arrays [32] and have suggested data-driven probe filtering using technical replicates [44,45] or beta value ranges [32]. However, not all studies have technical replicates, and previously proposed beta value range cutoffs for one experiment may not be generalizable to another experiment. We first called genotype clusters based on the beta values at each of the 59 SNP probe within each of the 3 different labs (Figure 6b). Although we used a naïve approach for calling genotypes (<25% methylation=cluster 1, 25-50% methylation = cluster 2, >75% methylation = cluster 3), which was sufficient for the clear separation in our dataset (Figure 6b), more sophisticated methods [46] can be used for datasets with less clear separation and/or outlier values. In theory, because these 59 SNP probes are meant to measure genotypes, cell lines with the same genotype should have exactly the same readout in an experiment without any technical noise. Therefore, we can use variance within genotype clusters from the same experiment as a measure of technical noise and determine the minimum population variation needed to exceed the observed technical variation. Within each of the 3 labs, we calculated methylation variance at each SNP probe within each genotype cluster, giving us a distribution of observed technical noise ((Figure 6c). To avoid being overly conservative due to outlier values at these 59 SNP probes, we use the 95th percentile of these genotype cluster variances as the threshold for defining low-varying sites (Figure 6c-d).
Microarray Versus Sequencing Comparison
Variance partition analyses were used to compare the microarray and sequencing datasets and assess cross-platform concordance. Each variance partition analysis included all microarray replicates, normalized with funnorm + RCP, and one sequencing sample per cell line from a single sequencing platform and lab (with replicates merged). The percent of variation in DNA methylation explained by cell line and platform (sequencing or microarray) was calculated at each overlapping CpG site. This produced 5 sets of results, one per sequencing platform. The percentage of variation explained by cell line at each site was used as a measure of cross-platform concordance between each sequencing platform and the microarray data, and the percentage of variation expained by platform was used as a measure of platform- or experimenet-specific artifacts. Each variance partition analysis was performed on the same 842,965 CpG sites, which were present in all 6 datasets, to ensure a fair comparison.
Data Availability
All data sequenced for this study is available within SRA under accession number SRR8324451. All code used to process data and generate files is publicly available on Github at https://github.com/Molmed/epiqc.
Disclaimer
The views presented in this article do not necessarily reflect those of the U.S. Food and Drug Administration. Any mention of commercial products is for clarification and is not intended as an endorsement.
Author Contributions
C.E.M, Y.W, Y.D, J.M.G, C.W, M.S, M.N, C.S, A.M, J.W.D, W.X, H.H, B.N, and W.T conceived of and designed the study. A.R, U.L, D.B, A.A, G.G, J.I, F.W, V.K.C.P, L.W, C.L, Z.C, Z.Y, J.L, X.Y, H.W, S.G, and D.B.M prepared sequencing libraries. V.K.C.P and L.W pooled and sequenced the libraries. T.A, R.R, C.R.A, I.I.C, T.G, Y.P.D, and M.N generated microarrays. J.F, A.L, J.N, B.W.L, M.L, M.A.C, C.R.A, T.G, C.L, K.P, R.C, S.L, G.G, A.M, P.P.L, M.M, A.S, S.B, A.B, V.F, W.L, J.X, and A.A contributed to bioinformatics analysis. J.F, B.W.L, J.N, C.L, M.L, S.L, and T.G generated figures. J.F, B.W.L, J.N, C.L, S.L, T.G, M.L, J.G, V.K, C.P.C.W, and J.X contributed to writing and editing the manuscript.
Competing Financial Interests
B.W.L, M.C., L.W., and V.K.C.P are employees of New England Biolabs. S.L and J.W.D are employees of Abbvie, Inc. S.B is an employee of Illumina, Inc. F.W, J.I, W.L are employees of New York Genome Center.
Supplementary Results
EPIC Methyl Capture Targeted Methylome Sequencing
We compared sequencing replicates of Illumina Methyl Capture EPIC, a targeted approach interrogating roughly 3.3 million CpGs with a preference for CpG islands and promoter regions, to methylome-wide assays across all seven genomes. Results shown for HG002 are representative of all seven genomes. Concordance between biological replicates was extremely high, with >98% of captured CpGs overlapping between replicates (Figure S14A), and very nearly 3.3 million CpGs captured in all seven genomes ((Figure S14B). Some off-target CpGs were captured, representing roughly 12.5% of total bases sequenced per replicate (Figure S14C). Within off-target regions, nearly all were captured only at 1X depth, with very few exceeding 5X, while the mean coverage per CpG was closer to 20X for on-target CpGs, with a long tail exceeding 50X for many sites (Figure S14D). Methylation percentage was more imbalanced for EPIC replicates than expected, with a higher proportion of sites estimated as 100% methylated than in other assays (Figure S14E). This was reflected in an analysis of concordance, which showed an r-value of roughly 0.68 per assay in comparison to EPIC when examining only targeted regions (Figure S14F), a value likely driven down by an over-estimation of methylation within EPIC capture.
Hydroxy-methylcytosine Estimation
The TrueMethyl protocol is one of the few assay allowing investigators to measure 5mC and 5hmC (and C) in an indirect manner. For completeness, each cell line replicate was processed using both bisulfite only (BS = 5mC + 5hmC) and oxidative reaction prior to bisulfite reaction (OX = 5mC). In parallel, total 5mC and 5hmC were measured by LC-MS/MS. Supplementary Figure Figure S15 shows that all cell lines have a higher level of 5mC compared to 5hmC (Figure S15A,B). The low 5hmC levels were also observed at the single-nucleotide resolution level, with similar correlations between the two library preparations across all cell lines (Figure oxbsSuppl c), and also within each cell lines (d), where the PCA plot in figure oxbsSuppld shows little to no separation between libraries prepa8 ed using BS or OX protocols.
As stated above, preparation of BS and OX libraries in parallel allows the determination of 5mC, 5hmC and C. We used the MLML2R package to estimate the level of each cytosine state, for each CpG sequenced, using HG002 as example. The results are shown in figure Figure S15E. The top panel shows that some CpG sites not only show 100% of a specific cytosine mark (C = 100% unmethylated CpG, mC = 100% methylated CpG), but also a mixture of two (mC_C = methylated or unmethylated C;hmC_C = hydroxymethylated or unmethylated C;mC_hmC = methylated or hydroxymethylated C) or of all cytosine mark (mC_hmC_C). Consistent with the LC-MS/MS quantitation, hmC marks were found in low proportions at some CpG sites. The results observed for HG002 were representative of all the 7 cell lines.
Input titration for EM-Seq
In order to investigate the impact of input DNA, we generated EM-Seq libraries using 10ng, 50ng, and 100ng of aliquot for each replicate for each Genome in a Bottle cell line. We also randomly subsample each run in silico to a random set of 1M, 5M, 10M, 25M, 50M, and 100M paired end 150bp reads per input. Across this gradient of subsampled reads, the input amount had an effect on the number of CpGs uniquely captured at or below 25M read pairs, though most CpGs were covered even with 10ng of input DNA at 50M read pairs and above (Figure S16A). For CpGs covered across input titers, the mean coverage per CpG remained even, and increased linearly with numbers of reads (Figure S16B).
Biological Insight within Sequence Data vs Microarray
To determine the biological relevance of our results, we considered 52 CpGs on chromosome 1 that had been previously identified as differentially methylated in an array analysis of approximately 300 individuals from Caucasian-American, African-American, and Han Chinese-American populations [47]. Annotation and methylation results from all 52 CpGs are available within Supplementary Table 3. Of the 7 sites with reported |PMD|>0.2 between Chinese-Americans and Caucasian-Americans, 5 were identified as DMAs for all five assays as well as having |PMD|>0.2 in our arrays. Of the two remaining sites, one (on the TAS1R3 promoter) had insufficient read coverage for MethylSeq and TruSeq but was a DMA for the remaining assays, and the second (located on the C1orf100 promoter) was identified as a DMA for only SPLAT and TruSeq. In addition to TAS1R3, which is a sweetness taste receptor that is known to vary phenotypically between the Asian and Caucasian populations [48], there was strong concordance for 6 CpGs on the PM20D1 promoter, a gene associated with obesity and Alzheimer’s disease with demonstrated population-based variation [49, 50].
We additionally reviewed a collection of 3379 sites that were identified as DMA for at least 3 of the five sequencing assays on chromosome 1. Following annotation with HOMER [51], analysis with DAVID bioinformatics [52] identified a subset of 32 genes associated with osteoporosis (Benjamini-Hochberg adjusted p-value< 5.5E-8) according to the GAD database [53] (Supplementary Table 4). These include PBX1 and WLS, both of which have been associated with bone mineral density in previous studies [54, 55]. These results are of interest not only because of the high rate of osteoporosis in the Ashkenazi Jewish population relative to other ethnic groups [56], but also because only 4 of the 94 CpGs associated with these 32 genes were present on the Illumina array, highlighting the ability of whole methylome sequencing methods to detect differences unobservable in array-based datasets.
Methylation Capture in Oxford PromethlON
Aliquots of all seven cell lines were sequenced across three Oxford Nanopore PromethION R9.4 flow cells. Bases and methylation values were called using Megalodon 2.2.1 with Guppy 4.0 under the hood, allowing simultaneous base calling and base modification calling from raw signal data. Compared to other methylome data captured from more traditional sequencing, PromethION showed a normal distribution of CpG coverage (Figure S17A). However, the methylation percentage distribution was much less bimodal, with far fewer CpGs demonstrating 100% methylation across the genome (Figure S17B), reflecting current limitations in uniform base modification detection across DNA strands from Nanopore data. Despite this, the correlation of methylation capture between Nanopore data and other sequencing assays was quite high, with r values raging between 0.794 compared to EM-Seq and 0.825 compared to TruSeq (Figure S17C), with most sites called at 0% or 100% methylation, but many sites at 100% for other assays that showed lower methylation in PromethION. The findings reported for HG002 are representative of findings for all other cell lines.
Supplementary Figures
Supplementary Tables
Acknowledgments
Library preparation and array-based analysis was performed by the SNP&SEQ Technology Platform in Uppsala (www.sequencing.se). The facility is part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory and is supported by the Swedish Research Council.I.I.C, R.R, and C.R.A are supported by ISCIII, project number PI18/00050. T.G and Y.P.D are supported by NIH Grants 5P30GM114737, P20GM103466, U54 MD007584, and 2U54MD007601. The genomic work carried out at the Loma Linda University Center for Genomics was funded in part by the National Institutes of Health (NIH) grant S10OD019960 (CW). This project is partially supported by AHA grant 18IPA34170301 (CW).