"Same Difference": Comprehensive evaluation of three DNA methylation measurement platforms

Background DNA methylation in CpG context is fundamental to the epigenetic regulation of gene expression in high eukaryotes. Disorganization of methylation status is implicated in many diseases, cellular differentiation, imprinting, and other biological processes. Techniques that enrich for biologically relevant genomic regions with high CpG content are desired, since, depending on the size of an organism’s methylome, the depth of sequencing required to cover all CpGs can be prohibitively expensive. Currently, restriction enzyme based reduced representation bisulfite sequencing and its modified protocols are widely used to study methylation differences. Recently, Agilent Technologies and Roche NimbleGen have ventured to both reduce sequencing costs and capture CpGs of known biological relevance by marketing in-solution custom-capture hybridization platforms. We aimed to evaluate the similarities and differences of these three methods considering each targets approximately 10-13% of the human methylome. Results Overall, the regions covered per platform were as expected: targeted capture based methods covered >95% of their designed regions whereas the restriction enzyme-based method covered >70% of the expected fragments. While the total number of CpG loci shared by all methods was low, ~30% of any platform, the methylation levels of CpGs common across platforms were concordant. Annotation of CpG loci with genomic features revealed roughly the same proportions of feature annotations across the three platforms. Targeted capture methods encompass similar amounts of annotations with the restriction enzyme based method covering fewer promoters (~9%) and shores (~8%) and more unannotated loci (7-14%). Conclusions Although all methods are largely consistent in terms of covered CpG loci and cover similar proportions of annotated CpG loci, the restriction based enrichment results in more unannotated regions and the commercially available capture methods result in less off-target regions. Quality of DNA is very important for restriction based enrichment and starting material can be low. Conversely, quality of the starting material is less important for capture methods, and at least twice the amount of starting material is required. Pricing is marginally less for restriction based enrichment, and number of samples to be prepared is not restricted to the number of samples a kit supports. The one advantage of capture libraries is the ability to custom design areas of interest. The choice of the technique should be decided by the number of samples, the quality and quantity of DNA available and the biological areas of interest since comparable data are obtained from all platforms.


Background
DNA cytosine methylation in the form of 5-methylcytosine (5mC) in CpG context is an epigenetic marker that is important for regulation of gene expression. Changes in CpG methylation are implicated in many diseases, and proper methylation patterns are required for normal development [1]. Large scale studies such as ENCODE [2] and the Human Epigenomics Roadmap [3] have performed extensive profiling of 5mC in various cell lines and tissues revealing a rich and dynamic landscape of 5mC patterns in the human genome. Given the importance of these markers to cellular development and contribution to disease, a number of approaches have been developed for detecting the methylation status of cytosines [4], with bisulfite sequencing (BS-seq) being widely used to provide single-base quantitative measurement of 5mC (and 5hydroxymethylcytosine, 5hmC). Bisulfite sequencing refers to massively parallel sequencing after chemical deamination of cytosines (C) to uracils (U), followed by polymerase chain reaction (PCR). The deamination of cytosines is accomplished by the use of sodium bisulfite, and this pretreatment preserves both the methyl modification in 5mC and the 5-hydroxymethyl modification in 5hmC [5]. The benchmark in methylome coverage is whole genome bisulfite sequencing (WGBS), which at 30X sequencing coverage, accounts for ~94% of all cytosines in the genome with 99.8% of them being CpG loci [6]. However, different WGBS library preparation protocols can bias region coverage. Since no method completely covers the methylome, and biologically relevant CpGs have been identified in known genomic features [1,7], developing focused assays considering these features is in demand with the caveat that these approaches will leave gaps in the methylome potentially excluding important CpGs.
There are several methods for acquiring DNA methylation levels and we investigated the characteristics of three currently widely used platforms: i) enrichment by enzymatic digestion 3 (MspI) enhanced reduced representation bisulfite sequencing (ERRBS) [8], ii) capture based Agilent SureSelect Methyl-Seq (SSMethylSeq), and iii) capture based Roche NimbleGen SeqCap Epi CpGiant (CpGiant).
In this paper we present an analysis of the methylation pattern of the human lung fibroblast IMR-90 cell line profiled by each of the platform protocols, using two technical replicate libraries for ERRBS, and two libraries each for SSMethylSeq and CpGiant, one at the manufacturer's suggested concentration and one at a reduced concentration (Table 1 and Additional file 1: Table   S1). The capture libraries differ only in concentration of input material and are treated effectively as technical replicates. All libraries were sequenced to equivalent depth and compared to a library made using WGBS.

Computational analysis
Illumina's CASAVA 1.8.2 was used to generate fastq files from basecalls. Raw fastq reads were processed by a custom pipeline that consists of: 1) filtering raw fastq reads for pass filter reads, 2) trimming adapter sequence by FLEXBAR [9], 3) genomic alignments performed using Bismark  [10] and Bowtie [11] to reference human genome hg19, and 4) methylation calling by a custom PERL script [8]. Custom analysis scripts were written in R (version 3.3.0 [12]), including packages:  [19], and VennDiagram package version 1.6.17 [20], were used to perform the analysis and are available at GitHub (https://github.com/thk2008/methylseqplatformcomparison). Sequencing data and methylation results are available to download from GEO (GSE83595)

Region coverage
Targeted capture techniques (SSMethylSeq and CpGiant) have a designed set of genomic regions and therefore, a predicted set of CpGs covered. SSMethylSeq is specifically designed to capture CpGs from a single DNA strand, where the other platform capture CpGs from both DNA strands.
The ERRBS protocol is considered targeted to the extent that the restriction digest produces consistent genomic fragments, sizes from 84-334bp are isolated during the library preparation gel extraction step. Note that since the DNA for WGBS is randomly sheared and coverage depends upon sequencing depth, there are no predicted regions per se and consequently WGBS is

Sample preparation and sequencing
The general library preparation protocol for ERRBS is to digest input DNA with the methylation  Table 1.

Sequencing and alignment characteristics
Generally, all platforms produced similar sequencing results with no noticeable bias or reduced quality scores. The number of clusters and number of pass filter reads produced a typically consistent number of usable reads for all samples (254M +/-37M), see Additional file 1: Table S3 for more details.

Strand methylation symmetry
Maintenance of symmetric methylation patterns across complementary CpG sites is required in order to preserve methylation patterns across cellular divisions. In principle, measuring methylation levels on one strand is sufficient to infer the cellular methylation state. Indeed, the SSMethylSeq platform is designed to capture only one strand in contrast to all other platforms that capture from both strands ( Figure 2B). We note that this is different from measuring hemimethylation states, which refers to different methylation between two parental alleles. This  Figure 4C).

CpG-unit overlap and methylation levels concordance
Overall, the average number of common CpG-units covered by all three platforms is ~30% +/-1% and sd = 0.05 (Additional file 1: Figure S3 and Table S4). Among inter-platform samples we observed less overlap; median number of shared CpG-units ~1.5M, median overlap 39.1% ( Figure 5 upper triangle, Additional file 1: Figure S3 and Table S4). Methylation levels of common CpG-unit's across all platforms are highly concordant indicating that methylation level measurements are consistent and reproducible with average MAD = 0.41 and sd = 0.25 ( Figure 5 lower triangle, Additional file 1: Figure S3 and Table S4). Inter-platform concordance was slightly lower than intra-platform concordance with mean MAD = 0.46 and sd = 0.27 ( Figure 5 lower triangle, Additional file 1: Figure S3 and Table S4). These results demonstrate that while the platforms differ in their capture approaches and bisulfite conversion kits (Table 1)

Coverage of genomic feature regions by CpG-units
Next we were interested in what genomic features are covered by each platform and the degree of coverage. Analogous to the previous analysis, here we define a region as being one of exon, intron, promoter, CpGi, or shore, and coverage is a genomic region that contains at least one detected CpG-unit. It should be noted that the genomic feature annotations are not mutually exclusive and that some CpG-units are annotated by more than one category. Naturally, the targeted platforms focus on genomic regions known to play important roles in epigenetic regulation such as promoter regions, CpG islands, and CpG shores and ERRBS covers the same regions, but to a lesser degree ( Figure 6A) CpG-units in particular genomic region ( Figure 6B). Conversely, we looked at the number of CpGunits that have an annotation and observed similar trends for the three platforms with ERRBS having a larger proportion of unannotated CpG-units (~27%, Figure 6C). Overall, no platform is significantly enriched for particular genomic region although ERRBS is slightly less represented in promoters, CpG islands and CpG shores, while having more representation of unannotated CpGunits.

Overlap of CpG-units annotated with a genomic feature
We evaluated the overlap of annotated CpG-units to determine whether any platform is enriched for a particular genomic feature (e.g. exons, introns, promoters, CpG islands and shores). Again, it should be mentioned that the genomic feature annotations are not mutually exclusive and that some CpG-units are annotated by more than one category. However, a CpG-unit is counted as annotated if it has one or more annotations. Comparing the overlap in annotations across platforms, we see a similar grouping as above with the methylation levels, intra-platform overlap is high (mean overlap 93.7% sd=2.9%), and inter-platform overlap is lower (mean 55.1% sd=15.5%) ( Figure 7 and Additional file 1: Figures S4-S9). Thus, we observed similar proportions of genomic region coverage across all platforms, but lower overlap of annotated CpG-units suggesting coverage of different loci within those regions.

Discussion
We Mean and median coverage per CpG-unit. Figure S3. Intra-and Inter-platform CpG-unit overlap and methylation levels concordance. Table S4. Intra-and Inter-platform details. Figure S4.
Overlap of exon annotation of CpG-units as UpSet plot. Figure S5. Overlap of intron annotation of CpG-units as UpSet plot. Figure S6. Overlap of promoters annotation of CpG-units as UpSet plot.