Detecting DNA cytosine methylation using nanopore sequencing

A hidden Markov model (HMM)-based tool enables detection of 5-methylcytosine (5-mC) from single-molecule nanopore-sequencing data generated directly from human genomic DNA without chemical treatment. In nanopore sequencing devices, electrolytic current signals are sensitive to base modifications, such as 5-methylcytosine (5-mC). Here we quantified the strength of this effect for the Oxford Nanopore Technologies MinION sequencer. By using synthetically methylated DNA, we were able to train a hidden Markov model to distinguish 5-mC from unmethylated cytosine. We applied our method to sequence the methylome of human DNA, without requiring special steps for library preparation.

single-molecule detection 5 . Enzymatic treatment to convert 5-mC to 5-carboxylcytosine via tet methylcytosine dioxygenase 1 (Tet1) can enhance this signal 6 ; however, others have found this conversion to be incomplete and context dependent 7 .
Previous work has shown that 5-mC can be distinguished from cytosine by careful analysis of the electrical current signals measured by nanopore-based sequencing devices 8,9 . Here we describe the development of a method to directly detect DNA modifications, in this case 5-mC, based on only the electrical readout of the Oxford Nanopore Technologies (ONT) MinION instrument, without any chemical treatment. We characterized the error rate of our approach by sequencing negative and positive control samples, and we assessed the biological context of the methylation patterns that were detected using low-coverage humangenome-sequencing data.
We and others have demonstrated how hidden Markov models (HMMs) can be used to analyze nanopore sequencing data [10][11][12][13] . Our HMM computes the probability of observing a sequence of signal-level events, as measured by the MinION, given a specific nucleotide sequence (Supplementary Note). Here we extend our HMM 10 to detect methylation.
An HMM contains a set of emission distribution functions that model the probability of some observed data for each state of the HMM. For MinION-derived sequencing data, the observations are segmented electrical current samples, termed events. Each event represents a current level measured in picoamperes over some interval of time. The emission distribution functions model the event observations using Gaussian distributions with a mean and s.d. that depend on the particular short sequence, or k-mer, which in this case is a 6-mer (Online Methods).
There were clear differences in the event distributions of methylated and unmethylated DNA, using both the old R7.3 nanopore chemistry and the newer R9 pore (Fig. 1). We used PCR-amplified Escherichia coli DNA that was either untreated or treated with the CpG methyltransferase M.SssI to assess Gaussian distributions and to train our HMM for an expanded set of k-mers that included 5-mC in a CpG context (Online Methods; see Fig. 1 for examples). We observed many k-mers whose electrical signal shifted when sequencing PCR-amplified and M.SssI-treated (hereafter referred to PCR + M.SssI + ) E. coli DNA (Online Methods, Supplementary Tables 1-5; a complete description of the data sets we generated is in Supplementary Table 5). The shift in signal depended on the position of 5-mC in the k-mer (Supplementary Fig. 1).
We applied our trained model to a sample with a reference methylome to confirm our observations of detectable signal differences from methylated DNA. We performed multiple sequencing runs, using both R7.3 and R9 flow cells, on genomic DNA from detecting dna cytosine methylation using nanopore sequencing Jared T Simpson 1,2 , Rachael E Workman 3 , P C Zuzarte 1 , Matei David 1 , L J Dursi 1 & Winston Timp 3 in nanopore sequencing devices, electrolytic current signals are sensitive to base modifications, such as 5-methylcytosine (5-mc). here we quantified the strength of this effect for the oxford nanopore technologies minion sequencer. by using synthetically methylated dna, we were able to train a hidden markov model to distinguish 5-mc from unmethylated cytosine. We applied our method to sequence the methylome of human dna, without requiring special steps for library preparation.
The epigenetic state of the cell is critical for regulating gene expression and cellular responses to stimuli; it functions as a cellular memory, retaining information even through cell division. Methylation is an important aspect of the epigenome, and although effective methods exist to map its occurrence in the genome, each has limitations. To address some of these challenges, we demonstrate the direct detection of methylation marks in complex templates using long nanopore sequence reads and provide a freely available analysis pipeline (https://github.com/ jts/nanopolish and Supplementary Software).
The current 'gold standard' method for profiling methylation in DNA is bisulfite sequencing 1 . Sodium bisulfite treatment converts unmodified cytosines to uracil, while leaving methylated cytosines unchanged. Bisulfite treatment is a harsh process, which results in extensive DNA fragmentation 1 , and any breaks in the insert prevent downstream amplification, thereby necessitating high levels of input DNA (~100 ng). The use of a large amount of material complicates analysis, as the epigenome is highly variable and results in a heterogeneous mixture of epigenetic states 2 , although recent methods, including post-bisulfite adaptor tagging 1 , have attempted to lower input requirements. In addition, short reads only reveal short-range patterns of methylation 3 .
The Pacific Biosciences RSII instrument is able to directly detect methylation from untreated DNA at both known and previously unrecognized motifs by analyzing polymerase kinetics 4 in long sequence reads. For 5-mC, alterations in polymerase kinetics are subtle and spread over several bases, effectively precluding direct a lymphoblast cell line derived from a human female (Coriell NA12878), along with two control samples, PCR-amplified NA12878 DNA (used as a negative control) and aliquots of this negative control that were treated with M.SssI (used as a methylation-positive control). We confirmed the methylation status of the control samples using bisulfite sequencing (Online Methods). We refer to sequencing runs from the unamplified NA12878 samples as 'natural DNA' data sets, as the cell's methylation state was preserved. All of the natural DNA runs from a pore that were analyzed are jointly referred to as 'merged' data sets.
We applied our trained model to calculate the log-likelihood ratio between an unmethylated version of a reference genome substring and a version that contained at least one 'CG' dinucleotide (we considered nearby sites jointly as a group, as each event's current level was affected by multiple bases) (Online Methods and Supplementary Note).
Presently, our model cannot detect non-CpG methylation or distinguish k-mers with a mixture of methylated and unmethylated CpGs, as our training set is limited to completely methylated or unmethylated sequences. In practice, this means that when the CpG sites are within 10 bp of each other, we must assign the same methylation status to the entire group. More comprehensive training sets will address this limitation. We assumed that CpG sites were not hemimethylated, so we summed the log-likelihood ratios from both strands of two-direction (2D) nanopore reads (i.e., reads that include a linked template and complementary strand). We confirmed that the log-likelihood ratio of 'singleton sites'-i.e., regions that contained only a single CpG-tended to be less than 0 for our negative control and greater than 0 for our positive control (Online Methods).
Next we assessed the accuracy of methylation calls for singleton sites from a single 2D nanopore read. We randomly sampled 100,000 singleton sites from each of the negative and positive NA12878 control data sets independently, and we then used the loglikelihood ratio to make a methylation call for each site. A simple binary classifier that called a site 'methylated' if the log-likelihood ratio was positive had an accuracy of 83% for singleton sites using the R7.3 data and 87% using the R9 data (classifications are provided as a contingency table in Supplementary Tables 6 and 7). We assessed the tradeoff between true-positive and false-positive rates by computing a receiver operating characteristic (ROC) by varying the threshold for calling a site methylated ( Fig. 2a and Online Methods).
This classification procedure can be improved by requiring the absolute value of the log-likelihood ratio to reach a threshold to make a call at a given site. For a threshold that varied from 0 to 10, accuracy (calculated as '1 − error rate') improved to >95% as the stringency for making a call increased (Supplementary Fig. 2a). However, this came at the cost of making calls at fewer sites (Supplementary Fig. 2b). After setting a threshold of 2.5, our classifier had an accuracy of 91% and made calls at 68% of sites for the R7.3 data. For the R9 data, we achieved an accuracy of 94% and made calls at 77% of sites when using the same call threshold. For the subsequent results, we used this calling method rather than the naive binary classifier. Our accuracy calculations assumed a completely unmethylated negative control and a completely methylated  We also explored the biological context of our methylation calls. We used a log-likelihood threshold of 2.5 to make methylation calls for each group of sites in each sample and calculated the percentage of called sites that were methylated as a function of their distance from an annotated transcription start site. We performed this analysis for the merged natural DNA data set, the controls and a recent bisulfite sequencing data set for NA12878 (Encyclopedia of DNA Elements (ENCODE) accession ENCFF279HCL). As expected, we observed a tendency for CpG sites near transcription start sites (TSS) to be unmethylated, with the percentage of methylation increasing as the distance from the TSS increased (Fig. 2b). The pattern of methylation for the nanopore-based calls closely tracked the pattern for the bisulfite data for both the R7.3 and R9 data (Fig. 2b). This pattern was consistent across chromosomes (Supplementary Data 1), with the exceptions of chromosome 21, which was noisy owing to low coverage and few genes, and chromosome X, which had a flatter profile that probably resulted from methylation associated with X-inactivation. Methylated CpGs in the negative and positive controls had lowpercentage-methylated and high-percentage-methylated profiles, respectively, which were independent of distance from the TSS. We also found that at annotated CpG islands, the percentage of methylation detected by our method correlated well with that by bisulfite sequencing (Supplementary Fig. 3

, Supplementary Data 2 and Online Methods).
We applied our nanopore-based detection method to two human breast cell lines, MCF10A (a nontumorigenic epithelial cell line) and MDA-MB-231 (an aggressive metastatic cancer cell line). To improve coverage, we applied a reduced representation approach (Online Methods). Each sample was sequenced on two nanopore R7.3 flow cells and subjected to bisulfite sequencing on an Illumina MiSeq instrument. For CpGs with at least 10× coverage in both the bisulfite-based and nanopore-based sequencing data, we observed good correlation between the two data types (Supplementary Fig. 4; MCF10A, r = 0.91; MDA-MB-231, r = 0.91). Points with low correlation were partially due to sampling issues from the low nanopore coverage (~10×) as compared to the bisulfite coverage (often >100×). Our methodology may also have been affected by confounding factors such as alternative cytosine modifications (5-hydroxymethylation; 5-hmC) or non-CpG methylation, whereas the bisulfite sequencing data (as analyzed) was insensitive to these effects; for example, bisulfite sequencing may have called 5-hmC as 5-mC.
We then examined regions with high nanopore coverage, which overlapped CpG islands that bisulfite sequencing data identified as being differentially methylated (top coverage region in Fig. 3a; other regions in Supplementary Data 3). Across the regions, data from the bisulfite-based and nanopore-based sequencing approaches largely showed the same trends in the amount of methylation when accounting for coverage levels. In the top coverage region, a hypermethylated island in cancer (chromosome 9: 63,817,919-63,818,178) was correctly identified by both the nanopore and bisulfite data, whereas a second island (chromosome 9: 63,815,013-63,815,679) showed no significant change in methylation percentage in either data set (Fig. 3a).
We also computed per-read nanopore data in a similar manner (top coverage region in Fig. 3b; other regions in Supplementary Data 4). CpGs that could not be called definitively due to a loglikelihood ratio below our threshold value were omitted from the analysis. We found a clear shift in the hypermethylated island region when comparing methylation calls in cancer cells to those in normal cells (Fig. 3b). We also suggest that the hypermethylation seems to be a larger-than-local phenomenon; reads 4,529 and 4,526 seemed to be almost entirely hypermethylated, partially driving the shift, but suggesting that the hypermethylation had a nonlocal effect on methylation that was not easily appreciated by looking at the local percentage-methylation data commonly available.
Our work demonstrates that 5-mC can be identified by sequencing DNA with the MinION instrument, at up to 95% accuracy using stringent calling thresholds and the most recent pore chemistry. Our results on human DNA used low-coverage data for which the interrogated sites were covered by a single nanopore read. As the amount of sequencing data per run from the MinION and related sequencing instruments grows, we anticipate an improvement in calling accuracy by integrating signals from multiple overlapping reads. Our methodology should be generally extendable to other pore types and nanopore platforms.

brief communications
The use of training data from entirely CpG-methylated DNA reduces the ability to call heterogeneous methylation within a region. This is especially relevant in high-CpG-density regions. In future work, we plan to generate more-extensive training sets for all of the possible methylation combinations, including non-CpG methylation.
Other modifications-such as 5-hmC 8 , 5-formylcytosine (5-fC) and 5-carboxylcytosine (5-caC) -or DNA damage (such as 8-oxoguanine) will also modulate the signal and cannot be resolved from our existing training set. These modifications have functional consequences in protein binding 14 and DNA structure 15 . Previous work has demonstrated the discrimination of cytosine, 5-mC, 5-hmC, 5-fC and 5-caC in a research setting, and it seems likely that modulation of the current through the nanopore will be suitable to discriminate base modifications in any type of pore-based sequencing 16 . There are other potential applications for direct sequencing to study modifications. N 6 -methyladenosine has recently been shown to occur in eukaryotes, even in mammals 17 , and exogenous methylation has applications in the study of nuclear architecture and protein occupancy, for example, by using DNA adenine methyltransferase identification (damID) 18 or nucleosome occupancy and methylome sequencing (NOME-seq) 19 . Finally, DNA damage resulting from heavy metals, oxidation, UV light or other alterations might be detected with this method. These changes are otherwise impossible to probe on a single-base level. Training models for these base modifications may allow for a unified calling model that can be applied to detect interleaved patterns of these methylation marks on human DNA, allowing for comprehensive epigenetic profiling from a single assay.

methods
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.

Reduced representation to detect CpG islands in cancer and normal breast tissue.
To apply the methods toward calling differential methylation in CpG islands of cancer and normal cells, we used a reduced-representation approach, which involves restriction digestion and size selection to enhance for CpG island presence in selected molecules 20 . We extracted genomic DNA from breast tissue cell lines MDA-MB-231 and MCF-10A (ATCC) using Masterpure (Epicentre Biotechnologies). 100 µg of DNA from each cell line was digested with the MseI restriction endonuclease (NEB); then, from a 50-µg aliquot, 3,500-to 6,000-bp-sized fragments were size-selected using automated size selection (Blue Pippin, Sage Sciences). This size window was chosen as it contained a large percentage of fragments with CpG islands, and specifically islands that have been shown in previous studies to differ between these cell lines 21 . This size-selection range was also designed to select for a small fraction of the human genome (12 Mb) and to ensure adequate coverage. After digestion and size selection, ~50 ng of DNA remained. This DNA was split into three aliquots-two 20-ng aliquots were used for nanopore-based sequencing, and one 10-ng aliquot was used for Illumina bisulfite-sequencing-based validation. Approximately 20 ng of genomic DNA was end-repaired and dA-tailed (NEB), subsequently cleaned with 1× Ampure XP (Beckman Coulter) and eluted in water. Adaptor ligation, using the Blunt/TA Ligase Master Mix (NEB), affixed leader and biotinylated hairpin adapters on either end of the library molecules (SQK-MAP006 Genomic DNA Sequencing Kit, ONT), along with loaded motor proteins on leader adapters and low-input tether molecules designed to bind to the synthetic membrane surface of the pore array. My-One Streptavidin C1 Dynabeads (Thermo Fisher) were used to enrich for library molecules containing the biotinylated hairpin. The low-input tether that was added allowed for on-bead loading of the sequencer, and no elution off the streptavidin beads was performed. On-bead libraries were mixed with Running Buffer and Fuel Mix (ONT), loaded on the MinION Mk1 sequencer and run for 48 h.
Validation with Illumina sequencing data. The E. coli, NA12878 positive control, NA12878 negative control, and MCF10A and MDA-MB-231 cancer/normal cell line reduced-representation samples were assayed for CpG methylation content using bisulfite sequencing with the Illumina MiSeq instrument. 500 ng of E. coli DNA was sheared to 300-bp fragments using the Bioruptor Pico (Diagenode), then end-repaired and dA-tailed using the NEBNext Ultra II library prep kit (NEB). Methylated adapters (NEB) were ligated using the Blunt/TA ligase Master Mix; the libraries were then bisulfite-converted (Zymo Methylation-Lightning). Libraries were then amplified using indexed primers following NEB recommendations, except for using Kapa Hifi Uracil as the polymerase. 10 ng of the reduced-representation samples from MDA-MB-231 and MCF10A were prepared for Illumina sequencing using Accel-NGS Methyl-Seq (Swift Biosciences). All samples were sequenced on an Illumina Miseq at Johns Hopkins University using v3 150 chemistry. The bisulfite-based sequencing data were analyzed with Bismark 22 , which was set to interpret only methylation in a 'CG' context. Region-specific analysis and local smoothing for samples from cancer and normal cells was performed using the bsseq package 23 .
Model development and training. ONT provides a reference set of parameters for all 4,096 6-mers over the standard nucleotide alphabet (A,C,G and T) for each of the three 'strand models' . The first strand model is called the 'template' model and is used for the first strand of DNA that passes through the pore. The other two models (complement.pop1 and complement.pop2) are for the 'complement' strand that is sequenced after the template strand when a hairpin adaptor is used to make a 'two-direction' (2D) read. During data preprocessing one of the two complement models is selected for each read by either Metrichor (R7.3) or our analysis pipeline (R9).
We extended our previously developed HMM 10,24 to learn new parameters ( ′ ′ m s k k , 2 ) for each k-mer for each of the three strand models for each pore type by aligning nanopore reads to a known reference genome. For example, the probability of observing event e i given that it came from k-mer k is: This training method is described in the Supplementary Note.
To test and validate our training procedure, we used nanopore reads from PCR-amplified genomic DNA from two E. coli strains, K12 MG1655 and K12 ER2925. The PCR amplicons were generated without defects or methylation marks, so we expected that the events measured by the MinION instrument, and hence our trained parameters, would closely match the ONT reference parameters. To assess this, we trained new parameters for all PCR data sets independently and counted the number of k-mers whose trained means differed from the ONT reference value by more than 0.1 pA, 0.5 pA, 1.0 pA, 2.0 pA and 4.0 pA for each of the three strand models. PCR-treated samples were run on three R7.3 flow cells and one R9 flow cell; the results are summarized in Supplementary  Table 1. For all three R7.3 PCR samples, the trained template and complement.pop1 parameters closely match the values expected from the ONT reference models; very few k-mers had a trained mean value that was more than 1 pA different than the ONT mean value. However, for one of the samples (PCR-R7.3-timp-113015) the complement.pop2 model had many k-mers with a trained mean value that was more than 1 pA different from the ONT mean value. This result was explained by a low flow cell temperature of 31-32 °C, relative to the expected 37 °C, which ONT indicated might affect the signal levels. We resequenced this sample after eliminating the temperature variation by controlling room ventilation more precisely to obtain the data set PCR-R7.3-timp-021216. The trained parameters for this third run were consistent with the ONT reference models for all three strands. For the R9 PCR data set, the trained parameters showed higher disagreement with the ONT R9 HMM models than when comparing the R7.3 data and model, but they were still generally consistent.
To generate methylation controls, we treated the same PCRamplified ER2925 DNA samples described above with the M.SssI methyltransferase (Zymo), which converts cytosine in a CpG context to 5-mC. We validated the methylation state of these samples by using whole-genome bisulfite sequencing on an Illumina MiSeq instrument, which resulted in 91% and 98% CpG methylation for the R7.3 samples (PCR+M.SssI-R7.3-timp-113015 and PCR+M.SssI-R7.3-timp-021216) and 97% methylation for the R9 run (PCR+M.SssI-R9-timp-061716). Each sample was run on its own flow cell (two R7.3 flow cells, one R9 flow cell), and we ran our training procedure for these runs similar to that for the unmethylated controls. In this case, for both R7.3 runs and all three strand models, we observed many k-mers with trained means that had a >1 pA difference when compared to the ONT reference models (Supplementary Table 2). For the R9 PCR + M.SssI + data set, we observed large differences with respect to the ONT HMM models. These differences often exceeded 4 pA, which only rarely occurred for the R9 PCR data. Because we consistently observed larger differences for the PCR + M.SssI + data sets across three independent runs using two different nanopores, we concluded that the differences were caused by 5-mC.
Next we sought to learn a new set of strand models for a k-mer collection that included methylated cytosine. We refer to this alphabet as the CpG alphabet, which contains the symbols A, C, G, T and M, where M stands for 5-mC and can only appear in a CpG context. We trained new parameters over this expanded alphabet by converting all CG dinucleotides in the E. coli reference genome to MG. The results of training using the PCR + M.SssI + data sets over this expanded alphabet are presented in Supplementary  Table 4. Again, we observed many k-mers whose trained means were more than 1 pA or 4 pA (for R7.3 and R9, respectively) different than the ONT reference parameters, for all three strand models (in this case we compared the mean for methylated k-mers to the unmethylated version in the ONT reference model).
To confirm that this effect was not due to simply increasing the alphabet size of the model, which implicitly trains 7-mers rather than 6-mers for a subset of k-mers, we ran the same procedure on the PCR (unmethylated) data sets in parallel. Here we did not observe the same magnitude of difference for the trained means (Supplementary Table 3). Figure 1b is an example of a strong difference in signal, for the 6-mer AGGTMG, between the unmethylated data set PCR-R7.3-timp-021216 and the methylated data set PCR+M. SssI-R7.3-timp-021216. Figure 1e shows the distributions for the same k-mer for the R9 training data sets (PCR-R9-timp-061716 and PCR+M.SssI-R9-timp-061716). We note, however, that not all 6-mers demonstrated a large difference in signal (Fig. 1c,f).
We determined that the difference in the trained mean, compared to the ONT reference model, was dependent on the position of the methylated base in the 6-mer (Supplementary Fig. 1) and on the pore. When the methylated base was in the fifth and sixth positions of the k-mer (k-mers with pattern abcdMf and abcdeM), we saw a consistent shift toward higher current observations for both pore types. When the methylated base was in the first position of the k-mer (pattern Mbcdef), almost no shift was observed for the R7.3 data, but the R9 data showed a weak signal.

Confirmation of the methylation status of control samples.
To confirm the methylation status of our positive and negative controls for NA12878, we performed whole-genome bisulfite sequencing on an Illumina MiSeq instrument and found an average of 0.5% and 0.4% CpGs methylated in the R7.3 and R9 negative controls, respectively, and 96% and 95% of CpGs methylated in the R7.3 and R9 positive controls, respectively.
Log-likelihood ratio assessment. To calculate the log-likelihood ratio for k-mers that contained at least one CpG dinucleotide, we compared methylated and unmethylated versions of the k-mer as: Here S R is a substring of the reference genome that contains at least one CG dinucleotide (we considered nearby sites jointly as a group, as each event's current level is affected by multiple bases). S M is a modified copy of S R , in which all CG dinucleotides were changed to MG. The likelihoods were calculated by computing the probability of observing the events aligned to this portion of the reference genome, (e j, 1 , …, e j, nj ), given the HMM parameterized by S M or S R . For complete details of this calculation see the Supplementary Note. Supplementary Figure 5 is a histogram of the log-likelihood ratios computed from the three sample types (negative control, positive control and natural DNA) for the R7.3 and R9 data sets. In this analysis we restricted the histogram to only include singleton sites, the CpG groups described above that only contain a single CG dinucleotide. The log-likelihood ratio distribution for the negative-control (PCR) and positive-control data sets were clearly shifted to be less than 0 (stronger evidence for no methylation) and greater than 0 (stronger evidence for methylation), respectively. The histogram for the merged runs of natural NA12878 DNA had a single peak around 0 and was wider than the distribution of the two control samples. This reflects the underlying biology, as CGs for natural human DNA consist of a mixture of methylation states. We observed a stronger methylated/unmethylated signal for the R9 data than for the R7.3 data.
CpG-island analysis. We compared the percentage of methylation calculated by our nanopore-based method to that by bisulfite measurements at annotated CpG islands (CGIs). Supplementary Data 1 shows the results for the merged natural DNA data set sequenced using the R7.3 pore and the R9 pore. We found that the nanopore-derived and bisulfite-derived data had consistent patterns of methylation at CGIs, with an overall correlation of 0.83 (R7.3) and 0.84 (R9) (by Pearson's correlation coefficient). CGIs that were in an annotated promoter showed low levels of methylation, as expected. Note that the quantization of the nanopore data was due to the low coverage of our data sets. Repeating the calculation for our control data sets broke the correlation. In Supplementary Data 1 we also show all individual runs of the natural NA12878 DNA to demonstrate that the detected pattern was reproducible across replicates.
Code availability. The details of our probabilistic model, training and classification algorithms, and a description of the analysis method we used to generate the results presented above are contained in the Supplementary Note. Our analysis pipeline is reproducible and has been documented in the GitHub repository: https://github.com/jts/methylation-analysis. Our training and classification code is available at https://github. com/jts/nanopolish. Data availability. All data generated for this study was deposited in the European Nucleotide Archive under accession PRJEB13021. In addition, we used PCR-amplified E. coli K12 sequencing data that was previously collected (ERR1147229).