## Abstract

Chemical modifications to DNA regulate cellular state and function. The Oxford Nanopore MinION is a portable single-molecule DNA sequencer that can sequence long fragments of genomic DNA. Here we show that the MinION can be used to detect and map two chemical modifications cytosine, 5-methylcytosine and 5-hydroxymethylcytosine. We present a probabilistic method that enables expansion of the nucleotide alphabet to include bases containing chemical modifications. Our results on synthetic DNA show that individual cytosine base modifications can be classified with accuracy up to 95% in a three-way comparison and 98% in a two-way comparison.

**Statement of Significance** Nanopore-based sequencing technology can produce long reads from unamplified genomic DNA, potentially allowing the characterization of chemical modifications and non-canonical DNA nucleotides as they occur in the cell. As the throughput of nanopore sequencers improves, simultaneous detection of multiple epigenetic modifications to cytosines will become an important capability of these devices. Here we present a statistical model that allows the Oxford Nanopore Technologies MinION to be used for detecting chemical modifications to cytosine using standard DNA preparation and sequencing techniques. Our method is based on modeling the ionic current due to DNA k-mers with a variable-order hidden Markov model where the emissions are distributed according to a hierarchical Dirichlet process mixture of normal distributions. This method provides a principled way to expand the nucleotide alphabet to allow for variant calling of modified bases.

## Introduction

Eukaryotic DNA chemical modifications of cytosine (C) include 5-methylcytosine (5-mC), hydroxymethylcytosine (5-hmC), 5-formylcytosine, and 5-carboxylcytosine. DNA methylation is involved in multiple facets of biology, such as gene regulation, cell differentiation and development, and disease. In addition, 5-mC and 6-methyladenine (6-mA) are involved in bacterial gene regulation^{1–3}.

Next generation sequencing technologies use chemical treatment to detect cytosine methylation. The treatment causes base substitutions that can be read without an expanded nucleotide alphabet. These techniques are limited by sequence read length of 100-500 base pairs and can only detect one cytosine variant at a time. Single-molecule real time (SMRT) sequencing generates long reads (1-5 kb), and researchers have shown that it can detect multiple modifications to DNA simultaneously using enzyme kinetics^{4,5}. The Oxford Nanopore Technologies’ (ONT) MinION is a portable, low-cost single molecule DNA sequencer that can sequence long fragments of (50 kb) DNA at up to 92% accuracy absent amplification^{6}.

Computational analysis of nanopore data has historically been a niche area of bioinformatic research^{7,8}, but the field has broadened since the beginning of MinION Access Program in 2014. Recently published algorithms have focused on alignment and *de novo* genome assembly using hidden Markov models^{9–12}. We build on this literature by taking a similar approach to detecting base modifications. Our group and others have previously shown that ionic current measurements from low-throughput nanopore sensors can discriminate all five C5-cytosine variants^{13,14}. In this paper, we demonstrate that the MinION nanopore sequencer can discriminate among C, 5-mC, and 5-hmC at high-throughput without special DNA preparation.

Our method is based on a generative model of the MinION’s ionic current signal. In particular, we assume that the signal is emitted by a variable-order pair hidden Markov model (HMM) that tracks a reference sequence but allows a reference nucleotide to match any of several modified bases (Figure 1A-B, Figure 4). We augment the HMM by modeling the ionic current distributions with a hierarchical Dirichlet process mixture model (HDP), a Bayesian nonparametric method that shares statistical strength to robustly estimate a set of potentially complex distributions^{15}. We show that the HDP meaningfully enhances the HMM’s ability to detect cytosine variants by comparing it to a simpler HMM with emissions modeled by parametric normal distributions.

This model allows for simultaneous reference alignment and probabilistic calling of DNA modifications. We show that it can accurately distinguish DNA modifications using synthetic DNA substrates containing homogeneously methylated, hydroxymethylated, or unmethylated cytosine residues.

## Results

### Methylation variant calling

We sequenced synthetic DNA strands containing entirely either cytosine, 5-methylcytosine, or 5-hydroxymethylcytosine on the MinION using standard preparation protocol (see Methods for details). During sequencing, the MinION records ionic current in real time at 3 kHz and then divides it into “events” that correspond to a single nucleotide step of the DNA molecule passing through the nanopore. The current software (and our method) models each event as being due to six nucleotide segments of DNA, which we refer to as 6-mers. A hairpin is ligated to the end of the DNA duplex during sample preparation so that both the template and complement strands are sequenced. We separately align the events from the template and complement strands to a reference sequence with our model and marginalize over the HMM’s states to obtain the posterior probability on the methylation status of a given cytosine. We then call the variant as the methylation status with the highest marginal probability. We performed three-way classification experiments between all three cytosine variants and two-way classification experiments between only cytosine and 5-methylcytosine. The error rates for each read (across cytosines) and each cytosine (across reads) are summarized below.

### Methylation calling error rate

The mean and median per-read accuracy using the best performing HMM-HDP model were 74% and 80% respectively for the template reads and 67% and 76% for the complement reads. The distribution of per-read accuracies is shown in Figure 2A. These results represent a significant improvement over the 33% accuracy that would be expected by chance. They are also significantly better than the results of the HMM with the emissions modeled by normal distributions, which achieved mean and median accuracy of 58% and 62%, respectively, for the template reads and 47% and 50% for the complement reads. When the HMM-HDP classifies between only cytosine and 5-methylcytosine, the mean and median accuracy increase to 83% and 85% respectively for template reads and 78% and 84% for complement reads (Table 1).

The accuracy varied substantially between different sites on the DNA substrate (Figure 2B). Averaged across reads, the best-performing three-way model classified cytosines at accuracies ranging from 16% to 95% with median accuracy of 76% for template reads and 70% for the complement reads (Table 1). The highest accuracy was achieved in a two-way classification at 98% on template reads, with a median accuracy of 82%. The variability in accuracies agrees with previous research that showed that sequence context affects methylation-calling error rate^{13,16}. Figure 2C shows the classifier’s tradeoff between false positive and false negative rate by site across thresholds on the posterior probability.

It is likely that some of the difficulty in classifying certain sites results from 6-mer ionic current distributions that vary only slightly between the methylation states. We observed a statistically significant correlation between the mean pairwise Hellinger distance between the distributions of the methylation states of the 6-mers overlapping a site and its classification accuracy: Pearson correlation 0.52 (p = 6.6E-31) on the template strand and 0.36 (p = 9.0E-15) on the complement strand (Figure 2D).

### The hierarchical Dirichlet process more realistically models ionic current distributions

Figure 3 compares the current signal distributions of three representative 6-mers from the HDP, the maximum likelihood estimate (MLE) normal distribution, and a kernel density estimate. Qualitatively, compared to MLE, the HDP posterior densities reflect the nuance of the 6-mer distributions more realistically. As a nonparametric method, the HDP can approximate any empirical distribution with sufficient data. The statistical shrinkage between the distribution estimates also tends to smooth away small-scale irregularities that can be observed in the kernel density estimate.

### Comparison of different HDP topologies

The HDP boosts its statistical strength by sharing information between the set of distributions it estimates. In effect, this encourages them to be more similar to each other than if they were modeled independently. The HDP model also has the possibility of encouraging a greater degree of similarity between pre-specified subgroups of distributions (see Methods for details). This can increase statistical strength further, assuming that the subgroups reflect clusters of similarity in the true distributions. Since the biophysical relationship between each given 6-mer sequence and the observed ionic current distribution is poorly understood, we empirically tested whether certain subgroupings would be informative in this manner.

We tested five HDP models with different subgroupings of 6-mers. The two-level HDP does not separate them into any subgroups (Figure 1C), whereas the rest of the models group 6-mers by features of their 6-mer sequence (Figure 1D). The “Multiset” HDP groups 6-mers by their nucleotide content without regard for the order. “Composition” groups 6-mers by how many purines and pyrimidines they contain. “MiddleNucleotides” groups 6-mers based on the center two bases in the 6-mer. Finally, “GroupMultiset” groups the 6-mers by their nucleotide content without regard for their order or their methylation status. We used methylation-calling accuracy to assess the performance these structures. The best performing model was the “Multiset” model (Table 1). However, it was a small gain in accuracy over the simpler ungrouped model.

## Discussion

To date, few sequencing technologies have been able to directly sequence modified bases alongside canonical nucleotides. ONT’s standard statistical model for the MinION also does not distinguish 6-mers according to methylation. Our results show that it is possible to expand the nucleotide alphabet to include 5-mC, and 5-hmC using a hybrid statistical model composed of a pair HMM and an HDP mixture of normal distributions.

We demonstrate that high-throughput nanopore sensing can successfully discriminate between cytosine, 5-methylcytosine, and 5-hydroxymethylcytosine. Using MinION signal data, we achieved three-way and two-way classification accuracy up to 95% and 98%, respectively of single cytosines and median accuracies of 80% and 85% by read. The classification accuracy varies between sequence contexts: some modified cytosines are reliably captured while others are not discernable. We only classified cytosine variants based on one strand, however, and in an application where there is symmetric methylation the context on the reverse complement strand may be more accurately classified. Rereading uncopied DNA may also improve the accuracy.

We anticipate numerous biological applications for this technology. In particular we expect the combination of long reads and detection of multiple base modifications to be widely useful. For instance, it could be applied to studying genomic methylation and haplotype phasing. Since no extra sample preparation is necessary, this information is available “for free” in any sequencing experiment. With appropriate training data, our methodology could be easily generalized to detect additional nucleotides and different base modifications as well. As nanopore sequencing evolves, the accuracies for detecting base modifications will improve further, opening this technology to diagnostics and other clinical applications.

## Methods

### Creating a controlled set of C, 5-mC, and 5-hmC sequences

We used 897 bp synthetic DNA strands from ZYMO Research (Catalog # D5405) that contain entirely C, 5-mC, or 5-hmC bases. Apart from the cytosines, the strands have identical sequences. We performed sequencing experiments (using SQK-MAP006 kits) with four MinION flow cells: one for each of the three substrates, and one where all the substrates were with barcoded with uniquely identifying sequences (using an ONT kit) and run together on one flow cell. All models were trained on the reads run in separate flow cells. The bar-coded reads served as our test dataset. This experimental design maximized the amount of training data while controlling for batch effects between MinION runs. Sequence data were processed using Metrichor (versions 1.15.0 and 1.19.0), and only ‘pass’ 2D reads were used for downstream analysis.

### Mapping of Reads and Event Alignment

We align ionic current events to the reference sequence in a two-step process. First we generate a guide alignment between nucleotide sequences, which we then use to guide a second alignment of events to the reference. To generate the guide alignments, we used a concatenated sequence from Metrichor’s ‘2D alignment’, which allows for each base in the MinION nucleotide sequence to be mapped to an event in the template and complement event sequence. We then generated a guide alignment of the nucleotide sequence to the reference with BWA-MEM in ont2d mode^{17}. Runs of consecutive matches in this guide alignment serve as anchors for the event-to-reference alignment using the banded alignment scheme described by Paten et al.^{18}. The anchors are mapped back to events in the event sequence, and the events are then realigned to the reference using the HMM described below, constrained by the anchors.

### Structure of variable-order hidden Markov model

Our HMM is structured to allow alignment of multiple different bases at a given position in the reference sequence. In this study, we allow for any cytosine variant to be aligned to a given cytosine residue. The fact that each event corresponds to six positions in the reference means that more than one event reports on a single ambiguous position. Accordingly, the HMM must be constrained so that two nearby match states cannot label a reference cytosine’s methylation inconsistently. To accommodate this, we implemented our HMM in a variable-order meta-structure that allows for multiple paths over a reference 6-mer depending on the number of methylation possibilities (i.e. the number of cytosine options raised to the power of the number of ambiguous positions in the 6-mer). The dynamic programming matrix has high-dimensional cells to accommodate these paths. We restrict the recursion by only allowing transitions if the bases at positions 2-6 in the first 6-mer are identical to the bases at positions 1-5 in the second 6-mer (Figure 5). The joint probability for the event sequence and the reference is calculated with the forward-backward algorithm, and the likelihood of methylation at each cytosine is calculated by marginalizing over the HMM’s states.

### Hierarchical Dirichlet process mixture model

The HDP mixture is a statistical model in which a collection of mixture distributions (here corresponding to the signal emission distributions for the 46,656 different 6-mers in the expanded alphabet) are composed of a countably infinite set of shared mixture components. The weights of the components in each mixture distribution are determined according to a separate Dirichlet process on the shared collection of components^{19}. In addition, the mixture components themselves are distributed according to a Dirichlet process that draws components from a base distribution. In our model, the base distribution is the normal-inverse gamma distribution, which is a conjugate prior to the normal distribution (that is, to the mixture components).

Sharing mixture components statistically shrinks our estimates of the current distributions toward each other. This boosts statistical strength since each distribution can share the information learned by the others. We also have the option of adding a further layer of Dirichlet processes between the Dirichlet process that generates the distribution over shared components and the Dirichlet processes that generate the 6-mer distributions. After doing so, the Dirichlet processes are arranged in a tree structure (Figure 1D). This encourages a greater degree of shrinkage within each subtree. We experimented with several topologies for this tree, each representing a different grouping of 6-mers based on their sequence composition (see Results for descriptions of the groupings).

### Generating preliminary alignments without consideration for methylation status

ONT provides a lookup table of parametric distributions that they use characterizes the current distributions of the 4096 canonical base 6-mers. We take advantage of this table to heuristically initialize the emission distributions in our HMM over the expanded alphabet. To do so, we generate a preliminary alignment using the table and then infer the methylation status of the events based on their flow-cell (as mentioned above, the substrates within a flow cell only contain one kind of methylation). We can then use high probability matches from this alignment to train the emission distributions of the HMM.

To generate preliminary alignments we used the ONT table to calculate the probability an event being due to a particular 6-mer in the Match and Insert-Y states of the HMM. The event’s mean current and fluctuation in the mean (noise) are modeled as normal distributions. We assume independence of the mean and noise variables, so the conditional probability of an event for a given 6-mer is just the product of the mean and noise marginal probabilities. The Insert-X state is silent and therefore does not have an emission probability.

### Supervised training of 6-mer distributions

We train the HMM with a variant of the Baum-Welch procedure. First, we heuristically initialize the emission distributions by training them on aligned events above a probability threshold (0.9) from the preliminary alignment described above. In the control experiments using normal distributions, this simply entails calculating the maximum likelihood normal distribution for each 6-mer. For the HDP-HMM, we estimate the posterior mean density for each 6-mer’s distribution using a Markov chain Monte Carlo (MCMC) algorithm. In both cases, we only estimate distributions for the event mean current following the preliminary alignment (a separate neural net experiment suggested that the event noise did not add to classification accuracy; Supplementary Methods). We then produce new alignments and re-estimate the emission distributions from high confidence assignments as in the initialization. At this step, we also re-estimate the HMM’s transition probabilities independently. This process is iterated until the model’s variant calling accuracy stops improving.

The MCMC algorithm we use for the HDP is the Chinese Restaurant Franchise Algorithm (Teh, et al. 2006), a Gibbs sampler for HDP mixture models. We discard the first 900,000 samples as burn-in (30-times the total number of assignment data points) and collect 10,000 samples, thinning sampling iterations by 100. Whenever we record samples from the Markov chain, we evaluate the posterior predictive distribution for each 6-mer at a grid of 1200 evenly spaced points in the interval between 30 pA and 90 pA. After sampling, we compute our estimate of the posterior mean density as the mean of the sampled densities at each grid point. Subsequently, we interpolate within the grid using natural cubic splines.