Analysis of mitochondrial genome methylation using Nanopore single-molecule sequencing

The level and the biological significance of mitochondrial DNA (mtDNA) methylation in human cells is a controversial topic. Using long-read third-generation sequencing technology, mtDNA methylation can be detected directly from the sequencing data, which overcomes previously suggested biases, introduced by bisulfite treatment-dependent methods. We investigated mtDNA from whole blood-derived DNA and established a workflow to detect CpG methylation with Nanopolish. In order to obtain native mtDNA, we adjusted a whole-genome sequencing protocol and performed ligation library preparation and Nanopore sequencing. To validate the workflow, 897bp of methylated and unmethylated synthetic DNA samples at different dilution ratios were sequenced and CpG methylation was detected. Interestingly, we observed that reads with higher methylation in the synthetic DNA did not pass Guppy calling, possibly affecting conclusions about DNA methylation in Nanopore sequencing. We detected in all blood-derived samples overall low-level methylation across the mitochondrial genome, with exceptions at certain CpG sites. Our results suggest that Nanopore sequencing is capable of detecting low-level mtDNA methylation. However, further refinement of the bioinformatical pipelines including Guppy failed reads are recommended.


Introduction
Methylation of the mitochondrial DNA (mtDNA) has been previously studied in various biological contexts 1,2 . In terms of environmental and lifestyle factors, mtDNA methylation changes have been observed to be associated with air pollution and smoking [3][4][5] . Specific CpG methylation status within different genes of the mtDNA correlates with airborne particulate matter 3,4 . Furthermore, in smokers, methylation levels of particular CpG sites were higher compared to past-and non-smokers 5 . Aside from environmental factors, changes in mtDNA methylation have been reported to be related to neurodegeneration such as Alzheimer's disease (AD), Parkinson's disease (PD) 6 and amyotrophic lateral sclerosis (ALS) 7 . Nuclear genome methylation correlates with age 8,9 , and it has been suggested that mtDNA can behave similarly, as an association between methylation level of two CpG sites within the mtDNA and age has been observed 10 .
Given the importance of environmental impact on mtDNA and its association with disease, the investigation of mtDNA CpG methylation is of interest. However, the level of methylation and its biological significance is still under debate. The underlying reason for this debate can be due to methodology.
The most common method to detect mtDNA methylation has been bisulfite treatment. With bisulfite treatment, the characterization of the methylation pattern depends on the resistance of 5-methylcytosine (5-mC) to be converted to uracil. However, the secondary and tertiary structure of intact mtDNA may block cytosines from bisulfite conversion, which influences the outcome of the detected methylation levels 11,12 .
Bisulfite conversion-resistant cytosines can lead to an overestimation of mtDNA methylation. To counteract the bias of bisulfite-resistant cytosines, linearization of mtDNA has been recommended 11 . For linearized mtDNA prior to bisulfite treatment, lower levels of methylation have been observed compared to intact mtDNA 12,13 . The low levels of methylation have led to suggestions of this being artifactual meaning an absence of methylation in the mitochondrial genome altogether 12,14 . On the other hand, there has also been supportive evidence for the presence of mtDNA methylation after linearization 15,16 . Furthermore, cytosines in a non-CpG context have been reported to be relatively highly methylated 16,17 .
Using long-read, single-molecule sequencing technologies like Nanopore sequencing 18 , it is possible to overcome limitations of the indirect measurement of methylation with bisulfite treatment, as it can be detected directly from the squiggle signals 19 . Nanopore sequencing enables the direct sequencing of fulllength linearized mtDNA reads in contrast to the shorter fragments obtained from second-generation sequencing 20 . To our knowledge, there are currently only two studies that have used Nanopore sequencing to investigate mtDNA methylation 21,22 . Thus, detecting mtDNA methylation with this novel sequencing technology remains to be thoroughly investigated and validated.
Herein, we focus on the exploration of CpG methylation on the mtDNA using Nanopore sequencing. We present a workflow to analyze mtDNA CpG methylation from human blood-derived DNA using a modified library preparation step and analysis pipeline on six individuals. In addition, we compare strand-specific CpG methylation of the mtDNA to the nuclear-encoded 45S rRNA gene. We then validated our methylation pipeline with a short 897 bp synthetic DNA product, suggesting that highly methylated reads are prone to lower Guppy base-calling Phred quality scores.

Demographics of individuals
All individuals included in this study provided informed consent and local ethics committee approval at the Research Ethics Board of the University of Luebeck was obtained. In this study, six individuals were included (L-3244, L-13062, L-6581, L-2131, L-2132 and L-2135). Three of the included individuals are patients with PD who also have Parkin mutations, with two being biallelic mutation carriers (L-3244 and L-13062) and one heterozygous carrier (L-6581) (Supplementary Table 1). The other three are healthy control subjects (L-2131, L-2132 and L-2135). Movement disorder specialists performed the clinical assessments of patients.
The genetic screening for Parkin was performed with Sanger Sequencing and MLPA from blood-derived DNA, as previously described 23 . DNA extraction from blood was performed with the QIAGEN blood DNA midi kit following the manufacturer's instructions.

Library preparation and Nanopore sequencing
Whole genomes DNA concentration of the blood-derived DNA was quantified with Qubit fluorometric quantification using the dsDNA BR Assay kit (Thermo Fisher Scientific). Library preparation of 1.5 µg input DNA from the six DNA samples was performed with the ligation sequencing kit (SQK-LSK109) following a Nanopore wholegenome sequencing protocol (premium WGA). To conserve DNA methylation, whole-genome amplification with random primers was omitted from this established protocol (the intended first step of the premium WGA protocol).
DNA concentration was quantified as described above and 1 µg input DNA was used. The library was prepared using the Nanopore ligation sequencing kit (SQK-LSK109) and multiplexed with the native barcoding expansion (EXP-NBD104).

Sequencing
Nanopore sequencing was performed with the MinION using the R9. for Nanopore community members (https://community.nanoporetech.com). We aimed to obtain 10 Gigabases (Gb) of data before stopping the run which lasted ~24 hours.

DNA methylation analysis
Base-called Nanopore reads were aligned to the reference sequence (i.e., mitochondrial genome, hg38 assembly), using Minimap2 (version 2.17). DNA methylation in a CpG dinucleotide context was called with Nanopolish (version 0.13.2), a software tool that uses a Hidden Markov Model (HMM) 24 . Nanopolish reports the log-likelihood ratio for each observed event (i.e., CpG site within a k-mer sequence). The methylation frequency (MF) was calculated with the default threshold of the log-likelihood ratio. Thus, with a loglikelihood ratio >2 the CpG site was classified as methylated and with a value <-2 the site was classified as unmethylated. In addition, the split-group parameter was used, which provides the MF of each detected CpG site in the reference sequence separately. The MF describes the proportion of reads that support methylation at the given CpG site. To investigate the methylation of the synthetic DNA, alignment, methylation calling, and calculation of MF were performed. The D5405 reference sequence provided by the manufacturer was used. Reads that passed Guppy calling, as well as failed reads, were analyzed.

Plus-and minus-strand methylation comparison
To explore heavy-strand (H-strand) and light-strand (L-strand) methylation, methylation calls were stratified by plus-and minus-strand and then the MF was calculated. The plus-strand represents the L-strand and the minus-strand the H-strand.
In order to compare mtDNA to a nuclear-encoded gene, we used the 45S rRNA gene (13 kb core region of the 45S cluster 5). There are multiple copies of the 45S rRNA gene in the human genome, comparable in size and coverage to the mtDNA. DNA methylation of the 45S rRNA gene was stratified by plus-and minusstrand to allow comparison to the H-and L-strand of mtDNA.

Coverage and mtDNA methylation analysis
In order to investigate the relationship between coverage and MF, different subsets of the obtained sequencing data (i.e., Fast5 and FastQ files) with increasing coverage were used to detect methylation. The mean coverage and MF of each data set were calculated.

Statistical analysis
For statistical analysis, GraphPad Prism software (version 9) was used. To compare median MF between plus-and minus-strand, the non-parametric Mann-Whitney U test for pairwise comparison was performed.
For correlation analyses, the non-parametric Spearman correlation was used. All p-values were exploratory and not corrected for multiple testing.

Results and Discussion
We investigated mtDNA methylation in six blood-derived samples from three patients with PD carrying Parkin mutations and three control subjects. We obtained a mean of 10 Gb of data and a read length of 5.4kb with whole-genome Nanopore sequencing. Alignment to the mitochondrial genome (hg38) resulted in a mean coverage of 269X (SD=±58.59).

Mitochondrial DNA methylation
Overall, we detected low-level methylation (mean MF±SD=0.060±0.054) of mtDNA across the six samples.
However, there were 32 unique CpG sites, where an MF higher than 0.2 was detected (range of MF=0.2-0.793, Figure 1). These sites were located in the following genes:

Heavy-and light-strand methylation differences
We next explored differences in methylation of the H-and L-strand. The H-strand has a higher purine content (G and A) compared to the L-strand and encodes 12 out of the 13 mitochondrial-encoded proteins 25 . As the reference sequence of the mitochondrial genome represents the L-strand, Nanopolish reports the L-strand as the plus-strand and the H-strand as the minus-strand 24 .
The MF detected from the mtDNA plus-and minus-strands was positively correlated (Spearman's r=0.3683, Spearman's exploratory p-value<0.0001, Figure 2A). For comparison, the 45S rRNA gene was used as a nuclear-encoded comparison, since it has comparable size and coverage to the mtDNA (size=13 kb, mean coverage=623X). The overall correlation between plus-strand and minus-strand methylation was stronger in the 45S rRNA gene (Spearman's r=0.7482, Spearman's exploratory p-value<0.0001, Figure 2B Given the low effect of differences between L-and H-strands, we cannot suggest strand-specific CpG methylation in mtDNA.

Relationship between the coverage, number of called sites and methylation frequency
The mean MF for the six individuals was at ~0.06 with >100X coverage ( Figure 3A). The relationship between coverage and MF of all detected individual CpG sites did not correlate (Spearman's r=0.0009, Spearman's exploratory p-value>0.05, Figure 3B). However, the number of called sites and MF was negatively correlated (Spearman's r=-0.3114, Spearman's exploratory p-value<0.0001, Figure 3C). The number of called sites represents how often a given CpG site was evaluated as methylated or unmethylated by Nanopolish 24 . Likewise, a negative association between coverage and the detected methylation frequency was shown with whole-genome bisulfite sequencing 12 and there were speculations that bisulfite conversionresistant cytosines were the culprit. However, the detection of mtDNA methylation from the Nanopore sequencing data does not depend on bisulfite conversion and we observed a plateau of the mean MF after 100X coverage. Furthermore, with all generated sequencing data included, we did not observe a negative association between the coverage and the MF of detected CpG sites. On the other hand, we observed a negative correlation between the number of called sites and the detected MF, which could be due to the general low-level methylation of the mtDNA.

Synthetic DNA
To further validate the workflow, we performed Nanopore sequencing of an 897 bp synthetic methylated and unmethylated control DNA sample at different dilution ratios (0%, 50% and 100%). We obtained a mean of 14Mb data and a read length of 963 bp. As the three samples were multiplexed for sequencing, barcodes with a length of 40 bp were added on both sides of the DNA fragments, which led to the higher mean read length of 963 bp compared to the 897 bp reference sequence. Alignment resulted in a mean coverage of 5018X (SD=±424X).

Synthetic DNA methylation
The CpG methylation of three samples with a ratio of 0%, 50% or 100% of methylated synthetic DNA molecules was investigated ( Figure 4A). In the 0% methylated sample, an MF=0.021 (SD±0.042) was detected similar to the 50% methylated sample (mean MF±SD=0.022±0.038). In the 100% methylated sample, the detected methylation was higher (mean MF±SD=0.262±0.141). Particularly at CpG sites (e.g. D4505:195-350), the detected methylation level in the 0% methylated sample was higher than zero (range of 8 MF=0.014-0.193), which could indicate false-positive calls. On the other hand, the detected methylation levels were overall lower than expected in the 50% and 100% methylated samples and the detected methylation levels in the 0% and 50% methylated samples were almost indistinguishable.

Investigation of Guppy failed reads
To investigate the possibility of preferential sequencing of unmethylated reads and methylated reads which do not pass the quality threshold during base-calling, we analyzed Guppy failed reads. During the Guppy base-calling, the software reports Phred quality scores (qscores) of each read, which indicates the accuracy of the base-calling. Reads with low qscores are classified as failed. Exploring these failed reads only, we detected low methylation in the 0% methylated sample (mean MF±SD=0.057±0.123) and higher levels in the 50% methylated sample (mean MF±SD=0.430±0.168) and the 100% methylated sample (mean MF±SD=0.880±0.156). Thus, the detected methylation was overall higher in Guppy failed reads compared to Guppy passed reads and showed a better representation of the dilution ratios ( Figure 4B). As the detected MF in the 0% methylated sample was higher in the failed reads compared to the passed, there is the possibility of even more false-positive calls.
Including both the passed and failed reads, the proportion of failed reads was n=55/6550 in the 0% methylated sample, n=26068/56577 in the 50% methylated sample and n=44505/51555 in the 100% methylated sample ( Figure 4C), which further shows that methylated reads are prone to lower Guppy qscores. Furthermore, there was a negative correlation between the number of called sites and the detected MF from passed reads in the 100% methylated sample (Spearman's r=-0.2987, Spearman's exploratory p-value=0.0315, Figure 4D) compared to a positive correlation when the MF was detected from the failed reads (Spearman's r=0.5574, Spearman's exploratory p-value<0.0001, Figure 4E). In the 50% methylated sample, there was a similar change in the direction of the relationship between the number of called sites and MF in the passed reads (Spearman's r=-0.6078, Spearman's exploratory p-value<0.0001) compared to the failed reads (Spearman's r=0.3048, Spearman's exploratory p-value=0.0280). In contrast, the 0% methylated sample showed a negative correlation between the number of called sites and the MF in the passed reads (Spearman's r=-0.6452, Spearman's exploratory p-value=<0.0001), but there was no correlation in the failed reads (Spearman's r=0.2310, Spearman's exploratory p-value=ns).
Although the methylation patterns from the failed reads were more representative of the corresponding dilution ratios of methylated DNA, a higher probability for sequencing errors will be present in the failed reads.

Reanalysis of mitochondrial methylation including Guppy failed reads
To consider the results obtained from the investigation of the synthetic DNA samples, mtDNA methylation was reanalyzed, including Guppy passed and failed reads. We obtained a mean coverage of 280X (SD=±62X) of the mitochondrial genome, which was on average 11X more coverage compared to just including the passed reads. Since the number of failed reads were low, the overall change in the MF was marginal (Supplementary Table 2). However, the methylation "peak" in the D-loop, for samples L-3244 (MF=0.793) and L-6581 (MF=0.688) was even higher after the inclusion of the failed reads (L-3244: MF=0.797, L-6581: MF=0.695). Thus, reanalysis of highly methylated CpG sites within mtDNA may be of importance.

Review of literature on mtDNA Nanopore sequencing
We performed a literature search for the terms "mtDNA" and "Nanopore" and found 36 publications, seven of which performed Nanopore sequencing of human mtDNA 21,22,[26][27][28][29][30] (Supplementary Figure 3). In these studies, the coverage of the mitochondrial genome ranged from 60.4X 21 to over 10,000X 22 (Supplementary   Table 3). In the latter case, extraction of mtDNA from a subcellular fraction was performed before Nanopore sequencing, which enhanced the fraction of sequenced reads that mapped to the mitochondrial genome.
Two out of the six publications investigated mtDNA methylation in the context of cancer research 21,22 .
Although in our study, we included patients with PD and Parkin mutations, we did not further investigate the significance of mtDNA methylation differences in the context of PD. Possible differences in mtDNA methylation between patients and control subjects in this small sample set are speculative and future studies to thoroughly investigate mtDNA methylation in PD are warranted. For the published studies in cancer cells, Nanopolish was used to call methylation from the Nanopore sequencing data 21,22 . In addition to Nanopolish, one study used Guppy in combination with Medaka 22 . Low-level mtDNA methylation in these two studies has been described 21,22 , albeit one study evaluated only three specific sites. We also detected overall lowlevel mtDNA methylation in blood-derived DNA, in concordance with reported literature 21,22 .

Strengths and limitations
In our study, we present an optimized protocol for the investigation of mtDNA CpG methylation. We obtained high coverage of the mtDNA as well as the nuclear-encoded 45S rRNA gene and the synthetic DNA samples. The mean coverage of the mtDNA ranged from 177X to 320X across individuals. Linearization of mtDNA has been recommended to counteract the influence of secondary and tertiary structure in bisulfitedependent methods [11][12][13] . Thus, we performed T7 Endonuclease I digestion of whole-cell DNA for singlemolecule sequencing. Still, the sample size of our study was small with six included individuals and our analysis was limited to DNA methylation in a CpG context. As previous studies have shown that there is non-CpG methylation present in the mtDNA 16,31 , future workflows should be expanded to include non-CpG methylation, as well. We were also unable to measure the exact amount of linearized mtDNA contained in the sample, which can have an impact on the reflection of MF. Given these limitations, we investigated two different types of genetic material in addition to the 16.5 kb mitochondrial genome to validate our workflow i.e, an 897bp synthetic DNA sample and the native 13kb 45S rRNA nuclear-encoded gene. Lastly, all experiments and analyses were performed in the same institute to reduce batch effects.

Conclusion
In conclusion, our results suggest the applicability of Nanopore sequencing for the investigation of mtDNA methylation. Overall, we detected low-level CpG methylation of mtDNA with exceptions for certain sites.    Relationship between the number of called sites and methylation frequency in the sample with 100% methylated reads, only reads that passed Guppy quality-threshold were included in the analysis. E) Relationship between the number of called sites and methylation frequency in the sample with 100% methylated reads, only reads that failed Guppy quality-threshold were included in the analysis.