Abstract
Methylartist is a consolidated suite of tools for processing, visualising, and analysing nanopore methylation data derived from modified basecalling methods. All detectable methylation types (e.g. 5mCpG, 5hmC, 6mA) are supported, enabling integrated study of base pairs when modified naturally or as part of an experimental protocol.
Background Covalent modification of nucleobases is an important component of genomic regulatory regimes across all domains of life [1–3] and is harnessed by various genomic footprinting assays, including DamID[4], SMAC-seq[5], and NOMe-seq[6]. Nanopore sequencing offers comprehensive assessment of base modifications from arbitrarily long sequence reads through analysis of electrical current profiles, generally through machine learning models trained to discriminate between modified and unmodified bases [7]. An increasing number of computational tools have been developed or enhanced for calling modified bases [8], including nanopolish [7], megalodon [9], and guppy [10], along with an increasing number of available pre-trained models.
Results and Discussion
Methylartist offers novel and useful visualisation outputs beyond those available through extant visualisation tools aimed at nanopore-derived methylation [11–13] in terms of the plots and options that it offers as well as support for arbitrary modification types. This has utility for identification of modified bases in assay-specific contexts which include GpC methylation (NOMe-seq), and 6mA (SMAC-seq, DamID in a 5′-GATC-3′ context, as well as native RNA base modifications). With few exceptions [14,15], most currently available models for calling modified bases involve some form of methylation or hydroxymethylation, so modifications will be referred to collectively as “methylation”, without loss of generality.
Modified bases are called from signal-level data stored in fast5 files using an appropriate basecalling model. Methylartist supports importing modified base calls from megalodon (db-megalodon), nanopolish (db-nanopolish) and guppy (db-guppy). Additional import methods for translating the basecalling output to the SQLite database format used by methylartist will be made available as the need arises. To demonstrate the capabilities of methylartist, we sequenced MCF-7 cells sourced from ATCC and from ECACC on the Oxford Nanopore Technologies PromethION platform. MCF-7 is a widely-studied breast cancer cell line with many sub-lines expressing different cellular phenotypes [16]. We anticipated that sourcing cells originating from different repositories would yield divergent yet comparable methylation profiles for demonstration purposes.
The command “methylartist segmeth” aggregates methylation calls over segments into a table of tab-separated values, useful for comparing whole-genome methylation or methylation over various annotations such as promoters, enhancers, or transposable element families. The resulting table is useful on its own or as input to “methylartist segplot” or “methylartist composite”. Category-based methylation data aggregated with “segmeth” can be plotted either as strip or violin plots using the “segplot” command (Figure 1).
Locus- or region-specific plots can be created in two ways, depending on the size of the window. For local regions, “methylartist locus” will generate plots similar to the example in Figure 2a, which depicts the methylation status of WNT7B known to be expressed in MCF-7 cells [17], and SHH, which appears to have a differentially methylated CpG island between the ATCC and ECACC cultivars (Figure 2b). These plots include an optional track showing genes, methylation calls relative to aligned read positions, a translation from genome space into a modified base space consisting only of instances of the methylated motif, a plot of the methylation statistic (e.g. log likelihood ratio), and a smoothed sliding-window plot showing methylation fraction across the region. The “locus” plotting mode also supports separating methylation profiles by phase, if the .bam files are first phased via WhatsHap [18] to add the “PS” and “HP” tags. Using this feature we can show apparent haplotype-specific methylation patterns that differ between the ATCC and ECACC cultivars in the TP53BP1 gene (Figure 2c).
For larger regions, “methylartist region” may yield a more expedient result as it aggregates methylation calls into bins, which can be normalised for occurrences of the methylation motif per-bin. The meaning of “larger” here depends on the density of methylation motifs in the region but generally 800kbp-1Mbp is a reasonable threshold. Region plots can also span an entire chromosome efficiently (Figure 3). The presentation is similar to “methylartist locus” but without the methylation statistic plot, and the coordinate transformation is from genome space into binned modified base space where bin sizes are normalised to equalise content of the modified base motif. Unless overridden via other options, the alignment plot is removed for regions larger than 5 Mbp and the panels rescaled appropriately. Both locus and region plots support an extensive set of parameters controlling dimensions, colour selection, highlighting, smoothing parameters, and panel ratios.
To demonstrate analysis and plotting of non-CpG methylation in a relevant context, we carried out a version of SMAC-seq [5], in which nuclei are treated with EcoGII [19] to mark accessible chromatin with 6mA. We analysed the SMAC-seq data with megalodon using the “res_dna_r941_min_modbases-all-context_v001” model from the rerio repository (https://github.com/nanoporetech/rerio), created a methylartist database via “db-megalodon” and identified loci from the eukaryotic promoter database (EPD) [20], with high 6mA relative to unmodified adenine using the “segmeth” utility in methylartist. In general, we see higher apparent 6mA methylation in regions defined by the Eukaryotic Promoter Database as compared to 50k size-matched regions of the genome drawn at random (Supplemental Figure 1). These loci were plotted en masse via the “locus” tool and screened visually, examples with corresponding CpG methylation plot are shown in Figure 4 and Supplemental Figure 2. Methylartist supports settings to improve visualisation of data where the expected distribution is a series of peaks, including the ability to limit inclusion of reads with an unusually high fraction of methylated bases (--maxfrac), and the option to skip sites below a threshold of methylated + unmethylated call coverage (--mincalls).
In order to facilitate the study of methylation patterns across families of highly duplicated sequences such as transposable elements [21], methylartist supports a “composite” methylation plot, which aligns each instance of a repeat element family to a user-supplied consensus sequence and shows the methylation profile of a user-defined number of individual elements (Figure 5). Finally, the “wgmeth” tool in methylartist can also output bedMethyl files and files suitable for input to DSS, a package for assessing differential methylation [22].
Conclusion
We have demonstrated that methylartist has substantial utility as a plotting tool and as an accessible augmentation to the available tools for analysis and visualisation of nanopore-derived methylation data, including of non-CpG methylation useful in chromatin footprinting assays. Functionality will be expanded and updated in the future as use cases arise and as methods for analysis of nanopore data continue to evolve. For instance, the ability to seamlessly compare the same dataset on multiple modified basecallers or models could be useful for benchmarking applications. For demonstration purposes, we sequenced MCF-7 cultivars from two sources (ATCC, ECACC). While a comprehensive assessment of differences between cultivars is beyond the scope of this paper, it is well documented that significant differences exist between MCF-7 sub-lines [16], which is reflected in some of the examples used here. Methylartist provides a set of readily usable computational tools with which comprehensive assessment and visualisation of inter- and intra-sample methylation is possible including allele-specific methylation (Figure 2c), and methylation in regions difficult to comprehensively access with short-read or hybridisation-based methods such as transposable elements [21].
Methods
Cell culture
MCF-7 cells (ATCC, ECACC) were grown to 60-80% confluency in high-glucose Dulbecco’s Modified Eagle Medium (DMEM, Life Technologies) supplemented with 10% heat-inactivated Fetal Bovine Serum (Life Technologies), 2mM L-Glutamine (Life Technologies) and 100U/mL Penicillin-Streptomycin solution (Life Technologies). Cells were washed with Dulbecco’s Phosphate Buffered Saline (DPBS, Life Technologies), lifted with Trypsin 0.25% EDTA (Life Technologies), pelleted by centrifugation, and washed again with DPBS.
Long-read PromethION sequencing
Genomic DNA was isolated using a Circulomics Big DNA Tissue Kit. Both the high molecular weight (HMW) and ultra-high molecular weight (UHMW) protocol was carried out for each cultivar (ATCC, ECACC) according to the manufacturer’s instructions for a total of 4 PromethION flow cells (Supplemental Table 1). Due to high DNA viscosity the UHMW protocol was modified to include vortexing after addition of CLE3 digestion buffer and incubation at 37°C instead of RT. For simplicity and for demonstration purposes replicates were combined across cultivars to yield one high-depth ATCC sample and one high-depth ECACC sample.
SMAC-seq
For both experimental conditions tested, 1×10^6 MCF-7 cells (ATCC) were resuspended in 500 μl of ice-cold Cell Fractionation Buffer (Abcam) and incubated for 10 minutes on ice. Nuclei were pelleted by centrifugation for 3 min at 500g, then resuspended in 200 μl of ice-cold Nuclei Wash Buffer (10 mM Tris pH7.4, 10mM NaCl, 3mM MgCl2, 0.1 mM EDTA). The nuclei were then pelleted by centrifugation for 3 min at 500g and resuspended in EcoGII reaction buffer (1X NEB CutSmart Buffer, 0.3 M sucrose). 200U of EcoGII and 0.6 mM SAM were added and nuclei were incubated either at 37°C for 10 min (condition one) or incubated at 37°C with 1000 rpm agitation for 15 min, with SAM replenished at 7.5 min (condition two). DNA was extracted with the Monarch® Genomic DNA Purification Kit (NEB) according to manufacturer’s instructions with mixing by inversion instead of vortexing. 1 μg of Genomic DNA was prepared for Nanopore sequencing using the Ligation Sequencing Kit (Nanopore LSK110). Samples were sequenced for 72 hr on an r9.4 MinION flowcell on the Nanopore MinION Mk1C.
Read alignment and variant calling
Nanopore reads were aligned to hg38 via minimap2 2.17[23] with parameters -a -x map-ont --cs:long --MD. Illumina reads were mapped to hg38 via bwa mem2 2.0pre2 (sse4.1) with default parameters and duplicate reads were marked via the MarkDuplicates tool in Picard 2.23.8(http://broadinstitute.github.io/picard).
Phasing
To inform phasing, variants were detected in the MCF-7 Illumina data using HaplotypeCaller in GATK 4.1.9.0[24]. The resulting VCF was phased using whatshap 1.0[18] and aligned nanopore reads. Nanopore reads were then tagged with haplotypes using the ‘haplotag’ function of whatshap.
Methylation calling
Basecalling along with modified base calls was done using megalodon 2.2.9 with guppy 4.4.0 using the res_dna_r941_prom_modbases_5mC_v001 model for 5mCG detection (PromethION), or the res_dna_r941_min_modbases-all-context_v001 model for 6mA detection (SMAC/MinION).
Implementation
Methylartist is implemented in Python using SQLite[25], matplotlib[26], seaborn[27], numpy[28], scipy[29], pandas[30], scikit-bio[31], pysam[32](https://github.com/pysam-developers/pysam), bx-python(https://github.com/bxlab/bx-python), and the ONT fast5 API(https://github.com/nanoporetech/ont_fast5_api). Methylartist is available at https://github.com/adamewing/methylartist or via pip install methylartist
Command-line arguments to methylartist for all figures presented in this manuscript are available in supplemental materials. Additional documentation and examples are available at https://github.com/adamewing/methylartist.
Ethics approval and consent to participate
Not applicable.
Data Availability
The dataset supporting the conclusions of this article is available in the NCBI Short Read Archive (SRA) repository as BioProject PRJNA748257.
Competing interests
The authors declare that they have no competing interests.
Funding
The Translational Research Institute is supported by a grant from the Australian Government. This study was funded by the Australian Department of Health Medical Frontiers Future Fund (MRFF) (MRF1175457 to A.D.E.), the Australian National Health and Medical Research Council (NHMRC) (GNT1161832 to S.W.C.), the University of Queensland Genome Innovation Hub and the Mater Foundation.
Contributions
SWC cultured cells, carried out SMAC-seq, and tested methylartist. MK cultured cells and extracted DNA for PromethION sequencing. ADE wrote methylartist and wrote the manuscript with input and contributions from all authors.
Acknowledgements
The authors thank members of the Mater Research Genome Plasticity and Disease group, in particular G. Faulkner and P. Gerdes for testing methylartist and P. Carreira for technical assistance, as well as the Kinghorn Centre for Clinical Genomics for providing PromethION sequencing services and Macrogen Oceania for providing Illumina sequencing services. The authors acknowledge the Translational Research Institute (TRI) for research space, equipment, and core facilities that enabled this research.