Abstract
Covalent DNA modifications, such as 5-methylcytosine (5mC), are increasingly the focus of numerous research programs. In eukaryotes, both 5mC and 5-hydroxymethylcytosine are now recognized as stable epigenetic marks, with diverse functions. Bacteria, archaea, and viruses contain various modified DNA nucleobases, including several in which one base is largely or entirely replaced by a particular covalent modification. Numerous databases describe RNA and histone modifications, but no database specifically catalogues DNA modifications, despite their broad importance as an element of epigenetic regulation. To address this need, we have developed DNAmod: the DNA modification database. DNAmod is an open-source database (http://dnamod.hoffmanlab.org) that catalogues DNA modifications and provides a single source to learn about their properties. DNAmod provides a web interface to easily browse and search through its modifications. The database annotates the chemical properties and structures of all curated modified DNA bases, and a much larger list of candidate chemical entities. DNAmod includes manual annotations of available sequencing methods, descriptions of their occurrence in nature, and provides existing and suggested nomenclature. DNAmod enables researchers to rapidly review previous work, select mapping techniques, and track recent developments concerning modified bases of interest.
Introduction
A rapidly growing body of research is continuing to reveal numerous gene-regulatory effects of covalent DNA modifications, such as 5-methylcytosine (5mC). We now recognize 5mC as a stable epigenetic mark and as having diverse functions beyond transcriptional repression1. An increasing number of studies demonstrate the importance of other cytosine modifications, such as 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC)2–6. More recently, three analogous modifications of thymine were found to occur in mammals7, 8 and can nowlargely be sequenced9.N6-methyladenine, previously thought to mainly occur as a RNA modification, has now been found in the DNA of multiple eukaryotes10. Bacteria, archaea, and especially bacteriophages have long been known to have a diverse array of modified11 and hypermodified bases—modified DNA bases that largely or completely replace an unmodified base12.
RNA modifications are profiled across multiple databases, including RNAMDB13, MODOMICS14, and RMBase15. Furthermore, histone modifications in humans are catalogued in HHMD16. Despite widespread recognition of DNA modifications as an important element of epigenetic regulation, no database exists to catalogue them. Some databases include particular classes of DNA modifications17,such as restriction endonucleases and DNA methyltransferases in REBASE18; methylation databases, like MethDB19; databases including DNA metabolic pathways, such as KEGG20;and those focused on DNA damage and repair, like REPAIRtoire21. There is, however, a pressing need to focus upon DNA modifications from a broad perspective and organize them in a single location. In order to address this, we have created DNAmod: the DNA modification database http://dnamod.hoffmanlab.org/ (http://dnamod.hoffmanlab.org). DNAmod is the first database to comprehensively catalogue DNA modifications and provides a single resource to launch an investigation of their properties.
Database construction and visualization
DNAmod consists of two components: a relational database back-end and a web interface front-end. We use the Chemical Entities of Biological Interest (ChEBI) database22,23 to seed DNAmod, importing a nucleobase-related subset of its contents, consisting of chemical entities and related annotations. We perform queries against the entities to construct a set of candidate DNA modifications for DNAmod, retaining most of these as a separate unverified set. Then, we filter candidate entities into a curated set of verified DNA modifications, augmenting them with modification-specific annotations. Finally, we provide the ability to either search or browse through the catalogue of DNA modifications, integrating ChEBI’s information with our own.
Identifying candidate DNA modifications from ChEBI
DNAmod leverages the ChEBI database23 to define a set of modified DNA candidates for inclusion and to add preliminary information for each candidate. ChEBI is a database of small biologically relevant molecules, which affect living organisms. We query ChEBI via ChEBI Web Services23. We use Biopython24 and the Python Simple Object Access Protocol client, suds25, to query ChEBI and construct the DNAmod database.
ChEBI provides an ontology which encodes the relationships between its compounds. We use this ontology to define precisely the notion of parents and children, which we use to hierarchically retrieve and display modifications. We use two kinds of relationships for this purpose, each of which can also be represented by their associated symbols, defined by ChEBI22: has functional parent and Δis a. We use these relationships to find candidate DNA modifications, by identifying entities related to the core nucleobases, which we represent by their symbols: {A, C, G, T, U}. We include uracil, since many of its descendents in the ontology are modifications of thymine (CHEBI:17821, which is equivalent to 5-methyluracil), and are not annotated as descendents of thymine itself. For each of these bases, we import all entities that are annotated in the ontology as a child of one of these bases, via the has functional parent relationship. ChEBI ranks entities based on their degree of curation. We only import entities with the highest rating—three stars—indicating manual curation by ChEBI. Whenever possible, we only include entities as nitrogenous bases (nucleobases). If not available, we then select their nucleoside form and finally, if necessary, the nucleotide. These imported bases form the candidate set of modifications (the “unverified” set), from which we create a curated set of DNA modifications (the “verified” set).
The ChEBI ontology does not generally encode has functional parent relationships for nucleobases beyond the children of the unmodified nucleobases. It instead encodes modified nucleobases with an Δis a relationship to their parent base. This is because descendent entities of specific modifications are generally subtypes of the class of modifications from which they originate. For example, 3-methyladenine is a methyladenine. Methyladenine, however, has functional parent adenine, since it is conceived of as possessing adenine as a characteristic group and as being derived via functional modification22. We therefore need to make use of both of these two relationships, within the ChEBI ontology, to accurately capture the desired nucleobase hierarchy.
ChEBI also provides selected citations, associated with some of its entities. We query ChEBI for its citations, via their PubMed IDs26. We use the Biopython24 package Bio.Entrez to query the PubMed citation database, using NCBI’s Entrez Programming Utilities26. We retrieve the details of each citation, and use them to construct a formatted citation. At this time, we only support publications that are indexed in PubMed.
Manual curation and annotation
We manually create a whitelist, which contains our curated (or “verified”) set of candidates that we deem DNA modifications. For each of these bases, we also import all descendents with an eventual has functional parent or Δ is a relationship with any of the members of the verified set. We expand the verified set to include any bases recursively imported in this manner, since they were children of verified DNA nucleobases. This rule has one exception: we exclude any bases that possess an ancestor in our blacklist—a curated list of specific entities to exclude, as non–DNA modifications.
We proceed to formalize the above description, of bases imported based upon the ChEBI ontology22 and their filtering, as follows. Let specify that a has the has functional parent relationship with b. Similarly, let a Δ b specify that a has the Δ; is a relationship with b. The definition of is transitive: for all n entities, li, for i = 0 to n − 1, between a and b: . The analogous definition holds for Δ. We call each li a child of li−1 and call each li−1 a parent of li. We refer to a as a descendent of b and refer to b as an ancestor of a. Let represent the first level of children of the unmodified nucleobases, such that . Let represent the manually-annotated, verified proper subset of . Finally, we manually curate a blacklist of excluded entities, B, satisfying: . We import the set of verified DNA modifications, M, defined in set-builder notation with predicates, as:
We additionally provide two kinds of manual annotations: sequencing techniques and occurrence in nature, for each modified DNA base. We surveyed the literature of sequencing methods for covalent DNA modifications27–t10430, and annotated the available methods for each base. These annotations include the method’s name, our categorizations of the basis for the method (such as chemical conversion), its resolution, limited genome-wide applicability or use of an enrichment method, and the citation for the method (Table 1A). We consider any method which involves affinity-based recognition of targets to be of “low” resolution31. These methods can also suffer from low specificity or cross-reactivity of the antibody27. Conversely, we annotate any methods based principally upon the detection of a chemically converted modification as “high” resolution. This generally reflects the resulting resolution of the method’s output data and often corresponds to the necessity to bin genomic regions during downstream analyses of the detected analyte.
For each modified base, we investigated if it had been previously reported to occur in vivo, either as an endogenously-generated modification or those that have been observed to occur as a result of exogenous stimuli, such as exposure to an environmental toxin. We annotate any modification observed in vivo merely as “natural”. We additionally provide non-exhaustive examples of some organisms in which the modifications have been reported. We annotate any modification not observed in vivo as “synthetic”, and list a reference in which it was synthesized or in which the synthetic base was used. For each of these annotations, we also briefly annotate a primary biological function, if known t107(Table 1B).
We enter these annotations in two annotation source files t108(Table 1), which we later import into our database. This decouples them from the rest of our pipeline and allows experts to submit additions from their domain of expertise, without requiring knowledge of our pipeline or programming workflow.
DNAmod integrates manually-curated nomenclature, including the name and abbreviation deemed most consistent and in common use2,32,t110t10933. We additionally provide recommendations for one-letter symbols of selected modified bases, and in some instances for their base-pairing complements. We have previously described these, as part of our expanded epigenetic alphabet, which we currently use to model modification-sensitive transcription factor binding sites34. We provide an example of these tables for 5-formylcytosine in Figure 1.
We store all data, either imported from ChEBI or from our manual annotations, within a SQLite35 database, used via the Python sqlite3 package36.
Website generation
We use a static website to display and provide navigation for the information contained within the database. We generate it by formatting the content of the database using Jinja237, a static Python templating engine. Two templates are sufficient to generate all HTML files. We use a single template for all modification pages and another for the homepage. We also record the date of the most recent update to the database. All webpages use the Bootstrap38 framework, which provides a standardized, portable, and mobile-compatible viewing format. An image of the chemical structure of each compound is created by converting Simplified Molecular-Input Line-Entry System (SMILES) data, if available from ChEBI, into a vector graphic, using the cheminformatics toolkit Open Babel39, via its Python wrapper Pybel40.
Searching and navigation
The modifications contained within DNAmod are accessible via either a search input field or by selecting them from a visual representation of curated modified DNA bases or a separate list of candidate entities. Three tabs on the DNAmod homepage provide these navigation options. The first tab provides the ability to search for a DNA modification, the second tab contains the curated DNA modifications displayed as a pie menu, and the third tab lists all other entities as a list, categorized by their parent unmodified nucleobases.
Client-side search functionality provides a means of rapidly finding bases with differing nomenclature t119(Figure 2A), while maintaining a static webpage. We use the elasticlunr.js JavaScript module41 to implement this. Searching allows matching to multiple fields: the common or International Union of Pure and Applied Chemistry (IUPAC) names, all synonyms, any assigned abbreviation, and a symbol, if available. DNAmod returns curated DNA modifications in green, and others in magenta. The search results provide the field matched by the query, such as “abbreviation”, along with the common name of the associated hit.
Alternatively, the modifications in DNAmod can be browsed through a pie menu42 interface (t122 Figure 2B), which hierarchically arranges the bases according to their structure within the ChEBI ontology. The innermost ring consists of the four unmodified DNA bases and consecutive outer rings represent children of the previous base. We demarcate natural versus synthetic bases by colouring natural bases in teal and synthetic bases in grey.
DNAmod structure and content
Individual modification pages visually represent the data contained within the backing database. We standardize and display all modifications in an identical format. DNAmod may omit some information, however, depending upon the extent of ChEBI’s annotations and whether the page is for a verified DNA modification or merely a candidate entry.
Modification pages begin with a header displaying the DNA modification’s ChEBI name. The top-right corner of the page lists the unmodified ancestor of the modification. For example, 5-hydroxymethyluracil is a modification of thymine t123(Figure 3), whereas 6-dimethyladenine is a modification of adenine.
Each modification begins with a short textual description of its chemistry, followed by a table containing its chemical properties. We import these from ChEBI, which provides their chemical formula, net charge, and average mass.
We annotate entities with all available names available from ChEBI, including: their IUPAC name, SMILES string, and common synonyms. We also provide a recommended abbreviation and in some instances a suggested single-letter symbol for bioinformatic purposes, from our proposed expanded alphabet34 t123(Figure 3).
We provide literature annotations for many modifications, including all DNA modifications observed in vivo. We provide a list of methods that have been used to map the genomic locations of a modification t124(see above). We additionally provide information on a modification’s occurrence, either naturally or only synthetically, where applicable, including some organisms in which it has been observed in vivo t125(see above). Finally, each page ends with the ChEBI database reference and a ChEBI-derived list of related literature citations t123(Figure 3).
Discussion
DNAmod enables researchers to rapidly obtain information on covalently modified DNA nucleobases and assist those interested in profiling a modification. It additionally provides a reference toward standardization of modified base nomenclature and offers the potential to track recent developments within the field. We expect DNAmod to continue to grow, particularly as new discoveries about DNA modifications are made. We also hope that DNAmod will serve to highlight modifications that have received inadequate attention, but may be of substantial biological importance.
The nomenclature used to describe a particular DNA modification is often inconsistent, with some early efforts toward standardization of particular classes32,t11033. The ChEBI name, for instance,often corresponds to the common chemical name of the compound, which is occasionally distinct from its common name within the biological literature, in the context of a DNA modification. We address this and attempt to encourage standardization by endeavouring to ensure that other names are annotated, while providing specific nomenclature recommendations. In particular, the suggested name of verified DNA modifications, as displayed on the homepage and within the recommended notation section, is always manually-curated and sometimes differs from the name assigned by ChEBI.
The inclusion of assays available to sequence different DNA modifications provides a means of assessing and selecting a sequencing method. It additionally attempts to track sequencing methods over time, as resolution improves, and especially to highlight recent developments, like direct-detection of various modifications via nanopore sequencing43. The sequencing annotations we provide annotate nucleobases which are directly elucidated by the method and only for the base or set of bases which the method independently maps. This includes those that are obtained in addition to another nucleobase. For instance, confounded mixtures are often obtained. 5mC and 5hmC, for example, cannot be distinguished with only conventional bisulfite sequencing. Alternatively, some methods have the capacity to independently resolve between modifications, such as various nanopore-based methods. Therefore, oxidative bisulfite sequencing (oxBS-seq), often used in combination with conventional bisulfite sequencing to elucidate 5hmC via subtraction, is only annotated as a sequencing method for 5mC, which it directly elucidates. Conversely, TET-assisted bisulfite sequencing (TAB-seq), also used for 5hmC detection, is only annotated under 5hmC, which it directly elucidates27.
We demarcate bases that have been found to occur in vivo, providing examples of organisms in which a modification has been found, along with associated citations. This is merely to substantiate its in vivo presence, however, and does not comprehensively list organisms which contain that particular modification. Finally, our brief annotations of the biological roles of various DNA modifications are expected to change as further research is conducted.
Future work
We plan to keep DNAmod updated continuously, manually reviewing newly added ChEBI compounds, continuing to request that missing DNA modifications be added to ChEBI (which we then automatically import), and curating any additions. We also add new sequencing annotations, as we come across them, and plan to continue to do so.
Integrating additional external databases will further increase DNAmod’s utility. In particular, we envision potential integration with domain-specific DNA modification databases. For instance, modifications involved in DNA damage and repair could be linked to REPAIRtoire21 data.
We used ChEBI Web Services23 to obtain information from their database. ChEBI has, however, recently released a Python application programming interface (API), permitting us to directly access their data44. Switching from our current web-based queries to use of their API would likely result in a more robust system and expedite the database-building process.
Availability
The DNAmod website and its backing SQLite database are freely available at: http://dnamod.hoffmanlab.org. Python source code and web assets for this project and an issue tracker are available at: http://bitbucket.org/hoffmanlab/dnamod. To ensure persistent availability, we have deposited in Zenodo the current version of our code (doi http://dx.doi.org/10.5281/zenodo.60827:10.5281/zenodo.60827) and SQLite database (doi: http://dx.doi.org/10.5281/zenodo.6082410.5281/zenodo.60824). All source code and web assets are licensed under a GNU General Public License, version 2 (GPLv2). DNAmod’s data is licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0).
Funding
This work was supported by the University of Toronto Undergraduate Research Opportunities Program (to A.J.S.), the Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03948 to M.M.H. and an Alexander Graham Bell Canada Graduate Scholarship to C.V.), the Canadian Cancer Society (703827 to M.M.H.), the Ontario Ministry of Training, Colleges and Universities (Ontario Graduate Scholarship to C.V.), the Ontario Institute for Cancer Research through funding provided by the Government of Ontario (CSC-FR-UHN to John E. Dick), the University of Toronto McLaughlin Centre (MC-2015-16 to M.M.H.), and the Princess Margaret Cancer Foundation.
Conflict of interest statement. None declared.
Acknowledgements
We thank Daniel D. De Carvalho and Christopher E. Mason for helpful feedback on early versions of DNAmod. We thank the creators of ChEBI22, and all those who have worked to improve it23,44,t139t12845. In particular, we thank Gareth Owen, Steve Turner, and Marcus Ennis for actively responding to curation requests and Venkatesh Muthukrishnan for managing ChEBI issues. The authors thank Carl Virtanen, Qun Jin, and Zhibin Lu for technical assistance. Thanks to Gabriel Moreno-Hagelsieb for revising the NAR LATEX template for use with an external BIBTEX file. Thanks to Thomas D. Schneider for providing a NAR BIBTEX template. We thank Casey M. Bergman for collating and distributing these TEX files.