A systematic mapping of the genomic and proteomic variation associated with monogenic diabetes

Aims Monogenic diabetes is characterized as a group of diseases caused by rare variants in single genes. Multiple genes have been described to be responsible for monogenic diabetes, but the information on the variants is not unified among different resources. In this work, we aimed to develop an automated pipeline that collects all the genetic variants associated with monogenic diabetes from different resources, unify the data and translate the genetic sequences to the proteins. Methods The pipeline developed in this work is written in Python with the use of Jupyter notebook. It consists of 6 modules that can be implemented separately. The translation step is performed using the ProVar tool also written in Python. All the code along with the intermediate and final results is available for public access and reuse. Results The resulting database had 2701 genomic variants in total and was divided into two levels: the variants reported to have an association with monogenic diabetes and the variants that have evidence of pathogenicity. Of them, 2565 variants were found in the ClinVar database and the rest 136 were found in the literature showing that the overlap between resources is not absolute. Conclusions We have developed an automated pipeline for collecting and harmonizing data on genetic variants associated with monogenic diabetes. Furthermore, we have translated variant genetic sequences into protein sequences accounting for all protein isoforms and their variants. This allows researchers to consolidate information on variant genes and proteins associated with monogenic diabetes and facilitates their study using proteomics or structural biology. Our open and flexible implementation using Jupyter notebooks enables tailoring and modifying the pipeline and its application to other rare diseases. Research in context Monogenic diabetes is a group of Mendelian diseases with an autosomal-dominant pattern of inheritance. Monogenic diabetes is mainly caused by rare genetic variants that are usually evaluated manually. The data on the variants are stored in several resources and are not unified in terms of the genomic coordinates, alleles, and variant annotation. What can be done for the systematic evaluation of the variants and their protein consequences? In this work, we have created an automated Jupyter notebook-based pipeline for the collection and unification of the variants associated with monogenic diabetes. The database of the genetic variants was created and translated to all possible variant protein sequences. These results will be used for the analysis of proteomics data and protein structure modeling.


Results
The resulting database had 2701 genomic variants in total and was divided into two levels: the variants reported to have an association with monogenic diabetes and the variants that have evidence of pathogenicity. Of them, 2565 variants were found in the ClinVar database and the rest 136 were found in the literature showing that the overlap between resources is not absolute.

Conclusions
We have developed an automated pipeline for collecting and harmonizing data on genetic variants associated with monogenic diabetes. Furthermore, we have translated variant genetic sequences into protein sequences accounting for all protein isoforms and their variants. This allows researchers to consolidate information on variant genes and proteins associated with monogenic diabetes and facilitates their study using proteomics or structural biology. Our open and flexible implementation using Jupyter notebooks enables tailoring and modifying the pipeline and its application to other rare diseases.

Research in context
• Monogenic diabetes is a group of Mendelian diseases with an autosomal-dominant pattern of inheritance.
• Monogenic diabetes is mainly caused by rare genetic variants that are usually evaluated manually.
• The data on the variants are stored in several resources and are not unified in terms of the genomic coordinates, alleles, and variant annotation.
• What can be done for the systematic evaluation of the variants and their protein consequences?
• In this work, we have created an automated Jupyter notebook-based pipeline for the collection and unification of the variants associated with monogenic diabetes.

Introduction
The most common forms of monogenic diabetes are maturity-onset diabetes of the young (MODY), neonatal diabetes (1), inherited lipodystrophies, mitochondrial diabetes, among others (2). Today, international guidelines are available for the diagnostic and follow-up of patients with suspected MODY (3). These patients may now receive a molecular genetic diagnosis using diagnostic gene sequencing panels (panelapp.genomicsengland.co.uk/panels/472). This allows precise MODY subtyping and, depending on the diagnosis, the opportunity to avoid lifelong insulin medication and complications through lifestyle management or alternative treatment using oral antidiabetic drugs (4). Furthermore, because the early correct diagnosis may implement successful treatment with low doses of sulfonylurea or diet alone and postpone complications, the timeline of diagnosis and care is thus crucial (4). It is estimated that around 77 % of monogenic diabetes cases remain undiagnosed (5). These patients and their relatives remain unaware of their familial condition and do not benefit from adapted care.
The first challenge in establishing a firm diagnosis in all cases of MODY is the mapping of all genes that can cause monogenic diabetes. To date, multiple genes have been discovered to have associations with familial forms of diabetes (2). Fourteen of them are notably often referred to as "the MODY genes" (3), although this list is subject to debate in the literature (6) and is being systematically assessed by international experts using established guidelines to determine gene-disease relationships (clinicalgenome.org/affiliation/40016). A second challenge is the difficulty to evaluate the pathogenicity of genetic variants (7), for which there is also an ongoing international effort to establish guidelines and provide expert variant curations in the ClinVar database (clinicalgenome.org/affiliation/50016) (8). Furthermore, the response to alternative treatment might differ between populations (9). One of the ways to address the challenge of precise diagnostics is to complement genetic screening with additional data, combining both molecular and clinical dimensions (10).
The recent advent of high-performance computational models for protein structures notably holds the promise to democratize the study of gene sequences (11). We aim to study the properties of the proteins encoded by genes carrying alleles suspected to cause monogenic diabetes. This can, for example, help us understand whether the structure and properties of the protein are affected by these alleles, hence shedding light on the pathogenicity of the variant investigated (12).
The adoption of these approaches is impaired by the difficulty of mapping variants associated with monogenic diabetes to the different forms of proteins that they encode. First, given that the variants associated with monogenic diabetes are rare, the coverage by genomic databases is low. Maintaining an updated list of variants requires constant monitoring of the literature by experts. Second, variants reported in the literature often lack standardization in their identifiers and coordinates, making it challenging to map them to a given genome build and requiring manual variant mapping. Third, inferring the consequences on protein sequences is still a daunting task for some variants (those alleles affecting splice sites and untranslated regions [UTRs], for example). Fourth, a given protein-coding sequence might encode different protein isoforms, which will produce different forms of proteins upon folding and post-translational modification (13), hence for a given variant multiple protein sequences need to be investigated. Mapping genetic variants associated with monogenic diabetes from genes to proteins is therefore not tractable and sustainable without automation using dedicated bioinformatic tools.
Here, we describe a new open-source modular pipeline based on Jupyter notebooks (eprints.soton.ac.uk/403913) that allows for the systematic collection of variants associated with monogenic diabetes and their mapping to Ensembl (14) and ClinVar (8). We demonstrate how the different genes associated with monogenic diabetes harbor variants of different levels of pathogenicity. Finally, we port the variant sequences to the protein level and provide the resulting sequences in a standard format that can readily be used for proteomic and structural proteomic analyses using mass spectrometry or protein structure modeling.

General architecture
The pipeline consists of seven independent modules written in Python using Jupyter notebooks (figure 1). The notebooks are chained together as a pipeline, but they can also be used as standalone applications, or integrated into other pipelines. First, the pipeline takes a list of genes and extracts exonic variants from Ensembl retaining only those variants that are predicted to affect protein sequences. Next, the program integrates the variants from ClinVar and maps them to Ensembl. Similarly, variants are extracted from the literature, here using the literature mining by Rafique et al. (15), and mapped to Ensembl. Subsequently, the harmonized collection of variants is consolidated in a database stored in the form of a text file that can easily be parsed and reused. Finally, the table of variants is mapped to all the transcripts linked by Ensembl to the genes of interest to obtain protein sequences of all the possible isoforms encoded by these genes. In this last step, the DNA sequences are translated to amino acid sequences and stored as protein FASTA files. To visually inspect the results, a separate module allows overlaying all the variants that can possibly affect a given gene onto the corresponding amino acid sequences. MODY -maturity-onset diabetes of the young, MD -monogenic diabetes, ND -neonatal diabetes, API -application programming interface, PMID -identifiers of scientific publications from the PubMed database, dbSNP identifiers -identifiers of the genomic variants from the dbSNP database, that start from "rs".

Module 1 -Mining variants in Ensembl
In Module 1, a list of genes is mapped to Ensembl genes, transcripts, and exon identifiers using the Ensembl REST API (16). Subsequently, the Ensembl REST API is queried using the exon identifiers to return all variants in Ensembl overlapping with the corresponding regions along with their annotation (identifiers, coordinates, consequences, etc.). For multi-allelic variants, we treat every alternative allele as a different entry, and add it as a new line to the table produced.
The example of the top rows of the reference table is given in the Supplementary materials,   For the variants that could not be mapped automatically in Module 4, the title of the publication as obtained from Module 4 is queried against the Entrez Programming Utilities API (www.ncbi.nlm.nih.gov/books/NBK25500) to return the PubMed (pubmed.ncbi.nlm.nih.gov) identifiers (PMIDs) of these articles. Note that some of the PMIDs were mapped and had to be added manually. Next, these PMIDs are used to query the same API and return the "rs" identifiers of the variants mentioned in these publications. Finally, the variants are mapped to Ensembl using their identifiers and passed to Module 6. Note that not all these variants could be mapped automatically and those that did not map to Ensembl were formatted manually for input to Module 6. variants in BLK, KLF11, and PAX4 were removed as these genes were reported to lack association with MODY in more recent literature (6).

Translation of the variant sequences into protein sequences
The translation step was performed using the ProVar tool (github.com/ProGenNo/ProHap).

Sequence overlay
All the variants from the resulting database were overlaid with the reference protein sequences obtained from Ensembl for all the transcripts of all genes. The variants are represented using two separate rows corresponding to the two levels of pathogenicity confidence (see example in figure 3). FASTA files are parsed using the Pyteomics library (19). The protein sequences and variants are plotted using the Matplotlib library (20).

Results
Unlike more common forms of diabetes like type 2 diabetes (T2D), where large numbers of samples are available and federated initiatives consolidate information on genetic variants and their consequences in aggregated and harmonized forms (e.g. t2d.hugeamp.org), monogenic diabetes, as a rarer disease, relies on small cohorts and information on genetic variants is scattered in the literature and online databases. The aggregation and comparison of variants known to be associated with monogenic diabetes therefore currently relies on expert manual curation and annotation.

Mining variants in genes associated with monogenic diabetes
We mined variants based on a list of 101 genes associated with monogenic diabetes, aiming at being as comprehensive as possible. These 101 genes consist of: i. 14 MODY genes taken from OMIM (21) or other reviews on MODY such as (3); ii. 10 genes associated with neonatal diabetes, lipodystrophy, and insulin signaling taken from (2); iii. 77 genes having variants with any evidence of association with either MODY, neonatal diabetes, or just the condition referred to as "monogenic diabetes" by ClinVar.

Categorizing by consequence and pathogenicity
Since monogenic diabetes, being a Mendelian disease, is determined mostly by rare, highly penetrant coding variants (5), we focused on exonic variants when mining Ensembl.
Nevertheless, some variants reported in ClinVar and in the literature mapped to untranslated regions (UTR) and splice regions. In these cases we decided whether to keep these variants or

Discussion
In this work, we presented a computational pipeline that allows for systematic monogenic diabetes-associated variant collection and mapping. The sources of information on the genetic variation are not unified which makes mapping of the variants challenging. An automated and reproducible pipeline for variant mapping has been developed and is available for public use. A database of variant protein sequences was created for the gene products of variants associated with monogenic diabetes. All known variants reported to be associated with monogenic diabetes published by the beginning of 2023 have been included in the database. The database contains variants with two levels of clinical significance: variants ever reported as associated with monogenic diabetes and pathogenic variants. Here we were considering the variants pathogenic if they had pathogenic or likely pathogenic clinical significance regardless of the star status according to ClinVar. All the monogenic diabetes-associated variants have been translated into protein products and can be compared to the canonical protein sequences. This will help predict the effect of genetic variation on the resulting protein structure and function.
The workflow is automated and aims to gather multiple variants from different sources and avoid their manual annotation. The implementation in Jupyter notebooks provides a good trade-off between automation and flexibility. For example, researchers can execute the entire pipeline as is, adapt it to specific use cases, execute only modules of interest, or completely change the set of genes to study another disease. The public availability, extensive documentation, and permissive license further enable the reuse of our work.
The collected genetic variants associated with monogenic diabetes have been translated to protein sequences and mapped to all known protein isoforms resulting in the collection of all The variant protein sequences can also be used in protein structure modeling using tools like Alphafold (11). This work illustrates how the different genes associated with monogenic diabetes show very different levels of annotation. Besides well-investigated genes, featuring a high number of variants with unambiguous consequences, many understudied genes bear variants lacking evidence of pathogenicity.
Furthermore, some variants simply lack basic genomic annotation, and are reported as amino acid changes, e.g. "Gly292Argfs". An amino acid substitution cannot always be mapped to a single genetic variant. Furthermore, most of the genes encode several protein isoforms and knowing an amino acid change does not give information on which isoform it affects and how it maps to other isoforms. Thus, reporting single amino acid substitutions impairs their inclusion in genomic and bioinformatic studies. In this work, we manually curated these variants and were able to match some of them to genomic coordinates using other variants and aligning protein isoforms using the IsoAligner tool (26). For complex proteomic research in humans, it is important to account for common variation. In future work, we are going to combine our variant sequences with the database of human protein haplotypes (27).
Our work focused on single nucleotide variants (SNVs) or short indels and therefore does not cover larger insertion/deletion mutations causing monogenic diabetes. Large genomic rearrangements, such as full deletion of the HNF1B gene (28) or full deletion of 17q12 locus, have been shown to cause HNF1B-MODY (MODY5) and can be missed by conventional point mutation screening (28,29). These events cannot be directly translated to protein sequences and, as a result, are not reflected in our database.
In our work, we have analyzed the distribution of pathogenicity among the variants with known consequence types. Based on this analysis we have included variants classified as 'splice donor variant', 'splice acceptor variant' and filtered out the 'splice region variants'. In future work, the question of splice region variant consequences should be given more attention as this is an understudied field, and these variants can play a significant role in rare diseases (30).
The research of monogenic diabetes is a dynamically developing field, and new variants are being constantly reported from different cohorts. Now with the pipeline we have developed, we and others can easily update MODY variant collections as new variants are reported. Our pipeline can further be used altogether or in parts to study other diseases. This can enable researchers to automatically and reproducibly collect variants associated with phenotypes of interest and consolidate them to a unified format. In research on rare diseases, the availability of flexible pipelines based on notebooks represents a good compromise between manual expert curation that lacks reproducibility and automated pipelines that cannot be tailored to the application.