History of rare diseases and their genetic causes - a data driven approach

This dataset provides information about monogenic, rare diseases with a known genetic cause supplemented with manually extracted provenance of both the disease and the discovery of the underlying genetic cause of the disease. We collected 4166 rare monogenic diseases according to their OMIM identifier, linked them to 3163 causative genes which are annotated with Ensembl identifiers and HGNC symbols. The PubMed identifier of the scientific publication, which for the first time describes the rare disease, and the publication which found the gene causing this disease were added using information from OMIM, Wikipedia, Google Scholar, Whonamedit, and PubMed. The data is available as a spreadsheet and as RDF in a semantic model modified from DisGeNET. This dataset relies on publicly available data and publications with a PubMed IDs but this is to our knowledge the first time this data has been linked and made available for further study under a liberal license. Analysis of this data reveals the timeline of rare disease and causative genes discovery and links them to developments in methods and databases.


Background and summary
Descriptions of unusual diseases date back until medieval ages but rare genetic diseases are a relatively new chapter in history of medicine as genes as carriers of hereditary diseases are only discovered in the mid of the last century (see e.g. Reflections on medicine and art). Identification of the disease causing mutation in the plethora of genetic variation an individual human carries is a difficult task in the diagnosis of rare diseases. A human individual has about 25 thousand genetic variants in the exome and for the identification of the disease causing one experts cross-check with variant databases and use variant pathogenicity prediction algorithms. For the data generated in this process there are several bioinformatics workflows available to go from the raw data to the detection of the causative mutation. Within this process mapping of genetic data, identifiers and information is required in several ways. A typical workflow was described by Gilisen et al. 1 . Linking information about rare diseases, their causative genes, respectively gene variants, there are many genotype-phenotype databases 2 and some current approaches like OMIM 3 , Orphanet 4 , and DisGeNET 5 which include provenance informatione.g. in form of manually curated or text mining derived literature lists. DisGeNET provides an extensive collection of linked data including a semantic model, Orphanet as well but focuses more on patient care related information. OMIM is the online version of the genetic (mendelian) disease encyclopaedia probably consulted most by clinicians. It provides information in form of a literature list with a disease or a gene and provide gene-disease mapping spreadsheets, e.g. morbid map. None of them provides the link of one gene to one disease including the one publication which described the disease first or found the link between gene and disease. Gene-disease associations are usually described by multiple references. This study produced a mapping dataset which links rare, monogenic diseases to their causative genes (and vice versa), backed up by provenance, the publication which prove the genetic cause for a disease for the first time. The data is annotated with OMIM identifiers (disease), gene identifiers (HGNC 6 and Ensembl 7 ), and PubMed identifiers for the literature. Based on this spreadsheet we used a modified version of the DisGeNET semantic model to produce a resource description framework (RDF) file. Additionally, we produced a linkset format which can be used directly within a Cytoscape add-on CyTargetLinker 8 (Cytoscape is a popular network analysis software 9 ) for e.g. genetic variant or gene expression analysis linking genetic variants to genes 10 or pathways 11 . Based on information provided by this dataset, different interesting facts can be retrieved, e.g. timeline of discovery of genes responsible for rare diseases, the links of groups of diseases and epidemiology information.

Workflow
First, the information was collected, the suitable data was extracted and reviewed to create the dataset. Second, the dataset was created and made accessible in three different ways: spreadsheet, RDF, and linkset (for Cytoscape add-on). Third, the information from the dataset was used to retrieve information about the discovery of rare disease causing genes and, forth, applied in different analysis use cases. The workflow is shown in Figure 1.

Creation of the dataset
The OMIM database was accessed for a list of all known gene-disease relationships. We manually extracted rare (about 1:1000, which is less exclusive than the EU definition 12 ) monogenic diseases. The first description of the disease was manually retrieved by literature research using different resources like OMIM, PubMed, Whonamedit [https://www.whonamedit.com/], or Wikipedia. If no clear first description publication could be identified, the oldest publication in the literature list on OMIM for this disease was used. The diseases were annotated with OMIM identifiers, publications with PubMed identifiers. The causative gene for the disease was annotated with the stable gene identifier from Ensembl and the appropriate HGNC symbol. Provenance for the first publication proving that a particular disease is caused by a specific gene was manually identified based on the literature provided on OMIM and other sources (PubMed, Google Scholar). The PubMed identifier of this publication was added. Table 1 shows the structure of the final dataset.

Creation of linkset for CyTargetLinker
CyTargetLinker is a Cytoscape add-on, which allows extending of networks by information given in linksets. These linksets are usually derived from external databases and contain e.g. gene-microRNA or drug-drug target relationships. The rare diseases (RD) linkset for CyTargetLinker was created as described in Kutmon et al. 8 . Basically, we used a Java program (available here https://github.com/CyTargetLinker/linksetCreator) to convert tab delimited text files into XGMML linksets.

Data availability
The data (TSV) collection of the gene-RD-provenance data is available here: https://figshare.com/account/home#/collections/4400798. The currently most actual version is version 2 (DOI: 0.6084/m9.figshare.7718537.v1). The applications of the gene-RDprovenance dataset are available on these resources: the linkset for CyTargetLinker in the CyTargetLinker repository and RDF (nanopublication). The availability of the PubMed identifiers of the publications on Wikidata 14 was checked and if not available, a new entry was created using QuickStatements (https://tools.wmflabs.org/quickstatements/#/).

Software availability
The code to create the RDF from a spreadsheet or TSV is available at: https://github.com/BiGCAT-UM/raredisease-omim/tree/master/rdf. The code to create the Linkset for CyTargetLinker is available at: https://github.com/CyTargetLinker/linksetCreator. The queries to retrieve information about the publications using the pmid in Wikidata can be found here: https://github.com/egonw/pubmedWikidata or in the attachment.

Bibliographic information
We queried the list of pmids in Wikidata to retrieve several different kinds of information. The SPARQL queries can be found here: https://github.com/egonw/pubmedWikidata.

Timeline of first descriptions of rare diseases
In this data collection the first description of a rare disease was from 1788 (Olof Ekman) about osteogenesis imperfecta. For this disease there were in the 1988 and 1989 two collagen genes identified as responsible: COL1A1 15 and COL1A2 16 . Several first descriptions of a phenotype were later classified as separate diseases and associated with different genetic causes. In 1817 James Parkinson wrote "An Essay on the Shaking Palsy" describing the disease which was later named after himself. By now, there are 19 different entries in OMIM named Parkinson's disease (or a variety of) with different genetic causes (not to be mixed up with Wolf-Parkinson syndrome which was named after Sir John Parkinson). A remarkable peak is in 1886 when Charcot-Marie-Tooth disease was described which was later sub classified in about 58 different subtypes and responsible genes. Additionally in this year first descriptions were made about Pheochromocytoma and Multiple endocrine neoplasia (each about 5 different subtypes and responsible genes). After 1901 there was no year passing without a new description of a rare disease. Timeline of rare disease-gene discovery Using the PubMed identifiers of the publication stating for the first time, that a certain disease is caused by a specific gene, Wikidata was queried (SPARQL code 10.1.6 17 ). In Figure 2.B we plotted the number of publications per year vs timeline. In this dataset, the earliest genedisease link is from 1967 when Seegmiller et al. found that a neurological disorder (which was described first three years earlier and later the name Lesch-Nyhan Syndrome was established 18 was caused by absence of hypoxanthine-guanine phosphoribosyltransferase 19 . The discovery rate increased after invention of both Sanger sequencing and PCR, reached a plateau between about 1996 and 2006 and increased again after development of next generation sequencing. It remains to be discussed whether the decline of the discovery rates after 2013 is due to reaching the saturation (all disease causing genes found) or due to new genetic causes for syndromes may rather be allocated to previously described diseases instead of "inventing" a new disease. In parallel, the average time since a disease was described first until the causative gene was identified was maximal in 1994 with 41.6 years and declined since then to 8 years in 2017 ( Figure 2.B). After establishing of next generation sequencing in 2007, by about 2013 it become common standard that together with the description of a new genetic disease also the genetic cause is identified and published in the same document. The time span does not drop to zero due to that rare genetic causes of long known diseases (e.g. Parkinson's or Alzheimer's disease) are still discovered. At the moment it is speculative whether the rapidly declining peak after 2014 is due to reaching a saturation -the dataset links at the moment 3163 genes (of about 22 000 possible human genes) to one of 4166 rare genetic diseases currently known -or due to the delay of literature information reaching databases.
In which journals are newly discovered diseases or genes published?
The information in which journals newly discovered diseases or new causative genes for rare diseases are preferably published was retrieved from Wikidata. It was first investigated whether all these publications were available in Wikidata (https://tools.wmflabs.org/sourcemd/) and if not the paper information was uploaded (https://tools.wmflabs.org/quickstatements/#/). About 800 publications were added to Wikidata. First descriptions of rare diseases had in 3144 cases a PubMed identifier, for new genes causing a rare disease 4263 of in total 4565. To get the journals in which these papers have been published the SPARQL endpoint of Wikidata was accessed using the query given in attachment (10.2.1.). Table 2 lists the top 10 of the journals in which new diseases and new causative genes have been described. According to the data, the most important journals for publishing both, new rare diseases and newly identified genes for rare diseases are the American Journal of Human Genetics (14.9% for diseases and 26.7% for genes) and Nature Genetics (4.7% for diseases and 18.4% for genes). In total, for new rare diseases we identified 364 different journals, for genes 197. This may be due to the broader spectrum of medical disciplines and therefore journals the first observation of a now rare disease was described as well as the broader timeframe in which these were published. . Figure 2.C shows the citation count distribution across the publication year for the first description of a causative gene for a rare disease. The mean citation count for such a paper is 56. Among the top 10 of most cited papers are several which identified rare, genetic causes for Alzheimer's and Parkinson's disease, Huntington's disease, macular degeneration, Rett syndrome, Crohn's disease, amyotrophic lateral sclerosis and chromosome 9p-linked frontotemporal dementia. High citation numbers indicate that there is a lot of research done for these diseases. Low citation numbers or none at all may be due to the newness of the finding (less than 5 years), disagreement among researchers, non-reproducibility of the result or no "interest" in terms of grants and research capacity investment. Among the true "neglected" diseases, diseases for which the genetic cause has been identified before 2013 but had been cited only once by now are e.g. 3-ketothiolase deficiency, glucose 6-phosphate isomerase deficiency, Leber congenital amaurosis, sarcosinemia or nonsyndromic oculocutaneous albinism. And many disorders which do not have a name yet and are described by their phenotype appearance (e.g. "Manifestations of X-linked congenital stationary night blindness in three daughters of an affected male: demonstration of homozygosity"). Apart from these, about 1000 gene-rare disease discoveries are not cited yet (according to the available data). The most cited disease publications are Inflammatory bowel disease 25, early onset, autosomal recessive (4080 citations), D-2-hydroxyglutaric aciduria 2 (1598), Macular degeneration, age-related, 4 (1498), and Osteogenesis imperfecta, type XII (1031).
Authors 2641 authors of papers which first described a causative gene for a rare disease have an ORCID which is about 15% of all authors (link to query). Of these, 1637 are male (62.4%) and 986 female (37.6%) (18 individuals without sex/gender information). Table 3 gives the top 10 of authors with the most gene discoveries. 993 of first rare disease description papers (19.7 %) have an ORCID. The lower number may be due that the description of rare disease ranges further to the past in which ORCID was not available.

Application of the dataset for analysis
Network of genes causing diseases Network analysis shows 2357 gene-disease causative relationships in which one gene causes one disease, res. One disease is caused by one gene (Figure 3.A upper left). Provenance can be by one or more publications depicted in one or more edges connecting the nodes. 446 triplets of two genes causing one disease, or one gene causing two diseases, are found, whereas in the majority of cases there is one gene causing two diseases. This may be explained that there are often two varieties of one disease, which were given separate identifiers in OMIM. 1226 genes and diseases are linked in more complex patterns (Figure 3.A bottom). Again, the majority, there is one gene responsible for multiple diseases or varieties of disease. The largest complex is around different, mainly mitochondrial disorders and their associated diseases. Here, a lot of overlap between the diseases and causative genes is observed, possibly due to the fact that several genes are contributing to the functionality of one complex.

Integration in network analysis
Based on the information we created a linkset for the Cytoscape (popular network analysis tool) plugin CyTargetLinker (see here how to use and create these linksets https://github.com/CyTargetLinker/linksetCreator). In Figure 3.B the result of such a network extension is shown.

Linking data and information with other datasets, mappings and RDF based databases
DisGeNET provides mapping datasets which allow mapping of OMIM identifiers to Concept Unique Identifiers (CUI) and Orphanet identifiers (ORPHA). Using our list of rare disease OMIM identifiers, it was possible to map 99.2% of them to a CUI and 58.0% to ORPHA.

DisGeNET RDF and database -identification of disease superclasses
Querying the DisGeNET database and RDF we found that 58.7% of the rare diseases are annotated with a disease superclass term (MeSH terms). The top ten of superclass terms are listed in Table 4. The majority of rare diseases are annotated with "congenital, hereditary, and neonatal diseases and abnormalities", "nervous system disease", and "musculoskeletal diseases". What kind of disorders have been discovered and when?
Linking publishing date of a rare disease description with MeSH disease superclass information from DisGeNET reveals trends which class of diseases have been identified. Figure 4 shows a timeline in blocks of 20 years, in which how many rare diseases belonging to different superclasses been described.