Abstract
This dataset provides information about monogenic, rare diseases with a known genetic cause supplemented with manually extracted provenance of both the disease and the discovery of the underlying genetic cause of the disease.
We collected 4166 rare monogenic diseases according to their OMIM identifier, linked them to 3163 causative genes which are annotated with Ensembl identifiers and HGNC symbols. The PubMed identifier of the scientific publication, which for the first time describes the rare disease, and the publication which found the gene causing this disease were added using information from OMIM, Wikipedia, Google Scholar, Whonamedit, and PubMed. The data is available as a spreadsheet and as RDF in a semantic model modified from DisGeNET.
This dataset relies on publicly available data and publications with a PubMed IDs but this is to our knowledge the first time this data has been linked and made available for further study under a liberal license. Analysis of this data reveals the timeline of rare disease and causative genes discovery and links them to developments in methods and databases.