TY - JOUR T1 - Validating gene-phenotype associations using relationships in the UMLS JF - bioRxiv DO - 10.1101/2020.07.29.226993 SP - 2020.07.29.226993 AU - Andrew L Blumenfeld AU - Claudia Gonzaga-Jauregui AU - Deepika Sharma AU - Geisinger-Regeneron DiscovEHR Collaboration AU - Regeneron Genetics Center AU - Ashish Yadav AU - Shareef Khalid AU - Suganthi Balasubramanian AU - Jeffrey G Reid AU - Lukas Habegger AU - Michael N Cantor AU - Jeffrey Staples Y1 - 2020/01/01 UR - http://biorxiv.org/content/early/2020/07/30/2020.07.29.226993.abstract N2 - Objective Large scale next-generation sequencing of population cohorts paired with patients’ electronic health records (EHR) provides an excellent resource for the study of gene-disease associations. To validate those associations, researchers often consult databases that identify relationships between genes of interest and relevant disease phenotypes, which we refer to as simply “phenotypes”. However, most of these databases contain phenotypes that are not suited for automated analysis of EHR data, which often captured these phenotypes in the form of International Classification of Diseases (ICD) codes. There is a need for a resource that comprehensively provides gene-phenotype mappings in a format that can be used to evaluate phenotypes from EHR.Methods We built a directed graph database of genes, medical concepts and ICD codes based on a subset of the National Library of Medicine’s Unified Medical Language System (UMLS) and other resources. To obtain associations between genes and ICD codes, we traversed the defined relationships from gene, variant and disease concepts to ICD codes, resulting in a set of mappings that link specific genes and variants to these ICD codes.Results Our method created 249,764 mappings between genes and ICD codes, including 27,226 “disease” phenotypes and 222,538 “symptom” phenotypes, and provided mappings for 4,456 unique genes. Paths were validated by manual review of a diverse sample of paths. In a cohort of 92,455 samples, we used these mappings to validate gene-phenotype associations in 32,786 samples where a person had a potentially disease-causing genetic mutation and at least one corresponding diagnosis in their EHR.Conclusion The concepts and relationships in the UMLS can be used to generate gene-ICD phenotype mappings that are not explicit in the source vocabularies. We were able use these mappings to validate gene-disease associations in a large cohort of sequenced exomes paired with EHR.Competing Interest StatementAll individual authors are full-time employees of the Regeneron Genetics Center and receive stock in Regeneron Pharmaceuticals, Inc. as part of compensation. This research was funded by the Regeneron Genetics Center. No other conflicts are reported. ER -