RT Journal Article SR Electronic T1 Evaluating disease similarity using latent Dirichlet allocation JF bioRxiv FD Cold Spring Harbor Laboratory SP 030593 DO 10.1101/030593 A1 James Frick A1 Rajarshi Guha A1 Tyler Peryea A1 Noel T. Southall YR 2015 UL http://biorxiv.org/content/early/2015/11/03/030593.abstract AB Measures of similarity between diseases have been used for applications from discovering drug-target interactions to identifying disease-gene relationships. It is challenging to quantitatively compare diseases because much of what we know about them is captured in free text descriptions. Here we present an application of Latent Dirichlet Allocation as a way to measure similarity between diseases using textual descriptions. We learn latent topic representations of text from Online Mendelian Inheritance in Man records and use them to compute similarity. We assess the performance of this approach by comparing our results to manually curated relationships from the Disease Ontology. Despite being unsupervised, our model recovers a record’s curated Disease Ontology relations with a mean Receiver Operating Characteristic Area Under the Curve of 0.80. With low dimensional models, topics tend to represent higher level information about affected organ systems, while higher dimensional models capture more granular genetic and phenotypic information. We examine topic representations of diseases for mapping concepts between ontologies and for tagging existing text with concepts. We conclude topic modeling on disease text leads to a robust approach to computing similarity that does not depend on keywords or ontology.