Abstract
Objectives Concept embeddings are low-dimensional vector representations of concepts such as MeSH:D009203 (Myocardial Infarction), whose similarity in the embedded vector space reflects their semantic similarity. Here, we test the hypothesis that non-biomedical concept synonym replacement can improve the quality of biomedical concepts embeddings.
Materials and methods We developed an approach that leverages WordNet to replace sets of synonyms with the most common representative of the synonym set.
Results We tested our approach on 1055 concept sets and found that, on average, the mean intracluster distance was reduced by 8% in the vector-space. Assuming that homophily of related concepts in the vector space is desirable, our approach tends to improve the quality of embeddings.
Discussion and Conclusion This pilot study shows that non-biomedical synonym replacement tends to improve the quality of embeddings of biomedical concepts using the Word2Vec algorithm. We have implemented our approach in a freely available Python package available at https://github.com/TheJacksonLaboratory/wn2vec.
Competing Interest Statement
The authors have declared no competing interest.