RT Journal Article SR Electronic T1 A novel Word2vec based tool to estimate semantic similarity of genes by using Gene Ontology terms JF bioRxiv FD Cold Spring Harbor Laboratory SP 103648 DO 10.1101/103648 A1 Dat Duong A1 Eleazar Eskin A1 Jingyi Jessica Li YR 2017 UL http://biorxiv.org/content/early/2017/07/16/103648.abstract AB The Gene Ontology (GO) database contains GO terms that describe biological functions of genes and proteins in the cell. A GO term contains one or two sentences describing a biological aspect. GO database is used in many applications. One application is the comparison of two genes or two proteins by first comparing semantic similarity of the GO terms that annotate them. Previous methods for this task have relied on the fact that GO terms are organized into a tree structure. In this old paradigm, the locations of two GO terms in the tree dictate their similarity score. In this paper, we introduce a new solution to the problem of comparing two GO terms. Our method uses natural language processing (NLP) and does not rely on the GO tree. In our approach, we use the Word2vec model to compare two words. Using this model as the key building-block, we compare two sentences, and definitions of two GO terms. Because a gene or protein is annotated by a set of GO terms, we can apply our method to compare two genes or two proteins. We validate our method in two ways. In the first experiment, we measure the similarity of genes in the same regulatory pathways. In the second experiment, we test the model’s ability to differentiate a true protein-protein network from a randomly generated network. Our results are equivalent to those of previous methods which depend on the GO tree. This gives promise to the development of NLP methods in comparing GO terms.