Abstract
The Gene Ontology (GO) contains GO terms that describe biological functions of genes and proteins in the cell. A GO term contains one or two sentences describing a biological aspect. GO is used in many applications. One application is the comparison of two genes or two proteins by first comparing semantic similarity of the GO terms that annotate them. Previous methods for this task have relied on the fact that GO terms are organized into a tree structure. In this old paradigm, the locations of two GO terms in the tree dictate their similarity score. In this paper, we introduce a new solution to the problem of comparing two GO terms. Our method uses natural language processing (NLP) and does not need the GO tree. We use the Word2vec model to compare two words. Using this model as the key building-block, we compare two sentences, and definitions of two GO terms. Because a gene or protein is annotated by a set of GO terms, we can apply our method to compare two genes or two proteins. We test the ability of our method in two ways. In the first experiment, we measure how similar are genes in the same regulatory pathways. In the second experiment, we test the model’s ability to differentiate a true protein-protein network from a randomly generated network. Our results are equivalent to those of previous methods which depend on the GO tree. This gives promise to the development of NLP methods in comparing GO terms.