Evaluating Representations for Gene Ontology Terms

Dat Duong; Ankith Uppunda; Chelsea Ju; James Zhang; Muhao Chen; Eleazar Eskin; Jingyi Jessica Li; Kai-Wei Chang

doi:10.1101/765644

Abstract

Recently, as the number of new protein sequences being collected is rising at a faster pace than the number being annotated, there have been efforts in developing better methods for predicting protein functions. Because protein functions are annotated by Gene Ontology (GO) terms, one key auxiliary resource is the GO data itself. GO terms have definitions consisted of a few sentences describing some biological events, and are arranged in a tree structure with specific terms being child nodes of generic terms. The definitions and positions on the GO tree of the GO terms can be used to construct their vector representations. These vectors can then be integrated into existing prediction models to improve the classification accuracy. In this paper, we adapt two neural network architectures, Embeddings from Language Models and Bidirectional Encoder Representations from Transformers, to encode GO definitions into vectors. We evaluate these new encoders against the previous definition and position encoders in three tasks: (1) measuring similarity between GO terms (2) asserting relationship for orthologs and interacting proteins based on their GO annotations and (3) predicting GO terms for protein sequences. Results in task 1 and 2 find that BERT-based encoders are often better than the other kinds of encoders. Result in task 3 shows that using GO vectors as additional prediction features increases the accuracy, primarily for GO terms with low occurrences in the dataset. In task 3, we also observe that having GO vectors as features definitely helps, but the choice of encoders does not greatly affect the outcome.

Footnotes

eeskin{at}cs.ucla.edu, jli{at}stat.ucla.edu, kwchang{at}cs.ucla.edu
Added more ELMO results and BERT11+12 average layer.
https://github.com/datduong/EncodeGeneOntology

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.