Abstract
Recently, as the number of new protein sequences being collected is rising at a faster pace than the number being annotated, there have been efforts in developing better methods for predicting protein functions. Because protein functions are annotated by Gene Ontology (GO) terms, one key auxiliary resource is the GO data itself. GO terms have definitions consisted of a few sentences describing some biological events, and are arranged in a tree structure with specific terms being child nodes of generic terms. The definitions and positions on the GO tree of the GO terms can be used to construct their vector representations. These vectors can then be integrated into existing prediction models to improve the classification accuracy. In this paper, we adapt two neural network architectures, Embeddings from Language Models and Bidirectional Encoder Representations from Transformers, to encode GO definitions into vectors. We evaluate these new encoders against the previous definition and position encoders in three tasks: (1) measuring similarity between GO terms (2) asserting relationship for orthologs and interacting proteins based on their GO annotations and (3) predicting GO terms for protein sequences. Results in task 1 and 2 find that BERT-based encoders are often better than the other kinds of encoders. Result in task 3 shows that using GO vectors as additional prediction features increases the accuracy, primarily for GO terms with low occurrences in the dataset. In task 3, we also observe that having GO vectors as features definitely helps, but the choice of encoders does not greatly affect the outcome.
Footnotes
eeskin{at}cs.ucla.edu, jli{at}stat.ucla.edu, kwchang{at}cs.ucla.edu
Added more ELMO results and BERT11+12 average layer.