TY - JOUR T1 - Learning Representations for Gene Ontology Terms by Contextualized Text Encoders JF - bioRxiv DO - 10.1101/765644 SP - 765644 AU - Dat Duong AU - Ankith Uppunda AU - Chelsea Ju AU - James Zhang AU - Muhao Chen AU - Eleazar Eskin AU - Jingyi Jessica Li AU - Kai-Wei Chang Y1 - 2019/01/01 UR - http://biorxiv.org/content/early/2019/10/20/765644.abstract N2 - Functions of proteins are annotated by Gene Ontology (GO) terms. As the amount of new sequences being collected is rising at a faster pace than the number of sequences being annotated with GO terms, there have been efforts to develop better annotation techniques. When annotating protein sequences with GO terms, one key auxiliary resource is the GO data itself. GO terms have definitions consisted of a few sentences describing some biological events, and are also arranged in a tree structure with specific terms being child nodes of generic terms. The definitions and positions of the GO terms on the GO tree can be used to construct vector representations for the GO terms. These GO vectors can then be integrated into existing prediction models to improve the classification accuracy. In this paper, we adapt the Bidirectional Encoder Representations from Transformers (BERT) to encode GO definitions into vectors. We evaluate BERT against the previous GO encoders in three tasks: (1) measuring similarity between GO terms (2) asserting relationship for orthologs and interacting proteins based on their GO annotations and (3) predicting GO terms for protein sequences. For task 3, we show that using GO vectors as additional prediction features increases the accuracy, primarily for GO terms with low occurrences in the manually annotated dataset. In all three tasks, BERT often outperforms the previous GO encoders. ER -