Abstract
The sequence-structure-function relationships that ultimately generate the diversity of extant observed proteins is complex, as proteins bridge the gap between multiple informational and physical scales involved in nearly all cellular processes. One limitation of existing protein annotation databases such as UniProt is that less than 1% of proteins have experimentally verified functions, and computational methods are needed to fill in the missing information. Here, we demonstrate that a multi-aspect framework based on protein language models can learn sequence-structure-function representations of amino acid sequences, and can provide the foundation for sensitive sequence-structure-function aware protein sequence search and annotation. Based on this model, we introduce a multi-aspect information retrieval system for proteins, Protein-Vec, covering sequence, structure, and function aspects, that enables computational protein annotation and function prediction at tree-of-life scales.
Competing Interest Statement
K.C. is currently a Senior Director of Frontier Research at Genentech. R.B. is currently VP of Machine Learning for Drug Discovery at Genentech gCS. K.C. and R.B. are currently employed by Prescient Design. T.H. is currently CTO of Phare Health. The remaining authors declare no competing interests.