Abstract
Inferring the properties of protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification tasks are tailored to specific classification tasks and rely on handcrafted features such as position-specific-scoring matrices from expensive database searches and show an astonishing performance on different tasks. We argue that a similar level of performance can be reached by leveraging the vast amount of unlabeled protein sequence data available from protein sequence databases using a generic architecture that is not tailored to the specific classification task under consideration. To this end, we put forward UDSMProt, a universal deep sequence model that is pretrained on a language modeling task on the Swiss-Prot database and finetuned on various protein classification tasks. For three different tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection, we demonstrate the feasibility of inferring protein properties and reaching state-of-the-art performance from the sequence alone.