RT Journal Article SR Electronic T1 Universal Deep Sequence Models for Protein Classification JF bioRxiv FD Cold Spring Harbor Laboratory SP 704874 DO 10.1101/704874 A1 Nils Strodthoff A1 Patrick Wagner A1 Markus Wenzel A1 Wojciech Samek YR 2019 UL http://biorxiv.org/content/early/2019/07/17/704874.abstract AB Inferring the properties of protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification tasks are tailored to specific classification tasks and rely on handcrafted features such as position-specific-scoring matrices from expensive database searches and show an astonishing performance on different tasks. We argue that a similar level of performance can be reached by leveraging the vast amount of unlabeled protein sequence data available from protein sequence databases using a generic architecture that is not tailored to the specific classification task under consideration. To this end, we put forward UDSMProt, a universal deep sequence model that is pretrained on a language modeling task on the Swiss-Prot database and finetuned on various protein classification tasks. For three different tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection, we demonstrate the feasibility of inferring protein properties and reaching state-of-the-art performance from the sequence alone.