Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Universal Deep Sequence Models for Protein Classification

View ORCID ProfileNils Strodthoff, Patrick Wagner, Markus Wenzel, View ORCID ProfileWojciech Samek
doi: https://doi.org/10.1101/704874
Nils Strodthoff
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nils Strodthoff
  • For correspondence: nils.strodthoff@hhi.fraunhofer.de
Patrick Wagner
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Markus Wenzel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wojciech Samek
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Wojciech Samek
  • For correspondence: nils.strodthoff@hhi.fraunhofer.de
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Inferring the properties of protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification tasks are tailored to specific classification tasks and rely on handcrafted features such as position-specific-scoring matrices from expensive database searches and show an astonishing performance on different tasks. We argue that a similar level of performance can be reached by leveraging the vast amount of unlabeled protein sequence data available from protein sequence databases using a generic architecture that is not tailored to the specific classification task under consideration. To this end, we put forward UDSMProt, a universal deep sequence model that is pretrained on a language modeling task on the Swiss-Prot database and finetuned on various protein classification tasks. For three different tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection, we demonstrate the feasibility of inferring protein properties and reaching state-of-the-art performance from the sequence alone.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted July 17, 2019.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Universal Deep Sequence Models for Protein Classification
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Universal Deep Sequence Models for Protein Classification
Nils Strodthoff, Patrick Wagner, Markus Wenzel, Wojciech Samek
bioRxiv 704874; doi: https://doi.org/10.1101/704874
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Universal Deep Sequence Models for Protein Classification
Nils Strodthoff, Patrick Wagner, Markus Wenzel, Wojciech Samek
bioRxiv 704874; doi: https://doi.org/10.1101/704874

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Synthetic Biology
Subject Areas
All Articles
  • Animal Behavior and Cognition (3586)
  • Biochemistry (7545)
  • Bioengineering (5495)
  • Bioinformatics (20732)
  • Biophysics (10294)
  • Cancer Biology (7951)
  • Cell Biology (11610)
  • Clinical Trials (138)
  • Developmental Biology (6586)
  • Ecology (10168)
  • Epidemiology (2065)
  • Evolutionary Biology (13580)
  • Genetics (9521)
  • Genomics (12817)
  • Immunology (7906)
  • Microbiology (19503)
  • Molecular Biology (7641)
  • Neuroscience (41982)
  • Paleontology (307)
  • Pathology (1254)
  • Pharmacology and Toxicology (2192)
  • Physiology (3259)
  • Plant Biology (7025)
  • Scientific Communication and Education (1294)
  • Synthetic Biology (1947)
  • Systems Biology (5419)
  • Zoology (1113)