Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization

Amy X. Lu, Haoran Zhang, Marzyeh Ghassemi, Alan Moses
doi: https://doi.org/10.1101/2020.09.04.283929
Amy X. Lu
1insitro, University of Toront
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Haoran Zhang
2Department of Computer Science, University of Toronto
3Vector Institute for Artificial Intelligence
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marzyeh Ghassemi
2Department of Computer Science, University of Toronto
3Vector Institute for Artificial Intelligence
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Alan Moses
2Department of Computer Science, University of Toronto
4Department of Cell and Systems Biology, University of Toronto
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: alan.moses@utoronto.ca
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Pretrained embedding representations of biological sequences which capture meaningful properties can alleviate many problems associated with supervised learning in biology. We apply the principle of mutual information maximization between local and global information as a self-supervised pretraining signal for protein embeddings. To do so, we divide protein sequences into fixed size fragments, and train an autoregressive model to distinguish between subsequent fragments from the same protein and fragments from random proteins. Our model, CPCProt, achieves comparable performance to state-of-the-art self-supervised models for protein sequence embeddings on various downstream tasks, but reduces the number of parameters down to 0.9% to 8.9% of benchmarked models. Further, we explore how downstream assessment protocols affect embedding evaluation, and the effect of contrastive learning hyperparameters on empirical performance. We hope that these results will inform the development of contrastive learning methods in protein biology and other modalities.

Competing Interest Statement

Amy X. Lu is current an employee of Insitro Inc. Work completed at the University of Toronto.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted September 06, 2020.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization
Amy X. Lu, Haoran Zhang, Marzyeh Ghassemi, Alan Moses
bioRxiv 2020.09.04.283929; doi: https://doi.org/10.1101/2020.09.04.283929
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization
Amy X. Lu, Haoran Zhang, Marzyeh Ghassemi, Alan Moses
bioRxiv 2020.09.04.283929; doi: https://doi.org/10.1101/2020.09.04.283929

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4395)
  • Biochemistry (9619)
  • Bioengineering (7111)
  • Bioinformatics (24915)
  • Biophysics (12642)
  • Cancer Biology (9979)
  • Cell Biology (14388)
  • Clinical Trials (138)
  • Developmental Biology (7968)
  • Ecology (12135)
  • Epidemiology (2067)
  • Evolutionary Biology (16010)
  • Genetics (10938)
  • Genomics (14764)
  • Immunology (9889)
  • Microbiology (23719)
  • Molecular Biology (9493)
  • Neuroscience (50965)
  • Paleontology (370)
  • Pathology (1544)
  • Pharmacology and Toxicology (2688)
  • Physiology (4031)
  • Plant Biology (8685)
  • Scientific Communication and Education (1512)
  • Synthetic Biology (2403)
  • Systems Biology (6446)
  • Zoology (1346)