PT - JOURNAL ARTICLE AU - Lu, Amy X. AU - Zhang, Haoran AU - Ghassemi, Marzyeh AU - Moses, Alan TI - Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization AID - 10.1101/2020.09.04.283929 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.09.04.283929 4099 - http://biorxiv.org/content/early/2020/09/06/2020.09.04.283929.short 4100 - http://biorxiv.org/content/early/2020/09/06/2020.09.04.283929.full AB - Pretrained embedding representations of biological sequences which capture meaningful properties can alleviate many problems associated with supervised learning in biology. We apply the principle of mutual information maximization between local and global information as a self-supervised pretraining signal for protein embeddings. To do so, we divide protein sequences into fixed size fragments, and train an autoregressive model to distinguish between subsequent fragments from the same protein and fragments from random proteins. Our model, CPCProt, achieves comparable performance to state-of-the-art self-supervised models for protein sequence embeddings on various downstream tasks, but reduces the number of parameters down to 0.9% to 8.9% of benchmarked models. Further, we explore how downstream assessment protocols affect embedding evaluation, and the effect of contrastive learning hyperparameters on empirical performance. We hope that these results will inform the development of contrastive learning methods in protein biology and other modalities.Competing Interest StatementAmy X. Lu is current an employee of Insitro Inc. Work completed at the University of Toronto.