Contrastive learning on protein embeddings enlightens midnight zone

Michael Heinzinger; Maria Littmann; Ian Sillitoe; Nicola Bordin; Christine Orengo; Burkhard Rost

doi:10.1101/2021.11.14.468528

Abstract

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the “midnight zone” of protein similarity, i.e., the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

* Thorough overhaul of manuscript * Addition of new data (ablation study, detailed runtime estimates, detailed analysis of sensitivity-distance relation, proteome analysis) * Total number of tables added: 3(SOM: 4)
https://github.com/Rostlab/EAT

Abbreviations used

3D: three-dimensional
BFD: Big Fantastic Database (11)
CATH: hierarchical classification of protein 3D structures in Class, Architecture, Topology and Homologous superfamily (1,2)
DL: Deep Learning
EAT: Embedding-based Annotation Transfer
EI: evolutionary information
embeddings: fixed-size vectors derived from pre-trained pLMs
ESM-1b: pLM from Facebook dubbed Evolutionary Scale Modeling (12)
FNN: Feed-forward Neural Network
FunFams: functional families as sub-classification of the most fine-grained H level in CATH (13)
HBI: Homology Based Inference
HMM: Hidden Markov Model
HMMer: particular method for HMM-profile alignments (6)
HSSP: homology-derived secondary structure of proteins (14)
HVAL: distance from empirical curve separating proteins with similar structure recognizable from pairwise alignments (15)
LM: Language Model
MMseqs2: fast database search and multiple sequence alignment method (10)
MSA: Multiple Sequence Alignment
NLP: Natural Language Processing
PDB: Protein Data Bank
PIDE: percentage pairwise sequence identity
pLM: protein Language Model
ProSE: pLM based on long short-term memory (LSTM) cells dubbed Protein Sequence Embeddings (16)
ProtBERT: pLM based on the LM BERT (17)
ProtT5: pLM based on the LM T5 (18)

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.