ABSTRACT
Protein language models (pLMs) generate high-dimensional representations of proteins, so called embeddings, that capture complex information stored in the set of evolved sequences. Interpreting these embeddings remains an important challenge. ProtSpace provides one solution through an open-source Python package that visualizes protein embeddings interactively in 2D and 3D. The combination of embedding space with protein 3D structure view aids in discovering functional patterns readily missed by traditional sequence analysis.
We present two examples to showcase ProtSpace. First, investigations of phage data sets showed distinct clusters of major functional groups and a mixed region, possibly suggesting bias in today’s protein sequences used to train pLMs. Second, the analysis of venom proteins revealed unexpected convergent evolution between scorpion and snake toxins; this challenges existing toxin family classifications and added evidence refuting the aculeatoxin family hypothesis.
ProtSpace is freely available as a pip-installable Python package (source code & documentation) with examples on GitHub (https://github.com/tsenoner/protspace) and as a web interface (https://protspace.rostlab.org). The platform enables seamless collaboration through portable JSON session files.
Competing Interest Statement
The authors have declared no competing interest.
Abbreviations
- 2D
- two-dimensional
- 3D
- three-dimensional
- BLAST
- Basic Local Alignment Search Tool
- CLANS
- CLuster ANalysis of Sequences
- EFI-EST
- Enzyme Function Initiative-Enzyme Similarity Tool
- HTML
- Hypertext Markup Language (file format)
- HMM
- Hidden Markov Model
- JSON
- JavaScript Object Notation (file format)
- MDS
- Multidimensional scaling
- MSA
- multiple sequence alignments
- PCA
- Principal component analysis
- PHROG
- protein orthologous groups
- pLM
- protein Language Model
- SVG
- Scalable Vector Graphics (file format)
- UMAP
- Uniform Manifold Approximation and Projection.