Abstract
Visualizing relationships and similarities between proteins can reveal insightful biology. Current approaches to visualize and analyze proteins based on sequence homology, such as sequence similarity networks (SSNs), create representations of BLAST-based pairwise comparisons. These approaches could benefit from incorporating recent protein language models, which generate high-dimensional vector representations of protein sequences from self-supervised learning on hundreds of millions of proteins. Inspired by SSNs, we developed an interactive tool - Protein Language UMAPs (PLUMAPs) - to visualize protein similarity with protein language models, dimensionality reduction, and topic modeling. As a case study, we compare our tool to Sequence Similarity Network (SSN) using the proteomes of two related bacterial species, Mycobacterium tuberculosis and Mycobacterium smegmatis. Both SSNs and PLUMAPs generate protein clusters corresponding to protein families and highlight enrichment or depletion across species. However, only in PLUMAPs does the layout distance between proteins and protein clusters meaningfully reflect similarity. Thus in PLUMAPs, related protein families are displayed as nearby clusters, and larger-scale structures correlate with cellular localization. Finally, we adapt techniques from topic modeling to automatically annotate protein clusters, making them more easily interpretable and potentially insightful. We envision that as large protein language models permeate bioinformatics and interactive sequence analysis tools, PLUMAPs will become a useful visualization resource across a wide variety of biological disciplines. Anticipating this, we provide a prototype for an online, open source version of PLUMAPs.
Competing Interest Statement
The authors have declared no competing interest.