Abstract
Foundation models, which encode patterns in large, high-dimensional data as embeddings, show promise in many machine learning related applications in molecular biology. Embeddings learned by the models provide informative features for downstream prediction tasks, however, the information captured by the model is often not interpretable. One approach to understanding the captured information is through the analysis of their learned embeddings, which in molecular biology so far has mainly focused on visualizing individual embedding spaces. This study introduces a quantitative framework for cross-space comparison, enabling intuitive exploration and comparison of embedding spaces in molecular biology. The framework emphasizes analyzing the distribution of known biological information within embedding space neighborhoods and provides insights into relationships between multiple embedding spaces. Comparison techniques include global pairwise distance measurements as well as local nearest neighbor analyses. By applying our framework to embeddings from protein language models, we demonstrate how embedding space analysis can serve as a valuable pre-filtering step for task-specific supervised machine learning applications and for the recognition of differential patterns in data encoded within and across different embedding spaces. To support a wide usability, we provide a Python library that implements all analysis methods, available at https://github.com/broadinstitute/EmmaEmb.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵† Co-supervision
Further development of method and application examples.