Abstract
Protein language models (PLMs) implicitly learn distributional constraints on protein sequences upheld over the course of evolution. As a consequence, the sequence and mutation-level likeli-hoods of such models form effective zero-shot predictors of mutations. Although various schemes have been proposed for exploiting the distributional knowledge captured by PLMs to enhance supervised fitness prediction and sequence design tasks, a lack of head-to-head comparison across different prediction strategies and different classes of PLM has made it challenging to identify the best-performing methods. Our contribution is to extend previously proposed ranking-based loss functions to develop likelihood scoring functions for family-based and masked PLMs. We demonstrate that in the low-data setting the best configurations outperform the current SOTA approach, which is based on frozen embeddings. Furthermore, we propose ensembling strategies that exploit the strong dependence of the mutational distributions learned by PLMs on sequence context, showing that they can be used to guide efficient optimisation strategies over fitness landscapes.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* Work was initiated during an internship at Instadeep, and subsequently completed by the research team.
This version of the manuscript has been revised to include more comprehensive results and analysis.