Abstract
Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Critical to its use in designing proteins with desired properties, machine learning models must capture the protein sequence-function relationship, often termed fitness landscape. Existing bench-marks like CASP or CAFA assess structure and function predictions of proteins, respectively, yet they do not target metrics relevant for protein engineering. In this work, we introduce Fitness Landscape Inference for Proteins (FLIP), a benchmark for function prediction to encourage rapid scoring of representation learning for protein engineering. Our curated tasks, baselines, and metrics probe model generalization in settings relevant for protein engineering, e.g. low-resource and extrapolative. Currently, FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families. In order to enable ease of use and future expansion to new tasks, all data are presented in a standard format. FLIP scripts and data are freely accessible at https://benchmark.protein.properties.
Competing Interest Statement
KKY was previously employed by Generate Biomedicines.
Footnotes
christian.dallago{at}tum.de
jodymou{at}mit.edu
kjohnston{at}caltech.edu
bwittman{at}caltech.edu
nick_bhat{at}berkeley.edu
samlg{at}mit.edu
amadani{at}salesforce.com
yang.kevin{at}microsoft.com
Glossary
- Epistasis
- in the most general sense, epistasis is interactions leading to non-independence of effects. For proteins, this means that the effect of a mutation in a protein sequence on fitness can vary based on co-occurring mutations.
- Fitness
- ability of a protein sequence to perform a specific, desired function.
- Fitness landscape
- both (1) a dataset mapping many protein sequences to fitness within a defined region of sequence space and (2) a conceptual framework for thinking about the mapping of protein sequence to fitness.
- Function
- a task performed by a protein sequence, typically referring to either a native task or a task desired by a protein engineer.
- Homology
- sharing a common origin at all levels (organism, population and species), which often results in similarity. For proteins both sequences and structures can be considered homologous due to common origin. [1]
- Multiple sequence alignment
- an arrangement of three or more sequences such that similar regions are aligned. Gaps can be inserted within some sequences at a penalty such that as much of the similar regions of the sequences are aligned as possible.
- Mutagenesis
- introduction of genetic mutations. In protein engineering, mutagenesis is typically performed on a single DNA sequence encoding a protein.
- Mutant
- a resulting DNA (and, equivalently, protein) sequence from mutagenesis on an initial starting sequence. Mutagenesis for protein engineering can either result in a single mutant or a library (pool) of mutants. Parent sequenc another word for the initial starting sequence prior to mutagenesis. This is not to be conflated with “wild type sequence”.
- Sequence identity
- similarity between two (typically aligned) sequences
- Thermostability
- ability of a protein to preserve its structure and function under extremes of temperature conditions. [2]
- Tree of life
- referring to the phylogenetic tree of life, which depicts the relationships of biological species based on their last common ancestors.
- Variant
- within this text we define variant the same way as mutant (see previous).
- Wild type sequence
- a protein sequence that arises in nature and predominates within a natural population. While a wild type sequence can function as a parent sequence, these two terms have distinct meanings and should not be conflated.