Masked Inverse Folding with Sequence Transfer for Protein Representation Learning

Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein’s amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.


INTRODUCTION
Large pretrained protein language models have advanced the ability of machine learning models to predict protein function and fitness from sequence, especially when labeled training data is scarce Alley et al. (2019); ; Rao et al. (2019;; Elnaggar et al. (2021); Brandes et al. (2021). These models address the limitation that effective deep learning models generally require an abundance of labeled data to train. Since high-quality labels are only available for a limited number of sequences in most applications, protein language models first expose models to a large quantity of unlabeled sequences in a pretraining phase (Figure 1)a, with the goal of imparting the model with a general foundation of knowledge about protein sequences so that they can be rapidly specialized to downstream tasks of interest with less training data than training from scratch ( Figure 1b). The recent state-of-the-art, inspired by BERT (bidirectional encoder representations from transformers) (Devlin et al., 2018), uses increasingly-large transformer (Vaswani et al., 2017) models to reconstruct masked and mutated protein sequences taken from databases such as UniProt (UniProt Consortium, 2021), UniRef (Suzek et al., 2015), and BFD (Steinegger et al., 2019;Steinegger & Söding, 2018). We will use the term masked language models (MLMs) to refer to models trained with the BERT reconstruction objective. Pretrained protein MLMs contain structural information (Rao et al., 2019;Chowdhury et al., 2021), encode evolutionary trajectories (Hie et al., 2022b;2021), are zero-shot predictors of mutation fitness effects , improve out-of-domain generalization on protein engineering datasets (Dallago et al., 2021), and suggest improved sequences for engineering (Hie et al., 2022a). Protein MLMs are now incorporated into the latest machine-learning methods for detecting signal peptides (Teufel et al., 2021) and predicting intracellular localization (Thumuluri et al., 2022). However, only training on sequences ignores the rich information contained in experimental and predicted protein structures, especially as the number of high-quality structures from AlphaFold (Jumper et al., 2021;Varadi et al., 2022) increases.
Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure. Deep learning-based inverse folding is usually parametrized as a graph neural network (GNN) (Ingraham et al., 2019;Strokach et al., 2020;Jin et al., 2021;Jing et al., 2020;Hsu et al., 2022;Dauparas et al., 2022;Shi et al., 2022) or SE(3)-equivariant transformer (McPartlon et al., 2022) that either reconstructs or autoregressively decodes the amino-acid sequence conditioned on the desired backbone structure. This is similar in spirit to work in fixed-backbone protein design, which involves the design of proteins with a given target backbone structure. Outside of deep-learning based methods, researchers use packing algorithms (Dahiyat & Mayo, 1997;Street & Mayo, 1999;DeGrado et al., 1991;Harbury et al., 1998), physics-based energy functions (Alford et al., 2017), or match structural motifs to sequence motifs (Zhou et al., 2020). More recent methods attempt to invert deep-learning protein structure prediction models (Jendrusch et al., 2021;Moffat et al., 2021;. See Ovchinnikov & Huang (2021) for a more comprehensive review of fixed-backbone protein design approaches. The ability to generate amino-acid sequences that fold into a desired structure is useful for developing novel therapeutics , biosensors (Quijano-Rubio et al., 2021), industrial enzymes (Siegel et al., 2010), and targeted small molecules (Lucas & Kortemme, 2020). Furthermore, single-chain inverse folding approaches could be coupled with recent sequential assembly based multimer structure prediction techniques (Bryant et al., 2022) for fixed-backbone multimer design.
However, we are primarily interested in using inverse folding as a pretraining task, with the intuition that incorporating structural information should improve performance on downstream fitness or function prediction tasks. Furthermore, current inverse folding methods must be trained on sequences with known or predicted structures, and thus do not take maximal advantage of the large amount of sequences that do not have known structures or of the menagerie of pretrained protein MLMs. For example, UniRef50 contains 42 million sequences, while the PDB (Rose et al., 2016) currently contains 190 thousand experimentally-measured protein structures. In this study, we train a Masked Inverse Folding (MIF) protein masked language model (MLM). To our knowledge, this is the first example of combining the MLM task with structure conditioning as a pretraining task (Figure 1a). We then show that using the outputs from a pretrained sequence-only protein MLM as input to MIF further improves pretraining perplexity by leveraging information from sequences without experimental structures. We will refer to this model as Masked Inverse Folding with Sequence Transfer (MIF-ST). Figure 1c compares the previous sequence-only dilated convolutional protein MLM (CARP), MIF, and MIF-ST. This is a novel way of transferring information from unlabeled protein sequences into a model that requires structure. We evaluate MIF and MIF-ST on downstream protein engineering tasks and analyze the effect of experimental and predicted structures on performance. Finally, we comment on the state of pretrained models for protein fitness prediction.

PRETRAINING
Proteins are chains of amino acids that fold into three-dimensional structures. In masked language modeling pretraining on protein sequences, a model learns to reconstruct the original protein sequence from a corrupted version, and then the model likelihoods are used to make zero-shot predictions or the pretrained weights are used as a starting point for training on a downstream task, such as structure or fitness prediction. While MLM pretraining on protein sequences can encode structural and functional information, we reasoned that conditioning the model with the protein's backbone structure should improve sequence recovery. A protein's backbone structure consists of the coordinates for each amino-acid residue's C, C α , C β , and N atoms, leaving out information about the side chains (which would trivially reveal each residue's amino-acid identity), as illustrated in Figure 1d. We call the pretraining task of reconstructing a corrupted protein sequence conditioned on its backbone structure Masked Inverse Folding. We trained a 4-layer MIF on the CATH4.2 dataset Sillitoe et al. (2019) using the training, validation, and testing splits from Ingraham et al. (2019), in which there is no overlap between CATH topology classifications between data splits. MIF is parametrized as a structured GNN (Ingraham et al., 2019). For details refer to the Methods.
We previously trained CARP-640M, a dilated convolutional protein masked language model with approximately 640 million parameters trained on UniRef50 that achieves comparable results to ESM-1b, which has a similar number of parameters and is trained on an earlier release of UniRef50 (Yang Figure 1: a) Pretraining phase. Models are pretrained with masked language modeling with optional structural conditioning (red): amino acids in unlabeled protein sequences are randomly masked and mutated to other amino acids, and the model is trained to recover the original sequence. b) After pretraining, models are specialized to downstream tasks. The weights of the pretrained model are transferred to the new task (dotted lines). For some zero-shot tasks like predicting the impact of mutations, the masked language modeling decoder is also useful and can be transferred. Otherwise, the decoder is replaced with a prediction head trained to output a prediction useful for the downstream task. Once again, the downstream tasks may also be conditioned on structure. c) Summary of models, including the Convolutional Autoencoding Representations of Proteins protein masked language model, the Masked Inverse Folding model, and the Masked Inverse Folding with Sequence Transfer model. d) The backbone atoms of amino-acid residues i and j with their dihedral and planar angles highlighted. et al., 2022). Importantly, all sequences with greater than 30% identity to the CATH test set were removed from CARP-640M's training set in order to obtain a fair evaluation on the CATH test set. As shown in Table 1, conditioning on the backbone structure drastically improves perplexity and sequence recovery compared to CARP-640M, despite MIF having 20 times fewer parameters and being trained on only the 19 thousand examples in CATH compared to the 42 million sequences in UniRef50. Increasing the GNN depth to 8 layers does not improve pretraining performance, so we use MIF with 4 layers for all following experiments. For comparison, we also train a 3.5M-parameter geometric vector perceptron (GVP) (Jing et al., 2020) on the same masked inverse folding task, which we will refer to as GVPMIF. The GVP architecture improves pretraining perplexity and recovery, but we find that it does not improve performance on downstream tasks.
While conditioning on structure improves sequence recovery compared to sequence-only pretraining, we hypothesized that transferring information from sequences for which no structure is available should further improve performance on the pretraining task. Therefore, we transfer sequence information from CARP-640M by directly replacing the sequence embedding in Equation 5 with the outputs from CARP-640M pretrained on UniRef50, as shown in Figure ??. The pretrained CARP-640M weights were not finetuned during training on CATH4.2. Figure 1c illustrates CARP, MIF, and MIF-ST. As shown in Table 1, sequence transfer improves perplexity and recovery on the CATH4.2 test set over both CARP-640M and MIF. Using the CARP-640M architecture with randomly-initialized weights as input to MIF did not improve performance, showing that simply increasing model capacity is insufficient and that sequence transfer is necessary for the improvement.

DOWNSTREAM TASKS
After pretraining, a MIF model can be used to perform any downstream task that a sequence-only PMLM can, with the caveat that a structure must be provided. Intuitively, structure-conditioned pretraining and having access to structures for the downstream task should both improve performance. We evaluate MIF and MIF-ST on downstream tasks relevant to protein engineering, including out-of-domain generalization and zero-shot mutation effect prediction.

OUT-OF-DOMAIN GENERALIZATION
It is desirable for pretrained protein models to be able to make the types of out-of-domain predictions that often occur in protein engineering campaigns. For example, a protein engineer may want to train a model on single mutants and make predictions for sequences with multiple mutations, or train a model that is accurate for sequences with fitness greater than what is seen in the training set.
We finetune and evaluate on two fitness landscapes: We use splits from FLIP (Dallago et al., 2021) over the same GB1 landscape. These splits test generalization from fewer to more mutations or from lower-fitness sequences to higher-fitness sequences. We use PDB 2GI9 (Franks et al., 2006) as the structure.
We compare MIF and MIF-ST to CARP-640M, GVPMLM, and ESM-1b . All large models are finetuned end-to-end on a single Nvidia V100 GPU with a 2-layer perceptron as the predictive head until the validation performance stops improving. In addition, we use the small CNN from Yang et al. (2022) and ridge regression as baselines.
As shown in Table 2, no model or pretraining scheme outperforms all others on both MSE and Spearman ρ for the Rma NOD task. For protein engineering tasks, rank ordering is generally more important than minimizing error, so we will primarily compare the Spearmans. However, ridge regression achieves the best Spearman at the cost of a very high MSE. The small CNN is a strong  Table 3 shows results on the GB1 tasks. All models except GVP consistently benefit from pretraining on the GB1 tasks. Combining structure conditioning with sequence transfer seems to help slightly, with the biggest gains coming when the training set is limited to single-and double-mutants. However, for the most challenging 1-vs-many split, ridge regression results in the best performance, and for lowvs-high, the small CNN results in the best performance. MIF-ST with random weights consistently converged to a degenerate solution where it predicts the same value for all sequences in the test set for all 3 random seeds for 3 of the 4 splits.
In general, pretraining usually helps, as does adding structure when comparing MIF and MIF-ST to CARP-640M, and adding sequence transfer when comparing MIF-ST and CARP-640M. However, different tasks, even those using the same underlying protein fitness landscape, are not best predicted by the same models, and the baseline models compare favorably on many tasks. Furthermore, there is no correlation between pretraining performance and out-of-domain performance, even when adding structure or sequence transfer. This suggests a mismatch between the masked language model pretraining task and the sort of OOD performance desired for protein engineering.   We score sequences by masking every mutated position and computing the log odds ratio between the mutated and wild-type residues at each mutated position, assuming an additive model when a sequence contains multiple mutations (the "masked marginal" method from ) except for stability, where we use the pseudolikelihood. Where possible, we compare to ESM-1v, which is a transformer masked language model trained on UniRef90, Structured Transformer, the SE(3)-equivariant model from McPartlon et al. (2022), GVP, and GVP-AF2. The ESM-1x values for DeepSequence are taken from ; the ESM-1x values for RBD are taken from Hsu et al. (2022). We compute values for ESM-1v on MSP, stability, and GB1 using only the second model, not the full ensemble of five independent models. Values for GVP and GVP-AF2 are both taken from Hsu et al. (2022); we take the best reported value for each task across several tested model variants. Note that this GVP is trained on a different dataset and task than our GVPMIF model.
For all tasks except DeepSequence, MIF is better than sequence-only methods, and on DeepSequence, adding sequence transfer improves performance above that of the sequence-only methods. Within DeepSequence, MIF-ST beats CARP-640M on 22 out of 41 datasets and MIF on 37 out of 41 datasets. Figure 2a shows results for each of the DeepSequence datasets. On the other tasks, MIF and MIF-ST achieve similar results, with sequence-transfer not consistently improving zero-shot performance despite improving pretraining performance. We suspect this is because fitness is unidentifiable from observational sequence data alone (Weinstein et al., 2022), and therefore improved density estimation does not necessarily lead to improved zero-shot fitness predictions. Table A2 shows that MIF outperforms CARP-640M on all ten folds tested in the stability dataset, and MIF-ST outperforms MIF on six out of ten folds. On both MSP and stability, MIF and MIF-ST are comparable to other inverse folding methods, but GVP+AF2 is the clear winner on RBD. For DeepSequence, GB1, and RBD, we also compared the effect of using PDB or AlphaFold structures, as shown in Table 5. Surprisingly AlphaFold structures lead to better predictions for both GB1 and DeepSequence. (We were only able to find PDB structures for 38 of the 41 DeepSequence datasets, so the results in Table 5 differ slghtly from those in Table 4. The PDB structures used are listed in Table A1.) As shown in Figure 2b, the AlphaFold structure for GB1 is nearly identical to its PDB structure. It is unclear why AlphaFold structures lead to better zero-shot predictions in these cases. However, using an AlphaFold-multimer prediction for RBD greatly degrades performance. Upon examining the structures, this is not surprising, as AlphaFold places the RBD on the wrong side of ACE2, as shown in Figure 2c. Table A3 shows zero-shot performance on the GB1 dataset separated by number of mutations using both PDB and AlphaFold structures. Without structural information, CARP-640M performs poorly for even single mutants, and no correlation at all for triple and quadruple mutants. Adding structure allows MIF and MIF-ST to make much better predictions at all mutation levels, but the accuracy nevertheless falls very quickly as the number of mutations increases.

CONCLUSION AND DISCUSSION
Protein structure is a richer source of information than protein sequence, but the largest protein sequence databases contain billions of sequences, while the number of experimental structures is currently in the hundreds of thousands. In this work, we investigate masked inverse folding on 19 thousand structures and sequences as a pretraining task. We observe that MIF is an effective pretrained model for a variety of downstream protein engineering tasks. We then extend MIF by transferring information from a model trained on tens of millions of protein sequences, improving pretraining perplexity and performance on some downstream tasks. High-quality predicted structures from AlphaFold often improve zero-shot performance over experimental structures. However, improving pretraining perplexity does not always lead to better downstream performance, and no model consistently outperforms the others on out-of-domain prediction tasks. We suspect that this is due to a mismatch between the masked language model pretraining task and out-of-domain fitness prediction.
However, the MIF and MIF-ST pretraining schemes have several important limitations. First, they require structures as input during downstream tasks. This is ameliorated by the ability to predict high-quality structures for most protein sequences. In this work, we use a single structure for each protein and its variants: we may be able to improve results by predicting structures for all variants. However, this would be computationally expensive for large datasets, and it is currently unclear how good AlphaFold is at predicting the effects of single mutations (Pak et al., 2021). Some datasets, such as the the FLIP Meltome landscape, contain many unrelated sequences; collating or predicting structures for each sequence would be a significant endeavor. Furthermore, because the structure is held constant during pretraining, it is unclear how to deal with insertions and deletions in downstream tasks. For example, this prevented us from evaluating on the FLIP AAV landscape.
While we condition on structure and reconstruct sequence, there are other methods for incorporating protein structural information, such as predicting structure similarity between protein sequences (Bepler & Berger, 2019), corrupting and reconstructing the structure in addition to the sequence (Mansoor et al., 2021;Chen et al., 2022), encoding surface features (Townshend et al., 2019), contrastive learning (Zhang et al., 2022;, or a graph encoder without sequence decoding (Somnath et al., 2021;Fuchs et al., 2020). LM-GVP uses the same architecture as MIF-ST consisting of a pretrained language model feeding into a GNN that encodes backbone structure . However, in LM-GVP the structure-aware module is used as a finetuned prediction head without any pretraining. Some of these methods suggest improvements that should be composable with MIF and MIF-ST. Using a more advanced GVP or SE(3)-transformer architecture instead of Structured GNN as the base model would likely improve pretraining performance, as would augmenting with AlphaFold structures or adding noise to the input structures. Another obvious extension is to train with an autoregressive or span-masking loss, which should be more amenable to generation tasks, better handle insertions and deletions, and may generalize better to complexes.
Machine learning on molecular data generally entails fewer societal risks than work on language, images, medical, or human data. Pretraining data comes from large, curated protein databases that compile results from the scientific literature, with no privacy concerns. However, large pretrained models incur significant energy and monetary costs to train. CARP-640M and ESM are trained on hundreds of V100s for weeks at a time, contributing to greenhouse gas emissions and putting retraining out of the reach of most academic labs.
Most work in protein pretraining has used methods borrowed from natural language processing on amino-acid sequences. However, techniques that treat proteins explicitly as biomolecules and leverage information from structure, annotations, and even free text should improve downstream performance. We hope that MIF and MIF-ST will lead to more investigations of multimodal protein pretraining tasks.

MASKED INVERSE FOLDING
The model learns to predict the original amino acids: by minimizing the negative log likelihood at positions i ∈ M.
We use the BERT corruption scheme and train the model to reconstruct the original amino acids conditioned on the corrupted sequence and the backbone structure: With a vocabulary of T of amino acids, we start from an amino-acid sequence s of length L of amino acids s i ∈ T : 1 ≤ i ≤ L, 15% of positions M are selected uniformly at random. 80% of these are changed to a special mask token, 10% are randomly mutated to another amino acid, and the remaining 10% are unchanged to generate s noised .
We represent protein backbone structures as graphs G = (V, E) where each node V is an amino acid connected by edges E to its k-nearest amino-acid neighbors in the structure. We set k = 30 throughout. Each node's structural input features consist of the sine and cosine transformations of its dihedral and planar angles to its nearest neighbors in the primary structure: Note that ω coordinates are symmetric whereas ϕ and ψ coordinates are asymmetric and depend on residue order, so we encode both the forward and backwards angles in the forward and backward direction, i.e. ϕ i+1,i , and ϕ i,i+1 , respectively. Figure 1d illustrates the backbone atoms of two residues and shows their dihedral and planar angles. Dihedral angles, planar angles, and residue distances used are unconventional to protein definitions and follow trRosetta (Yang et al., 2020) conventions.
The input edge features for the i th residue consist of the dihedral and planar angles and the Euclidean distance d i,j∈N (i,k) between the C β atom of residue i and the C β atoms of its k-nearest neighbors, N (i, k): where j ∈ N (i, k) and j ̸ = i.
We embed each token in s into a vector of size d s : and concatenate it to the structural input node features to arrive at: Finally, we embed each edge and node into the model dimension d: Throughout, we set d = 256 and d s = 30.
We parametrize MIF as a bidirectional version of the structured GNN from Ingraham et al. (2019). Although Ingraham et al. (2019) focuses on a Structured Transformer, they note that replacing the transformer attention mechanism with a simple multilayer perceptron improves performance, and we use this structured GNN architecture in both MIF and MIF-ST. The node and edge embeddings V 0 and E 0 are passed to a standard message-passing GNN with a multilayer perceptron aggregation function. The m th GNN layer takes as input node representations V m−1 and edge representations E m−1 and outputs V m and E m . Within each layer, we first gather relational information from every neighboring node and then compute the incoming messages at each node: We parameterize f msg as a three layer neural network with hidden dimension d and ReLU nonlinearities and Agg as a mean over the neighbor dimension. We then compute new node embeddings with another feed-forward neural network.
The sequence logits are computed as a linear mapping of the final node representations.
MIF and MIF-ST were trained with dynamic batch sizes to maximize GPU utilization with a maximum batch size of 6000 tokens or 100 sequences, the Adam optimizer, a maximum learning rate of 0.001, and a linear warmup over 1000 steps. Models were trained on one Nvidia V100 GPU for approximately one day, until validation perplexity stopped improving.