## Abstract

We propose the Neural Potts Model objective as an amortized optimization problem. The objective enables training a single model with shared parameters to explicitly model energy landscapes across multiple protein families. Given a protein sequence as input, the model is trained to predict a pairwise coupling matrix for a Potts model energy function describing the local evolutionary landscape of the sequence. Couplings can be predicted for novel sequences. A controlled ablation experiment assessing unsupervised contact prediction on sets of related protein families finds a gain from amortization for low-depth multiple sequence alignments; the result is then confirmed on a database with broad coverage of protein sequences.

## 1 Introduction

When two positions in a protein sequence are in spatial contact in the folded three-dimensional structure of the protein, evolution is not free to choose the amino acid at each position independently. This means that the positions co-evolve: when the amino acid at one position varies, the assignment at the contacting site may vary with it. A multiple sequence alignment (MSA) summarizes evolutionary variation by collecting a group of diverse but evolutionarily related sequences. Patterns of variation, including co-evolution, can be observed in the MSA. These patterns are in turn associated with the structure and function of the protein (Göbel et al., 1994). Unsupervised contact prediction aims to detect co-evolutionary patterns in the statistics of the MSA and infer structure from them.

The standard method for unsupervised contact prediction fits a Potts model energy function to the MSA (Lapedes et al., 1999; Thomas et al., 2008; Weigt et al., 2009). Various approximations are used in practice including mean field (Morcos et al., 2011), sparse inverse covariance estimation (Jones et al., 2011), and pseudolikelihood maximization (Balakrishnan et al., 2011; Ekeberg et al., 2013; Kamisetty et al., 2013). To construct the MSA for a given input sequence, a similarity query is performed across a large database to identify related sequences, which are then aligned to each other. Fitting the Potts model to the set of sequences identifies statistical couplings between different sites in the protein, which can be used to infer contacts in the structure (Weigt et al., 2009). Contact prediction performance depends on the depth of the MSA and is reduced when few related sequences are available to fit the model.

In this work we consider fitting many models across many families simultaneously with parameter sharing across all the families. We introduce this formally as the Neural Potts Model (NPM) objective. The objective is an amortized optimization problem across sequence families. A Transformer model is trained to predict the parameters of a Potts model energy function defined by the MSA of each input sequence. This approach builds on the ideas in the emerging field of protein language models (Alley et al., 2019; Rives et al., 2019; Heinzinger et al., 2019), which proposes to fit a single model with unsupervised learning across many evolutionarily diverse protein sequences. We extend this core idea to train a model to output an explicit energy landscape for every sequence.

To evaluate the approach, we focus on the problem setting of unsupervised contact prediction for proteins with low-depth MSAs. Unsupervised structure learning with Potts models performs poorly when few related sequences are available (Jones et al., 2011; Kamisetty et al., 2013; Moult et al., 2016). Since larger protein families are likely to have structures available, the proteins of greatest interest for unsupervised structure prediction are likely to have lower depth MSAs (Tetchner et al., 2014). This is especially a problem for higher organisms, where there are fewer related genomes (Tetchner et al., 2014). The hope is that for low-depth MSAs, the parameter sharing in the neural model will improve results relative to fitting an independent Potts model to each family.

We investigate the NPM objective in a controlled ablation experiment on a group of related protein families in PFAM (Finn et al., 2016). In this artificial setting, information can be generalized by the pre-trained shared parameters to improve unsupervised contact prediction on a subset of the MSAs that have been artificially truncated to reduce their number of sequences. We then study the model in the setting of a large dataset without artificial reduction, training the model on MSAs for UniRef50 sequences. In this setting there is also an improvement on average for low depth MSAs both for sequences in the training set as well as for sequences not in the training set.

## 2 Background

### Multiple sequence alignments

An MSA is a set of aligned protein sequences that are evolutionarily related. MSAs are constructed by retrieving related sequences from a sequence database and aligning the returned sequences using a heuristic. An MSA can be viewed as a matrix where each row is a sequence, and columns contain aligned positions after removing insertions and replacing deletions with gap characters.

### Potts model

The generalized Potts model defines a Gibbs distribution over a protein sequence (*x*_{1},…, *x*_{L}) of length *L* with the negative energy function:
Which defines potentials *h*_{i} for each position in the sequence, and couplings *J*_{ij} for every pair of positions. The parameters of the model are *W* = {*h, J*} the set of fields and couplings respectively. The distribution *p*(*x*; *W*) is obtained by normalization as exp{− *E*(*x*; *W*)}*/Z*(*W*).

Since the normalization constant is intractable, pseudolikelihood is commonly used to fit the parameters (Balakrishnan et al., 2011; Ekeberg et al., 2013). Pseudolikelihood approximates the likelihood of a sequence ** x** as a product of conditional distributions:

*ℓ*

_{PL}(

*x*;

*W*) = −Σ

_{i}log

*p*(

*x*

_{i}|

*x*

_{−i};

*W*). To estimate the Potts model, we take the expectation: over an MSA ℳ. In practice, we have a finite set of sequences in the MSA to estimate Eq. (2).

*L*

_{2}regularization

*ρ*(

*W*) =

*λ*

_{J}

*‖J ‖*

^{2}+

*λ*

_{h}

*‖h‖*

^{2}is added, and sequences are reweighted to account for redundancy (Morcos et al., 2011). We write the regularized finite sample estimator as: Which sums over all the

*M*sequences of the finite MSA , weighted with

*w*

^{m}summing collectively to

*M*

_{eff}. The finite sample estimate of the parameters

*Ŵ** is obtained by minimizing .

### Idealized MSA

Notice how in Eq. (2), we idealized the MSA as a distribution, defined by the protein family. We consider the set of sequences actually retrieved in the MSA in Eq. (3) as a finite sample from this underlying idealized distribution. For some protein families this sample will contain more information than for others, depending on what sequences are present in the database. We will refer to *Ŵ* ^{*} as a hypothetical idealized estimate of the parameters to explain how the Neural Potts Model can improve on the finite sample estimate *Ŵ* ^{*} for low-depth MSAs.

### 2.1 Amortized optimization

We review amortized optimization (Shu, 2017), a generalization of amortized variational inference (Kingma & Welling, 2013; Rezende et al., 2014) that uses learning to predict the solution to continuous optimization problems to make the computation more tractable and potentially generalize across problem instances. We are interested in repeatedly solving expensive optimization problems
where *W* ∈ ℝ^{m} is the optimization variable, *x* ∈ ℝ^{n} is the input or conditioning variable to the optimization problem, and ℒ : ℝ^{m} × ℝ^{n}→ ℝ is the objective. We assume *W* ^{*}(*x*) is unique. We consider the setting of having a distribution over optimization problems with inputs *x p*(*x*), and the arg min of those optimization problems *W* ^{*}(*x*).

Amortization uses learning to leverage the shared structure present across the distribution, *e*.*g*. a solution *W* ^{*}(*x*) is likely correlated with another solution *W* ^{*}(*x*^{′}). Assuming an underlying regularity of the data and loss ℒ, we can imagine learning to predict the outcome of the optimization problem with an expressive model *W*_{θ}(*x*) such that hopefully *W*_{θ} ≈ *W* ^{*}. Modeling and learning *W*_{θ}(*x*) are the key design decisions when using amortization.

### Modeling approaches

In this paper we consider models *W*_{θ}(*x*) that directly predict the solution to Eq. (4) with a neural network, an approach which follows fully amortized variational inference models and the meta-learning method (Mishra et al., 2017). The model can also leverage the objective information ℒ (*W*; *x*) and gradient information ∇_{W} ℒ (*W*; *x*), *e*.*g*. by predicting multiple candidate solutions *W* and selecting the most optimal one. This is sometimes referred to as semi-amortization or unrolled optimization-based models and is considered in Gregor & LeCun (2010) for sparse coding, Li & Malik (2016); Andrychowicz et al. (2016); Finn et al. (2017) for meta-learning, and Marino et al. (2018); Kim et al. (2018) for posterior optimization.

### Learning approaches

There are two main classes of learning approaches for amortization:
*Gradient-based* approaches leverage gradient information of the objective and optimize Eq. (5) whereas *regression-based* approaches optimize a distance to ground-truth solutions *W* ^{*}, such as the squared L^{2} distance in Eq. (6). Prior work has shown that models trained with these objectives can learn to predict the optimal *W* ^{*} directly as a function of *x*. Given enough regularity of the domain, if we observe new (test) samples *x*^{′} ∼ *p*(*x*) we expect the model to generalize and predict the solution to the original optimization problem Eq. (4). Gradient-based approaches have the computational advantage of not requiring the expensive ground-truth solution *W* ^{*} while regression-based approaches are less susceptible to poor local optima in ℒ. Gradient-based approaches are used in variational inference (Kingma & Welling, 2013), style transfer (Chen & Schmidt, 2016), meta learning (Finn et al., 2017; Mishra et al., 2017), and reinforcement learning, *e*.*g*. for the policy update in model-free actor-critic methods (Sutton & Barto, 2018). Regression-based approaches are more common in control for behavioral cloning and imitation learning (Duriez et al., 2017; Ratliff et al., 2007; Bain & Sammut, 1995).

## 3 Neural Potts Model

In Eq. (2) we introduced the Potts model for a single MSA (aligned set of sequences ** x**), to optimize . As per Eq. (5) We will now introduce a neural network to estimate Potts model parameters from a single sequence:

*h*

_{θ}(

**),**

*x**J*

_{θ}(

**) =**

*x**W*

_{θ}(

**) with a single forward pass.**

*x*We propose minimizing the following objective for the NPM parameters *θ*, which directly minimizes the Potts model losses in expectation over our data distribution ** x** ∼ D and their MSAs :
To compute the loss for a given sequence

**we compute the Potts model parameters**

*x**W*

_{θ}(

**), and evaluate its pseudo-likelihood loss 𝓁**

*x*_{PL}on a set of sequences from the MSA constructed with

**as query sequence. This fits exactly in “amortized optimization” in Section 2.1 Eq. (5): we train a model to predict the outcome of a set of highly related optimization problems. One key extension to the described amortized optimization setup is that the model**

*x**W*

_{θ}estimates the Potts Model parameters from only the MSA query sequence

**as input rather than the full MSA ℳ (**

*x***). Thus, our model must learn to distill the protein energy landscape into its parameters, since it cannot look up related proteins during runtime. A full algorithm is given in Appendix A.**

*x*Similar to the original Potts model, we need to add a regularization penalty *ρ*(*W*) to the main objective. For a finite sample of N different query sequences , and a corresponding sample of N× M aligned sequences from MSA {*x*_{n}}, the finite sample regularized loss, i.e. NPM training objective, becomes:

### Inductive generalization gain

(see Fig. 2) is when the Neural Potts Model improves over the individual Potts model. Intuitively this is possible because the individual Potts Models are not perfect estimates (finite/biased MSAs), while the shared parameters of *W*_{θ} can transfer information between related protein families and from pre-training with another objective like masked language modeling (MLM).

Let us start with the normal amortized optimization setting, where we expect an amortization gap (Cremer et al., 2018). The amortization gap means that *W*_{θ}(*x*) will be behind the optimal *W* ^{*} for the objective : (*W*_{θ}(*x*)) *>* ℒ : ℒ (*W* _{θ} *(x)) 003E* ℒ (*W* ^{*}). This is closely related to underfitting: the model *W*_{θ} is not flexible enough to capture *W* ^{*}(*x*). However, recall that in the Potts model setting, there is a finite-sample training objective (Eq. (8)), with minimizer Ŵ^{*}. We can expect an amortization gap in the training objective; however this amortization gap can now be advantageous. Even if the amortized solution *W*_{θ}(** x**) is near-optimal on , it can likely find a more generalizable region of the overparametrized domain

*W*by parameter sharing of

*θ*, allowing it to transfer information between related instances. The inductive bias of

*W*

_{θ}(

**) can allow the neural amortized estimate to generalize better, especially when the finite sample is poor. This inductive bias depends on the choice of model class for**

*x**W*

_{θ}, its pre-training, as well as the shared structure between the protein families in the dataset. Concretely we will consider for the generalization ℒ loss not just the pseudo-likelihood loss on test MSA sequences, but also the performance on downstream validation objectives like predicting contacts, a proxy for the model’s ability to capture the underlying structure of the protein. We will show that for some samples ℒ (

*W*

_{θ}(

**))**

*x**<*ℒ (

*Ŵ*

^{*}), i.e. there is an inductive generalization gain. This is visually represented in Fig. 2; and Table 1 compares amortized optimization and NPM, making a connection to multi-task learning (Caruana, 1998). Additionally, we could frame NPM as a hypernetwork, a neural network that predicts the weights of second network (in this case the Potts model) as in,

*e*.

*g*., Gomez & Schmidhuber (2005); Ha et al. (2016); Bertinetto et al. (2016).

In summary, the goal for the NPM is to “distill” an ensemble of Potts models into a single feedforward model. From a self-supervised learning perspective, rather than supervising the model with the input directly, we use supervision from an energy landscape around the input.

## 4 Experiments

In Section 4.1 we present results on a small set of related protein domain families from Pfam, where we artificially reduce the MSA depth for a few families to study the inductive generalization gain from the shared parameters. In Section 4.2 we present results on a large Transformer trained on MSAs for all of UniRef50.

For the main representation *g*_{θ}(** x**) we use a bidirectional transformer model (Vaswani et al., 2017). To compute the four-dimensional pairwise coupling tensor

*J*

_{θ}(

**) from sequence embedding**

*x**g*

_{θ}(

**) we introduce the multi-head bilinear form (mhbf) in Appendix B. One can think of the multi-head bilinear form as the**

*x**L*×

*L*self-attention maps of the Transformer’s multi-head attention module, but without softmax normalization. When using mhbf for direct prediction, there are

*K*

^{2}heads, one for every amino acid pair

*k, l*. For the Pfam experiments, we extend the architecture with convolutional layers after the mhbf, where the final convolutional layer has

*K*

^{2}output channels. We initialize

*g*

_{θ}(

**) with a Transformer pre-trained with masked language modeling following (Rives et al., 2019).**

*x*To evaluate Neural Potts Model energy landscapes, we will focus on proteins with structure in the Protein Data Bank (PDB), using the magnitude of the couplings after APC correction to rank contacts. The protocol is described in Appendix C.2.

### 4.1 Pfam clans

To study generalization in a controlled setting, we investigate a small set of structurally-related MSAs from the Pfam domain family database (Finn et al., 2016) belonging to the same Pfam clan. We expect that on a collection of related MSAs, information could be generalized to improve performance on low-depth MSAs. Families within a Pfam clan are linked by a distant evolutionary relationship, giving them related but not trivially-similar structure. We obtain contact maps for the sequences in each of the families where a structure is available in the PDB. At test time we input the sequence and compare the generated couplings under the model to the corresponding structure.

We compare the NPM to two baselines. The first direct comparison is to an independent Potts model trained directly on the MSA. For the second baseline we construct the “nearest neighbor” Potts model, by aligning each test sequence against all families in the training set, and using the Potts model from the closest matching family.

We perform the experiment using a five-fold cross-evaluation scheme, in which we partition the clan’s families into five equally-sized buckets. As in standard cross-validation, each bucket will eventually serve as an evaluation set. However, we do not remove the evaluation bucket. Instead, we artificially reduce the number of sequences in the MSAs in the evaluation bucket to a smaller fixed MSA depth. MSAs in the remaining buckets remain unaltered. The goal of this setup is to check the model’s ability to infer contacts on artificially limited sets of sequences. Both NPM and the baseline independent Potts model are fit on the reduced set of sequences. Note that while the baseline Potts model uses the reduced MSA of the target directly, NPM is trained on the reduced MSA but evaluated using only the target sequence as input. We train a separate NPM on each of the five cross-evaluation rounds, evaluate on the structures corresponding to the bucket with reduced MSAs, and show averages and standard deviations across rounds. Further details are provided for model training in Appendix C.1 and for the Pfam dataset in Appendix C.3.

Figure 3 shows the resulting contact prediction performance on the 181 families in the NADP Rossmann clan, with additional results on the P-loop NTPase, HTH, and AB hydrolase clans in Appendix D Fig. 9. We initialize a 12-layer Transformer with protein language modeling pre-training. Because of the small dataset size, we keep the weights of the base Transformer *g*_{θ} frozen and only finetune the final layers. As a function of increasing MSA depth, contact precision improves for both NPM and independent Potts models. For the shallowest MSAs, NPM has a higher precision relative to the independent Potts models. The advantage at low MSA depth is most pronounced for long range contacts, outperforming independent Potts models up to MSA depth 1000. These experiments suggest NPM is able to realize an inductive gain by sharing parameters in the pre-trained base model as well as the fine-tuned final layers and output head. Figure 4 shows training trajectories. We observe near-monotonic decrease of the amortized pseudo-likelihood loss (Eq. (7)) on the MSAs in the evaluation set, and increase of the top-L long range contact precision. This indicates that improving the NPM objective improves the unsupervised contact precision across the reduced-depth MSAs. Furthermore we see expected overfitting for smaller MSA depth: better training loss but worse contact precision.

Additionally, we assess performance of different architecture variants: direct prediction with the multi-head bilinear form (always using symmetry), with or without tied projections, and addition of convolutional layers after the multi-head bilinear form. The variants are described in detail in Appendix B. We find in Appendix D Fig. 8 that addition of convolutional layers after the multi-head bilinear form performs best; for the variant without convolutional layers, the head without weight tying performs best.

### 4.2 UniRef50

We now perform an evaluation in the more realistic setting of the UniRef50 dataset (Suzek et al., 2007). First we examine MSA depth across UniRef50 (Suzek et al., 2007). Appendix C.4 Fig. 7 finds that 19% of sequences in UniRef50 have MSAs with fewer than 10 sequences. (38% when a minimum query sequence coverage of 80% is specified).

We ask whether an amortization gain can be realized in two different settings: (i) for sequences the model has been trained on; (ii) for sequences in the test set. We partition the UniRef50 representative sequences into 90% train and 10% test sets, constructing an MSA for each of the sequences. During training, the model is given a sequence from the train set as input, and the NPM objective is minimized using a sample from the MSA of the input sequence. In each training epoch, we randomly subsample a different set of 30 sequences from the MSA to fit the NPM objective. We use ground-truth structures to evaluate the NPM couplings and independent Potts model couplings for contact precision. The dataset is further described in Appendix C.4; and details on the model and training are given in Appendix C.1.

The independent Potts model baseline is trained on the full MSA. This means that in setting (i) the NPM and independent Potts models have access to the same underlying MSAs during training. In setting (ii) the independent Potts model is afforded access to the full MSA; however the NPM has not been trained on this MSA and must perform some level of generalization to estimate the couplings.

Figure 5 shows a comparison between the NPM predictions and individual Potts models fit from the MSA. The Neural Potts Model is given only the query sequence as input. On top-L/5 long range precision, NPM has better precision than independent Potts models for 22.3% of train and 22.7% of test proteins. We visualize in Fig. 6 example proteins with low MSA-depth where NPM does better than the individual Potts model. For shallow MSAs, the average performance of NPM is higher than the Potts model, suggesting an inductive generalization gain.

To contextualize the results let us consider the setting where the amortized Neural Potts Model (i) matches the independent Potts model on training data: this means the NPM can predict good quality couplings from a single feedforward pass without access to the full MSA at inference time; (ii) surpasses the independent model on training data: the amortization helps the NPM to improve over independent Potts models, i.e. it realizes inductive generalization gain; (iii) matches the independent model on test sequences: indicates the model is able to synthesize a good Potts model for sequences not in its training data; (iv) surpasses the independent model on test sequences: the model actually improves over an independent Potts model even for sequences not in its training data. In combination these results indicate a non-trivial generalization happens when NPM is trained on UniRef50.

## 5 Related Work

Recently, protein language modeling has emerged as a promising direction for learning representations of protein sequences that are useful across a variety of tasks. Rives et al. (2019) and Rao et al. (2019) trained protein language models with the masked language modeling (MLM) objective originally proposed for natural languge processing by Devlin et al. (2019). Alley et al. (2019), Heinzinger et al. (2019), and Madani et al. (2020) trained models with autoregressive objectives. Transformer protein language models trained with the MLM objective learn information about the underlying structure and function of proteins including long range contacts (Rives et al., 2019; Vig et al., 2020). This paper builds on the ideas in the protein language modeling literature, introducing the following new ideas: the first is supervision with an energy landscape (defined by a set of sequences) rather than objectives which are defined by a single sequence; the second is to use amortized optimization to fit a single model across many different energy landscapes with parameter sharing; the final is the consideration of the unsupervised contact prediction problem setting rather than the use of representations in a supervised pipeline.

Unsupervised structure learning is reviewed in the introduction. The main approach has been to learn a set of constraints from a family of related sequences by fitting a Potts model energy function to the sequences. Our work builds on this idea, but rather than fitting a Potts model to a single family of related sequences, proposes through amortized optimization to fit Potts models across many sequence families with parameter sharing in a deep neural network.

Supervised learning has produced breakthrough results for protein structure prediction (Xu, 2018; Senior et al., 2019; Yang et al., 2019). State-of-the-art methods use supervised learning with deep residual networks on co-evolutionary features derived from the unsupervised structure learning pipeline. While Xu et al. (2020) show that reasonable predictions can be made without co-evolutionary features, their work also shows that these features contribute significantly to the performance of state-of-the-art pipelines.

Prior work studying protein language models for contact prediction focuses on the supervised setting. Bepler & Berger (2019) studied pre-training an LSTM on protein sequences and fine-tuning on contact data. Rives et al. (2019) and Rao et al. (2019) studied supervised contact prediction from Transformer protein language models. Vig et al. (2020) found that contacts are represented in Transformer self-attention maps. Our work differs from prior work on structure prediction using protein language models by focusing on the unsupervised structure learning setting. It would be a logical extension of this work to integrate the Neural Potts model into the supervised pipeline.

## 6 Discussion

This paper explores how a protein sequence model can be trained to produce a local energy landscape that is defined by a set of evolutionarily related sequences for each input. The training objective is cast as an amortized optimization problem. By learning to output the parameters for a Potts model energy function across many sequences, the model may learn to generalize across the sequences.

We also formally and empirically investigate the generalization capability of models trained through amortized optimization. We consider the setting of training independent Potts models on the MSA of each sequence, in comparison with training a single model using the amortized objective to predict Potts model parameters for many inputs. Empirically the amortized objective provides an inductive gain when few related sequences are available in the MSA for training the independent Potts model.

A number of direct extensions exist for future work, including further investigation of model architecture and parameterization of the energy function by the deep network, use of the amortized models in a supervised pipeline, and combining independent Potts models with amortized couplings. The hidden representations could also be investigated for structure prediction and other tasks using the approaches in the protein language modeling literature. The main contribution of this work is to directly incorporate information from a set of sequences related to the input in the learning objective. It would be interesting to investigate other possible approaches for incorporating this type of supervision into models that aim to learn underlying structure from sequence data.

## Appendix A Learning the Neural Potts Model

### B Model Architecture: Multi-head Bilinear Form for Pairwise Couplings

In this Section, we describe the model architecture to compute a four-dimensional pairwise coupling tensor *J*_{θ}(** x**) from sequence embedding

*g*

_{θ}(

**).**

*x*#### B.1 Multi-head Bilinear Form

We write sequence length *L* and amino acid vocabulary *K* = 21. The single site potentials *h* ∈ ℝ^{L×K}, and the pairwise couplings are a four-dimensional tensor: *J* ∈ ℝ^{L×K×L×K} indexed as *J*_{ij}(*k, l*).

We start with a sequence-level model to produce the embedding *e* of the sequence (typically final hidden layer output): *e* = *g*_{θ}(** x**) ∈ ℝ

^{L×d}. The estimator for single-site potential

*h*

_{θ}(

**) is a linear projection layer on the embedding;**

*x**h*

_{θ}(

**) =**

*x**g*

_{θ}(

**)**

*x**P*

^{h}with

*P*

^{h}∈ ℝ

^{d×K}. Now we discuss how to parametrize the estimator

*J*

_{θ}(

**) ∈ ℝ**

*x*^{L×K×L×K}.

#### Multi-head bilinear form for direct prediction

We introduce a *multi-head bilinear form* (mhbf) on the embedding *e*; i.e. for every pair *k, l* of amino acids we have a bilinear form, parametrized with a learned interaction matrix *B*^{kl} ∈ ℝ^{d×d} connecting the hidden states at positions *e*_{i}, *e*_{j} ℝ^{1×d}. So we compute the *K*^{2} bilinear forms for amino acid pairs (*k, l*) between *L L* position pairs . We always use a low-rank decomposition *B*^{kl} = *U*^{kl}*V* ^{klT} with both *U*^{kl}, *V* ^{kl} ∈ ℝ^{d×d}*′*, so the bilinear form becomes the inner product in the lower-dimensional space with *d*^{′} the projection dimension: (*e*_{i}*U*^{kl})(*e*_{j}*V* ^{kl})^{T}. We can interpret this as an inner product of embeddings *i, j* after linear projection to a space specific to amino acid pair (*k, l*). This low-rank multi-head bilinear form is similar to the multi-head attention mechanism introduced in Vaswani et al. (2017), but without softmax normalization.

Notation-wise, our parameters *θ* include the parameters of the transformer that produces the embedding and the components of the interaction matrix {*U*^{kl}, *V* ^{kl}}.

#### Direct prediction: tied projection

One way to reduce the number of parameters in the multi-head bilinear form, is for the low-rank decompositions of the *K*^{2} heads *B*^{kl} to share their decomposition per *k, l*. Instead we can *share/tie* the projection matrices between amino acids *k, l*: *U*^{kl} = *U*^{k} and *V* ^{kl} = *V* ^{l}, such that head *B*^{kl} = *U*^{k}*V* ^{lT}. Note that the dot product in this case is after a linear projection specific to single-site amino acid *k* and *l* separately; .

#### Direct prediction: Symmetry

We can or should parametrize the estimator of J to be symmetrical against interchanging both i,j and k,l: *J*_{ij}(*k, l*) = *J*_{ji}(*l, k*), i.e. no difference between the order of considering interaction between AA *k* at position *i* with AA *l* at position *j*. This does not mean symmetry of each interaction matrix! We ask that
The second equality is the symmetry, the last equality by transposing the bilinear form. From *B*^{kl} = *B*^{lkT} it follows that
The last equality is the obvious choice. In the tied parametrization, this simply becomes *U*^{kl} = *U*^{k} = *V* ^{k} = *V* ^{lk} such that *W*_{kl} = *U*^{k}*U*^{lT}. Once again, note that the dot product now becomes *J*_{ij}(*k, l*) = (*e*_{i}*U*^{k})(*e*_{j}*U*^{l})^{T}. We present a Tensor decomposition perspective on this multi-head bilinear form in Appendix B.2.

#### Convolutional layers after multi-head bilinear form

As an extended model architecture, we consider having convolutional layers after the multi-head bilinear form (only used for the Pfam experiments). parametrized with a learned interaction matrix *B*^{kl} ∈ ℝ^{d×d} In this case, rather than having *K*^{2} heads *B*^{kl}, we now have an arbitrary number of heads *F* which will become the number of channels in the consecutive convolutional layers: *B*^{F} = *U*^{F} *V* ^{F} ^{T}. We add 1 × 1 convolutional layers having also *F* channels, and finally *K*^{2} output channels for the last convolutional layer. Weight tying and symmetry considerations of the mhbf do not apply in this model variation.

#### B.2 Tensor decomposition view on multi-head bilinear form

We can see the multi-head bilinear form as a tensor decomposition of *J*, for which we will use Einstein notation to indicate that any pair of indices appearing both in subscript and superscript are summed over their range. Let us write the tensor collecting the *U*^{kl} matrices as U ∈ ℝ^{K×K×d×d}′; and index into U in the same notation as for . With *α, β* ∈ [1… *d*], *r* ∈ [1… *d*^{′}] The same for 𝒱. Now the J estimate in the full untied asymmetric case, written as tensor, becomes
or the symmetric and tied (U ∈ ℝ^{K×d×d}*′*) version:
Note that the 𝒰, *𝒱* are shared across proteins, while the embeddings *e* = *g*_{θ}(** σ**) are specific per protein, based on a high-capacity sequence level model.

### C Experiment Details

#### C.1 Training details

We summarize the precise model architecture and optimization settings in Table 2. During each NPM training step, for a given input ** x**, M sequences are randomly sampled (M=100 or 30, see Table 2), for the pseudo-likelihood loss evaluation in Eq. (8). Each sequence is selected with probability according to its sequence weight

*w*

^{m}. One can think of these

*M*sampled sequences as similar to a minibatch. Note that to compute the independent Potts model baseline, the Potts model is computed without any downsampling of the MSA. Additionally, in the Pfam experiments the loss term for family

*n*in Eq. (8) is upweighted with a factor , which places more weight on the well-formed, deep MSAs and discounts the shallower MSAs. In both the Pfam and UniRef experiments, we enforce a max sequence length of 512 via random contiguous crops of positions.

#### C.2 Validation details

To compute precisions, we convert the pairwise couplings *J* ∈ ℝ^{L×K×L×K} to an *L L* pairwise coupling score by (1) zeroing all positions in *J* corresponding to gap characters, (2) computing the magnitude via Frobenius norm over the *K* × *K* matrix *J*_{ij} for every pair of positions *i, j*, and (3) applying the Average Product Correction (Dunn et al., 2008). True contacts are defined as pairwise distances less than or equal to 8 Angstroms. Precision is calculated as the true positive fraction of the top *L*, ⌊*L/*2⌋, or ⌊*L/*5⌋ predicted contacts. Additional to precision, the Area Under the Precision-Recall Curve (AUC) is computed, summing over thresholds stepwise per *L/*10 increment up to *L*. Precision and AUC metrics are computed at sequence separations *s* of short (6 ≤ *s <* 12), medium (12 ≤ *s <* 24), and long (24 ≤ *s*) ranges.

For the independent Potts model baselines in all experiments, we use CCMpred (Seemayer et al., 2014), a GPU implementation of pseudolikelihood maximization (Balakrishnan et al., 2011). The coupling matrix *J* from the independent Potts model is processed in the same way following steps (1-3) described above.

#### C.3 Pfam training data and setup

##### Data Selection

We use the Pfam database (Finn et al., 2016) version 28.0. All MSAs in the HTH (n=217), P-loop NTPase (n=198), NADP Rossmann (n=181), and AB hydrolase (n=67) clans were parsed from the multiple alignment file Pfam-A.full. We apply two preprocessing steps to all MSAs. First, for speed, we only load up to a maximum of 100k sequences from each MSA. Next, we apply HHfilter, from the HHSuite3 (Steinegger et al., 2019) toolset, with all default settings, to each MSA. We find that filtering improves contact prediction accuracy of the independent Potts model baseline.

##### Dataset splits

We perform the experiment using a five-fold cross-evaluation scheme, in which we partition the clan’s families into five equally-sized buckets. As in standard cross-validation, each bucket will eventually serve as an evaluation set. However, we do not remove the evaluation bucket, but artificially reduce the number of sequences in the MSAs in the evaluation bucket to a small fixed MSA depth (=purging the MSA). All Pfam experiments are repeated 5 times, each with a different selection for the reduced bucket. In our figures, we plot average results, with confidence interval bounds defined by the standard deviation across the five-fold cross-evaluation.

##### During NPM training

we iterate over the the set of MSAs in the four buckets that have not been reduced, as well as the reduced bucket. At a given training step, we randomly select a sequence ** x** within an MSA for use as input to NPM. This selection is likely to return a sequence with inserted gap characters. We drop these gap characters and their corresponding columns in the MSA. Then we randomly subsample 100 sequences from the MSA to fit the NPM objective. The procedure is described in more detail in Appendix C.1.

##### Evaluation

During evaluation, we assess the NPM and the independent Potts model via a contact prediction task (described in previous subsection), on the families in the evaluation bucket. For each family, a single structure is selected as target, using the pdbmap included in Pfam. NPM’s contact predictions are made using only the sequence belonging to the target structure. To compute the independent Potts model for a given family in the evaluation bucket, the depth-reduced MSA is aligned to the sequence from the target structure, and the Potts model is computed without further downsampling.

As an additional baseline, we predict contacts for validation sequences using the Potts model of the “Nearest Neighbor” family in the train set. For a given validation sequence, we calculate “nearness” to all train families via calls to HHalign given the sequence and the train family’s Pfam seed alignment as input. We select the family with the highest HHalign probability score as the nearest neighbor. The nearest neighbor prediction is generated as follows: (1) the validation sequence is aligned to the selected train family’s MSA; (2) an independent Potts model is fit to the selected train family’s MSA, using a random member sequence as reference, yielding a predicted contact map for the train family; (3) the rows and columns of the predicted contact map that align to the validation sequence are extracted to construct a prediction for the validation sequence.

#### C.4 UniRef50 training data and setup

For the experiments in Section 4.2, we retrieve the UniRef-50 (Suzek et al., 2007) database dated 2018-03. The UniRef50 clusters are partitioned randomly in 90% train and 10% test sets. For all sequences, we construct MSAs using HHblits (Steinegger et al., 2019) against the UniClust30_2017_10 database. HHblits is run using the default settings, for 3 iterations with an e-value of 0.001.

It is important to note that given this MSA generation procedure, validation sequences can be included in MSAs of train sequences. However, we are guaranteed that validation sequences are not trained on as input to NPM.

##### Evaluation of contact precision

We use contact precision as a proxy to measure unsupervised structure learning in the underlying Potts model. To define a set of structures for evaluation, we collect structures from the PDB, and assign them to either the training sequences or test sequences. This allows us to separately examine performance of NPM on sequences from its training set, and sequences from its test set. Note, that the structures are used only to evaluate unsupervised contact prediction performance of the model; the model is never trained on structures.

We query the Protein Data Bank (PDB) to obtain a list of all protein structures with a resolution less than 2.5 Å, a length greater than 40 residues, and a submission date before May 1, 2020. We search each pdb entry for hits against the sequences in the training and test sets for NPM respectively. If the PDB entry retrieves hits only to training sequences we assign it to the training-sequences group. If the PDB entry retrieves hits only to test sequences we assign it to the test-sequences group. Any PDB entry which hits both training and test sequences or neither, is discarded. To perform the search we use the MMSeqs2 software suite (Steinegger & Söding, 2018) using the default settings with 50% sequence identity at 80% target coverage. We then cluster each of the two groups of PDB entries at 50% sequence similarity, resulting in a dataset of 11040 structures assigned to train sequences and 211 structures assigned to validation sequences. MSA construction for the PDB entries precisely follows the procedure for UniRef50 (first paragraph); the method for contact prediction from the model couplings (for NPM or the independent Potts model) is described in Appendix C.2.

### D Additional experiments

## Footnotes

tsercubda{at}fb.com,rverkuilbda{at}fb.com,jmeierbda{at}fb.com

zl2799{at}nyu.edu

carolinechen{at}fb.com

jasonliu{at}fb.com

yann,arives{at}cs.nyu.edu

↵† Work performed while at Facebook AI Research