An end-to-end deep learning method for rotamer-free protein side-chain packing

Protein side-chain packing (PSCP), the task of determining amino acid side-chain conformations, has important applications to protein structure prediction, refinement, and design. Many methods have been proposed to resolve this problem, but their accuracy is still unsatisfactory. To address this, we present AttnPacker, an end-to-end, SE(3)-equivariant deep graph transformer architecture for the direct prediction of side-chain coordinates. Unlike existing methods, AttnPacker directly incorporates backbone geometry to simultaneously compute all amino acid side-chain atom coordinates without delegating to a rotamer library, or performing expensive conformational search or sampling steps. Tested on the CASP13 and CASP14 native and non-native protein backbones, AttnPacker predicts side-chain conformations with RMSD significantly lower than the best side-chain packing methods (SCWRL4, FASPR, Rosetta Packer, and DLPacker), and achieves even greater improvements on surface residues. In addition to RMSD, our method also achieves top performance in side-chain dihedral prediction across both data sets.


Introduction
Protein side-chain packing (PSCP) involves predicting the three dimensional coordinates of side-chain atoms given backbone conformation (coordinates) and primary sequence. This problem has important applications to protein structure prediction [11,10,6], design [23,26,8,29], and protein-protein interactions [28,14]. Traditional methods for PSCP rely on minimizing some energy function over a set of rotamers from a fixed library [15,34,25,4,1,9,32,5]. These methods tend to differ primarily in their choice of rotamer library, energy function, and energy minimization procedure. Although many of these methods have shown success, the use of search heuristics coupled with a discrete sampling procedure could ultimately limit their accuracy. Currently, the fastest methods (OSCAR-star [25], FASPR [15], SCWRL4 [9]) do not employ deep learning (DL), and are rotamer library-based.
Several machine learning (ML) methods have been proposed for the task of side-chain prediction [22,21,30,31,34,20,32]. One of the earliest of these methods, SIDEPro [22], attempts to learn an additive energy function over pairwise atomic distances for each side-chain rotamer. This is achieved by training a family of 156 feedforward networks -one for each amino acid and contacting atom type. The rotamer with the lowest energy is then selected. DLPacker [21] formulates packing as an image-to-image transformation problem and employs a deep U-net-style neural network. The method iteratively predicts side-chain atom positions using a voxelized representation of the residue's local environment as input, and outputs densities for the respective side-chain's atoms. The output densities are then compared to a rotamer database and the closest matching rotamer is selected. The most recent version of OPUS-Rota4 [31] uses a pipeline of multiple deep networks to predict side-chain coordinates. The method uses predicted side-chain dihedral angles to obtain an initial model, and then applies gradient descent on predicted distance constraints to obtain a final structure. It is worth noting that OPUS-Rota4, to the best of our knowledge, is the only ML-based PSCP method that directly utilizes MSA (multiple sequence alignment) as part of its input.
Here we present AttnPacker, a new deep architecture for PSCP. Our method is inspired by recent breakthroughs in modelling three-dimensional data, and architectures for protein structure prediction -most notably AlphaFold2 [18], Tensor Field Networks (TFN) [27], and SE(3)-Transformer [12]. By modifying and combining components of these architectures, we are able to significantly outperform traditional PSCP methods as well as machine learning methods in terms of speed, memory efficiency, and overall accuracy using only features derived directly from primary sequence and backbone coordinates as input.
Specifically, we introduce deep graph transformer architectures for updating both residue and pairwise features based on the input protein's backbone conformation. Inspired by AlphaFold2, we propose locality-aware triangle updates to refine our pairwise features using a graph-based framework for computing triangle attention and multiplication updates. By doing this, we are able to significantly reduce the memory required for performing triangle updates which enables us to build higher capacity models. In addition, we explore several SE(3)-equivariant attention mechanisms and propose an SE(3)-equivariant transformer architecture for learning from 3D points.
Our method, AttnPacker, significantly outperforms traditional PSCP methods on CASP13 and CASP14 native backbones with average reconstructed RMSD over 19% and 25% lower than the next best method on each test set. AttnPacker also surpasses deep learning DLPacker, with 13% lower average RMSD on the each test set. In addition, our method achieves top RMSD scores for CASP14 non-native backbone targets predicted by AlphaFold2. On top of our method's favorable performance in RMSD minimization, it also achieves the lowest χ 1−4 mean absolute error (MAE) across all methods on the CASP13 and CASP14 targets.

Network Architecture
Our method draws on several recent breakthroughs in deep learning and protein structure prediction, namely the two-stage architecture introduced by AlphaFold2, and a TFN-based equivariant transformer inspired by Fuchs et al [12]. Moreover, our model directly predicts the 3D coordinates of all side-chain atoms for a given protein using only backbone coordinates and primary sequence as input. A detailed overview of input features and the input embedding procedure can be found in Supplementary Material, Table S3, and Figure S1.
The first component, outlined in Figure 1, is a deep graph transformer network which utilizes the geometry of the input backbone to revise node and pair features. This component is similar to the AlphaFold2's 'Evoformer' module for processing MSA and pair features, but we replace axial self-attention and global triangle updates with graph based self-attention and triangle updates for our residue and pair features. In AlphaFold2, triangle updates are performed by considering all triples of residues in the input protein, updating the corresponding pair representations in a geometrically motivated manner. These procedures require O L 3 d time to compute, and Ω L 3 space to store attention logits where L is the number of residues and d is the hidden dimension of pair features. In our setting, we are able to avoid this restrictive time and space complexity by incorporating locality information derived from the protein's backbone atom coordinates. To do this, we introduce locality aware triangle updates (Section 2.2) and compare the performance of this approach against the update procedures used in AlphaFold2. Figure 1: Architecture of Locality Aware Graph Transformer Block. Standard multi-headed graph attention is used to update node features, while edges between adjacent nodes are updated with triangle multiplication and attention. For each triangle update, edge adjacency information is used to select only those triangles having maximum side-length less than a fixed threshold.
The second component of our architecture is an SE(3)-equivariant neural network which takes the output of the first component, along with the protein's backbone C α -coordinates, to simultaneously predict each residue's sidechain coordinates. Our equivariant architecture is derived from the SE(3)-Transformer introduced by Fuchs et al. Similar to the SE(3)-Transformer, our attention blocks use TFNs to produce keys and values for scalar and point features. But our implementation differs in a few key areas. First, our implementation uses shared attention with dot-product based similarity (see Table 5). That is, we combine the attention logits of each feature type to produce shared attention weights. Furthermore, in each attention block, we augment the input to the TFN radial kernel with pairwise distance information, and make further use of the pair features to bias attention weights and update scalar features (see Algorithm S2, Algorithm S3). The architecture is outlined in Figure 2, full implementation details can be found in Section S6, and ablation studies guiding these decisions are given in, Table 6, and Table 7. Keys and values are computed using TFNs. The input to each TFN radial kernel are pair features, along with distance information between points for the corresponding pair. Pair features are also linearly projected to bias the similarity logits for adjacent pairs of residues.

Locality Aware Triangle Updates
We experimented with multiple approaches for improving the time and space requirements of triangle updates. We ultimately settled on a hypergraph-based approach where a separate spatial graph is constructed for edge features.
In this graph, there is a vertex v ij for residue pair i, j if the distance d (i, j) between the corresponding C α atoms is within a threshold θ. We add hyperedges between v ij and v ik , v jk only if max {d (i, k) , d (j, k)} < θ. Interpreting this graph in the context of triangles, the vertices coincide with triangle edges of side length at most θ, and edges correspond to triangles. The process is illustrated in Figure 3. Figure 3: Overview of pair graph generation for locality aware attention. Each pair of residues within a fixed distance is included as a node in this graph. (A) We begin with the spatial graph for residues and focus on a specific pair (i, j) . (B) For each node x,we then compute the maximum distance between (i, x) and (j, x) and disregard residues falling outside our distance threshold θ. (C) All triangles that include pair (i, j) in the remaining graph will have maximum side-length at most θ. (D) The final graph has nodes for each edge in (C), and edges for each triangle.
In order to efficiently compute triangle updates, we select the 2N e nearest neighbors of each vertex v ij using max {d (i, k) , d (j, k)} as our distance function. This results in a 3-uniform, N e -regular hypergraph containing as many nodes as there are underlying residues with C α distance less than the threshold θ. We denote the latter quantity by N v . Triangle multiplication and attention updates are straight forward to apply since the neighbors of each node v ij span strictly over residues k for which the corresponding triangle is well-defined. The update operations require time and space proportional to O (N v · N e ) in total. If the distance threshold θ is chosen so that the number of resulting pairs is linear in the number of atoms, then choosing constant N e results in N v · N e = o n 2 space complexity, and time complexity O (N v · N e · d). Computing the hypergraph still requires O L 3 time, but this can be performed on a CPU as a pre-processing step.

Results
We compare side-chain reconstruction accuracy for three of our models against several popular methods: DLPacker, RosettaPacker, SCWRL4, and FASPR. To understand the impact of different architectural components, we list results of our TFN-Transformer without triangle updates (TFN ), with standard triangle updates (+Tri ), and with locality aware triangle updates (+Tri+Local ), respectively. We use the same hyperparameters for the TFN-Transformer component, and use a distance cutoff of 15Å with 30 nearest neighbors for local triangle updates. All other hyperparameters are held constant unless otherwise specified. A full summary of model hyperparameters, training data, and training procedure can be found in Table S1, and Section S3.
To evaluate the performance of each method, we consider residue-level RMSD and dihedral angle deviations between the predicted and native side-chains. RMSD between predicted and native side-chain atoms is computed separately for each residue, and averaged over all residues in the corresponding data set(s). Dihedral angle mean absolute error (MAE) is computed analogously. Dihedral accuracy is defined as the fraction of dihedral angles with absolute difference less than 20 • from the corresponding native angle. We further divide our results based on residue centrality. Core residues are defined as those amino acids with at least 20 C β atoms within a 10Å radius of the target residue's C β atom. surface residues are defined as those amino acids with as most 15 C β atoms within the same region of interest. All Comparisons are made on native structures from CASP13, and CASP14 (see Table S11 for a list of targets in each test set). We begin by comparing the average RMSD over CASP13 and CASP14 test sets (see Table S6 for results on CASP13-FM and CASP14-FM test sets). Table 1 shows that variants of our methods consistently achieve the lowest RMSD in each centrality category across all data sets. The inclusion of triangle updates yields a significant improvement over the TFN-Transformer alone. Compared to the other deep learning method DLPacker, our method obtains much lower RMSD in all categories, and observes the largest improvement on surface residues. Interestingly, the impact of triangle updates is less substantial on protein surface residues suggesting that triangle updates play a larger role in determining conformations protein core residues. We explore this further in Figure 5.
In terms of residue-level RMSD, the most significant improvements are achieved for Arg and His, each of which have large positively charged side-chains, as well as bulky hydrophobic amino acids Phe, Trp, and Tyr. A full overview of residue-level RMSD on the CASP13 and CASP14 data sets can be found in Supplementary Tables S8, and S9. As shown in Figure 4, adding triangle updates before the TFN-Transformer improves the overall average RMSD of Arg, His, Phe, Tyr, and Trp by 11%, 16%, 20%, 28%, and 26% respectively. For surface residues, the performance difference between the TFN-Transformer and TFN+Tri is significantly smaller. On the other hand, the performance gap between our method and others increases for surface residues. We outperform the next best method by 16%, 17%, 25%, 20%, and 28% respectively.
We further investigated the role of centrality in side-chain reconstruction accuracy by measuring average RMSD and average χ 1−4 MAE with respect to the number of C β atoms in a residue's microenvironment. Not surprisingly, RMSD and chi angle prediction error decrease rapidly as centrality increases. In Figure 5, we verify that the marginal improvement of triangle updates increases with centrality. For both RMSD and MAE, the performance gap between TFN and TFN+Tri, increases with centrality, suggesting that triangle updates are important for accurately determining side-chain conformations in protein cores. The opposite is true when comparing TFN+Tri with DLPacker, where the gap decreases as centrality increases.This is not surprising, as DLPacker iteratively constructs each residue's side-chain using only atoms the residue's immediate microenvironment as input. This choice of input features implies that features for protein surface residues are more sparse than those for core residues. In addition to achieving top performance in reconstructed side-chain RMSD, our method is also significantly faster than all other methods except for FASPR.  We now turn our attention to dihedral angle prediction accuracy. As pointed out by Zhang et al. [15], the prediction accuracy of side chain dihedrals is much sharper when all chi dihedrals are considered, and a 20 • cutoff is used when comparing to the native structure. We opt to use this criteria for all reported accuracy. In addition, we assess the quality of each method in terms of χ 1−4 MAE with the native structure.
We analyzed side-chain dihedral accuracy on the CASP13 and CASP14 data sets and show the results in Table 3. In terms of χ 1−4 MAE, the TFN transformer with triangle updates achieves top-1 or top-2 performance in each data set. The performance carries over into prediction accuracy, where the method also achieves top 1 or top 2 performance on each data set regardless of residue centrality. Local triangle updates achieve competitive performance, with top-2 MAE scores for all but one category.
In line with the results reported by Misiura et al. [21], when compared to traditional methods, deep learning methods recover χ 1 dihedral angles considerably closer to those of the native structure. For our methods, this improvement carries over to χ 2 angle prediction, where triangle updates obtain an 8% and 10% improvement over the next best method DLPacker on the CASP13 and CASP14 targets.  Table 3: Side-chain dihedral prediction results on the CASP13 and CASP14 targets. Accuracy results shown for all, core, and surface residues.
To better understand the instances where our method loses its advantage to traditional PSCP algorithms, we consider the performance on four amino acids with high side-chain dihedral degrees of freedom. Following DLPacker, we consider dihedral prediction for charged and polar amino acids Lys, Arg, Glu, and Gln in Table 4. For these amino acids, variants of our methods are still competitive with traditional PSCP algorithms, comparing favorably in χ 1−2 MAE but losing their edge at higher orders. Comparing accuracy scores tells a different story. Although we are able to obtain comparable or lower MAE values for each degree of freedom, we only obtain top-1 or top-2 accuracy for Gln and Glu. On the other hand, physics-based RosettaPacker obtains top-1 or top-2 performance for each amino acid along with top-1 χ 4 scores. As the authors of DLPacker also point out, new training methods or loss functions which improve on higher order dihedral accuracy is an important area for future research.

Ablation and Architecture Assessment
We consider several variants of SE(k)-equivariant self attention. Each variant can be categorized by the operation used to compute the similarity between keys and queries of different features types (i.e. 1D-scalar and 3D-point features) and the operation used to compute the final attention weights. We focus on the well-known dot product similarity, used in the SE(3)-Transformer and negative distance similarity which was first presented in the Invariant Point Attention module of AlphaFold2. Aside from calculating similarity between points, these architectures also differ in their calculation of attention scores -AlphaFold2 uses the same shared attention weights for scalar and point features, whereas the SE(3)-Transformer computes attention weights for each type. This is outlined more formally in Table 5. Following the conventions of ( [27,12]), we use a superscript ℓ to denote type-ℓ features of dimension 2ℓ + 1.

Similarity Attention
Dot Product Table 5: Description of similarity and attention types. Here, we use a superscript ℓ to denote type-ℓ features of dimension 2ℓ + 1 (i.e. ℓ = 0 for scalar features and ℓ = 1 for point features). For distance similarity, we use T xy to denote some transformation mapping points into so-called "local frames" of the respective node used to ensure equivariance of the operation.
We trained four TFN-Transformer models differing only by attention and similarity type. Each model used a hidden dimension of 24 for point features and 180 for scalar features. Ten attention heads were used in each of eight attention blocks. A head dimension of 20 was used for scalar features and 4 was used for point features. A maximum of 14 nearest neighbors were considered and a radius of 16Å was used as a cutoff. The average RMSD and dihedral accuracy results of the four attention variants are shown in Table 6.

Similarity Attention
The results in Table 6 show that shared attention weights produce better results for both similarity types. Overall, dot-product-based similarity outperforms distance based similarity, even when per-type attention is used. As pointed out by Fuchs et al [12], this may be due to the fact that each basis kernel in a TFN is completely constrained in the angular direction. By using dot-product based similarity, the angular profile of the basis kernels are modulated by the attention weights.
We also experimented with different architectural variants. First, we used a linear projection, rather than a TFN to compute keys at each attention head. Next, we augmented the input to the TFN radial kernel by concatenating the pairwise distances between hidden coordinates. Third, we tried removing the attention calculation between points and instead used weights derived from scalar features for pointwise attention. With each variant, shared attention and dot product attention was used. As shown in Table 7, using a linear projection for attention keys has the largest impact on RMSD score -surprisingly more than removing point-based attention all together.  We also considered the effect of the neighbor distance threshold on performance. This threshold is used as the maximum valid edge length for triangle updates, and to determine residue adjacency in the locality aware graph transformer. It is also used to define the neighborhood of scalar and point features in the TFN-Transformer. We tried three distance thresholds, 8Å, 12Å, and 16Å The results are shown in Table 7.
Average RMSD clearly decreases with increasing neighbor distance. The marginal improvement diminishes with increasing radius, and the bulk of the improvement comes from residues in the protein core.

Concluding Discussion
In this work, we developed AttnPacker, an SE(3)-equivariant model for direct prediction of side-chain coordinates based on backbone geometry. AttnPacker uses spatial information derived from protein backbone coordinates to efficiently model residue and pairwise neighborhoods. This, coupled with an SE(3)-equivariant architecture, allows for the simultaneous prediction of all amino acid side-chain coordinates without the use of rotamer libraries or conformational sampling.
Components of our model were inspired by AlphaFold2 and the SE(3)-Transformer. By generalizing and carefully evaluating ideas from these architectures we were able to achieve better efficiency or better than the original implementations. Specifically, we generalized the Evoformer module of AlphaFold2 to spatial graphs specific to protein backbones. We also modified the attention heads of the SE(3)-Transformer to incorporate pair-bias, distance information between hidden points, and shared dot-product similarity.
Our TFN-Transformer outperforms all other methods in terms of average RMSD on the CASP13 and CASP14 targets, even without triangle updates for pair-features. On the other hand, the baseline SE(3)-Transformer with per-type attention falls short of DLPacker for the CASP13 targets. Part of this improvement is achieved by augmenting edge features with pairwise distance information between hidden points. We hypothesize that this information is especially important for deeper TFN-based architectures. TFNs require a basis of spherical harmonics which is typically computed once on the relative positions of input points and then shared across all proceeding layers. As a result of using spherical basis functions, information about relative distances between hidden coordinates is lost. To account for this, distances between pairs of initial point features are typically concatenated to the input of the TFN radial kernel. Traditionally, this input is the same across all layers -the pairwise distance d ij between initial points i and j, and (optionally) a fixed vector of pair feature e ij . In theory, distance information between hidden point features could be captured by TFNs convolution operation, but we believed that incorporating this information at the input level could be beneficial. Since separate radial kernels are learned for each TFN, we chose to augment our input with distance information between hidden coordinate features at each layer, resulting in a modest improvement over the baseline implementation.
The difference in performance between our model trained using full triangle updates and our model trained with local triangle updates is very small, especially considering the fact that that all other hyperparameters were the same for each model. Restricting residue and pairwise attention updates to at most 30 nearest neighbors did not considerably degrade performance. On the other hand, this drastically reduces the memory required to train the model. In the future, it would be interesting to see if a larger or deeper model could improve performance further.
On top of outperforming other popular methods, our model presents several other advantages. First, it is extremely fast. We are able to predict all side-chain conformations for a 500-residue protein in less than a second using a single nvidia RTXA6000 GPU. On the other hand, DLPacker must be run iteratively for each amino acid side-chain causing a large dependence on protein length. OPUS-Rota4 also requires several pre-and post-processing steps in the form of derived constraints and gradient descent. Since our method directly predicts side-chain coordinates, the output is fully differentiable which benefits downstream prediction tasks such as refinement or protein-protein interaction. This also circumvents the use of engineered energy functions and rotamer libraries and places the emphasis on architectural innovations and better loss functions.
Our model is also very simple to use -it requires only a pdb file to run. In contrast, OPUS-Rota4 requires voxel representations of atomic environments derived from DLPacker, logits from trRosetta100, secondary structure, and constraint files derived from the output of OPUS-CM. Obtaining the requisite input data made it too difficult for us to compare results with this method.
In addition to our method's speed and simplicity, it further succeeds in efficiently modeling residue-level local environments by using locality-based graph attention during feature and structure generation stages. On the other hand, DLPacker and OPUS-Rota4 use 3D-voxelized representations of each amino-acid's microenvironment - where v is the voxelized width (40 in the case of DLPacker and OPUS), c is the number of channels, and d is the channel dimension. Although this choice of representation has helped facilitate good performance for each method, the memory requirements prevent simultaneous modeling of all amino acid sidechains which could ultimately hinder reconstruction accuracy. We hypothesize that simultaneously modeling all side-chains helps contribute to our method's RMSD improvements for surface residues.
Although AttnPacker yields significant improvements in residue-level RMSD, we remark that traditional PSCP methods SCWRL, FASPR, and RosettaPacker still perform comparably in terms of χ 3 and χ 4 angle prediction.
We suspect that incorporating dihedral information into our model -either directly or through an appropriate loss function -could help improve performance.

Conflict of interest
The authors declare that they have no conflict of interest. Supplementary Material S1 Data Collection SCWRL4(ver 4.02), and FASPR were run with default configurations. Rosetta's fixbb application was run with nonflexible backbone coordinates and the maximum number of rotamers by passing -EX1, -EX2, -EX3 and -EX4 flags. We also included flags -packing:repack_only to disable design, -no_his_his_pairE, and -multi_cool_annealer 10 to set the number of annealing iterations -these settings are recommended in the rosetta tutorial. We ran Rosetta Packer 5 times for each target protein using Rosetta's ref2015 energy function and selected the conformation with lowest energy. DLPacker was run using the pre-trained release from the author's github, downloaded Sept. 17th, 2021. Side-chains were reconstructed in non-increasing order of the number of the number of atoms in the corresponding amino acid's microenvironment.
We ran each method on each of the targets listed in Table S11. All of the methods were run using the native backbone coordinates as input. For our models, we considered only the first 800 amino acids and corresponding coordinates in each target's PDB file. The output of FASPR, SCWRL, Rosetta Packer, and DLPacker was cropped to the same length where applicable.

S2 Overview of Hyperparameters
We tried to maintain consistent hyperparameters for all models. We mainly tuned parameters for model depth, number of nearest neighbors, number of attention heads, head dimension, and the distance at which residues or pair features should be considered neighbors. We settled upon the hyperparameter values listed in Table S1. In choosing the parameters, we aimed to balance memory usage with model capacity in each submodule. The final settings required~32GB of GPU memory during training when full triangle updates are used (this is based on a maximum sequence length of 300 residues). The actual memory usage is lower when local triangle updates are used.

S3 Training Details
We trained and validated all models using the BC40 training data set (SOURCE) and validated on BC40 validation set. The data set contains~36k proteins which are selected from PDB database by 40% sequence identity cutoff. We first train our models for 10 epochs with a sequence crop size of 300, and an initial learning rate of 10 −3 . The learning rate is decreased by a factor of two every three epochs, and we do not use any warm-up. After the first stage completes, the models are trained for an additional 2 epochs with a learning rate of 10 −4 and a sequence crop size of 500. No other parameters are modified between the two training stages.
To optimize our models, we use Adam [19] with parameters β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 , and use a minibatch size of 32. To stabilize training and avoid using learning rate warm-up schedules, we use ReZero [3], for every residual connection in our transformer blocks. We also apply gradient clipping by global norm [24] to clip the gradients of each example in a minibatch to have ℓ 2 norm at most 1.
We apply gradient checkpointing on the triangle attention logits and TFN kernel outputs. This yields a massive decrease in memory consumption during training, at a cost of a ≈ 50% decrease in speed. Details are provided in Section S8. Overall, each model was trained for roughly six days on a single nvidia RTX A6000 gpu (~5 days for the first stage, and~1 day for stage 2).

S3.1 Loss Function
Our loss function consists of two equally-weighted terms. The first is an auxiliary loss over predicted distances of distal side-chain atoms (see 'tip-atom' defined in [13]). This term is applied to the pair output of the locality aware graph transformer. The pair output is first symmetrized, and then logits are obtained by linearly projecting into 46 bins covering 2Å − 20Å. Two bins are also added for predicted distance less than 2Å and greater than 20Å. If locality aware attention is used, the loss is only computed for pair ij if the corresponding residues are adjacent. Pairwise distograms are obtained by taking a softmax of the logits, and an averaged cross entropy loss is then applied.
The second loss term is applied to the predicted coordinates. Let C Where huber(x, y, β) is the Huber loss between x and y with smoothing parameter β. The final loss is computed as mean i L (i) .
Some care must be taken in computing Equation (1), since some residues have symmetric sidechains. For these residues, we consider all possible symmetries by swapping the coordinates of symmetric atoms and take the lesser of the swapped and not-swapped loss for the respective residue. A list of residues with symmetric sidechains and pairs of atoms for which we swap coordinates can be found in Table S2.

S4.1 Input Features
Our model uses only input features derived directly from primary sequence and backbone coordinates. An overview of input feature types and the corresponding shape can be found in Table Table S3. Backbone phi, psi, and omega dihedral angles.

seq_pos [L]
The sequence index of the respective residue, from 1..L

centrality [L]
The number of C α atoms within 16Å of residue of the respective residue's C α atom. atom_distance One-hot encoded binned distance between C α − C α , C α − N , and N − O atoms in each residue. Each bin represents a distances from 2Å-20Å with two separate bins used for distances falling outside of this range.

tr_orientations [L, L, 3]
Dihedral and planar angles defined by Yang et al [33]. We use standard residue level features and encodings for our input. These include a representation of amino acid type, binned relative sequence position, and embeddings/encodings of backbone dihedral angles. Less standard is the inclusion of residue centrality. The use of centrality based encodings was studied in [35], where the authors found that transformers significantly benefit from the addition of centrality encodings with graph-like data. We choose to incorporate this information at the input level, rather than in each attention block (as is proposed in [35]). Originally, we included SS3 secondary structure as part of our input, but found that this feature gave the same (if not slightly worse) results in terms of average residue RMSD scores.
For our pair embedding, we follow a scheme similar to [17], but also incorporate residue pair information following the approach of AlphaFold2. Rather than embed each residue pair separately, the authors propose using two separate embeddings for residue types. Denote the embeddings as E A and E B , then the pair feature for residues i and j of type r i and r j (resp.) is given by the outer-sum of E A (r i ) and E B (r j ) . The authors further augment this information by add an embedding of the relative sequence separation between the two residues and we follow the same approach here. Figure S1: Overview of our input feature embedding. The feature shape is shown in parenthesis. Residue and pairwise features are embedded separately. Only information derived from primary sequence and backbone coordinates is considered. Full explanations of procedures referenced in this figure are given in Table S4. For residue features in our final models, we use d cen = 7 for the one-hot encodings of centrality features, d A = 6 for our backbone torsion angle embedding dimension, and d Rr = 32 for residue type embedding dimension. For pairwise features, we use d D = 34 for each one-hot distance encodings, d Rp = 48 for residue pair embeddings, and d A = 6 for embeddings of trRosetta orientations. Maps the (signed) sequence separation s to the index of a predefined interval. We chose intervals [0, 1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,10), [10,12), [12,15), [15,20), [20,30), [30,∞) plus the negation of each interval.

bin_dist(d)
: Maps the pairwise distance d to one of 32 equal width bins in the range 3Å..20Å. Distances less than 3Å and distances greater than 20Å are placed into separate bins.  Figure S1. The name of the procedure (top) along with the domain and range (bottom) are given in the first column. Here, we use L to denote the protein length.

S5 Locality Aware Triangle Attention
Consider a set of points the set of triangles in P with maximum side length at most θ.
We develop a hypergraph approach to performing triangle updates G (P ; θ) = (V, E, d) , where The edge set E is defined by the adjacency relation Observe that there is a node for every pair of points within distance θ, and the distance between two nodes sharing a common underlying point is the maximum edge length on the corresponding triangle.
When the underlying point set consists of backbone atom coordinates for a protein, a reasonable choice of θ results in a relatively small set of triangles. Furthermore, in order to efficiently compute triangle updates for each node v ij , we restrict N G (v ij ) to the N e nearest neighbors under d (·, ·). This yields a 3-uniform, O (N e )-regular hypergraph with as many nodes as there are underlying points within distance θ. We denote the latter quantity by N v .
In theory, the pairwise features selected for triangle attention can be further restricted to the (at most) N nearest neighbors of each residue, as only the pair features for each residue's N nearest neighbors are used to bias residue attention logits. In practice, we restrict pair features to the (at most) 2N nearest neighbors of each residue. More formally, the vertices of our triangle hypergraph consist of pairs (i, j) such that j is among the top 2N nearest neighbors of i. Triangle multiplication and attention updates then require time and space proportional to O (N v · N e ) = O n · N 2 e in total. These updates are identical to that of AlphaFold2, but restricted to this subset of triangles. As an alternative, we can easily implement local triangle updates by computing a global triangle mask for each sample (see Figure S3). This approach does not yield the time and space savings previously discussed, but does serve as a simple drop-in replacement for existing models.

Algorithm S1 SE(3)-Equivariant Normalization
Input For normalization, we propose a method similar to Layer normalization with a norm-based non-linearity. Several papers have proposed SE(3)-Equivariant Normalization schemes (e.g. [12,7,16]), most include some form of Layer normalization [2], or restriction on the ℓ 2 norm of coordinate features. In our experiments, we found that applying layer normalization to coordinate norms (and subsequently scaling by these values) sometimes caused instability in the early stages of training. In light of this, In Algorithm S1, we simply learn a scale and bias σ ℓ and β ℓ for each feature type.
Algorithm S2 Augment Edge Features Input z ij : pair features in R dz p ij : hidden coordinate features in R dp×3 for each residue r ij : input relative coordinates in R 1×3 Output z ij : augmented edge features in R dz+2·dp function AugmentPairFeats(z ij , c i , r ij ) : In augmenting the pair features with distance information, we choose to append both normalized and un-normalized distances between the hidden points. The output of this function is passed directly to the TFN radial kernel, which employs a 3-layer MLP with GELU nonlinearity to produce pairwise kernels for each pair of input feature types.

Algorithm S3 TFN-Transformer Attention Head
Input pair features in R dz B ℓ,k ij : equivariant basis mapping type ℓ features to type k features r ij : input relative coordinates in R 1×3 N i : List of neighbor indices for each residue 1..i Output o ℓ i : attention head features for each input type ℓ = 0 . . .
ℓ·h ℓ ▷ head weights for each input type ℓ = 0 . . . # Compute pair similarity and self similarity In Algorithm S3, we use d ℓ to denote the input dimension, and h ℓ to denote the hidden dimension (head dimension) of input feature type ℓ. Each attention head has a separate learnable weight γ ℓ for each input type ℓ. This weight is the softplus of a learnable scalar and is initialized so that γ ℓ = 1. For each input type, the output of each attention head is concatenated and linearly projected so that the output dimension matches the original input dimension. All linear projections (SE3Linear) follow the scheme proposed in [7].
We give a full description of out TFN-Transformer in Algorithm S4. Our TFN-Transformer consists of three components: (1) An equivariant mapping of scalar and coordinate input features to hidden feature types/dimensions (2) multiple TFN-based attention layers and (3) Mapping # Map to hidden dimension and hidden feature types  Figure S4: The full architecture of AttnPacker. we use s i to denote scalar (residue) features, and z ij to denote pair features. Backbone coordinates are used to define residue and pair adjacencies, and included in the equivariant input mapping (see Algorithm S4). The Equivariant input mapping returns type ℓ features x ℓ i for user-defined types ℓ > 0. In our model, scalar output features x 0 i are discarded, and coordinate output x 1 i has d coord = 32, one channel for each possible side-chain atom type.
Our full architecture is shown in Figure S4. It consists of an input embedding ( Figure S1) to produce scalar features for residues and pairs, and is followed by our locality-aware graph transformer and TFN-Transformer. Pair features are only modified in the locality-aware graph transformer block. The TFN-Transformer still makes use of pair features to produce radial kernels, bias attention logits, and augment the output of each attention head.

S8.1 Triangle Attention
In Section 2.2, we determine that the number of pair features requiring triangle updates is bounded by 2N L, where N is the maximum number of neighbors per point, and L is the protein length. Locality aware triangle attention stores at most N attention logits per pairwise feature. It follows that the number of attention logits stored for an attention layer with h heads is 2N 2 L · h. In practice, computing neighbor-wise attention can be costly. Depending on the implementation, repeating or grouping of neighbor features may cause the gradients for attention logits to require significantly more space. We experimented with several implementation, but ultimately decided to mask full triangle attention logits (O L 3 space) as a proof of concept. Masking attention logits effectively stops the gradient flow for all pair pair features which are not part of the triangle graph. This resulted in a modest improvement in memory, but more work will be needed to realize the full time and space benefits of this architecture.

S8.3 TFN Memory Analysis
One of the main drawbacks of the SE(3)-Transformer is high memory usage caused by computing equivariant pairwise kernels in each attention block. To alleviate some of this overhead, we modify the TFN implementation used in the original SE3-Transformer proposed by Fuchs et al. [12]. Given input feature tensors of size (d in , o in ),(d out , o out ) respectively, the corresponding basis element mapping between these types has shape (o in , o in · o out ). Let f = min (o in , o out ) denote the frequency of the mapping, then for each pair of features we require a radial kernel of size (d in , d out , f ). To ensure equivariance, the kernel passes through the corresponding basis element, yielding an intermediate tensor of shape That is, the kernel is obtained from multiplying the radial weights for each pair through the corresponding basis element. The input features are then multiplied through the respective kernel to yield the desired output, and the process is repeated for each pair of input and output types.
As TFNs are used to produce key and value vectors in each attention block, the intermediate kernel mapping between type-ℓ features at each block requires memory proportional to where h ℓ , d ℓ are are the number of heads and head dimension for type ℓ features.
In our implementation, we are able to obtain a factor ≈ o in · o out reduction in memory by changing change the order of matrix multiplication in the TFN kernel. Rather than multiply the radial weights through the basis, we first multiplying the features through the basis, and then multiplying the result with the radial weights. The memory required to store the intermediate tensors is reduced to This greatly reduces the memory burden of TFNs and, together with gradient checkpointing, allows us to fit a much deeper and larger model on a single GPU.

S9.1 CASP-FM RMSD
To better understand AttnPacker's ability to generalize to new folds, we evaluated each method on CASP13 and CASP14 free modelling (FM) targets. These datasets consist of proteins with previously unseen folds and hard analogous fold based models (see Table S11 for a complete list). The results are show in Table S6.  For non-native backbone comparison we trained two separate models TFN+Tri (25) and TFN+Tri(75). These models were trained using the same list of targets as the original model, but with each target having backbone coordinates predicted by AlphaFold2 in place of the native with probability 25% and 75% respectively.
To train these models, we calculated separate rotations and translations mapping the backbone atom coordinates of each residue in the native structure to the corresponding coordinates of the decoy (non-native) structure. We then applied these transformations to align both backbone and side-chain coordinates for each residue in the native structure and treated the result as the ground-truth. After applying this transformation, we were able to use the loss functions described in Section S3.1 without modification.
The average RMSD results shown in Section S9.2 were calculated by aligning the native and decoy residue coordinates as described in the previous paragraph. Substituting non-native backbone coordinates with 25% probability yields the best results in terms of Average RMSD, even better than that of AlphaFold2. Both models, trained partially with non-native backbone data, outperform the baseline TFN+Tri model for side-chain RMSD minimization. In terms of side-chain dihedral prediction accuracy, Alphafold2 achieves the best results, with our methods achieving competitive scores in accuracy for all but surface residues. Similar to the results on native backbones, our models are competitive in terms of dihedral MAE for χ 1 and χ 2 , but lose their edge for higher order χ values.