## Abstract

Designing proteins to achieve specific functions often requires *in silico* modeling of their properties at high throughput scale and can significantly benefit from fast and accurate protein structure prediction. We introduce EquiFold, a new end-to-end differentiable, SE(3)-equivariant, all-atom protein structure prediction model. EquiFold uses a novel coarse-grained representation of protein structures that does not require multiple sequence alignments or protein language model embeddings, inputs that are commonly used in other state-of-the-art structure prediction models. Our method relies on geometrical structure representation and is substantially smaller than prior state-of-the-art models. In preliminary studies, EquiFold achieved comparable accuracy to AlphaFold but was orders of magnitude faster. The combination of high speed and accuracy make EquiFold suitable for a number of downstream tasks, including protein property prediction and design.

## 1 Introduction

Recent studies using deep neural networks to predict protein structure [1, 2, 3, 4, 5] have accelerated the development of structure-based methods for protein property prediction and design. However, some models tend to predict one or few conformations that may not be optimal for properties of interest, such as ligand-binding pockets [6] and protein-protein interfaces [7, 8, 9]. Moreover, these models require either multiple sequence alignments (MSA), protein language model embeddings, or other statistical information distilled from large sequence databases. While complex inputs such as embeddings and MSAs have proven useful for structure prediction, they increase the complexity of properly testing these methods, require significant time to derive, and often scale poorly with respect to sequence library size, and thus may overall limit downstream use. Conversely, the physics of protein dynamics and interactions do not depend directly on these inputs.

Here, we introduce EquiFold, a novel representation of protein structures and an end-to-end differentiable, SE(3)-equivariant neural network that predicts a protein structure given its primary sequence via iterative refinement. EquiFold makes atomically accurate predictions for *de novo* designed mini-proteins with in *silico* predicted structures [10] and on experimental antibody structures from the Protein Data Bank [11]. We focus on this set because small structures (e.g., designed mini-proteins) and flexible loops (e.g., the CDR-H3 loop of IgG based antibodies) have proven to be as difficult to predict as larger structures in recent tests of state-of-the-art methods [12, 13, 14]. The model relies solely on geometrical structure representation and can readily incorporate various energy functions as Preprint. Under review. physical priors, which we hypothesize to be instrumental in exploring conformational landscapes towards predicting properties of interest.

## 2 Related Work

AlphaFold [1] and similar models [2, 3, 4] have been successful at protein structure prediction, by representing the geometry of the chain of backbone atoms as a set of nodes paired with Euclidean transformations and using an iterative refinement procedure that updates the transformations per each block of the structure module. In these models, side-chain geometries are implicitly modeled until they are predicted as a series of torsion angles by the module’s final block. However, such implicit modeling of side-chains may make it difficult to model their atoms’ placements and interactions in 3D space. For example, to avoid steric clashes, such models must learn from data complex distributions in the high dimensional space of torsion angles that span multiple residues. In contrast, models that represent side-chain degrees of freedom explicitly in 3D would likely need to learn substantially simpler distributions to achieve the same goal.

Coarse-grained structure representations of proteins [15, 16, 17] typically model each residue by one or few nodes with associated positions determined by its atoms’ coordinates. These strategies can increase computational efficiency for predictive tasks, such as interacting residue predictions [18] and functional residue prediction [19], and are appropriate for certain generative modeling tasks, such as generating backbone scaffolds [20, 21]. However, these approaches have previously sacrificed all-atom structure resolution needed for design and packing-related tasks, along with much information useful for predicting protein functions.

To address these limitations, we develop a novel coarse-grained representation that retains all-atom structure resolution. In this representation, side-chain degrees of freedom are modeled explicitly in 3D space rather than intrinsically through torsion angles, which we conjecture makes it easier to model geometry and interactions in 3D space.

SE(3)-equivariant neural networks have been used in various 3D object modeling tasks, including atomic potential prediction [22, 23, 24], molecular property prediction [25, 26, 27, 28, 29], protein structure prediction, [2] and docking [30, 31]. They incorporate the symmetry of 3D space and are substantially more data efficient than their non-symmetry-aware counterparts [32, 33]. Input to SE(3)-equivariant networks consists of geometric tensors, or the irreducible representations of the SO(3) group (the group of rotations in ), of various degrees *l*, such as scalars (*l* = 0), vectors (*l* = 1), and higher degree (*l* ≥ 2) tensors that transform under a rotation *R* through multiplication with corresponding Wigner D-matrices *D _{l}*(

*R*). Objects with an associated rotation, such as backbone frames [1], can be initially embedded with geometric tensors [15]. We use an SE(3)-equivariant model adapted from Equiformer [29] along with an initial embedding of coarse-grained nodes with geometric tensors.

## 3 Methods

### 3.1 A coarse-grained representation

Given a protein sequence *a* = (*a*_{1},…, *a _{N}*) of length

*N*where

*a*is one of the 20 canonical amino acids, its structure is specified by the set of 3D coordinates of its atoms grouped by amino acid, where

_{i}*n*is the number of atoms in residue

_{ai}*a*. In a CG representation, each amino acid

_{i}*a*is represented by a predetermined set of CG

_{i}*nodes*, where each CG node represents a subset of the amino acid’s constituent atoms. The CG nodes are chosen such that 1) their union represents all atoms of the amino acid; 2) each member atom of a node shares at least one covalent bond with another member of the same node; and 3) each node consists of at least three atoms that collectively form a rigid body whose orientation in 3D is uniquely determined. Based on the last property, we define a

*forward*CG mapping of the 3D coordinates of an amino acid

*a*’s atoms into its CG representation: where each tuple in the set consists of a CG node identity and a corresponding Euclidean transformation that maps the predefined template coordinates for atoms in to the corresponding input atom coordinates (see Appendix A.1). Table 3 illustrates an example of a CG scheme used in this work that respects the above properties while minimizing the redundancy of atom representations across CG nodes. Figure 1 gives illustrations of this CG scheme for select amino acids.

_{i}We define a *reverse* CG mapping for an amino acid *a _{i}* that maps its CG representation to the 3D atom coordinates as follows: for each atom in the amino acid, we average the 3D coordinates associated with the atom specified by any of its CG nodes.

### 3.2 Geometric tensor features and initial embedding

To each CG node, we assign a set of geometric tensor features of degree *l* = 0,…, *l _{max}* with

*n*channels per degree [25, 28, 29]. As there are 2

_{c}*l*+ 1 features associated with an

*l*-degree tensor, there are

*n*× (

_{c}*l*+ 1)

_{max}^{2}features in total per node. We define the initial embedding of the nodes where

*C*is the pre-determined set of the CG node types: where LookUp: is a typical embedding function and is the direct sum of the Wigner D-matrices

*D*corresponding to the tensor features of various degrees

_{l}*l*:

### 3.3 Structure prediction via iterative refinement using an SE(3)-equivariant neural network

Given an input sequence, its CG representation is instantiated with Euclidean transformations whose translations and rotations are sampled from a normal distribution with zero mean and unit variance and a uniform distribution over the SO(3) group, respectively. The output structure is predicted via iterative refinement using an SE(3)-equivariant model that consists of *N _{blocks}* blocks sharing the same architecture, where each block is composed of

*N*equivariant sub-blocks. The block architecture is adapted from Equiformer [29], as detailed in Appendix A.2.

_{sub}Each block takes as input either the initial CG representation or the output of the previous block. A block outputs two *l* = 1 tensors for each node, one of which is used as the vector part of a non-unit quaternion to compute an update *R′* to the node’s rotation *R _{in}* and the other to compute an update to its translation [1] (we drop amino acid and CG node indices for clarity). The input Euclidean transformation is then updated to via the following update rules:

A block can optionally transform or simply copy the input embedding of each node, but either way, it is multiplied by the direct sum of the Wigner D-matrices corresponding to the update rotation *R′*. The proof of equivariance of these update rules is provided in Appendix A.3.

### 3.4 Loss functions

To train the model, the Frame Aligned Point Error (FAPE) loss and the structure violation loss introduced in [1] are computed based on the CG node Euclidean transformations and the reverse CG mapped structures output from each block, respectively. Unlike in [1], all atom FAPE loss is computed for every block. We observed that the structure violation loss is important, since EquiFold is more susceptible to predicting non-physical bond lengths and angles and non-bonded atom distances, due to its modeling of CG nodes in extrinsic 3D space, compared to other models that predict torsion angles representing internal degrees of freedom [1, 2, 3]. However, we do not observe strong instabilities in EquiFold training dynamics like those reported in [1], when the loss is used from the beginning of training. More details on the losses, including their relative weights, other hyper-parameter choices, and training strategies are found in Appendix A.4.

## 4 Results

We report EquiFold’s performance on two structure datasets we curated to focus on structures that present key challenges to protein structure prediction methods: designed mini-proteins and antibody loops. In each challenge, we are primarily predicting structures that are small, have high error in recent blind benchmarks, and have significant potential applications in biotechnology and protein design. These sets also are comprised of regions that elude homology detection and traditional concepts that underpin MSA (as do, by construction, de novo designed sequences). EquiFold achieved high accuracy and speed over these sets, demonstrating that it can enable new downstream protein design and engineering.

### 4.1 *De novo* designed mini-proteins

We first tested EquiFold on a set of de *novo* designed mini-proteins from [10], each having one of four different folds (*ααα, αββα, βαββ,* and *ββαββ*), with associated *in silico* structures predicted using Rosetta [34]. After filtering for sequences with stability score greater than 1, we retained 2,842 sequences whose lengths range from 43 to 50. We randomly split the sequences into train, validation, and test sets of size 2,742, 50, and 50. Test sequences have nearest training sequence similarity ranging from 44% to 80% with mean of 67.2%.^{1} Table 1 shows all atom and *C _{α}* RMSD based on the test set broken down by fold. Average inference speed over the entire test set was 0.03 seconds per sequence, with the model containing 2.3M trainable parameters (see Table 4). Figure 2 shows test example predictions overlaid with ground truth structures. Rather than serving as a benchmark relative to other structure prediction models, since the ground truth here are

*in silico*predicted structures, this result illustrates the ability of EquiFold to learn a variety of distinct protein topologies with atomic resolution all-atom accuracy.

### 4.2 Antibodies

We obtained all antibody (Ab) experimental structures from the PDB [11] that were listed in The Structural Antibody Database (SAbDab) [35], as accessed on January 12, 2022. We processed this dataset to obtain the variable fragment portions of the structures and annotated the sequences with Chothia numbering using ANARCI [36]. We obtained 6,789 structures with a resolution better than 4Å and deposited before July 1, 2021 as training set, of which 50 structures were used for validation. We used the same test set structures as in [13] that have resolution better than 3 Å and deposited after the aforementioned date. Compared to other models, EquiFold achieves similar or better accuracy in backbone atom RMSD across different sequence regions (see Table 2) and results in all atom RMSD of 1.52 Å averaged over the test set. Importantly, the model has fast inference speeds of approximately 1 second per Ab on average on a single A100 GPU for predicting all atom structures and contains 7.38M trainable parameters (see Table 4), compared to 559.6M of IgFold (including 558M of the antibody language model AntiBERTy) [13] and 93.2M of AlphaFold [1] that requires time-consuming input preparation steps. Given its high speed and accuracy, it is practical to predict structures at high-throughput scale for millions of antibodies observed in deep sequencing data sets [37] and integrate the model in a design workflow [16, 21]. Fig 3 shows selected test set predictions overlaid with ground truth structures.

## 5 Conclusion and Future Work

We introduced EquiFold, a new end-to-end differentiable protein structure prediction model that uses a novel coarse-grained (CG) representation of proteins. The model achieves high test set accuracy on two datasets, while running at a substantially faster speed. Notably, it is trained on significantly less data and does not rely on multiple sequence alignments [1, 2] or protein language model embeddings [3, 4, 13]. Its accuracy and speed make it practical to integrate EquiFold as a sub-component of a larger model. For instance, EquiFold can be combined in an end-to-end differentiable fashion with another neural network that predicts various molecular properties based on the structure output by its last block.

We leave to future work training EquiFold on more general classes of proteins from the PDB [11] and examining its generalizability to novel folds unseen in training data. Scaling to larger proteins will require addressing the quadratic complexity of the message passing layers, possibly using similar strategies in earlier works [1]. To integrate physical priors, we will extend the CG representation to include hydrogens and implement various energy functions such as the Rosetta all-atom energy function [34]. With such utilities, EquiFold can be adapted to generate conformational ensembles within an energy band, perform flexible docking, and fit structural models to experimental data such as that from cryogenic electron microscopy (cryo-EM) and hydrogen-deuterium exchange (HDX). Lastly, the CG representation could be used in generative modeling of protein sequence and structure together, rather than modeling them sequentially as done in recent works [16, 20, 21].

## A Appendix

### A.1 Computing coarse-grained node template coordinates and ground-truth Euclidean transformations

For each coarse-grained (CG) node defined in Table 3, we compute the template coordinates of all the atoms comprising the node as follows. Given a protein structure dataset , we compute the Euclidean transformation *T _{q}* corresponding to each CG node

*q*in using Algorithm 21 of [1] with the 3D coordinates of the first three atoms of the CG node

*q*’s atom group as input. Next, we apply the inverse Euclidean transformation to the 3D coordinates of all the atoms in the node

*q*into the corresponding local frame. Lastly, we average the observed coordinates in the local frames across all instances in the dataset grouped by CG node type defined in Table 3.

To compute the ground truth Euclidean transformation for a given CG node to be used in the FAPE loss [1], we use the Kabsch algorithm [38] to determine the transformation from the template coordinates to the observed coordinates that minimizes the root mean square error (RMSE) of the node’s constituent atoms. To elaborate, for the *j*-th CG node containing *M* atoms of the *i*-th amino acid *a _{i}*, we apply the Kabsch algorithm to the template coordinates and the corresponding input coordinates . The algorithm factorizes the covariance matrix into eigenvectors and values using singular vector decomposition,
where

*W*and

_{c}*X*are mean centered template and input coordinates. The resulting rotation and translation are where and are the mean coordinates of

_{c}*W*and

*X*respectively.

### A.2 SE(3)-equivariant neural network architecture details

Each block of the neural network consists of *N _{sub}* sub-blocks that share the same architecture, where each sub-block is an adapted version of the Equiformer’s “Transformer block” [see 29, Figure 1]. As the general theory of SE(3)-equivariant neural networks and the implementation of the Transformer block are well-described in [29], here we describe only the modifications in our adapted version.

Given the input set of coarse-grained (CG) nodes and their Euclidean transformations, the block initially computes pairwise distances *r _{ij}* and normalized distance vectors , where

*i*and

*j*index CG nodes. is projected onto

*d*radial Bessel basis with learnable weights [22] and a cutoff distance

_{bessel}*r*, which are used in radial functions that parameterize tensor products in the Equivariant Graph Attention module. Instead of a polynomial envelope function used in [22], we apply

_{c}`e3nn.math.soft_unit_step`from [39] with 10(1 –

*r*)/

_{ij}*r*as input. is used to compute spherical harmonics input to tensor products. When training the network, gradients do not propagate through

_{c}*r*and .

_{ij}In our adapted version of the Equivariant Graph Attention module, after the application of the initial linear layers to input node embeddings, instead of element-wise summation, channel-wise fully connected tensor products are applied to the embeddings of every CG-node pair *ij*, which is followed by another linear layer to produce output tensors *x _{ij}* with the same number of channels as the input. Next, Depth-wise Tensor Product (DTP) is applied to

*x*and with a radial function that takes as input the Bessel basis mentioned above and additionally a scalar

_{ij}*edge*embedding vector corresponding to the primary amino acid sequence distance for the CG-pair

*ij*, clamped at maximum absolute distance of 32; for input proteins with multiple chains, sequence distances across chains are set at the maximum distance. The edge embedding is implemented via a simple look-up table with learnable weights and has the same dimension as the number of channels in input tensors. The output of the DTP layer is uniformly shuffled and grouped by

*N*attention heads and a linear layer is applied to produce tensors of various degrees with appropriate channel numbers for the remainder of the module.

_{head}The output of each sub-block except for the last one are updated node embeddings corresponding to the input CG nodes. The last sub-block outputs only two *l* = 1 vectors per CG node as mentioned in Section 3.3. Edge embeddings are shared across sub-blocks of a given block.

### A.3 Proof of equivariance of the Euclidean transformation update

Under a global Euclidean transformation , the Euclidean transformation corresponding to a CG node that specifies the mapping between the observed and template coordinates transforms as and the update transformation output by a block transforms as

We show that the update rules given in Section 3.3 are equivariant by applying a global transformation to the input and update T′ transformations and showing that their composition is equivalent to the global transformation of the output : and

### A.4 Hyper-parameters and training details

Table 4 provides the hyper-parameter values used for the two experiments described in Section 4. Both models were trained using the Adam optimizer [40] with PyTorch [41] default parameters, including `beta=(0.9, 0.999)` and the initial learning rate of 10^{-3}. We used a warm-up phase of 10,000 steps, where the learning rate was linearly increased from 0 to the final value and input structures to the model were linearly interpolated between the ground truth structures and random initializations; for rotations, we used quaternion spherical linear interpolation [20].

Both models were trained with the FAPE and structure violation losses computed on the output structure of each block with equal weights of 1. A mini-batch size of 8 was used with 8 A100 GPUs in Pytorch’s Distributed Data Parallel mode [41]. Model training for the miniprotein experiment was stopped after one day as the validation loss converged. The antibody model was trained for approximately 3.5 days before being trained for additional 7 days at a reduced learning rate of 2 × 10^{-4}.

To circumvent large memory requirement originating from the quadratic computational complexity of attention layers, we computed gradient updates of the learnable weights of the models on a block-by-block basis, stopping gradient propagation through rotations and translations after each block. Early experiments on mini-proteins showed that this model weight update algorithm did not substantially affect training dynamics. Similarly, we did not observe a significant benefit to using tensors of degrees higher than 1. We leave to future work a more careful benchmarking of different model weight update schemes and other hyper-parameters.

## Footnotes

↵1 Sequence similarity is defined as (

*l*–_{query}*n*where_{edit})/l_{query}*l*is the length of the query sequence and_{query}*n*the Levenshtein edit distance between query and target sequences._{edit}