Abstract
In this work, we establish a framework to tackle the inverse protein design problem; the task of predicting a protein’s primary sequence given its backbone conformation. To this end, we develop a generative SE(3)-equivariant model which significantly improves upon existing autoregressive methods. Conditioned on backbone structure, and trained with our novel partial masking scheme and side-chain conformation loss, we achieve state-of-the-art native sequence recovery on structurally independent CASP13, CASP14, CATH4.2, and TS50 test sets. On top of accurately recovering native sequences, we demonstrate that our model captures functional aspects of the underlying protein by accurately predicting the effects of point mutations through testing on Deep Mutational Scanning datasets. We further verify the efficacy of our approach by comparing with recently proposed inverse protein folding methods and by rigorous ablation studies.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* mmcpartlon{at}uchicago.edu
↵† blai{at}ttic.edu
↵‡ jinboxu{at}gmail.com
1 Results obtained for DenseCPD have residue probabilities truncated to 3 significant digits. To compute perplexity, we replace probabilities of 0 with 10−3. Consequently, the perplexities obtained for DenseCPD in Table 1 serve as a lower bound on the true perplexity.