Abstract
Computational protein design has the potential to deliver novel molecular structures that can function as binders or catalysts. Neural network models that use backbone coordinate-derived features perform exceptionally well on native sequence recovery tasks and can be applied to design new proteins. A statistical energy-based framework for modeling protein sequence landscapes using Tertiary Motifs (TERMs) has also demonstrated performance on protein design tasks. In this work, we combine the two methods to make neural structure-based models more suitable for protein design. Specifically, we supplement backbone-coordinate features with TERM-derived data, as inputs, and we generate energy functions as outputs. We present two architectures that generate Potts models over the sequence space: TERMinator, which uses both TERM-based and coordinate-based information, and COORDinator, which uses only coordinate-based information. Using these two models, we demonstrate that TERMs can be utilized to improve native sequence recovery performance of neural models. Furthermore, we demonstrate that sequences designed by TERMinator are predicted to fold to their target structures by AlphaFold. Finally, we show that both TERMinator and COORDinator learn notions of energetics, and these methods can be fine-tuned on experimental data to improve predictions. Our results suggest that using TERM-based and coordinate-based features together may be beneficial for protein design and that structure-based neural models that produce Potts energy tables have utility for flexible applications in protein science.
Code Code will be publically released at a later date.
Competing Interest Statement
Gevorg Grigoryan is a co-founder, share holder, and Chief Technology Officer at Generate Biomedicines, Inc. All other authors declare no competing interests.