Abstract
The combinatorial scale of amino-acid sequence-space has traditionally precluded substantive study of the full protein sequence-structure map. It remains unknown, for instance, how much of the vast uncharted landscape of far-from-natural sequences encodes the familiar ensemble of natural folds in a fashion consistent with the laws of biophysics but seemingly untouched by evolution on Earth. The scale of sequence perturbations required to access these spaces exceeds the reach of even gold-standard experimental approaches such as directed evolution. We surpass this limitation guided by the innate capacity of protein language models (pLMs) to explore sequences outside their natural training data through generation and self-feedback. We recast pLMs as probes that explore into regions of protein “deep space” that possess little-to-no detectable homology to natural examples, while enforcing core structural constraints, in a novel sequence design approach that we term “foldtuning.” We build a library of foldtuned pLMs for >700 natural folds in the SCOP database, covering numerous high-priority targets for synthetic biology, including GPCRs and small GTPases, composable cell-surface-receptor and DNA-binding domains, and small signaling/regulatory domains. Candidate proteins generated by foldtuned pLMs reflect distinctive new “rules of language” for sequence innovation beyond detectable homology to any known protein and sample subtle structural alterations in a manner reminiscent of natural structural evolution and diversification. Experimental validation of two markedly different fold targets; the tyrosine-kinase- and small-GTPase-regulating SH3 domain and the bacterial RNase inhibitor barstar demonstrates that fold-tuning proposes protein variants that express and fold stably in vitro and function in vivo. Foldtuning reveals protein sequence-structure information at scale out-side of the context of evolution and promises to push forward the redesign and reconstitution of novel-to-nature synthetic biological systems for applications in health and catalysis.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Main text, Figures 1-4, abstract, and methods updated to reflect additional experiments; author list updated to reflect relevant contributions.