Abstract
The advent of natural language interaction with machines has ushered in new innovations in text-guided generation of images, audio, video, and more. In this arena, we introduce Biological Multi-Modal Model (BioM3), as a novel framework for designing functional proteins via natural language prompts. This framework integrates natural language with protein design through a three-stage process: aligning protein and text representations in a joint embedding space learned using contrastive learning, refinement of the text embeddings, and conditional generation of protein sequences via a discrete autoregressive diffusion model. BioM3 synthe-sizes protein sequences with detailed descriptions of the protein structure, lineage, and function from text annotations to enable the conditional generation of novel sequences with desired attributes through natural language prompts. We present in silico validation of the model predictions for subcellular localization prediction, reaction classification, remote homology detection, scaffold in-painting, and structural plausibility, and in vivo and in vitro experimental tests of natural language prompt-designed synthetic analogs of Src-homology 3 (SH3) domain proteins that mediate signaling in the Sho1 osmotic stress response pathway in baker’s yeast. BioM3 possesses state-of-the-art performance in zero-shot prediction and homology detection tasks, and generates proteins with native-like tertiary folds and wild-type levels of experimentally assayed function.
Competing Interest Statement
N.P. is a co-author of US Provisional Patent Applications 63/314,898 and 63/669,836. R.R. is a co-founder of Evozyne, Inc. and a co-author of US Patent Applications 17/642,582, US Provisional Patent Applications 62/900,420 and 63/669,836, and International Patent Application PCT/US2020/050466. A.L.F. is a co-founder of Evozyne, Inc. and a co-author of US Patent Applications 16/887,710 and 17/642,582, US Provisional Patent Applications 62/853,919, 62/900,420, 63/314,898, 63/479,378, 63/521,617, and 63/669,836, and International Patent Applications PCT/US2020/035206, PCT/US2020/050466, and PCT/US24/10805.
Footnotes
niksapraljak1{at}uchicago.edu; hughy{at}uchicago.edu; moorem1{at}uchicago.edu; socolich{at}uchicago.edu; ranganathanr{at}uchicago.edu; andrewferguson{at}uchicago.edu