Abstract
We introduce a method to reduce the cost of synthesizing proteins and other biological sequences designed by a generative model by as much as a trillion-fold. In particular, we make our generative models manufacturing-aware, such that model-designed sequences can be efficiently synthesized in the real world with extreme parallelism. We demonstrate by training and synthesizing samples from generative models of antibodies, T cell antigens and DNA polymerases. For example, we train a manufacturing-aware generative model on 300 million observed human antibodies and synthesize ∼1017 generated designs from the model, achieving a sample quality comparable to a state-of-the-art protein language model, at a cost of 103 dollars. Using previous methods, synthesis of a library of the same accuracy and size would cost roughly a quadrillion (1015) dollars.
Competing Interest Statement
All authors are current or previous employees and/or shareholders of JURA Bio. A provisional patent with ENW, MGG, AS, and EBW as authors has been filed.
Footnotes
The reference after "A detailed description of variational synthesis can be found" incorrectly pointed to [32] while it should have been [31], this is now corrected
↵1 In practice our training procedure removes duplicated samples, but since only 0.06% of the samples drawn from q(x|b = 1) appear more than once, the difference between the de-duplicated distribution and q(x|b = 1) is minimal.
↵2 Note computing q(x) requires a small modification to the likelihood computation code provided with the ProGen2 model: we just care about the likelihood in the forward (i.e. N terminal to C terminal) direction, not the reverse direction, and the stop symbol must be included since it is part of the generative process