Abstract
The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be leveraged to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.
Availability pLM generated and UniRef50 sampled sequence sets and predictions are available at http://data.bioembeddings.com/public/design. Code-base and Notebooks for analysis are available at https://github.com/hefeda/PGP. An online version of Table 1 can be found at https://github.com/hefeda/design_tools.
Competing Interest Statement
CD was employed by VantAI and NVIDIA at different periods during the time of writing. MA, AG, LN are employees of VantAI. NVIDIA and VantAI had no influence on the contents of this manuscript.
Footnotes
Abbreviations
- ADMM
- Alternating Direction Method of Multipliers
- CNN
- Convolutional Neural Network
- DL
- Deep learning
- FNN
- fully-connected neural network
- GAN
- Generative Adversarial Network
- GCN
- Graph Convolutional Network
- GNN
- Graph Neural Network
- GO
- Gene Ontology
- GVP
- Geometric Vector Perceptron
- LSTM
- Long-Short Term Memory
- MLP
- Multilayer Perceptron
- MSA
- Multiple Sequence Alignment
- NLP
- Natural Language Processing
- NSR
- Natural Sequence Recovery
- pLM
- protein Language Model
- VAE
- Variational Autoencoder.