ABSTRACT
The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models’ ability to navigate sequence space and design new proteins beyond the training set. We applied the GB1 models to design a sequence that binds to IgG with substantially higher affinity than wild-type GB1. Our software is available from https://github.com/gitter-lab/nn4dms.
Significance Understanding the relationship between protein sequence and function is necessary to design new and useful proteins with applications in bioenergy, medicine, and agriculture. The mapping from sequence to function is tremendously complex because it involves thousands of molecular interactions that are coupled over multiple lengths and timescales. In this work, we show neural networks can learn the sequence-function mapping from large protein datasets. Neural networks are appealing for this task because they can learn complicated relationships from data, make few assumptions about the nature of the sequencefunction relationship, and can learn general rules that apply across the length of the protein sequence. We demonstrate the learned models can be applied to design new proteins with properties that exceed natural sequences.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
The manuscript has been updated to add additional data splits, test mutational and positional extrapolation, characterize the binding affinity of the A24Y and E19Q+A24Y GB1 mutants, assess the robustness of the random-restart hill climbing design algorithm, and clarify the text.