Abstract
The ability to perform de novo protein design will allow researchers to expand the pool and variety of available proteins, by designing synthetic structures computationally they can utilise more structures than is available in the Protein Data Bank, design structures that are not found in nature, or direct the design of proteins to acquire a specific desired structure. While some researchers attempt to design proteins from first physical and thermodynamic principals, we decided to attempt to test whether it is possible to perform de novo helical protein design of just the backbone statistically using machine learning by building a model that used a long short-term memory generative adversarial neural network architecture. The LSTM based GAN model used only the ϕ and ψ angles of each residue from an augmented dataset of only helical protein structures. Though the network’s generated backbone structures were not perfect, they were idealised and evaluated post generation where the non-ideal structures were filtered out and the adequate structures kept. The results were successful in developing a logical, rigid, compact, helical protein backbone topology. This paper is a proof of concept that shows it is possible to generate a novel helical backbone topology using an LSTM-GAN architecture using only the ϕ and ψ angles as features. The next step is to attempt to use these backbone topologies and sequence design them to form complete protein structures.
Author summary This research project stemmed from the desire to expand the pool of available protein structures that can be used as a scaffold in computational vaccine design since the number of structures available from the Protein Data Bank was not sufficient to allow for great diversity and increase the probability of grafting a target motif onto a protein scaffold. Since a protein structure’s backbone can be defined by the ϕ and ψ angles of each amino acid in the polypeptide and can effectively translate a protein’s 3D structure into a table of numbers, and since protein structures are not random, this numerical representation of protein structures can be used to train a neural network to mathematically generalise what a protein structure is, and therefore use this generalisation to generate new protein structures. Instead of using all proteins in the Protein Data Bank a curated dataset was used encompassing protein structures with specific characteristics that will, theoretically, allow them to be easily evaluated computationally and chemically. This paper details how a trained neural network was able to successfully generate logical helical protein backbone structures.
Footnotes
We added formulae, emphasised that we are designing helical protein backbones only, added more quantification to our results, added more results, improved out figures, and removed the sequence and folding simulation data.