State-of-the-art estimation of protein model accuracy using AlphaFold

James P. Roney; Sergey Ovchinnikov

doi:10.1101/2022.03.11.484043

Abstract

The problem of predicting a protein’s 3D structure from its primary amino acid sequence is a longstanding challenge in structural biology. Recently, approaches like AlphaFold have achieved remarkable performance on this task by combining deep learning techniques with coevolutionary data from multiple sequence alignments of related protein sequences. The use of coevolutionary information is critical to these models’ accuracy, and without it their predictive performance drops considerably. In living cells, however, the 3D structure of a protein is fully determined by its primary sequence and the biophysical laws that cause it to fold into a low-energy configuration. Thus, it should be possible to predict a protein’s structure from only its primary sequence by learning a highly-accurate biophysical energy function. We provide evidence that AlphaFold has learned such an energy function, and uses coevolution data to solve the global search problem of finding a low-energy conformation. We demonstrate that AlphaFold’s learned potential function can be used to rank the quality of candidate protein structures with state-of-the-art accuracy, without using any coevolution data. Finally, we explore several applications of this potential function, including the prediction of protein structures without MSAs.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

jamesproney{at}gmail.com;
The current version of this draft has some noteworthy differences from the previous version. In our original draft, we investigated two choices for the one-hot encoded sequence associated with the decoy structures: a sequence of all alanines, and the target sequence. We later discovered that we had used an incorrect encoding for the target sequence (this error is described in detail in the accompanying code release). Consequently, our reported results for the target sequence are significantly different in this draft. We also switched from using a sequence of all alanines to a sequence of all gap tokens to avoid any bias towards alanine-rich sequences, although these sequences gave highly similar results. Our previous draft also experimented with combining multiple choices of decoy sequence to create even stronger ranking results, but we ultimately decided it was simpler to focus on the results from using the gap sequence. Finally, we added an applications section to the main text, which expands upon the decoy generation results that were previously included in the appendix. This revision also includes new experiments exploring applications to protein design and mutation effect prediction, the latter of which is described in Appendix E.
https://github.com/jproney/AF2Rank

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.