Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

View ORCID ProfileGustaf Ahdritz, View ORCID ProfileNazim Bouatta, View ORCID ProfileSachin Kadyan, Qinghui Xia, View ORCID ProfileWilliam Gerecke, View ORCID ProfileTimothy J O’Donnell, View ORCID ProfileDaniel Berenberg, Ian Fisk, View ORCID ProfileNiccolò Zanichelli, View ORCID ProfileBo Zhang, View ORCID ProfileArkadiusz Nowaczynski, View ORCID ProfileBei Wang, View ORCID ProfileMarta M Stepniewska-Dziubinska, View ORCID ProfileShang Zhang, View ORCID ProfileAdegoke Ojewole, Murat Efe Guney, View ORCID ProfileStella Biderman, View ORCID ProfileAndrew M Watkins, View ORCID ProfileStephen Ra, View ORCID ProfilePablo Ribalta Lorenzo, View ORCID ProfileLucas Nivon, View ORCID ProfileBrian Weitzner, View ORCID ProfileYih-En Andrew Ban, View ORCID ProfilePeter K Sorger, Emad Mostaque, View ORCID ProfileZhao Zhang, View ORCID ProfileRichard Bonneau, View ORCID ProfileMohammed AlQuraishi
doi: https://doi.org/10.1101/2022.11.20.517210
Gustaf Ahdritz
1Department of Systems Biology, Columbia University
2Harvard University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gustaf Ahdritz
Nazim Bouatta
3Laboratory of Systems Pharmacology, Harvard Medical School
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nazim Bouatta
  • For correspondence: m.alquraishi@columbia.edu nazim_bouatta@hms.harvard.edu
Sachin Kadyan
1Department of Systems Biology, Columbia University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sachin Kadyan
Qinghui Xia
1Department of Systems Biology, Columbia University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
William Gerecke
3Laboratory of Systems Pharmacology, Harvard Medical School
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for William Gerecke
Timothy J O’Donnell
4Icahn School of Medicine at Mount Sinai
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Timothy J O’Donnell
Daniel Berenberg
5Prescient Design, Genentech
6Department of Computer Science, Courant Institute of Mathematical Sciences, New York University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Daniel Berenberg
Ian Fisk
7Flatiron Institute
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Niccolò Zanichelli
8OpenBioML
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Niccolò Zanichelli
Bo Zhang
9Scientific Computing and Imaging Institute, University of Utah
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bo Zhang
Arkadiusz Nowaczynski
10NVIDIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Arkadiusz Nowaczynski
Bei Wang
10NVIDIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bei Wang
Marta M Stepniewska-Dziubinska
10NVIDIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Marta M Stepniewska-Dziubinska
Shang Zhang
10NVIDIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shang Zhang
Adegoke Ojewole
10NVIDIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Adegoke Ojewole
Murat Efe Guney
10NVIDIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stella Biderman
11EleutherAI
12Booz Allen Hamilton
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stella Biderman
Andrew M Watkins
5Prescient Design, Genentech
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Andrew M Watkins
Stephen Ra
5Prescient Design, Genentech
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stephen Ra
Pablo Ribalta Lorenzo
10NVIDIA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Pablo Ribalta Lorenzo
Lucas Nivon
13Cyrus Bio
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lucas Nivon
Brian Weitzner
14Outpace Bio
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Brian Weitzner
Yih-En Andrew Ban
15Arzeda
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yih-En Andrew Ban
Peter K Sorger
3Laboratory of Systems Pharmacology, Harvard Medical School
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Peter K Sorger
Emad Mostaque
16Stability AI
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zhao Zhang
17Texas Advanced Computing Center
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zhao Zhang
Richard Bonneau
5Prescient Design, Genentech
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Richard Bonneau
Mohammed AlQuraishi
1Department of Systems Biology, Columbia University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mohammed AlQuraishi
  • For correspondence: m.alquraishi@columbia.edu nazim_bouatta@hms.harvard.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (i) tackle new tasks, like protein-ligand complex structure prediction, (ii) investigate the process by which the model learns, which remains poorly understood, and (iii) assess the model’s generalization capacity to unseen regions of fold space. Here we report OpenFold, a fast, memory-efficient, and trainable implementation of AlphaFold2, and OpenProtein-Set, the largest public database of protein multiple sequence alignments. We use OpenProteinSet to train OpenFold from scratch, fully matching the accuracy of AlphaFold2. Having established parity, we assess OpenFold’s capacity to generalize across fold space by retraining it using carefully designed datasets. We find that OpenFold is remarkably robust at generalizing despite extreme reductions in training set size and diversity, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced by OpenFold during training, we also gain surprising insights into the manner in which the model learns to fold proteins, discovering that spatial dimensions are learned sequentially. Taken together, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial new resource for the protein modeling community.

Competing Interest Statement

M.A. is a member of the Scientific Advisory Boards of Cyrus Biotechnology, Deep Forest Sciences, Nabla Bio, Oracle Therapeutics, and FL2021-002, a Foresite Labs company. P.K.S. is a member of the Scientific Advisory Board or Board of Di- rectors of Glencoe Software, Applied Biomath, RareCyte, and NanoString and is an advisor to Merck and Montai Health.

Footnotes

  • Author list and affiliations cleaned; animations added; acknowledgements and author contributions list updated; figures adjusted

  • https://github.com/aqlaboratory/openfold

  • https://registry.opendata.aws/openfold/

  • https://figshare.com/articles/media/Folding_animations/21561939

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted November 24, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization
Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Qinghui Xia, William Gerecke, Timothy J O’Donnell, Daniel Berenberg, Ian Fisk, Niccolò Zanichelli, Bo Zhang, Arkadiusz Nowaczynski, Bei Wang, Marta M Stepniewska-Dziubinska, Shang Zhang, Adegoke Ojewole, Murat Efe Guney, Stella Biderman, Andrew M Watkins, Stephen Ra, Pablo Ribalta Lorenzo, Lucas Nivon, Brian Weitzner, Yih-En Andrew Ban, Peter K Sorger, Emad Mostaque, Zhao Zhang, Richard Bonneau, Mohammed AlQuraishi
bioRxiv 2022.11.20.517210; doi: https://doi.org/10.1101/2022.11.20.517210
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization
Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Qinghui Xia, William Gerecke, Timothy J O’Donnell, Daniel Berenberg, Ian Fisk, Niccolò Zanichelli, Bo Zhang, Arkadiusz Nowaczynski, Bei Wang, Marta M Stepniewska-Dziubinska, Shang Zhang, Adegoke Ojewole, Murat Efe Guney, Stella Biderman, Andrew M Watkins, Stephen Ra, Pablo Ribalta Lorenzo, Lucas Nivon, Brian Weitzner, Yih-En Andrew Ban, Peter K Sorger, Emad Mostaque, Zhao Zhang, Richard Bonneau, Mohammed AlQuraishi
bioRxiv 2022.11.20.517210; doi: https://doi.org/10.1101/2022.11.20.517210

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4239)
  • Biochemistry (9172)
  • Bioengineering (6804)
  • Bioinformatics (24063)
  • Biophysics (12155)
  • Cancer Biology (9564)
  • Cell Biology (13825)
  • Clinical Trials (138)
  • Developmental Biology (7658)
  • Ecology (11737)
  • Epidemiology (2066)
  • Evolutionary Biology (15540)
  • Genetics (10672)
  • Genomics (14359)
  • Immunology (9511)
  • Microbiology (22901)
  • Molecular Biology (9129)
  • Neuroscience (49113)
  • Paleontology (357)
  • Pathology (1487)
  • Pharmacology and Toxicology (2583)
  • Physiology (3851)
  • Plant Biology (8351)
  • Scientific Communication and Education (1473)
  • Synthetic Biology (2301)
  • Systems Biology (6205)
  • Zoology (1302)