Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Predicting and Interpreting Protein Developability via Transfer of Convolutional Sequence Representation

Alexander W. Golinski, Zachary D. Schmitz, Gregory H. Nielsen, Bryce Johnson, Diya Saha, Sandhya Appiah, Benjamin J. Hackel, Stefano Martiniani
doi: https://doi.org/10.1101/2022.11.21.517400
Alexander W. Golinski
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zachary D. Schmitz
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gregory H. Nielsen
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Bryce Johnson
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Diya Saha
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sandhya Appiah
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Benjamin J. Hackel
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: hackel@umn.edu sm7683@nyu.edu
Stefano Martiniani
1Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
2Center for Soft Matter Research, Department of Physics, New York University, New York, NY 10003
3Simons Center for Computational Physical Chemistry, Departments of Chemistry, New York University, New York, NY 10003
4Courant Institute of Mathematical Sciences, New York University, New York, NY 10003
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: hackel@umn.edu sm7683@nyu.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability – quantified by expression, solubility, and stability – hinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput developability dataset for 105 of 1020 possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a high-throughput dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased developability through nonlinear dimensionality reduction and we explore the inferred developability landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold developability from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.

Significance statement Protein developability prediction and understanding constitutes a critical limiting step in biologic discovery and engineering due to limited experimental throughput. We demonstrate the ability of a machine learning model to learn sequence-developability relationships first through the use of high-throughput assay data, followed by the transfer of the learned developability representation to predict the true metric of interest, recombinant yield in bacterial production. Model performance is 44% better than a model not pre-trained using the high-throughput assays. Analysis of model behavior reveals the importance of cysteine, charge, and hydrophobicity to developability, as well as of an evolutionary bottleneck that greatly limited sequence diversity above 1.3 mg/L yield. Experimental characterization of model predicted candidates confirms the benefit of this transfer learning and in-silico evolution approach.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • Figure 2's legend has been updated for accuracy and clarity of displayed correlation coefficients.

  • https://github.com/HackelLab-UMN/DevRep2

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted December 03, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Predicting and Interpreting Protein Developability via Transfer of Convolutional Sequence Representation
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Predicting and Interpreting Protein Developability via Transfer of Convolutional Sequence Representation
Alexander W. Golinski, Zachary D. Schmitz, Gregory H. Nielsen, Bryce Johnson, Diya Saha, Sandhya Appiah, Benjamin J. Hackel, Stefano Martiniani
bioRxiv 2022.11.21.517400; doi: https://doi.org/10.1101/2022.11.21.517400
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Predicting and Interpreting Protein Developability via Transfer of Convolutional Sequence Representation
Alexander W. Golinski, Zachary D. Schmitz, Gregory H. Nielsen, Bryce Johnson, Diya Saha, Sandhya Appiah, Benjamin J. Hackel, Stefano Martiniani
bioRxiv 2022.11.21.517400; doi: https://doi.org/10.1101/2022.11.21.517400

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4091)
  • Biochemistry (8773)
  • Bioengineering (6487)
  • Bioinformatics (23356)
  • Biophysics (11758)
  • Cancer Biology (9155)
  • Cell Biology (13257)
  • Clinical Trials (138)
  • Developmental Biology (7418)
  • Ecology (11376)
  • Epidemiology (2066)
  • Evolutionary Biology (15095)
  • Genetics (10404)
  • Genomics (14014)
  • Immunology (9126)
  • Microbiology (22071)
  • Molecular Biology (8783)
  • Neuroscience (47397)
  • Paleontology (350)
  • Pathology (1421)
  • Pharmacology and Toxicology (2482)
  • Physiology (3705)
  • Plant Biology (8055)
  • Scientific Communication and Education (1433)
  • Synthetic Biology (2211)
  • Systems Biology (6017)
  • Zoology (1250)