Abstract
Advances in DNA sequencing and machine learning are illuminating protein sequences and structures on an enormous scale. However, the energetics driving folding are invisible in these structures and remain largely unknown. The hidden thermodynamics of folding can drive disease, shape protein evolution, and guide protein engineering, and new approaches are needed to reveal these thermodynamics for every sequence and structure. We present cDNA display proteolysis, a new method for measuring thermodynamic folding stability for up to 900,000 protein domains in a one-week experiment. From 1.8 million measurements in total, we curated a set of ~850,000 high-quality folding stabilities covering all single amino acid variants and selected double mutants of 354 natural and 188 de novo designed protein domains 40-72 amino acids in length. Using this immense dataset, we quantified (1) environmental factors influencing amino acid fitness, (2) thermodynamic couplings (including unexpected interactions) between protein sites, and (3) the global divergence between evolutionary amino acid usage and protein folding stability. We also examined how our approach could identify stability determinants in designed proteins and evaluate design methods. The cDNA display proteolysis method is fast, accurate, and uniquely scalable, and promises to reveal the quantitative rules for how amino acid sequences encode folding stability.
One-Sentence Summary Massively parallel measurement of protein folding stability by cDNA display proteolysis
Main Text
Protein sequences vary by more than ten orders of magnitude in thermodynamic folding stability (the ratio of unfolded to folded molecules at equilibrium) (1, 2). Even single point mutations that alter stability can have profound effects on health and disease (3–5), pharmaceutical development (6–8), and protein evolution (9–13). Thousands of point mutants have been individually studied over decades to quantify the determinants of stability (14, 15), but these studies highlight a challenge: similar mutations can have widely varying effects in different protein contexts, and these subtleties remain difficult to predict despite substantial effort (16, 17). In fact, even as deep learning models have achieved transformative accuracy at protein structure prediction (18–21) progress in modeling folding stability has arguably stalled (22–24). New high-throughput experiments have the potential to transform our understanding of stability by quantifying the effects of mutations across a vast number of protein contexts, revealing new biophysical insights and empowering modern machine learning methods.
Here, we present a powerful new high-throughput stability assay along with a uniquely massive dataset of 851,552 folding stability measurements. Our new method - cDNA display proteolysis - combines the strengths of cell-free molecular biology and next-generation sequencing and requires no on-site equipment larger than a qPCR machine. Assaying one library (up to 900,000 sequences in our experiments) requires one week and only ~$2,000 in reagents, excluding the cost of DNA synthesis and sequencing. Compared to mass spectrometry-based high-throughput stability assays (25–28), cDNA display proteolysis achieves a 100-fold larger scale and can easily be applied to study mutational libraries that pose difficulties for proteomics. Compared to our earlier yeast display proteolysis method (29), cDNA display proteolysis resolves a wider dynamic range of stability and is more reproducible even at a 50-fold larger experimental scale. Large-scale proteolysis data have already played a key role in the development of machine learning methods for protein design and protein biophysics (30–36). The cDNA display proteolysis method massively expands this capability and has the potential to expand our knowledge of stability to the scale of all known small domains.
Our new dataset of 851,552 absolute folding stabilities is unique in size and character. Current thermodynamic databases contain a skewed assortment of mutations measured under many varied conditions (14). In contrast, our new dataset comprehensively measures all single mutants for 354 natural domains and 188 designed proteins - including single deletions and two insertions at each position - all under identical conditions. Our dataset also includes comprehensive double mutations at 595 site pairs spread across 208 domains (a total of 222,265 double mutants). By maintaining uniform experimental conditions, our data can be used to examine the determinants of absolute folding stability in addition to the effects of mutations. Using our unique dataset, we investigated how individual amino acids and pairs of amino acids contribute to folding stability (Figs. 3 and 4) as well as how selection for stability interacts with other selective pressures in natural protein domains (Figs. 5 and 6). We also explored how our unique scale of data can be applied in protein design (Fig. 7).
Massively parallel measurement of folding stability by cDNA display proteolysis
Proteases typically cleave unfolded proteins more quickly than folded ones, and proteolysis assays have been used for decades to measure folding stability (37) and select for high stability proteins (38, 39). In 2017, we introduced the high-throughput yeast display proteolysis method for measuring folding stability using next generation sequencing (29, 40–46). To improve the scale, precision, speed, and cost of stability measurements, we developed cDNA display proteolysis. Each experiment begins with a DNA library. Here, we employ synthetic DNA oligo pools where each oligo encodes one test protein. The DNA library is transcribed and translated using cell-free cDNA display (47), based on mRNA display (48, 49), resulting in proteins that are attached at the C-terminus to their cDNA. We then incubate the protein-cDNA complexes with different concentrations of protease, quench the reactions, and pull down the proteins using an N-terminal PA tag (Fig. 1A). Intact (protease-resistant) proteins will also carry their C-terminal cDNA. Finally, we determine the relative amounts of all proteins in the surviving pool at each protease concentration by deep sequencing (Fig. 1D). To control for any effects of protease specificity, we perform separate experiments with two orthogonal proteases: trypsin (targeting basic amino acids) and chymotrypsin (targeting aromatic amino acids).
We inferred the protease stability of all sequences from our sequencing counts using a Bayesian model of the experimental procedure. We modeled protease cleavage using single turnover kinetics (50, 51) (Fig. 1B eqs. 1 to 3, Fig. S1, and Supplementary Text for the derivation) because we assume the enzyme is in excess over all substrates (up to ~20 pM of substrate based on previous estimates (47) versus 141 pM for the lowest concentration of protease). To parameterize the model, we used a universal kmax cleavage rate for all sequences (Fig. S1) and used our sequencing data to infer a unique K50 for each sequence (the protease concentration at which the cleavage rate is one-half kmax, see Methods). The K50 values inferred by the model were consistent between two replicates of the proteolysis procedure (R = 0.97 for trypsin and 0.99 for chymotrypsin for ~84% of sequences in a pool of 806,640 sequences after filtering based on confidence and dynamic range; Fig. 1E).
To infer each sequence’s thermodynamic folding stability (ΔG for unfolding), we used a kinetic model that separately considers idealized folded (F) and unfolded (U) states (Fig. 1B eq. 4). We model both states using the same single-turnover equations as before (Fig. 1B eq. 3), with separate K50 protease concentrations for each state (K50,F and K50,U) and a shared kmax. We assume that cleavage in the folded state exclusively occurs outside the folded domain (e.g. in the N-terminal PA tag added to all sequences), so we use an identical K50,F for all sequences. In contrast, K50,U reflects an individual sequence’s unique protease susceptibility in the unfolded state, which depends on its potential cleavage sites. We inferred K50,U for each sequence using a position-specific scoring matrix (PSSM) model of protease cleavage parameterized using measurements of 64,238 scrambled sequences (sequences with a high probability of being fully unfolded, Fig. S2; see also Methods). Finally, we assume that folding, unfolding, and enzyme binding are all in rapid equilibrium relative to cleavage, implying that K50,U, K50,F, and the overall K50 can be approximated by the enzyme-substrate equilibrium dissociation constants for each state (Fig. 1B eq. 6). Although these approximations will not be universally accurate, they are appropriate for the small domains examined here and facilitate consistent analysis of all test sequences. With these approximations, we can express a sequence’s ΔG in terms of the universal K50,F, its inferred K50,U, and its experimentally measured K50 (Fig. 1B eq. 5 and 7, and Supplementary Text for the derivation). For most analysis we combine our independent trypsin and chymotrypsin data into a single overall ΔG estimate (See Methods). Based on our kinetic model, (1) stability (ΔG) will be underestimated if significant cleavage occurs inside the test domain in the folded state, (2) stability can be over- or under-estimated depending on the accuracy of K50,U (independent measurements with trypsin and chymotrypsin help correct this), and (3) ΔG values become unreliable if K50 approaches K50,F or K50,U (Fig. 1C).
Folding stabilities from cDNA display proteolysis are consistent with traditional experiments on purified proteins
In Fig. 1G, we compare stabilities measured by cDNA display proteolysis to previous results from experiments on purified protein samples for 1,143 variants of ten proteins (52–65). All Pearson correlations are above 0.7. Our stability measurements for these 1,143 sequences were all performed in libraries of 244,000–900,000 total sequences. Although several sets of mutants show systematic offsets (y-intercept values) between literature values and our measurements, these offsets correlate with temperature differences between experimental conditions (with the exception of the N-terminal domain of Ribosomal Protein L9 (2HBB), Fig. S3, see Table S2 for all experimental conditions). We also noticed several variants of Protein GB1 appear unstable in our data but stable in the previous experiments (52). Our structural analysis of these mutations suggests that our measurements are more likely to be correct (Fig. S4). Overall, the consistency between our cDNA display proteolysis data and traditional biophysical measurements establishes that (1) small domains are cleaved mainly in the globally unfolded state, (2) our method can reliably measure these cleavage rates on a massive scale, and (3) our unfolded state model can remove protease-specific effects to attain accurate quantitative folding stability measurements.
Comprehensive mutational analysis across designed and natural protein domains
To systematically examine how individual residues influence folding stability, we used cDNA display proteolysis to measure stability for all single substitutions, deletions, and Gly and Ala insertions in 983 natural and designed domains. We chose our natural domains to cover almost all of the small monomeric domains in the Protein Data Bank (30-72 amino acids in length). Our designed domains included (1) previous Rosetta designs with ααα, αββα, βαββ, and ββαββ topologies (40-43 a.a.) (29, 66), (2) new ββαα proteins designed using Rosetta (47 a.a.), and (3) new domains designed by trRosetta hallucination (46 to 69 a.a.) (42, 67). We collected these data using four giant synthetic DNA oligonucleotide libraries and obtained K50 values for 2,520,337 sequences; 1,844,548 of these measurements are included here. K50 values were reproducible across libraries (Fig. S5). Oligo pools were synthesized by Agilent Technologies (one 244,000-sequence library, length 170 nt) and Twist Bioscience (three libraries of 696,000 - 900,000 sequences, length 250-300 nt).
Deep mutational scanning of hundreds of domains revealed several overall patterns. The largest fraction of these domains showed clear, biophysically reasonable sequence-stability relationships that were consistent between separate experiments with trypsin and chymotrypsin. However, other domains were completely unfolded, too stable to resolve, insensitive to mutation, or inconsistent between the proteases. For 42 domains that were too stable to resolve, we introduced single mutations to destabilize the wild-type sequence, then performed new mutational scanning experiments in these 121 new “wild-type” backgrounds (Fig. S6). In four domains, mutational scanning revealed trypsin-sensitive loops that could be cleaved in the folded state, leading to inconsistent stabilities between trypsin and chymotrypsin (Fig. S7). In these cases, we introduced one to two substitutions into the wild-type sequences to remove trypsin-sensitive sites, then performed new mutational scanning experiments in these alternative backgrounds. This led to consistent results between the two proteases. In total, we performed deep mutational scanning for 983 domain sequences, including both original and revised wild-type backgrounds.
Our overall categorization of all domains is shown in Fig. 2B (see Fig. S8 for inclusion criteria). Based on these categories, we assembled three curated datasets for machine learning (Fig. 2A). Our ΔΔG dataset (Dataset #1) includes 586,938 sequences (single and double mutants) from 251 natural domains and 145 designs. In this dataset, the wild-type sequence is 1.25-4.5 kcal/mol in stability so that most ΔΔG values (including for stabilizing mutants) are correctly resolved. Our ΔG dataset (Dataset #2) includes all 851,552 single and double mutants from 354 natural domains and 188 designs. In this dataset, the large majority of mutant ΔGs are accurately resolved, but the wild-type ΔG may lie outside the dynamic range, preventing accurate ΔΔG calculations. Finally, Dataset #3 includes all ~1.8 million confidently estimated K50 values, even when trypsin and chymotrypsin measurements produced inconsistent ΔG estimates. The main domain classes in Dataset #1 are shown in Fig. 2C; all natural domains included in Dataset #1 are listed by category in Fig. S9 (see Supplementary Materials for all wild-type sequences).
Mutational scanning results for nine domains are shown in Fig 2D and E. Like all mutational scans in Datasets 1 and 2, these examples show a strong consistency between independent ΔG measurements with trypsin and chymotrypsin (Pearson correlation 0.94 ± 0.04 for 542 domains in Dataset 2, median ± std.). In each structure, sites are colored according to the average effect of an amino acid substitution, with the most critical sites (where mutations are very destabilizing) colored dark blue. Most of these critical sites are in the hydrophobic core. However, our data also reveal numerous other critical interactions, such as a side chain hydrogen bond between S23 and D42 in the U-box domain of human E4B Ubiquitin ligase and a cation-π interaction between R10 and W32 in the chromodomain of human chromobox protein homolog 7 (residues have been re-numbered based on the exact sequence included in our experiments). These unique stabilizing interactions reveal the rich biophysical diversity found in our systematic exploration of stability across hundreds of domains.
Trends in amino acid fitness at different sites and across domains
We first sought to define the major sources of variation between protein sites that determine the relative stabilities of all 20 amino acids at that site (i.e. the site’s stability landscape). To this end, we performed principal component (PC) analysis using 293,697 ΔG measurements at 15,440 sites in 337 domains from Dataset #1 after centering our data to set the average ΔG at each site to zero (Fig. 3A, B). Each principal component expresses specific properties of a site that determine which amino acids are stabilizing or destabilizing. Based on the loadings of the different amino acids onto each principal component (Fig. 3C), we interpreted the first four components to reflect amino acid hydrophobicity (PC1; 31% of the total variance explained by this PC), helical probability (PC2; 15%), aliphatic vs. aromatic favorability (PC3; 12%), and positive vs. negative charge (PC4; 7%). The fifth principal component (6%) was more complex: at one extreme were small amino acids that could be buried in dense environments, along with positively charged amino acids that can “snorkel” their charged moieties to the surface even when partially buried. At the other extreme were negatively charged amino acids that are energetically costly to bury. We interpreted this component to reflect an “ease of burial” that is orthogonal to the hydrophobic property captured by PC1. These interpretations are also consistent with the structural environments at each site, as shown in Fig. 3D. For example, the first principal component reflecting hydrophobicity is high at buried positions and low at exposed positions (Fig. 3D).
These first five principal components collectively form a coarse model of the properties of protein sites, but some sites have unique stability landscapes that cannot be accurately represented by this model. We reconstructed the stability landscapes at all sites using the first five components and examined how different sites and domains deviated from these simplified landscapes (Fig. 3E). On average, stability landscapes reconstructed using five principal components were similarly accurate (in terms of mean absolute error) for both high and low stability domains (Fig. 3F). However, as expected, these coarse reconstructions were less accurate for domains with more varied stability landscapes (domains with a higher standard deviation of ΔG for all substitutions). The coarse model was also more accurate at reconstructing the stability landscapes of de novo designed domains and less accurate at reconstructing the landscapes of natural domains (Fig. 3F). This remained true for any number of principal components and even when designed proteins were excluded from the initial PCA (Fig. S10). This indicates that the de novo design protocols examined here lead to structures with “typical” amino acid environments that can be accurately described by only five principal components, and that these proteins generally lack the more specialized environments found in natural domains. Indeed, wild-type amino acids in natural domains tend to be more stable than the fit from the coarse model (Fig. S11). This suggests the remaining components capture additional biophysical effects that contribute to the compatibility between wild-type amino acids and their environments.
Three example proteins shown in Fig. 3G illustrate how the coarse five-component model captures (or fails to capture) protein stability landscapes. At one extreme, the stability landscape of the designed protein r11_692_TrROS (from trRosetta hallucination) is accurately approximated by the coarse model (average per-residue MAE 0.13 kcal/mol). In contrast, the two natural domains (an SH3 domain (1QP2) and a unique NifT/FixU barrel domain (2JN4); Fig. S12) contain many sites with unique properties that are not accurately represented by the model (average per-residue MAE of 0.34 kcal/mol and 0.31 kcal/mol for the SH3 domain and β-barrel domains respectively). Seven of these sites are highlighted in Fig. 3H. Each stability landscape contains sharp differences between closely related amino acids that are not captured by the coarse model, such as V versus L at V8 and Q19 in 1QP2, and Q versus E at Q19 in 1QP2, M28 in 2JN4, and T60 in 2JN4. These unusual patterns are unlikely to be experimental artifacts because the patterns are consistent between independent experiments with trypsin and chymotrypsin and the same patterns are seen in both our K50 and ΔG analysis (Fig. 3H). Our massive dataset enabled us to identify the global trends in stability landscapes as well as specific cases that depart from these trends. These unusual cases with large reconstruction errors may provide the opportunity to study how protein flexibility and/or rare side chain interactions contribute to folding stability. These unusual sites will also serve as stringent test cases for models of protein stability.
Quantifying thermodynamic coupling for hundreds of amino acid pairs
Next, we examined how side chain interaction between amino acid pairs affects folding stability. We constructed comprehensive substitutions (20 x 20 amino acids) of 595 amino acid pairs from 208 natural domains and designs in our ΔG dataset (Dataset #2) and measured stability for all sequences by cDNA display proteolysis. We selected pairs that were suggested to form energetically important hydrogen bonds in our mutational scanning data as well as other pairs forming close contacts (Fig. 4A; Methods). To quantify the interactions between side chains, we constructed an additive model for each amino acid pair with 40 coefficients that capture the independent stability contributions of each amino acid in each position. The deviations from these models quantify the “thermodynamic coupling” between specific amino acids. Among our curated set of wild-type pairs, thermodynamic couplings were typically 0.5-1.0 kcal/mol in magnitude, with the largest couplings stronger than 2 kcal/mol (Fig. 4B). Among all sequences tested (wild-type or mutant pairs), pairs with opposite charges and cysteine pairs tended to have positive (favorable) couplings, whereas pairs with the same charge and acidic-aromatic/aliphatic amino acid pairs tended to have negative couplings (Fig. 4C). These couplings are lower than our observed wild-type couplings because the side chain orientations and environment surrounding wild-type pairs will typically be optimized for that pair. Nonetheless, our data recapitulate expected patterns of side chain interactions, provide a wealth of data for training machine learning models, and identify a wide range of noteworthy interactions for further study.
Several notable pairs are highlighted in Fig. 4D to F. In an OB-domain from Shewanella oneidensis, we found strong thermodynamic coupling between two unrelated pairs of amino acids: the wild-type Tyr-Glu pair and a mutant Lys-Trp pair that may form a cation-π interaction (thermodynamic couplings of 1.6±0.2 and 1.4±0.2 kcal/mol respectively; mean±std from calculating the coupling using bootstrap resampling of the ~400 amino acid combinations; Fig. 4D, S13A). In the Alpha-spectrin SH3 domain, our comprehensive double mutant scanning of Y10 and Y52 uncovered the highly stable, tightly coupled double mutant Y10H/Y52K (coupling of 2.5±0.4 kcal/mol for His-Lys versus 1.0±0.2 kcal/mol for the wild-type pair) (Fig. 4E, S13B). AlphaFold modeling predicts that this double mutant introduces a new hydrogen bonding network to replace the original Tyr-Tyr interaction. We also identified an unexpected thermodynamic coupling between an amino acid pair lacking a direct side chain interaction. In the SH3 domain of Myo3, mutations at K24 are destabilizing even though the side chain makes no clear interactions. To investigate interactions of K24, we quantified thermodynamic couplings to nearby Y9 (0.0±0.1 kcal/mol) and D10 (1.0±0.2 kcal/mol) (Fig. 4F and Fig. S13C). The surprising K24-D10 coupling - between two side chains that appear not to interact - highlights the difficulty of inferring energetic interactions from structural data alone, and suggests a possible longer-ranged ionic interaction.
We also investigated thermodynamic couplings within 36 different three-residue networks. For each triplet, we comprehensively measured stability for all possible single and double substitutions in both the wild-type background and in the background where the third amino acid was replaced by alanine (400 mutants x 3 pairs x 2 backgrounds = ~2,400 mutants in total for each triplet). As before, we modeled each set of 400 mutants (i.e. one residue pair in one background) using 40 single-amino acid coefficients (we did not globally model all 2,400 mutants together). One notable triplet is found in the J domain of HSJ1a, where R60 and D64 both interact with the hydroxyl group on Y3 (Fig. 4G left). We observe strong couplings (> 1.5 kcal/mol) between each pair of two out of the three amino acids. However, when any of the three amino acids is mutated to alanine, the coupling between the remaining two amino acids becomes much weaker (< 0.5 kcal/mol, Fig. 4G middle and right, Fig. S13D). These results reveal a strong third-order thermodynamic coupling: the interaction between two amino acids is mediated by a third amino acid.
This strong three-way coupling is especially noteworthy because the interactions do not appear in the deposited NMR structural ensemble (2LGW; Fig S14A and B). The interaction network shown in Fig. 4G comes from the AlphaFold predicted structure for our wild-type sequence taken from the J domain of human HSJ1a. This network reproduces interactions seen in other J-domain crystal structures from C. elegans (2OCH) and P. falciparum (6RZY). However, in the deposited NMR ensemble for 2LGW, the backbone near Y5 (Y3 in our numbering) always positions that residue away from the helix containing R62 and D66, making the interaction network impossible. The strong couplings we identify support the AlphaFold model and suggest the deposited ensemble is missing conserved interactions that form in HSJ1a and other J domain proteins. This example illustrates how large-scale folding stability measurements can reveal the thermodynamic effects of a critical interaction even when that interaction is missing in the deposited NMR structure. Notably, AlphaFold itself does not always predict this network either: when we include disordered linkers from the NMR construct or used for cDNA proteolysis, AlphaFold also predicts alternative structures lacking the interaction network (Fig. S14D and E).
The scale of our cDNA display proteolysis experiments makes it straightforward to characterize unique cases like these, and again these cases will serve as stringent tests for models of folding stability. Strong third-order couplings like this example also present a special challenge for computational models that calculate stabilities by summing interaction energies between pairs of residues using a single reference structure. Deep learning models that implicitly represent entire conformational landscapes (42) may be more promising, but training these models using large-scale thermodynamic measurements will be essential to achieve their potential.
Natural sequences systematically deviate from their highest stability variants
How does selection for stability influence protein sequence evolution in concert with other evolutionary mechanisms? It is well known that proteins contain specific functional residues that are commonly deleterious to stability (68, 69). However, the challenge of measuring stability has made it difficult to experimentally distinguish selection for stability from other selective pressures on a global level (70–72). To examine the strength of selection for stability, we created a simple classification model to predict the wild-type amino acid at any site in a natural protein based on the folding stabilities of all substitution variants at that site (excluding Cys) (Fig. 5A). The model contains two parts: (1) a shared weight function that converts absolute stabilities of protein variants into relative probabilities of those sequences, and (2) amino-acid specific offsets that shift amino acid probabilities by a constant amount at all sites. We fit the shared weight function parameters (a flexible monotonically increasing function) and the offsets together using absolute stability data for wild-type sequences and substitution variants at 4,718 sites in 80 non-redundant natural proteins (85,004 ΔG measurements in all, Fig. 5A). Our simple model fits the data well by three criteria: (1) it correctly produces the overall frequencies of the 19 (non-Cys) amino acids (Fig. 5B), (2) the output amino acid probabilities are correctly calibrated across the full range of probability (Fig. S15), and (3) the model performs similarly well on the training set and on a held-out testing set consisting of 621 sites in 11 domains with no similarity to the training set (Fig. 5E).
The model parameters reveal the strength of selection for stability across this heterogeneous set of domains from many organisms. Within the main range of our data (folding stabilities from 1.5 to 4 kcal/mol), amino acid probabilities increase approximately linearly with increased stability, with a 1 kcal/mol stability difference between protein variants indicating a ~5.7-fold difference in sequence likelihood (Fig. 5C). The global offsets to each amino acid’s probability (Fig. 5D) are different from the empirical amino acid frequencies (Fig. 5B) and indicate the probability of each amino acid under conditions where all sequence variants are equally stable. The offsets span a 23-fold range: the most likely amino acid (Glu) is 23-fold more likely to occur (21.5/2-3.0) than the least likely amino acid (Trp) under the conditions that sequence variants containing these amino acids at the same site are equally stable (Fig. 5D). This probability difference corresponds to a stability difference of ~1.8 kcal/mol (Fig. 5C); i.e. Trp and Glu would be equally likely at a site if the Trp variant were 1.8 kcal/mol more stable than the Glu variant. Overall, the most likely amino acids are the charged amino acids Glu, Asp, and Lys, suggesting selection for solubility, whereas the least likely amino acids are the nonpolar aromatic amino acids Trp, Phe, and Tyr, along with Met. These offsets provide a quantitative “favorability” metric incorporating all non-stability evolutionary influences on amino acid composition, including selection for amino acid synthesis cost (73, 74), codon usage (75, 76), avoiding oxidation-prone amino acid(s), net charge, and function. These offsets also highlight that biophysical models and protein design methods trained to reproduce native protein sequences will not consistently optimize folding stability; Fig. 5D quantifies how much specific amino acids are over- or underrepresented in small, naturally occurring domains compared to their effects on stability. Notably,these offsets are similar to findings from an independent analysis of global discrepancies between variant effect data and sequence likelihood modeling (77)
Properties of functional residues across diverse domains
Selection for function also causes protein sequences to diverge from the highest stability sequence variants. Previous studies (70, 71) have applied this strategy to identify functional sites based on the difference between evolutionary conservation and predicted effects on stability. We expanded this strategy to employ experimental stability measurements and examined the properties of functional sites on a large scale. We identified functional sites in 92 diverse protein domains by comparing each site’s average ΔΔG of substitutions with its normalized GEMME (78) score, an evolutionary-based measure of sensitivity to mutations (Fig. 6A, see Methods for the details). High sensitivity generally indicates high evolutionary conservation. Sites where wild-type amino acids are critical for stability (higher average ΔΔG, rightward) tend to be predicted as more sensitive to mutation by GEMME (upward) and vice versa. We defined all sites in the upper left region (where the wild-type amino acid is conserved yet unimportant for stability, 9.3% in total) to be “functional” sites. This classification correctly identifies key binding residues in the chromodomain of HP1 and the SH3 domain of BBC1 (Fig. 6B and C, see Fig S16 for mutational scanning and conservation data on these examples). We found that Gly, Asp, and the bulky amino acids Trp Arg, and Tyr were frequently classified as functional (Fig. 6D). However, like previous studies, our classification method has the notable weakness that any site that is important for folding stability will not be considered functional.
Across all 92 domains, the fraction of functional sites ranged from 0 to ~25% (Fig. 6E). The domains with the highest fraction of functional sites (the Sso7d protein (1JIC) and Ribosomal protein S19 (1QKH)) are both nucleic acid binding proteins, with the functional sites located on the surface primarily at the binding interface (Fig. 6F). To identify buried functional sites, we compared each site’s evolutionary-based sensitivity to non-polar mutations (normalized GEMME score for hydrophobic substitutions) to the average ΔΔG of nonpolar substitutions (Fig. 6G), a more permissive metric. With this approach, most functional sites are still located at the protein surface, but a small fraction are located in the core (Fig. 6H). One example is A64 in the DUF1471 domain of yahO. A64 is highly sensitive to non-polar mutations and buried in the core of the domain, but substitutions to Tyr or Phe increase folding stability (Fig. 6I). This indicates that A64 modulates the function of the domain even without interacting with external partner molecules, perhaps by maintaining the overall protein shape. Similarly, in the N-terminal domain of FK506-binding protein 3, L55 is buried in the core and highly conserved even though substitutions to Ile, Val, or Phe have no effect on stability (Fig. 6J). This domain binds DNA and the other functional residues are mainly located at the binding interface. Although L55 does not directly interact with DNA, substitutions to other hydrophobic amino acids may change the orientations of the surface side chains and prevent proper DNA binding. Notably, chemical shift perturbations in this domain indicate which residues change their magnetic environment in response to DNA binding (Fig. 6J) (79). Chemical shift perturbations are found mainly in the functional residues on the protein surface, but L55 experiences a chemical shift perturbation as well, indicating allosteric communication between the functional surface residues and L55. These results highlight unusual cases where buried sites are conserved due to specific functional requirements rather than to maintain overall stability.
Large-scale stability analysis to characterize unique designs, identify stabilizing mutations, and evaluate design methods
The unique scale of cDNA display proteolysis creates new opportunities for improving protein design. Here, we examine three applications of our method and massive dataset: (1) characterizing the stability determinants of rare, highly polar proteins, (2) identifying stabilizing mutations, and (3) benchmarking the protein design tool PROSS (80). The hydrophobic effect is considered the dominant force in protein folding (1), and measuring stability for thousands of our previously-designed domains (29) by cDNA display proteolysis revealed a general trend of increasing stability with increasing hydrophobicity (Fig. 7A). However, increased hydrophobicity can promote protein aggregation, non-specific interactions, and low expression yield. To study the properties of high stability, low hydrophobicity proteins, we examined hundreds of designed proteins by deep mutational scanning across a wide range of hydrophobicity and stability. Although the mutational scanning patterns for low hydrophobicity proteins were not obviously different from other designs, we identified several designs that possessed exceptionally strong polar interactions (large dots in Fig. 7A). In Fig. 7B, we highlight stabilizing polar networks and a cation-π interaction in these unusual designs (see Fig. S17 for full mutational scanning results). The average ΔΔG for substitutions at these polar sites ranges from −0.20 to −1.33 kcal/mol, corresponding to the top 63 to 1.5%ile for all 3,694 polar sites in 145 designs. Our unusually massive dataset made it possible to identify these rare highly stabilizing interactions. Notably, the second hydrogen bond network in EHEE_rd2_0152 is also found in two other more hydrophobic designs. However, the network is less sensitive to substitution in those designs, highlighting how the overall protein environment mediates the effects of substitutions even on the protein surface (Fig. S18).
We next examined how our approach could be used to identify stabilizing mutations. Predicting and designing stabilizing mutations is a major goal of protein modeling, but prediction accuracy remains low (22). In part, this is because stabilizing mutants are rare in current databases (14, 15) (outside of reverting a destabilizing mutant), limiting the data available for improving modeling. In contrast, our large-scale approach revealed 2,600 stabilizing mutations, defined as mutations that increase folding stability by at least 1 kcal/mol. The overall fraction of stabilizing mutations was 0.06% to 0.6% for different protein types (Fig. 7C). Stabilizing mutations were enriched at functional sites (23% of the stabilizing mutations from 7.5% of sites classified as functional), but these were still a small fraction of the total. Notably, our set includes 112 examples of stabilizing insertions and deletions which are nearly absent from current databases. In Fig. 7D, we show three examples of different classes of stabilizing mutations found in our dataset with effects ranging from +1.2 to +3.1 kcal/mol (Fig. S19).
Finally, we applied our method to evaluate PROSS (80), an automated method for enhancing folding stability within sequence constraints inferred from a multiple sequence alignment. We tested 1,156 PROSS designs for 266 protein domains (a 10-100x increase over previous benchmarking study (81)). Unlike previous studies, our mutational scanning data for all 266 wild-type domains enabled us to examine the isolated effect of every individual substitution in every PROSS design. The average increase in stability from PROSS was 0.6±1.0 (mean±std) kcal/mol, and 40% of 727 domains (with wild-type ΔG < 4 kcal/mol) had at least one design with a 1 kcal/mol increase in stability (Fig. 7E). As expected, PROSS avoided mutations at functional positions: only 1.9% of PROSS-designed mutations were found at functional positions compared to 8.7% of sites classified as functional (defined in Fig. 6A). Three examples of domains successfully stabilized by PROSS are shown in Fig. 7F. Although the median number of designed mutations was only 4, more mutations typically led to a larger increase in stability (Fig. S20A), as theorized previously (22). Based on our mutational scanning data, the average effect of an individual PROSS mutation was 0.22±0.47 kcal/mol (Green line Fig. S20B). On average, the added stabilization from PROSS is comparable in size to the effect of the best single mutant designed by PROSS, and smaller than the additive effect of the two best designed mutations (Fig. S20C). Evaluating individual mutations recommended by PROSS (or other design tools) by direct comparisons to mutational scanning data provides a novel approach for systematically improving these design methods.
Discussion
The cDNA display proteolysis method massively expands the scale of folding stability experiments. Still, the method currently has notable limitations. Because we digest proteins under native conditions, our inferred thermodynamic stabilities are only accurate when (1) folding is fully cooperative (no unfolded segments get cleaved without global unfolding (82)), (2) folding is at equilibrium during the assay (no kinetic stability or spurious stability due to aggregation), (3)K50,U is accurately inferred (Fig. 1C), and (4) cleavage rates fall within the measurable range of the assay, which currently limits the dynamic range to ~5 kcal/mol (Fig. 1C). Many domains - particularly larger protein structures - will not satisfy these conditions, and issues such as non-cooperativity, kinetic stability, or aggregation are invisible in a single measurement. Combining cDNA display proteolysis with chemical denaturation (pulse proteolysis, (37)) may overcome these obstacles and enable mega-scale analysis of less cooperative and/or higher stability proteins, while also avoiding the need to infer K50,U. Advances in DNA synthesis (including methods like DropSynth (83, 84)) will also make it possible to expand cDNA display proteolysis to analyze diverse libraries of larger domains. Lastly, multiplexed measurements and automated data processing always have the potential to introduce inaccuracies, although we worked to exclude unreliable data. For notable individual results, examining the raw data can be helpful, and we included all data and code to regenerate all fits.
Despite these limitations, the unique scale of cDNA display proteolysis opens completely new possibilities for studying protein stability. By comprehensively measuring single mutants across nearly all small structures in the Protein Data Bank, we quantified several global trends: trends in amino acid fitness at different sites, trends in the effects of single and double mutants, and trends in how stability influences sequence evolution. Along with these global trends, our large-scale analysis also uncovered hundreds of exceptional cases that would be challenging to identify by smaller-scale methods. These include mutations with extreme effects, sites with unusual stability landscapes, and pair interactions with unusually strong thermodynamic couplings. The strong thermodynamic couplings we identified in the J domain of human HSJ1a (Fig. 4G) - missing in the deposited NMR structure - highlight how large-scale stability assays can complement other methods for revealing structural details in solution. The 2,400 double mutants examined in that domain made up only 0.3% of the experimental library. Beyond studying the origins of stability, cDNA display proteolysis will have a range of other applications, including assaying designed proteins on a massive scale to systematically improve design methods (29, 43, 85), identifying folded domains in metagenomic sequences, and dissecting the relationships between folding stability and function (41).
Achieving an accurate, quantitative understanding of protein stability and its sequence dependence has been a central goal in biophysics for decades. We envision millions of cDNA display proteolysis measurements forming the foundation for a new generation of deep learning models predicting absolute folding stabilities and effects of mutations. Breakthroughs in deep learning-powered structure prediction have proven the power of these models in protein science, but collecting sufficient thermodynamic data has always been a major obstacle. Due to the scale and efficiency of cDNA display proteolysis, the main limit to measuring stability for millions of small domains is the cost of DNA synthesis (86–88) and sequencing (89, 90) - both of which are rapidly decreasing (91–94). With the flexibility of DNA oligo synthesis, cDNA display proteolysis can assay massive mutational libraries (as shown here) as well as massive libraries of unrelated sequences and structures, which will add essential diversity in training datasets. The size and diversity of protein sequence space creates enormous challenges for biology and protein design. The cDNA display proteolysis method offers a powerful approach to map folding stability across this space on an unprecedented scale.
Funding
Northwestern University Startup Funding (GJR), JSPS KAKENHI 19J30003 (KT), Human Frontier Science Program Long-Term Fellowship (KT), and JST PRESTO Grant JPMJPR21E9 (KT). This research was supported in part through the computational resources and staff contributions provided for the Quest high performance computing facility at Northwestern University which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology.
Author contributions
KT designed and performed all experiments, and analyzed the data with help from GJR. JD designed and analyzed stabilities of hallucination-based proteins with help from SO. JC designed ββαα proteins using Rosetta with help from GJR. EL computed GEMME scores with help from YBM, and also assisted with interpretation of GEMME. JW generated the PROSS designs. NMM provided assistance with mathematical derivation and review of enzyme kinetic interpretation. GJR and KT conceived the project. GJR supervised the project and acquired funding. KT and GJR wrote and revised the manuscript, with input from all authors.
Competing interests
Authors declare that they have no competing interests.
Data and materials availability
All data and codes are available in the main text, the supplementary materials, or available for download at https://doi.org/10.5281/zenodo.7401275 or https://github.com/Rocklin-Lab/cdna-display-proteolysis-pipeline
Supplementary Materials
Materials and Methods
DNA oligo library construction
All sequences were reverse-translated and codon-optimized using DNAworks2.0 (95). Sequences were optimized using E. coli codon frequencies because we used an in vitro translation kit derived from E. coli. Oligo libraries encoding amino acid sequences of Library 1 were purchased from Agilent Technologies. Oligo libraries for Libraries 2-4 were purchased from Twist Bioscience.
Library 1
We selected ~250 designed proteins and ~50 natural proteins that are shorter than 45 amino acids. Then, we created amino acid sequences for deep mutational scanning followed by padding by Gly, Ala, Ser amino acids so that all sequences have 44 amino acids. The total number of sequences is ~244,000 sequences.
Library 2
We selected ~350 natural proteins that have PDB structures that are in a monomer state and have 72 or less amino acids after removing N and C-terminal linkers. Then, we created amino acid sequences for deep mutational scanning followed by padding by Gly, Ala, Ser amino acids so that all sequences have 72 amino acids. The total number of sequences is ~650,000 sequences. This library also includes scramble sequences to construct unfolded state model.
Library 3
We selected ~150 designed proteins and created amino acid sequences for deep mutational scanning of the proteins. We also included comprehensive deletion and Gly/Ala insertion of all wild-type proteins inlcuded in Library1 and Libary2. Additionally, amino acid sequences for comprehensive double mutant analysis on polar amino acid pairs were also included.
Library 4
Amino acid sequences for exhaustive double mutant analysis on amino acid pairs located in close proximity were included. We also include overlapped sequences to calibrate effective protease concentration and to check consistency between libraries.
EEHH design method
EEHH protein design was performed in three steps: (1) backbone construction, (2) sequence design, (3) selection of designs for deep mutational analysis. Backbone construction (the de novo creation of a compact, three-dimensional backbone with a pre-specified secondary structure) was performed using a blueprint-based approach described previously (96, 97). All blueprints are included as Blueprints_for_EEHH.zip in Supplementary Materials.
Hallucination design method
We used a TrRosetta hallucination protocol described previously in (42, 67) and available at https://github.com/gjoni/trDesign/tree/master/02-GD to unconditionally generate protein backbones and sequences with lengths ranging from 46 to 69 amino acids by maximizing the Kullback–Leibler divergence between the predicted and background distance/angle distributions. Predicted distograms and anglegrams were used to obtain 3D structures of these models as described in the TrRosetta paper (98). We selected the best designs according to the predicted distogram and 3D structure match.
DNA and mRNA preparation for cDNA display proteolysis method
Oligo libraries were amplified by PCR using KOD PCR Master Mix (TOYOBO) to add T7 promoter, PA tag to an N-terminal, and His tag to an C-terminal of the proteins. The number of cycles was chosen based on a test qPCR run to avoid overamplification using SsoAdvanced Universal SYBR Green Supermix (BIORAD). The PCR product was gel extracted to isolate the expected length product. Then we used T7-Scribe Standard RNA IVT Kit (CELLSCRIPT) to synthesize mRNA using the DNA fragment as a template.
Preparation of protein-cDNA complex
We basically follow the protocol described in the previous literature (47, 99) with some modifications.
Photocross-linking between mRNA and the puromycin linker
We prepared the photocrosslinking reaction solution including 200 mM NaCl, 40 mM Tris-HCl (pH 7.5), 20 μM cnvK linker (EME corporation), 20 μM mRNA. The solution was incubated at 95°C for 5 min, then slowly cooled down to 45°C (0.1°C / 1 second) using a thermal cycler. Then the solution including the duplex was irradiated with UV light at 365 nm using a 6W Handheld lamp (Thermofisher).
In vitro translation and reverse transcription
We prepared PUREfrex 2.0 (GeneFrontier) translation system with mRNA-cnvK linker duplex and RiboLock RNase Inhibitor (Thermofisher) and incubate the sample at 37°C for 2 hrs. After the incubation, 100 mM EDTA was added to the sample to dissociate ribosomes. Then, an equal amount of binding/washing buffer (30 mM Tris pH 7.5, 500 mM NaCl, 0.05% Tween 20) was added. The solution was added to Dynabeads MyOne Streptavidin C1 (Thermofisher) to pull down the protein-mRNA complex and incubated at room temperature for 20 min. Then, the beads were washed by binding/washing buffer once and rinsed twice by TBS (10 mM Tris-HCl pH7.5, 100 mM NaCl), and we added reverse transcription solution (PrimeScript RT Reagent Kit; Takara) onto the beads with protein mRNA complex, and incubated the beads at 37°C for 30 mins.
Purification of protein-cDNA complex
After the reverse transcription, the protein-cDNA complex was eluted by binding/washing buffer with RNase T1 (Thermofisher). The eluent was added His Mag Sepharose Ni (Cytiva) and incubated at room temperature for 30 min. Then the complex was eluted by binding/washing buffer with 400 mM imidazole then the eluent was buffer-exchaged by Zeba Spin Desalting Column (Thermofisher). Then the complex was snap-frozen by liquid nitrogen and stored at −80°C until the following protease assay.
Protease assay on protein-cDNA complex
We prepared 40 μL of 11 protease three-fold dilution series from 25 μM for replicate1 and 43.3 (= 25 x 30.5) μM for replicate2, then added them to 12 of 20 μL the protein-cDNA complex. After 5 min protease digestion in room temperature, we added 200 μL chilled 2% BSA in PBS to quench the reaction, then the solution was added to 10 μL Dynabeads Protein G (Thermofisher) with anti-PA tag antibody (Wako; Clone number: NZ-1; 1μg antibody per 30 μL beads), and incubated at 4°C for 1 hr. Then the beads were washed by washing buffer (PBS including 800 mM NaCl and 1% Triton) three times and rinsed by PBS three times, then the complex was eluted with 50 μL PBS including 250 μg/mL PA peptide (Wako) and 200 μg/mL BSA (Thermofisher).
qPCR analysis of cDNA display proteolysis results on individual proteins (for Fig. S1)
The cDNA amount for each specific sequence in the eluents was quantified by qPCR using SsoAdvanced Universal SYBR Green Supermix and specific primers for each sequence. The qPCR was performed using CFX96 Touch Real-Time PCR Detection System (BIORAD), and the qPCR cycles were determined by the CFX Maestro Software (BIORAD).
Next-generation sequencing sample preparation
For DNA library analysis, one-half volume (25 μL) of the eluted cDNA of the complex was amplified by PCR using SsoAdvanced Universal SYBR Green Supermix (BioRad) to add P5 and P7 NGS adapter sequence. The number of cycles was chosen based on a test qPCR run using the same PCR reagents to avoid overamplification. The DNA fragment length and concentration were confirmed by 4200 TapeStation System (Agilent), then the samples were analyzed by NovaSeq 6000 System (Illumina).
Processing of next-generation sequencing data
Each library in a sequencing run was identified via a unique 6 or 8 bp barcode. Following sequencing, reads were paired using the PEAR program (100) then the adapter sequences were moved by Cutadapt (101). Reads were considered counts for a sequence if the read perfectly matched the ordered sequences at the nucleotide level.
Overall strategy for inferring K50 and ΔG from sequencing data
We used Bayesian inference to infer K50 and ΔG values for all sequences in our library. This analysis uses two main models. The first model is called the “K50 model” and infers each sequence’s K50 values based on the sequencing count data. The second model is called the “unfolded state model” and predicts each sequence’s unfolded state K50 value (K50,U) based on its sequence. Both models are implemented in Python 3.9 using the Numpyro package (102) version 0.80. Here, we first describe the structure of each model, and then we describe the practical process of fitting the parameters of each model. Our scripts to reproduce the complete fitting process are provided in the Supplementary Materials.
Structure of the K50 model to infer K50 values from next-generation sequencing data
We modeled our selection results using the single turnover kinetics model described in Fig. 1B. We chose this model because we expect that the total concentration of protein-cDNA complex is low compared to the amount of added enzyme and because the model captures the saturation behavior observed by qPCR at high enzyme concentration (Fig. S1). Instead of attempting to capture the microscopic complexity of our system (millions of different substrates and potential inhibitors), the purpose of the model is to treat each substrate in a consistent, simplified manner and infer reasonable parameters.
Our model makes two main assumptions. First, we assume that each sequence is cleaved independently, with no competition or product inhibition. As described by Fig. 1 eqs. 2 and 3, cleavage is described by four parameters: enzyme concentration (E), time (t), and the kinetic parameters K50 and kmax. All experiments used a fixed five minute reaction time. Based on qPCR analysis of individual sequences (Fig. S1), we fixed the quantity kmax * t at 100.65 for all sequences. Each sequence’s unique stability is defined by the K50 parameter that represents the enzyme concentration producing the half maximal cleavage rate (Fig. 1 eq. 3). Our second main assumption is that we can interpret our K50 values as representing the dissociation constants (KD) between each protein sequence and the enzyme (K50 ≈ KD, Fig. 1 eq. 6). From this assumption, we can determine the folding stability of each sequence (ΔG) based on the relationship between the observed K50 value and theoretical K50 values for the fully folded and fully unfolded states (K50,F and K50,U, Fig. 1 eqs. 5-7). Although we can directly fit K50 values without making any assumptions about the microscopic basis for K50 (see Supplementary Text for the detail), assuming that K50 ≈ KD aids our interpretation and enables us to directly fit ΔG values to our data using the Coupled approach described below.
To fit our model to our sequencing counts data, we first assume that the cDNA display process produces an unknown initial distribution of full-length protein-cDNA complexes (the cDNA0 distribution). The distribution of sequences at enzyme concentration E (the cDNAE distribution) is the product of the initial sequence distribution cDNA0 and the surviving fraction of each sequence according to Fig. 1 eqs. 2 and 3, after re-normalizing the total surviving fraction of all sequences to 1.
Finally, we assume that our deep sequencing counts result from nsel independent selections from the cDNAE distribution, where nsel is the number of sequencing reads that exactly matched our specified DNA sequences.
We apply the K50 model in two different ways based on whether K50 values for trypsin and chymotrypsin are Independent or Coupled. The “Independent” procedure is used in Steps 1, 2 and 5 in the section “Procedure for fitting all data”. In the independent procedure, the inputs to the model are the sequencing counts data from experiments with one protease, the enzyme concentrations, the reaction time, and the kmax constant. We fit the model by sampling two parameters per sequence from normal prior distributions: (1) K50, and (2) the initial fraction of each sequence in the cDNA0 distribution. The “Coupled” procedure is used in Step 5 in the section “Procedure for fitting all data”. In the coupled procedure, the inputs to the model are the sequencing counts data from experiments with both proteases, the enzyme concentrations, the reaction time, the kmax constant, the K50,F constants representing the universal K50 value for sequences in the folded state (one for each protease), and the predicted K50,U values for all sequences for both proteases from the unfolded state model. We then assume that each sequence has a specific ΔG value that is shared across both proteases. We use this shared ΔG value along with K50,F and K50,U (for each protease) to determine K50 for each protease according to Fig. 1 Eqs. 5 and 7. Finally, we fit the coupled model by sampling two parameters per sequence from normal prior distributions: (1) ΔG, and (2) the initial fractions of each sequence in cDNA0.
Full results from both the independent and coupled fitting procedure are provided in K50_dG_Dataset1_Dataset2.csv and K50_Dataset3.csv. For our stability parameters (protease-specific K50 in the independent procedure and ΔG in the coupled procedure) we report the median of the posterior distribution as well as the upper and lower limits of the 95% confidence interval (the 2.5%ile and 97.5%ile values of the posterior distribution). We also used the protease-specific K50 values from the independent procedure to compute protease-specific ΔG values. We do this using the same K50,F and K50,U values used in the coupled procedure according to Fig. 1 Eqs. 5 and 7. These protease-specific ΔG estimates are also reported in K50_dG_Dataset1_Dataset2.csv and are only used to examine the consistency between different proteases (e.g. Fig. 1F and Fig. 2D). In some cases, the independently fit K50 values can lead to impossible values for ΔG. This can occur if K50 is higher than K50,F (observed cleavage is slower than our limit for cleavage in the folded state) or if K50 is lower than K50,U (observed cleavage is faster than predicted cleavage in the unfolded state). If the median protease-specific K50 or the confidence interval limits for a particular sequence lead to impossible ΔG values for that sequence, we report dummy values for the corresponding protease-specific ΔG estimates.
Structure of the unfolded state model to infer unfolded K50 (K50,U) from scrambled sequence data
Our unfolded state model is similar to the model employed previously (29) with two notable differences. First, instead of assuming that all scrambled sequences are fully unfolded, we assume that each scrambled sequence has its own unknown folding stability, with a prior distribution biased toward low stability (normal prior centered at ΔG = −1, sigma = 4). Second, instead of fitting an unfolded state model for each protease independently, we assume that each scrambled sequence’s stability (ΔG) is common across both proteases, and fit the models for each protease together. As a result, the majority of scrambled sequences are modeled as completely unfolded (Fig. S2C), but some scrambled sequences are modeled as stable when that interpretation is consistent with both the trypsin and chymotrypsin data.
Our unfolded state has three parts: (1) a position specific scoring matrix (PSSM) that describes how the amino acid sequence in a 9-mer window (the P5 to P4’ positions in protease nomenclature) determine the cleavage rate at the P1 position, (2) a local response function describing the saturation of the cleavage rate for a single P1 position, (3) a global response function that determines K50,U based on the sum of the cleavage rates at all possible P1 positions in the full sequence.
To fit the PSSM, we assumed an identical normal prior distribution of scores at all positions, with several exceptions. Due to known critical importance of the P1 position, we used a wider prior distribution of scores for all amino acids in the P1 position for both proteases. We also used wider prior distributions at all positions (P5-P4’) for the amino acids Asp, Glu, and Pro, due to the established large effects of these amino acids on cutting rates.
For the local response function to saturation of the cleavage rate at P1 site k, we used a logistic function: where SSk (site saturation) is the saturation of the cutting rate at site P1=k, aasite is the amino acid identity at site, and logistic is the logistic function f(x) = 1 / (1+ex). We fit the 21 (20 amino acids + ‘X’ representing empty sites) x 9 =189 elements of the PSSM for each protease.
For the global response function (determining K50,U based on the sum of SSk across the full protein sequence), we use a sum of logistic functions with 10 different activation thresholds. where maxK50,U is the highest possible K50,U value (K50,U assuming no cut sites), Scale is the range of possible K50,U values, and thresholdl is the value of the lth activation threshold for the global response function. All K50 values (including maxK50,U) are in log10 molar units.
The key parameters of the unfolded state model (for a single protease) are the 21 x 9 =189 elements of the PSSM, the maxK50,U, the scale, and the 10 threshold values. These parameters determine K50,U for each sequence by Eqs. 9 and 10. In addition to these parameters, we also sample the ΔG values for each scrambled sequence during fitting. These sampled parameters (as well as the universal K50,F value for all sequences) are sufficient to determine a theoretical K50 value for each scrambled sequence by re-writing Fig. 1 Eq. 6: where fU is the fraction of unfolded molecules:
The input data for the model are the observed K50 values for all scrambled sequences. The parameters of the model are fit by assuming that all observed K50 values should agree (with small, normally distributed errors) with the theoretical K50 values determined by the model parameters. After fitting the model, we used the median of the posterior distributions of PSSM, maxK50,U, scale, and the 10 threshold parameters as the final model parameters. We used these final model parameters to calculate K50,U for all sequences in our experiments without considering any uncertainty from the model posterior distribution.
Procedure for fitting all data
Step 1: Estimation of ‘effective’ protease concentrations for each library
We employed four DNA oligonucleotide libraries for this study. Although we tried to minimize the difference between assay conditions, we also fit “effective” protease concentrations to our data in order to minimize batch-to-batch differences. We used the K50 model to perform this fitting and fit protease concentrations for trypsin and chymotrypsin entirely independently. The main assumption of this fitting is that each sequence should have the same K50 when assayed in different libraries. By enforcing that each sequence had a single K50 value regardless of what library it appears in, we calibrated the protease concentrations in each library against each other. Although we did not use universal control sequences in all four libraries, each library contained 1000 to 2000 sequences that overlapped at least one other library in a fully connected graph. Specifically, the library pairs 1+4, 2+4, 3+4, 1+2, and 2+3 each included 1,000 to 2,000 overlapping sequences.
The overall model included 96 experimental conditions (12 protease concentrations per replicate x 2 replicates x 4 libraries; one of the 12 protease concentrations was the fixed “no protease” starting condition). However, each sequence was only present in 48 of the 96 conditions because any individual sequence was only present in two out of the four libraries. The inputs to fit the model were the sequencing counts data, the reaction time (t), and the kmax constant. Additionally, to set the overall scale of the protease concentration series, we fixed the effective protease concentrations for Library 4 at the expected protease concentrations (i.e. three-fold serial dilutions of 25 μM protease (Replicate 1) or 43.3 μM protease (Replicate 2)). We also fixed all of the starting samples at zero protease. Using these model inputs, we sampled the K50 values (one per sequence), the remaining 66 protease concentrations, and the initial sequence distributions cDNA0 (a separate cDNA0 was used for each of the 8 replicates). Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters. Sampling was performed using the No U-Turn Sampler (NUTS) in Numpyro with 50 steps of equilibration and 25 steps of production. We used the medians of the protease concentrations from our 25 posterior samples as our final calibrated protease concentrations for all further analysis (discarding the uncertainties).
Step 2: Estimation of K50 values of scramble sequences
To train the unfolded state model, we need to determine K50 values for our scramble sequences, which were included in Library 2. We used the Independent K50 model for this step. The input data were the sequencing counts data from two replicates (i.e. 12 protease concentrations x 2 replicates = 24 data points per sequence), the reaction time (t), the kmax constant, and the effective protease concentrations obtained in Step 1. We sampled the initial sequence distribution cDNA0 (a separate cDNA0 for each replicate) and K50 for all sequences included in Library 2. Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters. Sampling was performed using the No U-Turn Sampler (NUTS) in Numpyro with 100 steps of equilibration and 50 steps of production.
Step 3: Construction of unfolded state model
We trained the unfolded state model for predicting K50,U using K50 values obtained in Step 2. The input sequences were scrambled sequences of wild-type domains selected for deep mutational screening. In addition to our set of exactly scrambled sequences (matching the wild-type amino acid composition 100%), we also included scrambled sequences containing 50%, 60%, 70%, 80%, and 90% of the number of hydrophobic amino acids in the original wild-type sequences. These sequences helped ensure the large majority of our scrambled pool was fully unfolded. Additionally, because all sequences in our experiments are padded with G/S/A linkers up to a constant length, we generated scrambled sequences using two different padding procedures. In the first approach, we designed scrambled sequences that matched the original wild-type length and were padded with G/S/A up to 72 amino acids. In the second approach, we designed 72 amino acid-length scrambles approximately matching the composition of an original wild-type domain, regardless of the length of that wild-type. These scrambled sequences required no additional padding. After measuring K50 for all scrambles, we only used sequences with a 95% confidence interval smaller than 0.5 log10 molar units for model training for model fitting (64,238 sequences in total, see Fig. S3). In addition to the exact experimental sequences, we also augmented the training dataset with dummy sequences where GS linkers were replaced by the blank ‘X’ amino acid.
The inputs for the model are amino acid sequences created as described above, and their observed K50 for trypsin and chymotrypsin obtained in Step 2. The parameters of the model are fit by assuming that all observed K50 values should agree (with small, normally distributed errors) with the theoretical K50 values. In this model, we sampled the 21 x 9 =189 elements of the PSSM, the site bias, the maxK50,U, the scale, and the 10 threshold values. These parameters determine K50,U for each sequence by Eqs. 9 and 10. In addition to these parameters, we also sample the ΔG values for each scrambled sequence during fitting.
Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters. Using NUTS model, we sampled the parameters described above, then reported the median of the 100 posteriors after removing the initial 400 steps. In Step 4, we used these final model parameters to calculate K50,U for all sequences in our experiments without considering any uncertainty from the model posterior distribution.
Step 4: Prediction of unfolded K50 values (K50,U) across the full dataset
Using the final model parameters obtained in Step 3, we predicted K50,U values for each amino acid sequence in the libraries without considering any uncertainty. Additionally, since the model was constructed to predict unfolded K50 for sequences with 86 amino acids, we added a Gly linker ‘GGG’ to both ends, followed by padding by ‘X’ up to 86 amino acids.
Step 5: Estimation of K50 values and calculation of ΔG for trypsin and chymotrypsin
We applied the Coupled K50 model to each of the four libraries separately. The inputs to the model are the sequencing count data from trypsin and chymotrypsin experiments (i.e. 12 protease concentrations x 2 replicates x 2 proteases = 48 data points per sequence), the effective protease concentrations obtained in Step 1, the reaction time, the kmax constant (t*kmax = 100.65 based on qPCR analysis; see Fig. S1), the K50,F constants (3 for trypsin, 2 for chymotrypsin; determined based on the dynamic range of proteolysis experiment; see Fig. S5), and the K50,U values predicted by the unfolded model in Step 4. Using the inputs, we sampled ΔG shared between trypsin and chymotrypsin, and initial sequence distribution cDNA0 for each protease for each replicate (although our experiments utilized the same batch of the cDNA-protein complex for two replicates).
Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters. Using NUTS in Numpyro module, we sampled the posteriors of shared ΔG along with other parameters, then obtained the median of the 50 posterior samples after removing the initial 100 steps. Full results from both the independent and coupled fitting procedure are provided in K50_dG_Dataset1_Dataset2.csv and K50_Dataset3.csv. For our stability parameters (protease-specific K50 in the independent procedure and ΔG in the coupled procedure) we report the median of the posterior distribution as well as the upper and lower limits of the 95% confidence interval (the 2.5%ile and 97.5%ile values of the posterior distribution).
We also applied the Independent K50 model to each of the four libraries separately. The inputs to the model are the sequencing count data (i.e. 12 protease concentrations x 2 replicates = 24 data points per sequence), the effective protease concentrations obtained in Step 1, the reaction time, the kmax constant (t*kmax = 100.65 based on qPCR analysis; see Fig. S1). Using the inputs, we sampled K50 for each protease, and initial sequence distribution cDNA0 for each protease for each replicate (although we utilized the same batch of the cDNA-protein complex for two replicates).
Normal priors (with lower/upper boundaries for some parameters) covering the range of experimentally relevant values were used for the model parameters. Using NUTS in Numpyro module, we sampled the posteriors of K50 for trypsin and K50 for chymotrypsin along with other parameters, then obtained the median of the 50 posterior samples after removing the initial 100 steps.
Then, we computed protease-specific ΔG values using the protease-specific K50 values from the Independent model. We do this using the same K50,F and K50,U values used in the coupled procedure according to Fig. 1 Eqs. 5 and 7. These protease-specific ΔG estimates are also reported in K50_dG_Dataset1_Dataset2.csv and K50_Dataset3.csv, and are only used to examine the consistency between different proteases (e.g. Fig. 1F and Fig. 2D). In some cases, the independently fit K50 values can lead to impossible values for ΔG. This can occur if K50 is higher than K50,F (observed cleavage is slower than our limit for cleavage in the folded state) or if K50 is lower than K50,U (observed cleavage is faster than predicted cleavage in the unfolded state). If the median protease-specific K50 or the confidence interval limits for a particular sequence lead to impossible ΔG values for that sequence, we reported dummy values for the corresponding protease-specific ΔG estimates.
The actual number of sequencing counts, as well as the number of counts predicted for all sequences at all concentrations according to the fitted model parameters, are given in Raw_NGS_count_tables.zip and Pipeline_K50_dG.zip.
Data selection for Fig. 1E and F
We show all data from Library 3 within the range −2 < ΔG < 5 kcal/mol & log10_K50_trypsin < 1.75 & log10_K50_chymotrypsin < 2.25. We then overlaid the wild-type and four mutants of Protein G measured in Library 2.
Replicate analysis of K50 (Fig. 1E)
Instead of sampling K50 values using 24 samples per protease at one time as described in Step 5 above, we sampled K50 values using one experiment set (i.e. 12 samples) and obtained K50 for trypsin replicate 1 and 2, and chymotrypsin replicate 1 and 2. Note that we still used the calibrated protease concentrations to improve consistency between replicates. The replicates were conducted on different days using the same preparation of the protein-cDNA complex.
Classification of Datasets #1, #2, and #3 based on the quality of the data (For Fig. 2)
All mutational scanning data was classified into nine groups (0 through 8) according to the protocol in Fig. S8. We determined that a mutational scan was high quality (suitable for Dataset #2) if there was minimal missing data, minimal low confidence data, an appropriate slope, intercept, and correlation between the trypsin and chymotrypsin samples, sufficient wild-type stability, and the mutational scan did not include an unusual fraction of stabilizing mutations suggesting poor folding. For inclusion in the smaller Dataset #1, we additionally required that the wild-type stability was lower than 4.5 kcal/mol so that stabilizing mutations could still fall within the assay’s dynamic range. These sequences are considered “Group 0”; the remaining sequences in Dataset 2 are considered “Group 1”. Double mutant sequences were included in Datasets 1 and 2 based on whether the original wild-type mutational scan was included in that dataset.
All sequences in Dataset 1 and Dataset 2 are included in K50_dG_Dataset1_Dataset2.csv. All sequences in this file have an inferred ΔG estimate value, but only sequences in Dataset 1 have a tabulated ΔΔG estimate. Of course, one can calculate ΔΔG for the remaining sequences in Dataset 2, but these ΔΔG values will be biased toward destabilizing mutations because stabilizing mutations would typically be indistinguishable from the wild-type stability. Note that Datasets 1 and 2 include a small number of sequences with low quality data because these sequences come from mutational scans that are high quality overall. Although these tables include all K50, Δ G, and ΔΔG data (for Dataset 1), low quality data have been filtered out and replaced by a – symbol in the columns labeled “_ML” (for machine learning).
The remaining groups were defined this way:
Group 2: The wild-type protein is too unstable to see sequence-stability relationships. Group 3: Poor expression (low counts in next-generation sequencing) for the assay.
Group 4: Very few destabilizing mutations, suggesting aggregation and/or molten globule formation
Group 5: The wild-type is too stable to see consistency between trypsin and chymotrypsin Group 6 and 7: Low agreement between trypsin and chymotrypsin due to the absence of aromatic amino acids (i.e. chymotrypsin cleavage sites) or the presence of protease recognition sequences in the linker region.
Group 8: Did not fit into groups 2-7, but did not pass the quality metrics for groups 0 and 1.
Dataset #3 includes all data combined (Groups 0-8), even the data from Groups 2-8 that were excluded from Datasets 1 and 2. Although many of the K50 values from Groups 2-8 likely reflect factors other than folding stability (e.g. aggregation, low expression, etc.), these data can still be used to train models that directly predict K50. Again, a small fraction (~4%) of the K50 values in Dataset #3 are low confidence and have been replaced by a – symbol in the “_ML” columns.
Principal component analysis (related to Fig. 3)
We performed principal component analysis to determine the factors influencing stability of different amino acids. To this end, we utilized 15,440 sites in the 337 domains that are classified as G0 in the above. All folding stability data were clipped between from −1 to 5 (kcal/mol) because the folding stability outside the dynamic range is not reliable, and then the average of the stability for 20 amino acids for each site was subtracted from the data. Using the data, we performed PC analysis using the scikit-learn library implemented in Python 3.
Side chain contacts and burial analysis (Fig. 3D and 6H)
Burial values and contact counts were computed based on AlphaFold models (18) of all sequences using the included script Burial_side_chain_contact_Fig3_Fig6.ipynb based on Bio.PDB (103) and BioPython (104)). The calculation is based on the Rosetta “sidechain_neighbors” LayerDesign method previously reported (29). Briefly, to calculate the burial or contacts of residue X, we added up the number of residues in a cone projecting out 9 Å away from the Cβ atom on residue X in the direction of the residue X Cα-Cβ vector. “Burial” (Fig. 6H) indicates the number of Cα atoms in the cone. Contact counts (Fig. 3D) each count different atoms inside the cone: “Side chain contact count” (Fig. 3D) counts all Cβ atoms; “Aromatic side chain contact count” counts all CE2 atoms of Phe, Tyr, and Trp; “Acidic side chain contact count” counts all Glu OE1 and Asp OD1 atoms; and “Basic side chain contact count” counts all Lys NZ and Arg NE atoms.
Secondary structure determination (Fig. 3D)
Using the DSSP algorithm (105, 106), we obtained secondary structure information based on AlphaFold models.
Selection method of site pairs for double mutational analysis (related to Fig. 4)
Double mutants were selected for analysis in two ways. First, we manually selected polar interactions where either amino acid appeared important for stability in single mutational analysis. These pairs were mainly included in Library 3. Second, we used the program confind (107, 108) to identify interacting residues. All confind pairs with notable interactions such as polar interactions and cation-π interactions were selected, along with a randomly chosen subset of more common interactions such as hydrophobic interactions. These pairs were included in Library 4.
Thermodynamic coupling analysis (related to Fig. 4)
Thermodynamic coupling refers to the change in folding stability due to the interaction between two amino acids after removing folding stability effects from each amino acid individually. To determine this “nonadditivity”, we first modeled our double mutant data using a fully additive model (no thermodynamic coupling). The deviations from this model then reveal the thermodynamic coupling. Our additive model assumes that the absolute stability (ΔG) of each sequence is the sum of an amino acid-dependent term for site one (ΔG1) and an amino acid-dependent term for site two (ΔG2)
The forty site-specific terms (one ΔG1 term for each amino acid at site one and one ΔG2 term for each amino acid at site two) are not experimentally measurable; they are inferred based on minimizing the error of the additive model. We used Bayesian inference to infer the forty ΔG1 and ΔG2 terms for each set of mutants. The inputs to fit the model were the observed 400 ΔG values (20 amino acids at site one x 20 amino acids at site two) for a particular site pair. Using NUTS, we sampled ΔG1 and ΔG2 by assuming that the 400 observed ΔG values should agree (with small, normally distributed errors) with the expected ΔG values determined by eq. 13. Both expected and observed ΔG values were clipped to the range of −1 to 5 kcal/mol. We used 100 steps of burn-in and used the median of 50 posterior samples as the final values of the ΔG1 and ΔG2 terms. Using these terms, we calculated the expected (additive) ΔG for each sequence, and then the thermodynamic coupling:
To calculate the uncertainty in the thermodynamic coupling, we re-fit the additive model 50 times by bootstrap resampling of the 400 observed ΔG values. This ensures the ΔG1 and ΔG2 terms are not overly dependent on a single experimental measurement. The model fitting code is provided in Additive_model_Fig4.ipynb.
Wild-type amino acid prediction model (related to Fig. 5)
The classification model in Fig. 5 used a sum of logistic functions with learned amplitudes to define the weighting function. The overall model is defined below: where p(aa) is the probability of amino acid aa, softmax is the softmax function , logistic is the logistic function f(x) = 1 / (1+ex), i indexes the 100 logistic functions defining the weighting function, amp is the learned vector describing the amplitudes of the logistic functions, threshold is the vector describing the centers of the logistic functions, steepness defines the steepness of the logistic functions, and offset the learned vector (length 19 for the 19 non-Cys amino acids) describing the absolute probability offset for each amino acid.
We used Bayesian inference to infer the amp vector (length 100) and offset vector (length 19 for the 19 non-Cys amino acids). The logistic threshold vector was fixed at 100 evenly spaced points between −2 and 7 kcal/mol. The steepness term was fixed at 5. The inputs to fit the model were the observed ΔG values and the wild-type amino acid identities for each site within the natural protein domains. Using NUTS, we sampled amp and offset by assuming that the observed wild-type amino acids were randomly chosen at each site according to the predicted probability distribution for that site, calculated according to eq. 15. We then reported the median and the standard deviation of 100 posterior samples after removing the initial 500 steps. The fitting script is included in Classification_model_Fig5.ipynb.
GEMME analysis (related to Fig. 6)
To calculate the “Normalized averaged GEMME score”, which represents the sensitivity of a wild-type amino acid to substitutions inferred from evolutionary information (“ΔΔE” in the previous reports (70, 71)), we ran GEMME (78) on each natural amino acid sequence using the default parameters. We computed a single score for each site by averaging the scores of the 19 amino acids (except Cys), and then standardized each domain individually (subtracted the domain’s mean and divided by the domain’s standard deviation) so that the site scores within a domain had a mean of zero and a standard deviation of one. Finally, we flip the sign of the score so that positive values imply high susceptibility to mutations (i.e. very negative raw GEMME scores for non-wild-type amino acids). We define this standardized score for each site as the “Normalized GEMME score”. To build the input multiple sequence alignments, we performed five iterations of the profile HMM homology search tool Jackhmmer (109, 110) against the UniRef100 database of non-redundant proteins (111) using the EVcouplings framework (112). We used the default bitscore threshold of 0.5 bit per residue.
Supplementary Text
Derivation of eq.3 in Fig. 1B
We modeled the cleavage events, where Protease enzymes (E) and protein substrates (S) form an ES complex to produce cleaved protein products (P). The goal is to get a product formation equation in terms of the total product, initial enzyme and substrate concentrations and kinetic constants.
Also, we defined equilibrium constant K50:
Based on the model (1), we can obtain the following dynamic formulas:
The first two of these are assumed to be at quasi-steady state. The following are additional conservation equations for substrate-product and enzyme: where [S0] is initial amount of substrates
Additionally, the reaction conditions in the study were not substrate-excessive but enzyme-excessive: (because [E] >> [ES] or [S])
Using eqs. 1’, 19, and 20, the following can be derived to find an expression for the enzyme-substrate complex in terms of the initial substrate and enzyme concentration:
Substituting eq. 22 into eq. 18 and using the approximation[Etotal] ≈ [E], the an expression for the dynamics of the product formation in terms of enzyme concentration and substrate can be found:
Thus, the observed kinetic rate is (This eq. 3)
Derivation of eq.6 and eq.7 in Fig1B
We modeled the cleavage events, where Protease enzymes (E) and folded substrates (F) or unfolded substrates (U) form a FE or UE complex to produce cleaved protein products (PF or PU). The goal is to get a product formation equation in terms of the total product, initial enzyme and substrate concentrations and kinetic constants. We follow a similar derivation to that above for a single enzyme/substrate: where and kf and ku are rate arerate constant for cleavage of the bound folded substrates and unfolded substrates. Assuming binding and unbinding equations and the folding and unfolding transition rates are in a quasi-equilibrium then eq 23, 24, and 25 hold throughout the time-course.
We write an equation for the overall product formation:
Conservation equations for substrate-product and enzyme in this case are: where [S0] is initial and total concentration of substrate where [E0] is the initial concentration of enzyme.
Step 1: Write product formation eq. 26 in terms of [FE] and constants only, by substituting for [UE] complex
Step 2: Replace [FE] dependence with ([S0] - [PF] - [Pu]) dependence using conservation laws
Thus, we get an equation which describes the dependence of [FE] on initial substrates and products, with terms in the denominator that capture sequestration in intermediate bound states.
Substituting this into the product formation equation:
Then, we defined [Ptotal] = [PF] + [PU]
Because the reaction conditions in the study were not substrate-excessive but enzyme-excessive (i.e. [E] >> [S] or [ES]), [E]≈[E0]:
Finally, We can rewrite the product formation eq. 3 in terms of initial substrate concentration, total product, and an observed kinetic rate, which is a function of kinetic rates and initial enzyme concentration,:
Step 3, Derivation eq.6 and eq.7 in Fig. 1B
By comparing eq. 32 with eq. 3’, we can derive the following equations (including eq. 6 in Fig. 1B):
Using eq. 34 to rewriting a formula for KUF in terms of the half-max reaction rates:
Thus, eq.7 which gives the ratio of unfolded to folded substrate is derived:
Files for Supplementary Materials
Raw_NGS_count_tables.zip
NGS_count_lib1.csv
NGS_count_lib2.csv
NGS_count_lib3.csv
NGS_count_lib4.csv
K50_dG_tables.zip
K50_dG_lib1.csv
K50_dG_lib2.csv
K50_dG_lib3.csv
K50_dG_lib4.csv
Processed_K50_dG_datasets.zip
K50_dG_Dataset1_Dataset2.csv
K50_Dataset3.csv
Single_DMS_list.csv
Double_DMS_list.csv
Triple_DMS_list.csv
Heat_maps_single_DMS.pdf
Heat_maps_double_DMS.pdf
Data_tables_for_figs.zip
Pipeline_qPCR_data.zip
Raw_qPCR_data_FigS1.csv
Process_qPCR_data.ipynb
Pipeline_K50_dG.zip
STEP1_module.ipynb
STEP1_run.ipynb
STEP2_run.ipynb
STEP3_run.ipynb
STEP4_module.ipynb
STEP4_run.ipynb
STEP5_module.ipynb
STEP5_run.ipynb
Raw_NGS_counts_overlapped_seqs_STEP1_libl_lib2.csv
Raw_NGS_counts_overlapped_seqs_STEP1_lib2_lib3.csv
Raw_NGS_counts_overlapped_seqs_STEP1_libl_lib4.csv
Raw_NGS_counts_overlapped_seqs_STEP1_lib2_lib4.csv
Raw_NGS_counts_overlapped_seqs_STEP1_lib3_lib4.csv
K50_scrambles_for_STEP3.csv
STEP1_out_protease_concentration_trypsin
STEP1_out_protease_concentration_chymotrypsin
STEP3_unfolded_model_params
Pipeline_figure_model.zip
AlphaFold_model_PDBs.zip
Blueprints_for_EEHH.zip
eehh_EA_GBB_AGBB.bp
eehh_GG_GBB_AGBB.bp
eehh_XX_XXX_XXXX.bp
Acknowledgments
We thank Epsilon Molecular Engineering (EME) Corp for providing us with cnvK linker for cDNA display, Rush University and Genome Research Core at University of Illinois Chicago for performing next-generation sequencing, and David Minh, Timothy Whitehead, Kresten Lindorff-Larsen, David M. McCandlish, Jack Maguire, John Chodera, Parisa Hosseinzadeh, and the members of the Rocklin lab for discussions and comments on the manuscript.
Footnotes
We added new authors and a new method section, and modified the main text, and the legend related to Fig6.
https://github.com/Rocklin-Lab/cdna-display-proteolysis-pipeline
References and Notes
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.↵
- 6.↵
- 7.
- 8.↵
- 9.↵
- 10.
- 11.
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.
- 20.
- 21.↵
- 22.↵
- 23.
- 24.↵
- 25.↵
- 26.
- 27.
- 28.↵
- 29.↵
- 30.↵
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.
- 45.
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.
- 93.
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.
- 116.↵
- 117.↵