Abstract
While bioinformatics reveals patterns in protein sequences and structural biology methods elucidate atomic details of protein structures, it is difficult to attain equally high-resolution energetic information about protein conformational ensembles. We present PIGEON-FEATHER, a method for calculating free energies of opening (ΔGop) at single- or near-single-amino acid resolution for protein ensembles of all sizes from hydrogen exchange/mass spectrometry (HX/MS) data. PIGEON-FEATHER disambiguates and reconstructs all experimentally measured isotopic mass envelopes using a Bayesian Monte Carlo sampling approach. We applied PIGEON-FEATHER to reveal how E. coli and human dihydrofolate reductase orthologs (ecDHFR, hDHFR) have evolved distinct ensembles tuned to their catalytic cycles, and how two competitive inhibitors of ecDHFR arrest its ensemble in different ways. Extending the method to a large protein-DNA complex, we mapped ligand-induced ensemble reweighting in the E. coli lac repressor to understand the functional switching mechanism crucial for transcriptional regulation.
Proteins can perform biological functions because they adopt conformational ensembles: sets of rapidly interconverting conformational states with different populations and energies.1–3 For a given process, the thermodynamic ensemble of a protein or residue has an ensemble energy determined by the distribution of conformational states that it occupies. To respond to molecular perturbations like ligand binding, proteins reweight their ensembles. Disease-causing mutations disrupt ensembles. Because of this key relationship between a protein’s ensemble and function, measuring ensemble energies at amino acid resolution is crucial to deeply understanding protein behavior.
However, biophysical techniques are limited in their capacity to measure ensemble energies. X-ray crystallography and cryo-electron microscopy (cryo-EM) capture only the lowest-energy conformational states of proteins and do not report ensemble energies. Nuclear magnetic resonance (NMR) spectroscopy monitors shifts among conformational states but is limited by experimental throughput and protein size. Molecular dynamics (MD) simulations can reveal the details of the earliest stages in a conformational transition, but are constrained by simulation timescales, inaccuracies in force fields, and the availability of structural data.
We set out to measure amino acid-level ensemble energies using hydrogen exchange with mass spectrometry (HX/MS). With the advent of integrated robotics4,5 and accurate protein structure prediction methods6,7, HX/MS presents a powerful strategy for probing protein ensembles in solution. HX/MS measures the rate of exchange of labile hydrogen atoms with deuterium in a protein’s backbone amide groups to report changes in its ensemble at the peptide level (Fig. 1a).8 For hydrogen exchange to occur in a folded protein at equilibrium (EX2 conditions, see section 1 of the SI), the protein must transiently expose its backbone amide hydrogens to solvent, which incurs an energetic cost for each exchangeable site: the free energy of opening, or ΔGop (Fig. 1b).9 The ΔGop is an ensemble energy derived from the hydrogen exchange rate.
Because of a lack of methods to determine exchange rates for each amino acid, HX/MS data are typically interpreted as relative differences in total hydrogen exchange between states (e.g., apo vs. holo proteins) for selected peptides. This approach is flawed for three reasons. First, total hydrogen exchange is biased by the timecourse, because exchange half-lives can span years but the experiment is performed within hours.10 Second, the effects of protein sequence on exchange rates are ignored. The ΔGop depends on the intrinsic rate of exchange for each amino acid, or how it exchanges in an unstructured polypeptide (Fig. 1b), which depends on its own and its neighbors’ identities11; but the peptide resolution of HX/MS data precludes assignments of amino acid-specific exchange rates. Third, the manner of curating, combining, and visualizing the peptide-level total hydrogen exchange data influences how we understand protein ensembles.12,13
We therefore developed PIGEON-FEATHER (Peptide ID Generation for Exchange Of Nuclei - Free Energy Assignment Through Hydrogen Exchange Rates), an HX/MS analysis method to assign ΔGop for almost all the amino acids in a protein (Fig. 1c). PIGEON-FEATHER disambiguates peptides based on tandem MS fragment coverage and fits hydrogen exchange rates via Bayesian sampling to reconstruct full isotopic mass envelopes, capitalizing on information-rich raw HX/MS data for ΔGop determination. Benchmarking shows that PIGEON-FEATHER calculates exchange rates with near-perfect accuracy, consistently outperforming existing methods. We applied the method to bacterial and human enzyme orthologs to quantify how evolutionary divergence in ΔGop permits catalysis in different organisms, and we measured how two inhibitors differentially impact the bacterial enzyme’s ensemble to discern why only one of them is effective in the human ortholog. Finally, we used PIGEON-FEATHER to connect ligand-induced allosteric changes in the regulatory domain of a large transcription factor with local unfolding in its DNA-binding domain.
Results
Peptide disambiguation using PIGEON
In hydrogen-deuterium (H-D) exchange/MS experiments, following the exchange timecourse, proteins are proteolytically digested and the resulting peptides are measured for their level of D uptake. We first identify each MS peak by performing an MS/MS experiment on the undeuterated protein and matching the monoisotopic masses and y- and b-ion fragments to theoretical masses based on the protein sequence. The observed mass, charge, and chromatographic retention time are used to identify the corresponding peaks in the deuterated sample spectra. However, the combination of monoisotopic mass and fragment coverage is frequently inadequate to uniquely specify one peptide (Fig. 2a-c).14–16 This complicates the analysis because considering only peptides with strong fragment coverage can reduce sequence coverage, but including less well-supported peptides introduces the risk of spurious identifications (Fig. 2d).
PIGEON enables accurate peptide identification (Fig. 2e). First, PIGEON removes systematic errors from e.g. insufficient calibration (Figs. 2e, ii, S1a-c). Next, PIGEON removes all duplicate matches except for one consensus peak per charge state per peptide by identifying those with the highest fragment ion coverage and/or peak signal (Fig. S1d). Finally, PIGEON corrects double assignments in the peptide list by checking for their monoisotopic m/z and retention times, discarding those without fragment support (Fig. 2e, iii, 2g). For co-eluting peptides where both mass duplicates have strong fragment support, the user may decide to keep all, none, or only ambiguous peptides with substantial sequence overlap (Fig. 2e, iv, 2h). In this way, PIGEON allows flexible treatment for peptides for which the true assignment is ambiguous and the proportion of each peptide in the signal unknown, particularly those that are identical up to a small sequence difference. PIGEON uses sequence overlap as a discriminator because highly overlapping peptides typically report on the same set of amino acids, so uptake behavior is likely to be consistent. The remaining assignments form a peptide pool for HX analysis using FEATHER (Fig. 2e, v, 2h).
We evaluated how PIGEON impacts peptide assignments for HX/MS on E. coli dihydrofolate reductase (ecDHFR), an essential enzyme that reduces 7,8-dihydrofolate (DHF) to 5,6,7,8-tetrahydrofolate (THF) via NADPH-catalyzed hydride transfer. There are 2,107 possible peptides of length 3-15 amino acids in the ecDHFR sequence, many of which are degenerate in mass, especially when considering instrument sensitivity limitations (Fig. 2a, b). We assessed the impact of each filtering step and each level of disambiguation stringency for co-eluting peptides on the number of assigned peaks from MS/MS (Fig. 2h, Table S1). In one case, for example, we discarded 152 of 747 initial peptide assignments as duplicates.
ΔGop assignment using FEATHER
Following peptide disambiguation from the MS/MS experiment, HX/MS data can be analyzed using FEATHER. FEATHER is a Bayesian method to assign protection factors (PFs), the ratio of intrinsic and observed exchange rates for each residue (Figs. 1b, 3). As a first step, peptides with N/C terminal overlaps are subtracted from one another at the mass envelope level to produce explicit mini-peptides at an enhanced resolution as a data augmentation for PF fitting (Fig. 3a).
While there are several methods to calculate centroid values from isotopic mass envelopes to fit PFs17– 19, the degenerate nature of MS centroids causes considerable information loss, as many envelopes can yield identical centroid values.16 By contrast, direct envelope fitting introduces additional constraints, thereby minimizing calculation uncertainties and enhancing accuracy.20 Building on a centroid-based Bayesian method17, we implemented sampling improvements and isotopic mass envelope fitting.20 First, FEATHER initiates full isotopic mass envelope reconstruction for each mini-peptide at each timepoint with preliminary PF guesses. The PFs are refined through iterative Markov Chain Monte Carlo (MCMC) sampling (Fig. 3b, c). PFs are explored on discrete grids on the logarithmic scale within a log(PF) range of 0 to 14. The full theoretical kinetic exchange curves for each residue are built from single exponential functions given the PF value. D-to-H back-exchange, an effect of the chromatography steps in HX/MS, is corrected at each timepoint, followed by a probability distribution calculation to determine the isotopic envelope of deuteration. The isotopic envelope of deuteration and experimental isotopic envelope for each amino acid are then convolved to obtain a theoretical envelope. Then, the error and score based on the comparison between the theoretical and observed MS are calculated. Finally, FEATHER performs an MCMC simulation to reduce the score by randomly sampling the log(PF)s on discrete grids. FEATHER continues sampling until the absolute error between the theoretical and measured envelopes is minimized and the model converges on optimized PFs for each amino acid or mini-peptide (Fig. 3b). To enhance the sampling, we implemented an explicit PF swapping step (Fig. 3c), motivated by the switchable nature of exchange rates within a peptide that can yield the same mass spectrum. During each swapping step, a subset of sampled residues may swap their log(PF) values with adjacent amino acids. The final PF values are derived by combining results from multiple bootstrap replicates. The standard deviation of PFs within a cluster aggregated across multiple bootstrapping runs is used as a measurement of the error. We also implemented two optional Bayesian priors: a structural prior, which leverages a protein structure to bias sampling, and an uptake prior, which compares calculated exchange rates with centroid-level uptake (see Methods).
Simulated data benchmarks
To assess FEATHER accuracy, we prepared simulated datasets by assigning exchange rates to all non-proline amino acids in synthetic proteins of varied sequences and lengths (Tables S2, S3). The proteins were decomposed into peptides with sequence coverage, noise, and overlap reminiscent of true HX/MS data (see Methods). We used FEATHER to calculate the exchange rates and PFs for each amino acid in these proteins (Fig. 3d, e). FEATHER performance was measured as the correlation coefficient of the true vs. calculated log(PF) (R) and root-mean-square error (RMSE) (Fig. 3f). FEATHER achieved R values near 1 for proteins of all sizes (Fig. 3g).
We found that peptide subtraction, PF swapping-sampling, and MS reconstruction contribute synergistically to FEATHER performance (Fig. 3g, Table S4); the benefit of including all three features exceeds the sum of their individual contributions (Table S4). Notably, FEATHER benefits more from increased data volume when using mass envelope data rather than centroids, highlighting further advantages of envelope fitting. Further, removing half of the timepoints, removing timepoints beyond 1e6 seconds, and noising the MS peaks did not significantly reduce accuracy (Tables S3, S5).
To compare FEATHER to other PF calculation methods17–21, we also analyzed our optimal synthetic dataset using BayesianHDX17, PyHDX18, ExPfact19, and HDsite20 (Table S6). FEATHER considerably outperforms each method. Although FEATHER builds on its sampling method, BayesianHDX fits centroids and samples less efficiently, leading to diminished accuracy. HDsite was the first method to reconstruct full mass envelopes, but the iterative fitting approach and scoring function reduced its performance. Due to its incompatibility with dense data and weeks-long compute time, we could only test HDsite on one small dataset. Overall, FEATHER combines the best features of prior methods with sampling and architecture improvements for unprecedentedly accurate ΔGop calculations.
PIGEON-FEATHER determines single-site ΔGop values for ecDHFR
ecDHFR is a clinical target for treating infections.22 After confirming its activity (Fig. S2), we tested the performance of PIGEON-FEATHER on ecDHFR (Fig. 4a-d). Comparing the fitted model with the experimental mass spectra and centroid-level D uptake demonstrates that FEATHER can accurately fit real HX/MS data (Fig. 4a) with low error (Fig. 4b). All D uptake plots and fitted models are in the SI files.
From 442 PIGEON-disambiguated peptides, we resolved 91% of positions at single-site resolution and 4% as dipeptides, totaling 95% of the protein sequence (Figs. 4c, S3, Tables S7, S8). The coverage and resolution exceeds previous HX/MS experiments for ecDHFR23–25 due to using several acid proteases (Fig. S4), and the peptide-level exchange corroborates a recent study (Fig. S5).25 The projection of ΔGop on the structure highlights βB, βC, and βH as the regions most protected from exchange (Fig. 4c), with residues M42, I62, and F153 having the highest ΔGop. Critically, PIGEON-FEATHER can resolve single-site ΔGop in ecDHFR’s catalytically important loops: the Met20 loop, the FG loop, and the GH loop. The Met20 loop residues exchange quickly (Fig. 4a, c, d) as expected given diffuse electron density in APO ecDHFR X-ray crystal structures.26,27 However, the FG and GH loops contain well-protected residues, like D127-Y128 (ΔGop=40, 38 kJ/mol) and A143 (ΔGop=70 kJ/mol).
The three PIGEON modes produce consistent FEATHER results with R values ranging from 0.84 to 0.90 (Fig. S6). Correlating ΔGop from singlicate experiments against the pooled dataset (n=6) gives an average R value of 0.74, demonstrating the method’s reproducibility (Figs. S7, S8a-g). In the absence of long timepoints (>4 hours), we observed a drop of 0.09 in R, similar to the synthetic data benchmark (Figs. S8h, S9). We found that three bootstrapping replicates are sufficient (Fig. S10), and that FEATHER converges to consistent results even without the structural and uptake priors (Fig. S11). FEATHER-derived PFs show a 0.48 correlation with a widely used phenomenological model that defines the PF as a function of the numbers of amide hydrogen bonds and local contacts (Fig. S12).28 FEATHER-derived PFs also correlate with computationally calculated hydrogen bonds and solvent accessible surface area (SASA) using the macromolecular modeling suite Rosetta, suggesting a potential use of the method for structural modeling (Fig. S12).29,30
Single-site ΔGop values for ecDHFR and human DHFR (hDHFR) enable comparison
Like protein sequences and structures, ΔGop values may be evolutionarily conserved as features of the conformational ensemble.31 Although the sequence identity between ecDHFR and hDHFR is only 26%32, the X-ray crystal structures are almost identical (all-atom RMSD=1 Å, PDB IDs: 4M6J, 5DFR).33,34 While peptide-level H-D exchange cannot be quantitatively compared for proteins with different sequences, absolute ΔGop for structurally aligned regions of related proteins may be compared. We therefore applied HX/MS-PIGEON-FEATHER to determine ΔGop values for hDHFR (Figs. 4e, S13, Tables S9, S10). Despite modest sequence identity, the residue-level ΔGop values for the two DHFRs are moderately correlated (Fig. 4f, g).
Although the two orthologs are structurally similar, previous 15N relaxation dispersion experiments showed that the ms-scale motions that dominate ecDHFR Met20 loop are absent in hDHFR, and mutations in ecDHFR that disrupt these motions slow NADP+ dissociation.35–37 It was hypothesized that lower intracellular NADP+ and THF concentrations in vertebrates allow efficient cofactor exchange in DHFR, enabling a hinge-like open-close conformational change to control the ligand flux while maintaining the structure of the Met20 loop. Bacteria, which have ∼1:1 NADPH:NADP+, may instead have evolved a loop-mediated exchange mechanism to avoid end product inhibition.38
In support of this hypothesis, PIGEON-FEATHER yields the otherwise non-intuitive result that several positions in the central β sheet, a universally conserved DHFR motif, have much higher ΔGop in ecDHFR than in hDHFR (Fig. 4g). We hypothesize that this stabilization compensates for the larger loop dynamics in ecDHFR and facilitates the open-close conformational change in hDHFR.33 In line with this hypothesis, both hinge loops, which mediate the open-close conformational change in hDHFR and are much shorter in ecDHFR, are adjacent to β strands that are more stable in ecDHFR. Meanwhile, several peripheral residues in the M20 and FG loops and helix αC are stabilized by 9-14 kJ/mol in hDHFR compared to ecDHFR (Fig. 4f, g, S14). Even in the absence of ligands, hDHFR may have evolved to sample the hinge-closed conformational state with a higher probability than ecDHFR because this conformation has no specific functional role in ecDHFR. PIGEON-FEATHER thus introduces a strategy to classify the catalytic mechanisms for many other DHFR orthologs with varied-length functional loops and hinges towards developing pathogen-specific antibiotics.33,39
Two inhibitors differentially impact the ecDHFR ensemble
Trimethoprim (TMP) and methotrexate (MTX) are competitive inhibitors that bind ecDHFR with sub-nM affinities (Fig. 5a), but because TMP inhibits bacterial DHFRs at least four orders of magnitude more than vertebrate orthologs, it used as an antibiotic.32,39,40 X-ray crystal structures for TMP- and MTX-ecDHFR have <0.5 Å RMSD (Fig. 5a, b).41,42 NMR spectroscopy revealed near-identical µs-to-ms motions in the two inhibitor-bound states.43 Both states favor a conformation of the Met20 loop that blocks the active site.41–44 As mounting drug resistance mutations45 reduce inhibitor efficacy and motivate the search for new therapeutics, we revisited the question of how TMP and MTX impact the ecDHFR ensemble.
PIGEON-FEATHER-derived changes in ΔGop (ΔΔGop) between the APO and inhibitor states revealed that TMP and MTX have distinct effects on ecDHFR (Fig. 5a-c, Figs. S14-16). Although both inhibitors stabilize the β sheet and binding pocket helix, they each also have unique energetic effects (Figs. 5a, b, S14, S17, S18). Compared to TMP, MTX more strongly impacts the helix αC (Figs. 5d, S18), GH loop (Figs. 5e, S18), Met20 loop (Figs. 5d, f, S18), and N-terminal region of αB (Fig. S18). Compared to the APO enzyme, αC is stabilized in MTX-ecDHFR via interactions with the Met20 loop, while TMP binding has no effect (Fig. 5a-d). In the αC-proximal loop, R57 forms bidentate hydrogen bonds with MTX that are not possible with TMP (Fig. S19). In αB, the difference is most pronounced in K32 (ΔΔGop,MTX-TMP=33 kJ/mol) due to its hydrophobic interaction with the methylene groups of the glutamate moiety of MTX (Fig. S19). Allosteric TMP-specific increases in ΔGop are pervasive in the β sheet where TMP resistance mutations are enriched (Figs. 5a, b, S20). We highlight these effects to demonstrate the resolution and quantification afforded by PIGEON-FEATHER compared to conventional HX/MS analysis methods, and to emphasize results that are not obvious from structures.
Altogether, PIGEON-FEATHER shows how each inhibitor reshapes the ecDHFR ensemble (Fig. 5g). MTX interacts with the M20 loop and introduces a backbone-sidechain hydrogen bond between E17 in the Met20 loop and S49 in αC (Fig. 5d), further stabilizing the backbone-backbone hydrogen bond between N18 and G15. The FG loop is therefore released from its APO-state interactions with the Met20 loop to become relatively flexible, while the GH, FG, and Met20 loops at the edge of the sheet form a strong intramolecular hydrogen bond network (Fig. 5b-e, g). The destabilized FG loop disrupts backbone hydrogen bonds in βG by unfolding the small helix that connects them (Figs. 5g, S5). Because TMP binding does not bias the Met20 loop to the same conformation as MTX, the nucleotide-binding domain, FG loop, and βG strand remain relatively unperturbed (Figs. 5a, f, S14, S18).
These results combined with X-ray crystallography suggest that TMP’s selective inhibition of bacterial DHFRs may be linked to its binding conformation and to allosteric effects on the βG and βH strands. Generally, the amino acid content and local structure is similar in the active site among bacterial and vertebrate orthologs (Fig. S21).39,46 In ecDHFR and hDHFR complexes, the diaminopyrimidine ring of TMP is positioned similarly, but the trimethoxyphenyl group is positioned differently, with a 90° rotation in the dihedral angle for the pyrimidine-to-methylene bond (Fig. S22).42,47 This rotameric change may destabilize αB to reduce TMP-hDHFR affinity. PIGEON-FEATHER shows that TMP binding also relies heavily on β-sheet stabilization, which may have a higher energetic cost in vertebrate DHFR orthologs where the N-terminal βG forms an extended loop with the adjoining FG loop (Figs. 4e, S22). PIGEON-FEATHER reveals how ligand-induced conformational reweighting in this region, which is far from the active site and least conserved among orthologs, is necessary for ligand binding and catalysis in both ecDHFR and hDHFR as determined previously by mutational studies.48– 51 TMP resistance mutations52–54 in the β sheet of ecDHFR may limit TMP efficacy by similarly disrupting the wild-type pattern of ligand stabilization in this region (Fig. S20). By contrast, MTX binding destabilizes βG and likely energetically offsets this disruption through its interactions with αC and three functional loops (Fig. S19), resulting in the inhibition of both ecDHFR and hDHFR.
PIGEON-FEATHER clarifies functional switching in the lac repressor (LacI)
LacI is a model transcription factor from E. coli. Isopropyl β-D-1-thiogalactopyranoside (IPTG) binds the regulatory ligand-binding domain (LBD) of LacI to cause an order-disorder transition in the DNA-binding domain (DBD) more than 40 Å away, but X-ray crystal structures show <1.5 Å RMSD in among the apo, DNA-bound, and IPTG-bound LBDs.55 We previously applied HX/MS to learn how IPTG and other ligands modulate the rigidity of secondary structures in the LBD.56 The results were well-correlated with functional data from mutational studies57,58 and were validated by subsequent NMR59 that confirmed the roles of the same regions in stabilized and truncated mutants. However, these studies did not show how the energetic changes in the LBD upon IPTG binding actually remodel the DBD ensemble. We applied PIGEON-FEATHER with fully replicated HX/MS experiments to probe the allostery of this large protein at the amino acid level, including the DBD (Fig. 6). The peptide-level trends were consistent with our previous study (Figs. 6a, b, S23-25), but PIGEON-FEATHER also revealed crucial biological insights about the IPTG to DNA ensemble transition.
We found that IPTG binding differentially reweights specific sets of interactions at the dimer interface over a 70 Å distance to disfavor LacI-DNA interactions and activate transcription (Fig. 6c, d). The C-terminal LBD subdomain is virtually identical by X-ray crystallography in the DNA- and IPTG-bound states.55 Here, PIGEON-FEATHER shows that helices H17 and H15 are more protected in the DNA state due to better hydrophobic interactions among I283, Y282, and L251, while further stabilizing the D278-C281 and D278-Y282 backbone hydrogen bonds (ΔΔGop, DNA-IPTG =48 kJ/mol) (Fig. 6e). Despite the 55 Å distance, this region is coupled to the DBD, where in the DNA state, a cross-interface hydrophobic interaction between L56 and V52 stabilizes the hinge helix and protects the R51-A116 hydrogen bond, further stabilizing helix H4 (Fig. 6f). This bond is a true molecular switch: in IPTG-LacI, the probability of this interaction is dramatically decreased to bias the ensemble to its active state.
Helix H1 in the N-terminal subdomain of the LBD is one of the most elusive helices in LacI. We previously observed polymorphism in its exchange behavior: the N-terminal half is more protected from H-D exchange in the IPTG state, while the C-terminal half is less protected (Fig. S24). Using PIGEON-FEATHER, we determined that the N-terminal protection at A75 and V79 originates from IPTG binding (alongside S193 and S197 in the pocket periphery), while the C-terminal deprotection is due to the weakening of the hydrogen bond between R86 and A82 (ΔΔGop, DNA-IPTG =31 kJ/mol), caused by its hinge motion (Fig. 6g, h). The IPTG-triggered disruption to this region culminates in destabilization of the C-terminal region of helix H4 and the hinge helix (Figs. 6f, i). The interactions of the hinge helices with each other and with the minor groove of the DNA operator are therefore reduced60, which linearizes the DNA and releases the helix-turn-helix motif of the DBD from the major groove via increased local unfolding (Fig. 6d).
Discussion
PIGEON-FEATHER determines high-resolution ΔGop values as energetic features of protein ensembles from conventional HX/MS measurements. Detailed benchmarks show that it surpasses other methods and scales with protein size for nearly perfect accuracy, improving the utility and interpretability of HX/MS such that it may be seamlessly combined with cryo-EM and MD simulations. PIGEON-FEATHER includes several features that contribute to its high accuracy. After disambiguating peptides using tandem MS fragment identification, it fits their full isotopic mass envelopes, which circumvents the degeneracy of centroid fitting inherent in other methods by introducing additional constraints and minimizing calculation uncertainties. Inspired by the physics of the hydrogen exchange phenomenon, PIGEON-FEATHER’s swapping-sampling mechanism significantly improves sampling efficiency on the rugged PFs space in the MCMC simulations. The peptide subtraction feature further improves the method’s accuracy.
The power of PIGEON-FEATHER lies in its ability to reveal how proteins traverse free energy landscapes in response to various perturbations, including ligand binding, mutations, and environmental conditions (pH, temperature, pressure, solvent, and denaturant). The inclusion of ensemble energetic information adds an extra dimension to the sequence-structure-function paradigm for proteins. Integrated with recent advances in protein structure determination61, structure prediction7,62, structure search63, and bioinformatic analysis64,65, PIGEON-FEATHER enables a deeper understanding of proteins and protein families. Because the method permits the comparison of ΔGop within fold families, biologists may directly observe changes in ensemble energies over the course of evolution. PIGEON-FEATHER reveals allosteric protein mechanisms at high resolution, which can aid the identification and design of new therapeutics.
PIGEON-FEATHER can also be integrated into protein engineering and computational structural modeling methods, for example to build residue-level scoring functions for enhanced docking and virtual epitope screening, integrating with structural modeling66 and prediction6,67. With PIGEON-FEATHER, HX/MS datasets can be used to train machine learning models aimed at designing proteins with defined free energy landscapes. In principle, PIGEON-FEATHER can be extended to support multimodal distributions and HX-electron capture dissociation (ECD)/electron transfer dissociation (ETD)/MS. We will incorporate these capabilities in future developments.
Methods
PIGEON: detailed methods
Initial peptide identification and removal of systematic bias
PIGEON requires a list of peptide identifications for each MS/MS dataset. Peptide lists can be produced using any commercially available MS software that matches MS peaks by monoisotopic mass and retention time to sequence ranges and assigns a score based on b- and y-ion coverage (Fig. 2d). We collect 2-4 MS/MS datasets per HX/MS experiment and produce the peptide list using Bruker Biotools. To allow for interpreting datasets with a systematic m/z error from the MS calibration, we apply an initial high 30 ppm tolerance for m/z error in Biotools (Figs. 2e, i, f, S1a). To account for different protease specificities and heterogeneity in protease activities, our database includes every possible proteolytic fragment for the protein of interest up to a maximum length of 15 residues rather than only including peptide sequences resulting from pepsin cleavage (Fig. 2b). The 30 ppm tolerance introduces many false identifications, so we impose a lower tolerance threshold after correcting for calibration errors using PIGEON (Fig. 2e, ii). To do this, PIGEON first pools all identified peaks across MS2 datasets and identifies a subset of high-confidence peaks (with scores >150) from this pool (Fig. S1b), then fits a heuristic trend line to the subset using a fractional exponent function f(x) or polynomial function g(x):
We re-threshold the data to ±7 ppm to include only points around this trendline (Fig. S1c). Up to half of the peaks are discarded at this stage, with up to 100% increase in mean score.
Removal of redundant peptide matches
In HX/MS experiments, one to three charge states are typical per peptide. Each charge state for each peptide may have redundant matches that differ by small amounts in retention time or m/z, but correspond to the same peak. PIGEON uses the match with the highest fragment coverage for each charge state. In the approximately 30% of cases with equal fragment coverage among matches, PIGEON chooses the match with higher intensity. This leaves 100-1,000 peaks, with a 2-fold to 4-fold increase in mean score (Figs. 2e, ii, S1d). At this stage, the user may also impose a minimum score for matches, for use with higher quality datasets.
Removal of false duplicates
The most problematic case for HX/MS analysis is m/z degeneracy, where multiple peptides are redundantly assigned to a single peak in the MS because their theoretical m/z or masses are very close or identical (Fig. 2a-c). To check for these false duplicates, PIGEON compares each peak assignment to all other peak assignments in the dataset. If the monoisotopic m/z and the retention time for each sequence is within a threshold (default: 0.1 Da, 0.5 minutes) of another, PIGEON flags the pair. If the peptide in question is unsupported by fragment information (default: has score 0) it is dropped (Fig. 2e, iii, g). If multiple duplicate peptides have fragment support, this indicates that these peptides co-elute and are both present in an unknown ratio. Because the optimal treatment in this case depends on the details of the analysis (in particular on how high coverage is possible when excluding these co-eluting peptides, the resolution of information desired, and the extent of manual data refinement after rate fitting) several modes are provided in this case (Fig. 2e, iv). These are: to keep duplicates where both have fragment support (‘KEEP’), to drop all (‘DROP’), or to drop only those sets that do not mostly (default >80%) overlap in sequence (‘SELECT’).
FEATHER: detailed methods
The FEATHER method uses a Bayesian framework to reconstruct full isotopic mass envelopes to fit PFs given three inputs from the user: the protein sequence, a peptide list from PIGEON, and the HX/MS dataset.
Peptide subtraction
The experimental peptides from the HX/MS dataset for each functional state of the protein are categorized according to their overlaps. Subsequently, exhaustive combinations of peptides within each category are subjected to subtraction (Fig. 3a). Peptides are subtracted iteratively until no further new peptides can be produced. There is also an option to only perform one round of subtraction. We use a Monte Carlo (MC) simulation to subtract the isotopic mass envelope, aiming to minimize the error recalculated between the convolution of the two short peptides and the parent peptide. The initial guess for this process is obtained through the deconvolution of the parent peptides using a Fourier transform.
Isotopic mass envelope reconstruction
Following peptide subtraction, initial PF guesses for each individual amino acid are corrected using Markov Chain Monte Carlo (MCMC) sampling based on errors between the theoretical envelope and the measured envelope along with the centroid value until the model converges (Fig. 3c). The theoretical envelopes are derived by convolving the natural isotopic mass envelope from the MS/MS experiment with the theoretical deuteration isotopic distribution at a specific time point. The deuteration isotopic distribution is calculated based on the single exponential kinetic curves of all residues in a peptide given the log(PF)s and the peptide’s back exchange level. The peptide exchange does not take into account the first two residues and prolines. The forward model (fmod), which calculates the isotopic mass envelope Ef,t fmod of peptide f at exchange time t, is defined as
Where Conv. is the convolution function, f(x) is the function that calculates the probability distribution given the probability of each independent event, saturation level ϕ is the deuterium percent of the D2O buffer, ∂i is a delta function if residue i has an observable amide and zero otherwise, βf is the back-exchange level of peptide f, and Ef,t nat,non-D is the natural isotopic mass envelope.
Working from the BayesianHDX17 score function, we replaced the original noise model with a Gaussian function of the summed squared fitting error for individual isotopic mass envelope peaks, , for peptide f at time t, replicate n, and peak p. PE represents the probability of observing the envelope E.
Here, the model M = ({ki}), is of the set of residue-resolved exchange rates, {ki}. N is the total number of peptides, σE, f,t is the average mass envelope peak error estimate for peptide f at time t.
We also incorporate the agreement of the fits with the centroid-level observations. The probability PC of experimentally observed centroid value of the mass spectrum Cexp for peptide f at time t, replicate n, is given by
σC,f,t is the average centroid value error estimate for peptide f at time t.
The joint likelihood of isotopic mass envelope and centroid value of the mass spectrum is:
Where wC and wE are the weights of the centroid and envelope contributions, respectively. Their default values are both set to 1.
The likelihood function is the joint likelihood of all the peptides, timepoints and replicates.
The PFs are derived by combining results from multiple posterior analyses, each of which employs bootstrapping techniques where 90% of the peptides are randomly resampled from the entire dataset. In a typical bootstrapping replicate fitting process, simulated annealing is applied with the temperature decreasing from 50 to 0.1. At temperatures of 2 and 0.01, 1500 models are saved. This process is repeated five times, resulting in the collection of 7500 models. Finally, the top models (n=50 by default) from each bootstrap replicate are collected and pooled. K-means clustering is then applied to obtain the PFs within each resolution segment, where the number of clusters corresponds to the size of the resolution segment (1 for a single resolved residue). The standard deviation of PFs within the cluster, when combined across multiple bootstrapping runs, is used as an error measure.
Log(PF) swapping
During each sampling step, 20% of residues are randomly selected. Each of the residues in this subset may swap its log(PF) value with that of an adjacent residue. The probability that this swap is accepted is contingent upon the Metropolis criterion based on the score function at a temperature (T) of 0.
Priors
We also implemented two optional Bayesian priors to enhance FEATHER performance. The structural prior leverages computationally predicted or experimentally solved protein structures to bias the sampling based on probable hydrogen bonding patterns within the protein ensemble. This prior is justified by seminal work that demonstrated how backbone amide groups involved in hydrogen bonds with other protein atoms, ligands, or ordered water molecules limit H-D exchange.10,68 The uptake prior compares the FEATHER-determined exchange rates for each amino acid with the D uptake plots to ensure consistency with empirical observations in the calculated exchange rates. During the sampling process, two broad Gaussian biases are applied to the score function for positions with low and high PFs, based on their categorization according to D uptake.
In the structural and uptake priors, observable residues are categorized into three groups: high PF, low PF, and regular PF residues. This classification is based on an analysis of backbone hydrogen bonds in the protein structure and peptide uptake plots. For the structural prior, a residue is considered a high PF residue if it forms hydrogen bonds with other protein atoms, binding partners, or water molecules. Residues that lack backbone hydrogen bonds are classified as low PF residues. For the uptake prior, residues covered by slow-exchange peptides, where the maximum deuterium uptake is less than 20%, are considered high PF residues. Conversely, residues covered by fast-exchange peptides, where the minimum deuterium uptake is more than 100% (exceeding 100% due to full D sample error), are classified as low PF residues. Residues covered by both slow and fast exchange peptides are categorized as regular PF residues. To account for these categorizations in the sampling process, two soft biases are applied to the score function. For high PF residues, a prior Gaussian norm(6, 4.0) is added, and for low PF residues, a bias norm(1.5, 2.0) is included based on seminal NMR studies.10,68 The scales of the structural and uptake priors are set to 0.7 and 0.5, respectively. The PFs of regular residues are sampled evenly.
Simulated data creation
An amino acid sequence and a corresponding array of PFs of identical length and within a specified range (e.g., 2–10) were randomly generated using the python built-in random module. The intrinsic exchange rates (kint) were calculated using the HDXrate library69 given the protein sequence and the pH and temperature of the experiment (pH 7, 293.0 K)11,70. Subsequently, the deuterium uptake kinetics of each residue i were determined using a single exponential function that incorporates PF and kint.
Peptides of lengths ranging from 5 to 12 residues were then randomly and evenly generated along the sequence range. The theoretical isotopic mass of a peptide was obtained using the forward model described above. We employed pyOpenMS71 to calculate the natural isotopic mass envelope of a simulated peptide. The peptide deuterium incorporation is a sum of deuterium uptake of all the residues within it. The simulated data generation is optimized to achieve coverage comparable to that of the experimental dataset by ensuring a similar number of subtracted peptides after one round of subtraction.
Data availability
All data needed to reproduce this work are available at the PRIDE database (dataset identifier PXD054550).
Code availability
The source code for the software is available at https://github.com/glasgowlab/PIGEON-FEATHER, along with the synthetic datasets used for benchmarking, an example HX/MS dataset, and a tutorial.
Acknowledgements
We gratefully acknowledge Glasgow Lab members for requesting, adding, discussing, testing, and troubleshooting various features of the software. PIGEON-FEATHER also benefited from testing by Danielle Swingle at the Advanced Science Research Center (ASRC) at the City University of New York (CUNY). We thank Dr. Andrea Piserchio, Dr. Kevin Gardner, and Dr. Rinat Abzalimov from the ASRC, as well as Dr. Shawn Costello, Sophie Shoemaker, and Dr. Naomi Latorraca for feedback on the method. We also thank Dr. Abzalimov in his role as the Mass Spectrometry Facility Manager at the ASRC, where the HX/MS experiments were performed. We are grateful to Dr. Supriya Pratihar, Dr. Hashim Al-Hashimi, and Dr. Art Palmer at Columbia University Medical Center for valuable discussions, and to Dr. Costello, Dr. Al-Hashimi, Dr. Abzalimov, Dr. Roksana Azad, Dr. Jeff Glasgow, Dr. Tanja Kortemme, and Dr. Stavros Lomvardas for critical feedback on this manuscript.