Abstract
Hydrogen Deuterium Exchange Mass Spectrometry (HDX-MS) is a powerful technique to monitor the intrinsic and conformational dynamics of proteins. Most HDX-MS experiments compare protein states (e.g. apoprotein vs liganded) and provide detailed information on differential dynamics between them obtained from multiple overlapping peptides. However, differential dynamics are difficult to compare across protein derivatives, oligomeric assemblies, homologues and samples treated under different buffer and protease conditions. A main reason is that peptide-based D-uptake differences do not inform on absolute intrinsic dynamics at the level of single aminoacyl residues. Such information is offered by protection factors, i.e. the position of the local equilibrium between the D-exchange-competent ‘open’ state and the non-exchanging ‘closed’ state. We present PyHDX, a software tool to calculate protection factors and Gibbs free energies typically within minutes from HDX-MS-derived peptide lists. PyHDX provides intrinsic information on the thermodynamics of protein dynamics at single-residue level. An interactive web interface further streamlines the process of transforming peptide lists to either coloured linear sequence maps or 3D structures of Gibbs free energies/protection factors.
Availability PyHDX source code is released under the MIT license and can be accessed at the project’s GitHub page.
1 Introduction
Intrinsic dynamics commonly underlie protein function[1]. Hydrogen/deuterium exchange mass spectrometry (HDX-MS) is a powerful biophysical technique to monitor the intrinsic dynamics of proteins at near residue resolution and millisecond to hour timescales[2]. It is routinely used to compare different protein states, ultimately informing on differentially dynamic regions. For this it exploits the property of different regions of proteins that undergo D-exchange at different rates. In the common ‘bottom-up’ implementation of HDX-MS[3], proteins are D-labelled during various incubation times in D2O buffers. The reaction is quenched to arrest exchange, and proteins are proteolyzed by proteases and analysed by Liquid Chromatography-MS. Generally, D-labelling is carried out over timescales of 3-4 orders of magnitude[3].
D-uptake values are calculated from the average mass changes of peptides between deuterated and undeuterated populations using dedicated software[4].The local D-uptake is directly correlated to the dynamics and flexibility of that region, with high D-uptake corresponding to increased flexibility.
The D-uptake values are generally presented either as D-uptake curves or D-uptake heatmaps (Figure 1A).These representations only show a 2D slice of the 3D (time, deuteration, peptide) dataset and suffer from poor spatial resolution as the uptake of full peptides is not converted to residue-level information. Therefore, these types of representation generally do not accurately capture the full breadth of the experiment and cannot convey all the information present in the dataset.
It is therefore desired to represent HDX-MS data in a different form where information along the peptide and time axes are collapsed. This can be achieved by a) deconvoluting overlapping peptides and map information to the linear protein sequence and b) extracting the thermodynamics of kinetic quantities from the temporal information. This representation permits protein dynamics to be easily compared across different proteins or experimental conditions as well as to other orthogonal methods.
Relating observed D-uptake kinetics of individual peptides to underlying kinetics of protein dynamics or thermodynamic quantities such as the Gibbs free energy at single residue level is challenging. Typically, the Linderstrøm-Lang two-state approximation is used to model the kinetics of HDX-MS experiments[5, 6, 7, 8]. In this model, backbone amides can be either in a ‘closed state’, where amide hydrogen bonds form hydrogen bonds in secondary structures, and cannot exchange with deuterium in solution, or in an exchange-competent ‘open state’, where hydrogen bonds are compromised. The position of this equilibrium is given by the ‘protection factor’ (PF), where the PF is high when the equilibrium is shifted to the ‘closed’ state and the region/protein is more rigid. This description also gives access to the difference in Gibbs free energy between the ‘open’ and ‘closed’ state through the Arrhenius equation.
Obtaining PF values for single residues is complicated by the fact that bottom-up HDX-MS experiments only yield the deuteration value in toto of its constituent proteolytic peptides rather than for individual amino acids directly. To extract residue-level information, overlapping peptides can be exploited[9]. Several recent methods derive residue level dynamics from HDX-MS in the form of protection factors[10] or Gibbs free energy[11] and contribute towards a more comprehensive description of HDX-MS datasets. Nevertheless, significant hurdles for HDX-MS users still remain. Specifically, a) analysing datasets requires optimization of a number of parameters equal to the number of residues in the protein. Therefore, analysis procedures can take many hours or days. b) Installation of custom non-commercial academic software is frequently challenging and can have steep learning curves. c) the solutions require commercial licenses or the software itself is distributed under a commercial license. To address these issues, we developed the software tool PyHDX which derives protection factors and Gibbs free energy rapidly, at near residue level, from overlapping peptides. The full analysis including classification (assigning colours to residues) and visualization can be done in a web interface, typically within minutes. (Supplementary Movie 1). Input data is in the form of an ‘HDX data’ table in CSV format, which lists all peptides and their associated D-exposure time and D-uptake, and is exported from widely used software such as DynamX (Waters, UK) or HDExaminer (Sierra Analytics, CA, USA). Using the obtained protection factors residues can be coloured by either discrete binning in “flexibility” classes or by continuous custom colour maps. All output can be directly visualized in the web interface or exported either as a text file or PyMOL (Schrödinger, LLC) script to colour 3D structures.
2 Theory
The Linderstrøm-Lang model describing HDX-MS experiments can be written as[8, 7, 6, 5]:
Where NH is an amide hydrogen and ND an amide deuterium. Ultimately, the aim is to determine the position of the equilibrium between the open and closed states, by measuring the kinetics of formation of ND. The intrinsic exchange rate kint in the open state is dependent on the pH and temperature at which the deuterium labelling reaction takes place, as well as the primary sequence of the peptide, and can be calculated accurately[12, 13, 10]. This intrinsic rate is a major influence on the kinetics of D-exchange as it can vary up to three orders of magnitude (pH 6, 0°C, vs pH 8, 30°C). To correct for back-exchange[3, 14], a fully deuterated (FD) control sample is used. The experimentally determined FD provides the maximal degree of D-exchange possible for any given peptide. The corrected D-uptake for each peptide is then calculated as[14, 4]:
Where Dcorr is the corrected D-uptake, D is the experimental D-uptake, DFD the D-uptake of the fully deuterated control and nlabile the number of exchangeable amide hydrogens.
Using the steady-state approximation, the observed rate of formation of the deuterated residue ND is given by:
Assuming that the protein dynamics are faster than the exchange reaction (kopen + kclose ≫ kint) and introducing the substitution , the expression reduces to:
Where PF is the protection factor[15, 8, 16] for this particular protein residue. Protection factors are defined as the ratio between the opening and closing rate and therefore (in a two-state system) the inverse of the probability ratio of the open and closed state occupancy. This is equivalent to the system’s Boltzmann factor and thus protection factors inform directly on the states’ Gibbs free energy difference:
In this equation, if the energy difference ΔG is small, the relative occupancy of the open state is high and the residues displaying it would be highly dynamic or disordered.
Given a protein with Nr residues r and a HDX-MS experiment which yielded Np peptides p at Nt timepoints t, each with an associated measured deuterium uptake D(t,p), the loss function to be minimized to find the protection factors is then:
Where the quantity g ≡ log10(PF) is introduced to ensure that the optimization takes place in energy space. D and X are Np × Nt matrices where the elements of the coupling matrix X describes to which residues each peptide corresponds:
Extracting per-residue protection factors from HDX-MS data typically involves solving an underdetermined system, as the measured D-uptake is an average of several residues. To prevent overfitting, the L1-regularization term is added, where Δ represents the difference operator[15]. This regularization term ensures that the returned solution will have minimal fluctuations in g along r, unless available data dictates otherwise.
To further ensure convergence to the correct solution, the vector is initialized with guess values obtained from weighted averaging (by inverse peptide length) all peptides for a given timepoint. This procedure yields a kinetic uptake curve per residue where the D-uptake rate is extracted by determining the half-life of D-uptake and converting this value back to rate.
With a median polypeptide length of 278 residues[17] for the E. coli proteome, and outliers up to 1039 (Antigen 43), a typical number of fit parameters is expected to be in the hundreds. This large number of fit parameters leads to long computation times when traditional fitting algorithms are used. To overcome this and deliver results in minutes, we used the machine learning framework TensorFlow which is highly optimized to handle up to millions of parameters[18] and features automatic differentiation to rapidly calculate gradients. We used a custom network ‘layer’ using existing TensorFlow infrastructure to be able to calculate D-uptake according to equation (6). The graph of the neural network was designed such that data input enters the fitting layer separately. This architecture potentially allows for the simultaneous input of several datasets of the same protein with differing experimental conditions (pH, temperature). Variation of these experimental parameters can change the intrinsic D-exchange rate with factors up to 1000[12], which can increase the dynamic range of the experiment with the same factor, without the need of extending the temporal range sampled. This approach has been used routinely to bring out D-exchange detail in highly dynamic regions that otherwise exchange too fast to be analysed in depth[19]. A prerequisite of this approach is the assumption that the protein’s dynamics are invariant even under variable experimental conditions. When temperature-induced changes of the protein’s dynamics are considered, variation of temperature can increase the information obtained from HDX-MS experiments to include local enthalpic and entropic contributions, as well as coupling to global unfolding events[20].
3 Application
To assess the performance of PyHDX, we chose the well characterized 17 kDa E. coli chaperone SecB (ecSecB) that assembles into tetramers. D-uptake was measured across six D-labelling timepoints (10 sec to 100 min; 30°C, pHread 8) and a fully deuterated control was included (details in Supplementary Information). D-uptake data are shown as uptake curves for three representative peptides (Figure 1A, left) and as a heatmap for all peptides at t=30s (Figure 1A, right).
PyHDX was used to calculate Gibbs free energies and protection factors for all residues (Figure 1B). The obtained energies were then classified into three distinct kinetics regimes of relative flexibility, on the basis of their Gibbs free energies. To facilitate visualization, we chose to classify the residues as three kinetic classes of relative flexibility; ‘rigid’ (30kJ mol−1, blue), ‘flexible’, (17.5 kJ mol−1, green) and ‘disordered’ (5 kJ mol−1, red), and assigned colours by linear interpolation between these fixed nodes. The resulting energy landscape was visualized onto the structure of one ecSecB protomer (Figure 1C).
We find that both the β4 strand and the internalized multimerization helix α1 are mostly rigid (ΔG ≈ 30kJ mol−1). Flexible regions (ΔG ≈ 17.5kJ mol−1) are confined to the edges of the β-sheet and to connecting loops. The C-tail (aa 138-155) as well as parts of β2 are disordered (ΔG ≈ 5 kJ mol−1). The calculated Gibbs free energy for the C-tail is 7.4 kJ mol−1 (PF=19), indicative of high flexibility which is consistent with it not having been resolved by X-ray crystallography[23]. Nevertheless, since this region was resolved as unstructured and highly disordered by NMR[21], we hypothesize that in reality the true protection factor of this region is underestimated by PyHDX. This underestimation likely results from the limited time resolution of the experiment, where under the temperature and pH conditions employed the first timepoint is already at 99% D-exchange. Therefore, the obtained protection factor of the C-tail represents an experimentally-imposed upper limit. Accurate calculations of protection factors in such dynamic regions would necessitate additional experimental data obtained at regimes that slow down D-exchange (e.g. lower temperature and/or pH) as is traditionally done[19, 24].
To test whether PyHDX can generate useful comparisons between homologous but non-identical proteins obtained under different experimental regimes in different laboratories, we determined the flexibility of a SecB structural homologue with modest sequence conservation (13% identity/27% similarity) from M. tuberculosis (mtSecB), and calculated ΔG and protection factors[22] (Figure 1D).
Comparing the mean ΔG values for both proteins (19.6 kJ mol−1 and 18.3 kJ mol−1 for ecSecB and mtSecB, respectively), a ‘flexibility index’ can be derived from which both proteins, despite their differing physiological role, appear to have a very similar overall flexibility.
However, it must be noted that care must be taken when making such comparisons, because the experimental D-labelling conditions of mtSecB (20 °C, pD 6, 4 timepoints 10s – 30 min) differ from those of ecSecB, lowering the intrinsic D-exchange roughly 100-fold. Although this is corrected for in deriving ΔG values, the dynamic range of the experiment is shifted towards lower energy and protection factors. Accurate cross-comparisons of the protein flexibility index would be possible if all residues are fully within the experiment’s dynamic range.
PyHDX yields a strikingly similar overall energetic landscape for the two proteins. The similarities of flexibility profiles of mtSecB and ecSecB (Figure 1F) are conserved along the secondary structure elements of the proteins. The protein cores are rigid with only the edges of the central β-sheet showing flexibility as do the disordered C-tails (Figure 1B, D, F).
However, we hypothesize that the experimental conditions skew the actual ΔG values obtained, as highlighted by the extremes in protection factors. The rigid regions of mtSecB appear more flexible compared to those of ecSecB when ΔG values are compared. The obtained ΔG values for mtSecB likely represent a lower limit since these regions only show 2% D-uptake at the last (30 minutes) timepoint, resulting in a poor description of D-uptake by the fitting. To obtain accurate protection factors for the rigid regions of mtSecB the exposure time of these measurements would need to be extended to at least 40 hours, corresponding to the half-time of D-uptake under these conditions for a protection factor of 105. Moreover, several short and slowly exchanging peptides (aa 136 – 153, Supplementary Document 2) are overlapping with longer peptides, which skews initial guesses to lower protection factors. Similarly, because the dynamic range of the experiment is shifted towards lower protection factors, the disordered C-tail of mtSecB (not resolved in the X-ray structure, Figure 1D) is assigned a ΔG value of 4.4 kJ mol−1, compared to 7.4 kJ mol−1 for ecSecB. To probe if PyHDX is applicable to the wider protein space, we derived Gibbs free energies for five more proteins obtained at similar experimental conditions and observed a wide range of flexibility profiles (see Supplementary Information for experimental details). The Gibbs free energies for all five proteins were pooled together and classified into three categories by using a multilevel Otsu threshold (Figure 1G).
The obtained fit results generally describe the data well (Supplementary Documents 1-4, mean absolute error < 1 for 66% of peptides), however for some peptides the full complexity of the measured D-uptake kinetics is not completely captured (e.g. peptide 42-55, Supplementary Document 1). Several factors can contribute to this.
First, at the chosen regularization level (λ = 20), some features in the protection factor profile consisting of several amino acids are “flattened out” as they would incur a high loss penalty at this regularization level. Tuning the regularization parameter can alleviate this issue, at the risk of overfitting.
Second, some peptides exhibit D-uptake profiles where the deuterium uptake reaches a plateau at the first timepoint and remains constant thereafter (e.g. peptide 103-123, Supplementary Document 2). This behaviour suggests that either some residues in this peptide exchange fast while the rest of the residues exchange slowly or that large variations occur in intrinsic exchange rates within this peptide. Such complex behaviour provides a significant challenge to fitting algorithms as exact residue-level protection factor assignments within ‘subfragments’[11] of peptides are impossible due to the non-identifiability issue[11, 4, 14].
As residues differ in their intrinsic D-exchange rates, an incorrect assignment of protection factors from initial guesses can lead to convergence to local minima during the optimization process. Convergence to local minima is further evidenced by the observation that variations in the initial guesses result in variations in the final fit result. Possible strategies to mitigate this are the introduction of different optimization algorithms such as stochastic gradient descent or repeated fitting with variable initial guesses[10, 15].
Third, errors can be introduced by the approximations made when reducing the complexity of an HDX-MS experiment with the Linderstrøm-Lang model and further approximations leading to equation (4). The steady-state approximation will break down at high values of intrinsic exchange rates or low protection factors. Implementation of numerical integration of the differential equations should expand the applicability of our method to experiments in these kinetics regimes.
Finally, the two-state model approximation for peptide dynamics may need to be expanded to include a second non-exchanging closed residue state[8, 20]. This type of kinetic model approximates as a biexponential association model, which is what we empirically find for to be applicable for many peptides. With numerical integration in place, extending the model to include a second closed state is a trivial exercise. However, the introduction of two additional fit parameters will further increase the likelihood of overfitting and will likely require a large dataset to be fitted in global analysis, spanning many timepoints and temperatures, and possibly pH regimes.
4 Conclusion
In summary, PyHDX significantly simplifies and accelerates routine HDX-MS kinetics analyses and makes possible the comparison flexibility profiles across mutant derivatives, proteins in quaternary assemblies, homologous proteins, protein families and protein folds. Its open-source nature and its modular implementation allow for easy modifications and extension of the software and users are encouraged to offer suggestions, feature requests or submit code extensions.
5 Implementation
PyHDX is built on top of the scientific python ecosystem. Computation is done using the packages numpy[25], scipy[26], scikit-image[27] and symfit[28]. Fitting of protection factors is implemented on the machine learning platform TensorFlow[18]. Computationally intensive tasks are be scheduled to be processed in parallel through dask[29]. Intrinsic exchange rates are calculated as previously described[12, 13] and implemented by expfact[10]. Graphical output is generated with either matplotlib[30] or bokeh[31]. PyHDX features an API for data analysis in Jupyter notebooks[32] and a web application implemented in panel[33] using NGL[34, 35] to visualize proteins.
Competing Interests
The authors declare they have no competing financial interests or other conflicts of interest.
Author Contributions
JHS conceived all mathematical analysis and developed and implemented software and web interface. SKr, BYS and SK provided HDX-MS data and analysis. SKr guided optimization of software parameters and validated output. JHS wrote the first draft with contributions from SKr and AE. All authors reviewed and approved the final manuscript. AE conceived and managed the project.
Acknowledgements
We are grateful to: J. Marcoux, P. Geneveux and L. Mourey for generously sharing mtSecB HDX-MS data; J. Claesen for discussions; the open source software community for their support and advice. Research in our lab was funded by grants (to AE): ProFlow (FWO/F.R.S.-FNRS “Excellence of Science - EOS” programme grant #30550343) and CARBS (#G0C6814N; FWO) and (to AE and SK): FOscil (ZKD4582 - C16/18/008; KU Leuven). SKr was a FWO [PEGASUS]2 MSC postdoctoral fellow. JHS was a PDM, KU Leuven postdoctoral fellow. This project has received funding from the Research Foundation – Flanders (FWO) and the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 665501.