An equilibrium constant pH molecular dynamics method for accurate prediction of pH-dependence in protein systems: Theory and application

Computational modeling and simulation of biomolecular systems at their functional pH ranges requires an accurate approach to exploring the pH dependence of conformations and interactions. Here we present a new approach – the Equilibrium Constant pH (ECpH) method – to perform conformational sampling of protein systems in the framework of molecular dynamics simulations in an N, P, T-thermodynamic ensemble. The performance of ECpH is illustrated for two proteins with experimentally determined conformational responses to pH change: the small globular water-soluble bovine b-lactoglobulin (BBL), and the dimer transmembrane antiporter CLC-ec1 Cl−/H+. We show that with computational speeds comparable to equivalent canonical MD simulations we performed, the ECpH trajectories reproduce accurately the pH-dependent conformational changes observed experimentally in these two protein systems, some of which were not seen in the corresponding canonical MD simulations. Table of Contents artwork


Introduction
The prospect of exploring the pH dependence of macromolecular biological systems with molecular modeling tools, holds great promise for gaining reliable mechanistic insight and predicting the optimal pH for the peak functional activity. As the functions of a large variety of proteins were shown to be influenced by switching between protonation states of one or more residues [1][2][3][4][5] , an understanding of the atomistic underpinnings of pH dependence should offer powerful tools for regulation of protein activity by increasing the population of specific conformational states, and guiding experimental design.
In canonical molecular dynamics (cMD) simulations the pH of the system is defined by a set of fixed protonation states of all titratable residues that are calculated for a certain pH from per-residue pKa values. However, the pKa values are usually known only for the solvated single amino acids, and not in the structural context of a specific protein where they are likely to be shifted by the non-homogeneous environment. This complicates the assignment of protonation states of protein residues. Prediction methods based on the frozen structure of the protein, such as the used in the popular PROPKA server 6 , have been developed to provide estimates of individual pKa values in a given structure. While this approach is rapid and convenient, it ignores the conformational flexibility of protein macromolecule under realistic conditions, and thereby can miss key elements of the protein's response to pH changes. The possibility for inaccurate pKa assignments and correspondingly for the incorrect protonation states that it produces for the cMD simulations, can result in inaccurate inferences regarding the states visited by the system and the transitions of interest.
Not surprisingly, the perceived benefits of performing molecular dynamics simulation of proteins with correctly defined protonation states at a given pH have inspired efforts to develop fast and accurate implementations of a number of constant pH (CpH) versions of MD simulations [7][8][9][10][11][12] . Implicit solvent models have enabled CpH implementation in semi-grand canonical ensemble frameworks, and several newer variations of CpH use allatom simulations that can be coupled to advanced sampling techniques, and sometimes even with polarizable force fields 7,8,10,12,13 .
The approach we present here enables equilibrium constant pH (ECpH) simulation within a canonical (or N, P, T) ensemble based on starting with a protein construct in which all titratable sites are fully protonated, and scaling for each site the electrostatic and van-der-Waals interactions using a linear combination of its individual protonation probabilities as the conformation changes in the simulation towards convergence. This procedure takes advantage of a definition of protonation probability for each titratable site and yields a definition of the system in a given conformation at a certain pH. The protonation probability changes accurately with conformation during simulations, as long as the only contribution to the energy difference between two protonation states come from nonbonded interactions. With the model of a given protein state at each pH value obtained in this manner, it is possible to construct the dynamic representation of the pH dependence of a protein from calculated protonation states that reflect the actual proton-occupancies of sites during the dynamics. A major advantage of this development is that it eliminates the approximation of representing protonation states as a binary protonated/non-protonated choice.
We illustrate the new ECpH method and its application for two systems of different size and complexity: the bovine ß-lactoglobulin (BBL) known to undergo a pH-dependent conformational change, and the membrane protein CLC-ec1, a Cl -/H + antiporter in which the dependence on pH leads to major conformational changes involved in regulation of Cltransport [14][15][16] . The results are presented in parallel with benchmark tests comparing the ECpH trajectories of these systems to equivalent canonical (cMD) simulations showing that the new method reproduces the stability and ensemble dynamic properties of cMD trajectories. Using direct comparisons with structural data for the two systems to probe the validity of the ECpH results of calculations for a range of pH values, we show that ECpH simulations achieve better agreement with experiment in every detail. For BBL, ECpH reproduces the pH-dependent rigidity of its secondary structure [17][18][19][20] , and samples the intermediate  conformation observed using tryptophan fluorescence at pH 7 21 , whereas the corresponding cMD simulations   miss the intermediate conformations detected experimentally. Similarly, application of ECpH to the larger CLC-ec1 transporter, illustrates the successful representation of the more complex pH dependence of functionally important conformational rearrangements occurring in a pH range of 4.5 to 7.5, and the ability of this new method to bring to light some major repositioning of transmembrane helices inferred from experimental measurements.
Thus, recently published experimental data 22 identified the same changes in the switch from the inward facing conformational state at pH 7.5 to an outward facing opened state at pH 4. Here the enhanced sampling of pHdependent conformations in CLC-ec1 system is achieved by coupling the ECpH approach to the same Hamiltonian Replica Exchange Molecular Dynamics (H-REMD) protocol that offers promising scalability, accuracy, and significant acceleration of sampling with CpH methods 12 .

I. Theoretical framework
Together with the proper assignment of individual pKa values, the suitable representation of the partial nature of protonation state is essential for the accurate prediction of pH dependence of protein structure and dynamics. In the semi-grand canonical ensemble framework, the concept of co-existing protonation states is realized through discrete switching among protonation states. Because of the high frequency of proton exchange, this approach requires significant computational resources, and for the real-world all-atom protein systems remains mostly unfeasible. Here we show that the ECpH method formulated in the same n,P,T-ensemble framework commonly used in MD simulations, and using the same information for input pKa values, yields better insight than cMD into the pH-dependence of protein dynamics .
The protonation scheme. A central element of the theoretical framework of the ECpH method described here is the construction of the dynamic protonation scheme for a molecular system composed of any number of protonatable sites. To enable the shift between protonated and not protonated states, the number of particles is kept the same in the two states, which is achieved by positioning dummy-proton representations at each of the positions that are potentially protonated. The presence/absence of a proton at each site is represented by linearly scaling the non-bonded interactions of sites with titratable hydrogens by an individual protonation probability value that is dynamically determined by (1)-the pKa value of that site in a particular state of the entire system, and (2)-by the ambient pH. For a gradual shift between the protonated/non-protonated states of the system, the charges of the atoms neighboring the titratable residues are also scaled linearly, by the same scaling factor. Thus, the core idea of the ECpH method is the introduction of the modified force field expression described below, representing the pH-dependence of the state of the system.
Expression of the pH dependence of the state. The implementation of this core idea of pH dependence a canonical ensemble framework is illustrated first by considering a system with a single protonation site. The protonation probability l for this case is equal to 1, corresponding to a system at very low pH. Expressing the average energy of the subset of atoms of a titratable residue in this protonated state from the probability density ( "#$% & ) integrated over configuration space W we get: where 〈 "#$% & 〉 is an average total energy of the single protonation site of the system, "#$% & ( ) is a Boltzmann probability density of a protonated site over its whole conformational space, and "#$% & is the internal energy of this part of the system.
The expected pH-dependent differences in dynamics arise from the protonation probability densities modified by the change in proton availability defined by the pH. This pH dependence is introduced through individual protonation probabilities for each protonatable site throughout the system. We assume that the system visits a shared configurational space at the different pH values; degradation processes (e.g., complete unfolding or fragmentation) are outside the scope of this method. Therefore, for a system in any protonation state with a probability density "#$% 6 ( ), the average energy is expressed as: where the probability density function is defined on the same configuration space Ω. Thus, by scaling the internal energy "#$% & by the l corresponding to the pH, eq. 2 defines the Boltzmann probability density for any singlesite system with protonation probability λ at a given pH, in terms of the probability density of the fully protonated This definition is equivalent to splitting the entire ensemble into a non-protonated ( <$<"#$%

6
) and a protonated fractions ( "#$% 6 ): so that at extreme l values the internal energy would be: and for any other l the internal energy of the corresponding ensemble is expressed as: As the protonation probabilities vary at different pH values, the difference between energies of ensembles at different pH values arises from interactions of the titratable proton, so that the internal energies "#$% <$<"#$% represent interactions between all atoms of the system and the titratable proton, according to a probability l; this makes <$<"#$% equal to zero.
By the same definitions, expressing the average total energy of a system with K titratable sites at a very low pH so that for a system with multiple pH-dependent protonation sites and a set of protonation probabilities l(pH): where E is an individual protonation probability for the site k calculated for the current pH. It should be noted Calculations in a given pH range. In the protonation scheme defined by the conceptual framework developed here, a single canonical ensemble is created for each pH in a given range, by using the corresponding protonation probabilities to scale the internal energies of the titratable protonation sites of the system. We follow the common protocol 24, 25 of applying scaling only to nonbonded terms of the potential energy to eliminate the dependence of kinetic energy on pH. This is achieved by preserving the proton mass in the dummy atoms when l=0. While this could have a small effect on the dynamics of the system, it does not affect the main goal of proper sampling of the conformation distributions.
Definition of the effective forcefield. The electrostatic potential energy and van der Waals terms for interactions of titratable protons are scaled in accordance to the energy definitions in Eqn-s (8) and (9): The partial charges of atoms neighboring the titratable residues, which are affected by the change in protonation state are calculated as following: transition between the e and d tautomeric forms, we use two scaling parameters in Eqn-s (8), (9), and (10): one l to model deprotonation, and a second one -binary -to encode the information about the tautomeric form.
Conservation of the net charge. All CpH approaches based on discrete protonation states face a common problem, emerging from the charge-exchange procedure that affects the net charge of the molecular system, which becomes variable, non-zero, during the calculation. Thus, approaches proposed to enable simulations in explicit solvent with PME treatment of long-range electrostatics, were shown to affect the dynamical properties of the system 26,27 . Here we use an approach that overcomes the problem by distributing the excessive net charge to a sufficiently large number of water molecules in the solution to avoid affecting their state. This is achieved by assigning to the oxygen atoms of randomly selected water molecules an additional partial charge of 0.001e.
The extremely dilute acidic environment of this water solution acts as a charge buffer that can absorb from, or provide to, the molecular system the necessary charges without affecting enthalpy, entropy, or the interactions with proteins or membranes.

Calculation of the protonation probabilities.
Once individual pKa values are set for each protonatable residue in the system, the protonation probabilities are evaluated from the equilibrium constant of protonation/deprotonation process: In this paper only several pKa values were set specifically in each system for the key residues that define the pH dependence based on the experimental findings (see Methods section). Otherwise the model pKa-s of singular amino acids in water solution were used.

II. Performance of the new Equilibrium Constant pH molecular dynamics method
The implementation of the ECpH method to the study of conformational transitions in protein systems is designed to have the general characteristics of conventional MD simulations. This includes the stability of the trajectories generated in the runs, the integrity of the overall protein structure throughout the simulation, and the similarity of dynamic properties of the generated ensembles. To test these elements of the performance of the ECpH method in comparison to parallel canonical cMD simulations, we used as first benchmark a small (162 amino acids) globular protein -bovine b-lactoglobulin (BBL). BBL is known for a signature, pH-dependent conformational shift of the EF loop at pH ~7.3 (Tanford transition) that is attributed to an unusually high pKa value of Glu89 4, 28 . In addition, the initiation of an unfolding process of BBL at low (< 3) and at high (> 9) pH values, is associated with a series of conformational transitions that have been captured with numerous experimental techniques including NMR, X-ray crystallography and Heteronuclear sequential quantum correlation spectroscopy (HSQC) 21 . We therefore investigated the performance of ECpH method in describing the pH-dependent dynamic properties of BBL from a set of ECpH trajectories obtained at pH conditions ranging from 2 to 11, and compared it to cMD simulations with protonation states of titratable residues varied to represent states at pH 2, 6, and 8.

Stability and dynamical properties of ECpH trajectories.
The root-mean-square deviation (RMSD) computed along the trajectory for backbone atoms of the protein (see Fig. 1A) is commonly used to infer on the stability of MD methods 12,29 . We also use the comparison of standard deviation of the RMSD (STDRMSD) calculated for short segments of the trajectories to quantify that stability (Fig. 1B). As shown in Fig 1A, Figure 1B). The initiation of an unfolding process involving loosening of the proteins' loops was observed at pH 2 in NMR and other spectroscopic studies 18,30,31 .
Indeed, the RMSD profile of the ECpH trajectory at pH 2 indicates a conformational change of the loops AB, BC, CD, and DE into a loose opened conformation at ~600 ns ( Figure S1). In the canonical MD simulations, the geometry of BBL evolves gradually to the same RMSD values. Transition to a more basic environment (pH 8) forces the system into a gradual conformational change associated with the Tanford transition of loop EF. We used a standard technique for testing protein stability, native contact analysis 29 , to evaluate the stability of BBL in the ECpH and cMD simulations, with reference to the crystal structure of BBL at pH 6 (PDB ID 3BLG) in which the EF loop is in the closed conformation ( Figure 1C). Interestingly, the changes in the fraction of native contacts in response to pH changes are similar, and both methods reproduce the experimental structural data 17-21, 30-31 regarding the extremely stable secondary b-barrel structure of the BBL core at the pH range studied which was also confirmed by Solvent Accessible Surface area distributions computed at similar pH range ( Figure   S2). In more basic environments (pH > 8) a more dramatic loss of native contacts is observed, while the occupancy of the initial state populated at lower pH, vanishes ( Figure 1D).
Water distribution. Since the force field modification scheme uses explicit water molecules as a charge buffer, the method was also tested for the ability to reproduce proper water dynamics. The radial distribution functional (RDF) analysis demonstrates identical water densities in canonical MD and ECpH trajectories. The water distribution around titratable protons calculated separately, morphs gradually between non-protonated and protonated state densities observed in canonical MD as the pH changes ( Figure S3).
Computational speed. For practical use of ECpH with real-world systems, the computational speed needs to remain within the range of canonical MD performance. Table 1 shows such a performance comparison for BBL and CLC-ec1 in explicit environments (aqueous, and membrane, respectively), as described in the following section.  The distributions of preserved native contacts in ECpH simulations at pH conditions from 2 to 11 ( Figure 1D) are consistent with the conformational change occurring at pH 8. Notably, in the ECpH simulation at pH 8 the conformational transition is delayed by ~300 ns compared to the cMD trajectory (Figure 2, left). This delay reflects the fact that in cMD the deprotonation of Glu89 is set artificially, while ECpH represents the more realistic situation of the titratable proton of Glu89 maintaining a ~17% occupancy at pH 8. Notably, the gradual decrease of proton occupancy in Glu89 seen in the in ECpH simulation at pH 7 offers a more detailed description of the mechanism of loop opening, in which the exit of the Glu89 sidechain from the hydrophobic cavity involves first the formation of a hydrogen bond between it and Asn109 ( Figure 3). The stable interaction between the EF and GH loops forces the latter to shift and provide space for repositioning of the AB loop ( Figure 3A). AB and EF form the gates to the hydrophobic core of BBL.
The overall conformational change triggers a new orientation of the Trp61 side chain that is now stabilized by π-interactions with Arg40 located on AB ( Figure 4A). Notably, the move of the AB loop is observed only in ECpH and not in the cMD simulations. The reason is that in the cMD runs, the dependence of the Trp61 sidechain on pH, is lost. In contrast, the ECpH trajectories show that Trp61 visits three orientational states. State 2 ( Figure   4B) is sampled at pH 7 and reflects the conformational changes discussed above. State 3 is visited only very briefly in cMD at pH 6 ( Figure 4C), whereas in the ECpH simulations it is strongly populated at extreme pH values at which multiple experiments observed the beginning of unfolding. Thus, in a basic environment (pH > 9.5) the fraction of secondary b-sheet structure of BBL was shown to decrease rapidly while the portion of disordered protein structure increases 19 . Although the unfolding process is too slow to be captured in the 1 µs simulations we carried out 18  in the ECpH simulations in acidic environment (pH 2-5). In the cMD trajectories, however, the BBL protein appears to be more rigid than expected at acidic pH, as the fluctuations of BC, CD and Ha either do not differ from those at other pH values or are even decreased.
Unlike the other structural elements, the increased flexibility of Helix a observed with NMR at acidic pH, is not captured in the either method (Figure 5a), most probably because the trajectories are too short (1 µs) to model the helix unfolding. However, the ECpH method does record the significant mobility of Helix a in the extremely basic environment (pH > 10), in line with the decreased fraction of ordered secondary structure observed at pH > 9 19 . At neutral pH the mobility of all loops is reduces in both MD methods. Notably, the decreased mobility of the AB loop at basic pH compared to acidic and neutral conditions is the consequence of the conformational change detected in the ECpH simulations at pH 7. In cMD, however, the mobility of the AB loop is increased at pH 8, impairing the hydrophobic core integrity that requires the observed closure of the AB loop at that pH.
CLC-ec1. The performance of ECpH in the modeling of pH-dependent conformational changes in a class of larger and more complex proteins, in a membrane environment, is illustrated for the CLC-ec1 transporter protein that performs the coupled antiport of Cland H + 32, 33 . The H + /Clexchange follows a ~ 1:2 stoichiometry that suggests a pH-dependence of the transport function 32,34 . Indeed, the efficacy of CLC-ec1 mediated transport is pH dependent, and the largest Cluptake is observed at pH 4.5. Interestingly, crystallography data for the WT CLC-ec1 at pH 4.6 (PDB ID 1KPL) and pH 8.5 (PDB ID 1KPK) does not show major difference in conformational state of the protein 35 . However, other experimental evidence, e.g., from NMR, was suggested to indicate that the interconversion between functional states of the CLC protein in response to pH changes involves large-scale conformational changes such as the movement of Helix R and the structural change of the P-Q linker [36][37][38] . The introduction of a disulfide crosslink by the mutations A399C/A432C inhibited Cl-transport, leading to the suggestion of a connection between the displacement of Helix O and the gate opening in outward-facing state in WT protein 16 . Major structural rearrangements were also suggested from DEER experiments, cross-linking, and NMR experiments, which concluded that motions of helices G, N, O and P act to widen the extracellular bottleneck 14,15 . The recent high-resolution X-ray structure of a mutant construct of the CLC-ec1 protein (E148Q/E203Q/E113Q, PDB ID 6V2J) termed QQQ-CLC 14 , does show structural rearrangements to form an outward-facing open conformational state at pH 7.5 that mimics the WT-like conformation at low pH. We therefore used the QQQ mutant structure as a reference to investigate the pH-dependent changes predicted from parallel ECpH and cMD simulations in that pH change.
According to the QQQ-CLC data, the conformational change occurring between pH 4.5 and pH 7.5 involves the repositioning of helices M, N, O, P, and G. This is supported by the inter-subunit distance distribution for Caatoms measured in DEER for Met373, Val374 (Helix N), Glu385 (Helix O), Asp417 (Helix P) and Asp171 (Helix G). From the trajectories obtained for the two pH conditions, we evaluated first the distribution of distances between the Ca-atoms of the residues A399 and A432 within a subunit that is believed to change during the chloride transport connected by the presence of a crosslink in the A399C/A432C mutant. The change in the distribution of 399-432 Ca-atoms distances between pH 5 and 8 is evident in the ECpH trajectories, but is absent in the cMD simulations ( Figure 6, right). In the DEER experiment the inter-subunit distance of Asp171 which is located on the intracellular side of the CLC protomers exhibits the largest change observed among those detected experimentally (6 Å) and is reproduced by the data presented here from the ECpH calculation, where the second distribution peak of a new conformation appears at pH 5. Meanwhile the distributions obtained with cMD shows only one peak for Asp171 inter-subunit distance at both pH values.
Other distance changes observed in the DEER experiment are for Met373, Val374, and Glu385, all of which are located at the extracellular end of CLC-ec1 (Figure 6, left). Although subtle, the conformational rearrangement of Helix N recorded in the ECpH ensemble at acidic pH, agrees perfectly with the experiment. Further, the shift in inter-subunit distance distribution for E385 agrees well with the experimental results used to suggest a repositioning of Helix O as suggested in ref. 14. As seen in Figure 6, however, the predictions of pH-dependent relative changes obtained from the comparative cMD simulations at pH 4.5 and 7.5 were not in agreement with the experimental data 14 . Analysis of the likely reason for the discrepancy between the experimental and ECpHcomputed results for the Asp417 in the P-Q flexible loop at the CLC-ec1 dimer interface indicates that the larger movement of Asp417 observed in the DEER experiment may depend on a preceding reorientation of Tyr419 (also in the P-Q loop). Indeed, the inter-subunit distance between Ca-atoms of Tyr419 calculated with ECpH ( Figure S4) suggests a larger distance between the P-Q loops than that inferred from the Asp417 data. Because it had been shown previously 37-38 that at pH 4 the orientation of Tyr419 (P-Q loop) changes from buried to outwardfacing, it is reasonable to infer from the relative positions of Asp417 and Tyr419 that for the former to move, the latter needs to reorient and clear the space inside the buried area in P-Q region. This movement has started in the ECpH simulations and longer trajectories are likely to eliminate the discrepancy.
The extent of transition to the active state represented by the inter-subunit densities in 0.5 µs H-REMD ECpH simulations at pH 5 ( Figure 6), is quantified for each subunit by the distribution of RMSD of G, N, O and P helices (referenced to PDB 1OTS; Figure S5). Of the four helices involved in pH-dependent structural change of CLC-ec1 only helix G shows similar RMSD occupancies for the new state in both subunits, while the displacement of the other helices is not symmetrical in the two. As the CLC-ec1 monomers are known to be functionally independent 35 , the asymmetrical response to a pH change is not surprising, and the same changes are expected to occur in both monomers during longer trajectories.

Discussion
The new equilibrium constant pH method presented here was designed to enable the exploration of pH dependent structural dynamics of macromolecular systems with greater accuracy and reliability than conventional MD simulations in a framework that conserves their properties and practical attributes. The trajectories generated within ECpH for the illustration of the method and its performance were indeed shown to have the same level of stability as the corresponding cMD trajectories, with similar performance characteristics and dynamics, at similar computational speeds.
The differences in the pH-dependent behavior of the systems described by ECpH simulations compared to cMD, point to the significant advantages of using partial protonation probabilities, over the binary representation of protonation changes in cMD. Because ECpH is able to distinguish in this manner between the effects of pH in a narrow range of physiologically relevant values (5 -8)  The corresponding cMD simulations were performed for pH 2, 6 and 8. At pH 2 all titratable residues were protonated, while ensembles at pH 6 and 8 differed in the protonation the protonation states of Glu89 and His residues. The simulations were also performed for 1µs each, starting from both states of the EF loop.
For CLC-eq1 the simulated molecular construct was built from PDB ID 1OTS 41  The ECpH MD simulation was performed for a pH range from 2 to 9. The Hamiltonian replica exchange protocol was applied every 0.1 ns of 0.5 µs ECpH MD simulation to a random pair of replicas. pKa values were set for Glu113 and Glu148 at 8 and 6.5, respectively according to experimental and computational data 5,46 . All other titratable residues were assigned the pKa values of the corresponding amino acids. For cMD of the CLC-ec1 system data were generated from 2 parallel trajectories of 1.6 µs each for both pH 4.5 and pH 7.5.
Native contact analysis. The native contact analysis was performed on 1 µs trajectories at a pH range from 2 to

Supporting Information
The Supporting information contains depiction of a loose loop conformation of BBL protein at pH 2, a Solvent Accessible Surface Area (SASA) distribution for the hydrophobic core of BBL protein, the radial distribution of water molecules around the titratable residues, the RMSD distribution for CLC-ec1 helices G, N, O, P in the two subunits of the dimer and the inter-subunit distances between Tyr419 residues from ECpH simulations at pH 5 and 8. The ECpH python code compatible with OpenMM is published on the GitHub: https://github.com/weinsteinlab/pH-

Replica-Exchange
The Supporting Information is available free of charge on the ACS Publications website.

Funding Sources
The work was supported by National Institutes of Health (NIH) Grant R01GM106717 (H.W.), and NSF award BIGDATA: IA: Collaborative Research: In Situ Data Analytics for Next Generation Molecular Dynamics Workflows (NSF #1740990).

Notes
The authors declare no competing financial interest.