Abstract
We recently showed that the time-structure based independent component analysis method from Markov state model literature provided a set of variationally optimal slow collective variables for Metadynamics (tICA-Metadynamics). In this paper, we extend the methodology towards efficient sampling of protein mutants by borrowing ideas from transfer learning methods in machine learning. Our method explicitly assumes that a similar set of slow modes and states are found in both the wild type and its mutants. Under this assumption, we describe a few simple techniques using sequence mapping for transferring the slow modes and structural information contained in the wild type simulation to a mutant model for performing enhanced sampling. The resulting simulations can then be reweighted onto the full-phase space using MBAR, allowing for thermodynamic comparison against the wild type. We first benchmark our methodology by recapturing alanine dipeptide dynamics across a range of different atomistic force fields after learning a set of slow modes using Amber ff99sb-ILDN. We next extend the method by including structural data from the wild type simulation and apply the technique to recapturing the affects of the GTT mutation on the FIP35 WW domain.
Introduction
Efficient sampling of protein configuration space remains an unsolved problem in computational biophysics. While algorithmic advances in molecular dynamics (MD) code bases1 combined with distributed computing hardware2, specialized chips3, and large-scale increasingly faster GPU clusters have provided routine access to microsecond timescale dynamics, there is still room for significant improvements. One such potential avenue is predicting the affects of mutations onto the protein’s free energy landscape. Under the current scheme, one would have to re-run our entire simulation in order to ascertain the affects of a mutation onto a protein’s free energy landscape. Due to the vast amount of computational resources required for even one simulation, most current MD papers run one simulation in a single force field for a single protein. However, considering the important role of mutagenesis experimentally as biophysical probes, the biological role of SNPs in medicine and disease, as well as phylogenic and evolutionary questions connecting mutations, often there are hundreds to thousands of mutations (or more) that would be relevant for simulation. Instead, the predictions from these simulations are extrapolated to other conditions but those changes/mutations are often not explicitly tested in-silico. While such hypothesis generation is useful for guiding future work, the gap between extrapolated predictions and experimental realization is large.
There is an obvious scaling problem between the computational and time cost of unbiased MD and the number of interesting mutants that could be investigated using simulation. For example, there are several hundred known protein kinases4,5 with each having tens to hundreds of known mutants. These kinases have critical protonation and phosphorylation sites that significantly affect their free energy landscapes6,7. To predict these mutations’ effects, do we need to re-run an entirely new simulation on the mutated protein? Are modern force fields even capable of elucidating such effects? Even if we assume an accurate enough force-field8,9, how do we efficiently sample these mutants or perhaps even propose new novel variants to be probed via experimental assays. Arguably, for MD to decrease the gap between theoretical hypothesis and experimental realization, an ability to efficiently sample the effects of mutations is required. Since unbiased MD is too slow, we turn to enhanced sampling.
While enhanced sampling methods such as Metadynamics or Umbrella sampling offer promise, they require identification of a set of collective variables (CVs)10 to sample along. Metadynamics10–13 can be thought of as computational sand filling along CV of interest to enhance sampling between kinetically separate regions. Therefore, these CVs should correlate with the slowest structural degrees of freedom within the system, and exclusion of slow modes leads to hysteresis and convergence issue12,14. For example, even for the simplest test cases such as capped Alanine dipeptide, hysteresis can arise if we choose the faster ψ coordinate for enhanced sampling.
Given all of these problems with enhanced sampling algorithms, we instead aim to solve a simpler problem. What if we are given unbiased MD simulations for the wild type (WT) and we wish to learn the dynamics for a closely related mutant? The mutant could correspond to a change in force field, an amino acid substitution, post-translational modifications, or even an alternative drug in the case of drug-binding simulations. We expect that these mutants likely sample a very similar free energy landscape, albeit with different thermodynamics and kinetics. Could we design a better sampling scheme by transferring knowledge from the WT simulation to the mutant?
Transfer learning15,16 is a method from the machine learning literature where knowledge learnt from modeling one task is transferred to the model for the purpose of learning another task. We wish to replicate a similar effect in molecular modeling where we transfer the knowledge learnt from a protein’s wild type to a simulation of its mutant. Ultimately, we aim to efficiently sample the mutant to predict affects of force field changes, post translation modifications, and/or amino-acid substitutions etc.
The idea of knowledge transfer is not new in computational biophysics. Researchers constantly use homology modeling17 to create models for systems which have not been crystallized or select CVs for enhanced sampling simulations10 based upon an intuition learnt from failed runs, literature search, or previously published modeling work on homologous systems. However, this is often done in an ad-hoc or heuristic fashion. For example, it might be difficult to find the “right” template for homology modeling when a large set of similar sequence identity structures are available.
We hypothesize an efficient use of transfer learning would maximally leverage the reaction coordinates, thermodynamic, and structural information contained in the WT simulation. Our key results stem from recognizing that protein mutants sample a similar set of free-energy minima connected via similar slow modes. Our model assumes that these slow modes involve the same set of residues across the WT and mutant sequences and all that remains are identifying those slow modes (Figure 1) in the WT simulation12 and transferring them on to a mutant simulation.
We propose transferring information from the WT’s tICA (time-structure based independent component analysis) model and MSM (Markov state model) to the mutant Metadynamics or Umbrella sampling simulations (Figure 1). tICA is a dimensionality reduction technique18–21 capable of finding reaction coordinates(tlCs) within the dataset. are kinetic models of protein dynamics that model the dynamics as memory-less jump processes. tICA was initially used as a dimensionality reduction process21 for defining the Markov models’ state space though it was later shown that both tICA and MSM solve the same problem22 of approximating the underlying transfer operator, albeit with a differing choice of basis. The tICA19–21 method has non-linear18 extensions available which significantly improve its descriptive abilities. Furthermore, a variational principle22 for tICA and MSMs allows a researcher to systematically validate23 modeling parameters to potentially integrate out subjective modeling decisions. We recently showed that these tICs21 provided a set of excellent CVs for enhanced sampling via Metdaynamics12 or other schemes. Therefore, we hypothesize the answer lies in transferring these tICs over from one simulation to another.
But how do we transfer these slow tICA coordinates? At this point it is worth recalling that tICA is a linear combination of input features12,18,19,21,24. These input features are a set of real numbers encoding the protein’s conformational state and concretely might be dihedrals or contacts or RMSD to a set of landmark points. Furthermore, these features might be the result of a nonlinear transform such as a Guassian kernel12,24. Therefore, what we wish to compute are these protein strucutral features for a new closely related sequence (Figure 1). For this, we will need to determine a set of features that can be applied to both the WT and mutant system after performing a structural or sequence alignment (Figure 1). For example, this might involve figuring out the equivalent atom indices for backbone dihedrals/contact distances/rmsds etc that make up the set of features used to construct the WT’s tICA model. Once such a mapping has been established, it is straightforward to transfer the linear combinations that make up the slowest modes for enhanced sampling simulations. In practice, we find we only have to modify small parts of input scripts that are fed into Plumed25 for performing the enhanced sampling simulations.
Our method explicitly makes the following set of assumptions:
The wild type and mutant proteins occupy similar set of configurations in phase space, are connected via similar pathways, and have a similar set of slow modes.
The wild type simulation captures a large portion this accessible phase space, and tICA and MSMs correctly enumerate these slowest modes.
We can calculate equivalent features for the mutant and WT proteins.
There has been some previous work in using MSMs for efficient sampling of protein mutants. In particular, Voelz et al.26 used an information theoretic approach to find maximally surprising changes to a mutant MSM for performing new rounds of iterative sampling. However, their approach requires at least partial convergence of a rudimentary mutant MSM before such comparisons can be made. The amount of sampling required to make this rudimentary MSM could easily exceed the sampling of the WT, e.g. if the mutation slows down the dominant kinetics by an order of magnitude. Furthermore, at least initially, the rudimentary mutant MSM is likely to have large statistical uncertainties, potentially leading to false positives for the suprisal/self-information distance metric proposed in the paper26. Here, we are approaching the mutant problem from a fundamentally different perspective that aims to cannibalize all available data in the WT MSM.
Transferable tICA-Metadynamics is an efficient way to sample mutations
We begin by showing as a simple proof of concept that the dynamics of alanine dipeptide can be re-captured across several FFs after learning the slowest modes in the “WT” model (Amber99sb-ildn8). We downloaded a previously generated dataset27 that contained 4μs of capped Alanine dipeptide run using the Amber99sb-ildn force field (FF)8. We then trained a tICA model on the backbone dihedrals at a lagtime of 1ns. As shown in Figure 2a, the tICA model captures the slowest mode as corresponding to movement in and out of the αL basin while the next mode is flux in and out of the αR basin. We next ran bias-exchange10,11 tICA-Metadynamics simulations in 3 different FFs (Amber99sbiln, Charmm27, and Amber03). The exact parameters for the well-tempered Metadynamics runs are given in SI table 1, though we empirically found that a range of parameters worked. All MD trajectories were run in the NPT ensemble with a MonteCarlo Barostat (1 atm), a Langevin integrator (300 K), and a 2 fs timestep. We used the PME method28 to handle long range electrostatics using a 1nm cutoff. The simulations were performed on GPUs using OpenMM29,1 and Plumed25. After running the Metadynamics simulation, we combined the data across the two tICs using Multi-state Bennett Acceptance Ratio (MBAR)30,31 algorithm. For each simulated frame, we used the last reported bias across the tIC CVs as an estimate for input into the MBAR algorithm.
The results are given in Figure 2b and 2c. We explicitly projected the Charmm27 and Amber03 datasets27 using Amber99sb-ildn’s state decomposition, allowing us to compare the models across force fields without having to worry about state equivalence. It can be seen that our sampling scheme efficiently learns the differences between the dynamics upon mutating the force field from Amber99sb-ildn to Charmm27 or Amber03 (Figure 2b). For example, the αL basin in Amber03 is significantly higher in free-energy (Figure 2c) compared to Amber99sb-ildn and Charmm27.
Transferable tICA-Metadynamics can use Wild type simulation’s structural data by coupling to a MSM structural reservoir
Up to this point, our modeling efforts have only focused on using the slow tICs within the WT simulation for efficiently sampling the mutant. This might be sufficient for small peptides systems but is unlikely to work for large systems due to for example missing structural features in the construction of our tICA coordinates. While we could systematically improve the quality of our tICA model via the variational analysis22, there is always a finite chance of missing structural degrees of freedom. To overcome this, we recommend coupling the Metadynamics simulations to a structural reservoir containing structures sampled from the WT MSM simulation (Figure 3a). Then, all that remains is creating a proposal distribution and an acceptance criterion for inserting the WT MSM state into the mutant Metadynamics simulation (Figure 3a). Ordinary Bias-Exchange10,32 swaps protein coordinates according the following criterion: where Va(xa, t) is the Metadynamics bias potential acting on coordinates, xa, of replica a at time t. However, since a MSM structural reservoir has no external bias acting on it, we change the swap probability to an insertion (from MSM to Metadynamics) probability: where xMSM are the coordinates for the MSM state under consideration. If accepted, the MSM state is put into the mutant Metadynamics simulation. Given enough sampling, this scheme resembles a Metropolis step. To improve the acceptance probability, we used the WT Markovian transition model to propose a transition state after figuring out the mutant’s current MSM state within the simulation. Using the WT transition matrix provides an excellent proposal distribution since we hypothesize that the mutant only minimally perturbs certain elements of the matrix. Our reservoir approach is similar to the high-temperature reservoir introduced by Okur et al33, though in this instance, the ensemble of structures is obtained via a regular MD run, and the proposal is dealt using the WT transition matrix. While the WT MSM transition matrix serves as an excellent proposal distribution it is also possible to use other proposal distributions such as the uniform distribution. Furthermore, several sampling techniques from the MonteCarlo literature such as the Wang-Landau scheme can be employed as well. We note that for mutant simulations, generating this MSM state reservoir would require additional steps of homology17 modeling, minimization and equilibration, though this is a pleasantly parallelized problem.
Lastly, it is possible to use a neutral replica within this setup. The neutral replica has no external bias acting on it and approximately samples from the canonical distribution. However, if a neutral replica is used, we recommend only allowing the neutral replica to swap with the biased replicas since an appropriate asymptotically correct swapping criterion for swapping between the neutral replica and the MSM state reservoir doesn’t exist.
We tested our methodology by predicting the effects of the GTT mutation upon the folding of the WW domain3,37,38. We began by learning a tICA model (50ns lagtime) on the backbone dihedrals and selected contacts for the WT mutant (WW-FIP). We kept the top 15 tICs, and made a MSM at a 50ns lag time on a 200 state model. Our tICA model indicated that the slowest mode (Exchange time scale > 1 μs) corresponded to the folding while the second slowest mode (Exchange time scale > 100 ns) corresponded to formation of an off-pathway register shifted state (Figure 3b-c). Since every subsequent slow tIC mode has exchange timescales of less than ̴ 100ns, we chose to focus our sampling on these two tICs. We ran the simulations for both the FIP35 WT protein and the GTT triple mutant for a more systematic comparison. Similar to previous work37, all the simulations were performed in the NVT ensemble with a 2fs time step at 395K. We used the PME method28 to handle long range electrostatics using a 1nm cutoff. The simulations were performed on GPUs using OpenMM29,1 and Plumed25. After running the Metadynamics simulations, we used MBAR to re-weight to the MSM state space and obtained the PMFs along the dominant tIC. All relevant simulations parameters are shown in SI Table 2.
The results for both of our enhanced sampling simulations is given in Figure 3d. Two different insights emerge from our enhanced sampling scheme relative to the Anton results (SI Figure 1). Similar to the Anton simulations, our FIP unfolded state (Figure 3d, tiC value >−0.25) has a distinct two state behavior. Basin ‘C’ corresponds to the unfolded and collapsed state. This basin also includes an off-pathway register shifted state. The second high free-energy basin (Figure 3d, B) is an on-pathway intermediate state where two of the three beta-strands have formed. The unfolded state in our ensemble is more populated than in the Anton simulations. These on and off-pathway intermediate states were not detected in the original two-state folding reaction coordinate for the WW domain37,38 though it was later found from the simulations using a variety of techniques39. We note that our tICA analysis was able to identify the on-pathway folding intermediate and the off-pathway state as the top two slowest modes (tICs) within our model.
As can be seen in Figure 3c-d, our simulations indicate that the GTT mutant de-stabilizes the unfolded state and the on-path intermediate state, leading to increased folded population and faster folding timescales. These results are in line with the previous computational and experimental work37 though our simulations required about 200-300x less aggregate sampling(̴1-3μs vs 600μs). More importantly, the current sampling was performed in parallel so that no single walker had to be run for more than 50-200ns (̴3-7 days on K40 GPUs using OpenMM1 and Plumed25). We also believe it might be possible to optimize this further by modifying the Metadynamics parameters and Metropolis swap schemes/rate.
Transferable tICA-Metadynamics can use Wild type simulation’s thermodynamic data as a prior for the underlying free energy landscape
Lastly, we turn to efficiently using the thermodynamic information contained in the WT simulation. To that end, we recommend using the WT simulation to identify minimum values along each tIC coordinate, aka the thermodynamic minima, to plug into a variant of Metadynamics, namely Transition-Tempered Metadynamics (TTMetaD)40. In TTMetaD, the Gaussians heights are scaled according to the number of trips between basins. We also believe that it is possible use to the WT free energy surface as a Bayesian41 prior for the mutant Metadynamics simulation, though that is beyond the scope of this work. The latter might involve starting off with a ‘partially’ constructed free energy-landscape such that the Metadynamics engine only has to fill in the regions that are different between the WT and the mutant.
Our current results open up several interesting avenues for future work. For example, up to this point, we have only focused on enumerating the thermodynamic differences between the mutants. However, the recent work in kinetic reweighting either via Maximum caliber42, TRAM43, or plain transition state theory could potentially be used to obtain the an estimate for the mutants’ perturbed kinetics. This raises the intriguing possibility of getting estimates for both the kinetic and thermodynamics of a mutant simulation for a miniscule fraction of the WT’s compute cost. An excellent application for this would be the ability to predict changes in a drug’s binding and unbinding kinetics. Our approach explicitly includes all of the protein’s slow conformational modes, in addition to the drug binding mode—making it more accurate.
One possible problem with our current approach is the determination of how far we can move away from the WT in sequence space before the transfer approach fails. Are the tICs learnt from a WT simulation applicable to a sequence with minimally sequence similarity? What is the distance metric and how do we define minimal? A similar problem is faced in homology modeling, where the quality of the model depends on the underlying sequence conservation. It is possible that the heuristic value of 40-50% sequence identity cutoff used in homology modeling might be applicable here too, but we concede that that value is simple conjecture at this point.
A more involved solution to this problem is to consider clustering the entire sequence super family. For example, there are 518 known human kinases5. One could potentially cluster the sequences using evolutionary distance metrics into m representative sequence clusters, where m is the number of possible unbiased simulations that can be performed. Those m- simulations are then run and analyzed via tICA and Markov models. It is worth noting that the simulations for the m sequences are perfectly parallelizable, allowing for synergistic collaborations between different research institutes. For all other sequences, we can then use the tICs from its closest cluster center or perhaps even combine the tICs from the k nearest neighbors.
To summarize, we present a new method Transferable tICA–Metadynamics for the efficient sampling of protein mutations by transferring the reaction coordinates, structural, and thermodynamic data from the WT simulation to the mutant. Our method explicitly assumes that the WT and the mutant share a similar set of slow modes. Under this assumption, we then show that the slow modes of the WT can be transferred to the mutant simulation by computing an equivalent set of protein structural features. This requires using a protein structural alignment to identify equivalent residues which is readily possible using modern software44,45. We benchmarked our method on two test cases showing how switching force field in alanine dipeptide causes shifts in the propensity and location of the αL basin, and recapturing the previous results that the GTT mutant of WW domain stabilizes the active state.
Code and data availability
All the code needed to reproduce the main results of this paper is available at https://github.com/msultan/ticametadynamics.
Acknowledgements
The authors would like to thank various members of the Pande lab for useful discussions and feedback on the manuscript. M.M.S would like to acknowledge support from the National Science Foundation grant NSF-MCB-0954714.
Footnotes
↵† pande{at}stanford.edu