Refinement of protein structure via contact based potentials in replica exchange simulations

Proteins are complex biomolecules which perform critical tasks in living organisms. Knowledge of a protein’s structure is essential for understanding its physiological function in detail. Despite the incredible progress in experimental techniques, protein structure determination is still expensive, time-consuming, and arduous. That is why computer simulations are often used to complement or interpret experimental data. Here, we explore how in silico protein structure determination based on replica exchange molecular dynamics can benefit from including contact information derived from theoretical and experimental sources, such as direct coupling analysis or NMR spectroscopy. To reflect the influence from erroneous and noisy data we probe how false-positive contacts influence the simulated ensemble. Specifically, we integrate varying numbers of randomly selected native and non-native contacts and explore how such a bias can guide simulations towards the native state. We investigate the number of contacts needed for a significant enrichment of native-like conformations and show the capabilities and limitations of this method. Adhering to a threshold of approximately 75% true-positive contacts within a simulation, we obtain an ensemble with native-like conformations of high quality. We find that contact-guided REX MD is capable of delivering physically reasonable models of a protein’s structure. Author summary Protein structure prediction, that is obtaining a protein structure starting from a sequence using any computational method, is a great challenge. Over the past years a broad variety of methods evolved, ranging from algorithms for “blind” or de novo predictions using Monte-Carlo or physics-based biomolecular simulation methods to algorithms transferring structure information obtained from known homologous proteins. Recently, purely data-driven approaches using neural networks have shown to be capable of predicting high-quality structures. However, some local structural motifs are only poorly resolved and need further refinement. Here, we explore to what extent contact information helps guiding replica exchange molecular dynamics towards the native fold. By adding a contact pair bias potential to the energy function, we effectively guide the search towards the target structure by narrowing the conformational space to be sampled. We find that such an energetic bias, even if containing false-positive contacts to a certain extent, greatly enhances the refinement process and improves the chance of finding the native state in a single run.


Introduction
Knowledge of protein structures is crucial for understanding the proteins' functions or the biological processes they take part in. Structural knowledge is also critical in related fields such as pharmacology to design drugs or medicine to understand the molecular origins of pathogenesis. Both protein structure and function are intrinsically encoded in the corresponding amino acid sequence [1][2][3]. Over the past years, experimental techniques to gain such sequential data have become exceptionally efficient and lead to fast growing sequence databases, e.g., GenBank [4] and UniProt [5]. In contrast, experimental structure determination with methods such as X-ray crystallography or NMR spectroscopy are comparably time-consuming and expensive with often involved procedures. Some other experimental techniques, e.g., SAXS, FRET, or cryoEM, do not directly provide structural data but have to be carefully interpreted [6][7][8].
Computer simulations are commonly used to interpret and complement experimental data. Novel, purely data-driven approaches can predict protein structures of high quality [9,10] but lack insight into the physical processes driving structure adoption and cannot be easily complemented by experimental information. Depending on the method, local structural motifs are often less resolved [9] and could benefit from additional refinement. Physics-driven approaches are based on energy functions called force fields. Lindorff et al. demonstrated by Molecular Dynamics (MD) simulations that current force fields are sufficiently accurate to reversibly fold proteins starting from unfolded conformations [11,12]. Still, the computational cost for such de novo folding simulations is extremely high. As a result, simulations on the millisecond timescale can currently only be performed on specialized supercomputers like Anton. Luckily, it is possible to guide the simulations towards target structures or ensembles by introducing an energetic bias based on experimental data. This bias helps smoothing the energy landscape, which has frustrated, glassy properties with many competing minima separated by high barriers. At the same time, the computational costs are lowered due to the reduced sampling space. One can further lower computational demands by enhanced sampling techniques [13][14][15].
In this work, we assume having access to varying amounts of error-ridden contact information as an additional potential bias for MD simulations. Such contact information, i.e. information about adjacent amino acids, can be obtained from different sources. Sparse NMR contact maps would be one example. By themselves they provide insufficient information for structure generation and have to be complemented. an equal force coefficient k (see Eq 7) to all used contacts, as the study is performed to estimate the influence resulting from both native and non-native contacts. We analyze the data for each test case with simulated times of 250 ns, especially for the lowest temperature replica. By comparing the test cases to a reference simulation not including any contact information, we can thus estimate the total number of required restraints and the bias strength. The study shows good results for both tested proteins as long as the used bias quality was above a certain threshold. It is possible to include additional experimental bias into such simulations and use them as a tool for hybrid data integration.

Test systems
We use two well-known proteins for our systematic study. The first candidate is the 20-residue miniprotein Trp-Cage (PDB: 1l2y [23]). Its tertiary structure consists of an α-helix followed by a turn and a 3/10-helix. Trp-Cage was specifically designed as a fast folder and reaches folding timescales of approximately 4 µs [24]. The second test system is Villin Headpiece (VHP, PDB: 1vii [25]). This protein has a sequence length of 35 residues and forms a three-helix structure. Similar to Trp-Cage, VHP can also achieve folding times in the order of µs [26,27].

Trp-Cage
We performed REX simulations of Trp-Cage for a total of 60 replica yielding trajectories of 250 ns simulated time ranging from T 0 = 300K to T 59 = 624.67K. To compare the different scenarios as listed in Table 2, each REX simulation initiates from the same unfolded conformation. As given in Eq (7), a sigmoid potential with the same coupling strength k is assigned to all implemented restraints (only C α -C α pairs). Note that the resulting force is distance-dependent and the potential has a limit of 10 kJ mol -1 . As a first analysis step, we checked if the coupling strength of the implemented restraints is adequate. The energetic bias is supposed to guide the protein towards a native-like structure. We determined the time-dependent backbone RMSD with respect to the native structure for all replica in each test case. To get an overview and quickly compare the RMSD statistics across all scenarios, the color-coded RMSD values of all replica are displayed in heatmap plots. Fig 1A and Fig 1B exemplary show a comparison between the reference simulation and the simulation with 12 native contacts, respectively. The reference run mostly shows RMSD values greather than 4Å with a few random exceptions at lower temperature replica throughout the simulation. As expected, introducing a bias potential with purely native contact restraints strongly improves the RMSD values for lower temperature replica, as displayed in Fig 1B. Here, the majority of low-temperature RMSD values turns from red (more than 4Å) to blue (less than 4Å) compared to the reference simulation. Heatmap plots of the remaining cases using only ideal contact pairs at 100% TPR can be found in S1 Fig to S3 Fig. They prove that the number of used native contact restraints is correlated with the enrichment of native-like conformations. The blue region therefore grows with increased "correct" bias compared to the unbiased reference simulation. The greatest step-wise improvement for lower temperature replica is observed for the transition from 6 to 12 native contacts. With any sort of contact information usually being error-prone, it is nearly impossible to apply a perfect bias corresponding to a TPR of approximately 100%. Heatmap plots for the other REX simulations with TPRs of 75% and 50% are shown in S4   native contacts (Fig 2A to Fig 2D) show a strong enrichment of conformations with RMSD values between 1.6 and 3.0Å as indicated by the green region. Frequencies of conformations with RMSD values above 3.0Å got reduced accordingly. In case of Trp-Cage, the net gain of native-like folds does not improve when the bias exceeds 12 restraints, which corresponds to approximately L/2 contact pairs. Test cases with mixed contacts at 75% TPR ( Fig 2E to Fig 2H) show a similar behavior. Additionally, the two scenarios with 12 and 24 mixed contacts also enrich conformations with RMSDs around 5.0Å. Such conformations can be filtered out in a subsequent step using a suitable method or algorithm for frame selection. Scenarios with a low bias quality of 50% TPR ( Fig 2I to Fig 2L) show worse statistics compared to the reference simulations and are therefore inappropriate for the intended purpose. Naively one might guess that the "correct" bias would cancel the "wrong" bias out. However, if implemented restraints are clustered within the contact map, the local bias adds up and apparently is strong enough to trap the protein in unfavored conformations. When using large numbers of contact restraints with respect to the protein length, it is necessary to reduce the overall force coefficient k of the sigmoid potential (see Eq 7) to compensate this effect.
The disadvantage of RMSD-based evaluation of structure quality is that local deviations between mobile and target structure already result in a disproportionate increase. This is why we transition to the so-called Global Distance Test (GDT) which takes local misalignments better into account. The distributions of the two scores GDT TS and GDT HA provide an additional perspective to the estimation of the bias quality necessary for an effective integration of contact information into REX simulations. Table 2 gives an overview of occurring percentiles of GDT TS and GDT HA scores for all REX MD simulations with Trp-Cage as a test system. To simplify the comparison, shaded table cells indicate improved percentiles P x compared to the reference P x,ref for each score variant, i.e. cells with Additionally, we use a bold font for values which satisfy to highlight a significant improvement. In Eq (2) each P x is compared to a percentile-specific threshold only depending on corresponding reference values. The threshold is defined as the sum of the percentile itself and a weighted difference of this percentile to the highest observed value. The difference indicates the practically possible improvement in relation to the reference. To determine a "significant" improvement, we set the coefficient w to 50%. In scenarios with TPRs of 75% or higher, The comparison of 50% TPR simulations to the reference shows that highly error-prone contact bias has a very negative influence and is not sufficient to effectively improve structure determination in REX.

Villin Head Piece
All VHP REX simulations were performed under the same conditions as for Trp-Cage. However, due to the increased system size 40 additional replica were required to achieve nearly constant exchange rates across the considered temperature range. The REX RMSD heatmap plots (see S8 Fig to S14 Fig) have similar but less prominent tendencies compared to Trp-Cage. The reference shows very poor RMSD statistics throughout. Even with enhanced sampling as in REX we observe that the simulated time span of 250 ns is too short for VHP to be guided towards the native structure without additional bias. As soon as the bias potential is activated, we clearly see an analogous growth of the blue region within the heatmap plots indicating enrichment of native-like conformations. Furthermore, the most potent improvement was observed for the transition from 12 to 24 native contacts as shown in S9 Fig Fig 3E to Fig 3H. We still observe an enrichment of low RMSD conformations but also a drastic increase of conformations with values around 5.0 to 8.0Å. This is the result of the first three non-native contacts included, which were randomly selected for this test case. Long-range contact pairs (i, j), which are far away from the main diagonal of a contact map, have a more significant influence compared to contact pairs with a small difference in their sequence numbers ∆ ij = |i − j| close to the main diagonal. All simulations with a TPR of 75% nonetheless show a net gain of native-like conformations. Therefore, they should be favored over the unbiased scenario given a functional algorithm to filter the best structures. Scenarios with a TPR of 50% are displayed in Fig 3I to Fig 3L. Here, we observe that both low and high RMSD conformations appear less compared to the reference, whereas conformations between 5.0 and 9.0Å are enriched. The high ratio of non-native contacts lowers the chance of successful structure determination as before for Trp-Cage. Table 3 specifies the percentiles of GDT TS and GDT HA score distributions. We find that the VHP REX simulations benefitted from restraints with a TRP of 75% or higher, resulting in a significant increase of GDT score statistics. For ideal results in mixed scenarios with 75% TPR, a restraint number in the order of the protein length is required. Analogous to our previous observation during the RMSD-based discussion, simulations with only 50% TPR were worse compared to the reference scenario. We conclude that a bias of such poor quality is therefore unsuited to achieve useful results February 11, 2020 6/18 within contact-guided REX simulations. To highlight the local accuracy and visualize how well the simulated structures fit the native state, Fig 4A gives an overview of the best observed structures during the 75% TPR simulation ranked by HA scores. Each protein residue is assigned a colored rectangle representing the C α -C α distance between mobile and reference after a least-squares fit. As evident in Fig 4A,

Conclusions
Usually, contact information on its own is insufficient to fully determine a protein's 3D structure. We showed that contact-guided REX MD is capable of delivering physically reasonable structural models of a protein's conformational state. Contacts derived from various sources can easily be included and thus interpreted in terms of structural ensembles. Our ready-to-use framework for structure determination relies on the well-established and heavily used simulation method of molecular dynamics. As a physics-based method, it is conceptually transparent compared to purely data-driven methods such as AlphaFold [10]. Within one single REX MD run, it is possible to conveniently obtain high-quality results without the need for arduous adjustment of system-specific parameters.
We performed an extensive study on two well-known proteins to test the capabilities and limitations of our framework for protein structure determination. A total of 14 scenarios differing with respect to bias quality by varying true-positive rate and number of used contact restraints were considered. For a facilitated comparison, we kept the coupling strength equal across all restraints. We observed a significant enrichment of native-like conformations as long as the bias quality was above an apparent threshold of 75% TPR. We find that highly error-prone contact information as implemented into 50% TPR scenarios is not sufficient for effective structure determination within REX MD. Typically, it is a priori unknown which of the used contacts are really native, and both experimentally and theoretically derived contact information can mistakenly contain false positives. One possible approach to resolve this issue is a dynamic weighting of contact bias. Initially assigning each contact the same force coefficient, contacts can be monitored regularly and those remaining unrealized can be weakened or completely deactivated accordingly. Furthermore, we evaluated the step-wise improvement by comparing different scenarios with varying numbers of contacts at fixed TPR. In the case of error-free bias, the chance of finding native-like structures increases with the number of contacts included according to our expectations. For more realistic cases with a 75% TPR, we observe significant performance improvements compared to the reference for L/2 to L restraints, with L being the protein sequence length. As a proof of concept, we showed that it is possible to find physically reasonable folds starting from an unfolded state, provided enough turnarounds and a sufficiently long simulation time during REX. To avoid the unnecessary increase of computational costs associated with the large simulation box, one should always start in a pre-estimated folded conformation instead. This would greatly increase the computational performance of each REX run.
Although computationally rather involved, this method is particularly suitable for refinement of often available low-resolution structural models. The underlying force field contains rich information on various physical interactions determining protein dynamics. Since the bias is known and thus can retrospectively be balanced out, the simulated ensemble is thermodynamically correct. Thus, it is in principle possible to infer a free energy landscape with statistical techniques such as the "weighted histogram analysis method".

Molecular Dynamics
Computer simulations are often used to complement or interpret results of real experiments. Molecular Dynamics (MD) is such an in silico approach to study the movements of atoms or biomolecules. Here, one of the many classical force fields is selected and applied to the studied system. The resulting interactions between all atoms are calculated by using Newton's equations of motions. Typical timesteps are in the order of 1-2 fs. Details of certain mechanisms, such as protein folding or ligand binding, can be observed by analyzing the trajectory of the simulation.

Replica Exchange
Replica exchange (REX), sometimes referred to as REMD or parallel tempering, is an enhanced sampling technique [20][21][22]. This efficient method is commonly applied to overcome protein entrapment resulting from the multiple-minima problem during MD simulations. REX simulates N non-interacting copies ("replica") of a system at different temperatures T i . Each replica corresponds to one of N MD simulations performed simultanously. After a fixed time dt the atom positions and momenta of adjacent replica can be exchanged. The exchange probability is given by the Metropolis criterion [20] w(X i → X j ) = min(1, e −∆ ); with ∆ = ( where X i denotes the state of replica i, β −1 i = k B T the inverse temperature, and E i the energy of state X i . Since exchange rates are signifficantly lower for large temperature differences, which can be seen in ∆ from Eq (3), it is suffiencent to only exchange adjacent replica.
The intention of REX is to enhance sampling of both high and low energy states. For this reason the temperature range has to be chosen accordingly. Replica at the highest occuring temperatures should have enough energy to overcome potential barriers. Meanwhile, low temperature replica are supposed to explore conformations close to local minima. In combination, REX increases the chance of finding the global energy minimum and thus the native state of a protein. To achieve a random walk it is mandatory to aim for constant exchange rates across all replica and to make sure that all replica are shuffled sufficiently.

Replica Exchange Temperature Generator
In REX simulations, every replica resembles the dynamics of a canonical ensemble, where the probability distribution of each microstate follows the Boltzmann distribution e −βE . Since exchange rates are propotional to the energy difference of two adjacent replica (see Eq (3)), an exponential temperature distribution is needed to guarantee a random walk in conformation space [13,20,28]. This distribution is a priori unknown as it depends on the protein size and number of solvent molecules. However, an initial temperature distribution can be estimated [29]. A simple temperature generator is given by T i refers to the temperature of replica i, while k stands for the growth parameter which has to be modified based on the system size. To get more consistent exchange rates during the simulation across all replica, we slightly modified the generator following the equations Eq (5) recursively describes the temperature of replica i. ∆ denotes the temperature difference of two adjacent replica, as specified in Eq (6), while a i is a step size modifying coefficient. With a i = 1∀i the generator will produce the same temperature distribution as given by Eq (4). To keep the exchange rates almost constant over the whole simulated temperature range, we increased a i every ten replica by 4%. Used parameters and resulting temperature distributions can be taken from S3 Appendix for Trp-Cage and S4 Appendix for VHP.

Sigmoid Potential
In order to guide protein folding, we implement an attractive potential to increase the bias towards native-like conformations. In our simplistic model, we apply the force only February 11, 2020 9/18 between the C α atoms of used contact pairs. The potential is given by with the force coefficient k and the sigmoid function L denotes the sigmoid function's limit, whereas r and r 0 stand for the atom distance and equilibirium distance, respectively. Furthermore, r 0 determines the position of the inflection point of the sigmoid function and thus the local maximum of the sigmoid function's derivative. The parameter α affects the S-shape of the function, i.e. how fast the transition from low values to high values takes place. For the sake of simplicity, we set L = 1 so that the force only depends on the force coefficient k. simulations, we held k constant at 10 kJ mol -1 (approximately 4 k B T at 300K). Furthermore, we chose α = 2.5 nm −1 and r 0 = 1.6 nm.

Coevolution
The physiological function of a protein is directly linked to its 3D structure determined by the amino acid sequence [30][31][32]. Through mutation and selection a global structure is preserved within a protein family as a result of coevolution. Destabilizing mutations are statistically compensated by mutations of spatially close amino acids, leaving an evolutionary fingerprint [33,34]. Coevolution analysis methods such as direct coupling analysis [16] can infer spatially close amino acids based on multiple sequence alignments. The obtained contact pairs can then be used as an additional bias for, e.g., protein folding simulations or structure determination.

Native Contact Enrichment
One key aspect of this study was to measure the influence and correlation of native and non-native contacts during REX simulations. For this purpose we explicitly used randomly selected contact pairs based on the known structures of the test systems. To quantify the true-positive rate of chosen contact pairs, we define native contacts to fulfill the conditions Eq (9) defines a maximum cutoff distance r ij of 6Å between C α atoms for two residues i and j. Eq (10) excludes residue pairs which are very close to each other with respect to their sequence position and would appear on the main diagonal of a contact map. Based on these definitions, we created one list with native contacts and another list with non-native contacts. More precisely, we omitted direct neighbors of native contacts within the contact map from the non-native list. For example, if residue pair (i,j) is native, then all nine combinations of (i , j ) with i ∈ {i − 1, i, i + 1} and j ∈ {j − 1, j, j + 1} were excluded. Lastly, we randomly selected contact pairs from each list to construct the different scenarios at fixed TPRs for the method performance study. Contact pairs (i, j) used as restraints during the study can be looked up in the contact maps given in S19 Fig to S24 Fig.

Global Distance Test
During the trajectory analyses backbone RMSD values after structure alignment are considered to initially evaluate the method's performance. In terms of protein structure determination, however, RMSD is not a proper quantity to assess results as it correlates strongly with the largest displacement between the mobile and target structure. This means if the mobile structure fits the target to a large extent and only one small segment is misaligned, the RMSD becomes disproportionately large. This problem is solved for the so-called Global Distance Test (GDT) [35][36][37]. Analogously to the RMSD, the mobile structure is first aligned to the target structure. Then the displacement distance of each C α residue is calculated and compared with various cutoff thresholds to estimate how similar the two structures are. In a last step, percentages of residues with displacements below a considered threshold are used to calculate score values. The two most common scores are the Total Score (TS), GDT TS = 1 4 (P 1 + P 2 + P 4 + P 8 ), (11) and the High Accuracy (HA) score, GDT HA = 1 4 (P 0.5 + P 1 + P 2 + P 4 ).
Here, variables P x denote the percentage of residues with displacements below a distance cutoff of xÅ. Note that both scores range between 0 and 100 and their interpretation is based on the "fit resolution" set by the applied cutoff distances.

Setup of REX MD simulations
All simulations were set up on a standard desktop PC using an Intel Core i5-3470 CPU with four cores at 3.20 Ghz frequency. The runs were done with GROMACS 2016.3 [38,39] using the AMBER99SB-ILDN force field [40] and the TIP3P explicit solvent model [41]. Starting from the pdb structure, the protein was unfolded in a normal MD simulation at a high temperature. We selected an unfolded state as initial structure for the REX simulations. All necessary files were generated for the lowest replica temperature T 0 . A REX temperature generator based on Eqs (5) and (6) yielded the temperature distribution for N replica (cf. S3 Appendix and S4 Appendix). After verifying sufficient exchange rates in short REX simulations (every 1000 steps, rate ≈ 16%), the sigmoid potential based on Eq (7) was provided as a look-up table. Each simulation run for 250ns with a stepsize of 2f s on 60 (Trp-Cage) and 100 (Villin headpiece) replica. Restraints were added via tabulated bonded interactions [39] to the topology file.
The production runs were performed on the ForHLR II cluster. We only used thin nodes, consisting of two Deca-Core Intel Xeon E5-2660 v3 processors (Haswell) with a base clock rate of 2.6 GHz (max. turbo-clock rate of 3.3 GHz), 64 GB main memory, and 480 GB local SSD storage.
Supporting information S1 Appendix. Sample .mdp file for REX simulations. (PDF)