Abstract
Heuristics based on physical insight have always been an important part of structure determination. However, recent efforts to model conformational ensembles and to make sense of sparse, ambiguous, and noisy data have revealed the value of detailed, quantitative physical models in structure determination. We review these two key challenges, describe different approaches to physical modeling in structure determination, and illustrate several successes and emerging technologies enabled by physical modeling.
Highlights
Quantitative physical modeling is emerging as a key tool in structure determination
There are different approaches to incorporate physical modeling into structure determination
Modeling conformational ensembles and making sense of sparse, noisy, and ambiguous data are two challenges where physical modeling can play a prominent role
Introduction
Heuristics derived from physical insight have always played an import role in biomolecular structure determination, but more rigorous quantitative physical models are increasingly used to transform experimental data into structures and ensembles. These physical approaches become more important as the biomolecular system of study becomes more flexible and conformationally heterogeneous (Figure 1), and as experimental data becomes sparse, ambiguous, or noisy (Figure 2). Systems with these characteristics have recently come into focus, due to both the recognition of the importance of conformational heterogeneity and the emerging range of experimental techniques that can provide incomplete information about protein structures [1–5].
Physical modeling has become increasingly powerful in recent years, driven by improvements in computer power, improved physical models of protein structure [6–8], and improved algorithms for conformational [9–12] and data-driven [13–17] sampling.
Combined with advances in experimental methodology, these developments are leading to a new era in structural biology where physical modeling plays a pivotal role [18–20]. Here, we outline two challenges where physical modeling can make contributions to structure determination, overview some recent successes, and provide a perspective on emerging areas where physical modeling can play a key role.
There are several emerging challenges in structural biology
Challenge 1: Modeling Conformational Ensembles
When we refer to “the structure” of a biomolecular system, we are actually referring to some continuous cloud of structures in the neighborhood of a representative structure. While historically this single structure viewpoint has dominated in structural biology, there is increasing recognition of the importance of heterogeneity and dynamics, enabled by significant improvements in experimental techniques and computational capability.
Nearly all measurements in structural biology are ensemble averages, where the observed signal comes from the average across many molecules. The challenge of interpreting such averaged data increases as the conformational ensemble becomes more heterogeneous. A simple thought experiment illustrates the central concept (Figure 1), where three systems have the same average for some observable, but different conformational distributions. One system (orange) is tightly clustered, where the average conformation provides an excellent representation of the ensemble. Another system (green) has a broad distribution, where the average conformation is only somewhat representative. The final system (blue) has a multimodal distribution, where the average conformation is improbable and not representative of the underlying ensemble at all. As the experimental average is the same in each case, modeling is critical to making correct inferences about the ensemble.
Challenge 2: Making sense of Sparse, Ambiguous, and Noisy Data
An increasing variety of experimental methods can provide incomplete information about the structure of a biomolecule or complex [1–5]. The appeal of these approaches is that they are often applicable to a wide range of systems, including those where traditional approaches have proven intractable. However, these experiments often provide only an incomplete picture of the structure.
Figure 2 shows several common pathologies. First, the data may be sparse, often only providing information about a few degrees of freedom, e.g. an EPR experiment might measure a single distance between probes. Second, the data may be ambiguous, where there are multiple molecular features that could explain a particular signal, e.g. an NMR experiment might tell us that two protons are close together, but not specifically which ones. Finally, experimental data is almost always corrupted by noise, which must be interpreted as such to avoid overfitting. Noise comes in many forms, ranging from simple additive noise (often modeled by an appropriate distribution, e.g. Gaussian noise) to more challenging cases where experimental artifacts lead to the presence of false-positive and false-negative signals.
Overcoming the dual challenges of modeling ensembles and making sense of sparse, ambiguous, and noisy data requires a synergistic combination of experiment, statistical inference, and physical modeling.
What do we mean by physical modeling?
The term “physical modeling” encompasses many approaches, ranging from physically-motivated heuristics to models rooted in rigorous statistical mechanics. The former have always been an integral part of biomolecular structure determination, while the latter are becoming increasingly important in modern structural biology.
Heuristic approaches are motivated by physical considerations and empirical observations. One example is the use of stereochemical restraints during the refinement of X-ray crystal structures [21] that prevent physically impossible bond lengths and overlap between atoms, even though these unrealistic features might lead to naïve improvements in the agreement with experimental data. These heuristics are not a comprehensive physical description of biomolecular structure—clearly, one could not hope to predict the correct fold of a protein using only simple stereochemical restraints.
Conversely, statistical mechanics is a rigorous, comprehensive theory that connects the probability of observing a particular conformation with the potential energy through the Boltzmann distribution: where R is the gas constant, T is the absolute temperature, and Z is a normalization constant called the partition function.
Typically, the potential energy is modeled using an empirical approximation called a force field [6, 7]. Samples from are generated using molecular dynamics or Monte Carlo simulations, often augmented by various enhanced sampling algorithms [10, 12, 13, 22].
Rosetta is another widely used example of physical modeling [8]. Although the underlying philosophy and parameterization of Rosetta differ substantially from those of statistical mechanical models, the underlying goal is essentially the same—to reproduce the conformational landscape of a biomolecular system of interest.
There are different approaches to incorporating physical models into structure determination
The aim of integrative structural biology is to construct a structural model of a biomolecular system from one or more experimental datasets, which is a problem of statistical inference that can be approached from a variety of perspectives, including maximum likelihood, maximum entropy, maximum parsimony, and Bayesian approaches.
The likelihood, 𝓛(θ|D) ~ ℙ(D|θ), is central to many methods, where D is the observed data and θ is a set of parameters specifying the structural ensemble, e.g. atomic coordinates and B-factors. This probabilistic relationship encapsulates the experimental measurement and relates the model to experimental observables. The likelihood function is often evaluated on single structures. However, newer ensemble refinement methods [23–25] use likelihood functionals to evaluate distributions of structures, which, as described later, is more suitable for conformationally heterogeneous ensembles.
Maximum likelihood (ML) methods seek to find the single set of parameter values with maximum likelihood. Naive ML methods rely entirely on the data, making these methods sensitive to noise and notoriously prone to overfitting. To mitigate this, ML methods are often augmented by ad hoc penalty terms motivated by physical considerations, e.g. the use of restraints on crystallographic B-factors which ensure that variations in flexibility between nearby atoms are physically plausible [26]. However, even after augmentation with penalty terms, ML methods are still prone to over-fitting as the data to parameter ratio becomes increasingly poor.
In contrast to ML, maximum entropy (MaxEnt) methods seek to find a distribution of parameters, p(θ|D), to explain the observed data. Although there are many possible distributions that could match the observed data, there is a unique maximum entropy distribution [27, 28], providing a powerful basis for statistical inference. An ensemble generated using Eq. 1 alone may not agree with experiment. MaxEnt methods seek to minimally perturb (in a well-defined MaxEnt sense) this ensemble, either through biasing [23–25] or reweighting [29, 30], to bring the results into agreement with experimental measurements.
Maximum parsimony methods [20, 31] have many similarities with MaxEnt approaches. A key distinction is that maximum parsimony aims for simple models, e.g. describing an ensemble with a minimal number of representative conformations.
The Bayesian approach offers a different perspective [32] that partially encompasses both MaxEnt and maximum parsimony methods. Bayes theorem is a simple and elegant statement, which combines prior understanding with new information in a statistically consistent way. The quantity of interest is the posterior distribution, p(θ|D), which is obtained by combining the likelihood function, 𝓛(D|θ), with the prior, p(θ).
Bayesian methods differ from ML in several key respects. First, the prior, often given by Eq. 1, represents our knowledge of protein structures in the absence of data. The prior, rather than ad hoc penalty terms, provides a means to make sense of otherwise sparse, ambiguous, or noisy data. Second, Bayesian methods generate an ensemble from the posterior distribution, rather than a single sample, as in ML. The assumption of maximum entropy [27, 28] underlying Eq. 1 leads to ensembles that are as broad as possible given both the data and energetic considerations from the prior, which mitigates over-fitting. Finally, the prior may include “nuisance parameters”, like the level of noise corrupting a particular observable. During sampling, these parameters are jointly inferred with the other parameters describing the model, leading to a statistically consistent ensemble without the need to specify the exact values of nuisance parameters.
The lines between these different approaches are often blurred, and many methods do not clearly fall into any of the categories. These are often more ad hoc combinations of physically-motivated scoring functions and sampling strategies that do not produce a well-defined ensemble. However, although these methods have less rigorous statistical underpinning, they are often quite successful.
The term “ensemble” is highly overloaded in structural biology
In statistical mechanics, an ensemble has a specific technical meaning: the probability distribution over all possible configurations of a system under specified conditions. Unfortunately, in structural biology, it has become common to refer to almost any collection of conformations as an ensemble, which can be confusing. There are several key characteristics of these pseudo-ensembles that must be considered. Does the likelihood consider only individual structures, or properties of the distribution as a whole? Do the structures sampled come from a well-defined distribution, e.g. a Boltzmann distribution, or are they simply a set of low-energy conformations, e.g. as in traditional NMR refinement? How are experimental errors handled? What priors are used? What is sampled over?—is it just atomic coordinates, or are there other parameters like error magnitudes? It is only through consideration of these questions that the correct interpretation of the “ensemble” can be arrived at.
Maximum entropy and related methods can be robust against over-fitting
Maximum likelihood methods become prone to overfitting as the data to parameter ratio becomes poor. For example, it is uncommon to see multi-copy refinement of X-ray crystal structures, where heterogeneity is represented using multiple copies of the system [33], as the data to parameter ratio decreases linearly with number of copies. Phillips and co-workers undertook a systematic study of 50 experimental structures, and found that adding up to, on average, ~ 10 copies yielded improved models [34]. However, ensembles from maximum entropy or Bayesian methods can easily have thousands of models. How are these models not grossly over-fit?
The key to understanding this apparent paradox is to realize that the atomic coordinates are not free parameters in maximum entropy and related methods. Consider a simple maximum entropy reweighting procedure [30]. First, an unbiased ensemble is generated using Eq. 1, say with 1000 conformers, giving 1000 × 3 × Natoms coordinates. But these coordinates are now fixed, and instead the weights for each conformation, 1000 in total, are used to bring the computed averages into accordance with experimental observations. However, even these 1000 weights are not free parameters, as the maximum entropy principle prescribes a particular set of weights that simultaneously maximize entropy and bring compute average quantities into agreement with their experimentally observed counterparts. In practice, there is one Lagrange multiplier to be determined for each experimental observation, so the data to parameter ratio is essentially one-to-one, regardless of the number of conformers in the ensemble. Similar ideas apply to the ensemble refinement schemes discussed in the next section.
Physical modeling offers solutions to several key challenges in structural biology
Challenge 1: Modeling Conformational Ensembles
The form of the likelihood function is of critical importance in ensemble refinement. If the likelihood function considers only single structures, there is little hope of reproducing the correct ensemble, as the likelihood function “cannot see the big picture”. Single structure-based likelihoods have the effect of forcing all structures to satisfy the average data, rather than reflecting the true distribution (blue vs orange systems in Figure 1). However, in many cases an “ensemble” of structures is still produced. For example, Bayesian single copy refinement [30] will produce an ensemble of structures, but the resulting heterogeneity arises from the non-zero temperature and sampling over nuisance parameters, rather than necessarily reflecting the true underlying ensemble.
A variety of replica-based approaches use restraints that couple the behavior of many replicas or copies of the system to the measured averages from experiment, as recently reviewed in [19, 20].
Replica-averaged ensemble approaches simulate several replicas of the system in parallel, which are coupled through a harmonic potential that restrains properties averaged over all replicas to the corresponding experimental quantities [35]. While successful [36–39], these methods lack a formal connection to maximum entropy or Bayesian principles.
Pitera and Chodera [23] derived an expression for the maximum entropy biasing potential to bring calculated averages from a single simulation into agreement with experiment. This formulation is difficult to use in practice, as it requires determining Lagrange multipliers through trial and error. Nevertheless, Pitera and Chodera were able to identify an important link between their maximum entropy formalism and replica-averaged restraints—as the number of replicas and the harmonic force constant both increase, the replica-averaged ensemble approach converges to the correct MaxEnt distribution. This link was made rigorous in several follow up papers [24, 25] and now forms the backbone of a number of approaches. Hummer and co-workers introduced a Bayesian ensemble refinement method BioEN, a combination of replica ensemble refinement and the Ensemble Refinement of SAXS (EROS) method, combining the principles of both restraining and reweighing [30].
Ensemble heterogeneity explains much of the difficulty in characterizing intrinsically disordered proteins (IDPs) experimentally, as they are ensembles of inter-converting conformations [40, 41]. The Bayesian weighting method is an approach for characterizing an ensemble of IDPs where the weights are defined using a Bayesian estimate from calculated chemical shift data [42]. This method has been successful in determining the relative fractions of mutated structures in an ensemble for aggregative proteins [43].
Challenge 2: Making sense of Sparse, Ambiguous, and Noisy Data
Data from some experimental techniques can often be sparse, ambiguous, and noisy, due to inherent limitations of the technique, or the number and difficulty of the experiments that must be performed. Nevertheless, such data can still be highly valuable in inferring the structures of biomolecules and complex. A number of computational methods have been developed over the past decade which can translate such low information data into meaningful structural models.
High ambiguity driven biomolecular docking (HADDOCK), is a data-driven docking approach, that can take highly ambiguous data from different sources and convert them into distance restraints to guide docking processes [44, 45]. Among its many applications, HADDOCK has been used to study protein complex interfaces using cryo-EM data [46] and protein ligand complexes using sparse intermolecular NOEs [47].
The Integrative Modeling Platform (IMP), is a flexible software suite aimed at integrative structural biology, which facilitates development of integrative applications, models and methods, and allows incorporation of data from diverse sources [15]. Among many applications, protein complex structures have been defined with IMP using in vivo FRET data through a Bayesian approach [48], and using a combination of cross-linking data with biochemical and EM localization data [49].
Rosetta is an extensive software suite aimed at protein structure prediction and molecular design. There are several applications of Rosetta with sparse experimental data, where Monte Carlo-based fragment assembly is guided towards native structures by data [50]. Backbone chemical shifts and distance restraints have been used to guide structure determination [51]. Also, paramagnetic relaxation enhancement (PRE) [52], pseudo-contact shift (PCS) [53], and residual dipolar coupling (RDC) [54] restraints have been used to similar effect. Recently, the RASREC (resolution-adapted structural recombination) algorithm was developed, which yields better models with narrower sampling [17, 55]. RASREC enriches the structure pool by re-using structural features that were frequently observed in previous runs. It requires fewer restraints, and develops models that are closer to the native structure, including for NMR on deuterated samples up to 40 kDa [56, 57].
A newer approach based on Bayesian inference, Metainference, can address statistical and systematic errors in data produced by high-throughput techniques, and can handle experimental data averaged over multiple states [14]. It is suitable for studying structural heterogeneity in complex macromolecular systems. A combination of Metainference and Parallel-bias Metadynamics (PBMetaD), an accelerated sampling technique, provides an efficient way of simultaneously treating error and sampling configuration space in all-atom simulations [9]. Coupling Metainference and Metadynamics has been particularly successful in characterizing structural ensembles of disordered peptides [58, 59].
Modeling Employing Limited Data (MELD) is a Bayesian approach that combines statistical mechanics (Eq. 1), detailed all-atom physical models [7], and enhanced sampling to infer protein structures from sparse, ambiguous, and noisy data [13]. MELD was specifically designed to be robust in the presence of false-positive signals, and has been applied to EPR, NMR, and evolutionary data [13], de novo prediction of protein structures based on simple heuristics [60, 61], and mutagenesis guided peptide-protein docking [62, 63].
Physical modeling is enabling emerging techniques in structural biology
Advances in physical modeling will be key to enabling technologies for new approaches to structure determination. Below we outline just a few—of many—emerging techniques where the ability to model ensembles and to successfully treat sparse, ambiguous, and noisy data will be critical.
Chemical cross-linking detected by mass spectrometry is emerging as a potentially powerful tool in structure determination. Developments have focused on improvements in instrumentation [4, 64], cross-linking chemistries [65–67], and data analysis [65, 66, 68, 69]. These techniques are extremely sensitive, but the data can be highly ambiguous and both false-positive and false-negative signals are common. Such data has recently been used as restraints to guide Monte Carlo [70], molecular dynamics [71], and integrative modeling [68, 69] approaches. The use of crosslinking restraints for structure prediction was recently assessed during the 11th round of Critical Assessment of Structure Prediction [72, 73] and various shortcomings— both in experiment and modeling—were identified.
X-ray diffuse scattering experiments can produce information about correlated motions in proteins that is complementary to the information obtained from the more typically analyzed Bragg scattering [74, 75]. Wall and co-workers found good agreement between long molecular dynamics simulations and measured diffuse scattering [75], even in the absence of any fitting. The development of suitable ensemble refinement schemes would bring the models into even better agreement with experiment and would provide a powerful new tool for studying correlated motions of proteins.
Recent work has demonstrated the utility of paramagnetic relaxation enhancement measurements in solid-state NMR [76, 77]. These experiments provide less structural information than traditional protein NMR experiments, but, combined with suitable computational modeling, represent an increasingly viable avenue for structure determination [52, 77].
Transition metal ion FRET (tmFRET) measures the distance between small-molecule fluorophores and a non-fluorescent transition metal. Because it provides short range distances, and because different metals have different absorptions, the method is tunable for a range of distances (10–20 Å) [78] and has been used to study membrane proteins [79].
Finally, recent work has demonstrated the possibility of inferring residue-residue contacts from coevolution analysis of homologous sequences [80–82], commonly referred to as evolutionary couplings. Baker and co-workers were recently able to create models for 614 protein families with unknown structures [83], several of which had folds that are not in the Protein Data Bank. Montelione and co-workers combined evolutionary couplings with sparse NMR data, which provide complementary restraints for modeling, to correctly determine structures for proteins up to 41 kDa [3].
Conclusion and future perspectives
Physical insight has always been integral to structural biology, but the dual challenges of modeling ensembles and making sense of sparse, ambiguous, and noisy data mean that quantitative physical models will become an increasingly important part of modern structural biology. Driven by faster computers, advances in theoretical understanding, and better algorithms, detailed physical modeling is enabling new methods in structural biology, which are essential to addressing exciting biological questions.
Important References
Papers of interest have been highlighted as:
* of special interest
** of outstanding interest
** [20] A review on approaches that combine experimental and computational methods to determine structural ensembles of dynamic proteins.
**[19] A concise review on maximum entropy approaches. The authors highlighted three papers which explored an important link between replica-averaged ensemble refinement principle and maximum entropy method.
**[18] An important perspective on the relationship between experimental data and computational techniques, and the role of integrative structural biology.
**[32] A key paper on Bayesian inference, defining the commonly applied inferential structural determination methodology and indicating the importance of developing probabilistic methods for structure determination.
**[23] This paper makes use of maximum entropy methods to develop ensemble-averaged restraints for biasing molecular simulations, noting the success of a physics-based approach compared to other refinement schemes.
*[24] This paper demonstrates the statistical equivalence of principle of restrained-ensemble simulations and the the maximum entropy approach.
*[25] This paper justifies the use of the maximum entropy approach to define experimental data-driven restraints for simulations.
*[14] This paper introduces a Bayesian inference method to account for different sources of error in experimental data in modeling structural ensembles of complex macro-molecular systems.
*[15] This paper introduces the new and developing Integrative Modeling Platform (IMP) software package. The authors highlight its flexible capability to incorporate a variety of experimental data, and to generate and develop new models and representations.
*[13] This paper describes Modeling Employing Limited Data (MELD), highlighting its unique Bayesian methodology for determining protein structure, and demonstrating its ability to incorporate a variety of experimental data.
*[57] This paper introduces the RASREC Rosetta approach, describing its improvements over regular CS-Rosetta in detail, and exhibiting its capability to develop models closer to the native structure.
Acknowledgements
This work is supported by funding from the Natural Sciences and Engineering Research Council of Canada. JLM is a Tier 2 Canada Research Chair.
References
- [1].↵
- [2].
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].
- [38].
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].
- [82].↵
- [83].↵