Structure and ensemble refinement against SAXS data: combining MD simulations with Bayesian inference or with the maximum entropy principle

Small-angle X-ray scattering (SAXS) is a powerful method for tracking conformational transitions of proteins or soft-matter complexes in solution. However, the interpretation of the experimental data is challenged by the low spatial resolution and the low information content of the data, which lead to a high risk of overinterpreting the data. Here, we illustrate how SAXS data can be integrated into all-atom molecular dynamics (MD) simulation to derive atomic structures or heterogeneous ensembles that are compatible with the data. Besides providing atomistic insight, the MD simulation adds physico-chemical information, as encoded in the MD force fields, which greatly reduces the risk of overinterpretation. We present an introduction into the theory of SAXS-driven MD simulations as implemented in GROMACS-SWAXS, a modified version of the GROMACS simulation software. We discuss SAXS-driven parallel-replica ensemble refinement with commitment to the maximum entropy principle as well as a Bayesian formulation of SAXS-driven structure refinement. Practical considerations for running and interpreting the simulations are presented. The methods are freely available via GitLab at https://gitlab.com/cbjh/gromacs-swaxs.


Introduction
Understanding the function of biomolecules requires understanding of their conformational dynamics. An increasingly popular method for tracking conformational transition of biomolecules is small-angle X-ray scattering (SAXS), which provides structural information that is not accessible by other techniques. Unlike NMR spectroscopy, which probes local distances and angles of smaller biomolecules, SAXS provides information on the overall shape and is applicable to both small and large biomolecules. Unlike crystallography or cryo electron microscopy, SAXS probes molecules at ambient temperatures in solution, enabling experiments that track conformational transition after application of external stimuli. The accuracy of SAXS data has greatly improved over recent years thanks to sample delivery coupled with size exclusion chromatography (SEC-SAXS), thereby reducing sample aggregation artifacts, single-photon counting detractors, and standards for sample preparation (Berthaud et al., 2012;Jeffries et al., 2016). Software and algorithms for data analysis and for SAXS-based structural modeling has greatly developed (Gräwert and Svergun, 2020). These properties and developments establish SAXS as an indispensable tool for integrative structural biology (Rout and Sali, 2019;Brosey and Tainer, 2019).
The interpretation of the SAXS data is challenged by the low information content of the experimental signals. Because the biomolecules are randomly oriented during solution scattering, SAXS curves I(q) represent an orientational average and, consequently, are only a one-dimensional smooth function of momentum transfer q = 4π sin(θ)/λ. Here, λ is the X-ray wavelength and 2θ is the scattering angle. The number of data points in I(q) that provide independent structural information is estimated by the number of Shannon-Nyquist channels (Moore, 1980;Rambo and Tainer, 2013) N Shan = (q max − q min )D/π, where q max and q min denote the maximum and minimum momentum transfer in I(q), respectively, and D is the maximum diameter of the solute. N Shan is in the range of 5 to 30 for many SAXS experiments. For comparison, even a small protein with 100 residues contains approx. 200 flexible backbone angles, demonstrating that SAXS data is by far insufficient for defining all degrees of freedom of a biomolecule. Consequently, structure refinement against SAXS data is highly ambiguous, that is, many different structures fit the data equally well.
Challenges due to the low information content are enhanced by the presence of heterogeneous ensembles and by uncertainties in the experimental and predicted SAXS curves. The low information content of the data together with larger number of degrees of freedom leads to a significant risk of overinterpretation upon fitting structural models against experimental data (Putnam et al., 2007;Hub, 2018).
To mitigate the risk of overinterpretation during structure refinement, two main strategies have been devised. First, nearly all degrees of freedom of the biomolecule have been constrained, leading to methods such as rigid-body or normal mode refinement (Petoukhov and Svergun, 2005;Pelikan et al., 2009;Gorba et al., 2008). Such methods use only simple energy terms to discriminate plausible from prohibited conformations, such as volume exclusion terms between protein domains or a multidimensional harmonic potential along normal modes. Second, physico-chemical information has been added to the SAXS data with the help of atomistic molecular dynamics (MD) simulations (Björling et al., 2015;Chen and Hub, 2015;Kimanius et al., 2015;Paissoni et al., 2020). During MD simulations, all degrees of freedom are kept flexible, but the force field restrains the biomolecule to realistic conformations with acceptable free energy. Simulations with a coupling to experimental SAXS curves have been coined SAXS-driven or SAXS-guided MD simulations.
Several previous studies described SAXS-driven simulations, in which the MD simulations were coupled to experimental SAXS data with an energetic restraint. Björling et al. (2015) presented a method with a focus on the interpretation of SAXS curve differences between two conformational states. This study neglected effects from the hydration layer on the SAXS curve, rationalized by the fact that hydration layer effects may approximately cancel upon taking SAXS curve differences. Our group developed SAXS-driven MD focused on absolute SAXS curves and implemented the method into an extension of the GROMACS software, coined GROMACS-SWAXS (https://gitlab.com/cbjh/gromacs-swaxs). The implementation uses explicit-solvent SAXS curve calculations using atomistic representations for the hydration layer and excluded solvent Hub, 2015, 2014). Kimanius et al. (2015) suggested refining protein structures against SAXS data using metadynamics. However, since the SAXS curve calculations neglected the buffer subtraction, the implementation was not yet ready for refinement against experimental data. Paissoni et al. (2020) provided a method for SAXS-driven simulations implemented in PLUMED, which was primarily motivated with the aim of accelerating SAXS-driven simulations. This method maps the atomistic model to a coarse-grained representation for computing a SAXS curve. The method neglects effects from the hydration layer, and the coarse-grained approximation is limited to smaller scattering angles (Bernetti and Bussi, 2021). However, as discussed in this chapter and previously, the SAXS-driven MD simulations implemented in GROMACS-SWAXS are subject to only a small computational overhead between 5% and 20%, suggesting that SAXS-driven MD simulations based on all-atom SAXS predictions are likewise computationally feasible. Hsu et al. (2020) presented SAXS-driven MD simulations utilizing the Debye equation, similar to Björling et al. (2015); however, these authors included effects from the hydration layer by increasing atomic form factors of solvent-exposed atoms, similar to the FoXS method (Schneidman-Duhovny et al., 2010).
This chapter describes the refinement of protein and soft-matter complexes by coupling all-atom MD simulations to experimental SAXS data, as implemented in GROMACS-SWAXS. We first describe SAXS-driven MD simulations coupled with a harmonic restraint to the data (Chen and Hub, 2015). The method was extended for simultaneous refinement against SAXS and several small-angle neutron scattering (SANS) data sets collected at different D 2 O concentrations (Chen et al., 2019). We discuss how SAXS-guided structure refinement is embedded into rigorous probability theory, as the obtained MD ensemble may be interpreted as the posterior distribution of a Bayesian inference problem (Shevchuk and Hub, 2017). Such viewpoint enables statements of confidence intervals, which are still underdeveloped in SAXS-guided modeling. Finally, we describe the refinement of heterogeneous ensembles against SAXS-data using a minimal bias, that is, with commitment to Jaynes' principle of maximum entropy (Hermann and Hub, 2019;Ivanović et al., 2020).
The methods described here are freely available via an extension of the GROMACS simulation software (Abraham et al., 2015) developed at https://gitlab.com/cbjh/gromacs-swaxs. We present the theoretical basis of SAXS-driven MD simulations. We discuss conceptual considerations as well as practical guidelines for running and interpreting the calculations. Additional tutorials and documentation is available at https://cbjh.gitlab.io/gromacs-swaxsdocs. Although we here refer mostly to SAXS, the methods described here are likewise applicable for refinement against small-angle neutron scattering (SANS) data or for refinement against combined SAXS/SANS data.
SAXS-driven MD simulations require a forward model for predicting SAXS curves from given MD simulations. The methods implemented in GROMACS-SWAXS utilize explicitsolvent calculations, thereby taking explicit representations of the hydration layer and ex-cluded solvent into account (Merzel and Smith, 2002;Park et al., 2009;Chen and Hub, 2014). For more details on explicit-solvent SAXS predictions we refer to a chapter "Predicting solution scattering patterns with explicit-solvent molecular simulations" in Part A of this monograph.
2 SAXS-driven molecular dynamics simulations 2.1 Experiment-supported energetic bias Similar to refinement against other experimental data (Jack and Levitt, 1978), SAXSguided refinement is implemented by augmenting the MD force field energy V FF (R) with an experiment-derived energy E exp (R, D) that drives the simulation into conformations R that are compatible with the data D: Here, the data is given by the experimental SAXS intensities I exp (q i ) and errors σ(q i ), where i = 1, . . . , N exp and N exp is the number of experimental data points. The MD simulation is coupled to the data using a harmonic restraint on the SAXS intensities following Here, I c is the SAXS curve calculated on-the-fly from the MD simulation, f c is an overall force constant, whereas k B and T denote the Boltzmann constant and the temperature. The factor 1/2 is introduced to enable a Bayesian interpretation of the refined ensemble in the special case f c = 1 as described below. The symbols f and c denote fitting parameter that adjust the overall scale and an offset of the experimental curve relative to the calculated curve; f and c are adjusted throughout the simulation by minimizing E exp . Hence, only those differences between I c and I exp that cannot be explained by f and c contribute to E exp and, thereby, drive the simulation. The offset c absorbs uncertainties due to the buffer subtraction and may be omitted when coupling to high-precision SAXS data.
As an example, Figure 1 presents structure refinement of the nuclear exportin CRM1, which transports protein cargos across the nuclear pore (Chen and Hub, 2015). Free simulations of CRM1 are highly force field-dependent, evident from the fact that simulations with the Amber99sb or with the Charmm22* force fields lead to overly open or overly closed conformations, respectively (Fig. 1A, B, D top). Simulations with each force field exhibit poor agreement with experimental SAXS data (Fig. 1C, black and solid yellow or green curves). Upon restraining the simulations to SAXS data, excellent agreement with the data is obtained (Fig. 1C, dashed yellow or green curves), and both two force fields lead to a consensus partly open conformation (Fig. 1D bottom). Hence, the SAXS data overrules imperfections in the two force fields.

Increasing the computational efficiency by smoothing and rebinning the experimental curve
Owing to the large number of pixels on modern X-ray detectors, I exp (q i ) is heavily oversampled. I exp (q i ) contains typically 800-2500 noisy estimates of the true underlying smooth During SAXS-driven simulations, the different force fields lead to similar conformations (D, bottom), in excellent agreement with the data. Adapted and reused with permission from Chen and Hub (2015).
SAXS curve, which contains only few independent data points (N exp N Shan ). Consequently, the formulation in Eq. 3 is inefficient, since it requires the calculation of I c (q i ; R) for a large number N exp of q-points.
The computational cost is reduced by smoothing the raw experimental data, thereby merging neighboring I exp (q i ) pffoints within the same Shannon bin into a smoothed curvē I exp (q). A shell script for smoothing the curve based on the ATSAS module DATGNOM (Manalastas-Cantos et al., 2021) is available at https://cbjh.gitlab.io/gromacs-swaxs-docs. Upon smoothing the curve, N exp raw experimental data points are merged into a smooth curve with only N Shan independent features, suggesting that the uncertainties of the smoothed curve followσ 2 (q) ≈ σ 2 (q)/n s , where n s = N exp /N Shan is the number of experimental points per Shannon bin. Let ∆I(q i ) := I c (q i ; R) − (f I exp (q i ) + c) denote the residuals between experimental and calculated curve and such that E exp = f c k B T χ 2 /2. Let us further decompose the residuals ∆I i (q i ) = ∆Ī i + δI i into contributions (i) relative to the smoothed curve ∆Ī i , which can be fitted by adjusting the biomolecular structure R, and (ii) owing to statistical noise in the data δI i , which cannot be fitted. Then, χ 2 can be rewritten as The third term in Eq. 5 vanishes approximately because the noise δI(q i ) is symmetrically distributed around zero. The second term adds a constant offset of ≈ 1 per experimental data point, i.e., a total offset of N exp to χ 2 that cannot be fitted. The first term quantifies the deviation with respect to the smoothed curve and contains all the relevant structural information. Using that ∆Ī(q i ) and σ(q i ) are approximately constant within each Shannon bin, the first term of Eq. 5 simplifies to a sum over the Shannon bins, where we usedσ 2 (q) ≈ σ 2 (q)/n s . Hence, up to a structurally irrelevant constant offset N exp owing to experimental noise, χ 2 and E exp can be evaluated using only the N Shan intensities and errors of the smoothed curve. In case that more than N Shan q-points are used to evaluate E exp , a correcting prefactor is needed and we arrive at where N used may be chosen in the range of 1-2 N Shan . Evidently, using the formulation in Eq. 7 greatly improves the computational efficiency of SAXS-driven MD simulations as it requires ∼30 to 300 times fewer SAXS intensity calculations as compared to the formulation in Eq. 3.

Accounting for systematic and calculated errors
SAXS data collected with single-photon counting detectors may be subject to tiny statistical errors σ exp (q) at small scattering angles. Consequently, the overall uncertainty of the data may be dominated by unknown systematic errors, for instance owing to imprecise buffer subtraction or minor undetected radiation damage. When applying only the tiny statistical errors, the SAXS-derived energies take spuriously large values at small angles due to the 1/σ(q i ) 2 term, which would further propagate into large SAXS-derived forces. Such problems are avoided by modeling of systematic errors. We previously modeled systematic errors via an uncertainty δρ buf of 0.1% to 1% of the buffer density (Chen and Hub, 2015). The uncertainty δρ buf propagates into an uncertainty σ buf (q) of the SAXS curve, which is large at low angles and small at wide angles. Consequently, σ buf (q) dominates the overall uncertainty at low angles and avoids spuriously large SAXS-derived forces, whereas σ(q) dominates at wide angles. Apart from the systematic errors, GROMACS-SWAXS takes statistical errors of the calculated curve σ c (q i ) into account (Fig. 3). This is implemented by replacing the errorsσ in Eq. 7 with σ 2 tot = f 2σ2 +σ 2 buf +σ 2 c , leading to the final SAXS-derived energy applied by GROMACS-SWAXS:

On-the-fly averaging of the calculated SAXS curve
Explicit-solvent SAXS curve predictions require averaging over multiple MD frames for computing a SAXS curve. The number of frames required for obtaining a converged SAXS curve depends on the contrast between the biomolecule and the pure buffer; the smaller the electron density contrast, the more frames are required to converge the SAXS curve. During SAXS-driven MD simulations discussed here, the SAXS curve is therefore computed as a running temporal average with a memory kernel that decays exponentially into the past.
The on-the-fly average of a quantity X at simulation time t is then given by where τ is the memory time, typically chosen between 50 and 200 ps, and N is a normalization constant. Likewise, on-the-fly averaging of the scattering amplitudes of the biomolecular and the pure-solvent system using Eq. 9 provide an on-the-fly averaged SAXS curve (Chen and Hub, 2015). Implementing the coupling to the experimental SAXS curve with an on-the-fly averaged calculated curve has two key advantages: first, thermal fluctuations on the time scale 1-2 τ such as solvent, side chain, or loop fluctuations are taken into account before comparing the calculated with the experimental curve in Eq. 8. Such thermal fluctuations may significantly influence the SAXS curve of proteins at moderate scattering angles (q > 0.25 Å −1 ) (Tiede et al., 2002;Chen and Hub, 2014;Moore, 2014). Second, there is no need to compute SAXS intensities every MD step, but instead one SAXS update every ∼0.5 ps is sufficient. The update interval together with the memory time must be chosen such that the SAXS curves converge within τ . The longer update interval renders SAXS-driven MD simulation computationally efficient with an overhead of only 5-20% relative to unbiased simulations. However, owing to the on-the-fly average, the dynamics are not conservative because the energy E f exp depends not only on the current but also on previous conformations. Hence, a reasonably tight temperature coupling scheme is required to avoid energy drifts.

SAXS-derived forces applied during MD
SAXS-derived forces applied in the MD simulations are given by the negative gradients of the SAXS-derived energy, where ∇ k I c (q i ; R(t)) denotes the on-the-fly averaged gradients of the SAXS intensities with respect to the coordinates of atom k. For large biomolecular systems, storing these gradients may require several Gigabytes of memory (Chen and Hub, 2015). GROMACS-SWAXS computes the forces F k and the gradients ∇ k I c (q i ; R(t)) only for the solute atoms but not for the solvent atoms. Therefore, in addition to the derivatives of the I c with respect to the solute coordinates, a correction factor is applied, where ρ solu and ρ solv denote the average solute and solvent densities, respectively. Here, density refers to the scattering length density, that is, to the electron density in SAXS or to the neutron scattering length density in SANS. As a numerical example, the factor f contrast accounts for the fact that, upon moving a protein domain with density ρ solu = 440 e nm −3 inside solvent with density ρ solv = 334 e nm −3 , only the contrast of 106 e nm −3 is moved. Consequently, change of I c (q i ) upon a domain movement in mobile water is reduced by the factor f contrast = 0.24 as compared to a domain movement at fixed water positions.
When coupling to SANS data measured at a large D 2 O concentration, the contrast and the factor f contrast may even become negative because a D 2 O buffer exhibits a larger neutron scattering length density than proteins. Consequently, upon moving a protein domain to the left, the contrast may move to the right; such effect is correctly taken into account by a negative factor f contrast .

Protocol A
To carry out SWAXS-driven MD simulations with the GROMACS-SWAXS implementation published at https://gitlab.com/cbjh/gromacs-swaxs, the following protocol is recommended: 1. Download and compile GROMACS-SWAXS, following the installation instructions of official GROMACS. Alternatively, GROMACS-SWAXS can be easily installed using Spack, which is available at many high-performance computing centers.
2. Setup the MD simulation system following freely available GROMACS tutorials. Make sure to use an approximately 1 nm larger simulation box compared to common simulations via the GROMACS module gmx editconf -d.
3. First compute a SAXS curve from an equilibrium simulation using the rerun functionality of the GROMACS-SWAXS gmx mdrun module, following tutorials at https://cbjh.gitlab.io/gromacsswaxs-docs and as described in a chapter in Part A of this monograph. Computing a SAXS curve involves the setup of a pure-solvent simulation systems, definition of the atomic form factors with the gmx genscatt module, and building of the envelope with gmx genenv, which defines the solvent region contributing to the SAXS curve. From visual inspection of the SAXS curve, is it plausible that a conformational transition of the biomolecule explains the deviation between the calculated and the experimental curve?
Here, pure-solvent.tpr and pure-solvent.xtc are the run-input and trajectory files of the pure-solvent systems used for computing the buffer subtraction. These have already been set up for computing the SAXS curve from an equilibrium simulation in step 3.
7. To analyze the simulation, visualize it in a molecular viewer and validate that the conformation is reasonable. Inspect in waxs_final.xvg whether the simulation was capable of finding a conformation that is compatible with the target SAXS curve. Inspect the SAXS-derived energy E f exp , which is stored in the energy (edr) file and available via with the gmx energy command. Validate that the energy is in the range of several k B T after the structure has been refined. The contributions of individual q-points to E f exp , available in waxs_pot.xvg, may reveal q-regions that could not be explained.
8. Since waxs_final.xvg represents the final on-the-fly average of the calculated curve, it represents only the final 1-2 τ (∼200 ps) of the simulation. Use the rerun functionality of mdrun (gmx mdrun -rerun traj.xtc) to compute the SAXS curve that is uniformly averaged over a longer part of the SAXS-driven simulation, for instance over the final 10 ns. A uniform average is enabled with the mdp option waxs-tau = 0. Use more q-points, such as waxs-nq = 100.

SAXS-driven MD as a tool for Bayesian inference of molecular structures
3.1 Posterior, likelihood, and prior distributions Following pioneering work by Rieping, Habeck, and coworkers on refinement against NMR data (Rieping et al., 2005;Habeck et al., 2006), SAXS-driven MD simulations have been reformulated as tool for Bayesian inference of biomolecular structures from experimental data. Accordingly, the goal of structure refinement is to find the conditional probability P (R, θ|D, K) that quantifies the plausibility for the biomolecular structure R in the light of the experimental data D and prior physical knowledge K. In the context of SAXS-guided modeling, the data D is given by the experimental curve I exp (q) and its errors, whereas the prior knowledge K represents the MD force field, the starting conformation of the simulation, and other prior experience. The symbol θ summarizes so-called nuisance parameters such as the unknown fitting parameters or unknown systematic errors, which must be estimated simultaneously with the structure. In Bayesian inference, P (R, θ|D, K) should not be interpreted as probabilities of random events, as common in the "frequentist interpretation" of probability theory, but instead as a quantification of our state of knowledge and ignorance. A wide distribution P (R, θ|D, K) implies that many different conformations R are compatible with D and K, implying a high degree of ignorance and large confidence intervals on R. By the same token, a narrow P (R, θ|D, K) implies that only few structures are plausible in the light of D and K, implying precise knowledge of R and small confidence intervals on R. Hence, computing P (R, θ|D, K) provides rigorous confidence intervals for the refined structure founded in probability theory.
The conditional probability is evaluated using Bayes' theorem, P (R, θ|D, K) ∝ L(D|R, θ, K) π(R|K) π(θ|K), where π(R|K) and π(θ|K) denote the prior distributions of conformations and of the nuisance parameters, respectively. L(D|R, θ, K) is the likelihood that the data D was measured given that the structure R was present in the experiment and given a specific set of nuisance parameters θ. The distribution P (R, θ|D, K) is called posterior distribution. When running MD simulations, the prior π(R|K) is naturally given by the Boltzmann distribution with the MD force field V FF (R), i.e., by the ensemble obtained before incorporating the data D. A reasonable choice for π(θ|K) is less obvious; if no prior information on the nuisance parameters is available, a non-informative prior such as a flat or a Jefferys' prior may be used. It is advisable to test multiple priors in order to exclude that the conclusions depend on the choice of the priors. For instance, testing multiple choices of π(R|K) corresponds to running simulations with difference force fields V FF . Assuming Gaussian independent errors σ tot , the likelihood function is Here, the I exp (q i ; θ) = f I exp (q i ) + c is the experimental data adjusted by the scale f and the offset c, which appear as nuisance parameters (cf. Eq. 3). In GROMACS-SWAXS, the likelihood function is modified twofold relative to Eq. 15. First, as described above, the raw experimental intensities (I exp , σ) and errors are replaced with the smoothed curve (Ī exp ,σ), thereby replacing the sum over N exp values with a sum over N Shan values. Second, the errors are augmented with the calculated and systematic errors, leading to the final likelihood function: Hence, L f contains three nuisance parameters θ = {f, c, δρ buf }. The posterior distribution P (R, θ|D, K) cannot be computed analytically but is instead obtained by importance sampling. Here, Newtonian dynamics as implemented by MD simulations are used for sampling the conformations. By taking the negative logarithm of the posterior, we turn the probability functions into energy terms, where we used Eqs. 8, 13, 14, and 16. Several aspects of this results are notable: • The original SAXS-derived energy (Eq. 8) required the choice of a force constant f c that weights the experimental data relative to the force field V FF . Using Bayesian inference, in contrast, no force constant is needed because the weight of the experimental data is fully determined by probability theory. In practice, it may be useful to first drive a conformational transition with a larger f c and, in a follow-up simulation, sample the posterior with f c = 1.
• Sampling the posterior P (R, θ|D, K) has been implemented using Gibbs sampling. Accordingly, nuisance parameters θ are sampled with Metropolis Monte-Carlo at fixed conformations R, followed by Newtonian dynamics of R at fixed θ, and so on.
• Sampling the fitting parameters f and c is not required because they can be marginalized out analytically at the level of the likelihood when assuming flat priors for f and c, i.e., π(f |K) = π(c|K) = const. Then, the likelihood in Eq. 16 is replaced with L(D|R, δρ buf , K) = L(D|R, θ, K) df dc, thereby taking "all possible" values of f and c into account. Evaluating this integral shows thatL takes the same form as L, except that f and c are replaced with their maximum-likelihood estimates f ml and c ml (Shevchuk and Hub, 2017).

Bayesian treatment of systematic errors at small angles
GROMACS-SWAXS models systematic errors at small angles via an uncertainty of the buffer density δρ buf (see above and Fig. 3). In Bayesian SAXS-driven MD, δρ buf can be treated as one of the nuisance parameters θ (Shevchuk and Hub, 2017). Accordingly, the relative uncertainty δρ buf is sampled simultaneously with the structure R to obtain a joint posterior For LBP, the posterior peaks at δρ buf = 0; hence, systematic errors are not required to explain the data, but they cannot be excluded either. This finding is expected since the SAXS-driven MD simulation were carried out against a calculated target curve without any uncertainties. In contrast, the posterior for HSP90 peaks at larger δρ buf = 4 e nm −3 , suggesting that significant systematic errors are strictly required to explain the data. This finding reflects that the SAXS-driven MD was carried out against experimental data with substantial systematic errors at small angles. Data taken from Shevchuk and Hub (2017).
distribution over structures and buffer density uncertainties P (R, δρ buf |D, K). GROMACS-SWAXS applies a Gaussian prior distribution for δρ buf . Let r buf := δρ buf /ρ buf , then the prior is π(r buf |K) ∝ exp(−r 2 buf /2 2 buf ) where ρ buf is the solvent density and buf is given with the mdp parameter waxs-solvdens-uncert, typically set between 0.1% and 1%. This algorithm detects automatically whether the experimental data is biased by systematic errors at small angles. Namely, if the MD force field permits a conformational transition that explains the experimental data I exp (q), the posterior of δρ buf would peak at small values, indicating that systematic errors are not required to explain the experimental data (Fig. 4, blue solid). In contrast, if the MD simulations does not find a conformation that explains I exp at small angles, the posterior of δρ buf would peak at larger values, indicating that significant systematic errors are plausible in the light of the data and the force field (Fig. 4, orange dashed). Clearly, treatment of systematic errors in SAXS-based modeling is still underdeveloped. To harvest increasingly finer details in SAXS data, it will be useful to explicitly model other sources of errors, such as a small fraction of aggregated samples. We believe that Bayesian inference will provide a rigorous framework to learn systematic errors simultaneously with the biomolecular structures.

Protocol B
The Bayesian interpretation of SAXS-driven MD simulations are enabled by using a force constant of unity, and optionally, by sampling δρ buf . The mdp file for a Bayesian SAXSdriven MD simulation should contain, apart from the options described above: waxs-fc = 1 waxs-solvdens-uncert = 0.005 waxs-solvdens-uncert-bayesian = yes The simulation frames in the trajectory of the SAXS-driven MD may be interpreted as samples from a the high-dimensional posterior distribution over conformations P (R|D, K). Obtaining posteriors over intuitive properties, such as the center-of-mass distance d com between two domains, is straight-forward: d com values may be extracted from the trajectory frames and plotted as a histogram. The histograms is the posterior P com (d com |D, K) over the center-of-mass distances, suggesting that the peak position and the width of P com provide the most plausible d com and its uncertainty in the light of the data and the force field. Mathematically, obtaining P com (d com |D, K) from P (R|D, K) would involve a marginalization, that is the integration over all other degrees of freedom except d com ; however, since the trajectory of the Bayesian SAXS-driven MD contains samples of d com , there is no need carry out a marginalization in practice.
Samples from the posterior over the relative uncertainty of the buffer density, δρ buf /ρ buf , are written into separate output file waxs_solvDensUncert.xvg allowing the calculation of a histogram and, hence, the posterior over the uncertainty of the buffer density.
4 Maximum-entropy ensemble refinement against SAXS data

Theoretical background
Since SAXS is a solution method, experimental SAXS intensities represent the average over a structural ensemble. For structurally stable proteins, the ensemble may be approximated by a single, most prominent conformation, enabling the use of structure refinement methods described above. In contrast, for solutes that adopt a heterogeneous ensemble, the ensemble is adequately represented by a large number of conformations or by conformational distributions. Typical examples are intrinsically disordered proteins (IDPs) and proteins with disordered regions, domains connected with flexible linkers, or dynamic soft-matter complexes. Upon refining such heterogeneous ensembles against experimental SAXS data I exp (q), the SAXS curves computed from the individual conformations may differ from I exp (q), and only the ensemble-averaged computed curve should agree with the data. However, ensemble refinement against SAXS data is an ill-posed problem because many different ensembles would agree with I exp (q), even if the conformational space is restrained with an all-atom force field. Two strategies have been put forward for choosing a justified refined ensemble from all the ensembles that satisfy the data (Ravera et al., 2016). Following the strategy of maximum parsimony, the aim is to explain the data with as few conformations as possible. Such approach is most justified if the biomolecule adopts a few well-defined conformational states. The second approach is founded in statistical physics and is based on Jaynes' maximum-entropy principle (Jaynes, 1957;Boomsma et al., 2014;Cesari et al., 2018;Hermann and Hub, 2019). According to Jaynes, we should choose the ensemble distribution with the greatest uncertainty (or with the least information) that satisfies a given set of constraints. In the context of structure refinement, the constraints are our requested agreement with the experimental data. The principle is satisfied by finding an ensemble distribution Figure 5: Illustration of parallel-replica ensemble refinement. N replicas are simulated simultaneously, each providing a calculated curve I 1 , . . . , I N . Coupling the replica-averaged SAXS curve to the experiment with a harmonic restraint leads to the maximum entropy ensemble.
that maximizes the Shannon entropy under the given constrains.
In ensemble refinement, however, we are typically interested in refining a prior ensemble from a free MD simulation against experimental data, suggesting that it is useful to maximize the relative entropy between the unbiased and the refined ensemble (Caticha, 2004). Because the relative entropy is the negative of the Kullback-Leibler divergence D KL (p 1 |p 0 ) (Kullback and Leibler, 1951), maximizing the relative entropy implies that we should find a refined ensemble distribution p 1 (R i ) that minimizes under the given constrains, where p 0 (R i ) is the prior ensemble distribution. Taking the prior from an unbiased MD ensemble, the aim is to find an updated ensemble that is (i) in agreement with the data and (ii) updated as minimally as possible with respect to the unbiased prior ensemble. In other words, the ensemble is only updated as strictly needed to explain the data, while any bias that is not supported by the data is avoided. Specifically, the formulation assures that the ensemble is not updated if the prior ensemble is already in agreement with the data. The minimization problem can be solved with the help of Lagrangian multipliers, where one multiplier is required for each experimental constraint (Pitera and Chodera, 2012). However, the Lagrangian multipliers must be optimized in a iterative manner, which may be tedious in presence of a larger number of experimental constraints (Boomsma et al., 2014). An alternative for implementing a minimal bias is the parallel-replica approach. Here, several copies of the system (replicas) are simulated in parallel, and only the calculated data averaged over the replias is restrained to the data with a harmonic restraint (Fig. 5). Roux and Weare (2013) and Cavalli et al. (2013) showed that the replica-averaged harmonic restraint imposed a minimal bias in the limit of a large number of replicas.

Parallel-replica ensemble refinement against SAXS data
Parallel-replica refinement against SAXS data is illustrated for an IDP in Fig.5. First, the SAXS intensity is averaged over the replicas intensities I c (q i , R α ), where α is the replica index and N repl the number of replicas. Then, the systems are coupled to the data with a harmonic restraint similar to Eq. 8, except that the SAXS curve from a single simulation is substituted by the replica-averaged curve: Here,Ī exp denotes the smoothed experimental curve adjusted by the fitting parameters f and c, as used above. The biasing energy is multiplied with N repl , such that the factor cancels with 1/N repl in Eq. 21 when taking the derivatives with respect to atomic coordinates, as done for computing the SAXS-derived forces (Hummer and Köfinger, 2015).

Choosing the number of replicas
The number of replicas that are required to follow the maximum entropy principle depends on the system (Boomsma et al., 2014;Hermann and Hub, 2019). A possible strategy for finding a good value for N repl is to investigate distributions h(ξ) of a few important degrees of freedom ξ, such as distribution of the radius of gyration of an IDP or of the moments of inertia of a soft-matter complex. Accordingly, the Kullback-Leibler divergence D KL (h 1 |h 0 ) between the biased distribution h 1 (ξ) and unbiased distribution h 0 (ξ) may be plotted versus the number of replicas. A sufficient value of N repl would be indicated by a plateau region of such plot. A disadvantage of D KL is its numerical instability; namely, since p 0 appears in the denominator of Eq. 20, D KL is unstable if some conformations of the biased ensemble p 1 were hardly sampled in the unbiased distribution p 0 . A numerically more stable alternative is given by the Jensen-Shannon divergence, which may be considered as a smoothed and symmetrized version of D KL , where h M = (h 0 + h 1 )/2 is the average of h 0 and h 1 (Hermann and Hub, 2019). Another useful measure is given by the entropy of biased distributions h 1 , If the number of replias is too low, the ensemble is still overly biased, which may lead to overly narrow ensemble and overly narrow distributions h 1 (ξ). Hence, it is useful to plot the entropy (or even simply the width) of h 1 (ξ) versus the number of replicas. For the refinement of an IDP ensemble we found previously that using only 4-8 replicas were sufficient. This finding is likely explained by the fact that, when using explicit-solvent SAXS predictions, I c (q i , R α ) represent on-the-fly averages with a memory time of 50-200 ps. Hence, even the SAXS curves from individual replicas represent some conformation heterogeneity, explaining why only few replicas are sufficient to represent the heterogeneous overall shape of the IDP ensemble that is encoded by the data.

Protocol C
Multi-replica SAXS-driven MD simulations are set up similar to the single-replica refinement described above. 1. Compile GROMACS-SWAXS with MPI support using cmake -DGMX_MPI=ON ...

2.
When using four replicas, for instance, use the following mdp options: waxs-ensemble-type = maxent-ensemble waxs-ensemble-nstates = 4 waxs-scale-i0 = no ; yes with small contrast, e.g. with IDP gen-vel = yes With the gen-vel = yes option, the replicas start with different initial velocities, providing independent simulations. For solutes with a small contrast, such as an IDP, the forward scattering intensity I c (q = 0) may not converge within the memory time τ . To greatly accelerate the convergence, I c (q = 0) may be fixed to the forward intensity of the target curve by adding a small constant density to the solvent, turned on with the mdp option waxs-scale-i0 = yes. Prepare one tpr file for each replica, and place the tpr files into different sub-directories 000, 001, etc: gmx grompp -f maxent.mdp -o 000/topol.tpr gmx grompp -f maxent.mdp -o 001/topol.tpr etc.
3. Run the multi-replica simulation with the -multidir functionality of mdrun: sw=/path/to/pure-solvent.tpr fw=/path/to/pure-solvent.xtc target=$(realpath Itarget_trans.xvg) export GMX_ENVELOPE_FILE=$(realpath envelope.dat) export GMX_WAXS_FIT_REFFILE=$(realpath envelope-ref.gro) mpiexec -np 4 gmx_mpi mdrun \ -s topol.tpr -sw $sw -fw $fw -is $target -multidir 000 001 4. Carry out the same analysis as described above for regular SAXS-driven MD. Now, the file waxs_final.xvg contains the final on-the-fly average of the replica-averaged SAXS curve. Carry out a rerun with the trajectories of the four replicas to compute a uniformly averaged SAXS curve for each replica, and henceforth average the calculated curve. Inspect whether the replica-averaged curve agrees with the experimental target curve.
5. Compute distributions h 1 (ξ) of interesting observables ξ, combined from all replica trajectories, such as the distribution of the radius of gyration or the secondary structure content of an IDP. Such distributions quantify the heterogeneity of the ensemble.
6. With a new simulation system, redo the simulation with increasing numbers of replicas N repl . Re-compute distributions over observables using the same aggregated simulation time (e.g., 1×400 ns, 2×200 ns, 4×100 ns, etc.). Inspect convergence of the entropy (or the width) of the distributions h 1 (ξ) as function of N repl .
4.5 Example: ensemble refinement of a detergent micelle the distributions from single-copy refinement are overly narrow, indicative of an overly restrained ensemble in violation of the maximum entropy principle (Fig. 6C, black). Using a larger number of replicas, the distributions are wider, reflecting a larger degree of heterogeneity. Critically, all simulations reveal quantitative agreement with the data, even when using only a single replica (Fig. 6B). This demonstrates that agreement with the data does by far not guarantee that the ensemble is correct.
5 Discussion: conceptual considerations and recommendations

SAXS-driven MD simulations (should) feel only a weak bias by the SAXS data
Structural data is typically insufficient for defining all degrees of freedom of a biomolecule. This is true not only for SAXS data, but also for data from X-ray crystallography, NMR spectroscopy, or cryo electron microscopy. To avoid overfitting during structure refinement, structural data is complemented with addition physico-chemical information that restraints the biomolecule into realistic conformations. The required amount of additional information critically depends on the information content of the data; the lower information content of the data, the more predictive additional information is needed. For instance, excluding atomic overlaps and restraining chemical bond geometries is often sufficient for the refinement of atomic models against crystallographic data but would be by far insufficient for refinement against SAXS data. During SAXS-driven MD, the structure is largely imposed by the all-atom force field, which restraints not only chemical geometries but also maintains a proper hydrogen bond network and favorable electrostatic interactions. Overfitting is avoided by using only a small force constant f c for the SAXS-derived restraints in the order of unity, such that biomolecular dynamics are largely controlled by the force field V FF (R) whereas the energy E exp only mildly pushes the biomolecule into agreement with the data (see Eqs. 8 and 17). Indeed, E f exp takes values of few up to tens of kilojoules per mole, whereas the potential energy contributions from Lennard-Jones or Coulomb interactions are typically in the range of hundred thousands to millions of kilojoules per mole.

Accelerating transitions with SAXS data and sampling limitations
SAXS-driven MD simulations have been used to accelerate large-scale conformational transitions of biomolecules, which would require prohibitively long simulation times during unbiased simulations. For instance, we showed that SAXS data may be used to drive large-scale opening transition of the large proteins Hsp90 or ATCase in MD simulations (Shevchuk and Hub, 2017;Chen and Hub, 2015). However, SAXS-driven MD works as an enhanced sampling technique only if the SAXS-derived forces point into the direction of the sought-after conformation transition. This is typically true in the case of large-scale domain motions, in particular if these motions modulate the radius of gyration. In contrast, achieving a complex, nonlinear rearrangements in the simulation such as the folding of an unstructured tail into an α-helix is far more challenging. In such cases, the correct final state might be detectable via a low χ 2 in Eq. 5, but it is unlikely that SAXS-derived forces would accelerate the folding transition. Hence, for guiding complex transitions with SAXS data, SAXS-driven MD may be combined with established enhanced sampling techniques such as simulated tempering or Hamiltonian replica exchange.

Summary
SAXS is an increasingly valuable tool of integrative structural biology thanks to the accuracy of data collected at modern SEC-SAXS beamlines and thanks to its structural information content, which is not accessible by other techniques. However, the interpretation of the data is challenged by the low information content of the experimental signals, leading to the risk of overinterpreting the data. This risk is mitigated by using MD simulations that add physico-chemical information to the data.
In this chapter, we presented three approaches for refining structural ensembles against SAXS data by integrating the experimental data on-the-fly into all-atom MD simulations: (i) During SAXS-driven MD, an MD simulation is coupled to experimental SAXS data using harmonic restrains on the data, thereby refining an ensemble that may be approximated by a single, most prominent conformation. The SAXS-driven simulations promote conformational transitions compatible with the data. They are capable of overcoming force field imperfections and sampling limitations of unbiased simulations, given that the transitions are geometrically simple.
(ii) Using SAXS-driven MD as tool for Bayesian inference of biomolecular structures, a posterior distribution is sampled by the MD simulation, quantifying the plausibility of a biomolecular structure in the light the experimental data and prior physico-chemical information, as encoded in the MD force fields. This Bayesian framework may estimate systematic errors at small angles simultaneously with the structure, which enables one to assess whether systematic errors are required to explain the experimental data.
(iii) The parallel-replica approach allows one to refine heterogeneous ensembles with commitment to the maximum-entropy principle. Here, the SAXS curve averaged over several replicas is coupled to the experimental data using harmonic restrains, such that the updated ensemble is compatible with the data but biased as minimally as possible with respect to the unbiased ensemble.
All approaches require a forward model for calculating the SAXS intensities from the atomistic structures. We discussed structure and ensemble refinement based on explicitsolvent SAXS curve calculations, as used in the WAXSiS method, thereby accurately representing the hydration layer and the excluded solvent Chen and Hub (2014); Knight and Hub (2015). An implementation of the methods described here employing the explicit-solvent SAXS curve calculations is freely available in GROMACS-SWAXS (https://gitlab.com/cbjh/gromacsswaxs).