Accurate and efficient constrained molecular dynamics of polymers through Newton’s method and special purpose code

In molecular dynamics simulations we can often increase the time step by imposing constraints on internal degrees of freedom, such as bond lengths and bond angles. This allows us to extend the length of the time interval and therefore the range of physical phenomena that we can afford to simulate. In this article we analyse the impact of the accuracy of the constraint solver. We present ILVES-PC, an algorithm for imposing constraints on proteins accurately and efficiently. ILVES-PC solves the same system of differential algebraic equations as the celebrated SHAKE algorithm, but uses Newton’s method for solving the nonlinear constraint equations. It solves the necessary linear systems of equations using a specialised linear solver that utilises the molecular structure. ILVES-PC can rapidly solve the nonlinear constraint equations to nearly the limit of machine precision. This eliminates the spurious forces introduced to simulations through the very common use of inaccurate approximations. The run-time of ILVES-PC is proportional to the number of constraints. We have integrated ILVES-PC into GROMACS and simulated proteins of different sizes. Compared with SHAKE, we have achieved speedups of up to 4.9× in single-threaded executions and up to 76× in shared-memory multi-threaded executions. Moreover, we find that ILVES-PC is more accurate than the P-LINCS algorithm. Our work is a proof-of-concept of the utility of software designed specifically for the simulation of polymers. Author summary Molecular dynamics simulates the time evolution of molecular systems. It has become a tool of extraordinary importance for e.g. understanding biological processes and designing drugs and catalysts. This article presents an algorithm for computing the forces needed to impose constraints in molecular dynamics, i.e., the constraint forces; moreover, it analyses the effect of the accuracy of the constraint solver. Presently, it is customary to calculate the constraint forces with a relative error that that is not tiny. This is due to the high computational cost associated with the available software. Accurate calculations are possible, but they are very time-consuming. The algorithm that we present solves this problem: it computes the constraint forces accurately and efficiently. Our work will improve the accuracy and reliability of molecular dynamics simulations beyond the present state-of-the-art. The results that we present are also a proof-of-concept that special-purpose code can increase the performance of software for the simulation of polymers. The algorithm is implemented into a popular molecular simulation package, and is now available for the research community.


Introduction
1 Molecular simulation is a powerful research tool for scientific and technological purposes. 2 It is applied to a wide range of problems in chemistry and biology, such as the development 3 of novel materials [1] or biomedicines, e.g., for fighting cancer [2] and infectious diseases, 4 like the SARS-CoV-2 [3,4]. One of the most widely used techniques for molecular 5 simulations is molecular dynamics (MD) [5,6], which calculates the time evolution of 6 molecular systems subject to the Newton's equations, thus enabling the calculation of a 7 variety of quantities whose measurement in laboratories is frequently either difficult or 8 unfeasible. The impact of molecular simulation is expected to increase greatly due to 9 the continuous improvement of available computational capabilities [7] and calculation 10 methods [8]. Among the former, we highlight the successive generations of the Anton 11 supercomputers [9]; among the latter, the solution of the protein folding problem by 12 AlphaFold [10]. The availability of 3D structures of proteins provided by AlphaFold 13 will probably boost their simulations, e.g., for analysing their capabilities as catalysts 14 or medicines or for a more accurate interpretation of the effect of mutations on the 15 phenotype [11]. The availability of accurate and efficient methods for such simulations 16 does hereby acquire a novel boost. 17 2 Background and Motivation 18

19
All vectors are written using bold lowercase letters. All vectors are column vectors by 20 default. When we need a row vector, then we shall explicitly transpose a column vector. 21 The Euclidean norm of a vector x = (x 1 , x 2 , . . . , x n ) T ∈ R n is written as ∥x∥ and is the 22 nonnegative number ∥x∥ given by ∥x∥ 2 = n j=1 x 2 j . All matrices are written using bold 23 uppercase letters. If the function f = (f 1 , f 2 , . . . , f n ) T : R n → R n is differentiable, then 24 the Jacobian F (x) of f at the point x ∈ R n is the matrix A = [a ij ] ∈ R n×n given by 25 We shall now define the notation used to describe a system of m atoms. Let m i > 0 26 denote mass of the ith atom and let m i ∈ R 3 and the diagonal mass matrix M ∈ R 3m×3m 27 be given by 2.2 Fundamentals of constrained molecular dynamics 32 By Newton's second law, the equations of motion of a system of m atoms are where the prime indicates differentiation with respect to the time t. In nature, the 33 motion of atoms is continuous in time; however a computer simulation customarily 34 uses a sequence of discrete time steps. The standard algorithm for this problem is 35 the velocity Verlet-algorithm [12]. It is well-known that certain motions such as bond 36 stretching, bond bending, and torsional vibrations are all periodic with characteristic 37 frequencies that depend on the atoms involved [13]. It is generally accepted that in 38 order to accurately resolve periodic motion one needs at least five time steps per period. 39 Hence the fastest vibration imposes a limitation on the maximum time step that can 40 be used and this limits the length of the time interval one can afford to simulate. In 41 order to simulate phenomena with a longer duration it is customary to constrain the 42 fastest degrees of freedom. Let n denote the number of constraints. Mathematically, the 43 problem consists of solving the following system of differential algebraic equations 44 q ′ (t) = v(t) , g(q(t)) = 0 .

Constrained MD solvers
Their main objective has been the reduction of the time-to-solution of the constraint equations. The most popular algorithms are SHAKE [24] and (P-)LINCS [25]. The SHAKE algorithm solves the system of differential algebraic equations (6) using a pair of staggered uniform grids with fixed time step h > 0. SHAKE's equations take the form: Here q k ≈ q(t k ) and v k+ 1 2 ≈ v(t k+ 1 2 ), where t k = kh and t k+ 1 2 = (k + 1/2)h. Equation 49 (10) is a nonlinear equation for the unknown Lagrange multiplier λ k , namely where ϕ k is the function given by It is known that SHAKE is second order accurate in the time step [12]. The original 52 SHAKE algorithm solved the constraint equations using the nonlinear Gauss-Seidel 53 method, which converges locally and linearly subject to certain mild conditions (see [26] 54 and the references therein). The LINCS and P-LINCS algorithms use a truncated 55 Neumann-series to approximate the solution of the relevant linear systems. The Neumann-56 series converges linearly at best and there are physically relevant cases for which it does 57 not converge at all [25,27,28]. Therefore, solving the constraints to the limit of machine 58 precision is time-consuming. Indeed, as we shall demonstrate, the time spent by SHAKE 59 can easily exceed 50% of the total execution time in realistic simulations; other research 60 works also indicate severe performance drops (23%) when LINCS is accurately solved [78]. 61 Thus, it would be useful to have software that solves the constraint equations accurately 62 and rapidly. To this end, we have developed and implemented an algorithm called 63 ILVES [29,30] that avoids coarse-grain approximations and calculates the constraint 64 forces accurately. We expect that avoiding coarse approximations will also produce 65 a solver that is capable of finding solutions when the atomic displacements are more 66 abrupt, e.g., during simulations run at high temperatures or with large time steps (like 67 those used in Brownian dynamics [31] calculations), or when constraints on bond angles 68 are also imposed. For instance, in Ref. [28] it was shown that 7200 distinct simulations 69 all produced matrices for which the expansion used by LINCS does not converge and 70 SHAKE had to be used instead. However, the SHAKE algorithm is commonly described 71 as inherently serial [32], which makes it inefficient for the -presently ubiquitous-parallel 72 computations. Since the parallel algorithm ILVES-PC solves the exact same equations as 73 the sequential algorithm SHAKE, we expect that ILVES-PC will be able to solve more 74 problems than LINCS. To sum up, ILVES combines the features of efficiency, parallelism, 75 accuracy, and reliability. 76 We emphasize that several authors have already applied Newton's method for solving 77 nonlinear equations in the context of constrained molecular dynamics. M-SHAKE [20] 78 treats the linear systems as dense and solves them using Gaussian elimination. This 79 approach is limited to small molecules because the time complexity for computing an 80 LU factorization of a dense matrix of dimension n is O(n 3 ). MILC-SHAKE [17] utilizes 81 the linear structure of alkanes to achieve a time complexity of O(n) by computing an 82 LU factorization of a tridiagonal matrix rather than a fully dense matrix. The authors 83 of the papers [21,33] all approximate the relevant matrices using a matrices that are 84 symmetric positive semi-definite and apply the conjugate gradient (CG) algorithm to 85 solve these systems. The main advantages of this approach are twofold: the simplicity 86 of the parallelisation of the CG algorithm, and the solution can be accelerated using 87 a preconditioner. The disadvantage of this approach is the difficulty of finding a 88 preconditioner whose quality can be guaranteed mathematically. 89 We shall now describe how Newton's method can be applied in the context of 90 molecular dynamics with constraints. We begin by stating the method in the case of a 91 general nonlinear equation. Let f : R n → R n be a differential function and consider the 92 problem of solving with respect to x ∈ R n . If the Jacobian F of f is nonsingular, then Newton's method is 94 defined and takes the form where the initial value x 0 must be chosen by the user. In general, we expect that 96 Newton's method will converge locally to a zero of f and that the convergence will be 97 quadratic. In practice, we should never explicitly invert the matrix F (x l ), instead we 98 should compute the correction F (x l ) −1 f (x l ) by solving the linear system September 27, 2022 4/24 with respect to z l . We emphasize this point by restating Newton's method as the iteration We now return to the nonlinear constraint equation (10). To this end, we introduce the 100 matrix function A : R 3m × R 3m → R n×n given by Then Newton's method for the Lagrange multiplier λ k is given by where the initial value λ k,0 must be chosen by the user. The simple choice of λ k,0 = 0 102 is the de-facto standard choice. 103 We mention in passing that the matrix A(x, y) is structurally symmetric and that 104 the matrix A(ϕ k (λ k,l ), q k ) is close to the symmetric matrix A(q k , q k ), simply because 105 This is the observation that was utilized by the authors of the papers [21,33]. 106 2.4 Bond constraints 107 We now limit the discussion to general bond-length constraints. Note that constraints 108 on bond angles are commonly enforced by constraining distances between two atoms. 109 ILVES' design is expected to provide accurate solutions when imposing any kind of 110 constraints, either on bond lengths, bond angles or dihedral angles, due to the fact that 111 the coordinate matrix A is banded (regardless of the kind of the constrained degrees 112 of freedom) in biological molecules. Constraining degrees of freedom other than bond 113 lengths using flexible constraints [34] may be an appropriate method to increase the 114 time step of the simulation.

115
Our objective is to present a formula for the entries of the matrix A(x, y). Let the 116 ith bond have length σ i > 0 and let a i and b i denote the indices of the two bonded 117 atoms. Then the ith constraint can be written as A direct calculation establishes that Here δ i is Kronecker's delta, i.e, In particular, we observe that 1. The two bonds have no atoms in common. 124 2. The two bonds share a single atom. 125 3. The two bonds are identical, i.e., i = j.

126
Let bond i link atoms a i and b i and let bond j link atoms a j and b j . The (i, j)th entry 127 of the matrix A(x, y) is given by the weighted inner-product where the weight w ij is given by When the matrix A(x, y) represents bond constraints for a real molecule, then it is 130 necessarily quite sparse. Consider the row corresponding to the bond between a pair of 131 atoms with valence r 1 and r 2 . For this row of the matrix A(x, y), there can be at most 132 r 1 + r 2 − 1 nonzero entries, regardless of the overall size of the molecule.

145
Imposing constraints introduces a source of drift in the energy of the analysed 146 system [40,41], with the size of the drift increasing with the errors in the satisfaction of 147 constraints. One of the principal reasons for performing accurate constraint calculations 148 is to reduce this drift. This is especially important in simulations in the NVE ensemble, 149 where the conservation of the energy is an explicit requisite. Accurate constraints may 150 be also imposed to accurately calculate sought quantities [55,56] or to avoid undesired 151 effects -like spurious phase transitions [39]-; authors have also reported that improving 152 simulation accuracy eliminated the flying ice cube effect [52].

153
When conducting simulations using a thermostat, say, in NVT or NPT ensembles, it 154 is not customary to solve the constraint equations accurately, but this is not necessarily 155 the correct approach. Certainly, the energy is not conserved for canonical or NPT 156 ensembles, but there exists a conserved quantity (sometimes called conserved energy) 157 which is analogous to the conserved energy of the microcanonical (NVE) ensemble. The 158 derivation of the equations of the thermostat (e.g. Nosé-Hoover [57,58], V-rescale [59]) 159 implies that the conserved quantity is indeed conserved; otherwise, the thermostat 160 equations that are assumed to hold do indeed not hold. There is no reason to believe 161 that the conserved quantity will be conserved if the constraint equations are not solved 162 to the limit of machine precision.
A small drift of the energy does not necessarily indicate that the MD simulation 164 is accurate and reliable. Since different sources of inaccuracy can generate drifts with 165 different signs, large errors in the dynamics can be masked by small drifts resulting 166 from contributions that almost cancel each other. Generally speaking, a large drift of 167 the energy is likely indicating that the dynamics is distorted and the simulation is not 168 reliable, but the converse is not true (a small drift does not necessarily imply that the 169 dynamics is not distorted) [60,61]. An accurate MD should try to reduce all the sources 170 of errors in the dynamics, hence solve constraints to the maximum affordable accuracy. 171 Otherwise, the calculated quantities will not be reliable. Moreover, small drifts may be 172 misleading, hinting that the simulation is accurate, when it is not.

173
It is important to appreciate the consequences of solving the constraint equations 174 inaccurately. Letλ k denote the computed value of the Lagrange multiplier λ k ; then the 175 computed value of v k+1/2 cannot be more accurate than in which case the exact value v k+1/2 satisfies From the equations above we conclude that errors in the calculation of the Lagrange 178 multipliers λ n are mathematical equivalent to the actions of an external force, i.e, the 179 term −G(q k ) T (λ k −λ k ). Due to its random nature, this force does not need to be a 180 conservative force. Hence, if the objective is to simulate an isolated system, it is critical 181 that we solve the constraint equations accurately.

182
The distortions of the bond lengths that arise from solving of the constraint equations 183 inaccurately have non-zero average, i.e., the noise is not white, and are highly correlated 184 with their previous values (see results and discussion below and in the S1 File). This 185 is equivalent to using bond lengths which differ from the ones specified by the force 186 fields (whose calibration is highly optimised to accurately describe chemical phenomena), 187 and to using different bond lengths at different times during the simulation. Due to 188 the accumulation of errors throughout many time steps and to the chaotic nature of 189 simulations, the introduced spurious force may distort the dynamics in unpredicted 190 manners, thus reducing the reliability of the calculated quantities. In contrast, if 191 constraints are satisfied allowing a maximum relative error of e.g. 10 −10 instead of 192 the customary 10 −4 , then the scale of the distortions drops by a factor of 10 6 , see 193 equation (30) and the surrounding paragraph. This significantly reduces the expected 194 impact of spurious forces.

195
The points stated above indicate that an accurate satisfaction of the constraints 196 is necessary for an appropriate simulation, where the ensemble is respected and the 197 disruptions of the system's dynamics are reduced. The only difference between ILVES and SHAKE is how we solve the nonlinear constraint 201 equations; ILVES solves the same system of differential algebraic equations as SHAKE. 202 ILVES uses Newton's method to solve the nonlinear equations; the involved linear 203 (linearised) systems (19) are solved using a direct solver (Gauss-Jordan elimination), 204 which has linear (O(n), being n the number of constraints) scaling due to the sparsity 205 of matrix A. Such sparsity arises from the fact that one atom cannot be covalently 206 bonded to many others. This guarantees the sparsity of A for biological molecules, and 207 hence makes the ILVES algorithm suitable for constraining all kinds of internal degrees 208 of freedom, including bond and dihedral angles. The A matrix of relevant bio-polymers 209 can be defined so that it is banded (or nearly banded, with a few nonzero entries outside 210 the band), which enables a special efficiency when solving the system [62]. ILVES-PC 211 relies on a code (compiled code [30]) which is specifically designed to be efficient for 212 known structures, such as the amino acid residues that make up peptides and proteins. 213

214
The implementation of the ILVES algorithm presented in this document, called ILVES-215 PC, is specifically designed to compute the constraint forces for proteins. Developing 216 code for given types of molecules is the extension to software of a concept which has 217 already proven to be very successful with hardware. The Anton supercomputers were 218 conceived to perform MD simulations of proteins and other biological molecules [9]. 219 Their specific design greatly enhances their performance for these systems compared with 220 general-purpose computers. The algorithm we present in this article also utilises specific 221 features of widely simulated systems in order to increase the performance compared with 222 general-purpose algorithms.

223
Peptide chains (peptides and proteins) are polymers of variable length formed by 224 repeating blocks of atoms. They are a subset of biological polymers (also including e.g. 225 nucleic acids and polysaccharides), which are just a subset of chemical polymers (which 226 could benefit from the approach presented in this article). Each of the building blocks 227 of a peptide chain (residues) has a given connectivity pattern, which is found essentially 228 unchanged in biological molecules (occasionally further atoms can be attached to the 229 protein, e.g. in glycoproteins, or the protein structure can get modified, e.g. at the 230 chromophore of the Green Fluorescent Protein; in addition, hydrogen atoms in carbon 231 rings of side chains can lie in alternative positions). However, 20 given -proteinogenic-232 amino acid connectivities largely dominate the structure of peptides and proteins.  The coefficient matrix of the linear system solved by ILVES-PC, i.e., the matrix A 240 of equation (19), has the same structure as the adjacency matrix of the bond-graph 241 of the peptide chain. Since the peptide chain is composed of less than 30 different 242 building blocks, the main features of the matrix A can be described using less than 30 243 different submatrices. These matrices correspond to the proteinogenic amino acids, some 244 of them having slightly different configurations due to different locations of hydrogen 245 atoms. In truth, we need a few more subroutines to account for, say, the beginning and 246 the end of the chain. We have written subroutines for solving the linear subsystems 247 corresponding to each of these submatrices. To solve the entire linear system, we iterate 248 over the subsystems of the linear system, calling the required subroutine. We ensure 249 that submatrices corresponding to identical amino acids have the same structure by 250 always numbering the bonds of each amino acid in the same order. This allows us 251 to generate loop-free subroutines that store the relevant data contiguously in memory 252 and do direct memory access instead of relying on the auxiliary data structures and 253 the indirect memory access patterns that are so typical of direct solvers for sparse 254 linear systems. These ideas are all further adaptations of the compiled code approach 255 utilised in [30]. By choosing the bond numbering of each amino acid we can reduce the 256 fill-in during the factorization of the matrix. Fill-in are entries that are exactly zero 257 in the original matrix, but become non-zero during the factorization process. We can 258 Steps to apply Gauss-Jordan elimination to a banded matrix in parallel using three threads via the Schur complement method. The entries of the matrix are represented with and x and the fill-ins with an f . The magenta, purple and yellow entries are private to threads, while blue entries are shared between threads.
work with peptide chains with any bond numbering. Before simulating a new protein 259 for the very first time, we first explore the topology to identify the individual amino 260 acid residues. Then, we renumber the bonds to match the numbering required by our 261 specific implementation. The structural information can be saved and recycled if further 262 simulations are required.

263
ILVES-PC exploits modern processors' computational resources by assigning a subset 264 of amino acids of a peptide chain to different hardware threads. We ensure that each 265 thread is assigned a similar number of matrix elements. To concurrently apply Gauss-266 Jordan elimination to the submatrices corresponding to different amino acids, we rely 267 on the Schur complement method [63]. 268 Consider the example presented in Fig 1. We want to make the elimination at the 269 banded coordinate matrix of (a) -which exemplifies Ausing three threads (magenta, 270 purple, and yellow). First (b), each of the threads works with its own private submatrix, 271 trying to fill the lower left-hand (subdiagonal) corner with zeros and the diagonal with 272 ones. The threads also update the shared (blue) rows using mutual exclusion (mutex) 273 locks. They repeat this step (c) in the upper-hand corner of their submatrix. These two 274 steps may produce fill-ins. Then, the master thread processes the submatrix comprised of 275 the shared (blue) rows (d), while the other two threads wait until this step is completed. 276 Finally (e), all threads work in parallel to clean the fill-ins produced in steps (b) and (c). 277 In the case of ILVES-PC, the thread private data corresponds to constraints within a 278 given amino acid, while the shared rows correspond to the constraint that connects two 279 amino acids (which corresponds to a peptide bond).

280
While a general-purpose implementation of ILVES would almost certainly rely mainly 281 on coarse-grain synchronization mechanisms such as thread barriers, the precise knowl-282 edge of the structure of the linear system corresponding to proteins allows us to use very 283 fine-grain synchronization mechanisms. We use lightweight mutex locks to protect the 284 shared rows of the linear system from data races, so that each mutex involves only a 285 pair of threads. Similarly, during the update phase at the end of each Newton step, the 286 positions and velocities of each atom are protected by individual mutex locks. 287 4 Materials and methods 288 We have integrated ILVES-PC into the popular GROMACS molecular simulation pack-289 age [36]. Our solver can be used as an alternative to SHAKE and P-LINCS when 290 solving the bond constraints of proteins (only consisting of amino acid residues) without 291 disulfide bonds [64]. In addition we have extended the code of P-LINCS to accept a 292 tolerance τ > 0 for the satisfaction constraints on all bonds (all-bonds), as in SHAKE 293 and ILVES-PC, so that the three solvers can be compared on an equal footing. All the 294 where the ith bond is between atoms a i and b i . It is straightforward to verify that is a good approximation when the ith constraint equation is almost satisfied, i.e., when 297 ∥q ai − q bi ∥ ≈ σ i is a good approximation. It follows that τ is a good approximation of 298 an upper bound for the largest relative error for the bond lengths.

299
As already mentioned, P-LINCS uses a truncated Neumann series to approximate 300 the solution; then P-LINCS applies an iterative correction phase. The accuracy of the 301 expansion is determined by its order (lincs order), and the accuracy of the correction 302 phase is determined by the number of iterations (lincs iter). Both the order of the 303 expansion and the number of iterations of the correction phase are set at GROMACS' 304 startup and do not change throughout a given simulation. With our modification, 305 P-LINCS keeps iterating in the correcting phase until all constraints are solved with 306 the specified tolerance (shake tol). We found that the additional calculations due to 307 this modification (i.e. the calculations to check the degree of constraint satisfaction) 308 typically increase P-LINCS' execution time by 4% to 9%.  Table 1. We have used a GROMACS OpenMP version for multi-thread 315 executions. Four proteins have been simulated in this study, namely, ubiquitin, COVID-19 main 318 protease, human SSU processome and barnase. A summary of their features can be 319 found in Table 2. In order to assess the performance of the protein-specific constraint 320 solver (ILVES-PC) presented in this article, simulations of the first three proteins were 321 performed. They cover a wide range of protein sizes: 76 to 2722 amino acid residues; 322 SSU processome is unusually large, and was mainly chosen for testing parallel scalability. 323 The atomic coordinates of these proteins were taken from the RCSB Protein Data Bank 324 (www.rcsb.org, see the PDB codes in Table 2).

325
Our modified version of GROMACS was used to run and analyse the simulations, 326 which were set up with the CHARMM36 force field including CGenFF version 4.1 327 (last update on March 28, 2019) [69]. The ionizable residues of selected proteins were 328 protonated in all the cases as default (pH 7) in GROMACS, and after adding explicitly 329 TIP3P water molecules [70], chloride or sodium counterions were added to neutralize the 330 systems. A truncated dodecahedral was chosen as the simulation box, and the minimum 331 distance between the protein and the box edge was set to 1 nm. Periodic boundary 332 conditions (PBC) were imposed. A minimization phase of the solvated systems was 333 performed (maximum of 20000 steps) with the steepest descent algorithm [71]. One 334 (for the evaluation of performance with ubiquitin, COVID-19 main protease and human 335 SSU processome) or three (for the evaluation of accuracy with barnase and ubiquitin) 336 simulation replicas were launched under each setting, enabling the averaging of results 337 in the latter case. Systems were gradually heated through a heating ladder that allowed 338 to increase the temperature tier by tier in a number of NVT steps (50 K every 50 339 ns; 1 fs time step) until reaching the target temperature (either 298 or 400 K [72]). 340 Three consecutive steps for system equilibration followed. The first one (NVT) ran 341 for 100 ps with restraints imposed on the heavy atoms (protein and water) and the 342 V-rescale thermostat [59] used to keep the targeted temperature (coupling strength 343 parameter τ T = 0.1 ps). The second equilibration step (NPT) ran for 100 ps (2 fs time 344 step) without any restraint, with the V-rescale thermostat (τ T = 0.1 ps) [59] and the 345 Berendsen barostat used to set the pressure to 1 atm (coupling strength parameter 346 τ p = 2.0 ps) [73]. The third equilibration step (NPT) ran for 200 ps (2 fs time step) 347 using the V-rescale thermostat (τ T = 0.1 ps) and the Parrinello-Rahman barostat [74] (1 348 atm; τ p = 2.0 ps). The configuration reached after these steps was the starting point 349 for different production runs carried out in either the NPT or NVE thermodynamic 350 ensembles.

351
For calculations of performance (NPT) and accuracy (NPT or NVE), the thermostat, 352 barostat and the rest of parameters were set up in the production phase as done in the 353 third equilibration step, except for the NVE simulations (used only for ubiquitin and 354 barnase), where neither thermostat nor barostat were used. Production runs launched 355 for calculation of performance consisted in 50k steps (2 fs time step for ubiquitin and 356 COVID-19 main protease) or 10k steps (2 fs time step for human SSU-processome). On 357 the other hand, production runs for calculation of the simulations accuracy (energy drift 358 analysis of barnase and ubiquitin) consisted of half a million (5 · 10 5 ) steps (2 fs time 359 step, for a total simulation time of 10 ns). A record of the conserved energy values (in 360 NPT simulations) or of the total energy (in NVE simulations) at every 1000 steps (0.2 361 ps) was obtained. These numbers were used to calculate the energy drift.

362
As general parameters of the simulations the Verlet cutoff-scheme algorithm [75,76] 363 was used for van der Waals interactions and the PME method [77] for electrostatic 364 interactions, both with a radius cutoff of 1.2 nm as recommended by the authors of 365 the CHARMM36 force field [69]. A radius cutoff of 1.0 nm was also set in simulations 366 used for performance calculations. Velocities correction due to the thermostat were done 367 every 10 timesteps (default values of GROMACS).

368
For the tests performed to assess simulation performance, tolerance values (τ ) of 10 −4 , 369 10 −6 , 10 −8 , 10 −10 and 10 −12 have been used to constraint all bonds. The constraint 370 algorithms tested include SHAKE, the modified version of P-LINCS and ILVES-PC. 371 In the case of P-LINCS also the lincs order parameter has been tested and values 372 of 4 and 8 are combined with the tolerance values listed above. In simulations carried 373 out to analyse the accuracy the SHAKE algorithm was tested with constraints both on 374 all-bonds and on bonds connecting with hydrogen atoms (H-bonds; results obtained with 375 these constraints appear in Fig 1 of the S1 File). Tolerance (τ ) values of 3.1423 · 10 −4 , 376 10 −4 , 3.1423 · 10 −5 , 10 −5 , 10 −6 , 10 −7 , 10 −8 , and 10 −10 have been tested.

377
The analysis of the energy drift has been done by taking also into account the 378 Verlet buffer size for the pair-list neighboring search. The pair-list neighboring search is 379 considered one of the two most important sources of energy drift in MD simulations (the 380 other one being that related to the constraints algorithm). The GROMACS parameter 381 used to establish the Verlet buffer size (Verlet-buffer-tolerance) was set here to 382 5 · 10 −5 kJ/(mol · ps) to permit a lower level of drift than that allowed for GROMACS' 383 default value (5 · 10 −3 kJ/(mol · ps)). This way, it is possible to display the drift effect 384 due to constraints more clearly. Data presented at the Fig 1 of the S1 File also include 385 results obtained for a Verlet-buffer-tolerance of 5 · 10 −3 kJ/(mol · ps). 386

387
In this section, results of the test calculations are summarised. In the first subsection, 388 an analysis of the reliability of the simulation as a function of the accuracy in satisfying 389 constraints is presented, whereas an analysis of the efficiency (performance) of the 390 calculations is displayed in the second subsection. 391

392
One of our goals is to understand the connection between the energy drift in MD 393 simulations of proteins and the tolerance τ of the constraint algorithm (see equation (31) 394 for the definition of τ ). For this purpose, we have simulated ubiquitin and barnase in 395 the NVE and NPT ensembles (with a V-rescale thermostat in the latter), using SHAKE 396 to impose constraints. We choose SHAKE because the slow and steady convergence of 397 this algorithm ensures that the largest relative error for the bond lengths is essentially 398 equal to the tolerance τ of the constraint algorithm. The time evolution of the energy 399 followed regular straight lines in large enough time scales; hence we calculated the drift 400 as the slope of the energy-vs-time relationship [60] divided by the number of degrees of 401 freedom of the system N df . We defined the latter as: N idf = 3m − n − 6, where m is the 402 number of atoms in the protein, n is the number of imposed constraints, and the −6 403 term accounts for rotations and translations. In the NPT simulations the temperature 404 was set to 298 and to 400 K (this latter only for barnase), whereas the pressure was 405 set to 1 atmosphere. To convert Joule per mol to units of k B T we have divided by 406 2477.7 if T=298 K (it includes the NVE simulations, where no T is imposed during the 407 production stage, but 298 K was initially set), and divided by 3325.8 if T=400 K. The 408 time step used for all the simulations was 2 fs.  [47][48][49] to acceptable levels. Fig 2 suggests that one should choose a constraint 413 tolerance τ which is at least as small at τ = 10 −6 .

414
As a further evidence of the importance of using small values of τ , in Fig 3 it is 415 observed that higher tolerances in NVE simulations led to significant drift of the initial 416 temperature set for the simulations. Conversely, if we use τ ≤ 10 −6 the temperature is 417 approximately preserved over the analysed time range. Other authors have also found 418 that the low accuracy in solving the constraints gives rise to inaccurate temperatures, 419 and that the default parameters of LINCS lead to non-converged results [78].
where d i is the relative error for the ith bond length, i.e., and n is the number of constraints.

435
The maximum and minimum values, d Max and d min (yellow and red lines, respectively), 436 are approximately the established value for the tolerance of SHAKE, which indicates 437 that the solver usually does not provide solutions much more accurate than that required 438 for the user [80]. The averaged(t) (blue line) is positive along the time, which implies 439 that the bond lengths are, on average, longer than expected. Qualitatively, the patterns 440 of the error plots persist if the tolerance of the constraint solver is changed, e.g. to 441 τ = 10 −10 (central panel in Fig 4). However, as expected, the size of the distortions is six 442 orders of magnitude lower than those obtained for τ = 10 −4 (top panel in Fig 4). This 443 element confirms that a tighter tolerance for satisfying the constraints largely reduces 444 the bond length distortions in the simulated system.   [79]. The text labels (UBIQ, COVID and SSU-PR) along the x-482 axis of Fig. 5 represent the three analysed proteins (ubiquitin, COVID-19 main protease 483 and the -atypically large-human SSU processome, respectively). The numerical labels 484 along the x-axis represent the largest acceptable relative error for a constraint in each 485 simulations, i.e., τ = 10 −4 , 10 −6 , . . . , 10 −12 . Top: parallel execution with 24 threads; bottom: ibid. with 48 threads. Note that the y-axis starts at y = 0.4 rather than y = 0. This has been done to emphasize the differences between the constraint solvers.
The value τ = 10 −4 is the GROMACS default value; we omit larger values of τ because, 487 as presented in the previous section, they produce result that do not converge [81]. We 488 found that τ = 10 −12 is near the lowest relative error that can be consistently achieved 489 during our simulations. This should not be perceived as general result as this value 490 depends on the specific equations being solved and the floating point number system 491 used.

492
The results presented in Fig 5 indicate that the performance of SHAKE, whose 493 implementations are most commonly serial -in contrast to P-LINCS and ILVES-PC-, is 494 the worst among the analysed solvers. This implies that SHAKE requires higher ratios of 495 the total execution time, which is aggravated by increasing the number of threads. This 496 is a natural consequence of Amdahl's Law [7], that is, the performance improvement 497 obtained by parallel execution is limited by the sequential fraction of the application. 498 This problem is likely to be exacerbated by the continuous increase in the number of 499 cores in the processors. ILVES-PC performs better than the state-of-the-art in nearly all 500 the analysed cases, and its relative advantage increases as higher accuracy is demanded. 501 Moreover, since the computation time is similar for high accuracy (e.g. τ = 10 −12 ) and 502 low accuracy (e.g. τ = 10 −4 ), accurate calculations become affordable using ILVES-PC. 503  It is entirely possible to view Fig 5 and conclude that it is a waste of time to improve 504 the quality of parallel constraint solvers. After all, the majority of the execution time is 505 spent outside the constraint solver, so why should we bother improving the constraint 506 solver? This line of reasoning ignores two key points: 507 1. It is easy to question the conclusions drawn from inaccurate simulations that show 508 a significant violation of the fundamental principle of conservation of energy. High 509 accuracy is costly unless the constraint solver is parallel. maximum speedups over SHAKE of 4×, 10×, and 34× for the three executed test 524 cases. As we decrease the tolerance, ILVES-PC speedups over SHAKE dramatically 525 increase while P-LINCS shows similar speedups for the whole range of tolerances. At the 526 minimum tolerance (τ = 10 −12 ), ILVES-PC achieves maximum speedups over SHAKE 527 of 13×, 28×, and 76× for the three analyzed proteins. This is not surprising. ILVES-PC 528 is based on Newton's method which normally has quadratic convergence, while there is 529 no reason to expect more than linear convergence from P-LINCS.
530 Fig 8 shows the parallel scalability of P-LINCS and ILVES using different numbers 531 of threads setting the tolerance to τ = 10 −12 [79]. Nearly identical results were obtained 532 for the whole range of tolerances. The scalability of both solvers is similar for all three 533 test cases. ILVES-PC performs slightly better than P-LINCS for the two representative 534 proteins. P-LINCS and ILVES-PC require regular synchronization between threads, and 535 the size of their parallel tasks depends on the number of bonds of the molecule. Hence 536 the size of the test case is important for scalability. If the molecule is sufficiently small 537 and the number of cores is sufficiently large, then the parallel overhead will dominate 538 and no solver can be efficient. Conversely, if the molecule is sufficiently large compared 539 with the number of cores, then good parallel performance is not theoretically impossible. 540 6 Conclusions and future work 541 The constraint solver introduced in this article demonstrates that it is possible to 542 conduct efficient parallel simulations of polymers by utilising the chemical structure. 543 We have also presented arguments supporting that constraint equations must be solved 544 accurately, including the fact that the default configuration of popular MD packages 545 leads to situations where the constraint solver does not lead to converged results.

546
The introduced algorithm is well-suited for accurate calculations. It is faster than 547 state-of-the-art constraint algorithms for most of the analysed cases and equally fast for 548 all other cases. The use of Newton's method ensures that we can reach high accuracy with 549 a small increase in the computational effort compared with low accuracy simulations.

550
Future stages of this project include tackling the case of imposing constraints on 551 H-bonds, the use of inexact Newton methods -such as symmetric approximations of the 552 Jacobian-in order to achieve higher computational efficiency, the extension of ILVES to 553 nucleic acids (ILVES-NA), MPI parallelization, SIMD vectorization as well as a general 554 version of ILVES which can calculate constraints in every molecule.