Accelerated simulations of RNA clustering: a systematic study of repeat sequences

Under certain conditions, RNA repeat sequences phase separate yielding protein-free biomolecular condensates. Importantly, RNA repeat sequences have also been implicated in neurological disorders, such as Huntington’s Disease. Thus, mapping repeat sequences to their phase behavior, functions, and dysfunctions is an active area of research. However, despite several advances, it remains challenging to characterize the RNA phase behavior at submolecular resolution. Here, we have implemented a residue-resolution coarse-grained model in LAMMPS – that incorporates both RNA sequence and structure – to study the clustering propensities of protein-free RNA systems. Importantly, we achieve multifold speedup in the simulation time compared to previous work. Leveraging this efficiency, we study the clustering propensity of all 20 non-redundant trinucleotide repeat sequences. Our results align with findings from experiments, emphasizing that canonical base pairing and G-U wobble pairs play a dominant role in regulating cluster formation of RNA repeat sequences. Strikingly, we find strong entropic contributions to the stability and composition of RNA clusters, which is demonstrated for single-component RNA systems, as well as binary mixtures of trinucleotide repeats. Additionally, we investigate clustering behaviors of trinucleotide (odd) repeats and their quadranucleotide (even) counterparts. We observe that odd repeats exhibit stronger clustering tendencies, attributed to the presence of consecutive base pairs in their sequences that are disrupted in even repeat sequences. Altogether, our work extends the set of computational tools for probing RNA cluster formation at submolecular resolution and uncovers physicochemical principles that govern the stability and composition of resulting clusters.

RNA serves as a ubiquitous component of biomolecular condensates [1][2][3][4][5] -membraneless compartments within living cells formed through phase separation.Moreover, studies have demonstrated that protein-free RNA systems can undergo phase separation [6][7][8].Notably, RNA-driven condensates play crucial roles in both normal cellular function and dysfunction [2,9,10].Particularly intriguing are RNA-based condensates comprised of tandem repeat RNA sequences, which have been associated with various neurological and neuromuscular disorders [6,11].Consequently, there exists an urgent need to decode how RNA sequence, structure, and dynamics modulate the material properties and functions of biomolecular condensates.
Recent advancements in experimental techniques, including smFISH [12], SHAPE [13], and proximity labeling [14], have offered valuable insights into the roles of RNA polymers in biomolecular condensates.However, investigating the effects of RNA at submolecular resolutions remains a formidable challenge.Computer-based approaches, particularly explicit chain molecular dynamics simulations [15][16][17][18][19], have proven to be excellent complements to these experimental methods, striving to bridge the gap in our ability to map RNA sequences to their phase behaviors.
In general, faithfully representing RNA in molecular simulations requires accounting for several intrinsic and extrinsic factors, including RNA charge (electrostatic effects), base stacking, base-pairing interactions, and effects due to solvent and ions (including divalent ions).However, incorporating more details is often associated with greater computational 2 is such an approach for probing phase behavior of RNA? (2) how can we boost the efficiency of molecular simulations of RNA condensation when modeled via many-body potentials at nucleotide resolution?(3) what insights can such simulations reveal about RNA sequences and their phase behavior?Concretely, our work builds upon the existing singleinteraction-site (SIS) RNA model [15], proposed by Nguyen and coworkers.In this RNA model, each nucleotide is modeled via a unique bead.Notably, the model parameters are derived through a top-down coarse-graining approach leveraging experimental structures of RNA repeat sequences [15].
In so doing, the model includes many-body potentials (angular and dihedral terms) to recapitulate the A-form helix of CAG-repeat sequences.It is important to note that the SIS RNA model [15] does not explicitly account for RNA charges -instead, these effects are implicitly included in the model interaction terms.Consequently, this model is most suitable for probing protein-free RNA systems in high salt regimes.
Additionally, nucleotide-nucleotide interactions are scaled to account for canonical and wobble base-pairing propensities.
Interestingly, this model successfully reproduces the lengthdependent clustering behavior observed experimentally in repeat RNA sequences.However, more sophisticated models like the SIS model often come with an increased computational cost.As an illustration, the initial implementation of the SIS RNA model, the authors report requiring approximately 100 days of sampling to acquire statistical data on the clustering propensities of RNA.The current challenge, therefore, lies in finding a balance: how can we effectively model RNA clustering behaviors with both experimental accuracy and computational efficiency, particularly when utilizing many-body potentials?
Here, we broaden the accessibility and boost the efficiency of a nucleotide-resolution many-body potential designed for probing the clustering propensities of RNA.Our starting point is the SIS RNA model [15], initially implemented in OpenMM [25], which inherently employs 'implicit' force fields -eliminating the need for explicit definitions of analytical forces by practitioners.Notably, we extend the approach to LAMMPS [26], a widely utilized scientific molecular simulation software, with exceptional parallelization capabilities.To achieve this, we explicitly derive the analytical forces for the many-body base-pair potential using the formalism outlined in Ref. [27].We seamlessly integrate these forces into LAMMPS, implementing optimizations to reduce redundant force computations and leverage parallel computing.Furthermore, we derive the analytical pressure tensor for the model, facilitating simulations that involve variations in mechanical forces on the system.Thus, practitioners can utilize approaches such as the slab technique [28] to further accelerate equilibration.As a validation of successful implementation, we perform numerical integration using the one-sided Euler         Mixture effects in general) [41,42] should enhance the accuracy of predictions regarding trinucleotide clustering.For instance, G-rich sequences also yield stable RNA clusters, likely sustained by a synergy between non-canonical base-pairing and ionic effects [43,44].Moreover, meticulous incorporation of electrostatic effects paves the way for quantitative analysis of interactions between such RNA sequences and proteins [45][46][47][48][49][50][51].Lastly, expanding these models to encompass a wider array of RNA secondary structures could enhance their utility -both in protein-free solutions and protein-RNA mixtures, where structural recognition could be important.
Overall, the evolving approach in developing efficient RNA phase separation models, which explicitly account for sequence-dependent effects will enhance our ability to correlate RNA sequences with macroscopic condensate features.
When integrated with experimental data, these biophysical models contribute to advancing our understanding of spatiotemporal organization within cells, shedding light on mechanisms involved in abnormal condensation, and elucidating principles for designing soft materials based on RNA compartmentalization.

A. The coarse-grained model of RNA
In the single-interaction-site (SIS) model [15], each RNA nucleotide is represented by a single bead.As the center of coarse-graining is not specified in the original implementation, we utilize the C3' atom on ribose to determine the location of the coarse-grained bead.This choice is deemed reasonable based on the RMSD comparison against the CAG repeat crystal structure, Supplementary Fig. S2, and our analysis of the force field parameters against bioinformatics data [52], Supplementary Fig. S1.
In this model, the total potential energy (U ) is decoupled into three coarse-grained effective potentials: A schematic representation of the potential terms is presented in Fig. 1(a).Here, U BA describes the bonded interactions of RNA nucleotides, consisting of bond and angle restraints between the connected beads.The specific form of the potential is: The excluded volume interaction U EV is given by the Weeks-Chandler-Andersen potential: (3) Θ is the Heavyside step function and if the two beads are on the same chain and involved in bond or angle potentials, then this term is not computed.For the short-ranged interactions r j−1 : Then, the normal vectors of the surfaces that are formed by these vectors are defined as Thus, the angles can be calculated using the cos function The distance and angles can be used to calculate the base-766 pair potential U BP (Eq.4) and thus the force can be derived: A similar derivation procedure applies to calculations of all other 65 force components.For the pressure calculation, LAMMPS uses the pressure tensor that takes into account the contribution of the kinetic energy tensor and the virial tensor.
Thus, for example, the contribution of the 4 forces f j+1 , f j , 772 f i and f i+1 from the dihedral angle ϕ j+1,j,i,i+1 to the xx 773 component of the virial tensor is where, V is the total volume and x is the unit vector of the x direction in the Cartesian coordinate system.Similarly, the contributions to the pressure tensor are calculated for all the forces involved in the base-pair interaction.
Finally we implemented the forces and the tensors.By compiling with the force-field files (instructions are included in Github), the base pair potential can be accessed in LAMMPS by calling base/rna, with the freedom to set the interaction strength U i,j bp , equilibrium distance r 0 , the equilibrium angles θ i,j,j−1 , θ i−1,j,j , θ i,j,j+1 , θ i+1,j,j , and the dihedrals ϕ j−1,j,i,i−1 , ϕ j+1,j,i,i+1 , as well as their coefficients k r , k θ and k ϕ to any values depending on the systems of interest (9 parameters).An example script for defining this RNA model in LAMMPS is included in the Supplementary material (Listing S1).

Simulation
For the microcanonical simulation of the single chain (AG-GCAGCAGAAAAGACGACCCA), a timestep of 10 fs is employed.The potential energy and the kinetic energy are computed every 1 ps.The average drift of the total energy |∆E total | t is calculated by dividing the absolute overall drift from the initial total energy E total (t + ∆t) − E total (t) by the duration of time ∆t: For all the canonical ensemble simulations of RNA repeats, the timestep is set the same as the simulations in the microcanonical ensemble (10 fs).In multichain simulations, each system is composed of 64 chains, except where otherwise specified, and simulated with periodic boundary conditions.The systems are prepared at 200 µM concentration with chains evenly distributed in the initial configuration.For (AUC) 200 , we prepared the system with 27 chains at the same number density.
For all N V T simulations, a Langevin thermostat with low friction is used (damping time = 100,000 timesteps, with the integration step size of 10 fs).First, the chains are melted at in LAMMPS and the manuscript.We also thank Dr. Alina method [29], which reveals consistency with corresponding analytical counterparts.Crucially, we perform benchmarks for the model across both software platforms, investigating the potential advantages and disadvantages of running these simulations.Our findings demonstrate that while using a single GPU in OpenMM can deliver reasonable performance, employing parallel computing in our implementation with LAMMPS can 127 achieve up to approximately 5-fold speedups.128 Leveraging the efficiency gains of the RNA model, we in-129 vestigate all 20 non-redundant trinucleotide RNA repeats (ex-130 cluding homopolymers; Supplementary Fig. S4) that have also 131 been examined in experiments [30].Specifically, we con-132 duct molecular dynamics simulations on RNA trinucleotide 133 repeat sequences, which include investigations of single-134 component trinucleotide repeat systems, comparisons between 135 trinucleotide and quadranucleotide repeats and studies of bi-136 nary mixtures of trinucleotide repeats.In each case, we charac-137 terize the material properties (e.g., cluster size, concentration, 138 dynamics) of the collective systems.In line with experimental 139 findings, our simulations unveil a robust positive correlation 140 between the propensity to form RNA clusters and the ability to 141 establish base pairs, specifically canonical Watson-Crick and 142 G-U wobble base pairs.Strikingly, entropic contributions 143 to clustering propensities of such systems naturally emerge, 144 which have been postulated via an analytical model [31].Ad-145 ditionally, we capture differences in RNA clustering between 146 odd and even repeat sequences, which arise due to variations 147 in base-pairing patterns.Collectively, this work enhances our 148 capability to efficiently scrutinize RNA phase behavior via 149 many-body potentials, opening avenues for closely investigat-150 ing RNA conformational ensembles inside condensates.151 The implementation of the RNA coarse-grained model in 152 LAMMPS is consistent and stable 153 The single-interaction-site (SIS) model is a 1-bead-per-154 nucleotide coarse-grained model designed to capture the prop-155 erties of RNA in a sequence-dependent fashion.Initially im-156 plemented in OpenMM, the authors reported that it takes about 157 3 months to collect statistics for a system composed of 64 158 (CAG) 47 chains.Since our main objective is to probe RNA 159 clusters of similar and larger sizes, we first opted to enhance 160 the efficiency of this model by leveraging the parallelization 161 features in the LAMMPS software [26].OpenMM employs 162 its own toolkit to derive forces from potentials [25].While 163 LAMMPS can perform derivatives numerically (e.g., from 164 tabulated potentials), analytical potentials are much more de-165 sirable due to their boost in computational efficiency.Conse-166 quently, we derive and implement analytical forces for the SIS 167 RNA model in LAMMPS (Methods B). 168 To achieve this, we first assess the functional forms of inter-169 actions governing bonds, excluded volumes, and base-pairing 170 in the model, Fig. 1(a).Bonded interactions, encompassing 171 bonds and angles, are modeled via harmonic potentials that are 172 readily available in LAMMPS [26], facilitating direct imple-173 mentation by assigning corresponding parameters.Similarly, 174 the excluded volume potential in the model utilizes the soft re-175 pulsive Weeks-Chandler-Andersen potential [32], which can 176 be implemented in LAMMPS by shifting and cutting off the 177 standard 12-6 Lennard-Jones potential.However, there is no 178 existing form in LAMMPS for the base-pairing interactions, 179 constituting a 6-body potential dependent on 4 angles, 2 di-180 hedrals, and 1 distance [Fig.1(a), bottom panel].To address

FIG. 1 .
FIG. 1. Characterization of the RNA model implementation in LAMMPS.(a) Schematic of potential energy terms are included in the RNA model.Namely, U bond, angle represents the bond and angle terms along the chains; U excluded volume captures excluded volume interactions; U base pair represents the base-pair potential terms, which include a harmonic base-pair distance term, four angles and two dihedrals.U base pair ensures that the A-form right-handed antiparallel helical structures are energetically favored.(b) Comparison between analytical and numerical integrationof the base-pair potential.Numerical integration is performed via the one-sided Euler method using our LAMMPS code.We perform integration for each base-pairing term (harmonic distance, angles, and dihedrals); the specific scenarios are sketched in the plot.(c) Energy trajectories from the microcanonical simulation of a single chain CAG repeat.The simulation is started from a folded structure to make sure the force calculations are properly implemented.Setting the timestep size of 10 fs, the average total energy drift (3.5 × 10 −3 kcal/mol ns −1 ) is about 4 orders of magnitude smaller than the potential energy fluctuations (∼ 10 kcal/mol) demonstrating the stability of the model.(d) To further evaluate the correctness of the LAMMPS implementation, we simulate a condensed system of 64 (CAG)47 chains with Langevin thermostat at 293 K for 10 5 steps.This compact system requires an extensive evaluation of base-pair potentials at each timestep, thus representing the slowest simulation speed.Even with different simulation softwares, different precision, and different hardware, the system follows the same relaxation dynamics.(e) The scaling of the implementation is evaluated for 64 (CAG)47 and 125 (CAG)47 chains.For up to 32 CPUs, the scaling exponent is 0.88 for 64 (CAG)47 chains.When the system size is increased to 125 (CAG)47 chains, the scaling exponent is 0.86 for up to 64 CPUs.Data for the original OpenMM code on A100 GPUs with both double and mixed precision are also provided for comparison.

FIG. 2 .
FIG. 2. Material properties of non-redundant trinucleotide RNA repeat clusters.Each simulation is prepared by placing 64 chains of 40 trinucleotide repeats evenly in a periodic box.After using an annealing procedure (described in the text), we simulated each system for 2 µs in the N V T -ensemble at 293 K. (a) Depicted snapshots are taken from the end of the molecular dynamics simulations.Corresponding sequences are labeled at the left top of each snapshot.The redundancy of the sequences is explained more explicitly in Supporting Fig. S4.(b) Fraction of chains in the largest cluster versus concentration of the largest clusters, indicating strong positive correlation between clustering propensities and the concentration of the resulting clusters.Here, two chains are assigned to the same cluster if at least two of their nucleotides are within 1.2 r0, where r0 is equilibrium base-pairing distance.Error bars represent standard deviation.(c) Mean squared deviation (MSD) over time τ of chain center of mass.MSD exponents of select trajectories are indicated on the plot.Error bands represent standard deviation.

A U - 3 .FIG. 3 .Fig. 3 (Fig. 3 (FIG. 4 .
FIG. 3. Entropic effects, via base-paring modes, contribute to stability of trinucleotide clusters.All systems are prepared at 200 µM with 64 chains evenly distributed and melted at 373 K. Then the systems are cooled down to 293 K within 40 ns.Finally, the systems are simulated at 293 K for 2 µs.The trajectories from the last 500 ns are used to measure the cluster size distributions.(a)Canonical and wobble base-pairing interaction strengths, as implemented in the coarse-grained model.(b) Comparison of cluster sizes between (ACU)40 and (AUU)40.Each repeat is able to form weaker A-U base pairs, but (AUU)40 has more probabilities to form base pairs.The simulation snapshots corresponding to the final state of each system, as well as the pairing modes for each system, are shown as insets.(c) Comparison of cluster sizes between (CAG)40 and (CGG)40.Each repeat is able to form stronger C-G base pairs, but (CGG)40 has more probabilities to form base pairs.The simulation snapshots corresponding to the final state of each system, as well as the pairing modes for each system, are shown as insets.

475
Here, we find a positive correlation between the fraction 476 of RNA chains in the largest cluster in simulations and the 477 fraction of foci-positive cells in the in vivo experiments [30] 478 (Pearson correlation coefficient of 0.62), Fig. 4(a).Despite 479 the reasonable agreement with experiments, there are, how-480 ever, some disparities between the simulation predictions and 481 experimental results.A notable exception occurs for (ACU)

FIG. 5 .
FIG. 5. Comparison of clustering tendencies for trinucleotide (odd) and quadranucleotide (even) repeats.All systems are prepared at 200 µM with 64 chains evenly distributed and melted at 373 K. Then the systems are cooled down to 293 K within 40 ns.Finally, the systems are simulated at 293 K for 2 µs.The trajectories from the last 500 ns are used to measure the cluster size distributions.The simulation snapshots corresponding to the final state of each system, as well as the pairing modes for each system, are shown as insets.The main differences in pairing modes are highlighted by gray boxes.(a) The clustering propensities of (CAG)40 versus (ACAG)30.While (CAG)40 is able to form two consecutive C-G pairs that are separated by a 'spacer' A, the extra 'spacer' A in (ACAG)30 prevents it.(b) The clustering propensities of (CUG)40 versus (UCUG)30.Even though U is able to pair with G, the ideal pairing modes are dominated by strong C-G pairs.(CUG)40 is able to form two consecutive C-G pairs that are separated by 'U', the addition of extra 'U' in (UCUG)30 prevents it.

FIG. 6 .
FIG. 6. Material properties of RNA clusters in binary mixtures of trinucleotide repeats.Each equimolar mixture is composed of 64 + 64 = 128 chains and simulated with periodic boundary conditions at 200 µM.The chains are evenly distributed and melted at 373 K. Then the systems are cooled down to 293 K within 40 ns.Finally, the systems are simulated at 293 K for 2 µs.For comparison, the single component sysytems are simulated under the same condition with 64 chains.The trajectories from the last 500 ns are used to measure the cluster size distributions.For each subfigure, the snapshot of single-component systems and the mixtures, taken from the end of simulations, are shown on left with consistent coloring of different sequences.The pairing modes are shown in the middle with the minimum energy of pairing indicated by a gray box.The center of mass MSD (mean square displacement) of components in single-component systems and the mixture are show on right.In addition, the largest cluster sizes from simulations are shown as insets in the MSD subfigures.(a) Results for the binary mixture of (AAC)40 and (AUG)40.The dimers of (AAC)40 and (AUG)40 are dominant.(b) Results for the binary mixture of (AAC)40 and (CUG)40.Despite the ability of (AAC)40 and (CUG)40 to bind each other, (CUG)40 self-interactions mainly contribute to clustering, and (AAC)40 chains are almost excluded.(c) Results for the binary mixture of (CAG)40 and (CUG)40.With strong energetic compatibility, (CAG)40 and (CUG)40 are both contributing to clustering and experience relatively slow dynamics compared to other studied mixtures.
Emelianova, Ananya Chakravarti and Prof. Athanassios Panagiotopoulos for helpful feedback on the manuscript.This research was supported by departmental start-up funds allocated to J.A.J. through the Department of Chemical and Biological Engineering and the Omenn-Darling Bioengineering Institute at Princeton University.This research was also partially supported by the National Science Foundation (NSF) through the Princeton University (PCCM) Materials Research Science and Engineering Center DMR-2011750.The authors also acknowledge funding from the Silicon Valley Community Foundation.

and entropic effects of base-pairing regulate material 317 properties of RNA trinucleotide repeat clusters
280In addition, we compare the LAMMPS simulation efficiency 281 with that of OpenMM.Notably, in previous work the OpenMM 282 simulations were performed on a single NVIDIA Quadro RTX 283 5000 graphics card.Since we did have access to this hard-284 ware, we use NVIDIA A100 Tensor Core 80GB graphics 285 cards.We find that for a 64-chain system, OpenMM simu-286 lations with mixed precision running on one GPU completes 287 747 steps/second and going from mixed precision to double 288 precision reduces the simulation speed for OpenMM using 289 A100 GPUs.For comparison, the maximum speed of GPU 290 (mixed precision; OpenMM) is equivalent to 10 CPUs in our 291 LAMMPS implementation (double precision), rendering the 292 OpenMM implementation advantageous in this aspect.Inter-293 estingly, increasing the number of A100 GPUs leads to nega-294 tive scaling for the OpenMM implementation.This behavior 295 could be attributed to several factors regarding the multi-GPU 296 implementation: (1) how the GPUs are connected to each other 297 (NVLink or PCIe); (2) the fact that OpenMM employs force 298 evaluation decomposition instead of domain decomposition 299 [35].In contrast, our LAMMPS implementation achieves a 300 maximum speed of 2370 steps/second for the 64-chain system 301 when utilizing 64 CPUs -i.e., a roughly 3-fold increase in 302 speed compared to the tested OpenMM GPU (mixed preci-303 sion).In context, this result means that a 3 week long simula-304 tion in the original code (run on A100 GPUs) should now take 305 7 days to complete in the LAMMPS implementation on CPUs.306 It is important to note that the A100 GPUs are better suited for 307 deep learning computations [36] and the new RTX4070/80/90 308 cards could yield better performance for molecular dynam-309 ics simulations [37].Thus, the speedup in OpenMM could 310 improve if one has access to these GPUs.Collectively, our 311 comparisons reveal that while the OpenMM implementation 312 holds advantages over multiple CPUs, the maximum speed of 313 the model in OpenMM is capped in our test cases.Conversely, 314 our LAMMPS implementation demonstrates the potential for 315 multifold speedup -given enough resources.316 Energetic 318 Capitalizing on the improved efficiency of the RNA im-319 plementation in LAMMPS, we employ the model to com-326 terface.In our simulations, we focus on RNA clusters, in 327 general, and define these using a chain-to-chain distance met-328 ric; i.e., nucleotides belonging to chains that are within 120% 329 of the ideal base-pairing distance are assigned to the same clus-330 ter.Using this metric, we quantify, under the same conditions, ,j,i,i+1 + ϕ 2 ) sin(ϕ j+1,j,i,i+1 ) ∂ cos(ϕ j+1,j,i,i+1 ) ∂x i = U BP k ϕ sin(ϕ j+1,j,i,i+1 + ϕ 2 )sin(ϕ j+1,j,i,i+1 )