Abstract
Prediction of RNA structure from nucleotide sequence remains an unsolved grand challenge of biochemistry and requires distinct concepts from protein structure prediction. We report a stepwise Monte Carlo method that has enabled the first blind prediction recovering all noncanonical base pairs of a complex RNA structure, a double pseudoknot from the Zika virus genome posed as a community-wide RNA Puzzle. A benchmark of 82 diverse motif structure challenges and prospective experimental tests for three previously unsolved tetraloop/receptors support the method’s general ability to recover noncanonical pairs ab initio, with remaining problems traced to limitations of current macromolecule free energy functions.
Main text
Significant success in protein modeling has been achieved by assuming that the native conformations of a macromolecule have the lowest free energy and that the free energy function can be approximated by a sum of hydrogen bonding, van der Waals, electrostatic, and solvation terms that extend over Angstrom-scale distances. Computational methods that subject large pools of low-resolution protein models to all-atom Monte Carlo minimization guided by these free energy functions have achieved near-atomic-accuracy predictions in the CASP community-wide blind trials (1). When adapted to RNA structure modeling, analogous methods have consistently achieved nucleotide resolution in the RNA-Puzzle blind trials but have not yet reached atomic accuracy, aside from previously solved motifs that happen to recur in new targets (2). A disappointing theme in recent RNA-Puzzle assessments is that the rate of accurate prediction of noncanonical base pairs is typically 20% or lower, even for models with correct global folds (2). Without recovery of such noncanonical pairs, RNA computational modeling will not be able to explain evolutionary data, predict molecular partners, or be prospectively tested by compensatory mutagenesis for the myriad biological RNAs that are being discovered at an accelerating pace.
The lag between the protein and RNA modeling fields is partly explained by differences in how protein and RNA molecules fold. Protein structures are largely defined by how α-helices and β-sheets pack together. As abundant data exist on these regular protein elements and their side-chain interactions, protein models with reasonable accuracy can often be assembled from fragments of previously solved structures. Less regular loops interconnecting α- and β-elements are less critical for defining protein folds. Indeed, those loops are typically not recovered at high accuracy, even in the most exceptional blind predictions (3, 4). In contrast, the predictable and geometrically regular elements of RNA folding are Watson-Crick helices that sequester their side chains and therefore cannot be positioned by direct side-chain interactions. Instead, the RNA loops interconnecting those helices form intricate noncanonical base pairs that define an RNA’s global helix arrangement. The RNA structure prediction problem, more so than the protein problem, depends on discovering these irregular loop conformations and their associated noncanonical base pairs ab initio. Unfortunately, discovering the lowest energy conformations of these loop motifs has not generally been tractable due to the vast number of deep, local minima in the folding free energy landscape of even small noncanonical RNA motifs. Most RNA modeling methods use coarse-grained modeling stages that allow for smoother conformational search but generally return conformations too inaccurate to be refined to high accuracy by Monte Carlo minimization or molecular dynamics refinement.
To address this challenge, we have developed a Rosetta method called stepwise Monte Carlo (SWM). This method removes barriers in conformational search through addition and deletion of residues rather than through low resolution coarse-graining or conventional high resolution moves that make smaller perturbations. We previously described how step-by-step buildup of an RNA structure, enforcing low-energy conformations for each added nucleotide, could lead to atomic accuracy models of single stranded RNA loops (5). Conceptually, each addition was proposed to simulate the stepwise formation of well-defined structure from “random coil”-like ensembles. The calculation, instantiated in the Rosetta modeling framework, involved a deterministic enumeration over build paths, which guaranteed a unique solution for the final conformational ensemble but necessitated significant expenditure of computational power. Only small loops could be modeled. We recently discovered that such computationally intensive enumeration is not required for high accuracy modeling. Instead, an add-and-delete Monte Carlo minimization scheme stochastically produces conformations that give computed free energies as low as those achieved by enumerative stepwise assembly (SI Results). Through its increased speed, SWM allowed us to confirm that recent updates to the Rosetta free energy function (6) and estimation of conformational entropy of unstructured segments improve modeling accuracy for singlestranded loops (SI Results; SI Tables S1-S2; and SI Fig. S1).
In late 2016, the blind modeling challenge RNA-Puzzle 18 provided an opportunity to test SWM rigorously (Fig. 1). On one hand, the 71-nucleotide target sequence was readily identified via PDB-BLAST (7) to be a Zika virus RNA homologous to a molecule with a previously solved X-ray structure, an Xrn1 endonuclease-resistant (xrRNA) fragment of Murray Valley Encephalitis virus (PDB ID: 4PQV) (8). However, the crystallographic environment of the prior structure disrupted a pseudoknot (between L3 and J1/4, Fig. 1) expected from sequence alignments so that nearly half of the prior structure could not be trusted as a template for homology modeling. Further complicating the modeling, two Watson-Crick pairs within stem P3 changed to or from G•U wobble pairs. Moreover, prior literature analysis (8) suggested extension of this helix by two further Watson-Crick pairs (U29-A37; U30-A36), albeit without direct evidence from phylogenetic covariation and in partial conflict with dimethyl sulfate probing (SI Results and SI Fig. S4). Ab initio modeling was therefore necessary for modeling the RNA, and we carried out stepwise Monte Carlo runs (Figs. 1A and 1B; SI Methods). Lowest free energy models converged to a tight ensemble of intricate structures illustrated in Figs. 1D and 1E. The Watson-Crick pairs U29-A37 and U30-A36 predicted in the literature did not occur in the models. Instead, several other features were consistently observed across the SWM models (red, Figs. 1D-E): co-axial stacking of the pseudoknot helix on P3, a noncanonical base pair between A37 and U51 interconnecting the P3 and P1 coaxial helical stacks, a UA-handle (9) formed by U29-A36, and lack of pairing by U30, A35, A52, and A53. These features were not uniformly present – or not predicted at all – in models created by our prior state-of-the-art method for RNA-puzzle modeling (Fragment Assembly of RNA with Full Atom Refinement, FARFAR) or, as it later turned out, in models submitted by other RNA-Puzzle participants (see SI Fig. S2 and below).
The subsequent release of the crystal structure (Fig. 1F-G) (10) confirmed all base pairs predicted by SWM modeling (100% non-Watson-Crick recovery, compared to <50% in prior RNA-puzzles; Fig. 1H). Indeed, the only structural deviation involved A53, which was predicted in SWM models to be stacked on the pseudoknot helix (Figs. 1D-E). In the crystal, this nucleotide was indeed unpaired but bulged out of the core to form a contact with a crystallographic neighbor, while a 1,6-hexanediol molecule from the crystallization buffer took its place (white sticks, Fig. 1F); this arrangement was noted independently to be a likely crystallographic artifact (10). As illustrated in the overlay (Fig. 1G), there is striking overall fold agreement (3.0 Å RMSD over all 71 residues and 2.3 Å over just the most difficult noncanonical region, nucleotides 5–6, 27–39, and 49–57), much better than the ~10 Å agreement seen in previous RNA-puzzles of comparable difficulty (Fig. 1H) (2). Furthermore, SWM predicted all noncanonical base pairs accurately (FNWC = 1, Fig. 1H).
To evaluate SWM more broadly, we tested whether it could recover noncanonical base pairs on 82 complex RNA motifs that we encountered in previous RNA-puzzles and other modeling challenges (SI Table S3 and SI Fig. S3). Due to the relative efficiency of SWM modeling and growing computational power, we could test a benchmark that was nearly three times larger than our most extensive previous efforts (11). Overall, SWM achieved a median RMSD accuracy (over the top five cluster centers) of 1.45 Å (Table 1 and SI Fig. S4) and mean recovery of non-Watson-Crick pairs of 76%. We observed numerous cases in which the SWM model and experimental structure were nearly indistinguishable by eye (Fig. 2). Examples included two-stranded motifs solved at much higher computational expense with the prior enumerative stepwise assembly method (12), such as the most conserved domain of the signal recognition particle (Fig. 2A; 1.26 Å RMSD, 5 of 5 noncanonical pairs recovered) and the first RNA-Puzzle challenge, a human thymidylate synthetase mRNA segment (Fig. 2B; 0.96 Å, 1 noncanonical pair and 1 extrahelical bulge recovered) (2). For several test cases, there was experimental evidence that formation of stereotyped atomic structures required flanking helices to be positioned by the broader tertiary context, so we expected these motifs might be particularly difficult. Nevertheless, if the immediately flanking helix context was provided, the median RMSD accuracy and non-Watson-Crick base pair recovery remained excellent (1.19 Å and 100%; Table 1, SI Fig. S4 and SI Table S3), as illustrated by the J5/5a hinge from the P4-P6 domain of the Tetrahymena group I intron (13) (Fig. 2C; 0.55 Å RMSD, all 4 noncanonical pairs and all 3 extrahelical bulges recovered).
Perhaps the most striking models were recovered for multi-helix junctions and tertiary contacts, which have largely eluded RNA modeling efforts seeking high resolution (11, 14). SWM achieves high accuracy models for the P2-P3-P6 three-way junction from the Varkud satellite ribozyme, previously missed by all modelers in the RNA-puzzle 7 challenge (Fig. 2D; 1.13 Å RMSD, 3 of 3 noncanonical pairs recovered); a highly irregular tertiary contact in a hammerhead ribozyme (Fig. 2E; 1.16 Å RMSD, 2 of 3 noncanonical pairs and 1 extrahelical bulge recovered); a complex between a GAAA tetraloop and its 11-nt receptor (0.64 Å RMSD, all 4 noncanonical pairs recovered, Fig. 2F); and the tRNAple T-loop, a loop-loop tertiary contact stabilized by chemical modifications at 5-methyl-uridine, pseudouridine, and N1-methyl adenosine (Fig. 2G; 1.33 Å accuracy). Motifs without any flanking A-form helices offer particularly stringent tests for ab initio modeling but could also be recovered at high accuracy by SWM, as illustrated by the inosine-tetrad-containing quadruplex (Fig. 2H; 2.87 Å RMSD overall, 0.46 Å RMSD if the terminal uracils, which make crystal contacts, are excluded). For comparison, modeling with FARFAR gave worse recovery of noncanonical pairs and worse all-atom energies than SWM for each of these cases and, more broadly, across each motif category (Table 1 and SI Tables S3-S4).
In some cases in the benchmark, SWM did not exhibit near-atomic-accuracy recovery and illuminated challenges remaining for computational RNA modeling (Fig. 2H and SI Fig. S5). While a few discrepancies between SWM models and X-ray structures could be explained by crystallographic interactions (e.g., edge nucleotides making crystal contacts, Fig. 2H), most problems were better explained by errors in the energy function. For 14 of the 19 cases in which the SWM modeling RMSD was worse than 3.0 Å (and thus definitively not achieving atomic accuracy), the energy of the lowest energy SWM model was lower than that of the optimized experimental structure, often by several units of free energy (calibrated here to correspond to kBT (6)). In some cases, the RMSD achieved by FARFAR was better than by SWM, but not the fraction of base pairs recovered (Table 1). suggesting that conformational preferences encoded in database fragments in FARFAR needed to be captured by SWM, perhaps in its torsional potential. Results on the hepatitis C virus internal ribosome entry site, the sarcin ricin loop, and other test cases suggest that a more accurate torsional potential, as well as inclusion of metal ions, may eventually address these residual problems (Supplemental Fig. S5).
Despite these current limitations, the overall accuracy of SWM in the 82-motif benchmark suggested that it could predict noncanonical base pairs in motifs that have been refractory to NMR and crystallographic analysis and that the resulting models could be stringently validated or falsified by prospective biochemical tests. Success in the 11-nt tetraloop/receptor motif (Fig. 2F), a classic model system and ubiquitous tertiary contact in natural RNAs, encouraged us to model alternative tetraloop/receptor complexes selected for use in RNA engineering but not yet solved experimentally (2, 15). The resulting SWM models for the C7.2, C7.10, and R(1) tetraloop/receptors (Fig. 3) suggested striking structural homologies to the natural 11-nt receptors but also noncanonical features (extrahelical bulges, pairs) different from prior manual modeling efforts (SI Results). We tested these features using prospective experiments. CMCT (N-cyclohexyl-N′-(2-morpholinoethyl) carbodiimide tosylate) mapping on the receptors installed into the P4-P6 domain of the Tetrahymena ribozyme verified extrahelical bulging of single-nucleotide uridines predicted at different positions in the different receptors (Fig. 3C and SI Results). The R(1) receptor model included numerous unexpected noncanonical features, especially a base triple involving a new Watson-Crick singlet base pair G4-C9 and a dinucleotide platform at G4-U5. These features were stringently evaluated via compensatory mutagenesis. Chemical mapping on the P4-P6 domain confirmed the G4-C9 base pair but was not sensitive enough to test other compensatory mutants (SI Fig. S6). We therefore carried out native gel assembly measurements in a different system, the tectoRNA dimer, which allows precise energetic measurements spanning 5 kcal/mol (Fig. 3E). Observation of energetic disruption by individual mutations and then rescue by compensatory mutants confirmed the predicted interactions of G4-C9, the base triple G4-U5-C9, and noncanonical pair G6-A7 (p < 0.01 in all cases; SI Fig. 3D) as well as other features of the model (Fig. S7). Such mutagenesis-based validation of noncanonical pairs would have been intractable absent a predicted model, due to the large number of possible mutant combinations that would have to be tested.
We have reported blind prediction of all noncanonical pairs of a complex RNA-puzzle, quantitative recovery of noncanonical pairs through the majority of an extensive benchmark that includes prior RNA-puzzle motifs, and prospective experimental tests of noncanonical features in three previously unsolved tetraloop-receptors. These results support stepwise nucleotide structure formation as a missing algorithmic principle for high resolution RNA structure modeling. Towards modeling and design of large multi-motif RNA structures, determining whether improved RNA torsional potentials and treatment of ionic effects can correct residual free energy function problems becomes a critical open question.
Materials and Methods
Detailed methods are available in the Supplementary Materials.
Acknowledgments
We thank Stanford Research Computing for expert administration of the BioX3 clusters (supported by NIH 1S10RR02664701) and Sherlock clusters. We acknowledge financial support from the Burroughs-Wellcome Fund (CASI to R. D.), NIH R01 GM102519 and R21 GM102716 (to R. D.), and a RosettaCommons grant.