Tautomerization constrains the accuracy of codon-anticodon decoding

G○U(T) mismatch has the highest contribution to the error rate of base pair recognition in replication, as well as in codon-anticodon decoding in translation. Recently, this effect was unambiguously linked to keto-enol tautomerization, which enables the Watson-Crick (WC) geometry of the base pair. Structural studies of the ribosome revealing G○U in the WC geometry in the closed state of the A-site challenge the canonical induced-fit model of decoding and currently lack a physicochemical explanation. Using computational and theoretical methods, we address effects of the ribosomal A-site on the wobble↔WC tautomerization reaction in G○U (wb-WC reaction), and the consequent implications for the decoding mechanism in translation. The free energy change of the wb-WC reaction in the middle codon-anticodon position was calculated with quantum-mechanical/molecular-mechanical umbrella sampling simulations. The wb-WC reaction was endoergic in the open A-site, but exoergic in the closed state. This effect can be explained in part by the decreased polarity of the closed A-site. We developed a model of initial selection in translation that incorporates the wb-WC reaction parameters in the open and closed states of the A-site. In the new model, the exoergic wb-WC reaction is kinetically restricted by the decoding rates, which explains the observations of the WC geometry at equilibrium conditions. Moreover, the model reveals constraints imposed by the exoergic wb-WC reaction on the decoding accuracy: its equilibration counteracts the favorable contribution from equilibration of the open-closed transition. The similarity of the base-pair recognition mechanism in DNA polymerases allows extending this model to replication as well. Our model can be a step towards a general recognition model for flexible substrates.


INTRODUCTION
The accuracy of base pair recognition during replication and translation has a fundamental role in cell fitness and evolution [1][2][3][4] . Some mismatches, such as G•U (G•T) mismatch, have particularly large contributions to error rate [5][6][7][8][9][10] . The tautomeric hypothesis by Watson and Crick 11 , 12 suggests that a source of base pair recognition errors can be attributed to keto/enol and amino/imino tautomerism of nucleobases. G•U(T) base pair formed with the canonical keto states of the nucleobases is unable to adopt the Watson-Crick (WC) geometry and predominantly exits in the wobble (wb) geometry 13 . Formation of enol tautomeric states (G* or U*(T*)) enables the WC geometry, potentially leading to errors during base pair recognition, but the biologically-relevant mechanisms of this process remained elusive until recently.
Using computational chemistry methods, Brovarets' and Hovorun predicted wobble↔WC tautomerization reaction in G•U(T) base pair 14,15 (wb-WC reaction, Fig. 1A), which was experimentally confirmed by NMR in DNA and RNA duplexes in water solution 13 , where this reaction was endoergic (i.e. the wobble geometry was more populated than the WC) 13,16 .
Recently, by incorporating tautomeric populations measured in DNA duplexes in water solution into a numerical kinetic model of DNA replication, a good agreement between predicted and experimentally observed misincorporation rates was obtained 16 . It might indicate a negligible role of the active site environment on the properties of the wb-WC reaction. However, both experimental 17 and computational 18,19 studies demonstrate stabilization of the WC geometry of G•T in the closed state of active site in several DNA polymerases, implicating a more complicated mechanism of base pair recognition. The WC geometry of G•U was also observed in the closed state of the decoding site (A-site) of the ribosome at the first two codon-anticodon positions: in crystal structures [20][21][22][23] and later in Cryo-EM structures 24 . Figure 1: The wb-WC reaction. A -local minima and the transition state (TS) along the wb-WC reaction. Asterisk denotes enol tautomers. The markers shown at each state are used to denote the corresponding states in B -C. The wb-WC reaction consists of slow tautomerization reaction from G•U wb to G•U* WC, followed by fast double proton transfer reaction between G•U* and G*•U. B -reference potential energy profile of the wb-WC reaction calculated with nudge elastic band method (NEB) in gas phase. C -reference wb-WC reaction path from NEB, projected on the collective variables used in umbrella sampling calculations s (path collective variable) and hb (proton transfer variable). Gray lines and gray circles denote trajectories from the committor analysis, initiated from the TS These observations of the WC geometry of G•U(T) challenge the classical model of substrate recognition in replication and translation -the induced-fit model, which assumes rigid substrates. The open↔closed conformational transition of the ribosomal decoding site, the molecular mechanism of the "induced fit" 24,25 , provides the highest decoding accuracy in the equilibrium limit, but is shifted from equilibrium due to its high forward rates, and the rates of the following irreversible reaction [26][27][28][29] . Theoretical studies based on the rigid substrate approximation have concluded that decoding in translation has been evolutionarily optimized for higher rate rather than for higher accuracy 27,28,30 . Since the rate constants affecting the open↔closed transition are not rate-limiting in translation elongation 27 , a possible advantage of such trade-off remains unclear and additionally challenges the induced-fit model. Hovorun and coworkers have suggested a non-equilibrium exoergic wb-WC reaction in the closed state of a DNA polymerase active site, kinetically restricted by the lifetime of the recognition state 15 . Such mechanism applied to the codon-anticodon decoding might solve the above described problems, but requires formalization and detailed analysis.
Clearly, studying environmental effects of the ribosomal decoding site on the wb-WC reaction is essential for understanding the decoding mechanism. In the absence of experimental methods suitable for addressing this problem, computational methods provide such opportunity. Until now, two computational studies have addressed nucleobase tautomerism in codon-anticodon decoding 31,32 , both of which argue against stabilization of the WC geometry in G•U at the decoding site, thus contradicting the experiments 20-24 . However, these computational studies used force fields for energy calculations, which may not be the most accurate approach to study tautomerization, especially when the rare tautomers were not parameterized, as in ref. 31 .
Here, we overcome the force-field-related biases by performing hybrid quantum-mechanical/ molecular-mechanical (QM/MM) simulations. We use QM/MM umbrella sampling (US) method to calculate free energy profiles of the wb-WC reaction in benchmark systems (DNA duplex in water, and a single base pair in benzene), ribosomal A-site models and active sites of DNA polymerases. Our calculations reveal positive free energy of the WC geometry formation ∆G wc in the model of the open A-site, and negative ∆G wc in the closed model, consistent with the structural studies [20][21][22][23][24] . We also found different effects of DNA polymerases on the wb-WC reaction: ∆G wc in the active site of polymerase β was negative, consistent with the structural studies 17 , while ∆G wc in DNA polymerase T7 was not significantly affected compared to the DNA duplex in water. We analyze dielectric constant of the studied environments to explain the observed effects on the wb-WC reaction.
To study implications of the exoergic wb-WC reaction in the closed A-site on the codonanticodon decoding, we incorporated the wb-WC reaction into the model of initial selection in translation and derived analytical solutions for the new model. In this model, the exoergic wb-WC reaction is kinetically restricted by the residence time of the closed state of the decoding site. The new model is consistent with previously reported WC geometry of G•U in the structural studies at equilibrium conditions [20][21][22][23][24] . In addition to uniting structural and kinetic data, the model reveals constraints on decoding: the equilibration of the wb-WC reaction counteracts the equilibration of the open↔closed transition, thereby constraining the error rate of decoding.
The similarities with DNA polymerases suggest that the proposed model can be also applied to replication. The model derived here might be a step towards a more general model of recognition capable of describing flexible substrates.

METHODS
Details on the methods are provided in Supplementary Methods. QM calculations. We performed quantum chemistry (QM) calculations to obtain the reference dependence of the wb-WC equilibrium on the dielectric constant of a solvent, and to select the optimal QM level of theory for our QM/MM calculations. Using geometries of the wb-WC reaction minima and transition state (TS) optimized at density functional theory (DFT) levels ωB97X-D3/def2-TZVP, B3LYP-D3BJ/def2-TZVP and BLYP-D3BJ/def2-SVP in gas phase and implicit water model, we calculated single-point energies with multiple DFT and semiempirical (SE) methods. The methods were assessed based on the energies (∆E wc ) or enthalpies (∆H wc ) of the WC geometry formation (∆E wc = ∆E(G*•U WC) -∆E(G•U wb)). Our benchmark calculations showed PM7 33 as an optimal QM method for our QM/MM approach, as it was able to accurately reproduce ∆H wc calculated with DFT (Table S2), while allowing vast sampling due to its low computational cost. The activation enthalpies obtained with PM7 were much less accurate (Table S2) pol-β model was prepared from the crystal structure of pol-β with a G•C base pair in the closed active site (PDB ID: 4KLF) 36 , in which we replaced dCTP with dTTP. T7-pol was prepared in a similar way using the crystal structure with PDB ID 1T7P 37 as the initial coordinates.
To prepare the A-site models, we used the crystal structure of T.thermophilus 70S ribosome bound to tRNA T hr in the A-site (PDB ID: 6GSK) with U•G base pair in the second codon (AUC) -anticodon (GGU) position 23 . All residues within 35Å radius of the second codon nucleotide were selected for the model ( Fig. 2A). The outer shell of the selected sphere was restrained in all simulations. In the "closed" (native) A-site model the rRNA residues A1492, A1493 and G540 were in out conformation, surrounding the codon-anticodon helix ( Fig. 2B). In the "abasic" model the nucleobases of A1492, A1493 and G540 were deleted.
In the "open" model, used only in QM/MM US simulations (see below), we used harmonic biases from Zeng et al. 32 to restrain A1492 and A1493 to the in position. All Mg 2+ ions from the ribosome crystal structure were deleted in the models.
Unless specified otherwise, all models were solvated in TIP3P 38 water box with 0.15 M of NaCl. We employed a standard equilibration protocol (see Supplementary Methods), comprising steepest-descent minimization, solvent equilibration in NPT ensemble at standard conditions and gradual heating of the solutes to 298 K. Production MD simulations were performed in NVT ensemble using Langevin thermostat with 2 fs integration step. All MD simulations were performed using periodic boundary conditions. CHARMM36 force field [39][40][41] in NAMD 2.12 42 was used for all classical MD simulations. Dielectric constant calculations. To analyze dielectric properties around the base pairs in the simulated models, we applied Kirkwood-Fröhlich formula (KFF), which relates dielectric constant (ε) to dipole moment fluctuations in a selected volume 43 . Using classical MD trajectories, we calculated ε of all nucleic acid, protein and water residues in spheres of radii ranging from 5 to 12Å centered on the base pair of interest. Since ε calculated in a TIP3P water box using this approach depended on the probe volume radius until its convergence at ∼ 20Å (Fig. S3), the approach only allows a qualitative comparison between the studied systems. and calculated PMF using weighted histogram analysis method (WHAM). We focused only on the relative energies and positions of local minima in the wb-WC reaction in our US simulations Kinetic modeling. Decoding rate constants for a codon-anticodon combination with a G•U mismatch in the first position were obtained earlier by Pape et al. 45 , but only in the low-fidelity buffer conditions, in which the accuracy of the initial selection was extremely low (1.1), contrasting the in vivo measurements [5][6][7][8][9] . Therefore, for our kinetic modeling we selected decoding rate constants measured at high-fidelity conditions, which result in more realistic accuracy 46,47 . In the absence of high-fidelity measurements corresponding to a G•U mismatch, we used the set of rate constants corresponding to the A•C mismatch in the first codon-anticodon position 46,47 Table S3. Rate constants k 2 , q c 3 and q nc 3 were assigned arbitrary. Numerical solutions of the kinetic model were obtained by numerical integration of ordinary differential equations (ODE) in Python. We describe derivations of analytical solutions in Appendix.

Dielectric constant of base pair environments
It was speculated previously that closing of the ribosomal A-site, via conformational change of the rRNA residues A1492, A1493 and G540, desolvates the codon-anticodon helix, increasing base-pairing selectivity by an energy penalty from the lost H-bonds with water in mismatches 25,48 . We hypothesized that desolvation may also bring an opposite contribution to the accuracy of G•U mismatch recognition. Our benchmark calculations (Fig. S1) demonstrate that the wb-WC reaction is exoergic in gas phase and very nonpolar implicit solvents, corroborating previous computational studies 14,15,18,49 . We reasoned that a potentially decreased polarity of the closed ribosomal A-site may be responsible for the stabilization of the WC geometry of G•U observed in the structural studies.
We applied Kirkwood-Fröhlich formula (KFF) to qualitatively compare dielectric constant of the decoding site environments between the closed and abasic models. In the latter model, nucleobases of residues A1492, A1493 and G540 were deleted (Fig. 2B). We observed a consistent ε decrease in the closed model for all codon-anticodon positions (Fig. 3). However, only at some distance cutoffs and positions the difference was statistically significant, which can be explained by the limited length and replica numbers of the MD trajectories ( Fig. S4). From our KFF calculations we conclude that the A-site closing decreases ε of environment of all three codon-anticodon positions.
We also compared the active sites of pol-β and T7-pol using the same approach. The active site environment in pol-β was generally less polar compared to T7-pol, but the difference was statistically significant only at 12Å cutoff (Fig. 3).
In sum, our KFF calculations revealed decreased polarity in the closed compared to the abasic ribosomal A-site model, and in pol-β compared to T7-pol.  ∆G wc dependence on ε observed in implicit solvent models, we applied it to a single G•T base pair in explicit benzene environment (ε = 2.3). ∆G wc in benzene converged to -0.5 ± 0.6 kcal/mol, which is consistent with the QM calculations in implicit solvent model of similar ε (Fig. S1). In both benchmark systems the positions of the minima on the PMF were not qualitatively altered compared to the reference reaction path (Fig. S5, Fig. 1C). From our benchmark calculations we conclude that our setup is able to faithfully estimate environmental effects on the wb-WC reaction, reaching qualitative agreement with experiments and higher levels of QM theory.

Inspired by the previously applied numerical kinetic model of selection in replication by
Kimsey et al. 16 , we made the following assumptions: (i) nc-WC states are characterised with the same decoding rate constants as the cognate states, since the WC geometries are assumed to be indistinguishable for the decoding site; (ii) GTPase activation rate constant (k 4 ) in C4 wb nc state is set to zero, assuming that the major contribution to this reaction rate comes from the nc-WC state, thus the contribution from nc-wb can be ignored; (iii) escape rate constants of the nc-wb states are taken from the classical nc states (q nc i ), assuming that the cumulative escape rate is dominated by the nc-wb states, thus the nc→nc-wb rescaling can be neglected. Given the assumption (ii), the apparent near-cognate k 4 : where P wc is the population of the C4 wc nc state -a function of the equilibrium WC populations in C3 and C4 states, and constraints imposed by the decoding rates on the reaction kinetics in C4. Using the equation for product concentration in a first-order reversible reaction, we derived (see Appendix) the equation for the error rate induced by the wb-WC reaction: where P eq W C is the equilibrium WC population in C4 for a given (k f , k r ), and P C3 W C is the equilibrium WC population in C3. For convenience, the variables P eq W C , P C3 W C and k f are expressed below in terms of free energy differences ∆G C4 wc , ∆G C3 wc and ∆G ‡ (activation free energy), respectively, using Eq. (S.5) and Eq. (S.6) at standard conditions.
First, we analyzed how ∆G C3 wc and ∆G ‡ of the exoergic (∆G C4 wc = −1 kcal/mol) wb-WC reaction affect η predicted from Eq. (2) (Fig. 5B). For the range of ∆G wc and ∆G ‡ in RNA duplexes in solution reported in 16 , η overlaps with the in vitro error range of G•U mismatches (∼ 10 −2 -10 −4 ) 5-9 . However, this result clearly cannot be interpreted as a validation of Eq. (2). In order to test Eq. (2), the decoding rate constants for G•U mismatches should be obtained, and ∆G ‡ in the decoding site accurately estimated.
Next, we addressed the dependence of η on the decoding rate constants k c 4 (GTPase activation) and q c/nc 4 (cognate and near-cognate escape rate constants of the C4 state).
These rate constants determine the deviation of the open↔closed transition in decoding from equilibrium conditions, thereby limiting the accuracy of decoding in the classical model 27,28 .
Eq. (2) suggests two possible kinetic regimes of the wb-WC reaction for the case of positive ∆G C3 wc . In the "fast" regime ((k f + k r ) >> (k c 4 + q c 4 )), kinetics of the wb-WC reaction and P C3 W C are irrelevant, and P wc = P eq W C . For P eq W C = k nc 4 /k c 4 (corresponds to ∆G C4 wc ≈ 3.4 kcal/mol), the fast regime is equivalent to the classical error η 0 (Eq. (A.1)) in both k c 4 and q c/nc 4 dependencies, which is confirmed by numerical calculations (Fig. 5C-D). Therefore, our model satisfies the correspondence principle 50 : it reduces to the classical induced-fit model when the wb-WC reaction is at equilibrium in the pre-chemistry step of decoding.
A more intriguing and relevant given the predicted wb-WC parameters is the "slow" regime, where the wb-WC equilibrium in C4 is shifted to the WC geometry (k f > k r ), but the kinetics of the reaction is restricted by the decoding rates. To visualize the slow regime, we chose wb-WC parameters (∆G C4 wc = −1 kcal/mol, ∆G ‡ = 17.8 kcal/mol), which approximately correspond to the predicted ∆G C4 wc (Table 1)  To explore this further, we considered a cumulative error rate of translation, η c , as a sum of η contributions from mismatches with slow, WC-shifted wb-WC transitions (e.g., G•U mismatches) and with "classical" η 0 -like behavior. In terms of our model (Fig. 5A), the latter can be interpreted as mismatches having fast endoergic transition to the WC geometry in C4, or mismatches whose GTPase activation rate is governed by low WC-independent rate constant (i.e. k wb 4 = 0, and q c/nc 4 (Fig. 6). Both rate constants for a wide range do not significantly contribute to R cog (Fig. 6A, Fig. S9). The exoergic wb-WC reaction constrains η c at a plateau (Fig. 6B): for the ribosome to significantly improve the accuracy, the decoding rate needs to be reduced. In contrast, for a classical mismatch, the ribosome could reduce the error rate at least 1000-fold without significantly affecting the decoding rate (Fig. S9A). In the cumulative approximation, the ribosome (the experimental values of decoding rate constants) is positioned in a region of local minimum of ∇η c (Fig. 6C). It illustrates the possibility of the evolutionary optimization for the minimal gradient, and thus for the minimal noise of error rate. For the case of η of a G•U mismatch only, this position is shifted from the minimum (Fig. S9B). Consideration of the cumulative error of decoding would likely be important to understand the evolutionary optimization of the ribosome, but requires more experimental information and more realistic approximations.
It is also worth noting that our model does not contradict the experiments with varying Mg 2+ concentrations. In their experiments Zhang et al. 51 reported that Mg 2+ concentration affected q 2 , resulting in linear rate-accuracy trade-offs. For all considered wb-WC parameters, our model also predicts linear rate-accuracy trade-offs by varying q 2 (Fig. S10).

CONCLUDING REMARKS
Structural studies revealed stabilization of the WC geometry of the G•U base pairs in the closed ribosomal A-site 20-24 , but the mechanism of this phenomenon has been lacking a physichochemical explanation and its implications for the decoding mechanism remained unexplored. Here we addressed these problems using computational and theoretical ap- were not addressed in this study.
We also investigated the effects of the closed active site environment of DNA polymerases β and T7 on the wb-WC reaction in G•T. The wb-WC reaction on pol-β was exoergic, corroborating previous structural studies 17 . The wb-WC reaction equilibrium in T7-pol was not affected compared to the DNA duplex in water.
We introduced the wb-WC reaction into the classical model of initial selection in trans- It reveals possible constraints on the codon-anticodon decoding imposed by the exoergic wb-WC tautomerization reaction: the equilibration of the reaction counteracts the equilibration of the open↔closed transition. These constraints provide a solution to the long-standing problem of seemingly suboptimal decoding mechanism pictured by the induced-fit model 26 .
We speculate that the model derived here for the codon-anticodon decoding in translation can also be applicable for the base pair recognition in DNA replication. We discuss specific predictions of our model, as well as relations to the previous models and available experimental data in Supplementary Discussion. We hope that our theoretical and computational results might be a next step towards a more complete model of substrate recognition.
Höchstleistungsrechnen (HLRN) (allocation code hhc00026), and by Hummel cluster at the University of Hamburg.

Supporting Information Available
Detailed methods, a detailed discussion of the new model, derivations of the analytical solutions of the model, Tables S1-S3 and Figures S1-S12 are available in the Supplementary Information.

Data availability
Optimized base pair geometries, coordinate frames of the NEB trajectory, initial system topologies and coordinates in the QM/MM US simulations, as well an example setup of the

Code availability
Python and Tcl scripts for the umbrella sampling setup, dielectric permittivity calculations, kinetic modeling and visualizations are available on GitHub: https://github.com/and-kaz/wbwc_paper