DNA opening during transcription initiation by RNA polymerase II in atomic detail

RNA polymerase II (RNAP II) is a macro-molecular complex that synthesizes RNA by reading the DNA code, a process called transcription. During transcription initiation, RNAP II opens the double-stranded DNA to expose the DNA template to the active site. The molecular interactions driving and controlling the DNA opening are not well understood. We used all-atom molecular dynamics (MD) simulations to obtain a continuous atomistic pathway for the DNA opening process in human RNAP II. To achieve such large-scale and highly nonlinear transition, we steered the MD simulations along a combination of collective variables involving a guided DNA rotation and a set of path collective variables. The simulations reveal extensive interactions of the DNA with three protein loops near the active site, namely the rudder, fork loop 1, and fork loop 2. According to the simulations, these DNA–protein interactions support DNA opening by attacking Watson-Crick hydrogen bonds, and they stabilize the open DNA bubble by the formation of a wide set of DNA–protein salt bridges.


Introduction
Transcription of DNA to RNA is catalyzed by RNA polymerases (RNAPs), a cornerstone of the central dogma of molecular biology. 1 In eukaryotes, RNAP II carries out the synthesis of coding RNAs and of many non-coding RNAs. Transcription involves three main steps: initiation, elongation and termination. To trigger initiation, the 12-subunits RNAP II first assembles with general transcription factors to form the pre-initiation complex (PIC). 2 Within the 12 RNAP II subunits, RNA polymerase subunits 1 and 2 (RPB1 and RPB2, respectively, Fig. 1A) form the cleft and the active site. Several loops protrude from the two large subunits (Fig. 1A), which are well conserved among eukaryotes, including the rudder (in RPB1), fork loop 1 (FL1, in RPB2), and fork loop 2 (FL2, in RPB2). 3,4 During initiation, these loops are in proximity with the DNA as the transcription bubble forms. The architecture of RNAP II and the mechanism of transcription initiation have been described in several excellent reviews. 5,6 Structural studies provided snapshots of the two end states of the PIC during transcrip- Previous molecular dynamics (MD) simulations focused on the elongation step of transcription [17][18][19][20][21][22][23] and on the clamp dynamics during initiation in bacterial RNAP. 24 A recent coarse-grained MD study addressed DNA melting by inserting DNA base mismatches. 25 However, DNA opening has not been simulated with atomistic models or without DNA base mismatches.
In this work, we used MD simulations to obtain a continuous opening transition from the (B) Overlay of the DNA in CC and OC, taken from structures 5IY6 and 5IYB respectively. 9 The DNA region involved in the DNA bubble formation is highlighted in green. (C) DNA sequence simulated in this work, corresponding to the DNA sequence found in 5IYB. DNA numbering according to Ref. 9, where +1 refers to the transcription start site in the OC structure.
CC to the OC in atomic detail. Because the CC-to-OC transition involves conformational rearrangements on the scale of several nanometers, obtaining such transition by brute-force MD simulations is computationally prohibitive. Therefore, we used steered MD simulations 26,27 along a set of collective variables (CVs) to drive DNA opening and to enhance the sampling along the DNA opening pathway. Our CC-to-OC simulation provides insight into the spatial rearrangements of the DNA and of the protein loops during initiation, and they reveal extensive polar interactions of the DNA with the rudder, FL1, and FL2. These observed interactions suggest roles of the protein loops in supporting DNA strand separation and in stabilizing the transcription bubble in the OC.

Results
Steering a 55Å-conformational transition with a combination of collective variables Upon forming the transcription bubble, DNA carries out a transition involving a rotation of the DNA double strand by ∼370 • as well as a translation of the DNA strands by up to 55Å relative to the protein. 9 Simulating such large-scale, nonlinear conformational transitions in atomic detail imposes considerable challenges. One possible strategy for favoring these largescale motions is to introduce base mismatches between the two DNA strands, as used for obtaining the OC cryo-EM structure by He et al. 9 or used for favoring DNA melting in coarsegrained MD simulations. 25 In contrast to these previous studies, we simulated DNA opening without base mismatches, according to the biological relevant state of the CC (Fig. 1C). We obtained a relaxed pathway of DNA opening with a combination of two methods. First, we obtained an initial pathway using steered MD simulations along a combination of three collective variables (CVs); second, the initial pathway was relaxed using the path collective variable (PCV) method. 28 To guide the opening pathway, steered MD simulations were carried out along a combination of the following three CVs: (i) a rotational CV applied to the downstream DNA helix, thereby driving the melting of the DNA strand ( Fig. 2A, ξ 1 ); (ii) two CVs given by the root mean-square distance (RMSD) of the sugar-phosphate backbone of the DNA relative to the conformation in the OC, taken from the 5IYB structure (ξ 2 and ξ 3 ). 9 Figure 2A illustrates the evolution of the rotational CV ξ 1 and of the RMSD-based CV ξ 3 . By pulling along these CVs we obtained an initial path of DNA opening (Movies S1 and S2).
A path collective variable (PCV) for steering and relaxing the DNA opening path Because our steered MD simulations were carried out on much shorter time scales compared to experimental time scales, it is reasonable to believe that the initial path is still biased by non-equilibrium effects. To relax the conformations along the opening pathway and, thereby, to mitigate such non-equilibrium effects, we applied the PCV method. 28   base pairs (bp), in reasonable agreement with the length of 13 bp in the reference structure by He et al. 9 We quantified the spatial extension of the bubble with the average distance  Fig. 1. Water molecules, sodium ions, and chloride ions are colored in blue and white, pale pink, and pale green respectively. Most of water molecules and ions have been removed for clarity. (C) Section of the PIC in OC obtained from steered MD simulations. Open DNA from 5IYB structure is shown for reference. The transcription bubble is colored in pale green. d bb between the 12 disrupted base pairs. In the unbiased 200 ns simulation of the OC, we obtained d bb = 2.36 nm (Fig. S1), in reasonable agreement with the value of 2.67 nm in the reference structure. Minor structural differences relative to the reference OC are expected because (i) the DNA bubble is flexible and (ii) RNAP II in the OC accommodates various DNA bubble lengths and widths during initiation and, more generally, during the entire transcription process. 29 Overall, the stability of the DNA transcription bubble in our free simulation implies that the previous pulling simulation represents a complete DNA opening pathway. Figure 2A and Movies S1 and S2 provide an atomic view on the large DNA rearrangements. First, due to the clockwise rotational motion carried out by the downstream DNA, the DNA became underwound in the transcription bubble region. Second, due to the translational motions induced on the transcription bubble towards the active site, the DNA bent at the transcription bubble region. These two topological changes of DNA led ultimately to the disruption of 12 bp. DNA rotational angles, bending angles and RMSD relative to the reference OC are depicted in Fig. 2A for four snapshots of our DNA opening trajectory. This interplay between negative supercoiling (clockwise rotational motion of DNA), bent DNA and base pair disruptions have been reported previously in DNA minicircles. [30][31][32][33] In addition, negative supercoiling has been shown to promote DNA opening during transcription with minimal transcription factors, 34 further corroborating that our simulations reflect experimentally relevant conditions. Together, we obtained a simulation protocol that    To get additional insights into the role of DNA-Protein interactions during DNA opening, and to identify selection pressure on the three key residues mentioned above, we analyzed the residue conservation of FL1 and FL2 among six eukaryotic organisms by means of multiple sequence alignments. Overall, FL1 and FL2 are strongly conserved among eukaryotes demonstrating their critical biological roles (Fig. 4E). However, Gln-461 is not conserved among eukaryotes, suggesting that the DNA-Gln-461 interactions observed in our simulations is either not critical for DNA opening or may be replaced with other interactions. In contrast, Lys-458 is well conserved among the analyzed eukaryotes (Fig. 4E)  To rationalize the energetic driving forces for DNA opening, we monitored the potential energy from DNA-DNA, DNA-Protein, and DNA-Water interactions (Fig. 5A). Here, potential energies were taken as the sum of Lennard-Jones and short-range Coulomb interactions, averaged over 50 ns of simulation and normalized relative to the state of the CC.
The loss of interactions between the DNA strands is primarily compensated by a large gain of DNA-Protein interactions, as evident from the large negative DNA-Protein potential energies (Fig. 5A, orange and red). Although DNA opening leads to an increase of DNAwater interactions, as expected from the formation of H-bonds between water and the WC edge (Fig. 4A, blue), DNA-water interactions (Fig. 5A, blue) play a much smaller role as compared to DNA-Protein interactions (Fig. 5A, red).
To quantify which type of DNA-Protein interactions drive DNA opening, we further analyzed the number of contacts of the DNA bubble region with different groups of amino acids of common physicochemical properties (Fig. 5B). Evidently, the DNA forms ∼50 new contacts with basic protein residues, far more compared to contacts with polar or acidic residues. This finding reflects that RNAP II cleft is highly positively charged which helps to attract the negatively charged DNA backbone deeper into the cleft and, in particular, into the active site. This finding demonstrates, not surprisingly, that electrostatic interactions between the DNA and RNAP II are the key energetic driver for transcription bubble formation.  To corroborate the relevance of the DNA-protein interactions observed in our simulations for stabilizing the transcription bubble, we inspected the conservation of the residues mentioned above with sequence alignments. Accordingly, Lys-494 (Fig. 4E, FL2) and Arg-222 ( Fig. 6E, RPB2) are invariant while Lys-413 (Fig. 6E, RPB2) is well conserved, supporting their biological relevance. Gln-456 (Fig. 4E, FL1) and Arg-416 (Fig. 6E, RPB2) are largely conserved except in Dictyostelium discoideum and in Saccharomyces cerevisiae respectively, underlining their putative role in stabilizing the transcription bubble. In contrast Arg-327 ( Fig. 6E, rudder) is not conserved but may be replaced with Thr or Gln; however, all those residues are capable of forming H-bonds with the DNA backbone and, thereby may all stabilize the open bubble. Finally, Lys-457 (Fig. 4E, FL1) is not conserved suggesting that this residue is less critical for stabilizing the transcription bubble. Together, our simulations show extensive interactions between the PIC and DNA that stabilize the DNA bubble. In the light of the sequence alignments, many of these interactions are critical, whereas some may be replaced by other interaction.

Discussion
We have presented the first all-atom simulation of a continuous DNA opening transition Mutagenesis experiments targeting the fork loops and the rudder have been carried out in archaeal RNAP II and have revealed that the rudder helps in stabilizing the melted DNA in the OC. 39 In addition, these experiments suggested that FL2 and, in particular Arg-451 -the archaeal equivalent of Arg-491 mentioned in this work-, plays a role in unwinding downstream DNA during elongation. However, mutagenesis of FL1 did not impact DNA opening in archaeal RNAP II in this study. Whereas archaeal and human RNAP II exhibit high sequence conservation, the physiological temperature in which DNA opening occurs may strongly differ, which might influence DNA melting. Indeed, a temperature of 70°C was used in the permanganate footprinting experiment by Naji et al. compared to a temperature of 37°C expected for human physiological conditions. 39 The same kind of mutagenesis experiments in eukaryotic RNAP II system, ideally for human RNAP II, would be highly interesting to confirm that FL1, FL2, and the rudder are essential for DNA opening.
Simulating and relaxing such a large-scale roto-translational conformational transition with atomic MD force fields is computationally challenging. A recent coarse-grained MD study introduced base pair mismatches in the transcription bubble region to favor DNA opening. 25 However, if DNA melting occurs without simultaneous rotation of the downstream DNA, high DNA strains in the upstream or downstream DNA region emerge, which is incompatible with the open DNA conformation from experimental structures. 8,9,40 Therefore, in this work, we used steered MD simulations along a combination of three CVs to guide the large-scale displacement of DNA by up to 55Å simultaneously with DNA rotation by ∼ 346°.
Henceforth, we used the PCV framework to relax the initial opening pathway from steered MD. Notably, we tried to compute the potential of mean force along the PCV with umbrella sampling 41 or metadynamics, 42 with the aim to obtain an estimate for the free energy of DNA opening; however, we observed considerable hysteresis problems, suggesting that it is difficult to sample all the degrees of freedom orthogonal to the PCV. This observation further implies that our simulations provided a plausible pathway for DNA opening, but not necessarily the minimum free energy pathway. Future simulations may investigate the use of additional enhanced sampling techniques such as bias-exchange umbrella sampling 43 or the use of extensive compute power. 44 In eukaryotes, the transcription factor TFIIH is involved in, both, translocase activity and DNA opening. [45][46][47][48] However, it has been suggested that translocase activity is not necessary for RNA transcription 49 and that TFIIH-independent DNA opening is possible in yeast RNAP II 12 or in human RNAP II under negative supercoiling conditions. 34 In this study, we used rotational and translational CVs to model the motions induced by negative supercoiling in TFIIH-independent DNA opening. However, since TFIIH also produces torsional stress to the downstream DNA, our current protocol for DNA opening will be useful to study DNA opening in presence of TFIIH. Simulations with TFIIH will be particularly relevant to understand the role of its XBP subunit, the TFIIH subunit containing the translocase activity and the motor for DNA unwinding. 47 The transcription factor TFIIB contains the B-reader and B-linker elements, which also help DNA opening. 10,50 Because refined atomic models of the B-reader and B-linker were not resolved in the CC structure by He et al., 9 we simulated the CC-to-OC transition in the absence of these TFIIB elements. Further atomistic simulation will be needed to investigate whether the B-reader and B-linker support DNA opening using similar interaction motifs as observed here for FL1, FL2, and for the rudder. atomic coordinates for the CC were obtained from the Protein Data Bank (accession code 5IY6 9 ) from which we removed TFIIH and TFIIS. We used YASARA 54 version 20.8.23 to add acetyl and N-methyl amide capping groups at the ends of the missing protein regions and at the C and N termini. We also used YASARA 54 to add missing atoms. The system was solvated with TIP3P water molecules 55 and Na/Cl counter ions were added to neutralize the system with a salt concentration of 100 mM. In total, the system contained 832078 atoms.
The OL15 56 force field was used for the DNA. The ff14sb 57 force field was used for the protein except for the zinc(II)-coordinating Cis and His residues, for which the improved parameters by Macchiagodena et al. 58 were used.
Electrostatic interactions were computed with the particle-mesh Ewald method, 59 using a real-space cutoff at 1 nm and a Fourier spacing of 0.16 nm. Dispersion interactions and shortrange repulsion were described with a Lennard-Jones potential with a cutoff at 1 nm. Bonds and angles of water were constrained with the SETTLE algorithm 60 and bonds involving other hydrogen atoms were constrained with LINCS. 61 To remove atomic clashes, the system was energy minimized with the steepest-descent algorithm. We next equilibrated the system under NVT conditions for 100 ps at 300 K using the velocity-rescale thermostat 62 with one heat bath for the coordinated ions, DNA and protein and another heat bath for water and counter-ions. Then, we equilibrated the system at 1 bar for 10 ns under NPT conditions using Parrinello-Rahman 63 pressure coupling and using the same thermostat as in the NVT equilibration. During both equilibration steps, all heavy atoms were position restrained with a force constant of 1000 kJ mol −1 nm −2 .
To enable the use a 4 fs time-step for further pulling simulations we used hydrogen mass repartitioning 64 (HMR). Accordingly, to increase the oscillation period of the bond angles involving hydrogen atoms, the hydrogen masses were scaled up by a factor of f H and the heavy atom masses connected to hydrogens were scaled down, while keeping the overall masses of chemical moieties constant. To choose a scaling factor f H that yields stable simulations at a 4 fs time step, we tested scaling factors from 2 to 3 in steps of 0.2, where three simulations were carried out for each scaling factor. Each simulation was carried out for 20 ns in NPT conditions. None of the simulations with a scaling factor of 2.8 or 3 were stable, whereas all other simulations were stable. For production simulations, we decided to use f H = 2.5.
To exclude that HMR leads to excessive energy drift, we carried out three NVE simulations with f H = 2.5 using 4 fs time step for 500 ps and, for reference, three NVE simulation without HMR using a 2 fs time step. On average, we obtained an energy drift of 0.06% ns −1 with HMR, which was even smaller than the average value of −0.13% ns −1 without HMR (Table S1). Hence, integrating Newton's equations of motion with HMR models was numerically stable and exhibited only a marginal energy drift.

Simulation of initial DNA opening pathway
We generated an initial path from the CC to the OC with a steered MD simulation of 175 ns using a combination of one rotational CV and two RMSD-based CVs. As the first CV, we used a rotational CV defined as ξ 1 = 1/4 4 n=1 γ i (X; X 0 ), where each dihedral angle γ i was defined as: Here, u axis denotes the helix axis of the DNA in region +3 to +23 (Fig. S2A). X 0 is the configuration at t = 0 ns, and X is the configuration at a later simulation time t. Further, v i (X 0 ) denotes the vector connecting the two centers of mass (COMs) c i,1 (X 0 ) and c i,2 (X 0 ) at t = 0 ns (Fig. S2B), and v i (X) is the instantaneous vector connecting the same COMs at simulation time t (Fig. S1C). The two groups of atoms used to define c i,1 and c i,2 , respectively, were constructed by splitting the helix DNA region +3 to +23 along the axis, as illustrated in Figs. S2B and S2C. The four v i (X 0 ) defining ξ 1 are depicted in Fig. S2D.
As a second CV, we used ξ 2 = ∆(X, X OC1 ). Here, X OC1 denotes the DNA backbone atoms of the OC in the region −17 to −5, taken from the 5IYB structure, 9 and ∆(X, X OC1 ) denotes the RMSD of the instantaneous structure X relative to the reference structure X OC1 .
As a third CV, we used ξ 3 = ∆(X, X OC2 ). Here, X OC2 denotes the backbone atoms of the OC in the region −17 to +2, again taken from the 5IYB structure.
During the 175 ns of steered MD simulation, we applied different forces on the three CVs described above: 1. The rotational CV (ξ 1 ) was pulled from 6.27 rad (close to 2π) to 0.01 rad over the first 100 ns using a force constant of 7000 kJ mol −1 rad −2 . Over the next 4 ns, the force applied on ξ 1 was turned off by linearly decreasing the force constant from 7000 kJ mol −1 rad −2 to 0 kJ mol −1 rad −2 .
2. The RMSD relative to X OC1 (ξ 2 ) was pulled from 2.35 nm to 0.4 nm over the first 50 ns using a force constant of 10,000 kJ mol −1 nm −2 . Over the next 50 ns, ξ 2 was pulled from 0.4 nm to 0 nm using a force constant decreasing linearly from 10,000 kJ mol −1 nm −2 to 0 kJ mol −1 nm −2 .
3. The RMSD relative to X OC2 (ξ 3 ) was pulled from 1.99 nm to 0 nm between simulation times of 50 ns and 100 ns, using a linearly increasing force constant between 20,000 kJ mol −1 nm −2 and 30,000 kJ mol −1 nm −2 . Over the next 75 ns, ξ 3 was restrained at 0 nm using a force constant of 30,000 kJ mol −1 nm −2 .

Relaxation of the initial DNA opening pathway
In order to relax and sample intermediate states along the opening pathway, we first used constant-velocity pulling from the CC to the OC with a Path Collective Variable 28 (PCV).
The two components of a PCV are S path and Z path are defined as: where the unitless S path describes the progression along the path and Z path describes the deviation from the path. N denotes to the number of reference configurations defining the DNA opening path. The distance metric M i (X) = ∆ 2 (X, X i ) is the mean squared deviation (MSD) of the instantaneous configuration X relative to the reference configuration X i . The choice of the N reference configurations was optimized to obtain similar MSDs between neighboring configurations and a good flatness of the surface spanned by the N ×N MSD matrix. 28,65,66 The symbol λ is the smoothing parameter, proportional to the inverse of the MSD between adjacent reference configurations.
To relax the initial path along Z path , we performed two rounds of constant-velocity pulling along S path , while applying a wall potential on Z path acting above Z path (X) = 0.035 nm −2 .
For the first relaxation round, we took N = 72 reference configurations from the initial path, set λ = 21.7 nm −2 , and we carried out 100 ns of constant-velocity pulling along S path from 1.1 to 71.3, corresponding to configurations close to the first or last reference configuration. We

Simulation analysis
Hydrogen bonds were defined with a cutoff distance of 0.35 nm between the hydrogen atom and the H-bond acceptor, and with a cutoff angle hydrogen-donor-acceptor of 30°. The base pairs +2 and +3 exhibited disrupted hydrogen bonds but were mismatched with other bases in the downstream DNA fork; hence, during the analysis, we did not consider these bases as part of the open DNA bubble. Contacts were defined with a cutoff distance of 0.3 nm. The potential energies were computed as the average of the sum of Lennard-Jones and short-range Coulomb interactions with a cutoff at 1 nm. Simulation trajectories were visualized with PyMOL 67 and VMD. 68 Images were generated with PyMOL.

Conformational stability of the OC
To test the stability of the OC obtained from the second relaxation round, we performed a free simulation of the OC, i.e., without any biasing potential. To this end, a 200 ns simulation was started from the final configuration of the second relaxation round. We quantified the stability of the OC by monitoring the distances between the 12 disrupted base pairs of the transcription bubble (Fig. S3). The 12 distances were computed with the center of geometries of the heavy backbone atoms of the two complementary nucleotides. The 12 distances were then averaged in each time frame. For reference, the same protocol was used to compute the average distances between the 13 dissociated base pairs in the reference OC (5IYB 9 ).

Multiple sequence alignments
RPB1 and RPB2 protein sequences were taken from the UniProtKB/Swiss-Prot 69 database.
The organisms were chosen to cover different kingdoms of the eukaryotic domain. The alignments were carried out with CLUSTAL W 35 version 2.1.