Identification of novel RNA Polymerase II CTD interaction sites on the mRNA Capping Enzyme

Recruitment of the mRNA Capping Enzyme (CE/RNGTT) to the site of transcription is essential for the formation of the 5’ mRNA cap, which in turn ensures efficient transcription, splicing, polyadenylation, nuclear export and translation of mRNA in eukaryotic cells. The CE is recruited and activated by the Serine-5 phosphorylated carboxyl-terminal domain (CTD) of RNA polymerase II. Through the use of molecular dynamics simulations and enhanced sampling techniques, we provide a systematic and detailed characterisation of the human CE-CTD interface, describing the effect of the CTD phosphorylation state, length and orientation on this interaction. Our computational analyses identify novel CTD interaction sites on the human CE surface and quantify their relative contributions to CTD binding. We also identify differences in the CTD binding conformation when phosphorylated at either the Serine-2 or Serine-5 positions, thus providing insights into how the CE reads the CTD code. The computational findings are then validated by binding and activity assays. These novel CTD interaction sites are compared with cocrystal structures of the CE-CTD complex in different eukaryotic taxa, leading to the conclusion that this interface is considerably more conserved than previous structures have indicated.

studies identify the differences between the pSer5 and pSer2 CTD interactions with the GTase that could explain why the activation effect is observed with the pSer5 CTD but not the pSer2 CTD (12). Therefore, there are a number of outstanding questions in the field: i) how does the CTD phosphorylation code affect CTD binding to the GTase?; ii) what additional GTase-CTD interactions are required for GTase activation; iii) how does pSer5 CTD binding to the GTase illicit GTase activity stimulation?
Computational techniques such as molecular dynamics (MD) simulations are well posed to answer these open questions and generate a more comprehensive characterisation of the GTase-CTD interaction (33). MD simulations are increasingly used in the characterisation of protein conformational dynamics and protein-peptide interactions, including energetics. Recent studies highlight the application of MD simulations to understand the conformational ensembles of protein systems, protein allostery and protein-peptide interactions (35)(36)(37)(38)(39)(40). It must be mentioned that atomistic MD simulations are computationally expensive, typically limiting simulations to nanosecond-microsecond timescales (41). Since many events in protein systems occur on longer timescales, enhanced sampling techniques have been developed and successfully applied to overcome these limitations and sample longer-timescale processes (42,43). Accelerated molecular dynamics (aMD) is one such technique that increases the conformational sampling of a system by reducing the depth of free-energy minima while maintaining the characteristics of the energy surface (44)(45)(46)(47)(48).
Here we carry out a large-scale computational study by performing both conventional MD (cMD) and accelerated MD simulations to assess the conformational dynamics of the human CE GTase and provide a systematic and detailed characterisation of its interaction with the CTD in different phosphorylation states. We identify two novel CTD interaction sites on the human CE GTase surface. These sites are predominantly conserved throughout animals and yeasts, indicating that the core features of the GTase-CTD interface have undergone considerably higher selection pressure than previously recognised. In addition, we propose that the GTase-CTD interaction is bidirectional and recognise the palindromic nature of the CTD. 4 3 Methods

System preparation
The 3.0Å resolution crystal structure of the human CE GTase (residues 229-565) (10) Table   1 provides a summary of all the simulations presented in this work, with details of each system setup described below. All current crystal structures of the mammalian GTase miss portions of the β 2-αD loop (residues 425-433). This was modelled with ModLoop using the MODELLER loop modelling procedure (50).
To simulate the CE GTase-CTD complex, the mouse GTase-CTD complex structure and human apo-GTase structure were aligned and the 1-heptad CTD fragment was superimposed onto the human CE GTase. To model the 4-heptad systems, the PEP-FOLD3.5 server (51) was used to generate the starting peptide structures of 3 additional heptads (21 residues). These 3 heptads were then fused onto the 1-heptad CTD resolved in the mouse GTase-CTD cocrystal structure, either onto its C-or N-terminus (Supplementary Table 1). Unique starting conformations were used for each replicate of the simulation by selecting different PEP-FOLD3.5-generated structures.
All simulations were prepared within the LEaP module in the AMBER16 suite (52) using the 5 ff14SB force field (53), with phosphoserine modifications described by Homeyer et al. (54). All protein and peptide chains were capped with acetyl (ACE) and amino (NME) groups on the N-and C-ter respectively, and Reduce was used to protonate all residues in their standard protonation state at neutral pH (55). All CTD phosphoserines were modelled in the −2 charge state. Simulations of the 4-heptad, pSer5 CTD system were also performed with the pSer in the −1 charge state and show comparable qualitative behaviour, though with a weaker interaction and reduced stability of sites (data not shown). The protein was then placed in an octahedral box of TIP3P water molecules extending at least 15Å from the protein. The system was neutralised by balancing the charge with the appropriate number of Na + or Cl − counter ions. Finally a combination of steepest descent and conjugate gradient energy minimisation was performed.

Simulation setup and protocols
All standard simulations were performed using the pmemd.cuda module of AMBER16 (52). After energy minimisation, the system was heated from 100 K to 310 K over 25 ps, restraining the solute.
Equilibration was performed for 200 ps with the solute restraints gradually removed. After 200 ps of equilibration, hydrogen mass repartitioning was performed and the step size was increased from 2 to 4 fs for the production runs (56). Berendsen barostat and thermostat were used to keep pressure and temperature constant (1 atmosphere and 310 K) during the simulations (57). The non-bonded interaction cutoff distance was set to 10.0Å and the SHAKE algorithm used to restrain hydrogen bond lengths (58). To reduce neutralising counterion clustering around the phosphate groups of the CTD, a 20Å distance restraint (k = 20.0 kcal/mol · A) was imposed between all sodium counterions and phosphorus atoms of the CTD phosphoserines. Three replicates of each production run were performed by randomly generating the starting velocities.
aMD runs were performed with the AMBER16 implementation using the 'dual-boost' protocol as described previously (46,47,52,59,60). Briefly, this applies a potential energy boost to all atoms and an additional dihedral boost to torsion angles. The mean potential and torsion energies of each system was calculated from the last 50 ns of each 200 ns cMD replicate. These were then 6 used to calculate the aMD parameters (E P , α P , E D , α D ) based upon the guidelines described by Pierce et al. (46).

Data analysis
VMD and PyMOL were used to inspect and visualise the trajectories (49,61). Analysis of the MD trajectories was performed primarily in the CPPTRAJ module of the AMBER16 suite to compute interatomic distances, solvent exposure, root-mean-square deviations (RMSDs) and root-meansquare fluctuations (RMSFs) (52). Interatomic distances between CTD residues and the GTase residues were computed using the closest CTD residue from any of its heptads. All trajectories were analysed by using frames saved every 40 ps. Electrostatic potentials were generated using the Adaptive Poisson-Boltzmann Solver (APBS) implemented in the PyMOL APBS tools (62).
Normal mode analysis was performed using the ElNémo web server using the default settings (63).
Multiple sequence alignments were performed in Jalview using the Clustal Omega algorithm with default settings (64,65), selecting only reviewed protein sequences from the NT domain InterPro family (IPR001339).
The binding free energy analysis was performed using the Molecular Mechanics-Generalised Born Surface Area (MMGBSA) method using the MMPBSA.py package (66) and following the protocol described by Genheden et al. (67). The final snapshots of the aMD simulations were taken from the 3 replicates of each system. These snapshots were used as starting structures for 50 x 200 ps simulations. MMGBSA analysis, including per residue decomposition, was then performed using snapshots from these simulations with an 8 ps time step. In silico mutagenesis was performed on the final aMD snapshots followed by 50 ns of unrestrained equilibration. The MMGBSA protocol was then performed on the mutant structures as described above.
An additional search for potential CTD binding sites on the CE GTase was performed with the PIPER-FlexPepDock global protein-peptide docking server (68). Default settings were chosen for all conditions. The server does not accept non-standard residues, therefore, glutamates were used as phosphomimetics to replace the CTD phosphoserines.
Disorder prediction of the CE sequence was performed using the MetadisorderMD2 server (69).

Expression and purification of recombinant proteins
The DNA sequences of the human CE GTase and each sequence variant were synthesised and subcloned into the PGEX6p1-C-His plasmid vector by Thermo Fisher GeneArt. The PGEX6p1-C-His vector contains an N-terminal HRV 3C cleavable tag and a C-terminal hexahistidine tag.
These plasmids were then transformed into BL21 (DE3) E. coli and were cultured in 200 mL of Power Broth (Molecular Dimensions) at 37°C until A600 was between 0.6-0.8. Protein expression was induced with 1mM IPTG overnight at 16°C. Cells were pelleted and frozen before protein purification at −80°C. The cells were lysed in 5 mL of lysis buffer (50 mM Tris-HCl, pH 7.5, 500 mM NaCl, 30 mM imidazole, 1 mM TCEP, 0.2% Tween and 5 units/ML Benzonase nuclease) and sonicated for 10 minutes with 10 second pulses. The GTase was purified with metal affinity chromatography, through a 1 mL HisTrap HP column (GE Healthcare) and eluting with 350 mM imidazole. The GST was cleaved with GST-tagged HRV 3C protease (PreScission Protease, GE Healthcare). The GST and protease was removed with glutathione sepharose resin. Further purification was performed with size exclusion chromatography on a Superdex 75 10/300 GL column (GE Healthcare), resolving in a buffer of 20 mM Tris HCl (pH 7.5), 200 mM NaCl and 1 mM TCEP. Aliquots were stored with 10% glycerol. Purity was assessed by SDS-PAGE and Coomassie Blue protein staining and all recombinant proteins were tested for basal GTase activity as described below (Supplementary Figure S7).

CTD pull down assays
GTase-CTD peptide binding assays were performed as described by Ho et al. (12). 1 nmol of biotinylated 4-heptad CTD peptides (PeptideSynthetics) were incubated with 0.5 mg of Streptavidin-8 coupled magnetic Dynabeads M-280 (Invitrogen) in 300 µL of buffer A (25mM Tris-HCl, pH 8, 50 mM NaCl, 1 mM TCEP, 5% glycerol and 0.03% Triton X-100) for 45 minutes at 4°C. Next, the beads were magnet concentrated and washed three times with 0.5 mL of buffer A. 4 µg of the purified GTase sample was then incubated with the beads in 50 µL buffer B (Tris-HCl, pH 8, 53mM NaCl, 1 mM TCEP, 5% glycerol and 0.03% Triton X-100) for 45 minutes at 4°C. After incubation, the solution was collected as the unbound fraction, the beads were washed three times with buffer A and the bound fraction was eluted with 50 µL of SDS-PAGE loading buffer at 100°C for 5 minutes. Fractions were concentrated and analysed with SDS-PAGE and Coomassie Blue staining. Bands were quantified in ImageJ and normalised relative to the wild-type CE GTase (residues 211-597) incubated with the Ser5 phosphorylated CTD peptide.

Guanylyltransferase activity assays
Guanylyltransferase activity assays were performed as described by Ghosh et al. run on an SDS-PAGE gel. The gels were fixed with 30% methanol and 5% acetic acid, stained with Coomassie Blue and exposed to a phosphorimaging plate for 1 hour. The plates were scanned using an Amersham Typhoon phosphorimager with the bands quantified in ImageJ and normalised relative to the basal wild-type CE GTase (residues 211-597) activity.

Results and Discussion
The human CE GTase exhibits different conformational dynamics from the viral enzyme To our knowledge the human Capping Enzyme guanylyltransferase domain (CE GTase) has not been simulated before. Therefore, our first aim was to assess the conformational dynamics of the  (Figures 2A and 2B). This confirms that the two enzymes indeed exhibit strikingly different global dynamics.
To further characterise these large-scale conformational changes Normal Mode Analysis (NMA) was performed on the human and PBCV-1 GTase structures ( Figures 2C and 2D). The NMA results provide additional support to the above result, showing that for both structures the lowest frequency modes involve the domain opening and closing motion. However, this mode differs significantly between the two proteins. In the human CE GTase it is a rotation of the OB and NT domains relative to each other ( Figure 2C). In contrast, for the PBCV-1 CE GTase the lowest frequency mode shows a straight open-close motion ( Figure 2D). These differences in the global conformational dynamics are likely a result of the number of salt bridges that are able to form between the NT and OB domains ( Figures 2E and 2F). In the human CE GTase, there is a complex network of salt bridges which hold the domains in the closed state, whereas in the PBCV-1 CE GTase no more than three salt bridges can be observed at any point during the simulations. Interestingly, a number of residues involved in salt bridge formation between the NT and OB domains in the human CE GTase-namely K460, D468, R528, R530, D532 and K533-have also been shown to be important residues for GTP and mRNA binding and mammalian CE GTase catalysis (10,71,72). However, the enzyme kinetics of the CE GTase have only been characterised in PBCV-1 and not the human CE GTase (73). The observed dramatic differences in the domain opening/closing dynamics between the two proteins suggest that the kinetics of the human enzyme will be significantly altered compared to that of the viral enzyme.
The CTD forms an extensive interaction with the CE GTase, including two novel sites The interaction between the CE GTase and the C-terminal domain of RNA Polymerase II (CTD) is essential for GTase activation and CE recruitment to the site of transcription (12,15,17). It has previously been shown that interactions with multiple heptads are required for GTase activation (12). As a starting point for understanding the interaction between the human CE GTase and the CTD, we initially carried out MD simulations of the GTase in the presence of 1 heptad of the CTD.
These simulations were started from the CTD conformation and phosphorylation state observed in the mouse GTase-CTD cocrystal structure, which resolved only 1 heptad, phosphorylated at both the Ser2 and Ser5 positions (32). Our cMD results were consistent with the previous experimental data (Supplementary Figure S2): pSer5 remains bound to the positively charged pocket formed by R330, K331 and R386 (CDS1 site) in the conformation adopted in the crystal structure; in contrast, the pSer2 sidechain remains solvent exposed and does not form stable interactions with the protein. During the aMD simulations the CTD peptide samples much wider conformational space (Supplementary Figure S2). While the pSer5 interaction remains predominantly stable, the pSer2 residue changes conformation allowing it to also occasionally interact with the pSer5 pocket, CDS1. In addition, Tyr1 exhibits a greater extent of conformational flexibility, dissociating and rebinding to the tyrosine binding site (CDS-Y1).
Activation of the mammalian GTase strongly depends on the length of the CTD it interacts with, with the activation effect increasing 3-fold from 2 to 6 heptads (12). Activation is also dependent on the CTD being phosphorylated at the Ser5 position (12,32). This indicates that the CTD forms an extensive interaction with the GTase that requires the binding of multiple CTD heptads. Currently there are no crystal structures of the mammalian CE GTase in complex with multiple CTD heptads. In order to systematically characterise the extensive interaction between the longer CTD fragments and the human CE GTase, we extended the length of the CTD peptide to 4 heptads by modelling 3 additional heptads onto the termini of the 1-heptad CTD fragment, which was resolved in the mouse crystal structure (32), in both directions. To investigate the effect of the CTD phosphorylation code on the GTase-CTD interaction, three phosphorylation states were simulated: unphosphorylated, Ser5 and Ser2 phosphorylated (Supplementary Table S1). In each phosphorylation state the CTD peptide was extended in both the N-and C-ter directions in separate simulation systems, yielding 6 different systems, to identify interaction sites that might occur at different sides of the known CTD interaction sites (CDS1 and CDS-Y1) (Supplementary Figure S3). Three replicates were performed using different CTD starting conformations to ensure that the interactions formed were reproducible and not biased by the initial CTD conformation (Supplementary Figure S3).
Analysis of the 4-heptad pSer5 CTD simulations (Systems 6 and 7) provided valuable insights into the GTase-CTD interaction ( Figure 3). The previously reported CDS1 site remains occupied in all replicates ( Figure 3B). The CDS-Y1 interaction remains stable for the duration of the cMD simulations but becomes destabilised, dissociating and rebinding, during the aMD sim-ulations ( Figure 3D). In addition to CDS1 and CDS-Y1, our simulations identify two novel CDS sites-named CDS2 and CDS-Y2-that were not observed in the mouse crystal structure of the complex (Figure 3 and Movie S1) (32).
The first novel CDS site, CDS2, is a pSer5 interaction site composed of sidechains R358, K403 and R411 that form multiple electrostatic interactions with the phosphate group of pSer5 of the CTD (Figures 3A, 3C and Movie S2). The CDS2 site is located within a positively charged patch on β 7, β 8 and loop αC-β 8. This interaction is very stable, remaining occupied once pSer5 binds to the site, and is observed in all replicates of System 6 ( Figure 3C). In the no-CTD state, the basic residues that constitute the CDS2 site are predominantly solvent exposed and are involved Movie S3). This pocket is partially occupied by Tyr1 in all replicates, however, it represents a transient interaction and was easily destabilised during the aMD simulations ( Figure 3E). The CDS-Y2 residues are located on helix αC and loop β 8-αD. They are semi-buried within the NT-Hinge interface, reducing their solvent exposure, and interact with a number of adjacent hydrophobic residues that form part of a hydrophobic region that includes W293, Y362, I384, F416 and T445. Although there is no large-scale conformational change associated with Tyr1 binding to this site, many of these residues interact directly with the residues involved in GTP binding. Therefore Tyr1 binding to this site might have an effect on GTP binding or coordination.
Due to their electrostatic nature, the pSer5-CDS interactions (CDS1 and CDS2 sites) remain stable once formed ( Figures 3B and 3C). During aMD simulations, some individual CDS1 interactions are occasionally broken, however pSer5 remains bound to this region. Upon CDS2 binding, this interaction remains stable with only minor fluctuations. In contrast, the Tyr1 interactions are considerably less stable ( Figures 3D and 3E). CDS-Y1 remains occupied by Tyr1 for the duration of the cMD, however all replicates show Tyr1 dissociation and rebinding during the aMD stage. This is also observed with the CDS-Y2 pocket, which again represents a transient interaction, despite being occasionally observed in all replicates.
A further inspection of the previous cocrystal structure of the mouse GTase-CTD complex provides a rationale to explain why the newly identified CTD interaction sites, i.e. CDS2 and CDS-Y2, were not observed in that structure (32). The asymmetric unit of the structure forms a homodimer between two GTase domains, which was considered an artefact of crystallisation (Supplementary Figure S4). This homodimer interface forms extensive contacts on the NT domain and the hinge, occluding the CDS2 and CDS-Y2 sites, close to the bound CTD heptad. As a result, the dimer interface obstructs the CDS2 and CDS-Y2 sites, preventing CTD binding to this region.
We expect that future structural studies of the mammalian GTase will confirm CTD binding to these novel sites.
An important feature of our simulations is that although the novel CDS2 and CDS-Y2 interactions are observed reproducibly in all replicates ( Figures 3C and 3E), these interactions can occur on different heptads between the replicates (Figure 4). The CDS1 and CDS2 sites can be occupied either by adjacent CTD heptads or heptads can be looped out, with non-neighbouring heptads occupying CDS1 and CDS2. This provides evidence of the 'looping out' mechanism suggested in previous studies, which showed that the CE must interact with multiple heptads but that these do not need to be adjacent in sequence (31). The simulations also show that the order of the CDS interactions can vary. This can be seen, for example, in replicate 2 where heptad 4 dissociates from CDS-Y1 and is replaced by heptad 3, switching the order of CDS1 and CDS-Y1 ( Figure 4).
Both conformations are stable and this change does not destabilise other CDS interactions. Therefore, CDS sites can be occupied in different heptad orders as well as heptads being looped out.
During GTase recruitment the CTD is not uniformly Ser5 phosphorylated (25), and so the looping out mechanism we observe is consistent with the hypothesis that unphosphorylated CTD heptads 14 are looped out during GTase recruitment to enable the CTD to bind to all the CDS sites (31).

Phosphoserine interaction sites are critical for CTD binding to the GTase
Our simulations revealed two additional CTD interaction sites on the human CE GTase surface.
However, the contribution and importance of each CDS site to the CTD binding to the GTase remained unclear. MMGBSA is a computational technique that can be used to predict the binding free energies between binding partners, including protein-peptide complexes (see Methods for details) (66,67,74,75). In order to obtain a detailed quantitative characterisation of the GTase-CTD interaction, MMGBSA calculations were performed to assess the binding affinities and the contributions of individual residues. The MMGBSA analysis identified the main GTase residues that contribute to CTD binding ( Figure 5). Results for the 4-heptad pSer5 simulations extended in the N-ter direction (System 6) are shown in Figure 5A. As expected, the core residues comprising the pSer5 interaction sites-residues R330 and R386 of CDS1 and R358, K403 and R411 of CDS2-make the largest contributions to the GTase-CTD interaction. Notably, arginines make the most significant contribution to the binding free energy, whereas the flexibility of the CDS lysine sidechains and their position on the loops in the CE GTase make them more likely to dissociate from CTD interactions. This can be seen in CDS1 where R330 and R386 make the largest contributions to the CTD binding affinity, whereas K331 makes a relatively small contribution.
Likewise, in CDS2, R358 and R411 make the largest contributions to the binding affinity, whereas K403 makes a smaller contribution because of its location on a loop. R392 is included as a CDS2 residue, however it forms a strong interaction with pSer5 only in one replicate, whereas in the other two replicates it forms a stable salt bridge with E406 in the NT domain; this explains a large standard deviation for this residue. No other residues on the CE GTase make significant contributions to the binding affinity, confirming the central role of CDS1 and CDS2 sites in the GTase-CTD interaction.
In contrast to the pSer5 sites, the Tyr1 sites make a minor contribution to the binding affinity, with none of the CDS-Y1 or CDS-Y2 residues contributing more than −5 kcal/mol ( Figure 5A). This is consistent with their transient nature seen in the aMD distance analysis (Figures 3D and 3E).
Despite this, mutagenesis of Tyr1 to alanine has been previously shown to significantly decrease GTase binding and activation (76). This suggests that Tyr1 has an essential but more subtle role in GTase recruitment and activation.
As pSer5 interactions were found to dominate the GTase-CTD interaction, in silico mutagenesis was performed to further assess the importance of each site and to guide biochemical experiments ( Figure 5B). We constructed mutant systems in which CDS1 (R330, K331 and R386) and CDS2 (R358, K403 and R411) residues were mutated to alanine in the final frames of the aMD simulations of System 6-yielding Systems 13, 14 and 15 (Supplementary Table 1). Each system was reequilibrated for 50 ns and then MMGBSA analysis was performed ( Figure 5B). An additional system containing the dephosphorylated CTD (System 12) was simulated to provide an important reference. It must be noted that although MMGBSA results are extremely useful to compare relative binding free energy values for different systems, the absolute binding free energy values calculated must be taken with caution (67). The results show that the two pSer5 CDS sites (CDS1 and CDS2) make major contributions to the binding affinity of the CTD. When either pocket is mutated (∆CDS1 or ∆CDS2) the binding free energy is reduced by about a third.
When residues in both CDS1 and CDS2 pockets are mutated to alanine, the binding free energy is reduced further and is approximately equal to that of the unphosphorylated CTD. These results suggest that, although the Tyr1 interactions may have have an auxiliary role in GTase recruitment and activation, pSer5 CDS interactions form the basis of GTase binding to the CTD.
pSer2 CTD can bind to CDS1 and CDS2 adopting different conformations from pSer5 CTD In order to characterise the differences in the GTase-CTD interaction as the CTD code is changed, we simulated the pSer2 CTD (Systems 8 and 9) and compared its conformational dynamics and interactions to that of the pSer5 CTD (Systems 6 and 7). pSer2 is known to bind to the GTase with comparable affinity to pSer5 but does not illicit the GTase activation (12). Previous literature suggests that the Ser2 phosphorylated CTD displays non-competitive binding with Ser5 phospho-rylated CTD, therefore, the two states are expected to bind to different locations on the CE GTase surface (12). Our MD results suggest that the pSer2 CTD also readily binds to the same sites, CDS1 and CDS2 ( Figure 6 and Supplementary Figure S5). During the simulations, the CDS1 pocket is quickly occupied by pSer2 due to its close starting proximity ( Figure 6B and Supplementary Figure S5A). In addition, in one of the three replicates extended in the N-ter direction and in two of the three replicates extended in the C-ter direction the pSer2 sidechain occupies the CDS2 pocket ( Figure 6A, 6C and Supplementary Figure S5B). Once pSer2 is bound, the respective sites remain occupied for the duration of the simulations. These dynamics are similar to the pSer5 CTD.
This indicates that the CDS1 and CDS2 pockets are not specific to pSer5, and that both pSer2 and pSer5 groups can bind to them. Importantly, the conformation the pSer2 CTD adopts when binding to the CDS2 site is different from that of the pSer5 CTD ( Figure 6A). As the pSer2 residue is adjacent to Tyr1, it reduces the Tyr1 interactions with the hydrophobic CDS-Y1 and CDS-Y2 pockets ( Figures 6D and 6E, Supplementary Figures S5C and S5D). Tyr1 interactions have been implicated in CE recruitment and activation by the CTD (76). Therefore, this difference in binding mode may explain why pSer2 CTD can bind to the human CE GTase but does not stimulate GTase activity (12).

Disordered flanking domains contain positively charged regions suitable for phosphorylated CTD binding
Our simulations provide a detailed understanding of the CTD interactions with the GTase within a distance of around 3 heptads from the site reported in the mouse GTase-CTD crystal structure (32). However, they do not account for interactions that could occur in more distant regions of the human CE, such as the OB domain or within the disordered regions at the N-and C-terminal flanks of the GTase, which were not resolved in any of the crystal structures (10,32). A previous crystal structure of the S. pombe CE GTase (Pce1) displayed a Spt5 CTD docking site in the OB-fold domain (30). Given that the full-length CTD is 52 heptads in humans, it is not excluded that some fragments of the CTD also interact with other regions of the human CE, even when bound to the CDS1 and CDS2 sites (19).
The task of exploring potential binding sites of a full-length CTD on the CE GTase is unfeasible for atomistic MD. Therefore, to explore potential binding sites in alternative regions of the GTase, global peptide docking was performed using the PIPER-FlexPepDock server. Although global protein:peptide docking is challenging and often inaccurate, these techniques can give an indication of the regions a peptide can bind to and the possible conformations it can adopt. In particular, the results for the 2-heptad CTD docking to the GTase show that the pSer5 CTD peptides are localised in the NT domain in the region covering CDS1 and CDS2 (Supplementary Figure   S6B). On the other hand, the docking results for the pSer2 CTD offer a less clear picture although they still show models docked to CDS2 (Supplementary Figure S6C). These observations provide additional support to our MD findings.
So far we have focussed on the CTD binding to the CE GTase domain. However, CTD binding to other regions of the CE has not been fully explored. Biochemical assays have previously shown that the phosphorylated CTD does not interact with the TPase domain of the human CE, but the contribution of the two disordered regions that flank the GTase domain has not been examined previously (12). Unfortunately, due to the length and the disordered nature of these regions, they could not be accurately modelled or simulated by MD simulations. Inspection of the human CE sequence reveals that both the disordered TPase-GTase linker and the disordered region at the Cterminus of the GTase contain large numbers of positively charged residues that are found in several clusters, which could form positively charged sites similar to CDS1 and CDS2 (Supplementary Figure S6D). Therefore, we expect that the phosphorylated CTD can interact with these regions in addition to the sites in the GTase domain, enhancing the CE recruitment to the CTD. The Cterminal flanking region has previously been shown to be essential for the recruitment of the CE to Nck1 to enable cytoplasmic capping, further suggesting that this region plays an important role in recruitment of the CE to both the site of cotranscriptional and cytoplasmic capping (77).

Biochemical assays validate the role of the phosphoserine interaction sites for GTase recruitment
Our computational analyses provide a comprehensive picture of the GTase-CTD interaction offering a number of findings that can be tested biochemically. In particular, our results predict that: i) pSer interactions with the CE GTase form the basis of the binding affinity, with CDS1 and CDS2 sites making major contributions to CTD binding affinity, ii) mutating out both CDS1 and CDS2 reduces CTD binding affinity to a level comparable to the unphosphorylated CTD, iii) CDS1 and CDS2 are non-specific and can also bind the Ser2 phosphorylated CTD, and iv) the disordered regions flanking the GTase domain contain positively charged residues that are likely to contribute to CTD binding. In order to validate these predictions, we expressed and purified the recombi- First, we performed pull-down assays on all recombinant proteins, using 4 heptad CTD peptides that were either unphosphorylated, Ser5 or Ser2 phosphorylated on all heptads (Figures 7A-7E). The WT GTase (211-597) binds to the 4 heptad pSer5 CTD with an affinity comparable with previous literature ( Figure 7A) (12). We then compared the WT CE with the core human GTase domain (229-569) and the GTase with the disordered flanking domains (211-597) ( Figures 7A   and 7B). In agreement with our prediction that these disordered regions might be important for GTase recruitment, the protein containing the flanking regions exhibits a significantly increased CTD binding. These interactions are not pSer5 or pSer2 specific, enhancing binding for both the pSer5 and pSer2 CTD peptides. As both the CTD and these flanking regions are disordered, these additional interactions possibly represent the formation of a 'fuzzy' complex where the CTD interacts at well-ordered sites on the GTase surface (CDS sites) in addition to forming interactions with the disordered flanking regions (78,79). In this case the role of the flanking interactions is to increase GTase-CTD binding required for CE recruitment and GTase activation.
We then sought to validate the pSer5 CTD interactions observed on the GTase domain. Comparison of the binding affinity of the CDS mutations shows results consistent with our computational predictions, with the CDS1 and CDS2 sites both contributing significantly to pSer5 CTD binding ( Figures 7C and 7D, Supplementary Figure 8). When both of these interaction sites are removed, GTase binding to pSer5 CTD is at a comparable level with the WT GTase binding to the unphosphorylated CTD. This indicates that there are no other pSer5 interaction sites on the GTase.
In further agreement with our computational results, pSer2 CTD peptide binding also significantly decreases when CDS1 or CDS2 residues are mutated, with the same trend observed for the pSer5 CTD peptide (Figures 7C and 7E). This result contrasts with previous literature that showed that the pSer2 CTD binds to the human CE GTase non-competitively with the pSer5 CTD (12).
Our results also show that pSer2 CTD has a lower binding affinity than pSer5 ( Figures 7A and   7B), in contrast with previous experimental data that showed the pSer2 and pSer5 CTD peptides binding to the CE GTase with comparable affinity (12).
These results are reproducibly observed in the core GTase assays (229-569), however, without the disordered flanking regions the low binding affinity makes it difficult to quantify and distinguish the mutants (Supplementary Figure 6).
The novel phosphoserine pocket, CDS2, is essential for GTase activation Upon binding to the CE GTase, the Ser5 phosphorylated CTD has been shown to stimulate the first step of GTase catalysis (12). All crystal structures of the CE GTase-CTD interaction show that the CTD binds outside of the active site, on the NT domain, therefore this must involve an allosteric mechanism of activation. The nature of such an allosteric activation, its mechanism and importance in the regulation of mRNA capping remain unclear (26,27). To assess whether the novel CDS2 site is involved in the CE GTase activation by the CTD, we performed GTase 20 activity assays on the CDS mutant recombinant GTase proteins (Figures 7F and 7G). The activity assay quantifies the first stage of GTase activity by measuring the level of α 32 P labelled guanosine monophosphate covalently bound to the GTase active site. Incubation of the WT GTase (211-297) with the 4 heptad pSer5 CTD increases the GTase activity by 2.2-fold, consistent with previous literature (12). Removal of the CDS1 site reduces the GTase activation effect, however it still elicits activation to 1.4 fold the basal level. In contrast, mutagenesis of the CDS2 site completely inhibits GTase activity stimulation, suggesting that it has an essential role in the allosteric activation of the GTase. Mutagenesis of both the CDS1 and CDS2 sites replicates the inhibition observed in the ∆CDS2 mutant.
To explore the molecular details of allosteric activation, we compared the conformational dynamics of the GTase in its pSer5 CTD-bound and no-CTD states (Supplementary Figure S9 and S10). Comparison of the aMD simulations of these two systems shows no global changes in the GTase secondary structure (Supplementary Figures S9A and S9B), in agreement with previous crystal structures (31,32). In addition, the aMD simulations show no significant changes in the dynamics of the GTase (Supplementary Figure S10). However, upon binding of the CTD to the CDS2 pocket, some local conformational changes do occur. In particular, the conformational dynamics of loop β 9-αD, which is predominantly unstructured in the no-CTD state, favours an α-helical arrangement upon pSer5 CTD binding (Supplementary Figures S9C-S9F). This loop lies adjacent to the Mg 2+ binding site and the GTP binding pocket and therefore this structural rearrangement might alter Mg 2+ and GTP binding affinity or coordination, increasing the rate of the first step of GTase catalysis. However, a full assessment of the role of CTD binding on GTase activation is beyond the scope of this work, requiring comprehensive analysis of CTD binding at the GTP-bound and intermediate stages of GTase activation. This will be the focus of the subsequent work.

The GTase-CTD interaction sites are predominantly conserved between animals and yeasts
After identification of novel CTD interaction sites in the simulations, we checked these in the available crystal structures of the GTase-CTD complex in other eukaryotic species (S. pombe and C. albicans) (30,31). Previous research has concluded that different taxa have evolved distinct CDS binding sites on the GTase surface to recruit the CE to the site of transcription (30,32): although all current GTase-CTD cocrystal structures show that the CTD interacts with the NT domain of the CE GTase, their conformations and interaction sites differ significantly (30)(31)(32). Despite this, S. pombe and C. albicans GTase-CTD cocrystal structures share a number of conserved features, including a CDS site (CDS1) composed of residues on helix αC and loop β 7-αC and a Tyr1 interaction site in the same location between helix αC and strand β 8. Apart from these similarities, they contain additional pSer5 interaction sites that are not conserved between the two species or seen in the mouse GTase-CTD cocrystal structure. When comparing these yeast GTase-CTD interactions with the mouse GTase-CTD cocrystal structure there are no conserved interactions between them, although the CTD sites are always in nearby regions of the NT subdomain (32).
Surprisingly, when comparing the novel CDS sites observed in our simulations of the human CE GTase with that in the C. albicans cocrystal structure, we find a number of similarities. Both the novel CDS2 and CDS-Y2 sites are also observed at the same positions in the C. albicans cocrystal structure ( Figures 3A and 8A). In addition, CDS2 residues have been shown to be essential for CTD binding to the S. cerevisiae CE GTase (80). Sequence analysis comparing animal and yeast species shows that the core residues of both CDS2 and CDS-Y2 are functionally conserved across animals and yeasts ( Figure 8B). For the CDS2 site this functional conservation is not immediately apparent because, although R358 is highly conserved throughout animals and yeasts, K403 and R411 are not conserved in yeasts. However, in yeasts both are substituted with nearby positively charged residues that are highly conserved: K403 is substituted with a positively charged residue on helix αC (K178 in C. albicans) and R411 is substituted for a lysine two residues away on the same side of strand β 8 (K193 in C. albicans). These residues are not conserved in S. pombe and the CDS2 site is not observed in its GTase-CTD cocrystal structure (30), suggesting divergent 22 evolution in this branch of yeasts.
The CDS-Y2 pocket identified in our simulations is also conserved throughout animals and yeasts. The central leucine residue (L381) is highly conserved between the two. The additional hydrophobic residues that comprise the pocket are highly conserved in animals and yeasts but the specific residues in this pocket vary between the two taxa. In animals, this pocket is composed of F377, P414 and F415, in contrast to F63, F196 and M199 in yeasts.
The CDS1 pocket, although its precise location in loop β 5-β 6 and helix αC is not conserved between animals and yeasts, is in the same region of the GTase across animals and yeasts. The residues in the CDS1 pocket are highly conserved in animals, however, the CDS1 residues are poorly conserved throughout yeasts. Despite this, most species of yeast contain positively charged residues in either loop β 5-β 6, where the mammalian CDS1 residues are located, or on loop β 7-αC and helix αC, the same location as in C. albicans and S. pombe. Interestingly, R386 is highly conserved as a positively charged residue throughout both animals and yeasts. As this residue is only around 7Å from the CTD pSer5 sidechain in the S. pombe cocrystal structure, it is likely that this residue also contributes to CTD binding in yeasts. This lack of conservation of the CDS1 site is unsurprising because the majority of the CDS1 residues are located on flexible loops where the exact position of the positively charged residues is unlikely to affect efficient GTase recruitment by the CTD. The CDS-Y1 hydrophobic pocket, in contrast to the CDS-Y2 pocket, is highly conserved in animals, however, does not appear to be conserved in yeasts and it is not observed in either of the yeast GTase-CTD cocrystal structures (30,31). Therefore, although many of the central features of the GTase-CTD interaction are conserved in both animals and yeasts, there are some features that distinguish them.
The palindromic nature of the CTD code allows bidirectional binding of the CTD One striking difference between the conformations seen in our simulations and those observed in the C. albicans CE GTase-CTD cocrystal structure is that the CTD peptide is oriented in opposite directions ( Figures 3A and 8A). In our simulations CDS1 is occupied by a pSer5 of the C-terminal heptad of the CTD and CDS2 by a pSer5 of an N-terminal heptad. In contrast, the C. albicans cocrystal structure shows CDS1 occupied by the N-terminal heptad and CDS2 by the C-terminal heptad. This raised the question of whether this is a characteristic feature that is distinct between mammals and yeasts or if the CTD can bind in both directions.
To assess the stability and affinity of CTD binding to the human CE GTase in the alternative orientation, the C. albicans CTD conformation was superimposed onto the human CE GTase, extended to 4 heptads and simulations were performed as described above (Systems 10 and 11; Supplementary Figure S11). The CTD conformation remains stable for the duration of the simulations, with the same characteristics as observed in our previous simulations: the pSer5 pockets (CDS1 and CDS2) form strong interactions and the CDS-Y2 site is more transient (Supplementary Figure 7). Thus the pSer5 CTD can bind to the same sites in both orientations, although the CTD with the "mammalian" orientation has a higher relative binding affinity than the CTD bound in the "yeast" orientation (Systems 6 and 11; Figure 8C).
We suggest that such bidirectional CTD binding to the same interaction sites is enabled because the CTD heptad motif is almost completely palindromic ( Figure 8D), and therefore the positions of the CTD residues remain the same in both directions. To our knowledge this feature of the CTD sequence and structure has not been previously discussed. To see this, the canonical heptad motif must be viewed starting with Ser5 and placing Tyr1 at the centre. The GTase-CTD interaction mostly involves the CTD sidechains, such as the pSer and Tyr1 interactions, but not the backbone, and therefore the chirality of the peptide backbone is unlikely to affect the GTase-CTD binding affinity. This also agrees with our finding that the CDS pockets can be occupied in different heptad orders ( Figure 3B). As a result, the C. albicans CTD conformation can be superimposed onto the human GTase with few minor steric clashes, and this conformation remains stable for the duration of the simulation, despite the fact that it is in the "opposite orientation". Bidirectionality of peptide binding has been observed for other protein-peptide interactions, including the WW domain, MHC class II, SH3 domain and O-GlcNAcase (81)(82)(83)(84). In particular, the WW domain in Pin1 binds to CTD phosphopeptides in an opposite direction to other examples of WW domain protein-peptide interactions (81). Bidirectional peptide binding has been suggested to have implications for binding specificity, as changes to the peptide sequence or phosphorylation pattern are likely to introduce steric constraints that may prevent binding in a particular orientation (82).
We hypothesise that the palindromic nature of the CTD contributes to its function in binding to such a wide variety of partners during transcription. It may allow the CTD to have specific interactions, such as interactions with the specific CTD kinases, while also being able to recruit factors such as the CE, where the CTD can bind in a variety of conformations. The palindromic nature of the CTD also has implications for how the code could be read, with the pSer5 and pSer2 being in distinct locations along the palindrome. This could, in part, explain why a major transition in phosphorylation occurs between Ser5 and Ser2 rather than Ser2 and Ser7, which are in the same relative position within the palindrome (20,22,25,85).

Conclusions
Recruitment of the Capping Enzyme (CE) by the carboxyl-terminal domain of RNA Polymerase II (CTD) is an essential stage of mRNA capping, localising the CE to the site of transcription and stimulating the activity of its guanylyltransferase (GTase) domain (15)(16)(17). Despite a number of studies of the GTase-CTD interface, fundamental questions remain about the molecular details of this interaction (12, 27, 30-33, 76, 80). We have carried out an extensive and systematic study of the GTase-CTD interaction using molecular dynamics (MD) simulations, in which we varied the phosphorylation code, length and orientation of the CTD. We have subsequently confirmed the main computational predictions by performing a series of biochemical assays.
Through this approach we have identified several distinct characteristics of the GTase-CTD interaction. Most notably, we have identified two novel interaction sites on the human CE GTase surface (CDS2 and CDS-Y2). In addition to this we have shown that the disordered flanks of the GTase contribute significantly to CTD binding. Structural and sequence analysis between ani-25 mals and yeasts reveals that the novel GTase-CTD interaction sites are highly conserved, leading us to conclude that the GTase-CTD interaction sites have undergone considerably higher selection pressure than previously considered. The binding free energy analysis and binding assays have demonstrated that the phosphoserine interactions are the main contributors to the GTase-CTD interaction. Subsequent activity assays have shown that the novel CDS2 site is essential for GTase activation, revealing a previously missing link for understanding the molecular mechanism of GTase activation by the CTD.
This work has also characterised how the GTase-CTD interaction depends on the CTD phosphorylation code. Our simulations and biochemical assays both show that these interactions are not pSer5 CTD specific and are also essential for pSer2 CTD binding. We conclude that the occupation of the pSer interaction sites does not confer allosteric activation alone, instead the distinct conformations the pSer5 CTD peptides adopt when bound at these sites determines whether the GTase becomes activated. Overall, this work moves forward our current understanding of the GTase-CTD interaction, from one of static interaction sites obtained from crystal structures to a more complex picture of transient interactions and structural ensembles, where the CDS sites are occupied in different orders and directions.
Finally, this work sheds light on the structural features of the CTD. Our simulations clearly demonstrate the CTD looping out mechanism first described by Fabrega et al. (31). Moreover, we show that the order of heptad binding to the CDS sites can change more drastically, leading to the identification of the GTase-CTD interaction bidirectionality. We conclude that this bidirectionality is the result of the CTD motif being palindromic. The palindromic nature of the CTD has not been explored previously but it is likely to have implications for how the CTD is written and read, affecting a number of stages of gene regulation.

Acknowledgements
We are grateful to Professor Tom Owen-Hughes and Dr Subbu Sundaramoorthy for kindly providing the PGEX6p1-C-His plasmid vector and technical advice on protein expression and purification. We would also like to thank Professor Ulrich Zachariae for critical discussions.    The key GTase residues that contribute to CTD binding are labelled and coloured according to the CDS site they belong to (see the plot legend box). The residues making significant contributions (below -2.1 kcal/mol) are all confined to the region between residues 320 and 440. (B) Comparison of the binding free energy between the wild-type GTase and the three mutants, where the positively charged residues of CDS1, CDS2 and both sites were mutated to alanine. The result for the wild-type GTase with unphosphorylated CTD is also shown for comparison. Error bars denote one standard deviation. ANOVA followed by post-hoc Tukey tests were performed to calculate statistical significance between the mutants and the wildtype GTase + pSer5 CTD condition. * indicates that the differences are significant at P<0.05, ** indicates that the differences are significant at P<0.01, and *** indicates that the differences are significant at P<0.001.