Modeling SARS-CoV-2 nsp1–5’-UTR complex via the extended ensemble simulations

Nonstructural protein 1 (nsp1) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a 180-residue protein that blocks the translation of SARS-CoV-2 infected cells. Although it has been known that SARS-CoV-2’s own RNA evades an nsp1’s host translation shutoff, its molecular mechanism has been poorly understood. We performed an extended ensemble molecular dynamics simulation to investigate the mechanism of viral RNA evasion. Simulation results showed that the stem loop structure of SARS-CoV-2 RNA 5’-untranslated region is recognized by both nsp1’s globular region and intrinsically disordered region. The recognition presumably enables selectively translating the viral RNAs. A cluster analysis of the binding mode and a detailed analysis of the binding poses were performed, and we identified a few important residues involved in the SL1 recognition mechanism. The simulation results implied that nsp1 C-terminal helices are lifted from the 40S ribosome upon the binding of SL1 to the nsp1, reenabling the translation blocked by the C-terminal helices.

Introduction SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) belongs to Betacoronaviridae, and is a causative pathogen of the COVID- 19. Nonstructural protein 1 (nsp1) of SARS-CoV-2 resides at the beginning of SARS-CoV-2's genome, and is the first protein translated upon the SARS-CoV-2 infection. After the self-cleavage of an open reading frame 1a (orf1a) by an orf1a-encoded protease (nsp3; PLpro), nsp1 is released as a 180-residue protein. Before the emergence of the SARS-CoV-2, nsp1 of SARS-CoV-1, the causative pathogen of the SARS, has been extensively studied. Nsp1 of SARS-CoV-1 is homologous to SARS-CoV-2 nsp1, and it shares a high (84 %) sequence identity to that of SARS-CoV-2. Nsp1 suppresses the host gene expression, [1][2][3][4][5][6] and induces the host messenger RNA (mRNA) cleavage. 1,2,7,8 Under the presence of nsp1, translation of mRNAs is effectively blocked. The translation shutoff hinders the host cell's innate immune response icluding interferon-dependent signaling. 1,9 Recently, multiple groups reported cryogenic electron microscopy (cryo-EM) structures of SARS-CoV-2 nsp1-40S ribosome complexes. [10][11][12] The structural analysis has shown that two α-helices are formed at the C-terminal region (153-160, 166-179) of nsp1 and binds to the 40S ribosome. The helices block the host translation by shutting the tunnel in the ribosome for the messenger RNA (mRNA). The blockade leads to the inhibition of the 48S ribosome pre-initiation complex formation that is vital for the translation initiaion. 3,12 While nsp1 shuts the host mRNA translation, it has been known that the viral RNAs are translated even in the presence of the nsp1, and also that they evade the degradation. [2][3][4] These mechanisms force infected cells to produce only viral proteins instead of normal host cell proteins; indeed, transcriptome analysis has shown that 65 % of total RNA reads from SARS-CoV-2 infected Vero cells were mapped to the viral genome. 13 It has been revealed that nsp1 recognizes the 5'-untranslated region (5'-UTR) of the viral RNA 4,6,11 and selectively enables the translation of RNAs that have a specific sequence. The first stem loop in 5'-UTR 4,6,14 has been identified as necessary in the translation initiation under the presence of nsp1; specifically, bases 1-36 of SARS-CoV, 4 1-33 of SARS-CoV-2, 14 or 1-40 of SARS-CoV-2 6 attached to 5'-UTR of the protein re-enable the translation. However, their precise molecular mechanism has been poorly understood, and remained to be uncovered.
In the present research, we aimed to describe the detailed mechanism of the viral RNA evasion of nsp1. We modeled and simulated the complex of SARS-CoV-2 nsp1 and the SARS-CoV-2 5'-UTR's first stem loop using an extended ensemble molecular simulation.
The detailed analysis of the simulation suggests the molecular basis of the 5'-UTR recognition by nsp1, in which interaction of nsp1 and the stem loop prevents nsp1's C-terminal helices from binding the ribosomal tunnel.

Methods
Simulation setup. We constructed a complex of nsp1 and 5'-UTR of SARS-CoV-2 RNA.
The initial structure of the RNA stem was constructed using RNAcomposer. 22,23 Bases numbered as 1-35 from the SARS-CoV-2 reference genome (NCBI reference sequence ID NC 045512.2) 24 were used in the present research. This sequence corresponds to the first stem loop of SARS-CoV-2 RNA 5'-UTR. Hereafter, we call this RNA SL1. SL1 was capped by 7-methyl guanosine triphosphate (m7G-ppp-). First base (A1) after the cap was methylated at 2'-O position to reflect viral capped RNA. Charges and bonded force field parameters for these modified bases were prepared by the restrained electrostatic potential (RESP) method 25 and by the analogy to existing parameters, respectively. For SL1, we used the combination of AMBER99 + bsc0 + χOL3. 18,19,26,27 To maintain the structure of the stem loop to be stable, we employed the distance restraints between G-C bases. Specifically, between residues G7-C33, G8-C32, C15-G24, and C16-G23, distance restraints were applied such that the distances between N1, O6, and N2 atoms of guanosine and N3, N4, and O2 atoms of cytidine, respectively, do not exceed 4.0Å. Between these atoms, flat-bottom potentials were applied, where each potential is zero when the distance between two atoms is less than 4.0Å, and a harmonic restraint is applied when it exceeds 4.0Å with a spring constant of 1 kJmol −1Å−2 . We used acpype 28 to convert AMBER force field files generated by AmberTools 29 into GROMACS.
Nsp1 and SL1 models were then merged and solvated in the 150 mM KCl solution. TIP3P 30 water model and Joung-Cheatham monovalent ion parameters 31 were used (73,468 water molecules, 253 K ions, 209 Cl ions). The initial structure is presented in Fig. 1A. A periodic boundary condition of the rhombic dodecahedron shape was used with the size of ca. 140Å along the X-axis. Note that we started the simulation from the unbound state, i.e., nsp1 and SL1 did not directly contact with each other. The total number of atoms in the system was 224,798.
Although it is possible to perform a molecular dynamics simulation of an nsp1-SL1 complex, due to excessive charges in both molecules, with conventional simulations the model tends to be trapped around the initial configuration of the complex. Previously, it has been shown that the sampling for nucleic acids-protein systems can be effectively solved by extended ensemble simulations. [32][33][34] In this work, we used the replica exchange with solute tempering (REST) version 2 to sample various configurations of SL1 and IDR of nsp1. 35 We set both the disordered region (nsp1 1-11 and 128-180) and the whole SL1 as the "hot" region of the REST2 simulation. Note that, in addition to the charge scaling for nsp1 and SL1, we also scaled the charges of counter-ions to prevent unneutralized system charge in the Ewald summation. The total number of replicas used in the simulation was 192. The replica numbered 0 corresponds to the simulation with the unscaled potential. In the final replica (numbered 191), nonbonded potentials between "hot"-"hot" groups were scaled by 0. 25. Exchange ratios were 53-78 % across all replica. To prevent numerical errors originating from the loss of significant digits, we used a double-precision version of GROMACS as a simulation software. 36 We also modified GROMACS to enable the replica exchange simulation with arbitrary Hamiltonian. 37 The patch representing modifications is supplied in the supporting information.
The simulation was performed for 50 ns (thus, 50 ns×192 = 9.6 µs in total), and the first 25 ns were discarded as an equilibration time. The simulation was performed with NVT and the temperature was set to 300 K. The temperature was controlled by the velocity rescaling method. 38 The timestep was set to 2 fs, and hydrogens attached to heavy atoms were constrained with LINCS. 39 Simulation analysis. To obtain the proper structure ensemble under the unmodified potential function, we used the multistate Bennett acceptance ratio (MBAR) method. 40,41 With the MBAR method, we can obtain a weighted ensemble corresponding to the canonical ensemble, i.e., trajectory with a weight assigned on each frame, from multiple simulations performed with different potentials. Only eight replicas corresponding to the eight lowest replica indices (i.e., the one with the unscaled potential function and seven replicas with the closest to the unscaled potential) were used in the MBAR analysis. The weighted ensemble of the trajectory was used in the subsequent analyses. The ensemble of the structures with the weight information is available online at https://bsma.pdbj.org/entry/26. Visualization was performed with VMD 42 and pymol. 43 The secondary structure of nsp1 including the IDR was analyzed using the secondary structure definition of DSSP 44 using mdtraj. 45 We tested the convergence of the ensemble using the secondary structure distribution and the stability of the hydrogen bonds between nsp1 and SL1 (see the supplementary material for details).
Interactions between nsp1-SL1. We applied three criteria to detect interactions between nsp1 and SL1. (i) Inter-residue contacts were detected with the criterion that interatomic distance between Cα of an amino acid residue and C4' of a nucleotide residue is less than or equal to 12Å. (ii) Hydrogen bonds were detected with the criteria that the hydrogen-acceptor distance is less than 2.5Å and donor-hydrogen-acceptor angle is greater than 120 degrees. (iii) Salt-bridges were detected with the criterion that the distance between a phosphorous atom of RNA backbone and a distal nitrogen atom of Arg or Lys is less than 4.0Å.
Clustering. On the basis of the inter-residue contact information, the binding modes of the nsp1-SL1 complex observed in the ensemble were evaluated by applying the clustering method. The inter-residue contact information of each snapshot was represented as a contact map consisting of a 180 × 36 binary matrix. The distance between two snapshots was then calculated as the Euclidian distance of vectors with 180×36 = 6480 elements. We applied the DBSCAN method 46 to classify the binding modes. We arbitrarily determined two parameters for the DBSCAN method, eps and minPts, to obtain a reasonable amount of clusters each of which has distinct binding modes. Note that the DBSCAN generates the clusters each of which has more than minPts members based on the similarity threshold eps. The clusters with members less than minPts (including singletons) are treated as outliers. We used eps = 6 and minPts = 200 in this research. We also tested another clustering algorithm, OPTICS, 47 and confirmed that two different methods generate qualitatively similar results (data not shown).

Results and discussion
IDR partially forms the secondary structure and binds to SL1. Although we did not restrain the RNA-nsp1 distance in the simulation and started the simulation with two molecules apart, they formed a complex in the canonical ensemble. Figure 1  Next, we investigated the secondary structure of the nsp1 region. Even though we started the simulation from an extended configuration, the C-terminal region at residues 153-179 partially formed two α-helices. The result corroborates with the fact that the C-terminal region forms two helices (153-160, 166-179) and shuts the translation by capping the pore that mRNA goes through in the Cryo-EM structural analysis. The result also indicates that the cap structure is formed before nsp1 binds to the ribosome by the pre-existing equilibrium, although the ratio of the helix-forming structures is only up to 50 %. In addition to these known helices, residues 140-150 also weakly formed a mixture of α-helix and 3-10 helix.
Distance between nsp1 N-terminal domain and C-terminal helices. Recent cryo-EM structures, although ambiguously, suggest that the N-terminal domain of nsp1 reside on the 40S ribosome (Fig. 3). Inspired by the structures, we investigated the geometrical restraints of nsp1 in the presence of the SL1. The distance between the center-of-mass of nsp1 N-terminal domain (defined by residues 14-125) and that of C-terminal helices (153-179) was calculated and its histogram was plotted in Fig. 3(C). The distance distribution had two peaks at 27 and 33Å, which was below 49.8Å estimated from the cryo-EM structure (see supporting information for details). Indeed, 90.7 % of the trajectory had a distance less than the experimentally estimated distance of 49.8Å. The result indicates that the configuration observed in the cryo-EM structure, which does not include SL1, is unlikely to happen when nsp1 is complexed with the SL1.
SL1's hairpin is recognized by nsp1 IDR. Inter-residue contact probabilities between nsp1 and SL1 in the canonical ensemble are summarized in Table 1 and Fig. 4. Based on the distribution of the interactions, we categorized the binding interface on nsp1 into the five regions ( Fig. 1D and Supplementary Table S1): (i) the N-terminus (the 1-18th residues), (ii) the α1 helix (the 31-50th residues), (iii) the disordered loop between β3 and β4 (the 74-90th residues), (iv) the N-terminal side of the IDR (the 121-146th residues), and (v) the C-terminal side of the IDR (the 147-180th residues). These regions primarily interacted with some bases around C20 of the RNA fragments which compose the stem loop. The most major region to recognize SL1 was the region (iv), the N-terminal side of the IDR. The probability for contacts between any residue of this region and SL1 was 97.4 %. In particular, the contact between Asn126 and U18 was observed in 84.1 % of the canonical ensemble. The most frequently observed hydrogen bond in the canonical ensemble was Arg124-U18, the probability of which was 26.0 % ( Table 1). The second major interface region was in the region (ii), α1 helix, which has two basic residues (Arg43 and Lys47), and they frequently (v) did not have a strong tendency to form hydrogen bonds nor salt-bridges but frequently contacted in some residues in these regions; the probability for interactions with the regions (i) and (v) were 72.1 % and 59.2 %, respectively.
As an overall shape, the nsp1 surface consists of positive and negative electrostatic surface patches separated by a neutral region as shown in Fig. 6A. 48 The α1 helix in the region (i) forms the interface of these two patches; one side of the helix has basic residues (Arg43 and Lys47), and the other side consists of some hydrophobic residues (Val38, Leu39, Ala42, and Leu46). The positive side of the α1 helix forms a shape like a hill with a positively charged cliff ( Fig. 6(B) Eventually the IDRs in region (iv) and (v) grab SL1.
Although the binding site on nsp1 for SL1 can be characterized as the interface consisting of the regions (i) through (v), SL1 did not take a stable conformation even when it bounds to these regions. Diverse binding modes were observed in the canonical ensemble. Although the SL1 almost always interacted with some residues of the region (iv), its conformation fluctuated highly and was diverse. In addition, the IDR of nsp1 was also highly flexible.
Clustering analysis of binding poses. The diversity of binding modes were further investigated with the cluster analysis based on the contact map for each snapshot (see the Methods section). We determined the clustering threshold to hold the condition that any cluster has at least one inter-residue contacts with more than 80 % in each cluster. As a result, the binding modes can be categorized into 14 clusters and outliers, which has 34.2 % of statistical weight in the canonical ensemble. Even in the most major cluster, its statistical weight was 15.5 %; that for the second, third, and fourth clusters were 9.9 %, 7.4 %, and 5.0 %, respectively. Each cluster had a unique tendency to use a set of binding regions (Supplementary Table S2, Supplementary Figure S3). We also analyzed the differences in surface areas of interacting interfaces in ordered and disordered regions of nsp1 among the 14 clusters (Supplementary Figure S4). The distribution shows the unique characteristics of each cluster. These results indicate that recognition of SL1 by nsp1 is established by multimodal binding modes.
The representative structure of cluster 1, which had the largest population among all clusters, is presented in Fig. 7 and Supplementary Table S3 Relation to other experimental results. It has been reported that Arg124Ala-Lys125Ala double mutant loses the capability to recognize viral RNA, 3 which can be explained by our simulation results. The simulation showed that Arg124 sidechain strongly interacts with the phosphate backbone of 18U (Table 1 and Figs. 5 and 7). Arg124Ala mutation is thus considered to incite the loss of the ionic interaction between the sidechain and the backbone, and the nsp1 loses the recognition capability.
The circular dichroism (CD) spectrum of SARS-CoV-2 nsp1 C-terminal region of 130-180 16 in solution had only a single peak at 198 nm and did not show ellipticities at 208 nm and 222 nm, which indicated that the nsp1 C-terminal region did not form α-helices nor β-sheets and was disordered. Although in our simulation we found nsp1 partially forms α-helix in the IDR, our simulation also indicated that the percentage of the helix is low and not stable, which may explain the difference to the experimental facts. The difference may also be attributed to the existence of RNA and other solvent conditions, and the force field inaccuracies, but further study is required to conclude.
Whether SARS-CoV-2 nsp1 and SL1 bind without the ribosome is controversial. It has been reported that nsp1 and 7-33 bases of SARS-CoV-2 was reported to bind with a binding constant of 0.18 µM, 49 but it also has been reported that the gel shift did not occur with 5'-UTR of SARS-CoV-2 up to 20 µM when transfer RNA was used to exclude the non-specific binding. 6 Current simulation results indicated that the binding mode observed herein did not have a specific, defined structure. Typically, with such binding modes, the binding is expected to be weak; we thus consider the current simulation results do not contradict to neither experiments.
It has been reported that mutations to SL1 bases 14-25, which disrupt the Watson-Crick pairs of the stem loop, cause the translation to be shut off. 6 The result corroborates with our results that the hairpin structure of bases 18-22 in SL1 is recognized by nsp1. The hydrogenbond interaction analysis showed that the RNA phosphate backbone is mainly recognized within C15-C20 region (Table 1 and Fig. 5). Furthermore, our finding corroborates with the fact that the sequence of hairpin (corresponding to U18-C21 in our simulation) region is not well conserved among SARS-CoV-2 mutational variants while that of the stem is well conserved. 50 Our simulation shows that the interaction between nsp1 and SL1 backbone is stronger than that of nsp1 and SL1 sidechains (Table 1), underlining the importance of the backbone interaction.
Limitations of this study. Our simulation was performed based on several assumptions.
Here, we list the limitations of the current study.
First of all, our simulation was performed without the ribosome. This was mainly because the simulation started before the nsp1-ribosome complex structure was deposited. Furthermore, at the time of submitting this manuscript, the orientation of the nsp1 N-terminal domain attached to the 40S ribosome is still ambiguous in density maps. With the 40S ribosome, the environment around nsp1 may be altered, hence the interaction between the RNA and nsp1. Specifically, ribosome mostly consists of ribosomal RNAs and are thus strongly negatively charged, which may change the interaction environment significantly.
We performed the simulation with restraints to G-C pairs in 5'-UTR to maintain the stability of the hairpin loop structure. The restraints may have hindered RNA forming other structures than the initial hairpin structure. However, in the secondary structure prediction using CentroidFold, 51 these base pairs were predicted to exist in more than 92 % of the ensemble. Furthermore, a recent study 52 showed that, even with a rigorous extended ensemble simulation, the hairpin structure remained intact. From these results, the drawback of structural restraints to SL1 is expected to be minimal.
Finally, as is always the case with the simulation study, the mismatch between the simulation force field and the real world leaves a non-negligible gap. In addition to that, some residues may have alternative protonation states upon binding to RNA (e.g., histidine protonation state).
Future research and conclusions. Current simulation research was performed with nsp1 and SL1 only. Arguably, a simulation of the complex consisting of the 40S ribosome, nsp1 and SL1 will be an important step to understand the detailed mechanism of the viral RNA evasion of nsp1. Current simulation results showed that nsp1-SL1 binding has multimodal binding structures. The addition of the 40S ribosome to the system may confine the possible binding poses to fewer in number, and possibly more tight binding poses may be obtained.
In addition to a simulation study, mutational analysis on nsp1 will be beneficial. In addition to already known mutation at Arg124, current simulation results predict that the following residues are important in the nsp1-SL1 binding: Lys47, Arg43, and Asn126. Mu-tant analysis of these residues will help us to understand the molecular mechanism of nsp1.
Finally, the development of inhibitors for nsp1-stem loop binding, is highly anticipated in the current epidemic. Although our current result implied that the specific binding structure might not exist, important residues in nsp1, as well as bases in SL1, were detected. Blocking these residues/bases, or mimicking the binding of these residues/bases, may effectively nullify the function of nsp1.
In conclusion, using the molecular dynamics simulation, we investigated the binding and

Supporting Information Available
The supporting information is available free of charge: • Assessment of the convergence, detailed procedure of the center-of-mass distance from