Evidence for a long-range RNA-RNA interaction between ORF8 and the downstream region of the Spike polybasic insertion of SARS-CoV-2

SARS-CoV-2 has affected people all over the world as the causative agent of COVID-19. The virus is related to the highly lethal SARS-CoV responsible for the 2002-2003 SARS outbreak in Asia. Intense research is ongoing to understand why both viruses have different spreading capacities and mortality rates. Similar to other betacoronaviruses, long-range RNA-RNA interactions occur between different parts of the viral genomic RNA, resulting in discontinuous transcription and production of various sub-genomic RNAs. These sub-genomic RNAs are then translated into different viral proteins. An important difference between both viruses is a polybasic insertion in the Spike region of SARS-CoV-2, absent in SARS-CoV. Here we show that a 26-base-pair long-range RNA-RNA interaction occurs between the genomic region downstream of the Spike insertion and ORF8 in SARS-CoV-2. Predictions suggest that the corresponding ORF8 region forms the most energetically favorable interaction with that of Spike region from amongst all possible candidate regions within SARS-CoV-2 genomic RNA. We also found signs of sequence covariation in the predicted interaction using a large dataset with 27,592 full-length SARS-CoV-2 genomes. In particular, a synonymous mutation in ORF8 accommodated for base pairing with Spike [G23675 C28045U], and a nonsynonymous mutation in Spike accommodated for base pairing with ORF8 [C23679U G28042] both of which were in close proximity of one another. The predicted interactions can potentially be related to regulation of sub-genomic RNA production rates.


Introduction
Bimolecular RNA structural predictions were performed using bifold (Mathews, Burkard et al. 1999) default parameters. The bifold program takes two sequences as arguments and predicts the most energetically favorable bimolecular conformation. Three-way RNA structural predictions (i.e., simultaneous RNA-RNA interaction amongst three RNA molecules) were predicted by combining information from bimolecular folding predictions. From amongst the three RNA sequences, the longest of the three was identified as the 'main sequence'. Two bimolecular folding predictions were then performed between the main sequence and either of the two remaining RNA sequences. The resulting RNA-RNA pairing regions were then deleted from the main sequence. The remaining sections of the sequence were then concatenated to construct a shortened main sequence. The structure of the shortened main sequence was then predicted using ProbKnot. The final trimolecular RNA structure predictions would then consist of restoring the main sequence by adding RNA-RNA binding regions in their original locations. RNA secondary structure schematics were done using VARNA (Darty, Denise et al. 2009).

Compensatory Mutations Analysis of Long-range RNA-RNA interactions
Compensatory mutations within the multiple sequence alignments were investigated using the R-scape software package (Rivas, Clements et al. 2017, Rivas 2020, Rivas, Clements et al. 2020, which analyzes covariation in nucleotide pairs in the population to infer possible compensatory mutations in an RNA base pair. If the consensus RNA secondary structure is not provided by the user, the software is also capable of predicting the consensus structure from the population of sequences using an implementation of the CaCoFold algorithm. Compensatory (covarying) mutations for long-range RNA-RNA interactions were analyzed by retrieving the two sequence segments that constitute the desired RNA-RNA interaction from all downloaded SARS-CoV-2 sequences. Sequences with runs of ambiguous nucleotides in selected regions were filtered out. Then, the long-range RNA-RNA interacting structure was predicted by finding the consensus secondary structure within the population of sequences in the dataset using R-scape implementation of CaCoFold. The consensus structure was compared to bifold predictions for verification. Nucleotide pairs belonging to the consensus structure were then examined within the dataset for evidence of covariation using the built-in survival function that plots the distribution of base pairs with respect to their corresponding covariation scores (t). For a typical survival function plot, the black depicts the survival function for the null alignment. Blue depicts the survival function for the alignment. The black line fits to the tail of the null distribution. E-value corresponding to a pair is obtained by drawing a vertical line from the point to cross the black line. If a dot representing a base pair crosses the 0.05-threshold vertical blue line, it is considering as significantly covarying (See R-scape manual for more details).

Downstream region of the Spike 12-nt insert contains locally stable RNA structural features
The flanking regions of the 12-nt insert in Spike (23,600-24,107) showed a consistent local RNA structure. The SARS-CoV and SARS-CoV-2 predicted structures differ in this region due to their low sequence similarity (76% similarity; p-distance ~ 0.24). The SARS-CoV-2 predicted structure has a higher stability (-141.5 kcal/mol) that the structure predicted for the homologous region in SARS-CoV (-125.00 kcal/mol; Figures 1A and 1B). The 12-nt insert tends to form a stem with complementary nucleotides approximately 500-nt downstream. The stem named S1 located immediately after the 12-nt insert is common to both SARS-CoV-2 and SARS-CoV (Figures 1A and 1B), and has been found conserved across betacoronaviruses (Rangan, Zheludev et al. 2020). Both MEA and Probknot predicted structures downstream of the insert are similar (See structure in boxes; Figure 1A). Together, these results suggest the existence of stable RNA structural features downstream of the 12-nt insert. . The corresponding sequence of SARS-CoV was found by aligning SARS-CoV-2 and SARS-CoV with 121/508 nucleotides being different between the two sequence segments (sequence similarity 76%; p-distance ~ 0.24). The red asterisk represents the location of the 12nt insert on SARS-CoV-2. Stem S1 is shown in red.

The downstream of Spike 12-nt Insert is predicted to interact with ORF8
The possibility of long-range RNA-RNA interaction between the 12-nt insert region and other genomic regions was investigated using IntaRNA. Figure 2A shows IntaRNA results on the complete SARS-CoV genomes. In almost all cases, results were consistent across different choices of mode parameters. Predictions using query 23,917 -24,118 suggest a binding to Orf1ab. In specific, region 24,084-24,114 (Spike) formed an RNA interaction with region 17,012-17,046 (Orf1ab), containing ~30 base pairs between the two regions. The predicted structure was also found in SARS-CoV, with similar interacting energies (Figures 2A and 2B). On the contrary, other regions yield different results for both viruses. The region 23,663-23,804 (Spike) of SARS-CoV was predicted to bind 29,548-29,696 (3'UTR), while the same region (23,660-23,703) of SARS-CoV-2 was predicted to bind region 28,025-28,060 (ORF8).

Covarying mutations suggestive of Spike-ORF8 long-range RNA-RNA interaction
The predicted interactions between the Spike (23,660-23,703) and ORF8 (28,025-28,060) regions was analyzed for compensatory mutations. Sequence identity on the interacting regions for the 27,488 genomes was 99.98% (certain sequences with gaps were eliminated from the dataset with original size of 27,592). Three significantly covarying nucleotide pairs were observed across the population, none of which belonged to the predicted structure and all of which belonged to ORF8 region. The three covarying pairs refer to ORF8 non-synonymous mutations and are summarized in Supplementary Materials Text S1, Figure S1, and Tables S1 and S2. sequences. The second most frequent mutation (C23679U) was non-synonymous and occurred on the second codon in Spike. The two base pairs with the highest mutation frequency are only a few positions apart ( Figure 3B). The substitution rate within the predicted RNA base pairs, however, was not high enough to pass the 0.05 statistical significance threshold. Figure 3A shows the survival function of scores of all pairs in the Spike-ORF8 interaction region. The black depicts the survival function for the null alignment (i.e., all hypothetical base pairs within the region). Blue depicts the survival function for the alignment (i.e., all predicted base pairs within the structure).

SARS-CoV-2
As we can see, none of the predicted base pairs (shown in blue) cross the line representing 0.05 significance threshold, although base pair [23675 28045] seems to be an outlier within the distribution (See the blue population in Figure 3A).  Figure 4A shows the structure prediction between the ORF8 region and an extension of the Spike region, using bifold program. The 12-nt insert appears to be unpaired in the RNA-RNA duplex structure. Structural prediction of (23,600-24,107 in Spike) and (17,012-17,046 in Orf1ab) is partially shown in Figure 4B, where the corresponding Orf1ab region is shown in green. As we can see, the 12-nt insert appears to partially bind to a region on (23,600-24,107 Spike) different from originally predicted local structure ( Figure 1A). Figure 4C illustrates the approximated threeway RNA structure prediction amongst all three regions (23,600-24,107 in Spike), (28,(25)(26)(27)(28)060 in ORF8),and (17,(12)(13)(14)(15)(16)(17)046 in Orf1ab). In this scenario, the 12-nt insert remains unpaired but a new 3-bp pseudo-knot is stabilized.

Discussion
Betacoronaviruses use long-range genomic RNA-RNA interactions to control sub-genomic transcript production . Mediated by stabilizing proteins, such interactions impact the tertiary structure of the genomic RNA, facilitating binding of the 5' UTR Transcript Regulatory Sequences (TRS) to the TSR upstream of a particular gene, leading to the switching of minus strand template to that of the gene's sub-genomic transcript. Regulation of the N-gene sub-genomic transcript is a fair example of such high-order RNA-RNA interactions . Although some efforts have been made to investigate RNA-RNA interactions of SARS-CoV-2 (Ziv, Price et al. 2020), It is generally very difficult to identify all the genomic RNA regions that are involved in such intricate interactions, presenting computational challenges to finding novel interacting regions within the virus (Rouskin, Lan et al. 2021). The focus of the RNA structure analyses described here was the region corresponding to the 12-nt insert in the Spike region, present in SARS-CoV-2 but absent in SARS-CoV. The insert has a high GC content and is immediately followed by a conserved RNA stem-loop immediately downstream of it (Rangan, Zheludev et al. 2020). We found RNA structural differences between corresponding regions in SARS-CoV-2 and SARS-CoV. The locally stable RNA structure containing the 12insert region was more stable in SARS-CoV-2 than in SARS-CoV with a pseudo-knotted structure not observed in SARS-CoV (Figure 1). In SARS-CoV-2, the downstream region of the 12-nt insert was predicted to base pair with ORF8, while in SARS-CoV, the same region preferred the 3'UTR ( Figure 2). The aligned SARS-CoV-2 sequences were investigated for compensatory mutations that might occur within and between Spike-ORF8 binding location. Although not any significantly covarying mutations were observed, a polymorphism in ORF8, C28045U, was mildly in support of the Insilico prediction about Spike-ORF8 RNA-RNA base pairing (Figure 3). Being a synonymous mutation, C28045U has been previously identified as one of the polymorphic positions of ORF8 (Pereira 2020). In the local RNA secondary structure prediction of ORF8, C28045U is unpaired, while in the predicted long-range RNA-RNA interaction with Spike, it pairs with G23675. Another observed mutation [C23679U G28042], though non-synonymous, accommodates for the base pairing and occurs on Spike. The above discussed predicted pairs are only few base pairs apart. Neither of the two base pairs, however, individually passed the significance threshold (Figure 3). Given the relatively low mutation rate of the virus and given that not any covarying mutations in 5'UTR passed the significance threshold (data not shown) either, it may be possible that the downloaded dataset is not diverse enough to statistically support predicted [G23675 C28045U] and [C23679U G28042] base pairings. The contribution of the 12-nt insert in stabilizing the Spike-ORF8 interaction is still unclear.
Although the 12-nt insert consists of high GC content, it appears as unpaired in most of the longrange RNA-RNA interacting predictions (Figure 4). This occurs while the 12-nt insert secures base pairing in 500-nt downstream in local structure prediction ( Figure 1A). Ironically, Stem S1 remains stabilized in both local and long-range structure predictions ( Figures 1A and 4). It is possible that the insert stabilizes a tertiary interaction or binds to a protein or other regions of the virus not predicted here. In any case, it is unclear if the 12-nt insert contributes to long-range interactions or stabilizes against them. The predicted Spike-ORF8 long-range RNA-RNA interaction can potentially impact template switch during negative strand synthesis. Template switch in betacoronaviruses might occur if the TSR element downstream of the 5'UTR is in proximity of the TSR element immediately upstream of a viral gene. Such complex genomic conformation may involve other RNA-RNA as mediators.
The dE-pE ( Figure 2 of ) act as such mediator RNA binding locations to facilitate a discontinuous negative strand synthesis of the viral genome, leading to N-gene subgenomic RNA. The coronavirus nucleocapsid (N) is known to be a structural protein that forms complexes with genomic RNA, interacts with the viral membrane protein during virion assembly and plays a critical role in enhancing the efficiency of virus transcription and assembly . The predicted Spike-ORF8 binding location here is 200nt upstream of the N-gene TSR . Although high-order RNA-RNA interactions needed for template switch can be more complex and may involve the 5'UTR as well, it might be that the predicted Spike-ORF8 RNA binding is acting as an additional mediator step to bring the TRS elements of 5'UTR and the coronavirus N-gene closer to each other. It could be speculated that the sequence immediately downstream of the 12-nt insert are evolved to bind to ORF8 for the purpose of regulating sub-genomic RNA production. Since the first gene downstream of Spike-ORF8 binding location happens to be the N-gene, the binding location might be affecting the N-gene sub-genomic RNA production. Amongst coronaviruses, ORF8 is a rapidly evolving hypervariable gene that undergoes deletions to possibly adapt to human host (Gong, Tsao et al. 2020, Su, Anderson et al. 2020, Zinzula 2021). It has also been previously observed that patients infected with SARS-CoV-2 variants with a 382-nucleotide deletion (∆382) in ORF8 had milder symptoms (Young, Fong et al. 2020). In addition, ORF8 contains RNA structural features . While this observation may very well be due to impact of absence of the translated protein, ORF8 RNA structural characteristics of the genome may also play a role in the viral life cycle, making long-range RNA-RNA prediction with Spike a less remote possibility. A comprehensive exploration of predicted Spike-ORF8 binding in other SARS-CoV-2 variants and evaluating corresponding sub-genomic RNA production rates of these variants may lead to further clues about the predicted long-range RNA-RNA interaction, which can be rewarding for therapeutic purposes.
covariation of its base pairs across the population of sequences referred to here as msa_0520_23660-23703_28025-28060st. Sequence identity was 99.98 across the population of 27,488 (certain sequences with gaps were eliminated from the dataset with original size of 27,592). Three significantly covarying nucleotides were observed across the population, none of which belonged to the predicted structure and all of which belonged to ORF8 region. All the three significantly covarying mutations had E-value of 0.02975. These significantly covarying mutations referred to non-synonymous mutations for the ORF8 coding region. From amongst the 27,488 viral sequences analyzed here, two sequences from patients from Ecuador and one from Canada had different amino acids on two consecutive locations on ORF8: ORF8 AA (48-49), corresponding to genomic coordinates (28,035-28,040 ORF8). Supplementary Figure S2 shows the location of the two codons Supplementary Table S1 describe the statistical figures associated with the above significantly covarying nucleotides, given by R-scape. Supplementary Table S2 denotes information about individuals carrying the viruses with the above codon variations.