Abstract
Summary Structure probing data has been shown to improve thermodynamics-based RNA structure prediction. However, this type of data has not been used to improve the prediction of RNA-RNA interactions. This is even more promising as the type of information (chemical reactivity as provided by SHAPE) is closely tied to the accessibility of nucleotides, which is an essential part for scoring RNA-RNA interactions. Here we show how that such experimental data can be incorporated seamlessly into accessibility-based RNA-RNA interaction prediction approaches, as implemented in IntaRNA. This is possible via the computation and use of unpaired probabilities that incorporate the structure probing information. We show that experimental SHAPE data can significantly improve RNA-RNA interaction prediction. We evaluate our approach by investigating interactions of the spliceosomal U1 RNA with its target splice sites. When SHAPE data is used, known target sites are predicted with increased precision and specificity.
Availability https://github.com/BackofenLab/IntaRNA
Supplementary material https://github.com/BackofenLab/IntaRNA-benchmark-SHAPE
Contact mmann{at}informatik.uni-freiburg.de, backofen{at}informatik.uni-freiburg.de
1 Introduction
The function of many if not most non-coding (nc)RNA molecules is to act as platforms for inter-molecular interaction, which depends on their structure and sequence. A large number of ncRNAs regulate their target RNA molecules via base-pairing. For instance, small (s)RNAs regulate the translation of their target genes by direct RNA-RNA interactions with the respective messenger (m)RNAs (Wright et al., 2013). To predict such interactions, knowledge about potential interaction sites is needed, i.e. regions not involved in intra-molecular base pairing. State-of-the-art RNA-RNA interaction prediction tools like IntaRNA (Busch et al., 2008; Mann et al., 2017; Raden et al., 2018a) compute unpaired probabilities to gain this accessibility information. While correct within their thermodynamic models, such probabilities do not incorporate all cellular constraints and dynamics that define accessible regions and thus the likelihood for interaction.
The accuracy of RNA structure prediction can be improved when experimental structure probing data such as SHAPE1 is incorporated (Hajdin et al., 2013; Suüköosd et al., 2013; Lotfi et al., 2015). To this end, SHAPE information2 is converted to pseudo-energy terms (Zarringhalam et al., 2012; Deigan et al., 2009; Washietl et al., 2012) to guide thermodynamic RNA structure prediction methods (Lorenz et al., 2016a,b; Montaseri et al., 2017; Spasic et al., 2018).
As SHAPE reactivity is related to the accessibility of nucleotides, it is even more promising to use such experimental data for improving the accuracy of RNA-RNA interaction prediction. For that reason, we introduce a seamless incorporation of SHAPE data into accessibility-based prediction approaches such as IntaRNA within this manuscript.
We show that SHAPE-guided accessibility prediction improves RNA-RNA interaction prediction. To this end, we study the probabilities of U1 interacting with its pre-mRNA target sites. U1 is involved in pre-mRNA splicing by recognizing the 5’ site of introns via inter-molecular base pairing (Hertel and Graveley, 2005). Due to the dynamics and constraints imposed by the spliceosome, it is generally challenging to avoid false positive interaction predictions, which are either wrong predictions of U1’s recognition site with (random) regions of the mRNA or predicted interactions of other accessible U1 regions with the mRNA. For that reason, we used U1 as an example to show that in vivo probing data effectively reduces false positive predictions in RNA-RNA interaction prediction.
2 Methods
Given two RNA molecules with nucleotide sequences S1, S2 ∈ {A, C, G, U}*, we define interaction I between S1 and S2 as a set of inter-molecular base pairs (i.e. , that are complementary (i.e.and non-crossing (i.e. . Furthermore, any position forms at most one inter-molecular base pair (i.e. For any interaction I, the hybridization energy Ehyb(I) can be computed using a standard Nearest-Neighbor energy model (Turner and Mathews, 2010).
The accessibility-based free energy of an interaction I is defined by where the ED1,2 (≥ 0) terms represent the energy (penalty) needed to make the respective interacting subsequences of S1,2 unpaired/accessible (Mückstein et al., 2006; Raden et al., 2018b; Wright et al., 2018).
To compute ED terms, we need the left-/right-most base pair of I given by and ,respectively. Both base pairs define the interacting subsequences, i.e. and Based on that, the penalty terms are given by where R is the gas constant, T is the temperature, and Prss denotes the unpaired probability of a given subsequence, which can be efficiently computed (Bernhart et al., 2006; Mückstein et al., 2006).
As discussed above, SHAPE reactivity data can be incorporated into thermodynamic prediction tools via pseudo energy terms (Lotfi et al., 2015; Deigan et al., 2009) as incorporated into the Vienna RNA package (VRNA) (Lorenz et al., 2016b). The latter enables SHAPE-guided computation of unpaired probabilities, i.e. the P rss terms from Eq. 2. While SHAPE-guided energy evaluations can not be compared to unconstrained energy values (due to the pseudo-energy terms), unpaired probabilities are compatible, since they are reflecting the accessible structure space rather than individual structures. Thus, SHAPE-constrained P rss values can be directly used within the EDcomputation (Eq. 2), which provides a constrained accessibility-based interaction energy (Eq. 1) without further methodical changes. This approach is implemented in the recent version of IntaRNA e.g. available via Bioconda (Grüning et al., 2018).
To assess the effect of SHAPE data, we define the spot proba-spot probability Prspot of an interaction site of interest. A spot is defined by a pair of indices k, l for S1, S2, resp., and Prspot (k, l) as the partition function quotient where denotes the set of all possible interactions and the subset of interactions that cover the spot, i.e. position k, l are within the respective interacting subsequences3 and (see above).
3 Results
SHAPE data for U1 was obtained from in vivo DMS-seq RNA structure probing of Arabidopsis thaliana (Ding et al., 2014). The pre-mRNA sequences for 5 genes including ACT1, which have been previously validated to perform U1-dependent splicing (Yeh et al., 2017), were extracted for the analyses. Figures 1a,b) exemplify the effect of SHAPE-constrained predictions using IntaRNA 2.2.0, VRNA v2.4.7 and pseudo energies following Zarringhalam et al.(2012). Without SHAPE constraints, the splice site is predicted to interact with various regions of U1 with high probability (i.e. low energy). In contrast, when using SHAPE-corrected accessibility terms, the splice site is predicted to be the dominant target of U1’s recognition site. This interaction, for instance, is shifted upwards from rank 9 (standard prediction) to 3 (SHAPE-constrained) among all predicted interactions of U1 with the ACT1 mRNA. Figure 1c provides the interaction probabilities of U1’s recognition site with all three 5’ splice sites of ACT1. All splice sites are predicted with increased probability when SHAPE data was used. As shown in Fig. 1a,b), this effect results from a decreased number of wrong low energy interactions, i.e. false positive predictions. Over all mRNAs, the probabilities of correct splice site recognition were increased on average by a factor of 3.08 (Figure 1d). The supplementary material provides further details on data extraction, analysis procedure and the evaluation of all studied mRNAs.
4 Conclusion
Most of the non-coding RNAs perform their function via molecular interactions for which experimental data is still sparse. Prediction of RNA-RNA interaction has proven to be quite useful for detecting targets of sRNA especially in prokaryotes (Backofen and Hess, 2010). However, the false positive rate is still quite high, making RNA-RNA interaction prediction alone too error-prone for eukaryotes.
The only possibility to reduce errors is to combine interaction prediction with other type of data. Here, in vivo structure probing data seems especially suited as it represents a multitude of factors that guide RNA structure formation; like the binding of other molecules or kinetic effects. We have shown that SHAPE data indeed improves RNA-RNA interaction prediction accuracy. To this end, we have successfully extended IntaRNA to incorporate SHAPE data in its accessibility computation and to compute spot probabilities of interaction sites. The predicted interaction probabilities of splicesomal U1 RNA with its known target splice sites were significantly improved. This results from a decreased number of false positive (wrong low energy) predictions.
Recently, structure probing has been complemented by next-generation sequencing to quickly obtain single or transcriptome-wide probing data (Kutchko and Laederach,2017; Choudhary et al., 2017). This produces large data sets that demand for fast methods incorporating the probing data, which is met by our introduced extension of IntaRNA.
Funding
This work was supported by Bundesministerium fr Bildung und Forschung [031A538A RBC, 031L0106B] and Deutsche Forschungsgemeinschaft [BA 2168/14-1, BA 2168/16-1].
Acknowledgements
We thank Dr. Ronny Lorenz for discussions on SHAPE integration.
Footnotes
1 Selective 2’-Hydroxyl Acylation analyzed by Primer Extension (SHAPE) (Wilkinson et al., 2006).
2 For simplicity we refer to probing experiments of all reagents (SHAPE, DMS) as SHAPE.
3 Note, interactions I ∈ I∗ covering a spot at k, l do not necessarily contain the base pair (k, l), i.e. k, l or both can be unpaired.