Journal of Molecular Biology
Volume 345, Issue 5, 4 February 2005, Pages 987-1001
Journal home page for Journal of Molecular Biology

Secondary Structure Prediction of Interacting RNA Molecules

https://doi.org/10.1016/j.jmb.2004.10.082Get rights and content

Computational tools for prediction of the secondary structure of two or more interacting nucleic acid molecules are useful for understanding mechanisms for ribozyme function, determining the affinity of an oligonucleotide primer to its target, and designing good antisense oligonucleotides, novel ribozymes, DNA code words, or nanostructures. Here, we introduce new algorithms for prediction of the minimum free energy pseudoknot-free secondary structure of two or more nucleic acid molecules, and for prediction of alternative low-energy (sub-optimal) secondary structures for two nucleic acid molecules. We provide a comprehensive analysis of our predictions against secondary structures of interacting RNA molecules drawn from the literature. Analysis of our tools on 17 sequences of up to 200 nucleotides that do not form pseudoknots shows that they have 79% accuracy, on average, for the minimum free energy predictions. When the best of 100 sub-optimal foldings is taken, the average accuracy increases to 91%. The accuracy decreases as the sequences increase in length and as the number of pseudoknots and tertiary interactions increases. Our algorithms extend the free energy minimization algorithm of Zuker & Stiegler for secondary structure prediction, and the sub-optimal folding algorithm by Wuchty et al. Implementations of our algorithms are freely available in the package MultiRNAFold (http://www.rnasoft.ca/download.html).

Introduction

Computational tools for prediction of the secondary structure (a set of base-pairs, with each base occurring in at most one pair) of a single nucleic acid molecule provide insight into the structure of RNA molecules,1 aid in comparative analysis and alignment of RNA sequences,2 and are used to help design and screen libraries of antisense or primer oligonucleotides.3, 4, 5 Such tools include the Mfold server6 and the Vienna package,7 which, for a given input sequence, calculate the pseudoknot-free secondary structure that has minimum free energy (MFE) according to a standard thermodynamic model. These tools have been significantly enhanced over the years. For example, a list of sub-optimal secondary structures whose energies are close to that of the MFE secondary structure is provided with Mfold; this is useful, since the MFE predictions are not always correct, and some sequences have more than one stable secondary structure.

In some applications, it is desirable to predict the secondary structure of two or more interacting nucleic acids. We focus on this problem here. Such predictions aid in understanding mechanisms for ribozyme function, in determining the affinity of an oligonucleotide primer to its target,5 and in designing good antisense oligonucleotides,8 novel ribozymes9 or nanostructures.10

A method for predicting the MFE secondary structure of two or multiple sequences was originally briefly proposed, but not implemented, by Hofacker et al.7 Mathews et al.8 provide an implementation for two sequences; their algorithm calculates the MFE secondary structure formed by a probe and its target, and is used in their OligoWalk software, which selects a good probe for a particular target. However, neither work provides the algorithmic details nor handles the task of predicting sub-optimal secondary structures, and the OligoWalk software cannot handle more than two strands. Another related piece of software is the two-state hybridization server described by Zuker,6 a simple extension of the Mfold program, but this tool is less general, in that it considers a limited range of potential secondary structures. Finally, the HyTher software tool† calculates the free energy of stacked pairs or mismatches at the corresponding positions in the two input sequences. No minimization algorithm is performed, and the input sequences must have the same length.

In this work, we describe two algorithms, PairFold and MultiFold, for prediction of the MFE pseudoknot-free secondary structure of two or more interacting nucleic acids. PairFold is the first tool to predict sub-optimal secondary structures of two interacting strands, and MultiFold is the first to handle multiple strands. Both programs use the standard thermodynamic parameters of the Turner group11, 12 for RNA molecules and of SantaLucia Jr13 for DNA molecules. Our algorithms have been implemented using C++. The package is open source and can be downloaded‡. We provide a detailed analysis of the performance of PairFold and MultiFold on several data sets, in order to benchmark the quality of MFE thermodynamic predictions for complexes of two or more interacting RNA or DNA molecules by our algorithms.

PairFold predicts the MFE secondary structure that can be formed by two interacting nucleic acid molecules. The structure may include inter-molecular pairing (base-pairing between the two molecules) as well as intra-molecular pairing (base-pairing within each molecule); see Figure 1(b). PairFold algorithm takes as input a pair of RNA molecules S1 and S2, and extends the dynamic programming algorithm by Zuker & Stiegler14 for single molecules, which underlies the Mfold software.12 The idea is straightforward: the two given sequences S1 and S2 are concatenated, and the linkage location is memorized.

The MFE secondary structure is calculated, where the energy of a structure is the sum of the energies of its component elementary structures, see Figure 1. We consider the same elementary structures appear in a pair of interacting molecules as for a single molecule, except that in addition a “special” loop contains the location at which the molecules are linked. A special loop is treated as an external loop, except that a penalty for inter-molecular interaction is added. Handling special loops that form multi-loops, should the linkage location be “sealed”, is the most significant extension to the single-molecule algorithm, and requires two new dynamic programming arrays to handle special multi-loops. The output of the program is independent of the order of concatenation. Roughly, this is because, for a given secondary structure, the set of its elementary structures is the same regardless of the order of concatenation; the only difference being which of the two external loops is treated by the algorithm as a special loop. PairFold also calculates the duplex melting temperature, as a function of the reactants' concentrations and ionic concentration.15

To predict sub-optimal structures in PairFold, we extend the algorithm of Wuchty et al.16 for single molecules in two ways: by handling two strands as input; and by outputting a number k of sub-optimal structures whose free energies are closest to that of the MFE secondary structure, where k is specified by the user. In contrast, the algorithm of Wuchty et al. outputs all secondary structures whose free energy values are within a specified distance from the free energy of the optimal structure. Our method gives the user more direct control over the number of sub-optimal structures that are output.

MultiFold predicts the MFE secondary structure formed by several interacting nucleic acid molecules. The MultiFold algorithm is similar to that of PairFold, with the method for handling special elementary structures generalized to manage the case where more than one linkage location may lie in the structure. Furthermore, the input strands are concatenated in all possible orders, and the one having the lowest MFE is returned.

In order to assess the quality of predictions by PairFold, we analyzed the accuracy (i.e. sensitivity and specificity) of predicted structures, compared with secondary structures reported in the literature. Our test data include ribozyme and their RNA targets;9, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 small nuclear RNAs of the U2, U4 and/or U6 snRNPs;30, 31 and small DNA and RNA duplexes used by Peyret et al.32 and by Xia et al.33 to experimentally determine thermodynamic parameters for mismatched bases. Overall, we found that PairFold predictions on relatively short (length ranging from 43 to 170) duplexes, such as ribozyme–target complexes, are good, with 87% sensitivity and 82% specificity, on average, on pseudoknot-free structures. In cases where the prediction quality of PairFold is poorer, such as for pseudoknotted structures, we found that the best prediction from the top 100 sub-optimal structures was significantly better than the MFE prediction.

To test PairFold on longer complexes (length over 1000 nucleotides), we used two data sets from Yu et al.5: a combinatorial library of ribozymes for a target viral mRNA, and a small library of primers for the same target. While the complete secondary structures of the ribozyme–target duplexes are not known, it can be inferred from the experimental data whether or not the secondary structure at the site where the ribozyme is active conforms to the known secondary structure of ribozymes. We tested whether PairFold predicted the active part of the ribozyme–target duplex correctly. We found that PairFold predictions are quite accurate when the target is short, but they are poor for long (e.g. 1.1 kb) targets. We expected this poor performance, as all the existing approaches for RNA folding (including Mfold) perform more poorly on long structures than on short ones.

We tested MultiFold on five complexes that are variations of hairpin and hammerhead ribozyme constructs, each with three or four interacting molecules.20, 34, 35 The sensitivity of predictions of MultiFold on these complexes was also very high, over 93% in every case. Finally, we ran MultiFold on DNA strands designed for a molecular automaton that diagnoses high or low levels of mRNA strands in vitro.10 The strands are designed to form certain secondary structures, which are essential to correct functioning of the automaton. When the sequences are ordered so that the designed secondary structures can form without pseudoknots, the MFE secondary structure reported by MultiFold matches the designed secondary structures with accuracy between 0.91 and 1. With other permutations, the secondary structures reported by MultiFold also have high accuracy, supporting the good quality of the strand design.

First, we give a thorough analysis of our algorithms' accuracy on experimental data found in the literature. Then, we give the key issues of the algorithms and we analyse their computational complexity. Finally, we present a discussion on the accuracy and the limitations of the tools we propose, and we give conclusions and future work.

Section snippets

Results

In this section we analyze the accuracy of PairFold and MultiFold predictions on several reference structures from the biological literature. Both programs are able to take as input one RNA sequence as well, as opposed to two or several, in which case the result will be equivalent to the one returned by Mfold, except that at this point we do not incorporate coaxial stacking calculations for multi-loops. Since our goal here is to measure the accuracy of folding for two or more interacting RNA

Algorithms

In this section, we briefly describe our algorithms. Full details, including the recurrences for our algorithms, are given in the Supplementary Materials.

Accuracy and limitations

Our results show that PairFold has overall accuracy of 79% on sequences of up to 200 nucleotides in length, and which do not form pseudoknots. When generating 100 sub-optimal foldings and selecting the best, the accuracy increases to 91%. These statistics are roughly consistent with the data reported by Mathews et al.12 for single sequences, which shows that the predicted lowest free energy structure by Mfold contains 73% of known base-pairs, on single sequences of length up to 700 nucleotides;

Acknowledgements

We thank Holger Hoos, Dan Tulpan, Sanja Rogic, Jérémy Barbay, Greg Lakatos, and Lloyd Smith for valuable input on this work. This material is based upon work supported by the National Science Foundation under grant numbers 0130108 and 0203892. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References (42)

  • I.L. Hofacker et al.

    Fast folding and comparison of RNA secondary structures

    Chem. Monthly

    (1994)
  • D.H. Mathews et al.

    Predicting oligonucleotide affinity to nucleic acid targets

    RNA

    (1999)
  • A. Barroso-delJesus et al.

    Selection of targets and the most efficient hairpin ribozymes for inactivation of mRNAs using a self-cleaving RNA library

    EMBO Rep.

    (2001)
  • Y. Benenson et al.

    An autonomous molecular computer for logical control of gene expression

    Nature

    (2004)
  • M.J. Serra et al.

    Predicting thermodynamic properties of RNA

    Methods Enzymol.

    (1995)
  • J. SantaLucia

    A unified view of polymer, dumbbell, and oligonucleotide DNA nearest neighbour thermodynamics

    Proc. Natl Acad. Sci. USA

    (1998)
  • M. Zuker et al.

    Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information

    Nucl. Acids Res.

    (1981)
  • J.G. Wetmur

    DNA probes: applications of the principles of nucleic acid hybridization

    Crit. Rev. Biochem. Mol. Biol.

    (1991)
  • S. Wuchty et al.

    Complete sub-optimal folding of RNA and the stability of secondary structures

    Biopolymers

    (1999)
  • Y. Kasai et al.

    Measurements of weak interactions between truncated substrates and a hammerhead ribozyme by competitive kinetic analyses: implications for the design of new and efficient ribozymes with high sequence specificity

    Nucl. Acids Res.

    (2002)
  • N.K. Vaish et al.

    Recent developments in the hammerhead ribozyme field

    Nucl. Acids Res.

    (1998)
  • View full text