Abstract
Predicting the secondary structure of an RNA sequence with speed and accuracy is useful in many applications such as drug design. The state-of-the-art predictors have a fundamental limitation: they have a run time that scales with the third power of the length of the input sequence, which is slow for longer RNAs and limits the use of secondary structure prediction in genome-wide applications. To address this bottleneck, we designed the first linear-time algorithm for RNA secondary structure prediction, which can be used with both thermodynamic and machine-learned scoring functions. Our algorithm, like previous work, is based on dynamic programming (DP), but with two crucial differences: (a) we incrementally process the sequence in a left-to-right rather than in a bottom-up fashion, and (b) because of this processing, we can further employ beam search pruning to ensure linear run time in practice (with the cost of exact search). Even though our search is approximate, surprisingly, it results in even higher overall accuracy on a diverse database of sequences with known structures. In particular, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart).
Significance Statement Fast and accurate prediction of RNA secondary structures (the set of canonical base pairs) is an important problem, because RNA structures reveal crucial information about their functions. Existing approaches can reach a reasonable accuracy for relatively short RNAs but their running time scales almost cubically with sequence length, which is too slow for longer RNAs. We develop the first linear-time algorithm for RNA secondary structure prediction. Surprisingly, our algorithm not only runs much faster, but also leads to higher overall accuracy on a diverse set of RNA sequences with known structures, where the improvement is significant for long RNA families such as 16S and 23S Ribosomal RNAs. More interestingly, it also more accurate for long-range base pairs