PT - JOURNAL ARTICLE AU - Mehdi Saman Booy AU - Alexander Ilin AU - Pekka Orponen TI - RNA secondary structure prediction with Convolutional Neural Networks AID - 10.1101/2021.05.24.445408 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.05.24.445408 4099 - http://biorxiv.org/content/early/2021/05/24/2021.05.24.445408.short 4100 - http://biorxiv.org/content/early/2021/05/24/2021.05.24.445408.full AB - Predicting the secondary, i.e. base-pairing structure of a folded RNA strand is an important problem in synthetic and computational biology. First-principle algorithmic approaches to this task are challenging because existing models of the folding process are inaccurate, and even if a perfect model existed, finding an optimal solution would be in general NP-complete. In this paper, we propose a simple, yet extremely effective data-driven approach. We represent RNA sequences in the form of three-dimensional tensors in which we encode possible relations between all pairs of bases in a given sequence. We then use a convolutional neural network to predict a two-dimensional map which represents the correct pairings between the bases. Our model achieves significant accuracy improvements over existing methods on two standard datasets. Our experiments show excellent performance of the model across a wide range of sequence lengths and RNA families. We also observe considerable improvements in predicting complex pseudoknotted RNA structures, as compared to previous approaches.Author summary Structure prediction for RNA sequences is a computationally difficult task that is of increasing importance in applications such as medical diagnostics and drug design; this is because the structure of a folded RNA strand to a large extent defines its function. An open RNA strand can fold to many different structures of varying thermal stability, and the goal of structure prediction is to determine a most stable one among these. There are two main difficulties to this task. Firstly, a given RNA sequence can fold into an enormous number of alternative structures, and a computational search for a most stable one in this huge space can be very demanding. The search can however be facilitated by using heuristics that take into account some underlying principles of the folding process. Here is where machine learning methods come into play: they are suitable for discovering patterns in data, and can thus predict features of the desired structure based on previously learned patterns. Secondly, there do not yet exist fully satisfactory coarse-grained models for the most popular metric for stability, the free energy of the folded structure. Although in principle a minimum free energy (MFE) structure should be a good candidate for a most stable one, MFE structures determined according to current energy models do not match experimental data on native RNA conformations very well. We show how to use an artificial neural network design to predict the structure for a given RNA sequence with high accuracy only by learning from samples whose native structures have been experimentally characterized, independent of any stability metric or energy model.Competing Interest StatementThe authors have declared no competing interest.