T OWARDS G ENERATIVE RNA D ESIGN WITH T ERTIARY I NTERACTIONS

challenges in computational


INTRODUCTION
Ribonucleic acid (RNA) is one of the major regulatory molecules inside the cells of living organisms with key roles during differentiation and development (Morris & Mattick, 2014).RNAs fold hierarchically (Tinoco Jr & Bustamante, 1999) and the structure is key to their function: Base interactions via hydrogen bonds result in a fast formation of a secondary structure, with tertiary interactions stabilizing the formation of the final 3D shape (Vicens & Kieft, 2022).A strong structure-to-function relationship is already achieved on a secondary structure level (Hammer et al., 2019), and therefore, RNA secondary structure prediction recently got into the focus of the deep learning community, achieving state-of-the-art results (Singh et al., 2019;Fu et al., 2022;Chen et al., 2022;Franke et al., 2022;2023).Compared to more traditional methods, these algorithms predict an L × L adjacency matrix representation of the secondary structure instead of the commonly used but less expressive dot-bracket string notation (Hofacker et al., 1994).This has the advantage that they are not limited to the prediction of specific kinds of base pairs but can predict non-Watson-Crick interactions, pseudoknots (Staple & Butcher, 2005), as well as base multiplets (nucleotides that pair with more than one other nucleotide) (Singh et al., 2019), which all play significant roles for RNA structures and functions (Vicens & Kieft, 2022;Reyes et al., 2009).
Structure-based RNA design considers the inverse problem: Given a target structure, find an RNA primary sequence that folds into the desired structure.It is thus intricately tied to RNA folding.However, there is currently no structure-based RNA design algorithm available that can invert state-1 Preprint of-the-art deep learning-based secondary structure prediction algorithms, which could clearly lead to better designs.
In this work, we propose RNAinformer, the first inverse RNA folding algorithm that is capable of designing RNAs while considering all kinds of base interactions.We show that a vanilla transformer architecture, enhanced with axial attention inspired by the RNAformer (Franke et al., 2023), can reliably design RNAs in different settings, including RNA design with non-canonical interactions, pseudoknots, and base multiplets.We see our main contributions as follows: • We propose RNAinformer, a novel generative transformer model for the inverse RNA folding problem.Using axial attention, our model is the first RNA design algorithm that can design RNAs from secondary structures with all types of base interactions.• We show that our model outperforms existing algorithms on nested and pseudoknotted structures, while further being capable of designing sequences that form base multiplets.

RELATED WORK
Traditional Methods The problem of computational RNA design was first introduced as the inverse RNA folding problem by Hofacker et al. (1994).Since then, different methods were proposed for solving the problem using approaches like local search (Hofacker et al., 1994;Andronescu et al., 2004), constraint programming (Garcia-Martin et al., 2013;2015;Minuesa et al., 2021), or evolutionary methods (Esmaili-Taheri et al., 2014;Esmaili-Taheri & Ganjtabesh, 2015).However, in contrast to our approach, these methods are limited to the design of nested structures, typically considering canonical base pairs only.
Learning Based Approaches More recently, RNA design was also approached with learning based methods.One line of research use human priors to design RNAs based on player strategies obtained from the online gaming platform Eterna (Shi et al., 2018;Koodli et al., 2019).However, these models incorporate human strategies that might not be available for all designs and consider nested structures only.The other, more general approach seeks to learn RNA design purely from data.Eastman et al. (2018) propose to use reinforcement learning (RL) to adjust an initial input sequence by replacing nucleotides based on structural information.In contrast, Runge et al. (2019) and Riley et al. (2023) use a generative approach to the problem.Runge et al. (2019) employs a joint architecture and hyperparameter search approach (Bansal et al., 2022) via automated reinforcement learning (AutoRL) (Parker-Holder et al., 2022) to derive an RL system that is capable of generatively designing RNAs that fold into a desired target structure.Riley et al. (2023) uses a GAN (Goodfellow et al., 2020) approach specifically for the design of toehold switches (Green et al., 2014).However, all learning-based approaches so far consider RNA design for nested structures only, ignoring pseudoknots and base multiplets, while often being limited to the design of canonical base interactions.
Overall none of the existing algorithms can design RNAs including non-canonical base pairs, pseudoknots, and base multiplets.

METHODS
RNA secondary structures can be represented in different ways, including the common dot-bracket string notation (Hofacker et al., 1994) or adjacency matrices.We show different representations in Figure 1.One advantage of an adjacency matrix representation is that it can model all types of base interactions, especially if a nucleotide interacts with more than one other, a situation prevalent for most experimentally solved structures (Singh et al., 2019).In the following, we detail our generative approach to design RNAs from secondary structures using matrix representations.
Preprint Loss The problem of RNA design is often addressed by defining a structural loss function L ω = d(ω, F(ϕ)) that quantifies the difference between the target structure ω and the folding, F(•), of the designed candidate sequence ϕ (Runge et al., 2019).However, a folding engine might not be differentiable, which makes it hard to employ this strategy to deep learning based design approaches.Therefore, we train a model to maximize the sequence recovery by minimizing the mean Cross Entropy Loss L ψ over a nucleotide sequence.For a designed candidate ϕ ∈ {A, C, G, U } l of length l and a given target nucleotide sequence ψ ∈ {A, C, G, U } l of the same length, this loss is defined as: where L CE (ψ i , ϕ i ) is the cross entropy loss between the target sequence and the designed sequence at position i.
Model Our model is a vanilla auto-regressive encoder-decoder transformer model (Vaswani et al., 2017) with a next token prediction objective.The encoder embeds the structure information, while the decoder auto-regressively generates RNA nucleotide sequences by sampling from the softmax distribution.For RNAinformer we use axial attention in the first encoder block to process the matrix input similar to the RNAformer (Franke et al., 2023).However, instead of working on a 2D latent space, we use pooling to reduce the 2D representation to a 1D vector that is then passed through the encoder and the decoder to generate candidate sequences.Figure 3 and Figure 4 in Appendix A give an overview over our model.
Training Details Due to hardware limitations we set the maximum sequence length to 100 for all our experiments.We train our model with 6 encoder blocks and 6 decoder blocks with an embedding dimension of 256.The model is trained using cosine annealing learning rate schedule with warm-up and AdamW (Loshchilov & Hutter, 2019).The hyperparameters used for training our model are described in Table 2 in Appendix B.

EXPERIMENTS
We evaluate the RNAinformer in three settings with increasing complexity: We design RNAs for nested structures (Section 4.1), pseudoknotted structures (Section 4.2), and for experimentally val-  et al., 2000), including all kinds of base interactions (Section 4.3).We use two different folding algorithms: RNAfold (Lorenz et al., 2011) and RNAformer (Franke et al., 2023).While the former is the most widely used folding algorithm, the latter is the current state-of-the-art deep learning based approach, capable of predicting RNA structures with all kinds of base pairs.We use RNAfold for our experiments on nested structure and the RNAformer for all other experiments, since RNAfold can provide solutions for nested structures only.During evaluation, we generate between 20 and 100 candidate sequences for each task.The first sequence is generated using a greedy strategy and the rest of the sequences are generated using multinomial sampling.All datasets used for our experiments are detailed in Appendix C.

Metrics
The ultimate goal of structure-based RNA design is to generate sequences that fold back into the target structure.Following the common convention in the field of RNA design, we report the number of solved tasks for a given benchmark dataset.However, we provide a more comprehensive analysis of all experiments with different performance measures in Appendix D.

RNA DESIGN FOR NESTED STRUCTURES
We compare the performance of RNAinformer against one of the currently best performing set of algorithms, LEARNA, Meta-LEARNA and Meta-LEARNA-Adapt (Runge et al., 2019).For each task, we generate 20 sequences with each algorithm and report the percentage of solved tasks.A tasks thus counts as solved if one of the 20 sequences folds into the desired target structure.
Data We use the Rfam dataset provided by Franke et al. (2023).While the set was originally build for learning a simplified biophysical model of RNA folding, it serves exactly our needs: Homologies between the training and test set have been removed using RNA family annotations from the Rfam database (Griffiths-Jones et al., 2003), it contains a large amount of training data, and all sequences have been folded with RNAfold to obtain secondary structures.Hence, the dataset contains only canonical base pair interactions.

Results
The results on the Rfam dataset are shown in Table 1 (left).We observe that RNAinformer clearly outperforms the other methods, solving 98.7% of the tasks.Notably, these are ∼35% more solved tasks compared to the next best competitor, LEARNA (64.8% solved tasks).Furthermore, RNAinformer generates multiple, highly diverse valid sequences for each task, indicated by a high diversity score of 0.713 as depicted in Table 6 in Appendix E.

RNA DESIGN WITH PSEUDOKNOTS
In this section, we assess the performance of RNAinformer when designing RNAs for pseudoknotted input structures.We compare against aRNAque (Merleau & Smerlak, 2022) a recently proposed Lévy flight mutation based design algorithm that supports pseduoknotted structures.However, the evaluation of aRNAque is computationally expensive with rather high runtimes (an evaluation for 32 pseudoknotted structures nearly took 24 hours on two CPUs using 50 generations).We, therefore,

Results
We report results in Table 1 (right).Remarkably, RNAinformer solves nearly 50% of the tasks (47.2% solved tasks), while outperforming aRNAque on pseudoknotted structures by a margin of more than 10% (19.8% solved tasks compared to 9.4%).Again, we observe that RNAinformer generates various solutions for most of the tasks as shown in Table 7 in Appendix E.

RNA DESIGN WITH ALL KINDS OF BASE INTERACTIONS
In this section, we investigate the ability of RNAinformer to design RNAs from structure data that contains all kinds of base pairs.To account for the difficulty of the task, we sample 100 sequences instead of only 20 sequences.
Data For our evaluations, we use the inter-family dataset provided by RnaBench (Runge et al., 2024).This dataset uses the test sets TS1, TS2, TS3, and TS hard provided by (Singh et al., 2021).All dataset contain structures with both pseudoknots and base multiplets.

Results
We observe that RNAinformer cannot solve structures for the different testsets, indicating that designing sequences for structures with all kinds of base pairs seems to be much more challenging than for nested structures or structures with pseudoknots only.However, for all samples with base multiplets, we predict more than two of the multiplets present in the structures correctly on average, reported in Table 8 in Appendix E. Furthermore, Figure 2 shows examples of the training predictions that solve structures that contain base multiplets as well as pseudoknots.We conclude that RNAinformer is generally capable of designing RNA sequences from structures that contain all kinds of base interactions.Nevertheless, we admit that further improvements in performance might require adjustments to our model like scaling in terms of model size or applying a finetuning strategy.

CONCLUSION
In this work, we propose RNAinformer, the first RNA design algorithm capable of designing RNA sequences for structures that contain all kinds of base interactions, including non-canonical base Preprint pairs, pseudoknots, and base multiplets.We demonstrate the strong performance of RNAinformer on tasks with nested structures only, tasks that contain pseudoknots, as well as on experimentally derived structures with all kinds of base interactions.We think that RNAinformer is a useful basis for future approaches to RNA design and expect it to be of great value for the RNA design community.
For the future, we plan to further condition our model on different properties of RNA, to e.g.design RNAs with desired G and C nucleotide ratios.
Preprint A MODEL DETAILS

Figure 1 :
Figure 1: Representations of RNA secondary structures.(Left) Common graph representation of the RNA.(Middle) Dot-bracket notation in the graph structure.A pair of nucleotides is indicated by a pair of matching brackets, unpaired nucleotides are indicated by a dot.(Right) Matrix representation of the RNA.The matrix is a binary L × L square matrix, where L is the sequence length of the RNA.Pairing nucleotides are shown in yellow.

Figure 2 :
Figure 2: Example design predictions of solved structures including base multiplets and pseudoknot interactions.

Figure 3 :
Figure 3: Overview of matrix input processing in RNAinformer.

Figure 4 :
Figure 4: Overview of nucleotide sequence generation.

Preprint Table 1 :
Performance on the Rfam and bpRNA datasets for nested and pseudoknotted structures, respectively.

Table 3 :
Overview of the Rfam dataset.

Table 4 :
Overview of the bpRNA dataset.

Table 5 :
Overview of the Inter-family Dataset.

Table 8 :
Results for RNA design for experimentally validated structures with all kinds of base interactions.

Table 9 :
Tasks and folding algorithms.