RNAmigos2: Fast and accurate structure-based RNA virtual screening with semi-supervised graph learning and large-scale docking data

RNAs constitute a vast reservoir of mostly untapped drug targets. Structure-based virtual screening (VS) methods screen large compound libraries for identifying promising candidate molecules by conditioning on binding site information. The classical approach relies on molecular docking simulations. However, this strategy does not scale well with the size the small molecule databases and the number of potential RNA targets. Machine learning emerged as a promising technology to resolve this bottleneck. Efficient data-driven VS methods have already been introduced for proteins, but these techniques have not yet been developed for RNAs due to limited dataset sizes and lack of practical use-case evaluation. We propose a data-driven VS pipeline that deals with the unique challenges of RNA molecules through coarse grained modeling of 3D structures and heterogeneous training regimes using synthetic data augmentation and RNA-centric self supervision. We report strong prediction and generalizability of our framework, ranking active compounds among inactives in the top 1% on average on a structurally distinct drug-like test set. Our model results in a thousand-times speedup over docking techniques while obtaining higher performance. Finally, we deploy our model on a recently published in-vitro small molecule microarray experiment with 20,000 compounds and report enrichment factors at 1% of 8.8 to 16.8 on four unseen RNA riboswitches. This is the first experimental evidence of success for structure-based deep learning methods in RNA virtual screening. Our source code and data, as well as a Google Colab notebook for inference, are available on GitHub.1


Main
Only a small fraction of RNA codes for proteins, while ncRNA is now known to play critical roles in a wide range of biological processes [64].For instance, about 2000 genes code for micro-RNAs, which in turn affect the expression of 60% of genes [15].Despite this ubiquity, the first RNA-targeting drug, risdipalm, was only recently approved by the FDA [35], and nearly all commercially available small molecule therapies still target proteins.Being able to target RNA would drastically increase the size of the druggable space, and propose an alternative for overused protein targets in scenarios where they are insufficient.For instance, lncRNAs could represent interesting therapeutic targets in oncology [57], for which protein targets might be too specialized.RNA targets also represent a therapeutic avenue in pathologies where protein targets are absent, such as triple-negative breast cancer [66].In this context, RNA is increasingly recognized a a promising family of targets for the development of novel small molecule therapeutics [14,18,11,2], highlighting the need for efficient tools for RNA drug discovery [14].
Here, we focus on structure-based approaches which rely on the knowledge of the 3D structure of the RNA of interest to search for candidate compounds.A major advantage of this methodology is to allow for the discovery of binding modes beyond those already known and achieve high specificity [3].Recently, many binding affinity prediction methods based on machine learning [4,50,19,44,61,65,30] were introduced, challenging traditional ones and unlocking state-of-the-art performance.Despite the widely recognized potential for RNA small-molecule therapeutics, the development of computational tools for identifying promising RNA ligands has faced substantial roadblocks compared to computational protein-targeting small molecule discovery [32].A major reason for this lag is the substantial gap in available data of RNA structures and small molecule interactions and affinity measurements.Whereas the RCSB-PDB database [45] contains hundreds of thousands of experimental protein structures, and now hundreds of millions predicted structures from AlphaFold [22], only a couple of thousands are available for RNAs.Similarly, databases of protein-ligand interactions such as PDBBind [63] have been widely used for affinity prediction models and contain tens of thousands of experimentally measured ligand interactions for proteins and only around 100 for RNAs.In addition to the paucity of data, the uniqueness of biophysical phenomena that underpin RNA folding make it difficult to translate approaches developed for proteins into the RNA domain [56].
In recent years, several computational methods for RNA drug discovery confronted these challenges.Two major strands have emerged: docking-based and direct scoring approaches.For a given pocketligand pair, docking approaches estimate the binding affinity by searching over possible poses of a ligand, iteratively scoring different poses while minimizing an energy function.Widely used nucleic-acid-specific docking tools include rDock [46], Autodock VINA [58], and RLDOCK [53].To enhance the docking procedure, alternative pose scoring functions have been proposed such as AnnapuRNA [51], LigandRNA [42], and very recently FingeRNAt [55], and RLaffinity [54].These methods offer higher confidence screening results but come at the cost of high computational demands: the time needed to search over poses in a single binding site-ligand pair with a docking software is on the order of minutes.Therefore, instead of undergoing the expensive docking procedure, a few direct scoring methods have been proposed to estimate the binding probability without pose searching, typically using in data-driven strategies.Tools such as InfoRNA [12] and InfoRNA2.0[13] score pocket-ligand pairs directly according to the secondary structure similarity to a fixed database of experimentally determined RNA-ligand complexes.Our previous contribution RNAmigos [37] aimed to eliminate the need of a fixed ligand library and made the first attempt at integrating the tertiary structure of the binding site into deep learning assisted compound screening, showing the potential of graph representations based on the Leontis-Westhof nomenclature of base pairs in 3D [27], and outperforming InfoRNA2.0at native ligand recovery.However, at present, no structure-based RNA virtual screening tool combines the accuracy of docking with the scalability of direct scoring methods.
In this paper, we introduce the first structure-based method for virtual screening on RNA that is competitive with molecular docking in a fraction of the time, opening the door to large-scale target-based RNA drug discovery.Whilst our previous work provided insights on RNA binding sites representation as graphs [37], the paucity of RNA small molecule binding data impeded the performance and generalization of this approach, hindering its use in real drug discovery pipelines.We address this data paucity by generating a large database of docking scores, together with unsupervised data, as data augmentations.Moreover, we improve our RNA binding site graph representations and propose novel model designs that open the door to enhanced neural networks and pretraining techniques.Despite running in seconds instead of hours, the proposed approach retrieves higher enrichment factors to molecular docking (top 1.1% vs 4% of the candidate ligand list).Moreover, using the tool and docking jointly, we are able to enhance docking virtual screening enrichments by cutting computation costs tenfold while improving the predicted rank of actives from 1.1% to 0.6%, This strong performance is found to be maintained over a diverse and unseen set of targets and compounds.Finally, we achieve strong enrichment factors on an independent experimentally derived ligand screen of 20,000 compounds which constitutes the first evidence of success for structure-based deep learning assisted virtual screening for RNA.

RNAmigos2
Figure 1: a: Datasets used in this paper.Beyond the supervised data available in the form of 1740 deposited structures of a ligand and its binding site, we use 1.2M molecular docking values as synthetic data and unlabeled datasets of 1.5M chemical compounds and 200k RNA residues neighborhoods.b: Representation of RNA structure as a 2.5D graph.Each binding site is encoded using Leontis-Westhof classification [27] and converted into an edge-attributed directed graph.c: Models to learn RNA-ligands interaction.We use an RNA and a ligand encoder, pre-trained in an unsupervised way.We then decode these representations to predict binding affinity (Aff) trained using our synthetic data; or a compatibility score (Compat) trained using deposited complexes.d: RNAmigos2 integrate these models to screen input ligands libraries, sorting them by their predicted probability of binding.
RNAmigos2 is designed to enable rapid screening of ligand libraries for binders using a query RNA structure.Our pipeline, illustrated in Figure 1, takes as input a candidate binding site structure (as full 3D or base pairing network) and a list of compounds to screen.The tool then returns a score for each compound reflecting binding likelihood.
RNAmigos2 model uses an encoder-decoder framework, with two encoders and two decoders, which are each trained on separate data sources.The two encoders are respectively mapping the input RNA binding sites and small molecules into embeddings.The RNA 3D structure is represented as a graph, called 2.5D graph, that encodes all canonical (Watson-Crick and Wobble) and non-canonical base pair interactions occurring in the structure [28].This representation allows us to capture key features of the RNA 3D architecture with a discrete mathematical object [27,26] that suits well machine learning frameworks and was shown to be a useful biological prior for RNA cheminformatics applications [37].The RNA encoder takes the 2.5D graph as input and learns to generate an RNA representation using self-supervised training schemes [38] on all available non redundant RNA substructures.Ligands are represented as molecular graphs.The ligand encoder learns a neural representation for ligands, using a variational autoencoder model proposed in [8] and trained on large dataset of chemical compounds [20].
To train decoders, we extract 1,740 RNA-ligand complexes from the PDB [6] and group them in 436 clusters of similar binding sites we identified using RMAlign [70] with a similarity threshold of 0.75.Our first decoder (Compat) is trained as a binary classifier to distinguish between the native ligand of a binding site and decoys.Then, to synthetically augment the limited number and drug-likeness of PDB compounds, we perform a large-scale docking experiment by docking 500 drug-like ChEMBl chemical compounds on our 1,740 binding sites.Finally, our second decoder is trained to predict binding affinity (Aff) using the docking data.Given a binding site and a list of ligands, we encode all objects and use our joint decoders to predict a compatibility score that can be used for virtual screening.In what follows, we measure performance of a model by its ability to assign a high score to active compounds against a pool of inactive (decoy) compounds.

RNAmigos2 is competitive with docking software and generalizes to new targets
In this work, we propose several enhancements to the original RNAmigos encoder model: we expand the dataset and training regimes, represent binding sites with directed graphs, and update the model architecture and pretraining strategy.Following the validation protocol in [37] and our ligand filtering, we show that the proposed enhancements yield a significant performance gain of over 25% AuROC points, achieving a new state-of-the-art on this task.The detailed results are presented in supplementary section E.
We now turn to an evaluation of our results using a larger and drug-like decoy set.The new decoy set is built querying the ChEMBL database [16] such that: (i) at most 100 compounds have a physicochemical similarity ≥ 0.6 with the native ligand for each binding, (ii) the set has a high diversity detrmined by the MinMaxPicker algorithm ( [24]), and (iii) all compounds pass a drug-like filter, keeping compounds with a molecular weight below 400, at least one ring, less than five rotatable bonds and H-bond donors, less than ten H-bond acceptors and logp values below five [49].The active compound is defined as the one found co-crystallized in the PDB with each binding site, also termed the native ligand.On the binding site side, we enforce non-redundancy by computing the 3D alignment-based RMscore [70] over all pairs of binding sites in our dataset and perform a hierarchical agglomerative clustering with a cutoff of 0.75, resulting in 436 clusters of binding sites.The resulting binding site groups represent similar binding sites or different instances of the same binding site with potentially different ligands.Then, we performed a data splitting on the groups, ensuring no structural data leakage exists between our training set of 367 group of binding sites and our test set of 69 binding sites.
On this new dataset, we train models using our two settings: Aff and Compat, comparing performance to rDock on the test set.We find that Aff does indeed correlate well with rDock intermolecular energy terms (Spearman correlation of 0.73) on unseen binding sites, as well as shows a clear separation between native ligands and decoys with respect to both docking and Aff scores (see Fig. 2 a).Applying our models in a VS setting (Fig. 2c) we achieve a mean AuROC score of 0.958 and 0.939 using Compat and Aff models respectively.We emphasize the high performance of the Aff model that was trained solely on simulated data.It is worth noting that such high values are common in VS, as libraries of hundreds of thousands of compounds are screened but only a few thousands of compounds in the top percentiles can be sent to wet-lab assays.On the same task, rDock slightly outperformed our models with a mean value of 0.959.This result is expected given the intensive (and CPU-time costly) optimization process executed by rDock in identifying the best poses.We underline however that the runtime of our method is around ten seconds on a single machine, whereas the docking results were obtained after ≃ 8 CPU hours per binding site.
Interestingly, we find that the errors in VS aligned for each binding site for our different models in Figure 2d.While as expected, rDock and Aff errors align, the Compat trained on an independent data source has complementary errors modes.This suggests a simple strategy for mitigating their individual errors, and we propose to ensemble each model's results to create a model coined as Mixed.The Mixed scores now have a lower correlation of 0.68 with docking values (Figure 2b), but we see that this correction enables to effectively recover active compounds that the docking program was not able to identify alone.We compare the VS performance of this mixing with ensemble of two models trained on the same data source, and report AuROCs over 3 independent seeds for all pairwise combinations in supplementary Table 7.We reach a performance of 0.990 AuROC for the Mixed to be compared to 0.965 and 0.940 for the ensembled versions of our single modality models.This performance gap proves the synergistic effect of our models, and enables us to significantly outperform rDock (0.959).Hence, the performance boost obtained by correcting docking scores with experimental data exceeds the decrease resulting from our imperfect docking surrogate model, which establishes a new state-of-the-art and best performing virtual screening tool.Next, we coin our best performing configuration, Mixed as RNAmigos2 for simplicity and evaluate it in a larger benchmark of RNA virtual screening tools and study the generalizability of our predictions.We compare the resulting model with several other VS tools, namely RLDOCK [53], AutoDock+Vina [58], Annapurna [51], rDock [46] and RNAmigos1 [37].As presented in Figure 3a, our method outperforms all methods with a wide margin for all methods and a smaller margin for rDock as seen in Fig 2c .We report minimal variance in our models' performances by training them over three random seeds and include these values in the first column of supplementary Table 7. Next, we look at the generalization capacity of RNAmigos2 when considering the novelty of test set pockets and ligands with respect to the training data.For each binding site in the testing set we report the AuROC versus the structural similarity (RM-score [70]) to the most similar binding site in the training set (see Fig. 3b.We see that the model performs consistently well at ranges of structural redundancy from 0.3 up to the threshold value of 0.75.

RNAmigos2 predictions consistently boost virtual screening efficiency
Our model outperforms docking by using complementary sources of information.However, the docking surrogate model alone (Aff) achieves a lower AuROC of 93.9% compared to 95.9% for actual docking.Thus, we also investigate mixing the Compat model with docking instead of Aff, and observe it enhances performance from 99.0% to 99.4%.We refer to this more accurate, but more expansive strategy, as RNAmigos++.
In practice, we have a limited time budget, and hence cannot afford to scan large scale databases with docking methods.This introduces a trade-off in virtual screening between quality of the scoring and the amount of computational time expended.When scanning a binding site, tool needs about 30 CPU seconds to perform the virtual screening, while the rDock model needs about 8 hours.Hence, in a drug design setting and under a compute budget of one CPU day, docking enables the screening of about ∼ 1400 compounds while RNAmigos2 enables screening ∼ 1.4M compounds.
To leverage the performance gain obtained with RNAmigos++ while keeping computations tractable, we propose to quickly presort compounds according to our RNAmigos2 score, and then sort the best scoring ones again with RNAmigos++ (See Figure 4a).We expect that presorting the compounds will help the docking tool to review the most promising ligands first and thus use the allocated compute time more efficiently.We compare this hybrid strategy with the sole use of docking and with RNAmigos2 in Figure 4b.Here, we define the efficiency of a VS method, as the area between the AuROC curve through time for RNAmigos2 and rDock (further developped in Methods).The efficiency of our hybrid method is split by binding sites and plotted in Figure 4C.This benchmark highlights that our methods obtain excellent results much faster than with a dockingonly approach.We note that we benefit of a boost of performance with RNAmigos++ over the RNAmigos2 model after screening only the top compounds, which yields a ten-fold speedup.Finally, we show that we improve the performance for almost all binding sites, which suggests that our protocol never results in a waste of time (See Figure 4B).These results suggest that our method could allow the screening of larger libraries, potentially those coming from ligand generative models [7].

RNAmigos2 identifies riboswitch ligands in large-scale in vitro assay
We now aim to assess the performance of RNAmigos2 in a realistic large-scale scenario with in vitro validation.In the previous experiments, we established ground truth using co-crystallized ligands from the PDB as the active (native) compounds.Here, we use a recent large-scale in vitro screen of approximately 20k compounds against nucleic acid targets, known as ROBIN [68].To make predictions using RNAmigos2, we identified all RNA targets in the ROBIN database which had a perfect sequence identity to an RNA chain in the PDB and pass the corresponding 3D structure as input to the model.Importantly, we ensure that all RNAs collected were not present in our training Figure 4: a. Illustration of our hybrid strategy.First, we sort the compounds using RNAmigos2 with a high throughput.Then, rDock is ran only on the best scoring compounds and used to replace our docking surrogate score.b.AuROC of different models, averaged over binding sites, as a function of time budget spent.c.Efficiency metric split by binding sites.
We computed the RNAmigos2 scores of all 24,572 assayed molecules in about four CPU minutes.We plot the chemical space induced by ROBIN compounds in Figure5a.Our top predictions (best 5%) are shown as colored full dots, with greater size for actives.Notably, we see that the retrieved active compounds span a large fraction of the chemical space.Nevertheless, we also note that some regions appear to be missed by our tool, such as the rightmost part of the plot.When plotting sample compounds from this region, we noted that the molecules seem to display sulfur atoms, which also decreases their drug likeness.
We plot the score distributions of actives (colored) and inactive compounds (grey) in Figure5c.This data again suggests that RNAmigos2 effectively differentiates actives from inactive compounds.We see that our tool is able to achieve enrichment factors at 1% ranging from 8.78 to 16.82, for an average of 12.2.Motivated by papers pointing out the low specificity of drug-target interaction predictor in proteins [59], we add a control experiment (RNAmigos2-Control Fig. 5c) where the prediction for each ligand is swapped between two pockets, eliminating the structure-specific signal while preserving the general affinity distribution.We observe positive enrichment factors (EF 1% 2.91-3.53)that illustrate how certain ligands seem to exhibit generic RNA binding properties.However, these enrichments are much weaker in the control setting, supporting the target-specificity of our predictions.Finally, we also include results obtained using rDock alone, and observe lower enrichments than when using RNAmigos2 (EF 1% 1.76-6.17).To our knowledge this is the first experimental evaluation of a deep learning structure-based model for RNA virtual screening.Large, colored compounds correspond to selected actives and smaller ones to selected inactive compounds.Hollowed ones are other actives and small grey points correspond to unselected, inactive compounds.are shown in color, with greater size for actives.c score distributions of actives (curve colored by pocket) and inactive compounds (grey curve) for each binding site.The shaded regions correspond to selecting the 5%, 2% and 1% top scoring compounds respectively.Score distributions obtained by shuffling predictions and using rDock are also provided.

Discussion
In this work, we address the timely question of in-silico RNA drug design.In contrast to proteins for which we already accumulated a lot of structural data, the current sparsity of reliable threedimensional RNA structures prevents the application of machine learning paradigms for RNA drug discovery tasks.The development of dedicated frameworks and methods are therefore essential to address this bottleneck.
Specifically, we complement the available RNA-ligand complexes structures by using RNA docking as a data augmentation strategy and incorporate domain-specific unsupervised pre-training techniques.
Whilst our docking surrogate model slightly under-performs traditional docking, its combination with a model trained on experimental data drastically outperforms three docking tools (0.96 rDock vs 0.99 RNAmigos2 AuROC), in a fraction of runtime, for diverse targets.Moreover, by replacing the surrogate with actual docking scores for the top-scoring compounds, we manage to again half this error while maintaining a tenfold speedup over docking.Finally, we establish our tool's performance on an independent large-scale in vitro binding screen, and show that it provides twelve-fold enrichment factors in four minutes.Together, these results establish RNAmigos2 as the new state of the art in structure-based RNA virtual screening.
By publicly releasing all our datasets, source code and model weights, we hope to spark a community effort in this important direction.Limitations to our approach include the need for pre-defined binding sites for which integrations with binding site predictors will need to be developed [60,52,53], as well as for the modeling of binding site flexibility [34].We envision tools such as RNAmigos2 will play a synergistic role with rapidly emerging RNA-centric technologies for molecular design [21] and newly released AlphaFold3 [1] which supports nucleic acids to pave the way for the next generation of RNA drug discovery.Notably, our methods have the distinct advantage to enable structure-based virtual screening for RNA with only low-resolution structural data at hand such as base pair interactions.In light of the daunting number of potential RNA targets, such feature could prove itself a major asset to mine entire genomes and fully embrace the era of RNA therapeutics [36].

Constructing and representing a Dataset of Binding Sites
To construct our dataset of binding sites, we download all RNA structures from the Protein Data Bank (PDB) [45] containing RNA and at least one ligand.Then, we extract RNA binding sites as RNA residues with an atom at most 10Å from the ligand, and filter out resulting binding sites with less than five RNA residues or involving a majority of protein residues.We additionally remove complexes with ligands that did not follow rules introduced in Hariboss[39] (e.g.removing ions, crystallization artefacts, etc.).The final dataset consists of 1,740 binding sites associated with 264 unique ligands.
We then use RNAGlib [33] to obtain a representation of the structure of our binding sites in the form of 2.5D graphs.The nodes of these graphs represent nucleotides and contain a one-hot encoding of the nucleotide type, and the edges represent interactions between nucleotides, split in interaction types that follow the Leontis-Westhof classification [28].An undirected 2.5D graph representation was shown to give better performance [37] than traditional representations for machine learning.In this work, we additionally use directed edges relevant to disambiguate sequence orientation and asymmetrical non canonical interactions.We expanded our graphs by performing a breadth-first expansion of depth four, to add context nodes around the binding sites.

Unsupervised Datasets of RNA structures and chemical compounds
We use two additional sources of unsupervised data to pretrain our networks.We use all RNA structures in the non redundant dataset proposed in [29], represented as directed 2.5D graphs, and extract the 2-hops neighborhoods around each residue, resulting in over 200,000 small RNA graphs.We additionally use a curated dataset of ∼1.5M ZINC [20] compounds, introduced in [43] as reasonable drug like compounds.These molecular compounds are represented as molecular graphs with nodes containing atom types, charge and aromaticity and edges with different chemical interactions.

Docking Dataset Construction
Our current database only includes co-crystallized ligands, that are thus positive binding examples.We do not have information about the affinity between a binding site and the native ligands of other binding sites, which are thus assumed to be negative pairs.Moreover, we would like to have more affinity data about drug-like compounds, since PDB compounds are not necessarily drugs (cofactors for instance).To generate more data, we will use molecular docking as a source of synthetic data.
Therefore, we curate a list of drug-like compounds from ChEMBL [16].To do so, we use a drug-like filter and keep compounds with a molecular weight below 400, at least one ring, less than five rotatable bonds and H-bond donors, less than ten H-bond acceptors and logp values below five [49].
We complement this procedure with the MaxMin algorithm based on Tanimoto distance between fingerprints, using RDKit [24], for picking a subset of 500 diverse compounds, ensuring that up to 100 compounds have a Tanimoto Coefficient greater than 0.6 with the native ligand for each binding site.
To perform our molecular docking, we compared two candidate tools, rDock [46] and AutoDock Vina [58] which were identified in a recent review of nucleic-acid docking tools [31].In addition, we used AnnapuRNA [51], a knowledge-based scoring function, as an independent referee to compare these candidate software in terms of native pose identification.We first establish our docking procedure and compare our candidate docking programs using self and cross docking experiments.In self-docking experiments, we evaluate programs' ability to predict the binding modes inside the native macromolecule by comparing the generated ligand poses to the crystal structure.In cross-docking experiments, we evaluate the ability of docking tools to retrieve actives among decoys.An in-depth explanation of these experiments is available in supplementary section C. Based on our results, we choose to use rDock as our docking program.We used it to dock the pdb and chembl ligand sets of 264 native binders and 500 ChemBL compounds in our 1740 binding sites, resulting in over 1.3 million values, that can be downloaded at the following link2 .On the test set, the docking scores were used to obtaining virtual screening performance of rDock for benchmarking purpose.The ones obtained for the train set binding sites are used as a data augmentation strategy.

Data splitting
We compute the RMscore [70] of all pairs of binding sites in our dataset, and report their distribution in supplementary Figure 6a.As can be seen, while most of the binding sites are distincts, there exists a proportion of binding sites that are highly similar.Hence, we then perform a hierarchical agglomerative clustering with a cutoff of 0.75.This allows us to group similar binding sites together and results in 436 groups of binding sites.
Those groups can represent similar binding sites, or different instances of the same binding site with different ligands.Data is split based on the groups, ensuring no structural data leakage exists between our training set of 367 group of binding sites and our test set of 69 binding sites.An MDS representation of our train and test groups is available in Figure 6b, showing that our test points are diverse in the space of binding sites.Training and testing on those groups requires caution.To avoid data imbalance, we chose to choose only a representative per group, and to set all ligands of this group as group actives.While training, we chose to dynamically sample one of the group actives.
While testing and to compute AuROCs, all the group actives are used as positives.

Models
Our model adopts an encoder-decoder framework with two encoders and two decoders.Our encoders are pre-trained to serve as a strong initialization for the representation of the input binding sites and the small molecules.For the binding site encoding, we use the directed 2.5D graph representation.Given such a directed graph G = (V, E), we then use a neural network encoder f θ based on Relational Graph Convolutional Network (RGCN) [47] layers to produce node embeddings of dimension d ∈ N and a pooling step reduces those node embeddings to a single graph-level vector representation ϕ G = f θ (G).Our RNA encoder using the strategies proposed in [38] which are based on the metric learning framework [67].Namely, given a similarity metric k, the network is trained to minimize Our directed graphs open the door to using more powerful similarity functions, and hence enhance pre-training.In this work we used a graphlet matching algorithm that matches directed rooted subgraphs.
To encode the structure of small molecules, the ligand is represented as a molecular graph F to serve as input to our model.This graph is then fed into a ligand encoder, denoted as g θ , resulting in ϕ F = g θ (F).We use the architecture and pretraining scheme proposed in OptiMol [8]: we additionally encode the graph as a SELFIES [23] string and try to reconstruct this string from ϕ F following a Variational Auto-Encoder (VAE) framework.
We then concatenate the graphs and ligand embeddings and feed the result to a decoder that outputs real values, so that we can write our overall models as Our first decoder (Compat) uses this architecture and is trained to output one for native ligands and zero for decoys, by minimizing a Binary Cross Entropy (BCE) loss over experimental complexes Our second setting (Aff), keeps the M model described above, but changes the training procedure so that the inter-atomic term of the docking score becomes the prediction target.We normalize docking scores to be between zero and one and train our network to approximate this value using an L 2 loss.The specific model architecture and hyper-parameter settings are described in supplementary Section D.

Inference and validation with virtual screening
Virtual screening (VS) aims to sort a library of compounds such that the active ones are at the top of the list [25].Typically, a VS is used to screen millions of compounds, as only a few thousands can be sent to wet-lab assays.Metrics for VS quantify the gain in active compounds obtained by selecting compounds using a VS strategy instead of a random selection.The main VS metric are the well known AuROC and Enrichment Factors [10], where the enrichment factor at a cutoff X of a VS method is the fraction of actives versus decoys (non-actives) in the top scoring X% relative to the overall active fraction.
Because the number of drug-like compounds available in databases such as Enamine [17] is in the billions, we introduce a novel metric, denoted efficiency, to measure the expected performance of a screening after a certain time budget is spent.While performing a virtual screening over a list of compounds, we keep track of the AuROC as a function of time and define the efficiency of a search as the area under the time vs. AuROC curve.The value of this metric depends on the initial ordering of the list.For instance, if all active compounds are at the end of the list, they are processed last, yielding zero AuROC until these compounds are reached.To get an expected efficiency, we thus repeat the computation over many shuffled initial orderings .In this manner we can capture the trade-off between time expenditure and the quality of the search results in a single value.
To perform virtual screening with our tool, we simply compute the scores for all of our compounds and sort them based on this score.To avoid imbalance when mixing our models, we first normalize the distributions of scores independently for the Compat and the Aff and then take the average of the results as the predicted score.To use our tool in conjunction with docking, we quickly presort the compounds using our mixed score and only dock the prioritized compounds.For those compounds, we use the true docking score instead of their approximation by the Aff.
the ChemBL database and selected for diversity and drug-likeness.We present a few properties of those compounds in Table 2.Moreover, we compute the distribution of inter-molecular energies for the ligands (Figure 7a) and the distribution of pairwise similarities (Figure 7b).We can see that the PDB ligands do not display drug-like features.In particular, the compounds are bigger than drug-like compounds.They also display worse docking scores on average.These results indicate that this second ChEMBL was indeed needed for an assessment of our tool from a drug design perspective.Finally, our results show that the samples in each of the data sets are diverse.

Descriptor
Finally, in Figure 8 we plot a 2D MDS of our ligand space in which we identify test compounds.This illustrates that our actives span a large fraction of the chemical space.

C Docking software benchmark
We assessed two candidate docking tools, rDock [46] and AutoDock Vina [58] which were identified in a recent review of nucleic-acid docking tools [31].In addition, we used AnnapuRNA [51], a knowledge-based scoring function, as an independent referee to compare these candidate software in terms of native pose identification.We first establish our docking procedure and compare our candidate docking programs using self and cross docking experiments.In self-docking experiments, we evaluate programs' ability to predict the binding modes inside the native macromolecule by comparing the generated ligand poses to the crystal structure.In cross-docking experiments, we evaluate the ability of docking tools to retrieve actives among decoys.Based on both experiments, we choose to use rDock as our docking program.An in-depth explanation of these experiments is available in section C.
Let us describe the settings with which we have run our docking experiments, using AutoDock Vina and rDock.For AutoDock Vina, we defined the exhaustiveness, energy_range and num_modes to be 8, 3 and 100, respectively.The search space (i.e., the coordinates for the center of mass and the size of the binding sites) was calculated by AutoGridFR [69].We also executed a different set of experiments by using the parameters presented in [46] (i.e., exhaustiveness=16, num_modes=100, energy_range=30 and the search space calculated by rDock).The greatest affinity difference between (the best pose predicted by) the two sets of parameters was 0.3 kcal/mol.Regarding the RMSD score, the greatest difference occurred for the 1BYJ structure (5.404); however, all of the other structures presented differences lower than 0.183.These results agree with the developers of AutoDock Vina who stated that "With the default (or any given) setting of exhaustiveness, the time spent on the search is already varied heuristically depending on the number of atoms, flexibility, etc. Normally, it does not make sense to spend extra time searching to reduce the probability of not finding the global minimum of the scoring function beyond what is significantly lower than the probability that the minimum is far from the native conformation".
Once a system is chosen, rDock performs a cavity generation first, and then the actual docking.As in [46], the cavity was defined using the crystallographic ligand (found in the PDB data-base) as reference through the "reference ligand method" implemented in rDock.All the parameters were used by default; but the radius, small_sphere, max_cavities, flexibility were set to 4.0, 1.0, 1 and 3, respectively.The models were scored using the "dock_solv.prm"function.
Self and cross docking experiments are performed to have a fair comparison between the capabilities of the selected tools when used on RNA molecules as a target.Spearman correlation matrix between docking scores and RMSD

C.1 Self-docking experiments
During the self-docking experiments, we evaluate the ability of the docking tools to predict the binding modes inside the native macromolecule.Therefore, we compare the generated ligand poses predicted as best to the crystal structure.We use the root mean square deviation (RMSD) as the metric to determine the accuracy of the prediction.
The experiments were run on a set of 9 RNA-ligand complexes (1EI2, 1QD3, 1LVJ, 1BYJ, 1NEM, 1PBR, 1TOB, 2TOB, and 1FMN) obtained from [5] and [69].For AutoDock Vina, the complex structures were downloaded from the PDB data-base.For rDock, the structures were obtained from the validation dataset reported by [46].9B depicts the correlation matrix between the rDock scores, AutoDock Vina affinities, Anna-puRNA scores, and RMSD values.There is a positive high correlation (0.74) between AutoDock Vina and AnnapuRNA scoring functions.Regarding rDock, there are weaker negative correlations with AutoDock Vina (-0.33) and AnnapuRNA (-0.32); however, there is a moderate positive correlation (0.37) with the RMSD values.Considering the RMSD values, its correlations with respect to AutoDock Vina and AnnapuRNA scores are -0.2 and -0.18, respectively.
Annapurna can be used to re-rank the generated poses.If we take the top 5 poses as ranked by AnnapuRNA, the mean RMSD decreases in 6 out of 9 complexes (for both AutoDock Vina and rDock).If only the top pose is considered, 3 complexes and 5 complexes are improved for AutoDock and rDock, respectively.
Based on the self-docking experiment, we can conclude that rDock has better performance than AutoDock Vina in docking ligands against RNA targets.In addition, re-scoring the predicted poses with AnnapuRNA identifies poses with lower RMSD values than those originally ranked by AutoDock Vina and rDock.

C.2 Cross-docking experiments
In cross docking experiments, we evaluate the ability of docking tools to retrieve actives among decoys.We do so by docking each ligand in the dataset to each RNA target.The experiments were carried out on a set of 52 RNA-ligand complexes, having 31 unique ligands.We use the native ligand of each complex as the active ligand and the other 30 ligands in the dataset as inactive ligands.
AutoDock Vina only returns the binding affinity of the bound conformations calculated with its scoring function, and on the other hand, in addition to the total score, rDock details the value of each term of the scoring function (intermolecular, ligand intramolecular, site intramolecular and external restraint).Beside the total score, we also ranked the poses predicted with rDock based on the intermolecular term because it represents the RNA-ligand interaction score [46].We use AnnapuRNA to re-score the poses.The results are presented in Table 4 each docking tool and for the poses.We find rDock with the intermolecular term of its scoring function to have the best performance.Based on the results shown in this section, we chose rDock as our docking tool.The approach rDock uses to parallelize the jobs is independently running each molecule, splitting the ligands file into multiple files, and running each independently.To parallelize the docking experiments, we executed the experiments over each pocket in parallel using an array job in Compute Canada.Therefore, for the cavity generation task we used as many CPU numbers as pockets (1740), taking approximately 2 hours to complete this task.In the same sense, the docking of each pocket with the ChEMBL compounds, using 1740 CPUs, is approximately 4 hours and for the PDB ligands, 7 hours.

Figure 2 :
Figure 2: a,b Relationship between Aff (a) and Mixed predictions against rDock scores for decoys and native ligands in the test set.Horizontal lines indicate the mean of predicted scores for native and decoy ligands.c Distribution of the AuROC performance of different virtual screening models over test binding sites.Diamonds represent the mean performance.We compare models trained under our two settings with rDock (left of the dashed lines), an ensemble of these models (Mixed) and the average of Mixed and rDock.d.Description of the performance of the tools split by binding sites.Each line on this plot corresponds to a point in A. This shows that the successful predictions of different methods are complementary and illustrates their synergy in a mixed model.

Figure 3 :
Figure 3: a Virtual screening benchmark against state of the art docking tools, RNAmigos1 and RNAmigos2.Note, RLDOCK produced errors on 42 binding sites.We only report its performance for 27 binding sites, which we denote with RLDOCK * .b. Depiction of the performance of RNAmigos2 split by similarity of the test binding site to the train set.This similarity is assessed with RMScores for the binding sites, and Tanimoto similarity of chemical fingerprints for ligands.

Figure 5 :
Figure 5: a Visualization of the structure of the riboswitches in the ROBIN dataset used in this experiment.b Chemical compounds selection (best 5%) in the tSNE space of ROBIN compounds.Large, colored compounds correspond to selected actives and smaller ones to selected inactive compounds.Hollowed ones are other actives and small grey points correspond to unselected, inactive compounds.are shown in color, with greater size for actives.c score distributions of actives (curve colored by pocket) and inactive compounds (grey curve) for each binding site.The shaded regions correspond to selecting the 5%, 2% and 1% top scoring compounds respectively.Score distributions obtained by shuffling predictions and using rDock are also provided.
(a) Rdock score distribution for the PDB and ChEMBL datasets.Distribution of pairwise distances for the PDB and ChEMBL datasets.

Figure 9 :
Figure 9: Correlation matrix between docking scores and RMSD to native pose.

Table 2 :
Chemical descriptors of PDB Ligands and ChEMBL compounds

Table 3
summarizes the descriptive statistics for the RMSD of the 20 best poses for each complex and for each tool.The total number of RSMD values for each docking tool is 180 (twenty for each complex).The mean and median RMSD are more favorable for rDock with 4.28A and 3.62A, respectively (7.21A and 7.43 for AutoDock Vina).The distribution of RMSD values is shown in Figure9A.This figure confirms that, in general, the RMSD values for the poses predicted by rDock are lower than those predicted by AutoDock Vina.

Table 3 :
Descriptive statistics of the RMSD obtained with AutoDock Vina and rDockFigure

Table 4 :
Mean and standard deviation of the AuROC scores obtained with different settings.We compare the use of the Rdock score, the use of only its intermolecular term, Rdock with Annapurna and AutoDock Vina with Annapurna.C.2.1 Runtime for docking experimentsTable5reports the execution time it took to perform the docking experiments, including the cavity generation stage and the docking itself.The method selected for generating the cavity was to take the native ligand as the reference ligand.