ASSA: Fast identi¯cation of statistically signi¯cant interactions between long RNAs

The discovery of thousands of long noncoding RNAs (lncRNAs) in mammals raises a question about their functionality. It has been shown that some of them are involved in post-transcriptional regulation of other RNAs and form inter-molecular duplexes with their targets. Sequence alignment tools have been used for transcriptome-wide prediction of RNA – RNA interactions. However, such approaches have poor prediction accuracy since they ignore RNA's secondary structure. Application of the thermodynamics-based algorithms to long transcripts is not computationally feasible on a large scale. Here, we describe a new computational pipeline ASSA that combines sequence alignment and thermodynamics-based tools for e±cient prediction of RNA – RNA interactions between long transcripts. To measure the hybridization strength, the sum energy of all the putative duplexes is computed. The main novelty implemented in ASSA is the ability to quickly estimate the statistical signi¯cance of the observed interaction energies. Most of the functional hybridizations between long RNAs were classi¯ed as statistically signi¯cant. ASSA outperformed 11 other tools in terms of the Area Under the Curve on two out of four test sets. Additionally, our results emphasized a unique property of the Alu


Introduction
Due to the single strand nature of an RNA molecule, its nucleotides are capable of base pairing with the complementary nucleotides. Usually, the hybridization occurs between di®erent regions of the same transcript producing the secondary structure. However, a part of one RNA molecule can bind to a complementary part of another transcript forming inter-molecular duplex. Such RNA-RNA pairing is called antisense interaction and the corresponding RNAs are known as natural antisense transcripts or NATs. 1 Long noncoding RNAs (lncRNAs) are a large and diverse class of transcripts with a length of more than 200 nucleotides that do not encode proteins. The discovery of thousands of lncRNAs expressed in the mammalian cells raises a question about their functionality. 2,3 The fact that the transcription of lncRNAs is regulated, 4 indirectly supports their functionality. Due to the functional diversity, 5 the role and/or the molecular mechanism of only a few hundred lncRNAs have been determined to date. Particularly, it has been shown that some of them function post-transcriptionally via formation of inter-molecular RNA-RNA duplexes. [6][7][8] Several experimental methods have recently been developed to identify intermolecular RNA-RNA duplexes on a large scale (SPLASH, 9 PARIS, 10 LIGR-seq, 11 MARIO, 12 RIA-seq 13 ). Nevertheless, due to the limited availability of the experimental data there is still a need for a computational prediction of antisense interactions. Existing thermodynamics-based tools [14][15][16] compute the free energy of the inter-molecular duplexes (ÁG) to estimate the strength of the RNA-RNA binding. Although these algorithms are e®ective in working with relatively short RNAs (such as bacterial sRNAs) on a small scale, the computational complexity does not allow to directly use them for genome-or transcriptome-wide searches. To overcome this limitation, a number of large-scale computational studies have utilized sequence alignment tools (such as BLASTn 17 or LASTAL 18 ) to predict mammalian NATs. [19][20][21][22] However, these approaches do not account for RNA secondary structure which is crucial for the RNA binding.
Here we present a new computational pipeline called \AntiSense Search Approach" (ASSA) that combines sequence alignment and thermodynamics-based tools for e±cient prediction of the RNA-RNA interactions between long transcripts. It reduces the running time by fast identi¯cation of the putative antisense sites using the sequence alignment tool LASTAL. The detected sites along with the°anking sequences form the putative duplexes. The inter-molecular hybridization energy of every duplex is calculated by the thermodynamics-based tool RNAup and the SumEnergy of all the putative duplexes between two RNAs is computed.
Clearly, the value of the SumEnergy depends on several factors including the transcript lengths (longer transcripts produce more putative duplexes) and GCcontent (G::C base pairing is stronger than A::T). This makes it di±cult to compare RNA-RNA interaction energies computed for transcript pairs with di®erent properties. To tackle this problem, the statistical signi¯cance (\Theoretical P -value") of every SumEnergy value is estimated in ASSA with respect to the lengths and GCcontents of both the interacting transcripts. For this purpose, we developed a mathematical model that predicts the expected background distribution of SumEnergys based on the properties of the input sequences. This model ensured that the interaction energies computed for random sequences were not statistically signi¯cant while most of the functional hybridizations between mammalian RNAs produced strong P -values. Moreover, sorting predictions by P -value instead of SumEnergy improved ASSA accuracy and allowed it to outperform other bioinformatics tools on two test sets containing random sequences.
A similar idea of combining the sequence alignment and thermodynamics-based tools has been used in several recent algorithms. 23,24 However, these approaches do not estimate the statistical signi¯cance of the interaction energies and, therefore, may be biased to predict stronger interactions between longer transcripts or RNAs with higher GC content. First, the lastdb tool is used to index all the queries by executing the command \lastdb DB" and passing all the query sequences via STDIN. Next, a search for antisense sites is performed by submitting the reversed target sequences via STDIN to \lastal -s 0 -m 9999999 -P 1 -p MATRIX.txt -a 12 -b 6 -e 30 DB", where -s 0 is the search strand (0=reverse), -m 9999999the maximum initial matches per query position (the restrictions are removed by using a very large threshold value); -P 1number of parallel threads (the value is updated according to the ASSA launch options); -a 12gap opening penalty; -b 6gap extension cost; -e 30threshold value for alignment scores (can be changed through an ASSA option) and MATRIX.txt is the custom substitution matrix adopted from Szczesniak et al. 22 (Supplementary  Table 1).
In ASSA, we relaxed the gap open/extension penalties suggested by Szczesniak et al. 22 from À20/À8 to À12/À6, respectively. With these settings, complementary Alu-elements located in di®erent transcripts produced a single local LASTAL alignment.
It should be noted that some of the LASTAL local alignments overlap either on one of the sequences (the query or the target) or on both of them. The latter cases are resolved by keeping the alignment with the larger score only.

Comparison of thermodynamics-based tools
We considered the following 12 thermodynamics-based tools for calculation of the hybridization energies of the putative duplexes produced by ASSA: AccessFold, 15 Bifold, 25 DuplexFold, 25 GUUGle, 26 IRIS, 27 IntaRNA-2, 28 LncTar, 16 RactIP, 29 RIsearch, 30 RNAduplex, 31 RNAPlex, 14 RNAup. 32 To identify the best tool, we prepared a test set by generating 360 sequence pairs (simulated duplexes) of two types. These sequences were intended to represent two situations with respect to the local secondary structures that might be present at the antisense sites identi¯ed by LASTAL.
Duplexes of the¯rst type (\true duplexes") did not have strong complementarity to the nearby (°anking) regions on the either side. Consequently, the RNA regions corresponding to an antisense site did not form intra-molecular interactions and were accessible for inter-molecular base pairing. To simulate this situation, we prepared sequence pairs with di®erent lengths and percent complementarity of the antisense sites (all the sequences had GC content = 50%). Each sequence in a pair consisted of a site of length SL in the middle and two random°anking sequences of length FL each (see Fig. 1). Additionally, the complementarity between the two sites in the simulated duplex was INTER CMPL percent. In this construct, the simulated site in the middle corresponded to a LASTAL local alignment. To simulate a variety of possible LASTAL hits, di®erent values for SL, FL and INTER CMPL were used. Sequence pairs with three di®erent combinations of the site/°ank lengths (SL/FL = 10/10, 20/50 or 40/50) were generated (the corresponding total sequence lengths were 30, 120 and 140, respectively). Additionally, three di®erent percents of complementarity were used (INTER CMPL = 100%, 90% or 80%). A total of 20 sequence pairs were simulated for every combination of SL, FL and INTER CMPL making the total number of \true duplexes" in the test set equal to 180 (¼ 20 Â 3 Â 3).
We refer to the second type of the simulated duplexes in the test set as \two hairpins". In addition to some level of inter-molecular complementarity, they also had perfect (100%) intra-molecular complementarity. It should be noted that a loop of length LL separated the two complementary regions located on the same sequence (see Fig. 1). We generated nine di®erent variations of the \two hairpins" duplexes by using three combinations of the SL/LL values (10/5, 20/20 and 40/20the corresponding sequence lengths were 25, 60 and 100) and three di®erent complementarity values (INTER CMPL = 80%, 70% or 60%). In total, 180 \two hairpins" duplexes were simulated for the test set by preparing 20 sequence pairs for every combination of the SL, LL and INTER CMPL values.
On one hand, a sequence alignment tool (such as LASTAL) is expected to identify complementarity between sequences of both types. This is why both classes of the simulated duplexes are likely to be among the putative duplexes analyzed by ASSA. On the other hand, due to the presence of 100% intra-molecular complementarity in the \two hairpins" sequences, a thermodynamics-based tool is expected to fold them into two separate RNA molecules. Therefore, a secondary structure aware algorithm is likely to predict strong inter-molecular interaction for the \true duplexes" only (see Fig. 1). So, our goal was to¯nd the thermodynamics-based tool able to distinguish the \true duplexes" from the \two hairpins" types most accurately.
A \good" tool should predict existence of inter-molecular binding for the \true duplexes" (label=1) and no binding for the \two hairpins" (label=0) sequence pairs. However, thermodynamics-based tools do not produce such a binary classi¯cation. Instead, they compute inter-molecular hybridization energy (ÁG). Thus, the predicted duplex types were determined by computing tool-speci¯c empirical P -values for all the simulated duplexes.
To do this, 100 \random" duplexes were generated from each simulated duplex by mono-nucleotide shu®ling of one of the sequences (the other sequence in the pair did not change). Each tool was applied to the original as well as to 100 \random" duplexes producing 101 interaction energies. The empirical P -value was computed as follows: where x is the inter-molecular hybridization energy (ÁG original ) outputted by the tool T for the original duplex; ÁG rand ¼ fÁG rand 1 ; . . . ; ÁG rand N g is a set of interaction energies predicted by the tool T for all the corresponding random duplexes; NumðÁG rand xjT Þ is the number of random duplexes with the hybridization energy less or equal to x (i.e. the same or stronger interaction is observed for random sequences) and NumðÁG rand Þ is the total number of random sequences (N ¼ 100 in our settings). It should be noted that inter-molecular hybridization energy is a negative value and the smaller it is the stronger the predicted interaction.
Empirical P -value less than 0.05 indicated that a tool predicted the existence of inter-molecular binding (i.e. label = 1). On the other hand, a simulated duplex with empirical P -value greater or equal to 0.05 corresponded to the case where no binding was predicted (i.e. label = 0). By comparing the predicted labels with the known duplex types (\true duplexes" or \two hairpins"), the numbers of True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) predictions were determined. Finally, the Matthews correlation coe±cient (MCC) was computed for each tool:

Interaction energy calculation by RNAup
The following command is used in ASSA to compute the hybridization energy for each putative duplex: \RNAup -o -b -w L", where -o indicates not to produce an output¯le and report the predicted free energy via STDOUT, -b takes into account the probability of unpaired regions in both RNAs and -w speci¯es the maximum length of the region of interaction. Sequence pairs of each putative duplex are concatenated by the`&' character and passed to RNAup via STDIN. It should be noted that the value of the -w parameter has a great e®ect on the RNAup execution time (the larger the value, the longer the execution time). Our analysis of the 100 human lncRNA-mRNA pairs indicated that the majority of the identi¯ed antisense sites (i.e. LASTAL local alignments) were shorter than 50 bp (see Supplementary  Fig. 1). Therefore, in ASSA, the value of the -w option is de¯ned as minð50; antisense site lengthÞ). To speed up the calculation of the hybridization energies for the long (> 50) antisense sites, the RNAup is applied to the site sequences only (i.e.°anks are not added) with the -w set to 50. The produced energies are then scaled up by multiplying to the ratio \antisense site length=50". The RNAup run is the most time consuming step of the pipeline. Further speed up of this step can be achieved by using the -num threads ASSA option that allows to compute the free energies of all the putative duplexes independently in parallel.
Besides the interaction energies, RNAup also outputs the coordinates of the predicted inter-molecular base pairing. Importantly, ASSA only considers hybridization energies from the sequence pairs where the location of the RNAup duplex overlaps the original antisense site on both sequences. If the LASTAL and RNAup predictions do not overlap, the ÁG of the putative duplex is set to 0.

Estimation of theoretical P-values
To study the dependence between the properties of the random sequences and the distribution of SumEnergys, a training set was prepared. First, groups of random sequences were generated. Each group consisted of 10 sequences with the same length and GC content. Seven di®erent lengths (50, 100, 300, 1000, 2000, 4000 and 8000 nt) and seven di®erent GC contents (30%, 40%, 42%, 50%, 56%, 60% and 70%) were considered. Two groups (\queries" and \targets") were generated for every Length/ GC content combination. In total, 49 query and 49 target groups were prepared. It should be noted that short sequences (length = 50 and 100 nt) were included in the training set in order to obtain a universal model that can be used to estimate the statistical signi¯cance of the interactions between sequences in a wide range of lengths.
Next, ASSA was applied to search queries versus targets. To reduce the number of ASSA runs, the input group pairs were selected so that the queries were never longer than the targets (i.e. L1 L2). In total, 1372 ð¼ 7 Â 7 Â 7 Â ð7 þ 1Þ=2) ASSA runs were performed and each run produced 100 SumEnergy values (10 queries vs 10 targets).
Initially, the RNA-RNA interaction prediction was done with the relatively low LASTAL score threshold equal to 30 (or 36 for long sequences with high GC content). To study the in°uence of this ASSA parameter on the distribution of SumEnergys, the produced¯les were post-processed by increasing the score threshold and re-calculating all the SumEnergy values. In total, this allowed to produce 20,532 sets of SumEnergys that corresponded to di®erent sequence features as well as a wide range of LASTAL score thresholds (from 30 to 105). It should be noted that some random sequence pairs produced SumEnergys equal to 0e.g. when no putative duplexes were identi¯ed because the sequences were too short or the score threshold was too strict.
To select the probability distribution function (PDF) which¯ts the produced data better, we focused on the 3411 sets where all the 100 SumEnergys were negative (i.e. none of them was equal to zero). Three candidate distributions (Gamma, Normal and Log-Normal) were considered. Since the Gamma and Log-Normal PDFs are only de¯ned for values greater than zero, their parameters were identi¯ed from the À1 Â SumEnergy values using the fitdistr() function from the MASS 33 R library. The Kolmogorov-Smirnov test P -values were calculated for all the 3411 sets with respect to each of these three PDFs. The Gamma distribution produced the least number of small (< 0:05) KS P -values (for 0.21% of the sets only) and was selected to model the distribution of the SumEnergys produced by random sequences.
The Gamma distribution can be parameterized in terms of a shape parameter () and a rate parameter ()both are positive numbers. To account for the fraction of random sequence pairs that produced SumEnergy ¼ 0, we utilized the \hurdle model", 34,35 which in our case is a mixture of Bernoulli and Gamma distributions. The basic idea is that a Bernoulli probability (P ¼ P ðSumEnergy 6 ¼ 0Þ) governs the binary outcome of whether or not the SumEnergy equals to zero. If SumEnergy is not zero, the hurdle is crossed, and the Gamma distribution governs the remaining (nonzero) part of the distribution. The 20,532 empirical SumEnergy distributions were used to train three linear regressions (see Supplementary Table 2). The obtained models are used in ASSA to predict the expected background distribution based on the features of the input sequences and the LASTAL score threshold for Theoretical P -value calculation (see Supplementary text for more details).

Transcriptome-wide all-vs-all search and repeat masking
Information about the genes expressed in the K562 cell line was taken from the FANTOM5 2 database (sample id \CNhs12334.10824-111C5"). All the genes with nonzero expression values were considered. For the alternatively spliced genes, the longest isoform was selected only.
The \all-vs-all" BLASTn 17 search was performed in the antisense mode (-strand minus) with the seed length equal to 15 (-word size 15) and without a threshold on the E-value (-evalue 999999). Thus, an antisense interaction was recorded between any two transcripts which had at least one perfectly complementary duplex longer or equal to 15 bp. The number of the antisense partners for each RNA was de¯ned as the number of unique transcripts that produced at least one BLASTn local alignment with the query.
We ran the RepeatMasker in a quick mode (the -qq option) to mask the human-speci¯c repeats (-species human) in all the sequences. Additionally, the -alu option was used to restrict masking to the Alu repeats only.

Test sets to compare the RNA-RNA interaction prediction tools on a large scale
The performance of di®erent RNA-RNA prediction tools was evaluated based on their ability to detect experimentally identi¯ed targets of the lncRNA TINCR 13 and mRNA ACTB. 9 Among thousands of TINCR antisense partners, we randomly selected 100 transcripts with the length between 200 nt and 4000 nt (to reduce execution time of some RNA-RNA prediction tools) and without Alu-repeats (in order to focus on the short-trans interactions). For the ACTB, all the 82 targets identi¯ed in the HeLa cells were taken.
To estimate the ability of the computational tools to identify the true targets among a large set of sequences, two types of test sets were prepared. The TINCR and ACTB test sets of the¯rst type (\mix with human transcripts") consisted of all the selected true targets and a number of randomly selected human transcripts with similar length and GC content. It should be noted that the false targets were selected among the transcripts of the genes expressed in the corresponding cell types. To identify such genes in keratinocytes, (where the TINCR pull down has been performed) the reads from the input RNA-seq sample (run ID SRR539976 from the NCBI GEO entry GSM986009) were mapped to the human genome (hg38) and the 7314 NCBI genes with at least 100 mapped reads were identi¯ed. In case of ACTB, we considered the 1967 highly expressed genes with at least one interaction identi¯ed by Aw et al. 9 in HeLa cells. Nine \false targets" were selected for every \true target" (see Supplementary Fig. 2) to simulate the assumed situation in cell where a long RNA interacts with a limited number of other RNAs. To prepare the test sets of the second type (\mix with shu®led sequences") every selected true target transcript was used to generate nine di-nucleotide shu®led sequences using the uShu®le tool. 36 Therefore, each of the two TINCR test sets consisted of 1000 sequences with 100 of them being the true targets, while the ACTB test sets included 82 true and 738 false targets. Each tool was used to compute the interaction energy between the query transcript (NR 027064 for TINCR or NM 001101 for ACTB) and sequence from a test set. Predictions of the tools were used to rank all the sequences in each test set and to build a ROC curve by using the ROCR 37 R-package.

Types of inter-molecular RNA-RNA interactions
Antisense interactions are usually classi¯ed into two groupscis and trans. The interactions of the¯rst type occur between the products of the overlapping genes that are transcribed in opposite directions. The resulting RNAs have one or several (due to splicing) relatively long sites with perfect complementarity. All other interactions are classi¯ed as trans (\not-cis") since they are formed between transcripts produced from genes located in di®erent genomic regions.
Analysis of the published cases of the biologically active duplexes formed between long RNAs (i.e. mRNA-mRNA, lncRNA-mRNA or lncRNA-lncRNA) in mammals prompted us to expand the classical \cis-trans" classi¯cation of the natural antisense transcripts. It should be noted that some RNAs produced from non-overlapping genes are also able to form long (> 100 bp) highly complementary inter-molecular duplexes. To better discriminate between di®erent classes of trans-interactions, we divided the trans-category into three sub-categoriespseudo-cis, Alu-based and short-trans interactinos.
The \pseudo-cis" sub-type of the trans RNA-RNA binding occurs when one of the overlapping genes has an expressed copy (a paralog) at another genomic locus. The gene copy harbors a sequence highly complementary to a part of the other gene in the original overlap and, thus, can form trans-antisense duplexes with it. This scenario has been observed in the case of expressed pseudogenes. 38 Moreover, it has been shown that such pseudogene related duplexes can be recognized by RNAi machinery and produce functional siRNAs in mouse oocytes. 39 Inversion of a genomic region during gene duplication is another possible scenario for such NATs formation. 40 Sequence repeats of several classes occupy a signi¯cant portion of the human genome. Up to 350 bp long with relatively high percent identity (> 70%) Alu-repeats belong to the short interspersed nuclear elements (SINEs) class. They are frequently present in lncRNAs or in the 3'UTRs of human mRNAs either in the direct or in the reverse-complement orientation. It has been shown that a pair of transcripts with Alu repeats in opposite directions are able to interact with each other and trigger Staufen Mediated Decay (SMD) 41,42 and/or regulate mRNA translation. 43 To evaluate the fraction of the Alu-based interactions on a transcriptome scale, \all-vs-all" BLASTn search was performed for the 10,664 genes expressed in the K562 cell line (see Methods). We observed linear dependance of the number of predicted antisense partners on the query transcript length. Surprisingly, the analyzed RNAs formed three distinct clusters (see Fig. 2(a)). To check whether the origin of these clusters was related to the Alu repeats, we applied the RepeatMasker software 44 to the the 10,664 sequences and identi¯ed 2212 Alu-containing transcripts. Next, all the RNAs were classi¯ed into three categories according to the presence and direction of Alu repeats that matched well with the three clusters (see Fig. 2(a)). Indeed, the transcripts with two or more Alu-repeats in di®erent directions are able to form inter-molecular duplexes with any other Alu-containing transcript. RNAs with Alu(s) in one orientation can only hybridize with the transcripts containing complementary repeat sequence. Clearly, transcripts without Alu's have the lowest antisense potential since they can not participate in the Alubased interactions.
To further con¯rm the observed role of the Alu repeats, we masked them in all the 2212 sequences and repeated the \all-vs-all" search (for the 10,664 transcripts). With this modi¯cation, all the RNAs followed the same trend of increasing the number of possible antisense partners with the sequence length ( Fig. 2(b)). This indicated that the two transcript clusters with the higher number of RNA-RNA interactions (the red and green dots in Fig. 2(a)) appeared due to the presence of Alu repeats in the corresponding sequences. Furthermore, we masked repeats of all types (7.8% of the total sequence length) and performed the \all-vs-all" search once again. No signi¯cant change was observed this time (compare (B) and (C) in Fig. 2) suggesting that the ability to signi¯cantly increase the number of possible antisense targets of a transcript is the unique feature of the Alu elements. Thus, we concluded that Alubased base pairing plays the major role in the repeat-associated interactions in the human transcriptome.
It should be noted that a number of large scale computational studies [19][20][21] have masked repeats of all types in the transcript sequences prior to the antisense search to avoid the prediction of the large number of repeat-based interactions. Our results indicated that the same e®ect can be achieved by masking the Alu repeats only while preserving other potentially informative parts of the sequences. This approach was used in the present study where applicable.
Finally, apparent long sites with high complementarity have not been found in a number of trans-antisense interactions. 8,45 This makes it di±cult to identify the exact regions of inter-molecular hybridization between the corresponding RNAs. The authors of the corresponding papers have hypothesized that in such cases the observed bindings may be based on several relatively short and not-perfectly complementary duplexes. We refer to this sub-type of trans binding as \short-trans-interactions".

The ASSA algorithm
Among the four introduced types of inter-molecular RNA-RNA hybridization, three (cis, pseudo-cis and Alu-based) are based on relatively long, highly complementary duplexes. Thus, even sequence alignment algorithms that do not take secondary structure into account can detect interactions of these types. 19,46,47 However, the short-trans category is di®erent. The presence of multiple small duplexes distributed along the sequences makes it di±cult to localize the exact interacting regions both experimentally and computationally. In addition, a number of spurious short sites with imperfect complementarity can be predicted by the alignment tools.
On the other hand, thermodynamics algorithms discriminate between RNA regions that are parts of stable secondary structures (i.e. involved in intra-molecular interactions) and the unpaired sites accessible for inter-molecular base pairing. Thus, the computed free energies re°ect both the accessibility and the length/complementarity of the speci¯c transcript regions. This is why interaction energy calculation is expected to improve the prediction accuracy of the short-trans interactions. 23,24 It is unpractical to apply the traditional thermodynamics tools to the long transcripts (such as mammalian mRNAs or lncRNAs) because the execution time grows quickly with the lengths of the input sequences. 14,15,28 At the same time, recent experimental data [9][10][11][12][13] have indicated that the short-trans base-pairing may be the most abundant interaction type in the human transcriptome.
Thus, the two main challenges in predicting short-trans hybridizations between long (> 200 nt) RNAs on a large scale are (i) the execution time of existing thermodynamics-based algorithms and (ii) the identi¯cation of the statistically sig-ni¯cant interactions among all the transcript pairs. Here we present a new computational pipeline, called ASSA (\AntiSense Search Approach"), developed in attempt to address both of these problems. It should be noted that ASSA is able to identify interactions of all types, but our main goal was to make a progress in the most challenging direction-prediction of the short-trans binding.
Brie°y, ASSA performs the following main steps (see Fig. 3) to predict interaction between each pair of input transcripts: (i) identify antisense sites by the local sequence alignment algorithm LASTAL 18 ; (ii) extract sequence regions corresponding to the predicted antisense sites and compute hybridization energies of the putative duplexes; (iii) compute the RNA-RNA interaction energy by summing the hybridization energies of all the putative duplexes and (iv) estimate the statistical signi¯cance (\theoretical p-value") of the obtained SumEnergy. These steps are discussed in more details below.

Predicting antisense sites by LASTAL
ASSA takes two sets of nucleotide sequences as input (the query and the target transcripts). On the¯rst step, the local sequence alignment tools from the LAST package 18 are used to identify the regions of local complementarity for each sequence pair by searching all the queries versus all the targets (see Methods). In this work, we refer to each produced local alignment as the \antisense site". This approach is adopted from another large scale study of the RNA-RNA bindings between human transcripts. 22 The advantage of using LASTAL over other aligners (such as BLASTn 17 ) for predicting antisense interactions is that it allows to use a custom substitution matrix with appropriate scores for the RNA-speci¯c hybridization rules (i.e. G:C, A:U and G:U base pairs-see Methods). ASSA antisense sites are selected based on the local alignment scores rather than the alignment length and/or percent complementarity (a similar approach is used in the RIblast algorithm 24 ). The threshold on the LASTAL alignment scores is one of the pipeline parameters.
On the next ASSA step, a thermodynamics-based algorithm is used to compute the hybridization energies of the \putative duplexes" which are de¯ned as RNA regions consisting of the antisense sites together with the°anking sequences on both sides. Several thermodynamics-based algorithms can be used for this purpose. To choose the tool which suits the ASSA pipeline best, we compared their performances on a simulated dataset.

Choosing a thermodynamics-based tool to compute hybridization energies
We considered 12 thermodynamics-based tools as candidates to be used in the ASSA pipeline for calculation of hybridization energies of the putative duplexes. The performances of these algorithms were compared on a test set consisting of 360 short sequence pairs that resembled the putative duplexes in ASSA. Both sequences in a pair had the same length ranging from 25 to 140 nt.
There were two types of simulated duplexes in the test set. The¯rst one (\true duplexes") corresponded to the cases of true inter-molecular hybridizations that were not interfered by the local secondary structures. In these duplexes, the middle part of each sequence represented a putative antisense site (a gapless LASTAL local alignment) of a particular length and percent complementarity°anked by the random sequences on both sides. The second type of the simulated duplexes (\two hairpins") represented the situation where the regions of both RNA molecules corresponding to an antisense site interact with the°anking regions to form local secondary structures. Importantly, the percent of complementarity of intra-molecular hybridization was greater than that of the inter-molecular binding (see Methods for details). The test set consisted of 180 simulated duplexes of each type.
Each tool was used to predict the type of every duplex. The prediction accuracy of an algorithm was determined by comparing the predicted and the true duplex types. According to the obtained Matthews correlation coe±cients, RNAup 32 produced the most accurate labeling of the simulated duplexes (see Supplementary Fig. 3). Thus, this tool was incorporated in the ASSA pipeline.

Preparing putative duplexes and calculating SumEnergies
Execution time of the thermodynamics-based algorithms does not allow to directly use them for analysis of long transcripts on a large scale. The interaction energy is e±ciently computed in ASSA by applying RNAup to the speci¯c sequence chunks (\putative duplexes") rather than the full-length transcripts. A putative duplex is generated by extracting regions of two transcripts that correspond to an antisense site together with the°anking sequences on both sides (see Methods).
Adding°anks to an antisense site allows RNAup to compute the inter-molecular hybridization energy with respect to the local RNA secondary structure. The°ank length in°uences both the accuracy and the execution time of the ASSA pipeline. On one hand, longer°anks allow to take more elements of the RNA secondary structure into account. On the other hand, the RNAup takes more time to process longer sequences. By default, the length of each°ank is equal to the length of the corresponding LASTAL alignment. This approach can be considered as a tradeo® between the time and accuracy.
In ASSA, the interaction strength between two transcripts is measured by the sum of the hybridization energies of all the putative duplexes (SumEnergy). It has been shown that SumEnergy outperforms MinEnergy in predicting binding between human transcripts. 23 It should be noted that both the duplex ÁG and the RNA-RNA SumEnergy are negative values and the smaller they are the stronger the corresponding hybridization will be.
Notably, any RNA-RNA prediction tool (including ASSA) outputs some value of the interaction energy even for random sequences. We observed that the distribution of the SumEnergy values computed by ASSA depended on the features of random sequences (lengths and GC contents) as well as on the LASTAL score threshold (see Fig. 4). We were interested in identifying the mammalian transcript pairs with the SumEnergy values smaller (stronger interactions) than the ones produced by random sequences with the same lengths and GC contents. Thus, on the next ASSA step, the statistical signi¯cance (P -value) of each SumEnergy is estimated by comparing the observed value with the distribution produced by the corresponding random sequences.

Estimating the statistical signi¯cance of the SumEnergy
The ability to quickly estimate the statistical signi¯cance (\Theoretical P -value") of the interaction energies taking into account the lengths and GC contents of the corresponding transcripts is the main novelty of the ASSA pipeline.
Any p-value is computed by comparing the observed value with the distribution of the values generated by the \null" model (the background distribution). In ASSA, the observed value is the SumEnergy calculated for a particular transcript pair. Random   sequences are frequently used in bioinformatics as the \null" model. Thus, a background distribution can be generated by applying ASSA with the same parameters to a number of random sequences generated by shu®ling the nucleotides in the original transcripts. The \Empirical P -value" can then be directly calculated from the obtained empirical background distribution (see Eq. (1) in Methods). The problem with this approach is that it requires a lot of simulations to estimate small p-values. For example, the SumEnergy computed by ASSA for lncRNA-ATB and IL11 48 was À386:59 kcal/mole. To estimate the statistical signi¯cance of this value each sequence was mono-nucleotide shu®led 30 times and all-versus-all ASSA search was performed producing an empirical background distribution consisting of 900 SumEnergy values (Fig. 5(a)). The smallest value observed for random sequences was À220:56 kcal/mole. Thus, with this number of simulations, the Empirical P -value was equal to zero which meant that the actual P -value was less than 1=900 ¼ 0:0011 and a larger background dataset was needed to get a better estimate.
A common approach to obtain estimates for small P -values without generating enormous amount of random sequences is to approximate the empirical background distribution with a function and use it to compute the statistical signi¯cance of the observed value. The SumEnergy distribution produced by ASSA for random sequences can be approximated by a mixture of the Bernoulli and Gamma distributions (the hurdle model 34,35 ) À À À see Methods. We applied this approach to the interaction between lncRNA-ATB and IL11 and obtained the \Empirical hurdle model P -value" ¼ 8:3 Â 10 À7 (see Fig. 5(b)).
As was mentioned above, the distribution of the SumEnergy values depends on several factorsthe transcript lengths (longer transcripts produce more putative duplexes) and GC-contents (G::C base-pairing is stronger than the A::T) as well as the LASTAL search threshold (Fig. 4). Thus, sequence-speci¯c background distributions should be generated for calculation of the Empirical P -values (with or without the use of the hurdle model). It is computationally challenging to use this measure for estimating statistical signi¯cance of the numerous SumEnergy values obtained in large-scale searches.
To tackle this problem, we analyzed the dependence between the distribution parameters and the features (length and GC content) of the random sequences as well as the LASTAL score threshold. In short, ASSA was applied to a number of random sequences with various lengths and GC-contents. Three parameters of the hurdle model (P , mean and variance) were computed for each distribution of the SumEnergy values. The generated dataset was used to train three linear regression models to predict each of the hurdle model parameters based on¯ve features (log (L1), log(L2)), GCaver, GCdi® and score-see Methods for more details). The derived formulas are used to predict the expected background distribution and compute the \Theoretical hurdle model P -value" (or simply \Theoretical P -value") for every transcript pair analyzed by ASSA (Fig. 5(c)).
The goal of the developed mathematical model is to quickly and reliably estimate the statistical signi¯cance of an observed SumEnergy value. This implies that the Theoretical and the Empirical P -values should be similar for the same transcript pairs. To check the correspondence between these two measures 10 lncRNAs and 10 mRNAs without Alu repeats and with the lengths between 200 and 4000 nt were randomly selected from the human transcriptome. Both the Theoretical and the Empirical hurdle model P -values were computed for all the 100 lncRNA-mRNA pairs using two di®erent LASTAL score thresholds. In both cases, the Pearson correlation coe±cient between the log's of the P -values was greater than 90% ( Supplementary Fig. 4) indicating that the Theoretical P -values can be considered as a reasonable approximation of the Empirical hurdle model P -values. It should be noted that ASSA output also includes the FDR-corrected Theoretical P -values to account for the database size and multiple testing.

Choosing the default value for the LASTAL score threshold
The in°uence of the LASTAL score threshold on the ASSA performance was evaluated on four validation sets, not overlapping with the test sets. The score values between 30 and 50 were considered. It should be noted that ASSA execution time also depends on the value of this threshold as more putative duplexes are predicted by LASTAL when a weak threshold (a small score) is used. Our analysis demonstrated that the score threshold 36 produced one of the best average AUC values and reduced the ASSA execution time (see Supplementary Fig. 5). Therefore, this value was selected as the default in ASSA and it is used throughout this work (if not stated otherwise).

Properties of ASSA P-values computed for random sequences
One important property of every P -value is that it is uniformly distributed between 0 and 1 when computed for the objects from the \null-model". 49,50 In case of ASSA, the \null-model" is a pair of random sequences. To check whether this property holds for Theoretical P -values, the 10 lncRNAs and 10 mRNAs were mono-nucleotide shu®led and the Theoretical and the Empirical hurdle model P -values were computed for the 100 random sequence pairs. As anticipated at the level of p-value < 0:05, four random sequences pairs out of 100 had Empirical P -value < 0:05. Moreover, the Kolmogorov-Smirnov test for the uniform distribution applied to the 100 Empirical observations produced P -value 0.065 con¯rming that the distribution of this measure is indeed close to uniform (see Supplementary Fig. 6(a)). On the other hand, the obtained 100 Theoretical P -values were not exactly uniformly distributed (there were 10 sequence pairs with the Theoretical P -value < 0:05 and the corresponding Kolmogorov-Smirnov P -value was 0.032 À À À see Supplementary Fig. 6(b)). Still, the correlation between the Theoretical and the Empirical hurdle model P -values was 83% ( Supplementary Fig. 6(c)). This analysis demonstrated that ASSA have a tendency to predict P -values that are slightly stronger than the corresponding true estimates. This may be explained by the fact that the Maximum Likelihood approach (used for Empirical hurdle models) provides better estimates of the Gamma distribution parameters than the method of moments used in ASSA. Nevertheless, given the high correlation with the Empirical hurdle model P -values (Supplementary Fig. 4 and Supplementary Fig. 6(c)), Theoretical P -values may still be useful to identify the statistically signi¯cant RNA-RNA interactions in a large scale search.
As mentioned above, one of the problems with the interaction energy is that it depends on the lengths and GC contents of the input sequences (Fig. 4). Theoretical P -values computed by ASSA not only estimate the statistical signi¯cance of the observed SumEnergys, but also provide automatic normalization to the features of the input sequences. Indeed, every P -value is computed with respect to the background distribution that takes into account the lengths and the GC contents of both transcripts in a pair. To check whether Theoretical P -values are indeed normalized to the sequence properties, two types of random sequence sets were generated. Each set consisted of 100 random sequence pairs. The four sets of the¯rst type had di®erent lengths (from 1000 to 4000 nt), but the GC content of all the sequences was the same (50%). On the other hand, the sets of the second type had di®erent GC contents (from 35% to 65%), but the length was the same (3000 nt). Theoretical P -values were computed for all sequence pairs. There was almost no dependance between ASSA P -values and the features of the random sequences (see Supplementary Figs. 7(a) and 7(b)). It should be noted that the SumEnergys computed by RIblast 24 for the same sequences demonstrated strong dependence on both the length  and the GC content (see Supplementary Fig. 7(c) and 7(d)). For example, the median RIblast SumEnergys for 3kb sequences with GC contents 35% and 65% were À67 and À694 kcal/mole, respectively (the corresponding ASSA P -values were 0.479 and 0.703). Overall, the results obtained for random sequences demonstrated that the Theoretical P -values are highly correlated with the Empirical P -values and provide automatic normalization to the sequence properties. The main advantage of this measure is the ability to compute it quickly without generating random sequences. Thus, ASSA P -values can be considered as the¯rst approximation of the statistical signi¯cance of the RNA-RNA interaction energy.

ASSA predictions for functional NATs
In order to evaluate ASSA performance on real RNA sequences, we applied it to the 34 mammalian natural antisense transcripts (NATs) collected from the literature -11 cis, 4 pseudo-cis, 10 Alu-based and nine short-trans cases (see Supplementary  Table 3). ASSA P -values computed for cis, pseudo-cis and Alu-based interactions were very strong due to the existence of long (>100 bp) duplexes with high percent of complementarity.
By contrast, only¯ve out of nine short-trans interactions had Theoretical P -values < 0:05. The inability of ASSA to classify some of the short-trans cases as statistically signi¯cant could be an artifact of the method of moments used in ASSA to predict the parameters of the background distribution. To¯nd out whether the inter-molecular hybridizations between the corresponding transcript pairs are statistically signi¯cant, the same distribution parameters were obtained by the Maximum Likelihood Estimation approach and the Empirical hurdle model P -values were obtained from the SumEnergys computed by ASSA or RIblast for the shu®led versions of the same sequences. In all four cases, ASSA Empirical P -values agreed with the Theoretical estimates (the Empirical P -value 0.043 is assumed to be not statistically signi¯cantsee Supplementary Figs. 8(a)-8(d). Once again, this result con¯rmed the good correspondence between the Theoretical and the Empirical ASSA P -values. Interestingly, two out of the four cases were statistically signi¯cant according to the interaction energies computed by RIblast (see Supplementary  Figs. 8(e)-8(h). This indicates that ASSA may not be sensitive enough to detect some of the short-trans interactions. Additional analysis is needed to pinpoint the parts of the pipeline that should be improved to handle such cases.
According to the obtained results, Theoretical P -values computed by ASSA are suitable to predict all types of antisense interactions between long RNAs. However, the identi¯cation of the short-trans interactions remains the most challenging task.

Comparison of ASSA with other tools
We compared the ability of ASSA to identify short-trans RNA-RNA interactions on a large scale with the following tools that were used for this purpose in other studies -BLASTn, 17 DuplexFold, 25 GUUGle, 26 IRIS, 27 LASTAL, 18 LncTar, 16 RIBlast, 24 RIsearch2, 51 RNAduplex, 31 RNAPlex 14 and RRP 23 . To evaluate the ability of the tools to predict the short-trans interactions, we used the experimentally identi¯ed targets of the lncRNA TINCR 13 and mRNA ACTB 9 . It has been shown that at least some of the identi¯ed transcripts are associated with these RNAs through direct short-trans antisense duplexes.
To exclude the possibility of Alu-based interactions with TINCR (that has an Alu repeat), the Alu-free transcripts were used for the TINCR test sets. The ACTB mRNA does not have Alu repeats. In total, four test sets were prepared (see Methods). Each tool was used to rank the sequences from every test set according to the predicted hybridization strength with the query (TINCR or ACTB). The ASSA output sorted by the Theoretical P -values outperformed all the tools in terms of the Area Under the Curve (AUC) and partial AUC (pAUC) on both test sets of the \mix with shu®led sequences" type (see Figs. 6(a) and 6(c), Supplementary Figs. 9 and 10 and Supplementary Table 4). The AUC values produced by ASSA were above 80% which improved the performance of the second best tool by more than 10%.
The performance of all the tools decreased on the two test sets of the \mix with human transcripts" type (see Figs. 6(b) and 6(d). For the TINCR and ACTB interactions, the most accurate algorithms produced AUC values around 54% (BLASTn and RIblast) and 63.6% (RNAplex-a), respectively.
Our analysis demonstrated that ASSA was able to accurately identify true interactions of long RNAs (TINCR or ACTB) in the mix of human transcripts with shu®led sequences. At the same time, other tools performed better on the test sets consisting of human transcripts only but the produced accuracies were relatively low. Thus, there is a room for further improvement of the RNA-RNA prediction tools.

Discussion
The goal of our work was to improve the accuracy and the speed of prediction of inter-molecular interactions between long transcripts (i.e. lncRNAs and mRNAs). For this purpose, we developed a new computational pipeline ASSA. To speed up the time-consuming traditional thermodymanics tools, we obtained a set of local alignments (by the sequence aligner LASTAL) that allowed to restrict the calculation of the interaction energies (by the RNAup algorithm) to a limited number of relatively short parts of the input transcripts. The main novelty implemented in ASSA was a mathematical model that allowed to quickly estimate the background distribution and compute the statistical signi¯cance (Theoretical P -value) of the observed RNA-RNA hybridization energy. Sorting the predicted interactions by the P -value rather than the SumEnergy allowed ASSA to outperform other tools on two out of the four test sets. It should be noted that ASSA is one of the fastest tools that takes RNA secondary structure into account. This makes it a good candidate to perform transcriptome-wide searches.
Roughly speaking, ASSA can be viewed as a three stage algorithm À À À the LASTAL and RNAup runs followed by the calculation of P -values. The prediction accuracy is improved gradually in the pipeline since every step takes additional piece of information into account. First, the interactions are predicted by LASTAL without considering RNA secondary structures and estimating the statistical sig-ni¯cance. This step is similar to the approach used by Szczesniak et al. 22 and the \LASTAL (SumScore)" performance provides an estimate of its prediction accuracy (the average AUC value over the four test sets was 55.7% À À À see Supplementary  Table 4). Next, RNAup takes secondary structure into account by computing intermolecular hybridization energy. In terms of ROC statistics, this improves the accuracy by 3.5% (the average AUC value of the \ASSA (SumScore)" approach was 59.3%). Notably, at this step, ASSA is similar to the RIblast algorithm which produces similar average AUC value (60%). Finally, calculation of P -values with respect to the sequence features allowed to sort all the predictions in the most accurate way providing an additional 8.4% increase in the accuracy which made the average AUC value produced by ASSA equal to 67.7%.
In our study, we also suggested an improvement to the classical cis/trans clas-si¯cation of the antisense interactions. Based on the origin of the regions involved in inter-molecular hybridizations, three subtypes of the trans-interactions were introduced: pseudo-cis, Alu-based and short-trans interactions. Importantly, we demonstrated that among all types of sequence repeats in the human genome, Alu repeats have a striking in°uence on the ability of a transcript to base pair with other RNAs and, therefore, participate in post-transcriptional gene regulation. 41,42 Inter-molecular RNA-RNA hybridizations may form yet another layer in the gene regulatory network. It should be noted that the bioinformatics prediction of a RNA-RNA interaction is not su±cient to make conclusions about its functionality or even realization in the cell. There are other factors that should be taken into account, including the cellular localization of the RNAs as well as the presence of speci¯c RNA binding proteins. Probably, these reasons contributed to the poor performance of all the tools on the \mix with human transcripts" test sets. Thus, the search for new biologically active NATs is a more complex task than prediction of antisense partners. Nevertheless, we believe that further improvement of the RNA-RNA interaction detection methods (both computational and experimental) is a necessary step in this direction.

Conclusions
. A new computational pipeline ASSA was developed for identi¯cation of intermolecular hybridizations between long RNAs. . ASSA provides statistical signi¯cance estimate for every predicted RNA-RNA interaction computed by a custom mathematical model. . A special role of the Alu-based interactions in the human transcriptome was emphasized.