The snoGloBe interaction predictor enables a broader study of box C/D snoRNA functions and mechanisms

250 words max) Box C/D small nucleolar RNAs (snoRNAs) are a conserved class of noncoding RNA known to serve as guides for the site-specific 2’-O-ribose methylation of ribosomal RNAs and the U6 small nuclear RNA, through direct base pairing with the target. In recent years however, several examples of box C/D snoRNAs regulating different levels of gene expression including transcript stability and splicing have been reported. These regulatory interactions typically require direct binding of the target but do not always involve the guide region. Supporting these new box C/D snoRNA functions, highthroughput RNA-RNA interaction datasets detect many interactions between box C/D snoRNAs and messenger RNAs. To facilitate the study of box C/D snoRNA functionality, we created snoGloBe, a box C/D snoRNA machine learning target predictor based on a gradient boosting classifier and considering snoRNA and target sequence and position as well as target type. SnoGloBe convincingly outperforms general RNA duplex predictors and PLEXY, the only box C/D snoRNA-specific target predictor available. The study of snoGloBe human transcriptome-wide predictions identifies enrichment in snoRNA interactions in exons and on exon-intron junctions. Some specific snoRNAs are predicted to target groups of functionally-related transcripts on common regulatory elements and the exact position of the predicted targets strongly overlaps binding sites of RNA-binding proteins involved in relevant molecular functions. SnoGloBe was also applied to predicting interactions between human box C/D snoRNAs and the SARS-CoV-2 transcriptome, identifying known and novel interactions. Overall, snoGloBe is a timely new tool that will accelerate our understanding of C/D snoRNA targets and function. 2 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint


INTRODUCTION
Small nucleolar RNAs (snoRNAs) are a conserved class of noncoding RNA present in all eukaryotes. SnoRNAs have been extensively characterized as key players in ribosome biogenesis through the processing and modification of ribosomal RNA (rRNA). They are also implicated in spliceosome biogenesis by guiding the modification of small nuclear RNA (snRNA) (Kiss 2001). To carry out these functions, deemed canonical, they assemble in ribonucleoprotein (snoRNP) complexes which provide stability and catalytic activity to the complex. SnoRNAs are split in two families, the box C/D and the box constraints identified in (Chen et al. 2007). PLEXY thus allows to find both rRNA/snRNA and mRNA targets interacting with the ASE.
In recent years, a wide range of functions have been discovered for box C/D snoRNA, including the regulation of chromatin compaction, of metabolic stress, cholesterol trafficking, alternative splicing and mRNA levels (reviewed in (Dupuis-Sandoval et al. 2015;Falaleeva et al. 2017;Bratkovič et al. 2020)). For some of these novel functions, deemed noncanonical, an interaction was reported between the snoRNA and one or multiple target RNA(s), and they are not limited to the ASE. In fact, as shown in Figure   1B, taken together, the interacting regions enabling these functions cover the whole snoRNA, creating a need for a broader tool to predict snoRNA interactions throughout their whole length. In this paper, we define a canonical interaction as an interaction leading to the methylation of a rRNA or a snRNA, and all others are defined as noncanonical, including interactions leading to the methylation of other types of RNA such as mRNA and transfer RNA (tRNA). In support of the noncanonical interactions of C/D snoRNAs, several have been shown to interact with proteins other than the core C/ D binding partners mentioned above, such as splicing and polyadenylation factors (Soeno et al. 2010;Kishore et al. 2010;Huang et al. 2017;Zhong et al. 2015). In addition, the accumulation of some snoRNAs is not affected by the depletion of core snoRNP proteins (Deschamps-Francoeur et al. 2014), and some snoRNAs were found in nuclear fractions devoid of such factors (Falaleeva et al. 2016). These studies hint at the possibility that snoRNAs can interact with specific proteins distinct from the core box C/D binding proteins to accomplish specific functions (reviewed in (Baldini et al. 2021;Bergeron et al. 2020;Bratkovič et al. 2020)).
The study of the involvement of snoRNAs in diseases also provides evidence of noncanonical functions and interactors. Although so far, in most cases, snoRNAs have only been shown to have a deregulated abundance in many diseases and their implication in pathogenesis is typically poorly understood (reviewed in (Zhang et al. 2019; Deogharia and Majumder 2018;Schaffer 2020;Cavaillé 2017)), the molecular mechanisms and pathways involved have been more extensively studied for a small number of pathologies. One snoRNA group proposed as linked to disease for over two decades is the SNORD115 family involved in the Prader-Willi syndrome (PWS), the leading genetic cause of obesity (Cavaillé et al. 2000). The SNORD115 family is lost in most PWS patients (Hebras et al. 2020). The SNORD115 ASE is complementary to the serotonin receptor 2C gene, HTR2C, near a splice site and multiple A to I editing sites (Kishore and Stamm 2006;Vitali et al. 2005). The absence of SNORD115 is thought to lead to a defective serotonin receptor, with consequences on the disease phenotype, although the molecular mechanism is still under debate (Hebras et al. 2020). Another example is SNORD126, which is deregulated in cancer. Its oncogenic role was studied by knockdown and overexpression. SNORD126 was shown to increase the activation of the PI3K-AKT signaling pathway through the upregulation of FGFR2, but the mechanism causing this is still unknown (Fang et al. 2017).
Another form of evidence supporting the involvement of snoRNAs in noncanonical interactions are high-throughput RNA-RNA interaction identification methodologies, PARIS (Lu et al. 2016), LIGR-seq (Sharma et al. 2016) and SPLASH (Aw et al. 2016), which have been devised to enable the survey of all RNA duplexes in cells, both intraand intermolecular. They have enabled the detection of known and novel snoRNA interactions. For example, LIGR-seq identified functional interactions between orphan snoRNA SNORD83B and three different mRNAs affecting their stability (Sharma et al. 2016). Our de novo analysis of these datasets shows that snoRNAs can interact with diverse RNA biotypes, and mostly with protein coding genes, rRNA and other snoRNAs 5 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 ( Figure 2A). However, these large-scale methods generally have a low proportion of intermolecular duplex reads (Schönberger et al. 2018) and they are not focused on snoRNAs, suggesting that there are probably plenty of snoRNA-RNA interactions yet to uncover.
To address this question and systematically investigate snoRNA interactors and ultimately the extent of their cellular functions, we developed snoGloBe, a gradient boosting classifier to predict box C/D snoRNA-RNA interactions that involve any region of the snoRNA, not only the ASE. SnoGloBe was trained and tested in human using

Experimentally detected examples of snoRNA-RNA interactions cover a wide range of interactor biotypes
In order to create a predictor of snoRNA-RNA interactions, we began by assembling all experimentally detected such interactions. We thus performed a manual curation of the literature to obtain these interactions, to which we added all known canonical interactions as compiled in snoRNABase (Lestrade and Weber 2006). This resulted in 149 non-redundant interactions involving snoRNAs, 16 from the literature and 133 from snoRNABase ( Figure 2A). To increase and enrich this set, we turned to the highthroughput RNA-RNA interaction (HTRRI) identification methodologies PARIS (Lu et al. 2016), LIGR-seq (Sharma et al. 2016) and SPLASH (Aw et al. 2016). To standardize the analysis of these HTRRI datasets, they were all reanalyzed using the PARIS bioinformatics protocol (Lu et al. 2018) and then filtered to keep only those with a minimum length and without bulges ( Figure S1). The list of all interactions obtained from the literature, snoRNABase and from our de novo analysis of HTRRI datasets is available from Table S1. As shown in Figure 2A, the three different sources of positive snoRNA interactions display very different RNA biotype distributions of the targets. While snoRNABase is the main repository of canonical human snoRNA interactions, our manual curation of the literature revealed articles describing noncanonical snoRNA interactions and thus mainly involves protein_coding targets. In contrast, PARIS, LIGRseq and SPLASH are methodologies detecting RNA-RNA interactions with less user bias. The HTRRI datasets show a much wider distribution of RNA biotypes in terms of distinct interactions, the group with the largest number of distinct targets being protein_coding RNAs, followed by rRNA, snoRNAs, lncRNAs, tRNAs and snRNAs ( Figure 2A). Our extensive compilation of snoRNA-RNA interactions thus shows that snoRNAs can interact with many different RNAs of diverse biotypes, supporting the notion that many snoRNA targets and their biological roles in interacting with them remain to be identified. In order to have a full training/testing set for our predictor, we  Figure 2C). The positive:negative ratio was chosen to be imbalanced (1:20) to reflect the fact that the proportion of transcriptomic sequences bound by C/D snoRNAs is expected to be much lower than the proportion not bound.

7
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 Feature encoding and predictor training SnoRNA-RNA interactions involving the formation of an RNA duplex, the sequences of the two RNAs needed to be encoded amongst the features presented in input. The duplex length of validated functional snoRNA interactions, considering both canonical and noncanonical interactions (Figure 2A, known canonical and known noncanonical), varies from 10 to 32 base pairs, so we chose to encode windows of 13 nucleotides in both the snoRNA and its interacting RNA as a compromise to represent as many validated snoRNA-RNA interactions as possible while limiting the chance of finding these sequences randomly in the genome. More details of the sequence encoding are available from the Methods. The position of the interacting window in the snoRNA is another feature that is important to consider. As mentioned in the Introduction, canonical interactions between snoRNAs and rRNA or snRNAs involve the region immediately upstream of the boxes D or D'. However, validated noncanonical interactions cover collectively the entire snoRNA ( Figure 1B). The region employed in the interaction could be an important characteristic for a subgroup of targets and was thus included as an input feature. Similarly, very diverse types of RNAs can be bound by snoRNAs ( Figure   2A), with diverse functional outcomes (Bratkovič et al. 2020;Baldini et al. 2021;Bergeron et al. 2020) and we thus chose to include the target biotype as well as the position in the target (either in an exon and/or an intron and whether the exon is a 5' or 3' UTR when appropriate) as input features. Input features are represented schematically in Figure 2D.
Input features were encoded for all positive and negative snoRNA-RNA pairs in the training/test sets. Since snoRNA-RNA pairs are encoded as 13 nucleotide pairs of windows, an interaction can consist of multiple such pairs of windows. In total, the datasets consist of 1838 such positive windows and 38370 negative windows. The overall proportion of positive, random negatives and matched negatives used to train the 8 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 predictor is shown in Figure 2B. The datasets were split in a non-overlapping manner for hyperparameter tuning, model training and model testing in a 1:7:2 relative proportion ( Figure 2E). Since there are few examples of known noncanonical interactions (Table   S1), they were all kept for the test set. SnoRNA genes are often present in multiple copies in genomes and Homo sapiens is no exception (Bergeron et al. 2021;Deschamps-Francoeur et al. 2020). To ensure redundancy is removed from the datasets, they were built such that members of the same family of snoRNAs (as defined by Rfam (Griffiths-Jones et al. 2003)) do not span more than one set (so for example, no member of the SNORD116 family is included in the test set if other members of the family are present in the training set). More details of the implementation are available in the Methods section.

snoGloBe predicts a wide range of interactions with high accuracy
SnoGloBe is based on a gradient boosting classifier, which is a combination of multiple decision trees. This model differs from a random forest by adding each tree successively to lessen the error of the previous ones instead of creating the trees independently. The predictor was built first by tuning the hyperparameters on 10% of the data, using a random search. The model was then trained on 70% of the data using a 5-fold stratified cross-validation ( Figure 2E). The output of the prediction is a value between 0 and 1 representing the probability of interaction. The performance of the model was then extensively evaluated on the test set and compared to PLEXY (Kehr et al. 2011) and general RNA-RNA interaction predictors: RNAup (Lorenz et al. 2011), RNAplex (Lorenz et al. 2011), RIsearch2 (Alkan et al. 2017 and IntaRNA (Mann et al. 2017) using parameter values summarized in Figure S2. SnoGloBe displays an excellent separation of the positives and negatives in the independent test set, providing scores below 0.1 for 9 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint the great majority (96%) of negative examples and scores above 0.9 for the majority (63%) of positive examples ( Figure S3). In addition, it performs considerably better than the other tools on the independent test set, obtaining the highest area under the ROC and precision-recall curves ( Figure 3A,B). PLEXY has the weakest performance, which is expected since it only predicts interactions with the ASE and the test set has interactions with all regions of the snoRNA ( Figure S4). Interestingly, snoGloBe performs better than general RNA-RNA interaction prediction tools, hinting that snoGloBe captures information specific to snoRNA interactions, more than simple base-pairing.
Using thresholds to obtain a 90% precision with every tool, we determined the number of windows predicted as positives and negatives for each tool. SnoGloBe retrieves the highest number of true positive windows, and the highest proportion of known canonical, known noncanonical interactions and HTRRI ( Figure 3C,D, S5). PLEXY retrieves the smallest number of positive windows and most are from known canonical interactions, as expected. Generic RNA-RNA interaction predictors give similar results amongst themselves and retrieve in majority known canonical interactions. Interestingly, although snoGloBe was not trained on any known noncanonical interaction, it performs by far best at predicting them, identifying 81/95 noncanonical windows ( Figure 3D). Taken together, these data show that snoGloBe enables the prediction of more and a higher diversity of snoRNA-RNA interactions.

Transcriptome-wide predictions using snoGloBe show enrichment of snoRNAs targeting regulatory regions
The interactions of every expressed human snoRNA were predicted against all protein coding genes in human. As many snoRNA copies in human are not expressed and likely represent 'dead' copies in the genome, we restricted our study to only those detected as expressed as described in the Methods section. To limit the number of predicted 10 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint interactions, we used a stringent cut-off of 3 consecutive windows having a probability (ie snoGloBe output score) greater or equal to 0.98 ( Figure S6). With this threshold, we obtain a median of 1017 predicted interactions per snoRNA ( Figure 4A). Among those with the largest number of predicted interactions are included several snoRNAs already known to interact functionally with mRNAs such as SNORD32A (>1500 interactions), SNORD83B (>6000 interactions), SNORD88C (>10 000 interactions) (Elliott et al. 2019;Sharma et al. 2016;Scott et al. 2012). The global analysis of all snoRNA predicted interactions reveals a preference for the region upstream of the box D, even though interactions were predicted throughout the whole snoRNA length and the training and 7.26% and 0.03% of the coding transcriptome consists of exonic sequences and intronexon junctions respectively (measured in terms of 13 nt windows as described above), these values go up to 15.65% and 1.19% respectively when only regions predicted to 11 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint interact with snoRNAs are considered, an increase of >2 and 40 fold respectively. A high proportion of the exons targeted by snoRNAs are 3' UTR ( Figure 4F), which is interesting since some snoRNAs are known to produce microRNAs targeting 3'UTRs Some snoRNAs display even more convincing target sets, including strong enrichment in specific gene elements of target genes enriched in specific biological processes, as well as significant overlap with functionally relevant RNA binding protein (RBP) target sites. For example, SNORD50B, which is known to bind K-Ras and regulate its prenylation, was found to be strongly enriched in 5'UTR binding of its targets, using its box D adjacent guide region ( Figure S9A,B). Gene ontology enrichment analyses show that predicted targets of SNORD50B are involved in neuronal functionality and genes coding for proteins related to cell-cell interactions ( Figure S9C). Many SNORD50B exonic targets bind alternative 5'UTRs, involving alternative transcription start sites, such as those of NDFIP2, COPS3 and SPG21 ( Figure S10). In contrast, SNORD22 shows enriched binding to targets on 3' splice sites and PPTs, involving a non-guide region of the snoRNA overlapping the box C' (Figure S11A,B). Gene ontology terms for these targets are enriched in membrane proteins, cell junctions and GTPases ( Figure S11C).
Many SNORD22 3' splice site targets bind alternatively spliced exons including in the diacylglycerol kinase zeta gene DGKZ, the amyloid beta precursor protein binding family 12 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint B member APBB1 and three hits on the same alternatively spliced exon in the focal adhesion protein PXN ( Figure S12). Interestingly, when SNORD50B and SNORD22 transcriptome predicted binding sites are compared to RBP binding sites as measured by ENCODE using eCLIPs (Van Nostrand et al. 2020;Davis et al. 2018), they both show strong overlap enrichment with the binding sites of specific RBPs ( Figure S9D, S11D). In the case of SNORD50B, the strongest enrichments include DDX3X a helicase known amongst others to bind RNA G-quadruplexes in 5'UTRs, NCBP2 a cap-binding protein interacting with pre-mRNA, BUD13 involved in pre-mRNA splicing and FTO an RNAdemethylase involved in the maturation of mRNAs, tRNAs and snRNAs, supporting a role for SNORD50B in the maturation of pre-mRNA and in particular their 5' extremity. In contrast, for SNORD22, the RBPs with strongest enrichment are PCBP2 the poly(rC) binding protein, PTBP1 a PPT binding protein involved in the regulation of alternative splicing as well as BUD13, PRPF8 and AQR all known as involved in pre-mRNA splicing, supporting a role for SNORD22 in the regulation of alternative splicing, in particular through the binding of PPTs.

SnoGloBe predicts SNORD126 interactions that can be validated experimentally
To further investigate snoGloBe's predictions, we focused on SNORD126. SNORD126 is listed as an orphan snoRNA in snoRNABase (Lestrade and Weber 2006), but is predicted to interact with 28S rRNA in snoRNA Atlas (Jorjani et al. 2016) and was shown to methylate this target in rats in a tissue-specific manner (Hebras et al. 2019).
SNORD126 was shown to activate the PI3K-AKT pathway, conferring it an oncogenic role, though the underlying molecular mechanism is still unknown (Fang et al. 2017, 126). The SNORD126 interactions predicted by snoGloBe against protein coding genes led to interesting profiles. The most represented interaction region in SNORD126 is 13 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; located in the middle of the snoRNA, overlapping the D' box, which doesn't correspond to an ASE ( Figure 5A), suggesting a molecular mechanism of target binding different from the canonical mechanism. The predicted folding of SNORD126 forcing the predicted interacting region to remain single stranded is shown in Figure 5B. SNORD126 interactions are enriched in exons (1.6 fold) and particularly enriched on intron-exon junctions (>50 fold) compared to the protein coding transcriptome composition ( Figure   5C-D). The exons predicted to be targeted by SNORD126 are enriched in 5'UTR ( Figure   5E-F). SNORD126 predicted interactions seem somewhat uniformly distributed in the target exons, but are enriched around 80 nucleotides upstream of the exons ( Figure 5G).
Taken together, these data suggest that SNORD126 could play a role in the regulation of RNA stability and splicing.
To investigate this hypothesis, we studied the effect of SNORD126 knockdown on the predicted target genes by RNA-seq. Indeed, we found that the knockdown of SNORD126 significantly affects the RNA level of 1050 protein coding genes including 65 predicted targets (Fig. 5H), showing a significant enrichment (p-value < 0.0001 by random sampling analysis as described in the Methods). The genes affected by the knockdown are mostly downregulated (Fig. 5I). As for the regulation of alternative splicing, 309 such events are affected by the knockdown of SNORD126, 25 of which overlap a predicted SNORD126 binding site (p-value = 0.0003). Amongst the interesting candidates, the target site of SNORD126 on CPT1B overlaps an alternative 5' splice site detected with a differential splicing pattern and the target site of SNORD126 on MR1 is near a 3' splice site for which the intron has an alternative 5' extremity ( Figure S13).
Interestingly, three genes having an alternative splicing event overlapping a predicted interaction were also differentially expressed upon SNORD126 knockdown: CPT1B, MR1 and DDX11. In contrast to snoGloBe predictions of SNORD126 for which a total of 72 were detected as either stability and/or splicing targets following SNORD126 14 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint knockdown, only two SNORD126 interactions were identified in the HTRRI datasets, one with a mRNA and one with the 18S rRNA, neither of which is detected as affected by the SNORD126 knockdown, emphasizing the usefulness of snoGloBe in addition to highthroughput methodologies.

SnoGloBe predicts interactions between human snoRNAs and the SARS-CoV-2 transcriptome
As another example of the utility of snoGloBe, we applied it to the SARS-CoV-2 transcriptome. It has been shown that the SARS-CoV-2 genome is heavily 2'-O-ribose methylated and interacts strongly with snoRNAs using the high-throughput structure probing methodology SPLASH (Aw et al. 2016;Yang et al. 2021). Thus as another proof of concept for the utility and capacity of snoGloBe, we predicted the interactions between human snoRNAs and the SARS-CoV-2 genome using snoGloBe. We detected 8818 interactions between 312 snoRNAs and the SARS-CoV-2 genome, the distribution of which is shown in Figure 6A. One of the strongest interaction partners between SARS-CoV-2 and host transcripts is reported to be with the box C/D snoRNA SNORD27 (Yang et al. 2021). Although, the SPLASH experiments were carried out in Vero-E6 cells from African green monkey kidney, snoGloBe detects the SARS-CoV-2 interaction with human SNORD27 (Figure 6B), suggesting that the interaction is also relevant in human.

SnoGloBe availability and usage
The snoGloBe code is written in Python using the machine learning package scikit-learn (Pedregosa et al. 2011). It is freely available and can be downloaded from gitlabscottgroup.med.usherbrooke.ca/scott-group/snoglobe. Users must provide a file with the sequences of the snoRNAs of interest, the sequences of whole chromosomes,

15
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. an annotation file in gtf format and a file with the potential target identifiers to scan for snoRNA interactions. Detailed instructions are available in the help manual.

DISCUSSION
Motivated by the continually increasing number of examples of snoRNAs interacting with noncanonical targets using diverse regions within but also without the ASE (Figure 1) as well as by the diversity in RNAs targeted by snoRNAs (Figure 2A), we built snoGloBe, a box C/D snoRNA interaction predictor that considers the whole snoRNA and any type of RNA target. SnoGloBe is a gradient boosting classifier that takes into account the sequence of the snoRNA and its potential target as well as the position in the snoRNA and the type and position in the potential target. Compared to general use RNA-RNA interaction predictors that consider only sequence complementarity and the interaction stability of the duplex, snoGloBe performs considerably better, suggesting that considering snoRNA and target features enhances the prediction. SnoGloBe also performs better than the snoRNA-specific predictor PLEXY which was not built to predict interactions outside of the snoRNA ASEs. Many such non-ASE interactions are detected in HTRRI datasets and some have been extensively validated for individual snoRNAs ( Figure 1B), limiting the scope of the PLEXY predictor. Interestingly a subset of positive examples (30%) is not found by snoGloBe and while this proportion is considerably lower than for all other predictors considered (they miss >70% of positives in the test set, Figure 3D), there is still room for improvement. This subset involves mostly interactions displaying bulges in the base pairing in one or both members of the interaction, which is more difficult to accurately identify and will require different approaches and likely larger training datasets for machine learning approaches to accurately predict them.

16
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint The study of the snoGloBe predicted interactions in human is very interesting and opens numerous research avenues that will likely lead to important insights into snoRNA function. Dozens of snoRNAs display profiles supporting the non-uniform distribution of predicted targets in pre-mRNA, with hundreds or even thousands of targets enriched in common regulatory elements such as PPTs, 5' or 3' splice sites and 5' or 3' UTRs (Figures 4, 5, S8, S9, S11). Each such snoRNA target profile will require in depth integrative analysis to consider the possible functionality, molecular mechanism and ultimately cellular outcome of the collective regulation of these targets by the snoRNA.
We began such studies for SNORD50B, SNORD22 and SNORD126, all three of which display strong enrichment for binding to specific regulatory elements, respectively 5' UTRs, PPTs/3' splice sites and 3' of introns. Manual review of their predicted targets led to the identification of a subset of such binding events overlapping alternatively regulated events (for example alternative 5' UTRs for SNORD50B and alternatively spliced exons for SNORD22, Figures S10, S12). Gene ontology analysis of the targets show strong enrichment for specific biological processes and provide convincing subsets of targets to focus on. Finally the strong overlap between snoRNA predicted binding sites and ENCODE-detected RBP binding sites is important evidence of the functional relevance of the interactions, particularly as the function of the RBPs with the strongest overlap strongly supports the type of regulation likely carried out by the snoRNA. Several molecular mechanisms could explain the binding overlap of snoRNA and RBP at the same position on the same target pre-mRNA. The snoRNA could be guiding the RBP to its target as snoRNAs do for core snoRNA binding proteins such as FBL, and as has been shown for nuclear exosome components as shown in (Zhong et al. 2015).
However, since RNA binding motifs are known for several of the RBP considered and because snoRNAs have not been found as enriched binding partners of all these RBPs, it is likely that the snoRNAs and some of the RBPs are competing for the same binding 17 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint site. Further studies will be required to define the snoRNA-RBP relationship and its effect on the regulation of the targets. These overlapping snoRNA-RBP targets could be revealing novel levels of post-transcriptional regulation, the understanding of which will be important in health and disease.
Overall, while HTRRI datasets have collectively been generated in a handful of cell lines, considerable time and money would be required to explore normal human tissues and diverse conditions using these methodologies. SnoGloBe predictions will be instrumental in filling the gap by providing rapid predictions for snoRNA interactions that can then be further investigated to better understand cellular functionality. Our additional demonstration that snoGloBe can be used to investigate the interactions between snoRNAs and viral transcripts (Figure 6), further widens its scope and utility.

High-throughput RNA-RNA interaction analysis
The high-throughput RNA-RNA interaction datasets from PARIS (SRR2814761, SRR2814762, SRR2814763, SRR2814764 and SRR2814765), LIGR-seq (SRR3361013 18 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint All the samples were analyzed using the PARIS pipeline as described in sections 3.7 and 3.8 from (Lu et al. 2018 The RNA duplexes were assigned to genes using the annotation file described in (Boivin et al. 2018)  interactions were split at each bulge to ensure the correct alignment of the snoRNA and target sequences. The interactions shorter than 13 nucleotides were removed to respect the length of the windows (see Input features section). We finally removed interactions that were already known and present in our positive set described in the next section.

Positive set composition
The positive set is composed of the previously detected snoRNA interactions from PARIS, LIGR-seq and SPLASH filtered as described above, as well as interactions obtained from snoRNABase (Lestrade and Weber 2006) and manually curated interactions from the literature (Table S1) (Fig 2B). Interactions from snoRNABase and

19
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint from the literature that were shorter than 13 nucleotides were padded by adding their flanking sequence to respect the length threshold.

Negative set composition
The negative set is composed of random negatives and matched negatives (Fig 2B).
The random negative examples are the combination of random sequences from any box C/D snoRNA and any gene, whereas the matched negative examples are random sequences coming from a positive snoRNA-target gene combination (Fig 2C).

Input features
The interactions were split in 13 nucleotides sliding windows, with a step of 1 nucleotide.
The 13 nucleotide window length was chosen to limit the chance of finding this sequence randomly in the genome, and most of the known interactions respect this length. Each interaction window is composed of 13 nucleotides of the snoRNA and the corresponding 13 nucleotides of the target. The input features used are the window sequences in onehot encoding, the relative position in the snoRNA between 0 and 1, the location in the target gene (intron, exon, 3'UTR and/or 5'UTR) and the target biotype ( Fig 2D). The biotypes considered are listed in snoGloBe's manual. Protein coding, pseudogene and long noncoding RNA biotypes were grouped according to http://ensembl.org/Help/Faq? id=468 and http://vega.archive.ensembl.org/info/about/gene_and_transcript_types.html.

Redundancy removal for tuning, training and test sets
The positive and negative examples were split into hyperparameter tuning, training and test sets. First, to remove redundancy from the sets, the snoRNAs were grouped based on their Rfam identifier (Griffiths-Jones et al. 2003). To ensure that the model is not trained and tested on similar snoRNAs, the Rfam families were split in order to assign 20 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made remaining examples were split between the hyperparameter tuning and training set to get respectively 10% and 70% of initial data. (Fig 2E) Building the model  (Lorenz et al. 2011). To compare their performance on a similar basis, we selected a threshold (either a score for snoGloBe, or an energetic cut-off for the other tools) resulting in 90% precision on the test set. The details are available in Figure S2. PLEXY was only used for snoRNAs with non-degenerated boxes D and D' to avoid bias caused by misidentified boxes.
Prediction against protein coding genes 21 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint SnoGloBe was used to predict box C/D snoRNA interactions with protein coding transcripts. For this analysis, only box C/D snoRNAs expressed at 1 transcript per million (TPM) or more in at least one of the RNA-seq datasets from 7 different healthy human tissues (3 samples from different individuals for each of the following tissues: brain, breast, liver, ovary, prostate, skeletal muscle and testis) from a previous study (Fafard-Couture et al. 2021) (available from GEO: GSE126797, GSE157846) were considered.
We predicted the interactions of these snoRNAs against all protein coding regions of the genome, split in 13-nucleotide windows with a step of two. We took whole gene sequences to predict interactions with any intron and exon. To narrow the number of predictions obtained, we kept the interactions having at least three consecutive windows with a score >= 0.98 for further analysis. The gene ontology enrichment analysis of the predicted targets was done using g:Profiler (Raudvere et al. 2019).  Table S2. Only the eCLIP regions having a p-value <= 0.01 were kept. Datasets from the same protein were merged using BEDTools merge -s (v2.26.0) (Quinlan and Hall 2010). The number of overlaps between the predicted interactions and the eCLIP regions was computed using BEDTools intersect -s.

SNORD126 knockdown
HepG2 cells were cultured in complete Eagle's Minimum Essential Medium (EMEM from Wisent) and passaged twice a week, according to ATCC guidelines. Trypsinized cells were then seeded at 350000 cells/well in 6 well plates in 1ml EMEM. Cells were 22 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 transfected 24 hours later with 2 different ASOs targeting SNORD126 (30nM or 40nM) using Lipofectamine 2000 (LIFE technologies) and optiMEM (Wisent). A scrambled ASO was used as a negative control. The sequence of the ASOs are listed in Supplemental Figure S14.
Cells were harvested 48 hours post transfection, washed and pelleted, then resuspended in 1ml Trizol and stored at -80℃ until RNA extraction. This was repeated 3 times to obtain biological triplicates.

RNA extraction
Total RNA extraction from transfected HepG2 cells was performed using RNeasy mini kit (Qiagen) as recommended by the manufacturer including on column DNase digestion with RNase-Free DNase Set (Qiagen). However, 1.5 volumes Ethanol 100% was used instead of the recommended 1 volume ethanol 70% in order to retain smaller RNA. RNA integrity of each sample was assessed with an Agilent 2100 Bioanalyzer.
RNA was reversed transcribed using Transcriptor reverse transcriptase (Roche) and knockdown levels were evaluated by qPCR.

RNA-seq library preparation and sequencing
RNAseq librairies were generated from 1ug DNA-free total RNA/condition using the NEBNext® Ultra™ II Directional RNA Library Prep Kit for Illumina (E7760S) and following the Protocol for use with NEBNext Poly(A) mRNA Magnetic Isolation Module (NEB #E7490). The resulting librairies were submitted to a total of 10 cycles of amplification then purified using 0.9X Ampure XP beads. Quality and size was assessed with an Agilent 2100 Bioanalyser. Librairies were then quantified using a Qubit fluorometer, pooled at equimolar concentration and 1.8pM was sequenced on Illumina's 23 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 NextSeq 500 using a NextSeq 500/550 High Output Kit v2.5 (150 cycles) paired-end 2x75bp.

Empirical p-value calculation
To evaluate the significance of the overlaps between each snoRNA predicted interactions and eCLIP binding sites, alternative splicing events and differentially expressed genes, we computed an empirical p-value using BEDTools shuffle 10 to 24 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; 100 000 times for each combination followed by BEDTools intersect -s through pybedtools (Quinlan and Hall 2010;Dale et al. 2011). BEDTools shuffle was used with an appropriate background for each analysis: all protein coding genes for eCLIP binding sites and protein coding genes having an average of 1TPM across all sequencing datasets for differential expression and alternative splicing analyses. For eCLIP binding sites and alternative splicing events, we counted the number of overlaps between the shuffled experiments and the region of the events, whereas for differential expressed genes, we counted the number of overlaps between the shuffled experiments and the entire differentially expressed genes with no regards on the position of the overlap in the gene. P-values were calculated as the proportion of iterations in which the shuffled dataset overlap was at least as extreme as the true dataset overlap.

Prediction of human snoRNA interaction with SARS-CoV-2 transcriptome
We predicted the interaction between the expressed human snoRNAs against SARS-CoV-2 genes, using SARS-CoV-2 ASM985889v3 genome assembly and the annotation file Sars_cov_2.ASM985889v3.101.gtf obtained from Ensembl COVID-19. We used thresholds of minimum 3 consecutive windows having a probability greater or equal to 0.85.    28 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;  A) The major interaction site in SNORD126 predicted by snoGloBe is located in the middle of the snoRNA and doesn't match the ASE upstream of the boxes D and D'

29
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 represented in purple. The accumulation profile represents the proportion of SNORD126 predicted interactions overlapping each nucleotide in the snoRNA. B) Predicted folded structure of SNORD126 considering the main region of interaction. Mfold (Zuker 2003) was used to predict the secondary structure of SNORD126, forcing nucleotides 37 to 53 to be single stranded (

30
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021  African green monkey SNORD27 and SARS-CoV-2 is also predicted with human SNORD27. The validated interaction is shown in red, the nucleotide that differs between the African green monkey and human SNORD27 is underlined. The predicted interaction is outlined by the box.

31
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021

37
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021   . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint A B Figure 6 Deschamps-Francoeur et al. 2021 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint Figure S1: Overview of the filtering procedure starting with all distinct interactions detected in at least one HTRRI dataset involving a snoRNA to obtain the final HTRRI formatted datasets. The number of interactions remaining at each step is indicated. 2872 distinct interactions from PARIS, LIGR-seq and SPLASH between a box C/D snoRNA and a known gene 2079 interactions having a paired region >= 13 nucleotides (can contain mismatches and bulges) RNAplex 500 interactions without bulges >= 13 nucleotides (can contain mismatches)

RNAplot 445 non-redundant interactions
Remove already known interactions . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint Figure S3: Distribution of the scores output by snoGloBe for the positive and negative examples in the test set.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint Figure S4: Profiles representing the number of predicted interactions from the hyperparameter tuning and training sets (blue) or the test set (orange) that involve specific positions in the snoRNA, for all snoRNAs considered simultaneously.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; Figure S5: Upset plot indicating the number of correctly predicted interactions for all predictors compared, considering all the box C/D snoRNAs. The color legend on the right indicates the different interaction types.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint Figure S6: Boxplots depicting the distributions of the number of predicted interactions as a function of the parameter values used. The parameters considered are the number of predicted interactions per snoRNA (y axis), the score cut-off (x axis) and the minimum number of consecutive windows (different colors, with legend on the right).
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint B C D A Figure S7: Profiles measuring the number of predicted interactions involving each position of the snoRNAs SNORD45C (A), SNORD11 (B), SNORD31B (C) and SNORD18A (D). The green and pink highlighting in the profiles represent respectively the positions of the C and D'/D boxes. The predicted folding of each snoRNA, as predicted by RNAplot, is shown using the dot-bracket format. The panel below each profile represents the positional entropy predicted by replot from the ViennaRNA package for each position of the snoRNA. The color legend for the position entropy score is given in E. E . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint B C D . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint A B C Figure S10 Examples of SNORD50B binding sites on alternative 5' UTRs. SNORD50B binding sites are enriched in 5' UTRs many of which are alternative including those in (A) NDIFP2, (B) COPS3 and (C) SPF21, as shown with genome browser screenshots. In each case, the top track displays the predicted binding position. The Human genes track indicates the architecture of the different isoforms encoded for these genes for the positional window chosen and the bottom tracks show the binding sites for the 5 RBPs indicated in Figure S9D according to ENCODE eCLIPs.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021.  Table showing the number of predicted SNORD22 interactions that overlap binding sites for the indicated RBPs. (All 5 RBPs indicated have a significant overlap). In addition, 3 of the RBPs bind SNORD22 according to eCLIP experiments and SNORD22 is predicted to bind the pre-mRNA of 1 of the RBPs.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint A B C Figure S12 Examples of SNORD22 binding sites on alternative 3'SSs. SNORD22 binding sites are enriched in 3' SSs many of which are alternative including those in (A) DGKZ, (B) APBB1 and (C) PXN. Screenshots are as described in Figure S10.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint Figure S13: Examples of splicing events affected by the knockdown of SNORD126. Both CPT1B (A) and MR1 (B) display differential splicing following the knockdown (KD) of SNORD126 as shown using sashimi plots. The blue arrows represent the predicted interaction region. The colored arcs represent different splice junctions with the number of reads supporting them. Statistics of the splicing event are given on the right. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 15, 2021. ; https://doi.org/10.1101/2021.09.14.460265 doi: bioRxiv preprint Figure S14: List of ASO sequences used for SNORD126 knockdown and negative control (NC5). * means phosphorothioate backbone, m means 2'-O-methoxyethyl.