ABSTRACT
Sequence specific RNA-binding proteins (RBPs) control many important processes affecting gene expression. They regulate RNA metabolism at multiple levels, by affecting splicing of nascent transcripts, RNA folding, base modification, transport, localization, translation and stability. Despite their central role in most aspects of RNA metabolism and function, most RBP binding specificities remain unknown or incompletely defined. To address this, we have assembled a genome-scale collection of RBPs and their RNA binding domains (RBDs), and assessed their specificities using high throughput RNA-SELEX (HTR-SELEX). Approximately 70% of RBPs for which we obtained a motif bound to short linear sequences, whereas ~30% preferred structured motifs folding into stem-loops. We also found that many RBPs can bind to multiple distinctly different motifs. Analysis of the matches of the motifs on human genomic sequences suggested novel roles for many RBPs in regulation of splicing, and also revealed RBPs that are likely to control specific classes of transcripts. Global analysis of the motifs also revealed an enrichment of G and U nucleotides. Masking of G and U by proteins increases the specificity of RNA folding, as both G and U can pair to two other RNA bases via canonical Watson-Crick or G-U base pairs. The collection containing 145 high resolution binding specificity models for 86 RBPs is the largest systematic resource for the analysis of human RBPs, and will greatly facilitate future analysis of the various biological roles of this important class of proteins.
INTRODUCTION
The abundance of protein and RNA molecules in a cell depends both on their rates of production and degradation. These rates are determined directly or indirectly by the sequence of DNA. The transcription rate of RNA and the rate of degradation of proteins is determined by DNA and protein sequences, respectively. However, most regulatory steps that control gene expression are influenced by the sequence of the RNA itself. These processes include RNA splicing, localization, stability, and translation. These processes can be affected by RNA-binding proteins (RBPs) that specifically recognize short RNA sequence elements (Glisovic et al., 2008).
RBPs can recognize their target sites using two mechanisms: they can form direct contacts to the RNA bases of an unfolded RNA chain, and/or recognise folded RNA-structures (Loughlin et al., 2009). These two recognition modes are not mutually exclusive, and the same RBP can combine both mechanisms in recognition of its target sequence. The RBPs that bind to unfolded target sequences generally bind to each base independently of each other, and their specificity can thus be well explained by a simple position weight matrix (PWM) model. However, recognition of a folded RNA-sequence leads to strong positional interdependencies between different bases due to base pairing. In addition to the canonical Watson-Crick base pairs G:C and A:U, double-stranded RNA commonly contains also G:U base pairs, and can also accommodate other non-canonical base pairing configurations in specific structural contexts (Varani and McClain, 2000).
It has been estimated that the human genome encodes approximately 1500 proteins that can associate with RNA (Gerstberger et al., 2014). Only some of the RBPs are thought to be sequence specific. Many RNA-binding proteins bind only a single RNA species (e.g. ribosomal proteins), or serve a structural role in ribonucleoprotein complexes or the spliceosome. As RNA can fold to complex three-dimensional structures, defining what constitutes an RBP is not simple. In this work, we have focused on RBPs that are likely to bind to short sequence elements analogously to sequence-specific DNA binding transcription factors. The number of such RBPs can be estimated based on the number of proteins containing one or more canonical RNA-binding protein domains. The total number is likely to be ~400 RBPs (Cook et al., 2011; Ray et al., 2013). The major families of RBPs contain canonical RNA-binding protein domains (RBDs) such as the RNA recognition motif (RRM), CCCH zinc finger, K homology (KH) and cold shock domain (CSD). In addition, smaller number of proteins bind RNA using La, HEXIM, PUF, THUMP, YTH, SAM and TRIM-NHL domains. In addition, many “non-canonical” RBPs that do not contain any of the currently known RBDs have been reported to specifically to RNA (see, for example (Gerstberger et al., 2014)).
Various methods have been developed to determine the binding positions and specificities of RNA binding proteins. Methods that use crosslinking of RNA to proteins followed by immunoprecipitation and then massively parallel sequencing (CLIP-seq or HITS-CLIP, reviewed in (Darnell, 2010) and PAR-CLIP (Hafner et al., 2010) can determine RNA positions bound by RBPs in vivo, whereas other methods such as SELEX (Tuerk and Gold, 1990), RNA bind-N-seq (Lambert et al., 2015) and RNAcompete (Ray et al., 2009) can determine motifs bound by RBPs in vitro. Most high-resolution models derived to date have been determined using RNAcompete, where microarrays are used to generate a library of RNA-molecules containing all possible 7-base long subsequences in at least 256 oligonucleotides, and the desired RBP is then used to select its target sites followed by detection of the bound sites using a second microarray. RNAcompete has been used to analyze large numbers of RBPs from multiple species including generation of PWMs for 75 human RBPs (Ray et al., 2013).
The CISBP-RNA database (Ray et al., 2013) (Database Build 0.6) currently lists total of 392 high-confidence RBPs in human, but contains high-resolution specificity models for only 100 of them (Ray et al., 2013). In addition, a literature curation based database RBPDB (Cook et al., 2011) contains experimental data for 133 human RBPs, but mostly contains individual target- or consensus sites, and only has high resolution models for 39 RBPs. Thus, despite the central importance of RBPs in fundamental cellular processes, the precise sequence elements bound by most RBPs remain to be determined. To address this problem, we have in this work developed high-throughput RNA SELEX (HTR-SELEX) and used it to determine binding specificities of human RNA binding proteins. Our analysis suggests that many RBPs prefer to bind structured RNA motifs, and can associate with several distinct sequences. The distribution of motif matches in the genome indicates that many RBPs have central roles in regulation of RNA metabolism and activity in cells.
RESULTS
Identification of RNA-binding motifs using HTR-SELEX
To identify binding specificities of human RBPs, we established a collection of canonical and non-canonical full-length RBPs and RNA binding domains. The full-length constructs representing 819 putative RBPs were picked from the Orfeome 3.1 and 8.1 collections (Lamesch et al., 2007) based on annotations of the CisBP database for conventional RBPs (Ray et al., 2013) and Gerstberger et al. (Gerstberger et al., 2014) to include additional unconventional RBPs. The 293 constructs designed to cover all canonical RBDs within 156 human RBPs were synthesized based on Interpro defined protein domain calls from ENSEMBL v76. Most RBD constructs contained all RBDs of a given protein with 15 amino-acids of flanking sequence (see Table S1 for details). Constructs containing subsets of RBDs were also analyzed for some very large RBPs. Taken together our clone collection covered 942 distinct proteins. The RBPs were expressed in E.coli as fusion proteins with thioredoxin, incorporating an N-terminal hexahistidine and a C-terminal SBP tag (Jolma et al., 2015).
To identify RNA sequences that bind to the proteins, we subjected the proteins to HTR-SELEX (Figure 1A). In HTR-SELEX, a 40 bp random DNA sequence containing a sample index and 5’ and 3’ primer binding sequences is transcribed into RNA using T7 RNA polymerase, and incubated with the individual proteins in the presence of RNase inhibitors, followed by capture of the proteins using metal-affinity resin. After washing and RNA recovery, a DNA primer is annealed to the RNA, followed by amplification of the bound sequences using a reverse-transcription polymerase chain reaction (RT-PCR) using primers that regenerate the T7 RNA polymerase promoter. The entire process is repeated up to a total of four selection cycles. The amplified DNA is then sequenced, followed by identification of motifs using the Autoseed pipeline (Nitta et al., 2015) modified to analyze only the transcribed strand. Compared to previous methods such as RNAcompete, HTR-SELEX uses a selection library with very high sequence complexity, allowing identification of long RNA binding preferences.
The analysis resulted in generation of 145 binding specificity models for 86 RBPs. Most of the results (66 RBPs) were replicated in a second HTR-SELEX experiment. The success rate of our experiments was ~ 22% for the canonical RBPs, whereas the fraction of the successful non-canonical RBPs was much lower (~ 1.3%; Table S1). Comparison of our data with a previous dataset generated using RNAcompete (Ray et al., 2013) and to older data that has been compiled in the RBPDB-database (Cook et al., 2011) revealed that the specificities were generally consistent with the previous findings (Figure S1). HTR-SELEX resulted in generation of a larger number of motifs than the previous systematic studies, and revealed the specificities of 49 RBPs whose high-resolution specificity was not previously known (Figure 1B). Median coverage per RBD family was 24 % (Figure 1C). Compared to the motifs from previous studies, the motifs generated with HTR-SELEX were also wider, and had a higher information content (Figure S2), most likely due to the fact that the sequences are selected from a more complex library in HTR-SELEX (see also (Yin et al., 2017)). The median width and information contents of the models were 10 bases and 10 bits, respectively.
Some RBPs bind to RNA as dimers or multimers
Analysis of enriched sequences revealed that 31% of RBPs could bind to a site containing a direct repeat of the same sequence (Figure S3), suggesting that some RBPs were homodimers, or interacted to form a homodimer when bound to the RNA. In these cases, the gap between the repeats was generally short, with a median gap of 5 nucleotides (Figure S3). To determine whether the HTR-SELEX identified gap length preferences were also observed in sites bound in vivo, we compared our data against existing in vivo data for five RBPs for which high quality PAR-CLIP and HITS-CLIP derived data was available from previous studies (Farazi et al., 2014; Hafner et al., 2010; Weyn-Vanhentenryck et al., 2014), and found that preferred spacing identified in HTR-SELEX was in most cases also observed in the in vivo data (Figure S4).
Recognition of RNA structures by RBPs
Unlike double-stranded DNA, whose structure is relatively independent of sequence, RNA folds into complex, highly sequence-dependent three dimensional structures. To analyze whether RBP binding depends on RNA secondary structure, we identified characteristic patterns of dsRNA formation by identifying correlations between all two base positions either within the motif or in its flanking regions, using a measure described in Nitta et al., (Nitta et al., 2015) that determines how much the observed count of combinations of a given set of two bases deviate from expected count based on independence of the positions (Figure 2A). The vast majority of the observed deviations from the independence assumption were consistent with the formation of an RNA stem-loop structure (example in Figure 2B). In addition, we identified one RBP, LARP6, that bound to a predicted internal loop embedded in a double-stranded RNA stem (Figure 2C, Figure S5). This binding specificity is consistent with the earlier observation that LARP6 binds to stem-loops with internal loops found in mRNAs encoding the collagen proteins COL1A1, COL1A2 and COL3A1 (Cai et al., 2010) (Figure S5).
In total, 69% (59 of 86) of RBPs recognized linear sequence motifs that did not appear to have a preference for a specific RNA secondary structure. The remaining 31% (27 of 86) of RBPs could bind at least one structured motif (Figure 2D); this group included several known structure-specific RBPs, such as RC3H1, RC3H2 (Leppek et al., 2013), RBMY1E, RBMY1F, RBMY1J (Skrisovska et al., 2007) and HNRNPA1 (Chen et al., 2016; Orenstein et al., 2018). A total of 15 RBPs bound exclusively to structured motifs, whereas 12 RBPs could bind to both structured and unstructured motifs. The median length of the stem region observed in all motifs was 5 bp, and the loops were between 3 and 15 bases long, with a median length of 11 (Figure 2E). Of the RBP families, KH and HEXIM motifs we found were linear, whereas some proteins from RRM, CSD, Zinc finger and LA-domain families could bind to both structured and unstructured sites (Figure S6).
To model RBP binding to stem-loop structures, we developed a simple stem-loop model (SLM; Figure 2B). This model describes the loop as a position independent model (PWM), and the stem by a nucleotide pair model where the frequency of each combination of two bases at the paired positions is recorded. In addition, we developed two different visualizations of the model, a T-shaped motif that describes the mononucleotide distribution for the whole model, and the frequency of each set of bases at the paired positions by thickness of edges between the bases (Figure 3), and a simple shaded PWM where the stem part is indicated by a gray background where the darkness of the background indicates the fraction of bases that pair with each other using Watson-Crick or G:U base pairs (Figure 3). On average, the SLM increased the information content of the motifs by 4.2 bits (Figure S7). As expected from the correlation structure, a more detailed analysis of the number of paired bases within 10 bp from the seed sequence of MKRN1 revealed that >80% of individual sequence reads had more than four paired bases, compared to ~15% for the control RBP (ZRANB2) for which a structured motif was not identified.
Classification of RBP motifs
To analyze the motif collection globally, we developed PWM and SLM models for all RBPs. To compare the motifs, we determined their similarity using SSTAT. To simplify the analysis, the PWM models were used for this comparison even for factors that bound to the structured motifs. We then used the dominating set method (Jolma et al., 2013) to identify distinctly different motifs (Figure S8). Comparison of the motifs revealed that in general, the specificities of evolutionarily related RBPs were similar (Figure 5 and Figure S8). For the largest RRM family, the 96 members were represented by 47 specificity classes, whereas the smaller classes such as CCCH, KH, CSD, and HEXIM were represented by 9, 10, 6 and 1 motifs, representing 17, 11, 7 and 2 different specificities, respectively (Figure S8).
Analysis of the dinucleotide content of all motifs revealed unexpected differences in occurrence of distinct dinucleotides within the PWMs. The dinucleotides GG, GU, UG and UU were far more common than other dinucleotides (Figure 4G; fold change 2.75; p < 0.00225; t-test). This suggests that G and U bases are most commonly bound by RBPs. This effect was not due to the presence of stem structures in the motifs, as the unstructured motifs were also enriched in G and U. The masking of G and U bases by protein binding may assist in folding of RNA to defined structures, as G and U bases have lower specificity in pairing compared to C and A, due to the presence of the G:U base pair in RNA.
Most RBPs bound to only one motif. However, 41 RBPs could bind to multiple different sites (Figure 5). In five cases, the differences between the primary and secondary motif could be explained by a difference in spacing between the two half-sites. In 12 cases, one of the motifs was structured, and the other linear. In addition, in eight RBPs the primary and secondary motifs represented two different structured motifs, where the loop length or the loop sequence varied (Figure 5). In addition, for four RBPs, we recovered more than two different motifs. The most complex binding specificity we identified belonged to LARP6 (Figure 5 and S9), which could bind to multiple simple linear motifs, multiple dimeric motifs, and the internal loop-structure described above.
Conservation and occurrence of motif matches
We next analyzed the enrichment of the motif occurrences in different classes of human transcripts. The normalized density of motifs for each factor at both strands of DNA was evaluated for transcription start sites (TSSs), splice donor and acceptor sites, and translational start and stop positions (see Supplementary Data S1 for full data). This analysis revealed that many RBP recognition motifs were enriched at splice junctions. The most enriched linear motif in splice donor sites was ZRANB2, a known regulator of alternative splicing (Figure 6A) (Loughlin et al., 2009). Analysis of matches to structured motifs revealed even stronger enrichment of motifs for ZC3H12A, B and C to splice donor sites (Figure 6A). These results suggest a novel role for ZC3H12 proteins in regulation of splicing. The motifs for both ZRANB2 and ZC3H12 protein factors were similar but not identical to the canonical splice donor consensus sequence (ag | GURagu) that is recognized by the spliceosome, suggesting that these proteins may act by masking a subset of splice donor sites.
Analysis of splice acceptor sites also revealed that motifs for known components of the spliceosome, such as RBM28 (Damianov et al., 2006), were enriched in introns and depleted in exons. Several motifs were also enriched at the splice junction, including the known regulators of splicing IGF2BP1 and ZFR (Supplementary Data S1) (Haque et al., 2018; Huang et al., 2018). In addition, we found several motifs that mapped to the 5’ of the splice junction, including some known splicing factors such as QKI (Hayakawa-Yano et al., 2017) and ELAVL1 (Bakheet et al., 2018), and some factors such as DAZL, CELF1 and BOLL for which a role in splicing has to our knowledge not been reported (Figure 6A and Supplementary Data S1) (Rosario et al., 2017; Xia et al., 2017).
To determine whether the identified binding motifs for RBPs are biologically important, we analyzed the conservation of the motif matches in mammalian genomic sequences close to splice junctions. This analysis revealed strong conservation of several classes of motifs in the transcripts (Figure 6B), indicating that many of the genomic sequences matching the motifs are under purifying selection.
Both matches to ZRANB2 and ZC3H12 motifs were also enriched in 5’ regions of transcripts, but not in anti-sense transcripts originating from promoters (Figure 6C), suggesting that these motifs also have a role in differentiating between sense and anti-sense transcripts of mRNAs.
To identify biological roles of the motifs, we also used Gene Ontology Enrichment analysis to identify motifs that were enriched in specific types of mRNAs. This analysis revealed that many RBP motifs are specifically enriched in particular classes of transcripts. For example, we found that MEX3B motifs were enriched in genes involved in type I interferon-mediated signaling pathway (Figure 6D). Taken together, our analysis indicates that RBP motifs are biologically relevant, as matches to the motifs are conserved, and occur specifically in genomic features and in transcripts having specific biological roles.
DISCUSSION
In this work, we have determined the RNA-binding specificities of a large collection of human RNA-binding proteins. The tested proteins included both proteins with canonical RNA binding domains and putative RBPs identified experimentally (Gerstberger et al., 2014; Ray et al., 2013). The method used for analysis involved selection of RNA ligands from a collection of random 40 nucleotide sequences. Compared to previous analyses of RNA-binding proteins, the HTR-SELEX method allows identification of structured motifs, and motifs that are relatively high in information content. The method can identify simple sequence motifs or structured RNAs, provided that their information content is less than ~40 bits. However, due to the limit on information content, and requirement of relatively high-affinity binding, the method does not generally identify highly structured RNAs that in principle could bind to almost any protein. Consistent with this, most binding models that we could identify were for proteins containing canonical RBPs.
Motifs were identified for a total of 86 RBPs. Interestingly, a large fraction of all RBPs (47%) could bind to multiple distinctly different motifs. The fraction is much higher than that observed for double-stranded DNA binding transcription factors, suggesting that sequence recognition and/or individual binding domain arrangement on single-stranded RNA can be more flexible than on dsDNA. Analysis of the mononucleotide content of all the models also revealed a striking bias towards recognition of G and U over C and A. This may reflect the fact that formation of RNA structures is largely based on base pairing, and that G and U are less specific in their base pairings that C and A. Thus, RBPs that mask G and U bases increase the overall specificity of RNA folding in cells.
Similar to proteins, depending on sequence, single-stranded nucleic acids may fold into complex and stable structures, or remain largely disordered. Most RBPs preferred short linear RNA motifs, suggesting that they recognize RNA motifs found in unstructured or single-stranded regions. However, approximately 31% of all RBPs preferred at least one structured motif. The vast majority of the structures that they recognized were simple stem-loops, with relatively short stems, and loops of 3-15 bases. Most of the base specificity of the motifs was found in the loop region, with only one or few positions in the stem displaying specificity beyond that caused by the paired bases. This is consistent with the structure of fully-paired double-stranded RNA where base pair edge hydrogen-bonding information is largely inaccessible in the deep and narrow major groove. In addition, we identified one RBP that bound to a more complex structure. LARP6 recognized an internal loop structure where two base-paired regions were linked by an uneven number of unpaired bases.
Compared to TFs, which display complex dimerization patterns when bound to DNA, RBPs displayed simpler dimer spacing patterns. This is likely due to the fact that the backbone of a single-stranded nucleic acid has rotatable bonds. Thus, cooperativity between two RBDs requires that they bind to relatively closely spaced motifs.
Analysis of the biological roles of the RBP motif matches indicated that many motif matches were conserved, and specifically located at genomic features such as splice junctions. In particular, our analysis suggested a new role for ZC3H12, BOLL and DAZL proteins in regulating alternative splicing, and MEX3B in binding to type I interferon-regulated genes. As a large number of novel motifs were generated in the study, we expect that many other RBPs will have specific roles in particular biological functions.
Our results represent the largest single systematic study of human RNA-binding proteins to date. This class of proteins is known to have major roles in RNA metabolism, splicing and gene expression. However, the precise roles of RBPs in these biological processes are poorly understood, and in general the field has been severely understudied. The generated resource will greatly facilitate research in this important area.
MATERIALS AND METHODS
Clone collection, cloning and protein production
Clones were either collected from the human Orfeome 3.1 and 8.1 clone libraries (full length clones) or ordered as synthetic genes from Genscript (eRBP constructs). As in our previous work (Jolma et al., 2013), protein-coding synthetic genes or full length ORFs were cloned into pETG20A-SBP to create an E.coli expression vector that allows the RBP or RBD cDNAs to be fused N-terminally to Thioredoxin+6XHis and C-terminally to SBP-tags. Fusion proteins were then expressed in the Rosetta P3 DE LysS E.coli strain (Novagen) using an autoinduction protocol (Jolma et al., 2015). All constructs described in Table S1 were expressed and subjected to HTR-SELEX, regardless of protein level expressed. Protein production was assessed in parallel by 96-well SDS-PAGE (ePage, Invitrogen). The success rate of protein production was dependent on the size of the proteins, with most small RBDs expressing well in E.coli. Significantly lower yield of protein was observed for full-length proteins larger than 50 kDa.
After HIS-tag based IMAC purification, glycerol was added to a final concentration of 10%. Samples were split to single-use aliquots with approximately 200 ng RBP in a 5μl volume and frozen at −80°C.
Selection library generation
To produce a library of RNA sequences for selection (selection ligands), we first constructed dsDNA templates by combining three oligonucleotides together in a three cycle PCR reaction (Phusion, NEB). For information about the barcoded ligand design, see Table S1. The ligand design was similar to that used in our previous work analyzing TF binding specificities in dsDNA (Jolma et al., 2013) except for the addition of a T7 RNA polymerase promoter in the constant flanking regions of the ligand. RNA was expressed from the DNA-templates using T7 in vitro transcription (Ampliscribe T7 High Yield Transcription Kit, Epicentre or Megascript-kit Ambion) according to manufacturer’s instructions, after which the DNA-template was digested using RNAse-free DNAseI (Epicentre) or the TURBO-DNAse supplied with the Megascript-kit. All RNA-production steps included RiboGuard RNAse-inhibitor (Epicentre).
Two different approaches were used to facilitate the folding of RNA molecules. In the protocol used in experiments where the batch identifier starts with letters “EM”, RNA-ligands were heated to +70°C followed by gradual, slow cooling to allow the RNA to fold into minimal energy structures, whereas in batches “AAG” and “AAH” RNA transcription was not followed by such folding protocol. The rationale was that spontaneous co-transcriptional RNA-folding may better reflect folded RNA structures in the in vivo context. In almost all of the cases where the same RBPs were tested with both of the protocols the results were highly similar.
HTR-SELEX assay
Selection reactions were performed as follows: ~200ng of RBP was mixed on ice with ~1μg of the RNA selection ligands to yield approximate 1:5 molar ratio of protein to ligand in 20μl of Promega buffer (50 mM NaCl, 1 mM MgCl2, 0.5 mM Na2EDTA and 4% glycerol in 50 mM Tris-Cl, pH 7.5). The complexity of the initial DNA library is approximately 1012 DNA molecules with 40 bp random sequence (~20 molecules of each 20 bp sequence on the top strand). The upper limit of detection of sequence features of HTR-SELEX is thus around 40 bits of information content.
The reaction was incubated for 15 minutes at +37°C followed by additional 15 minutes at room temperature in 96-well microwell plates (4-titude, USA), after which the reaction was combined with 50 μl of 1:50 diluted paramagnetic HIS-tag beads (His Mag Sepharose excel, GE-Healthcare) that had been blocked and equilibrated into the binding buffer supplemented with 0.1% Tween 20 and 0.1μg/μl of BSA (Molecular Biology Grade, NEB). Protein-RNA complexes were then incubated with the magnetic beads on a shaker for further two hours, after which the unbound ligands were separated from the bound beads through washing with a Biotek 405CW plate washer fitted with a magnetic platform. After the washes, the beads were suspended in heat elution buffer (0.5 μM RT-primer, 1 mM EDTA and 0.1% Tween20 in 10 mM Tris-Cl buffer, pH 7) and heated for 5 minutes at 70°C followed by cooling on ice to denature the proteins and anneal the reverse transcription primer to the recovered RNA library, followed by reverse transcription and PCR amplification of the ligands. The efficiency of the selection process was evaluated by running a qPCR reaction on parallel with the standard PCR reaction.
Sequencing and generation of motifs
PCR products from RNA libraries (indexed by bar-codes) were pooled together, purified using a PCR-purification kit (Qiagen) and sequenced using Illumina HiSeq 2000 (55 bp single reads). Data was de-multiplexed, and initial data analysis performed using the Autoseed algorithm (Nitta et al., 2015) that was further adapted to RNA analysis by taking into account only the transcribed strand and designating uracil rather than thymine. Autoseed identifies gapped and ungapped kmers that represent local maximal counts relative to similar sequences within their Huddinge neighborhood (Nitta et al., 2015). It then generates a draft motif using each such kmer as a seed. This initial set of motifs is then refined manually to identify the final seeds (Table S2), to remove artifacts due to selection bottlenecks and common “aptamer” motifs that are enriched by the HTR-SELEX process itself, and motifs that are very similar to each other. To assess initial data, we compared the deduced motifs to known motifs, to replicate experiments and experiments performed with paralogous proteins. Individual results that were not supported by replicate or prior experimental data were deemed inconclusive and were not included in the final dataset. Draft models were manually curated (AJ, JT, QM, TRH) to identify successful experiments, and final models were generated using the seeds indicated in Table S2.
Autoseed detected more than one seed for many RBPs. Up to four seeds were used to generate a maximum of two unstructured and two structured motifs. Of these, the motif with largest number of seed matches using the multinomial setting indicated on Table S2 was designated the primary motif. The motif with the second largest number of matches was designated the secondary motif. The counts of the motifs represent the prevalence of the corresponding motifs in the sequence pool (Table S2). Only these primary and secondary motifs were included in further analyses. Such additional motifs are shown for LARP6 in Fig. S9.
To find RBPs that bind to dimeric motifs, we visually examined the PWMs to find direct repeat pattern of three or more base positions, with or without a gap between them (see Table S2). The presence of such repetitive pattern could be either due to dimeric binding, or the presence of two RBDs that bind to similar sequences in the same protein.
To identify structured motifs, we visually investigated the correlation diagrams for each seed to find motifs that displayed the diagonal pattern evident in Figure 2B. The plots display effect size and maximal sampling error, and show the deviation of nucleotide pair distribution from what is expected from the distribution of the individual nucleotides. For each structured motif, SLM models (Table S3) were built from sequences matching the indicated seeds; a multinomial 2 setting was used to prevent the paired bases from influencing each other. Specifically, when the number of occurrence of each pair of bases was counted at the base-paired positions, neither of the paired bases was used to identify the sequences that were analyzed. The SLMs were visualized either as the T-shaped logo (Figure 3) or as a PWM type logo where the bases that constitute the stem were shaded based on the total fraction of A:U, G:C and G:U base pairs.
For analysis of RNA structure in Figures 2 and S5, sequences matching the regular expression NNNNCAGU[17N]AGGCNNN or sequences of the three human collagen gene transcripts (From 5’ untranslated and the beginning the coding sequence, the start codon is marked with bold typeface: COL1A1 -CCACAAAGAGUCUACAUGUCUAGGGUCUAG-ACAUGUUCAGCUUUGUGG; COL1A2- CACAAGGAGUCUGCAUGUCUAAGUGCUAGA-CAUGCUCAGCUUUGUG and COL3A1 - CCACAAAGAGUCUACAUGGGUCAUGUUCAG-CUUUGUGG) were analysed using “RNAstructure” software (Mathews, 2014) through the web-interface in:http://rna.urmc.rochester.edu/RNAstructureWeb/Servers/Fold/Fold.html using default settings. All structures are based on the program’s minimum energy structure prediction. For analysis in Figure 3, we extracted all sequences that matched the binding sequences of MKRN1 and ZRANB2 (GUAAAKUGUAG and NNNGGUAAGGUNN, respectively; N denotes a weakly specified base) flanked with ten bases on both sides from the cycle four of HTR-SELEX. Subsequently, we predicted their secondary structures using the program RNAfold (Vienna RNA package; (Lorenz et al., 2011)) followed by counting the predicted secondary structure at each base position in the best reported model for each sequence. For both RBPs, the most common secondary structure for the bases within the defined part of the consensus (GUAAAKUGUAG and GGUAAGGU) was the fully single stranded state (82% and 30% of all predicted structures, respectively). To estimate the secondary structure at the flanks, the number of paired bases formed between the two flanks were identified for each sequence. Fraction of sequences with specific number of paired bases are shown in Figure 3.
Motif mapping
To gain insight into the function of the RBPs, we mapped each motif to the whole human genome (hg38). We applied different strategies for the linear and the stem-loop motifs. For the linear motifs, we identified the motif matches with MOODS (Korhonen et al., 2017) with the following parameter setting: --best-hits 300000 --no-snps. For the stem-loop motifs, we implemented a novel method to score sequences against the SLMs. The source code is available on GitHub: https://github.com/zhjilin/rmap.
We identified the 300,000 best scored matches in the genome, and further included any matches that had the same score as the match with the lowest score, leading to at least 300,000 matches for each motif. The matches were then intersected with the annotated features from the ENSEMBL database (hg38, version 91), including the splicing donor (DONOR), splicing acceptor (ACCEPTOR), the translation start codon (STARTcodon), the translation stop codon (STOPcodon) and the transcription starting site (TSS). The above features were filtered in order to remove short introns (<50bp), and features with non-intact or non-canonical start codon or stop codon. The filtered features were further extended 1kb both upstream and downstream in order to place the feature in the centre of all the intervals. The motif matches overlapping the features were counted using BEDTOOLS (version 2.15.0) and normalized by the total number of genomic matches for the corresponding motif.
Motif comparisons and GO analysis
To assess the similarity between publicly available motifs and our HTR-SELEX data, we aligned the motifs as described in (Jolma et al., 2015) (Figure S1). To determine whether RBPs with similar RBDs recognize and bind to similar targets, we compared the sequences of the RBDs and their motifs. First, the RBPs were classified based on the type and number of RBDs. For each class, we then extracted the amino-acid sequence of the RBPs starting from the first amino acid of the first RBD and ending at the last amino acid of the last RBD. We also confirmed the annotation of the RBDs by querying each amino acid sequence against that SMART database, and annotated the exact coordinates of the domains through the web-tools: http://smart.embl-heidelberg.de and http://smart.embl-heidelberg.de/smart/batch.pl. Sequence similarities and trees were built using PRANK (Loytynoja and Goldman, 2005) (parameters: -d, -o, -showtree). The structure of the tree representing the similarity of the domain sequence was visualized using R (version 3.3.1).
For identification of classes of transcripts that are enriched in motif matches for each RBP, we extracted the top 100 transcripts according to the score density of each RBP motif. These 100 transcripts were compared to the whole transcriptome to conduct the GO enrichment analysis for each motif using the R package ClusterProfiler (version 3.0.5).
To analyze conservation of motif matches, sites recognized by each motif were searched from both strands of 100 bp windows centered at the features of interest (acceptor, donor sites) using the MOODS program (version 1.0.2.1). For each motif and feature type, 1000 highest affinity sites were selected for further analysis regardless of the matching strand. Whether the evolutionary conservation of the high affinity sites was explained by the motifs was tested using program SiPhy (version 0.5, task 16, seedMinScore 0) and multiz100way multiple alignments of 99 vertebrate species to human (downloaded from UCSC genome browser, version hg38). A site was marked as being conserved according to the motif if its SiPhy score was positive meaning that the aligned bases at the site were better explained by the motif than by a neutral evolutionary model (hg38.phastCons100way.mod obtained from UCSC genome browser). Two motifs were excluded from the analysis because the number of high affinity sites that could be evaluated by SiPhy was too small. The hypothesis that the motif sites in the sense strand were more likely to be conserved than sites in the antisense strand was tested against the null hypothesis that there was no association between site strand and conservation using Fisher’s exact test (one-sided). The P values given by the tests for individual motifs were corrected for multiple testing using Holm’s method.
Mutual information calculation
The global pattern of motifs across the features tested was analyzed by calculating the mutual information (MI) between 3-mer distributions at two non-overlapping positions of the aligned RNA sequences. MI can be used for such analysis, because if a binding event contacts two continuous or spaced 3-bp wide positions of the sequences at the same time, the 3-mer distributions at these two positions will be correlated. Such biased joint distribution can then be detected as an increase in MI between the positions.
Specifically, MI between two non-overlapping positions (pos1, pos2) was estimated using the observed frequencies of a 3-mer pair (3+3-mer), and of its constituent 3-mers at both positions: where P(3+3-mer) is the observed probability of the 3-mer pair (i.e. gapped or ungapped 6 mer). Ppos1(3-mer) and Ppos2(3-mer), respectively, are the marginal probabilities of the constitutive 3-mers at position 1 and position 2. The sum is over all possible 3-mer pairs.
To focus on RBPs that specifically bind to a few closely related sequences, such as RBPs with well-defined motifs, it is possible to filter out most background non-specific bindings (e.g., selection on the shape of RNA backbone) by restricting the MI calculation, to consider only the most enriched 3-mer pairs for each two non-overlapping positions.
Such enriched 3-mer pair based mutual information (E-MI) is calculated by summing MI over top-10 most enriched 3-mer pairs.