Abundant capped RNAs are derived from mRNA cleavage at 3’UTR G-Quadruplexes

The 3’ untranslated region (3’UTR) plays a crucial role in determining mRNA stability, localisation, translation and degradation. Cap analysis gene expression (CAGE), a method for the detection of capped 5’ ends of mRNAs, additionally reveals a large number of apparently 5’ capped RNAs derived from 3’UTRs. Here we provide the first direct evidence that these 3’UTR-derived RNAs are indeed capped and often more abundant than the corresponding full-length mRNAs. By using a combination of AGO2 enhanced individual nucleotide resolution UV crosslinking and immunoprecipitation (eiCLIP) and CAGE following siRNA knockdowns, we find that these 3’UTR-derived RNAs likely originate from AGO2-mediated cleavage, and most often occur at locations with potential to form RNA-G-quadruplexes and are enriched by RNA-binding protein UPF1. High-resolution imaging and long-read sequencing analysis validates several 3’UTR-derived RNAs, demonstrates their abundance and shows that they tend not to co-localise with the parental mRNAs. We also find that production of 3’UTR-derived RNA could explain the previously reported role of a 3’UTR G-quadruplex in regulating the production of APP protein. Taken together, we provide new insights into the origin and abundance of 3’UTR-derived RNAs, show the utility of CAGE-seq for their quantitative detection, and provide a rich dataset for exploring new biology of a poorly understood new class of RNAs.


Introduction
In all eukaryotes, mRNA molecules contain an evolutionarily conserved m7G cap (N7-methylated guanosine), which gets added at the 5' end of nascent transcripts.
Co-transcriptional capping is the first modification made to nascent RNA in the nucleus, which protects it from exonuclease cleavage while promoting cap-related biological functions such as pre-mRNA splicing, polyadenylation and nuclear export 1 . In addition to the co-transcriptional capping of nascent mRNAs, there is evidence for a post-transcriptional capping mechanism, which adds cap to newly exposed 5' ends of RNA fragments created upon endonucleolytic cleavage [2][3][4] . However, little is known about the extent and biological roles of such post-transcriptional capping, and of its relation to other post-transcriptional RNA processing mechanisms.
Cap analysis of gene expression and deep-sequencing (CAGE-seq) was originally designed to precisely determine transcription start site (TSS) positions by capturing and sequencing 5' ends of capped mRNA transcripts, and it can also be used to measure gene expression 5 .
However, several studies detected an unexpected, reproducible, and so-far unexplained enrichment (~10-15%) of CAGE signal, or significant enrichment of RNA-seq reads, in the untranslated region within the 3'UTR, far away from the usual TSS [6][7][8][9][10][11][12][13][14] . Previous studies have shown an absence of active promoter marks (i.e. no enrichment of modified histones or RNAPII) around these 3'UTR signals 6,8,15 , arguing against the possibility that they are unannotated transcription start sites. Moreover, their expression profiles have been reported to be separated from the associated protein-coding sequence in a subcellular specific manner, with expression changes detected in several 3'UTRs during differentiation stages in mouse embryos 15 . In addition, specific isolated 3'UTRs have been implicated in a growing number of physiological and pathological processes 6,16,17 . Some processed and capped 3'UTRs have been reported to play important roles in regulating protein expression in trans, similar to long non-coding RNAs 6,8,9 . It has been suggested that some 3'UTR CAGE signal arises as a consequence of post-transcriptional cleavage followed by capping mechanism [2][3][4]14 rather than through conventional transcription initiation, and that this can lead to 3'UTR-derived RNAs that have been referred to as 3'UTR-associated RNAs (uaRNAs). To avoid potential misunderstandings we refer to these as 3'UTR-derived RNAs, as these newly generated RNAs are not known to be physically associated with 3'UTRs.
Here, we thoroughly examined the presence of these 3'UTR-derived RNAs across the transcriptome, and the molecular basis of their generation and regulation. We perform a genome-wide identification of 3' UTR-derived RNAs based on their capped 5' ends, and proceed to investigate the mechanisms involved in their formation. To this end, we combine CAGE, RNA-seq and cross-linking immunoprecipitation (CLIP)-based techniques from ENCODE and FANTOM consortia to detect 3'UTR-derived RNAs genome-wide. We show that those RNAs have biochemical properties expected of 5' capped RNAs, that may originate by endonucleolytic cleavage of the host mRNA, and that they are often as abundant, or more so, as the protein-coding part of the host transcript. We support this by showing that the apparent cleavage sites near the 5' ends of these 3'UTR-derived RNAs are bound by UPF1 and AGO2, and also have a tendency to form at RNA-G-quadruplexes.
Moreover, some of those abundant 3' UTR-derived RNAs show markedly different subcellular localisation than their protein-coding counterparts. Finally, we show that equivalent 3' UTR capped RNAs can result from siRNA-mediated cleavage of RNA.

CAGE-seq identifies non-promoter associated capped 3'UTR-derived RNAs
Ourselves and others [6][7][8][9][10][11] have previously reported the presence of CAGE-seq signals outside of annotated promoter regions in thousands of protein-coding genes, albeit their origin or biological relevance had not been interrogated. Here we first confirmed the abundance of these signals in human cell lines using CAGE data provided by the ENCODE consortium. As expected, we could detect a similar proportion of CAGE signals per genomic region in two human cell lines, as well as show that the CAGE signal is highly reproducible across replicates. This included in proportion, library size, position and distribution of the CAGE tags ( Figure 1A, S1A,B,C). A similar ratio was also detected by other groups before, using the same protocol 18 .
The relative intensities of CAGE signal detected at different elements of the structural gene depend on the priming method for reverse transcription (oligo-dT, random hexamers, or mixtures thereof in different ratios) 11 . Oligo-dT priming quantitatively favours shorter transcripts, while the reverse is true for random priming. We subsequently verified that 3'UTR CAGE signal is optimally detected when a combination of Oligo-dT and random primers is used, with the optimal inclusion ratio of 1 to 4 ratio of Oligo-dT to random primers 18,19 ( Figure S1D). Notably, the same ratio was used in ENCODE CAGE samples analysed in this study.
The CAGE signal is the strongest at 5'UTRs of known protein-coding genes 18 (Figure 1A, 65% of promoter signal). While low-level non-promoter CAGE signal (sometimes referred to as "exon painting"), can be detected along the entire length of transcripts, the signal at 3' UTRs is consistently present and occurs in localised clusters, like at promoters (see Figure   S1I for examples). We focused on the 3'UTR region, since a substantial amount (~11%) of total CAGE reads map there ( Figure 1A), and the significance of this is unknown. To identify robust CAGE signals with sufficient sensitivity, we used a 20nt window requiring at least two 5' reads overlapping from two different replicates for each cell line separately. This revealed 32,065 unique 3'UTR CAGE clusters across all samples (Table 1). Moreover, these 3'UTR CAGE clusters showed high reproducibility, suggesting biological relevance; there was~0.9 correlation between replicates, and 0.98 between HeLa and K562 samples ( Figure S1E). The latter correlation is higher than that for the 5' UTR CAGE signal between the two cell types (0.79 correlation and Figure S1F). Together these analyses show that the transcripts whose 5' end map to 3'UTR ends of protein-coding genes are abundant and reproducible across cell types, and that CAGE is a robust method for their quantitative detection.

3'UTR-derived RNAs are confirmed by RNA-seq, qPCR and long-read CAGE
We next wanted to investigate whether the 3'UTR CAGE signals originate from post-transcriptionally capped RNA fragments. First, we asked if there is support for the ends of the corresponding cleavage fragments sites in transcriptomic data produced by independent methods. For this we compared the CAGE signal with the RNA-seq signal of two different cell lines. To categorise CAGE peaks we first used the paraclu 20 peak caller to identify clusters of 5' ends of capped RNAs, and within each cluster we selected the highest signal as dominant CAGE peak position. For comparison, we processed paired-end RNA-seq data from the same K562 and HeLa cell lines, then plotted read-starts and read-ends relative to the dominant 3'UTR CAGE peak per transcript ( Figure 1B -in blue, and S1G). Both RNA-seq samples showed highly reproducible enrichments of read ends coinciding with dominant 3'UTR CAGE peaks. This reveals that the 3'UTR CAGE peaks are confirmed by the read-ends from RNA-seq data, which suggests that the signal could be originating from post-transcriptional cleavage sites. Notably, there is also a small enrichment of RNA-seq read-starts downstream from the 3'UTR CAGE peaks, which could represent the same RNA fragments detectable by the CAGE samples ( Figure 1B in yellow). More importantly, these findings demonstrate that 3'UTR capped fragments identified by CAGE can also be detected by other, methodologically independent, high-throughput sequencing methods such as RNA-seq.
We next aimed to confirm the presence of transcripts initiating at the non-promoter 3'UTR CAGE peaks by an alternative experimental approach, not dependent on RNA library creation or high-throughput sequencing. We focussed on two genes, CDKN1B and JPT2, which showed a single strong 3'UTR CAGE peak and highly reproducible read coverage for CAGE and RNA-seq in both K562 and HeLa cells ( Figure S1J). Two separate sets of primers were designed upstream and downstream the 3'UTR CAGE peak (see Methods) to quantify transcripts containing these regions. In agreement with CAGE and RNA-seq data ( Figure S1J), RT-qPCR detected higher levels of these transcripts with the 3' downstream primers ( Figure S1H), suggesting an accumulation of abundant 3'UTR fragments.
Treatment of the samples with TerminatorTM 5'-Phosphate-Dependent Exonuclease (TEX), an enzyme capable of degrading uncapped RNAs, had none or little effect in the amount of JPT2 and CDKN1B transcript detected either side of the 3'UTR CAGE peak within these cells. This was in sharp contrast with the known uncapped 3' fragment of SLC38A2 mRNA, previously described by Malka et al. 7 , which was, as expected, sharply reduced upon TEX treatment ( Figure 1C), thus further demonstrating that our studied 3'UTR fragments are capped.
We further confirmed that 3'UTR-derived RNAs can be detected by long-read Nanopore-sequencing CAGE. We were provided with such data from 10 genes in iPSC, neuron stem cell (NSC) and Cortical Neuron samples by the FANTOM6 consortium that contain HeLa and K562 3'UTR CAGE peaks ( Figure S1K). In all of the 10 examples, the full length read sequencing CAGE identified highly abundant reads spanning from the start of our identified CAGE 3'UTR peaks till the end of the annotated transcripts ( Figure S1K). This further confirms that the 3'UTR-derived RNAs originate from the full length mRNA.
Altogether, these analyses confirm the presence of abundant, capped 3'UTR-derived RNAs that originate from cleavage of the full-length mRNAs.

Capped 3'UTR-derived RNAs are predominantly cytoplasmic
Next, we asked if there is evidence of nuclear Cap Binding Complex (CBC) binding to the capped 5' ends of 3'UTR fragments, as it is known to bind to 5' ends of nascent protein-coding mRNA transcripts in the nucleus. Individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) is a method that identifies protein-RNA crosslinking interactions with nucleotide resolution in a transcriptome-wide manner. We examined CBC-iCLIP data from HeLa cells, where the authors targeted nuclear cap-binding subunit CBP20 protein 21 . CBP20 is a nuclear component of cap-binding complex (CBC), which binds co-transcriptionally to the 5' cap of pre-mRNAs and interacts directly with the m7-G cap 22,23 . The CBP20 RNA binding data was analysed using standard iCLIP processing pipeline, where the nucleotide preceding cDNA-start position after PCR duplicate removal is reported as the crosslinking position (see Methods). The CBP20 crosslinking positions were then summarised across all dominant 5'UTR and 3'UTR CAGE peaks per transcript. As expected, CBP20 crosslinks were enriched around the dominant 5'UTR CAGE peaks where the TSS of full-length transcripts is positioned. However, the enrichment was very weak at the non-promoter 3'UTR CAGE peaks ( Figure S2A). This strongly suggests that the 3'UTR capped fragments identified by CAGE are not part of nuclear CBC, but are likely a product of an independent post-transcriptional processing pathway.
Additionally, we analysed capCLIP data from HeLa cells. capCLIP is a version of CLIP that targets translation elongation factor eIF4E, a cytoplasmic protein which binds the 7-methyl-GTP moiety of the 5′-cap structure of RNAs for the efficient translation of almost all mRNAs 24,25 . The capCLIP data was analysed in the same way as CBP20-iCLIP. The enrichment of capCLIP signal at the non-promoter 3'UTR CAGE peaks was much stronger than in the CBC-iCLIP ( Figure S2A, S2B), which suggests that the cap of the 3'UTR-derived RNAs is primarily bound by the cytoplasmic eIF4E, rather than the nuclear cap binding protein CBP20, suggesting that these RNAs are predominantly cytoplasmic.

5' ends of 3'UTR-derived RNAs are enriched for G-rich motifs and strong secondary structures
Next, we wished to understand the sequence features that distinguish the CAGE peaks corresponding to co-transcriptional capping of transcription start sites from those originating from post-transcriptional capping of 3'UTR-derived RNAs. We first explored the possibility that 3'UTR fragments might be a side-product of nuclear polyadenylation and associated endonucleolytic cleavage. In this case, the identified 3'UTR CAGE peaks should be preceded by enrichment of the canonical polyA A[A/U]UAAA hexamers, which recruit the nuclear polyadenylation machinery. However, we only found such enrichment at the annotated 3'UTR ends, and not upstream of the 3'UTR CAGE peaks ( Figure S2C). We observed a notable enrichment downstream from the 3'UTR CAGE peaks ( Figure S2C -red line), most likely because some of the 3'UTR-derived RNAs are relatively short and their 5' ends are close to the annotated 3'UTR ends.
Next, we examined whether there was any other distinguishing difference between the two types of CAGE peaks. Consistent with previous studies 8, 12 , we detected a strong G-enrichment around the 5' end of the CAGE reads present in non-promoter regions ( Figure   2A, S2D), distinct from YR dinucleotide which is a feature of initiator signal at 5' ends of genes. More surprisingly, CAGE peaks within the 3'UTR region showed a strong increase in internal pairing probability (see Methods: Secondary structure) in comparison to other regional groups ( Figure 2B, S2E), suggesting structural preference is important for 3'UTR-derived RNAs. Notably, the surrounding region of CAGE peaks in 5'UTRs is more structured (light blue line in Figure 2B, S2E), which could be explained by the higher GC content that is present around all 5'UTRs in vertebrates 26 , with a distinctive drop at -25 bps coinciding with the canonical TATA box position.  28 . To further explore the RNA G-quadruplexes formation profile, we integrated RNA-G-quadruplex sequencing (rG4-seq) data from HeLa cells 29 and ran G4-Hunter predictions 30 around CAGE peaks.

3'UTR CAGE peaks coincide with RNA-G-quadruplexes and heavily structured regions
Indeed, both the rG4-seq data (HeLa) and G4-Hunter predictions (K562) showed the highest G4s enrichment in the 3'UTR region relative to CAGE peak ( Figure 2D, S2G, 2E) with the highest number of G4s present in 3'UTRs ( Figure S2H). Moreover, in 8 out of 10 examples with dominant 3'UTR CAGE peaks across multiple samples we identified rG4-seq clusters coinciding with 3'UTR CAGE peaks ( Figure S1K).
Interestingly, beside the G4 preference, the top 3'UTR CAGE overlapping peak in both HeLa and K562 cell lines overlaps MALAT1-associated small cytoplasmic RNA (mascRNA, Figure   S1E), which is extremely abundant, widely conserved among mammals, and is known to be upregulated in cancer cell lines 31 . Notably it forms a triple helix structure at its 3′ end that makes it more stable from the rest of the ncRNAs 32 . Overall, these results suggest that strong structures around 3'UTR CAGE peaks, including RNA-G4s, could play an important role in stabilising these RNA fragments. One possibility would be by making these fragments exoribonuclease-resistant 33 or by causing XRN1 to stall during 5'-3' degradation 34 .

3'UTR cleavage sites are flanked by enriched UPF1 binding
Based on the evidence outlined, we hypothesised that the capped 3'UTR-derived RNAs are formed post-transcriptionally. On that assumption, we aimed to determine whether specific RNA-binding proteins (RBPs) were involved in the process. To that end, we analysed publicly available enhanced CLIP (eCLIP) data for 80 different RBPs in the K562 cell line, produced by the ENCODE consortium 35 . For each RBP, we calculated normalised cross-linking enrichment compared to other RBPs around maximum CAGE peak per annotated gene region (5'UTR, CDS, intron, 3'UTR). This identified a specific set of RBPs around CAGE peaks, with UPF1 (Regulator of nonsense transcripts 1) protein as the top candidate in 3'UTRs, and DDX3X (DEAD-Box Helicase 3 X-Linked) in 5'UTRs ( Figure 3A, S3A). No specific RBP enrichment was detected in CDS and intronic regions. The DDX3X enrichment around 5'UTR CAGE peaks was no surprise since it is known to be involved in transcriptional process by interacting with transcription factors, in pre-mRNA-splicing by interacting with Spliceosomal B Complexes, and in RNA export by interacting with Cap-Binding-Complex (CBC) 36 . UPF1 is a known factor of the Nonsense-Mediated Decay (NMD) pathway, where stalled UPF1 at CUG and GC-rich motifs activates its mRNA decay 37,38 .
Interestingly, the crosslinking enrichment of UPF1 around 3'UTR CAGE peaks is positioned just upstream from the peaks, followed by depletion downstream ( Figure 3B). Moreover, there was a high correlation between the 3'UTR CAGE signal and UPF1 binding, which was not correlated with gene expression or 3'UTR length ( Figure S3B), indicating that the 3'UTR CAGE signal could result from the post-transcriptional capping of NMD-mediated RNA degradation by-products 8,13 . This correlation could also be related to G-enrichment since 3'UTRs with UPF1 bindings are prone to having higher than average G content 39 . More specifically, the strength of UPF1 binding coincides with the strength of the 3'UTR CAGE peaks and proximity to the peaks ( Figure S3C), which suggest that the precise binding position of UPF1 relative to the cleavage/capping position could be important for the formation of these fragments. All in all, we find that 3'UTR-derived RNAs are not a simple by-product of high mRNA decay, but also find that the NMD factor UPF1 might regulate their generation.

mRNA cleavage by small interfering RNAs generates newly capped RNA fragments
An alternative way in which mRNAs can be cleaved post-transcriptionally is through RNA interference (RNAi). Indeed, a common way to artificially accomplish gene silencing is to utilise small interfering RNAs (siRNAs) to induce endonucleolytic degradation of the target transcripts 40,41 . SiRNAs are usually 21-23 nt long and their sequence corresponds to an antisense mRNA target sequence. Silencing by siRNAs is mediated thanks to the RNAse III catalytic activity of Argonaute 2 (AGO2), a subunit of the RNA-induced gene-silencing complex (RISC) in the cytoplasm.
We hypothesised that siRNA silencing following AGO2 cleavage could lead to cytoplasmic capping of the cleaved RNA fragments instead of degradation. To follow up this hypothesis, we first investigated if CAGE could detect cleaved RNA fragments guided by siRNA. We analysed CAGE siRNA KDs from FANTOM5 dataset 42 , for which we collected 28 samples with siRNA targeting sequences (20 siRNAs designed by ThermoFisher and 8 by the study authors) and 5 control samples. Surprisingly, in 20 of the 28 samples we detected CAGE 5' end signal corresponding to the exact siRNA complementary genomic sequence in at least two replicates ( Figure 3C).
The strongest enrichment of CAGE signal relative to the siRNA target start site was detected in the ISL1-KD sample, supported by all 3 biological replicates, and with no signal detected in control samples ( Figure 3D,E, S3D). More interestingly, the dominant CAGE 5' end signal was present in the middle of the siRNA target sequence ( Figure 3E, S3D), where the AGO2 cleavage is known to take place 43,44 . Also, the TSS CAGE signal in 5'UTR of the corresponding protein-coding gene dropped by~75% compared to the control samples in all 3 replicates ( Figure 3D,E), confirming that the KD of ISL1 transcript was efficient. Together these results indicate that siRNA mediated recruitment of AGO2 can lead to the generation of post-transcriptionally capped RNA fragments following mRNA cleavage.

3'UTR CAGE peaks coincide with AGO2 binding and RNA-G-quadruplexes
Since the endonuclease activity of AGO2 facilitates mRNA cleavage guided by siRNAs, we investigated if AGO2 binds also at the endogenous 3'UTR CAGE peaks. There was no publicly available AGO2 binding data for either HeLa or K562 cells, so it could not be detected in the analysis in Figure 3A. For that reason, we produced 'enhanced individual nucleotide resolution'-CLIP (eiCLIP) 45 Figure 3B).
In animals, endogenous RNAi is mainly mediated by microRNAs (miRNAs). MiRNAs are also~21-23 nucleotide (nt) long RNAs, but, in contrast to siRNAs, miRNAs recruit the miRNA induced silencing complex (miRISC) containing AGO1-4 to mRNAs with partial complementarity which results in translational repression and/or exonucleolytic cleavage 40,41 . To see if there is evidence compatible with AGO2 miRNA-guided cleavage of mRNA targets, which would hence be similar to the siRNA method of action, we first mapped reverse complements of miRNA sequences to the human genome, allowing 2 mismatches, to identify putative miRNA matches in 3'UTRs which could be mediated by miRNA-dependent endonucleolytic cleavage 46,47 . We identified 29 such targets but there was no CAGE signal present around them (data not shown). Accordingly, we instead explored the binding specificity of AGO2-eiCLIP data, and performed a motif analysis using HOMER motif finder. When analysing the 15 bp flanking region around AGO2-crosslinking peaks (see Methods), one of the most prominent motifs was highly enriched in Gs ( Figure   S3I -2nd and 3rd). Notably, this also agrees with one of the first AGO2-CLIP studies performed on mouse embryonic stem cells, where the authors showed that, without the miRNA present, AGO2 binds preferentially to G-rich motifs 48 . This suggests that miRNA-directed recruitment may not be necessary for AGO2 binding at the site of cleavage that generates 3'UTR-derived RNAs.
Our results demonstrate the AGO2 binding near 3'UTR CAGE peaks. Accordingly, given our previous observation that RNA-G-quadruplexes were enriched around 3'UTR-derived RNAs, we investigated whether AGO2 could be attracted by RNA-G-Quadruplexes in general. We first aligned AGO2-eiCLIP-HeLa cross-linking positions relative to 3' end of rG4-seq-HeLa sites in different regions of primary transcripts. Similar to the 3'UTR CAGE peaks enrichment by RNA-G-Quadruplexes, AGO2 crosslink-binding sites are much more highly enriched at rG4-seq sites in the 3'UTRs relative to 5'UTRs, introns and coding sequence ( Figure S3J,K).
The mechanistic implications of the overlap between RNA-G4 structures and AGO2 binding close to the capped 3'UTR-derived RNAs remain to be experimentally interrogated.

3'UTR-derived RNAs could explain the previously reported regulation of APP protein
Additionally, we looked if there are any known RNA-G4s with regulatory features in 3'UTRs.
Notably, there has been a proposed involvement of RNA-G4 in the regulation of amyloid precursor protein (APP) in Alzheimer's disease 49,50 , where G4 motif in the 3'UTR of APP mRNA was found to suppress overproduction of APP protein, but the underlying mechanism remained unclear 51 . Analysis of rG4-seq and CAGE data from HeLa cells in APP mRNA showed that the CAGE peak coinciding precisely with the 3' end of the same G4 motif that was previously found to affect APP protein production ( Figure S1K -APP, Figure 2E).
Moreover, long-read sequencing CAGE from iPSC, neuron stem cell (NSC) and Cortical Neuron samples, confirmed that abundant 3'UTR-derived RNAs are present in all of these samples that start at the position of our identified CAGE peak and span till the end of the annotated APP gene ( Figure S1K -APP). This indicates that the RNA-G4 very likely affects the production of the 3'UTR-derived RNAs, and that this mechanism accounts for its effect on the APP protein production.

Capped 3'UTR fragments of CDKN1B and JPT2 transcripts do not co-localise with the parental mRNAs
Finally, we were interested to explore potential implications of 3'UTR derived RNAs. If a transcript is cleaved, it is possible for the two resultant RNA fragments to localise either together or independently from each other. To test this, we designed smFISH probes to simultaneously image the RNA upstream and downstream of the proposed post-transcriptional cleavage and capping site in CDKN1B and JPT2 using hybridisation chain reaction RNA-fluorescence in situ hybridization (HCR-FISH 3.0) 52 . To account for the technical biases in detection, we also designed probes against the coding sequence (hereafter upstream) and 3'UTR (hereafter downstream) of a control mRNA, PGAM1, which does not contain CAGE peaks in the 3'UTR and contained a similar 3'UTR length to our targets.
We performed HCR-FISH in HeLa cells to determine whether putative 3'UTR-derived RNAs can be found independently of the RNA upstream of the cleavage site ( Figure 4A, B). In the control transcript, PGAM1, we observed that 17.3% of upstream signals did not have a colocalising downstream signal and 21.3% of downstream signals did not have a colocalising upstream signal ( Figure 4C, S4A). Interestingly though, the mRNAs that contain a 3'UTR CAGE signature were significantly more likely to show independent signals from the RNA downstream of the proposed cleavage site (CDKN1B: 53.3%, p adj. < 0.05; JPT2: 52.3%, p adj. < 0.05; Figure 4C). In the case of JPT2, we also observed significantly more independent signals from the upstream probes (29.3%, p adj. < 0.05; Figure 4C). These observations are consistent with the existence of cleaved 3'UTR fragments in the cell, and they reveal that these products may localise differently from their host transcripts.

Discussion
We employed a combination of computational analyses of high throughput sequencing datasets from human cell lines that reveal capped 5' ends of RNAs genome-wide (CAGE) and binding sites for dozens of RNA binding proteins. As the main resource we used large publicly available datasets from consortiums such as ENCODE (Encyclopedia of DNA Elements) and FANTOM together with new computational approaches and experimental validations. We identified several factors that show strong binding enrichment at the sites where 3'UTRs are cleaved to generate the capped 3'UTR-derived RNAs. Specifically, we compared the crosslinking enrichment of several RBPs including UPF1 and AGO2, which are both highly enriched around 3'UTR-derived RNAs beside RNA-G-Quadruplexes and heavily structured regions.
Other studies suggested that these capped RNAs originate as a consequence of incomplete degradation of the mRNA during the standard processes of mRNA decay 6,15,17 , which would agree with the enrichment of RNA-seq read-starts at 3'UTR CAGE peaks ( Figure 1B).
However, it is challenging to use traditional fragmented-based sequencing methods such as RNA-seq and CAGE-seq for discovery and validation of 3'UTR-derived RNAs, because the reads of derived RNAs can not easily be distinguished from the parental mRNAs, and the only information available is the enrichment of starts of RNA-seq reads with the positions of capping detected by CAGE-seq. This could explain why most 3'UTR-derived RNAs have so far remained undetected. We now used multiple lines of evidence to complement CAGE and RNA-seq data, including long-read Nanopore sequencing, analysis of RNA structural features and RBP interactions relative to the cleavage of 3'UTR-derived RNAs, and HCR-FISH imaging. We show that the position and strength of capping is closely linked to RNA structure and RBP binding, and that the 3'UTR-derived RNAs are often abundant and do not co-localise with the parental mRNAs.

Cytoplasmic capping of the siRNA-targeting cleaved fragments
In siRNA-KD CAGE samples we noticed that certain cleaved fragments which are involved in post-transcriptional cleavage processing, such as RNAi targeting, can form capped RNA fragments ( Figure 3C,E). However, with the available data we can not quantify the efficiency of such capping, or identify all the factors that might be involved in the process.
Understanding the cytoplasmic capping of cleaved fragments and their abundance will also give important insights for understanding viral RNA capping. Since the majority of RNA capping happens in the nucleus, viruses evolved to produce efficient capped RNAs in the cytoplasm by encoding their own capping machinery, or by or taking a capped 5' fragment from the host's mRNA, also known as cap snatching 1 . Moreover, many new drugs which are based on RNAi and miRNA targeting are already in use or under active clinical trials for treating neurological or viral diseases, and in cancer treatments. Side products of the targeted mRNAs from these therapeutic drugs could still be subjected to a cytoplasmic capping mechanism and result in unwanted toxic side effects.

The role of UPF1, AGO2 and RNA-G-Quadruplexes in capping 3'UTR-derived RNAs
On average 3'UTRs are shorter in cancer cells to evade miRNA-mediated repression 53 but we could not see any correlation between 3' length and intensity of the CAGE signal ( Figure   S3B). It is known that UPF1 binds to GC-rich motifs in 3'UTRs 37 , but it is still not known what the main trigger of UPF1-mediated mRNA decay is. Another study found evidence that a G content enrichment in 3'UTRs plays a more important role on mRNA destabilisation by inserting UPF1 binding motifs into non-UPF1 targets 37 . This suggests that G enrichment plays a vital role in triggering UPF1-mediated mRNA decay. Meanwhile, significant overlaps in binding between UPF1 and AGO2 have been reported but with an unknown functional relationship 39 . In the same study they also discovered preferential UPF1 binding in structured G-rich regions 39 . We do not know if G enrichment is needed for the post-transcriptional capping process, but it has been shown in a previous study 48 that G enrichment could be important for AGO2 binding in the absence of miRNA guidance; it is also important for the RNA structure to form the hairpin-loop structure, which is necessary for AGO2 cleavage 54 . Another recent study demonstrated 3'UTR cleavage site in rat cervical ganglion neurons, which are also cleaved post-transcriptionally but only expressed in axons and not in cell bodies, with AGO2 and UPF1 as the top two RBP targets 55 . However, we now find that AGO2 binds to RNA-G4s in 3'UTRs ( Figure S3J,K), but their functional relation remains unknown. Those AGO2 binding sites are less likely to be guided by miRNA, since it has been shown that RNA-G4s can also prevent miRNA binding from its target sites 56 .
Moreover, RNA-G4s are known to form stable structures in vitro, but recent studies have suggested that they may be less stable in vivo due to active unwinding by RNA helicases 57,58 . However, Kharel et al. (2022) 59   First, we show that 3'UTR-derived RNAs in CDKN1B and JPT2 are capped and highly expressed in the cell using qPCR primers ( Figure 1C,S1H). Next, we produced additional HCR-FISH-probe experiments for JPT2 and CDKN1B targets, which demonstrate that 3'UTR-derived RNAs can be abundant in the cytoplasm without co-localising with parental mRNAs. In agreement with the CAGE data, the highest ratio of~2-fold of downstream vs. upstream probes is present in CDKN1B, where the 3'UTR CAGE peak is higher than the peak at the TSS ( Figure 4A,B,C, S1I -CDKN1B).
Interestingly, some cells showed a strong perinuclear accumulation of 3'UTR-probes in CDKN1B ( Figure 4B), whilst the signal was spread throughout the cytosol for most cells. An interesting possibility is a cell cycle-dependence, as it has already been observed for other aspects of regulation of p27 gene expression, including mRNA translation 60,61 . The role of these capped 3'UTR clusters in CDKN1B could also be related to cell cycle specific regulation of CDKN1B/p27kip1 (p27) protein expression. Two studies have demonstrated that rescue of splicing deficiency in CDKN1B improves protein production and leads to cell cycle arrest 62,63 . Additionally, another study suggested that high levels of 3'UTRs of NURR1 in proliferating cells could also be linked to cell cycle dynamics 16 . It would be important to investigate further the dynamic nature of these isolated 3'UTRs and their impact on cellular functions in a cell cycle dependent manner.

Methodological implications
Different abundance and localisation of 3'UTR derived RNAs relative to their parent transcript and the 5' cleavage fragment containing protein coding sequence suggests that

Declaration of interests
The authors declare no competing interests.           Tables   Table 1: This table contains genomic locations

HCR-FISH Microscopy
HeLa cells (obtained from Cell Services at the Francis Crick Institute) were grown in DMEM supplemented with 10% FBS and plated into 8-well ibidi chambers. Cells were fixed for 10 minutes at room temperature using 4% paraformaldehyde/0.4% glyoxyl diluted in PBS before permeabilization overnight at -20°C in 70% EtOH. In situ HCR v3.0 with split-initiator probes was performed as described in 52 , except amplification which used 30nM of each fluorescently labelled hairpin; cells were then stained with DAPI 1 µg/mL in PBS before mounting with Fluoromount-G tm (Thermo Fisher). Cells were imaged on a spinning disk confocal microscope (Nikon CSU-W1 Spinning Disk) using 60x oil-immersion objective. 6 non-overlapping field z-stacks of 17 slices with 0.39um z-steps were taken per well. 8 HCR probe pairs per target were designed using the HCR 3.0 Probe Maker 69 . Probes were designed for CDS and 3'UTR to be amplified by the B1 HCR-amplifier with Alexa594 or the B2 HCR-amplifier with Alexa674 (Molecular Technologies), respectively.

HCR-FISH Analysis
We z-projected the images and segmented the nuclei and cytoplasms with Cellpose (v2.0.5, 70 ) using the DAPI signal and thresholded AlexaFluor594 signal. We then detected smFISH signal positions using the Fiji plugin RS-FISH (v2.3.0, 71 ). We excluded signals that fell outside of a cell mask. For each detected signal, the minimum distance to the nearest signal in the other channel was measured. Co-localisation was defined as a minimum distance of 3 or fewer pixels. We then filtered for high confidence signals with a signal intensity in the top 50% of all signals for that channel in that image. The proportion of independent signals was calculated for each replicate, and pairwise t-tests were calculated in R using the compare_means function from the ggpubr package (https://github.com/kassambara/ggpubr/) with Benjamini-Hochberg correction.

Cell culture
K562 and HeLa cells were maintained in RPMI 1640 or DMEM medium, respectively, supplemented with 10% foetal bovine serum (FBS) and 100 U/mL penicillin/streptomycin at 37°C in 5% CO2 in a humidified incubator.

Reverse transcription (RT) and quantitative PCR (qPCR)
K562 cells were lysed in Trizol R (Thermo Fisher Scientific) and total RNA extracted as per  SLC38A2 primers were obtained from 7 .

AGO2-eiCLIP
AGO2-eiCLIP was performed as previously described 45 . In brief, this involved following a previously described non-isotopic iCLIP workflow 72  Sigma-Aldrich/ Merck). Samples of two biological replicates were sequenced with paired-end reads using NextSeq500.

Mapping and processing of AGO2-eiCLIP
Pre-processing, mapping to hg38 gene annotation and removal of PCR duplicates of AGO2-eiCLIP data and peak calling was performed by using iMAPS (https://imaps.goodwright.com/) with default settings. Processed data was downloaded from the iMAPS in BEDgraph format where each count represents crosslinking position and was used for further analysis.

miRNA analyses
For the genomic separations of crosslink positions we used GENCODE (v27 primary assembly) annotation and for the separation of transcripts with high and low miRNA targeting in HeLa cells we used 68 annotation. miRNA seed sequences were downloaded from 'TargetScan' (www.targetscan.org) database. Only miRNAs expressed in HeLa were selected from miRNA expression profile study 73 with the threshold of more than 10 reads in at least 2 replicates. The miRNA seed sequence heatmap was plotted by counting the expressed seed sequence motifs relative to the AGO2-eiCLIP dominant crosslink sites using the 'ggplot2' Bioconductor R package.

CAGE data processing
The BAM files of mapped reads were converted into BED format using the bamtobed function from bedtools package (version v2.30.0). Each 5' read position was then used for further analyses. The CAGE peaks were processed by using the Paraclu clustering tool (https://gitlab.com/mcfrith/paraclu). Default settings of minimum 5 reads filter for merged replicates was used followed by paraclu.cut.sh which removes: 1. Remove single-position clusters.
4. Remove any cluster that is contained in a larger cluster.

Single nucleotide clusters were added additionally
For each cluster the highest peak of 5' CAGE reads was used as the max peak position.

CAGE reproducibility of 3'UTR peaks
Mapped BAM samples from HeLa and K562 cell lines were converted to BED file format by For each cluster a maximum number of 5' read-ends was defined as peak, with a threshold

--outSAMattrRGline ID:foo --alignEndsType EndToEnd
Mapped paired-end reads were then converted from BAM to BED by using 'bedtools bamtobed' (version v2.30.0) function to extract both sides of each read. Read starts and read ends were then plotted as a metaplot relative to the 3'UTR CAGE peaks.

CBP20-iCLIP
The CBP20-iCLIP data was downloaded from GEO (GSE94427) and analysed using standard iCLIP processing pipeline where each read was treated as truncated read to identify crosslinking positions of protein-RNA interactions 74 . For the adapter removal we used the cutadapt tool (version 3.5) with removal of shorter reads than 18 bps.
cutadapt --match-read-wildcards --times 1 -e 0. bamtobed function followed by removal of PCR duplicates by collapsing identical reads with the same random barcode. For each read the read start position was used as the crosslinking position and was used for further analysis.

rG4-seq
The processed RNA-G-quadruplex sequencing (rG4-seq) data from HeLa cells was downloaded from the genomics data repository (GSE77282). The rG4-seq hits were then lifted from the hg19 to hg38 genome using UCSC liftOver webtool. For Figure 2E we used middle positions of each rG4-seq target normalised by the number of CAGE (HeLa) peaks from each transcriptome region.

RNA-maps of iCLIP, eCLIP, eiCLIP and RNA-seq reads start/ends
For the visualisation of all the CLIP based and RNA-seq methods we used previously developed RNA-map approach 66,74 with small addition for RNA-seq read end positions by summarising the read start positions relative to the CAGE peaks, TSSs and G-Quadruplexes.

Secondary structure
For each dominant CAGE peak we extracted a flanking region of 75 bps of the genomic sequence as an input to the RNAfold vienna package (version 2.4.17) with default settings.
Each double stranded position was then plotted as a sum of all pairings in the region.

Predictions of G-quadruplexes
To predict G-quadruplexes in the K562 and HeLa cell line we first selected CAGE peaks with a threshold of minimum 10 reads per peak in the region of 50 bps upstream and downstream from the peak. For the predictions we used sequence based prediction tool G4Hunter

Motif discovery
For AGO2 binding motif discovery we used HOMER software for motif discovery and next-gen sequencing analysis (version 4.9), with default parameters for human genome hg38 and using a 15 bps window around crosslink positions of processed AGO2-eiCLIP-HeLa samples.

Motif enrichment
For