Abstract
Accurate positional information concerning ribosomes and RNA binding proteins with respect to their transcripts is important to understand the global regulatory network underlying protein and RNA fate in living cells. Most footprinting approaches generate RNA fragments bearing a phosphate or cyclic phosphate groups at their 3′ end. Unfortunately, all current protocols for library preparation rely only on the presence of a 3′ hydroxyl group. Here, we developed circAID-p-seq, a PCR-free library preparation for 3′ phospho-RNA sequencing. We applied circAID-p-seq to ribosome profiling, which produces fragments protected by ribosomes after endonuclease digestion. CircAID-p-seq, combined with the dedicated computational pipeline circAidMe, facilitates accurate, fast, highly efficient and low-cost sequencing of phospho-RNA fragments from eukaryotic cells and tissues. While assessing circAID-p-seq to portray ribosomes engaged with transcripts, we provide a versatile tool to unravel any 3′-phospho RNA molecules.
Background
RNA molecules bearing a phosphate or cyclic phosphate group at the 3’ end (3′P/2’-3’cP) are generated by either heat fragmentation [1], ribonucleases (e.g. RNase A superfamily) [2], ribozymes [3, 4] or toxins [5, 6]. Beside endogenous 3′P/2’-3’cP RNA molecules [7–11], several biochemical methodologies designed to obtain genome wide positional information of RNA-protein interaction by RNA sequencing require an enzymatic digestion step that generates phosphorylated 3’ termini prior to library preparation. This is the case for near-nucleotide resolution RNA footprinting protocols used to understand the global regulatory network underlying protein and RNA fate in living cells. These protocols are based on (i) RNA:protein cross-linking and immunoprecipitation (CLIP-seq) [12–14] (ii) selective RNA:protein immunoprecipitation (RIP-seq) [15, 16], (iii) RNA:protein affinity purification (uvCLAP) [17], and (iv) ribosome profiling (Ribo-seq) for sequencing of RNA fragments protected by ribosomes. Commonly used enzymes for RNA footprinting include the RNase A superfamily (e.g. RNase I) [18] RNase T1 [19], and RNase T2 [20], which produce 3′ phosphate or cyclic phosphate RNA molecules. Additionally, phosphate or cyclic phosphate group at the 3’ end are also generated by heat fragmentation [21] to define RNA secondary structures. There are currently no technologies to selectively and directly sequence 3′P/2’-3’cP RNA fragments. In fact, the terminal 3′P/2’-3’cP is usually removed by enzymatic reactions before library preparation. This step is likely to result in the loss of useful data. Methods that provide insights into 3′-phospho RNA species rely on an indirect detection of these fragments by means of a periodate treatment [22] and downstream stringent bioinformatics analysis [23]. However, these approaches introduce potential sequencing biases related to 3′ de-phosphorylation and PCR amplification steps [24, 25], are time-consuming and computationally expensive.
Here, we present a one-day long and PCR-free library preparation strategy, called circAID-p-seq (CIRCular Amplification and IDentification of short 3′ Phosphate RNA SEQuences). CircAID-p-seq is uniquely characterized by the selection of 3′P/2’-3’cP terminated RNA fragments and cDNA synthesis by rolling circular amplification (RT-RCA) [26, 27], which does not require PCR amplification. In fact, the resulting cDNA is a long (> 200 nt), single-stranded molecule bearing multiple copies of a unique RNA fragment. This cDNA is suitable for direct cDNA sequencing with Oxford Nanopore Technologies (ONT) instruments.
Ribo-seq can be considered an effective case study to test circAID-p-seq, since it requires the sequencing of 25-35 nt-long ribosome-protected fragments (RPFs) produced, inter alia, by the RNase I or the Micrococcal nuclease. To evaluate the performance of our new library construction method, we compared our results to standard library preparation and sequencing approaches [28, 29]. To our knowledge, this is an original and unique method for 3′-P/cP RNA-seq library preparation, the first selective 3′-P/cP ribosome profiling, and the first Ribo-seq with the Oxford Nanopore Technologies (ONT) platform.
Results
RNA molecules with 3′P or 2’,3′cP can be accurately analysed only if the phosphate signature is preserved. To sequence 3′ phosphorylated RNAs, we have developed a new approach for library preparation (circAID-p-seq), which proceeds according to the following five steps (Figure 1a): (1) phosphorylation of the RNA 5′ end, (2) ligation of an RNA adaptor molecule to the 3′P or 2’,3′cP RNA ends [30], (3) intramolecular circularization, (4) retro-transcription rolling circular amplification (RT-RCA) [31] to produce a long concatemeric read containing tandem repeats; each repeat comprises one adaptor and one fragment of interest (called insert hereinafter), (5) second strand cDNA synthesis. Once the cDNA is produced, it can be sequenced with the ONT platform, which allows direct sequencing of the long (> 200 nt) cDNA without additional PCR amplification. After sequencing, all copies of the RNA fragment of interest in the concatemeric raw reads can be identified and processed. To this purpose, we developed circAidMe, a computational pipeline implementing identification in the concatemeric read of all RNA fragment copies, multiple sequence alignment of the fragments, and generation of a highly accurate consensus sequence of the RNA fragment for further analyses (Figure 1b). To obtain meaningful biological information, consensus sequences can then be mapped to the target genome or transcriptome for downstream analyses. The circAidMe python code is made freely available on GitHub at: https://github.com/ms-gx/CircAidMe.
To optimize our strategy, we utilized a 30 nt long synthetic RNA molecule bearing a 3′P group, called RNA30-3′P. The phosphorylation of the 5′ terminus was carried out by the T4 polynucleotide kinase (PNK 3′-minus), a step required to block self-ligation and obtain 5′P-RNA30-3′P RNA species (Figure 1a). The ligation of the 3′P terminus to an adaptor (called ADR hereinafter, with 3′- and 5′-hydroxyl groups) was performed using a 3′P ligase (Supplementary Figure S1A), producing an RNA:adaptor molecule. Intramolecular circularization of the RNA:adaptor was performed with a T4 RNA ligase to form circular RNAs. To confirm the effectiveness of the reaction and to remove remaining single stranded RNAs, the reaction mix was treated with RNase R. This exoribonuclease digests all linear RNA molecules, while preserving circular RNAs [32] (Supplementary Figure S1A). To synthesise the first cDNA strand and generate long single-stranded cDNA molecules carrying multiple copies of the insert, we performed a RT-RCA. To confirm that the multimeric cDNAs had been successfully synthesised, the product was amplified by PCR (Supplementary Figure S1B). Next, we used a Taq polymerase with 5′-3′ exonuclease activity and a primer complementary to the adaptor used in the first ligation, to synthesise the second cDNA strand. The efficiency of second strand synthesis was confirmed by the resistance of cDNA to S1 nuclease digestion, an enzyme that acts on single stranded DNA oligonucleotides but not on double stranded cDNA (Supplementary Figure S2). The library was then sequenced with a benchtop Oxford Nanopore sequencer (MinION). After base calling, the output was analysed by CircAidMe to identify the inserts and generate the consensus sequence (see Materials and Methods and Figure 1b).
To ensure a robust consensus sequence, a large number of inserts in the concatemer reads is desirable. Thus, we assessed the effect of the ADR length and sequence on the number of repeats generated. We used three adaptors, 20 nt (ADR20), 60 nt (ADR60) or 110 nt (ADR110) long and sequenced the same 30 nt synthetic RNA fragment (Figure 2a) (adaptor sequences are listed in Additional file 1). The read length distributions showed a major peak at about 550 nt for all samples, suggesting that ADR20 generated more tandem repeats than ADR60 or ADR110 (Figure 2a and 2b). Then, we investigated the impact of the number of repeats on the accuracy (i.e. the percentage of the read not altered by mismatches or indels, see Materials and Methods) of the consensus sequence. As expected, the high number of repeats obtained with ADR20 results in a more accurate consensus (Figure 2c) and in a narrower distribution of fragments length (Figure 2d). In line with these observations, to maximize the accuracy of the consensus, we used short ADRs for all further experiments.
Next, we focused our attention on optimizing the sequence composition of ADRs. We designed twelve 24 nt long ADR oligos (Additional file 1), predicted to have minimal secondary structure. We pooled them at equimolar ratio for capturing and sequencing a 30 nt synthetic fragment (RNA30-3′P). After sequencing, we evaluated the length and the quality of the consensus sequence of the insert, as well as the relative abundance of each ADR. Almost all adaptors achieved a correct consensus length (Figure 2e) and a high accuracy (> 95%) of the insert (Figure 2f) with median of five or more repeats per read (Supplementary Figure 3A). However, reads obtained with some adaptors were more represented than others (Supplementary Figure 3B), suggesting that some of them displayed a higher probability to form RNA:ADR products. Since ADR12 combined good accuracy with a relatively high read abundance, we used this adaptor in all further experiments.
To investigate whether circAID-p-seq can capture quantitative variations in RNA abundance, we sequenced a mixture of three different synthetic RNA fragments (RNA-A, RNA-G and RNA-M) at different molar concentrations. For quantitative analysis we took into consideration only reverse and “hairpin” reads (Supplementary Figure 4A and 4B and methods). Our results showed that there is a good agreement (R2 > 0.98) between the amount of input and the number of consensus sequences obtained for each insert (Supplementary figure 4C and 4D). Overall, our results provide evidence that circAID-p-seq (i) can selectively incorporate a mixture of short synthetic RNA molecules bearing a 3′P signature, (ii) is quantitatively accurate, and (iii) is efficiently applicable to the ONT sequencing platform.
Ribosome profiling with circAID-p-seq
To further confirm the selectivity and efficiency of our library preparation method in complex biological samples, we chose the framework of Ribo-seq, the sequencing of short RNA fragments protected by ribosomes from nuclease digestion. Ribo-seq provides positional information of ribosomes on transcripts as well as an indication of the RNAs engaged in translation. This experimental setup is intrinsically suitable for our purpose because the endonucleases used (RNase A superfamily, e.g. RNase I) are known to generate 3′P ribosome protected fragments (RPF), with a length between 25 nts and 35 nts. Therefore, with circAID-p-seq we can selectively and directly capture only the fragments digested by the nuclease, without the need for additional de-phosphorylation steps and without the risk of sequencing RNAs fragments not generated by the endonuclease.
Since there are currently no available technologies for selecting and sequencing 3′P RNA fragments, we compared our library preparation method with two established protocols for RPF sequencing: 1) the ligation-free sequencing protocol based on a strand switching approach [28] and 2) the classical protocol for ribosome profiling [33]. Both methods are based on the removal of the 3′P prior to library preparation and on Illumina (ILMN) platform. We used HEK293T cells to compare circAID-p-seq to a commercial switching approach (SMARTer smRNA-seq, Clontech) and mouse liver samples for the standard established protocol of ribosome profiling (Figure 3a). To determine whether circAID-p-seq can uncover changes in the localization of ribosomes along transcripts, we compared ribosome footprint distribution in HEK293T cells untreated (H-) and treated (H+) with Harringtonine, a drug known to stall ribosomes at the start codon [34]. We obtained 2 - 5 million raw concatemeric reads per condition by using the CircAID-p-seq libraries, and 47 - 100 million raw reads using Illumina sequencing. The length distribution of the consensus sequences, generated by CircAidMe and representing putative RPFs, peaked at about 33 nts for both HEK293T and liver samples (Figure 3b), in accordance with Illumina data and the length of RNA fragments covered by ribosomes [35]. Even if the two platforms have different sequencing depths, we wanted to better understand the concordances among circAID-p-seq and the ILMN-based methods in term of RPFs coverage, GC content, number and biotype of genes identified. We observed a high correlation in RPFs coverage between circAID-p-seq and the two ILMN sequencing methodologies (Spearman’s R = 0.84 in HEK293T and R = 0.90 in mouse liver samples) (Supplementary Figure S5A and S5B).
A similar correlation was obtained with circAID-p-seq within replicates (Spearman’s R > 0.90) in mouse liver samples (Supplementary Figure S5C and S5D). A current limitation in small RNA sequencing experiments is the over-representation of sequences generated by PCR amplification [36]. As circAID-p-seq is a PCR-free protocol, PCR duplicates are not an issue. The median percentage of GC content in the liver genes was concordant across the sequencing approaches (52% circAID-p-seq/ONT sequencing vs 56% ILMN sequencing, data not shown), suggesting that there were no GC-related biases in the library preparation method. Of note, CircAID-p-seq showed 45% less reads mapping on rRNA than ILMN in mouse liver samples. The lower rRNA contamination suggests that not all rRNA contaminants derive from cleavage by RNAse I nuclease (i.e. they do not have a 3′P). In circAID-p-seq we also observed less tRNA contamination, but a higher percentage of unassigned reads (Figure 3b, 3c) due to a still high ONT sequencing error rate.
To finally confirm that circAID-p-seq can detect ribosome footprints, we investigated the number and percentage of reads mapping to protein coding sequences (cDNA). In circAID-p-seq data, 18% (0.15 million) of reads in HEK293T and 29% (0.6 million) of reads in mouse liver mapped to coding genes. In Illumina data, 20% (22 million) and 9% (5 million) of the reads in HEK293T and mouse liver respectively, aligned to coding genes (Figure 3b, 3c). We identified a total of 9,419 genes in HEK293T/H-(5,002 with > 10 Transcripts Per kilobase Million, TPM) for circAID-p-seq and 16,754 genes (8,820 with > 10 TPM) (Additional file 2) for ILMN. About 55% of ILMN genes were in common with ONT sequencing (Supplementary Figure 6A). In line with this result, in circAID-p-seq mouse liver we identified 61% (4115 with > 10 TPM) of ILMN genes (Supplementary Figure 6B). We determined that ILMN covers 0.27% of coding genes per million reads generated, while circAID-p-seq/ONT covers 2.45% of the coding genes per million reads generated. In other words, more than 4,000 reads are required to detect a gene in ILMN, while only 130 reads are sufficient with circAID-p-seq. Considering only genes with > 10 TPM, ILMN needs more than 8000 reads/gene while circAID-p-seq requires only 185 reads/gene. Comparable performances were obtained in mouse liver. As a result, even if with ONT the sequencing depth is lower, most of the genes detected by circAID-p-seq match ILMN genes with good read coverage (> 10 TMP), drastically reducing the need for deep sequencing. In agreement with this, if we consider all detected genes (> 1 count), we observed that the majority of circAID-p-seq data have a gene coverage higher than 1 TPM (with a median of 10 TPM) for both HEK293T cells and mouse liver. On the contrary, genes identified only by ILMN are less covered (median of 0.5 TPM), meaning that they have a low density of RPFs (Supplementary Figure 6C and 6D). Overall, these results demonstrate that circAID-p-seq is between four and ten times more efficient in term of number of genes per million of reads with respect to the ILMN sequencing.
To further characterize genes identified by the two methodologies, we performed a Gene Ontology (GO) analysis in both HEK293T and mouse liver datasets. We observed similar enriched terms between ILMN and circAID-p-seq/ONT (Supplementary Figure 6D and 6E), confirming that the most representative transcripts in ILMN were also captured by, and were semantically coherent with, circAID-p-seq data.
Sequencing of ribosome footprints provides a precise record of the position of the ribosome along the mRNAs. RPFs are expected to be over-represented within the coding region (CDS) of the mRNAs, where, in contrast to those positioned in 5′UTR and 3′UTR regions, the P-site positions are expected to exhibit trinucleotide periodicity. In line with this, a high percentage of the generated consensus sequences mapped to the coding sequence (CDS) within mRNAs (85.7% in HEK293T cells; and 93.4 % in mouse liver). The percentage of reads mapping to the 5′UTR and 3′UTR was negligible (Figure 3e). The percentage of P-sites in the three possible translation reading frames obtained with circAID-p-seq were in agreement with ILMN data (Figure 4a, 4b and 4c). A comparison between the P-site metaprofiles at single nucleotide resolution showed a clear trinucleotide periodicity in all sequencing approaches and samples (Figure 4a, 4b and 4c). Treatment with Harringtonine in HEK293T cells showed a relative increase in the signal at the start codon and a decrease along the CDS (Figure 4b) in both circAID-p-seq and ILMN, confirming the robustness of circAID-p-seq in detecting positional information of ribosomes. Metaprofiles in mouse liver obtained using only transcripts detected in both library preparation approaches (n = 4115; > 10 TPMs, Supplementary Figure S7) did not show significant differences between circAID-p-seq and the standard method.
Taken together, these results confirm that circAID-p-seq allows selection and sequencing of actual ribosome footprints by including 3′P RNA fragments. Hence, circAID-p-seq retrieves important information to study the positional landscape of ribosomes along transcripts. More specifically, circAID-p-seq, coupled with CircAidMe, generates ribosome profiling data consistent with existing methods; with the added advantages of a higher efficiency and no PCR amplification biases. Remarkably, CircAID-p-seq requires at least 10 times less raw reads than classical ribosome profiling protocols and sequencing, confirming that circAID-p-seq (i) is suitable for ribosome profiling experiments, and (ii) is effective for phospho-RNA-sequencing.
Discussion
Here, we introduced circAID-p-seq, a new approach for library preparation and nanopore sequencing of 2′,3′ cyclic phosphate- and 3′ phospho-terminated RNA fragments. When combined with the bespoke computational pipeline, CircAidMe, we obtained > 99.5% average accuracy in retrieving the sequence of short synthetic RNAs fragments.
We applied circAID-p-seq to ribosome profiling experiments, which are based on the generation of RNA fragments with 3′-phosphates due to endonuclease digestion. We benchmarked our method against commonly used Ribo-seq library preparation protocols that require a dephosphorylation of the 3′ end. The 3′ dephosphorylation could decrease the level of specificity, resulting in noisier sequencing and inclusion of any short RNA molecule endowed with -OH groups at the 3′ terminus. Moreover, existing approaches are based on PCR steps that can introduce amplification related biases and PCR duplicate issues. With circAID-p-seq, no PCR step is required and after RPF purification no gel extraction steps are needed. Coupled to ONT sequencing, circAID-p-seq produces higher-quality information than approaches based on Illumina sequencing in terms of (i) number of genes per million of reads and (ii) localization of positional data of the ribosomes along transcripts. In addition, circAID-p-seq/ONT sequencing requires 10 to 30 times less absolute number of raw reads than classical ribosome profiling coupled with Illumina sequencing to achieve comparable coverage of the translatome. In fact, CircAID-p-seq showed a higher efficiency in the coverage of the coding genome compared with standard library preparation. Furthermore, circAID-p-seq showed a lower rRNA contamination, at least in our experiments. These features make the circAID-p-seq library generation extremely useful for genome-wide analysis.
If required, an increase in the sequencing depth can be achieved with other ONT sequencing platforms, such as PromethION or GridION. In terms of experimental time, circAID-p-seq combined with CircAidMe allows fast ribosome profiling experiments from sequencing to data analysis (in the region of 24-48 hours). Of note, circAID-p-seq sequencing was performed with the portable and low throughput MinION (ONT) sequencer, affording lower instrumental costs compared to ILMN sequencing (here performed on HiSeq2500 and NovaSeq6000). Another advantage of circAID-p-seq/ONT sequencing is that if samples are not performing well during sequencing, the run can be stopped and the flow cell re-used, a relevant feature in expensive ribosome profiling experiments (Table 1). The main constraints identified for circAID-p-seq are when analysing low abundance transcripts or detecting translation events only marginally represented in Ribo-seq data, such as the translation of upstream open reading frames (uORFs) or ribosome readthrough events [37, 38]. Future developments of circAID-p-seq need to address the option of multiplexing the workflows and of reducing the required amount of input material, currently established in more than 3 picomoles of phosphorylated RNA fragments.
In addition to ribosome profiling, many other RNA footprinting techniques may take advantage of this method, like those employing endoribonucleases and generating 3’P termini with the aim of characterizing RNA-protein interactions, large RNA-protein complexes [40], and/or the interaction of small molecules with RNA [41]. More importantly, 3′P phosphorylated RNAs are hallmarks of biological processes and can be generated in living cells by toxins [5], ribozymes [4], endonucleases [2, 18], the tRNA splicing endonuclease [39], the Ire1 [40], the RNase T2 [20], the RNase L [41] and some CRISPR-associated (Cas) proteins [42]. Endogenous 3′P/2’-3’cP terminated RNA fragments are involved in diverse biological processes, such as RNA metabolism [7], rRNA and tRNA biogenesis [8], mRNA splicing [9], unfolding protein response [10] and stress granules production [11]. Phosphorylated RNA fragments are also dysregulated in disease conditions, such as cancer [43], viral infection [44] and Amyotrophic Lateral Sclerosis [45] pointing to 3′P/2’-3’cP RNAs as likely and largely unexplored signatures of disease [46].
In conclusion, the combination of circAID-p-seq with ONT allows single molecule, low-cost, fast and easy detection of biologically relevant 3′-P/cP RNA species with a portable ONT device. CircAID-p-seq is the first phospho-RNA-sequencing library preparation method successfully tested in ribosome profiling experiments and could be used in the near future to better uncover the biological role of 3′-phospho RNA molecules, a hidden transcriptomic layer in many genome-wide profiles.
Materials and Methods
Synthetic oligonucleotides
Custom adaptors, synthetic RNA fragments and DNA primers were synthesized by Integrated DNA Technologies (Coralville, IA). Synthetic RNA fragments consisted of 30-mer oligonucleotides with 5′ OH and 3′P termini. Custom adaptors consisted of RNA molecules with different length and no modification at 5′ and 3′ ends. Adaptors were carefully designed to minimize their secondary structure, using a combination of RNAFold and OligoEvaluator folding tools. All the sequences are listed in Additional file 1.
circAID-p-seq synthetic library preparation
5′ Phosphorilation and adaptor ligation
Synthetic RNA fragments bearing a 3′P were subjected to 5′ phosphorylation with T4 PNK 3′ minus (NEB, cat n° M0236S), according to manufacturer’s instructions. RNA fragments were purified from the reaction using a RNA Clean & Concentrator(tm)-5 column (Zymo Research, Cat. n° R1013). The resulting RNA fragments were ligated to different adaptors (listed in Supplementary Table 1) via 3′P ligase, according to the following conditions: 30 pmol of RNA fragment, 10 pmol of adaptor, 15 pmol 3′P ligase, 1X 3′P ligase buffer, 100 µM GTP, 1 mM MnCl2 in a final volume of 10 µL. The reaction was incubated 2h at 37°C and then loaded on a 15% acrylamide/8M urea precast gel (Invitrogen, cat n° EC6885BOX). The ligated RNA was purified through gel extraction, as described in the gel analysis section below. To evaluate and optimize the circAID-p-seq method, different combinations of adaptors and synthetic RNA fragments were employed. For testing the effect of different adaptor lengths on RT-RCA and number of repetition, ADR-110, ADR-60, ADR-20 were used for ligation with 30RNA-3’P fragment. Subsequently, to identify the best 24 nt long adaptor sequence, an equimolar pool of 12 oligos was created and used in the first ligation step with 30RNA3’P fragment. For quantitative analysis, three different types of RNA fragments (RNA30-G, RNA30-M, RNA30-A) were combined at the ratio of 1:10:100. In this case, for the circAID-p-seq library preparation, ADR-12 was used.
Circularization and RNase R treatment
The circularization of the adaptor-ligated RNA (RNA:adaptor) was carried out at 25 °C for 2h, in a total volume of 20 µL containing 10 U of T4 RNA Ligase 1 (NEB, cat n° M0204L), 1X T4 RNA ligase buffer, 20% PEG8000, 50 µM ATP. The reaction was then incubated at 37 °C for 1 h with 20 U of RNase R (Lucigen, cat n°RNR07250), to remove all the undesired products (i.e linear RNA or concatemer product). Circular RNA was purified through RNA Clean & Concentrator(tm)-5 column (Zymo research, Cat. n° R1013) and quantified using Qubit(tm) RNA HS Assay Kit (Thermo Fisher, Catalog number: Q32852).
Reverse Transcription - Rolling Circle Amplification (RT-RCA) and Second Strand synthesis
RT-RCA was performed in 20 µL with Maxima H Minus Reverse Transcriptase (Thermo Fisher, cat n°EP0752) under the following conditions: 50 ng of circular RNA, 200 U of Reverse Transcriptase, 1X RT buffer, 0.5 mM dNTPs, 50 pmol Primer R, 10 % glycerol. The reaction was carried out at 42°C for 4 hours, then stopped by incubation at 70°C for 10 min. After cDNA synthesis, circular RNA template was hydrolyzed by adding 0.1 N NaOH for 10 min at 70°C. The second strand cDNA was generated by performing one PCR cycle using Platinum II Hot start Taq Polymerase (Thermo Fisher, cat n°14966001). The reaction included 20 μLsingle-strand cDNA, 1x Buffer, 0.2 mM dNTPs, 2 mM MgCl2, 2 U Taq Polymerase, 50 pmol primer F in a total volume of 50 μL and subjected to the following program: initial denaturation at 94°C, one cycle of 94°C for 30 sec, 60°C for 30 sec and 68°C for 2 minutes. Double strand cDNA was purified using AMPure XP beads (Agencourt, cat n°A63881) according to manufacturer’s instructions. Validation of second strand synthesis was performed by Nuclease S1 digestion (Thermo Fisher, cat n°EN0321) according to manufacturer’s instructions.
Nanopore sequencing
Purified double-strand cDNA was prepared for nanopore sequencing. Briefly, cDNA was subjected to end repair and dA-tailing reaction using NEBNext End repair/dA-tailing module (NEB, cat. n°E7546S) following the manufacturer’s instruction and incubated for 5 min at 20°C and then 5 min at 65°C. The reaction mix was purified with AMPure XP beads (Agencourt). ONT Adaptor mix was added according to the direct cDNA sequencing kit protocol (SQK-DCS109, ONT), then loaded on a R9.4 flow cell and sequenced on the MinION sequencing device.
Gel analysis
The samples were mixed 1:1 with gel loading II (Thermo Fischer Scientific, Cat n°AM8547), denatured at 70°C for 90 secs before being loaded into the gel and run at 200V. Gels were then stained with Sybr™ Gold (Invitrogen, cat n°S11494) and scanned using Chemidoc (GE Healthcare, Piscataway, NJ). Gel images were analyzed using ImageLab (Biorad). When required, bands were isolated from the gel, crushed and soaked overnight in Buffer I (Immagina Biotechnology, cat. n°RL001-10) at room temperature with constant rotation. The aqueous phase was filtered with Millipore ultrafree MC tubes and then precipitated with isopropanol (Sigma, cat. n° I9516) at - 80°C for 2 hours or overnight, then centrifuged for 30 min at 1200g, 4°C. The pellet was washed once with 70% ethanol, centrifuged at 12000 g for 5 min at 4°C and air-dried before further processing.
Cell culturing and treatment
HEK293 cells were seeded at 1.5×106 cells/dish and kept in culture until reaching 80% of confluence. Cells were then treated with or without harringtonine (2 µg/mL) for 3 min, followed by CHX (10 μg/mL, SIGMA cat. n° 01810) for 5 min at 37°C. Cell lysates were obtained using a hypotonic lysis buffer (IMMAGINA Biotechnology, cat. n°RL001-1). RNA absorbance at 260 nm was measured by Nanodrop ND-1000 UV-VIS Spectrophotometer and the lysate diluted to a total of 1.7 a.u. (A260 nm) with W-buffer (IMMAGINA Biotechnology, cat. n°RL001-4). RPFs were generated by treating the diluted lysate with 12.7 U of RNase I (Ambion, cat. n°AM2295) at room temperature for 45 min (as described in Clamer et al., 2018). RNase I digestion was stopped by adding 10U of Superase Inhibitor (Thermo Scientific, cat. n° AM2696) for 10 min on ice. Young adult wild-type FVB mice were obtained from breeding stocks at the University of Edinburgh. All procedures were performed under licensed authority from the UK Home Office (PPL P92BB9F93). Liver tissues were dissected immediately following sacrifice and pulverized under liquid nitrogen using a pestle and a mortar and the lysates obtained according to previous protocols [47]. Lysates were treated with RNAse I and the 80S with RPFs were isolated using sucrose gradient separation according to previous protocols [48] (Lauria et al., 2020). The RPF were purified using acidic phenol/chloroform extraction. The RPF (25-35 nt) were obtained after purification in denaturing 15% UREA-PAGE. The RPF were divided into two aliquots. One was used for library preparation [29, 48]. The other aliquot was used for circAID-p-seq. All experiments were performed in biological triplicate.
Sucrose cushioning
After digestion, lysates were loaded on top of 900 µL of a 30 % sucrose cushion (30 mM Tris-HCl pH 7.5, 100 mM NaCl, 10 mM MgCl2, 1M sucrose in nuclease-free water) supplemented with 20 μg/mL of CHX. Samples were ultracentrifuged at 95,000 rpm at 4°C for 2h using a TLA100.2 rotor (Beckman). The pellets were resuspended in 200 µL of W-Buffer and treated with 1% SDS (Sigma cat. n° 05030) and 0.1 mg Proteinase K (Euroclone, cat. n° EMR022001) at 37°C for 75 min. Total RNA was extracted by acid-phenol:chloroform, pH 4.5 (Ambion, cat n° AM9722). RNA was precipitated with isopropanol, air-dried, resuspended in nuclease-free water and analyzed on 15% acrylamide/8M urea precast gel. RPFs were size-selected (corresponding to 25-35 nt bands) and extracted from gel (see above). Before starting with library preparation, isolated and purified RPFs were quantified using the Qubit™ miRNA Assay Kit (Thermo Fisher, Catalog number: Q32881).
Ribo-seq sample and library preparation
Libraries from RPFs isolated from HEK293T cells were prepared using the SMARTer® smRNA-Seq Kit for Illumina (Takara, Cat. No. 635029) and sequenced with 50 cycles single-read on an Illumina NovaSeq 6000 sequencer. HEK293T RPFs were extracted from independent biological replicates for circAID-p-seq/ONT and SMARTer® smRNA-Seq /Illumina sequencing. Lists of the counts per gene are reported in Additional file 2. Mouse liver RPFs were extracted from three biological replicates. For each replicate, the same pool of PAGE purified RPFs of each triplicate were used for parallel circAID-p-seq/ONT and Illumina sequencing. Illumina libraries were prepared according to Ingolia et al., 2012 and were sequenced with 50 cycles single-read on an Illumina HiSeq2500 sequencer. List of the counts per gene reported are in Additional file 3.
For circAID-p-seq, upon 5′ phosphorylation of RPFs, library preparation was performed starting from 10 pmol of ADR12, following the protocol as described above. The final libraries were loaded on a R9.4 flow cell and sequenced for 20 h with a MinION sequencer.
circAID-p-seq data analysis
To identify the RPFs (fragments) contained in each read we developed a custom pipeline (written in Python 3): CircAidMe. The FASTQ files obtained by Guppy 3.6.1 (available from ONT via https://community.nanoporetech.com) base calling are processed to identify from each read the consensus of the repeat sequence from every fragment meeting certain selection criteria. These criteria are user-defined and include filtering by fragment length (for our libraries we selected fragment lengths between 15 and 40 nt) and minimum number of fragment copies required to calculate a consensus sequence. CircAidMe performs the following steps: First, fused reads are detected by 1) searching for remaining ONT sequencing adaptors in a read and 2) detecting orientation patterns of the circAID-p-seq adaptors, which indicate a fused read. If a fused read is encountered, it is split at the appropriate position. Then, for every (split or non-split) read circAID-p-seq adaptors are detected. The fragment inserts flanked by the adaptors are then extracted and a first multiple sequence alignment (MSA) is performed with MUSCLE v3.8.1551 [49]. The MSA is examined using esl-alipid (https://github.com/EddyRivasLab/easel/tree/master/miniapps) and low quality inserts are detected and removed. A second MUSCLE run is performed, resulting in the final consensus sequence of the fragment. The detection of ONT- and circAID-p-seq adaptors is performed using SeqAn v2.4 [50]. Additionally, several statistics are collected while executing CircAidMe to evaluate the quality of the circAID-p-seq library, including: number of fragments per read (> 2 for Ribo-seq), number of adaptors detected per read (≥ 4 for Ribo-seq), read length and consensus length (≥ 20 nt for Ribo-seq). These quantities are recorded for each read in a text file in CSV format. Moreover, for each read that was excluded from the final output the reason is given as a tag in the report file. The final retained consensus sequences are collected in a FASTA file for downstream analyses. Reads discarded by CircAidMe are collected in a separate FASTA file. More details about the function of CircAidMe can be found on its GitHub page: https://github.com/ms-gx/CircAidMe. Read accuracy was determined as reported in Volden R. et al., 2018. Briefly, the accuracy is given as a percentage, representing the portion of the consensus sequence not altered by mismatches or indels when considering the alignment to the reference sequence/coding transcriptome. The process is executed for each aligned consensus sequence.
(i) accuracy calculation
For quantitative analysis, we first reasoned that the two strands of the concatemeric cDNA carry complementary information (Supplementary Figure 4A). If this was the case, only one strand of the double stranded cDNA should to be taken into account for an accurate quantitative analysis and in order to not count two times the same fragment. To evaluate if this had occurred, and to then determine which strand is more reliable, we compared the raw read length of both forward and reverse cDNA concatemeric strands based on the orientation of the ADR-12 sequence (Supplementary Figure 4A). We noticed that forward reads, generated during the second strand synthesis, are generally shorter and less abundant than the reverse reads (Supplementary Figure 4B). This effect most probably derives from multiple annealing sites of the primer for second strand synthesis on the concatemeric cDNA. Therefore, since forward-strand synthesis can generate fragmented copy of the reverse strand, it could introduce biases in quantitative experiments. Then, we noticed that a portion of the reads carry forward and reverse strands fused together. This effect is known to be caused by two different molecular reactions: (i) a second strand entering the pore immediately after the first strand without the sequencing device been able to detect two independent molecules (we called this type of reads “fused reads”) [25], or (ii) a hairpin at the 3’-end of the cDNA strand that functions as a primer during second strand synthesis [26] (we called this type of reads “hairpin reads”, Supplementary Figure 4A). We optimized CircAidMe to split fused reads, while leaving hairpin reads untouched. The latter are useful to produce longer reads (Supplementary Figure 4B), hence useful for a better consensus and for quantitative information.
Ribosome profiling data analysis
To assess representation of different components of the human transcriptome in the libraries, the consensus sequences generated by CircAidMe were iteratively mapped with Bowtie2 [51] to different classes of human transcribed sequences (as annotated In Gencode v33), including: rRNAs, tRNAs, other non-protein-coding transcripts, and the protein-coding transcriptome. The remaining unmapped sequences were aligned to the human Genome (assembly GRCh38.p13). For ribosome profiling down-stream analyses, consensus sequences were aligned directly to the genome after removal of those mapping to rRNA and tRNA sequences. Alignment files were processed with Samtools version 1.9 [52] and analyzed with the RiboWaltz R package [53], to identify the P-site localization within ribosome footprints and to assess three-periodicity and representation of coverage within coding regions of the mRNAs. As an extra QC step, the Read Accuracy index [54] was calculated for reads of selected libraries with a custom R script. Illumina data from human HEK293T cells were processed with the SMARTer® smRNA-Seq Kit for Illumina (Takara, Cat. Nos. 635029) following guidelines. Briefly, reads were trimmed with Cutadapt version 2.10 [55] by removing the first 3 nucleotides of every read and the 3’ terminal adaptor plus poly-A tail and trimmed reads of length under 15 nucleotides were discarded. Illumina data from mouse liver tissue were processed according to Ingolia et al., 2012, removing the Illumina adaptors and retaining only reads where the adaptors were present (parameters used: -O 15 –e 0.15 -n 3 –m 19). Illumina reads from HEK293T and mouse liver samples were analysed following the same procedure detailed above for circAID-p-seq data, aligning them to the respective classes of transcribed sequences and genomes (Gencode v33 annotation and genome assembly GRCh38.p13 for human; Genecode vM34 annotation and genome assembly GRCm38.p6 for mouse). Read coverages for each protein-coding gene from ONT and Illumina libraries were calculated with HTSeq [56] and normalized as TPM values for comparison between libraries. Only coverages by more than one read were retained in comparisons between ONT and Illumina libraries. PCR duplicate reads were identified with Picard MarkDuplicates version 2.23.1 (Picard Toolkit.” 2019. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute) with default parameters, and removed with Samtools.
Ethics approval and consent to participate
All animal tissue was obtained under full licensed approval from the UK Home Office following internal (institutional) and external ethical review.
Availability of data and materials
Source code of CircAidMe along with a pre-compiled package and a description of the software are available on GitHub: https://github.com/ms-gx/CircAidMe.
Competing interests
M.C. is the founder of, director of, and a shareholder in IMMAGINA BioTechnology S.r.l., a company engaged in the development of new technologies for gene expression analysis at the ribosomal level. L.M, A.D.P., C.F., P.B are employees of IMMAGINA BioTechnology S.r.l. G.V is a scientific advisor of IMMAGINA BioTechnology S.r.l. All of the other authors declare no competing interests. CircAID-p-seq technology is covered by the international patent application number PCT/IB2020/058259.
Funding
IMMAGINA BioTechnology S.r.l. internal funding, AFM-Telethon (reference number 22129) to GV and THG; Caritro Foundation (young Post-doc funding grant) to FL.
Authors’ contributions
Conceptualization: M.C., A.D.P. Bioinformatic analysis: R.B., M.S., L.B., F.L. Implementation of CircAidMe: M.S. Wrote first draft of paper: M.C., A.D.P. Tissue/Data collection: A.D.P., C.F., L.M., M.C., I.S., S.T., T.H.G. Data curation: M.C., PB, A.D.P. Supervised research: M.C. Paper revision: G.V., M.S., A.D.P., M.C., F.L. T.H.G., L.B., S.T. Reviewed and approved text: all authors. The authors read and approved the final manuscript.
Acknowledgements
We thank Pervasive Edge Srls for the service (reference number: 7/E).