Abstract
Current transcriptome annotations have largely relied on short read lengths intrinsic to most widely used high-throughput cDNA sequencing technologies. For example, in the annotation of the Caenorhabditis elegans transcriptome, more than half of the transcript isoforms lack full-length support and instead rely on inference from short reads that do not span the full length of the isoform. We applied nanopore-based direct RNA sequencing to characterize the developmental polyadenylated transcriptome of C. elegans. Taking advantage of long reads spanning the full length of mRNA transcripts, we provide support for 20,902 splice isoforms across 14,115 genes, without the need for computational reconstruction of gene models. Of the isoforms identified, 2,188 are novel splice isoforms not present in the Wormbase WS265 annotation. Furthermore, we identified 16,325 3’ untranslated region (3’UTR) isoforms, 2,304 of which are novel and do not fall within 10 bp of existing 3’UTR datasets and annotations. Combining 3’UTRs and splice isoforms we identified 25,944 full-length isoforms. We also determined that poly(A) tail lengths of transcripts vary across development, as do the strengths of previously reported correlations between poly(A) tail length and expression level, and poly(A) tail length and 3’UTR length. Finally, we have formatted this data as a publically accessible track hub, enabling researchers to explore this dataset easily in a genome browser.
Introduction
The nematode Caenorhabditis elegans is an ideal experimental model organism due to its compact, well-annotated genome (The C. elegans Sequencing Consortium 1998; Wilson 1999; Hillier et al. 2005; Gerstein et al. 2010), invariant cell lineage (Sulston et al. 1983), and wide-array of molecular methods. Our current understanding of the C. elegans transcriptome has been determined with EST libraries, cDNA based libraries, and Illumina-based cDNA and RNA sequencing (Walhout et al. 2000; Reboul et al. 2001; Lamesch et al. 2004; Hillier et al. 2009; Gerstein et al. 2010; Spieth et al. 2014; Tourasse et al. 2017). Most coding sequences (CDSs) span more than 600 nucleotides (excluding introns), and the typical C. elegans gene contains 6.4 coding exons on average (Spieth et al. 2014).
3’ untranslated regions (3’UTRs) are critically important features of mRNA transcripts that contain binding sites for RNA-binding proteins and small noncoding RNAs (Cai et al. 2009; Szostak and Gebauer 2013). Regulation of 3’ UTR length can therefore have profound impacts on mRNA expression, stability, and localization (Mayr and Bartel 2009; Andreassi and Riccio 2009; Kuersten and Goodwin 2003). Large-scale sequencing of the C. elegans 3’UTRs revealed median lengths of 130-140 nucleotides (nt) with an average length of ~211 nt (Jan et al. 2011; Mangone et al. 2010). In addition, poly(A) tails in C. elegans have a median length of approximately 57 nt at the L4 stage and short poly(A) tail lengths are a feature of highly expressed genes (Lima et al. 2017).
The average transcript in C. elegans is significantly longer than the maximum possible read length of Illumina sequencing. Therefore, current approaches to annotate the full length structure of the average C. elegans transcript isoform rely on manual curation of gene models based on a variety of data types, while more generally computational approaches to assemble transcript structures from bulk, short-read sequencing data utilize computationally expensive and imperfect inference (Williams et al. 2011; Spieth et al. 2014; Pertea et al. 2015; Trapnell et al. 2012). Calculating poly(A) tail lengths requires a sequencing approach capable of resolving long homopolymers, and determining 3’UTR structures requires an experimental or computational means of determining which reads reflect the 3’ most base included in the transcript before cleavage and polyadenylation. The specialized protocols and analyses used to measure poly(A) tail length and identify 3’UTRs with short read sequencing approaches cannot directly link these measurements to their splice isoform of origin, and in the case of 3’UTR identification instead rely on assigning putative cleavage sites to the nearest overlapping or upstream gene (Subtelny et al. 2014; Chang et al. 2014; Mangone et al. 2010; Jan et al. 2011; Blazie et al. 2017; Diag et al. 2018).
Nanopore sequencing, in contrast, has no theoretical upper limit to read length and is capable of sequencing transcripts from end to end at a single molecule level (Garalde et al. 2018; Jenjaroenpun et al. 2018; Workman et al. 2018). Nanopore based sequencing methods have been used to annotate transcriptome structure in a variety of organisms ranging from the relatively simple Saccharomyces cerevisiae, to complex human cell lines and cancers (Byrne et al. 2017; Bayega et al. 2018; Garalde et al. 2018; Jenjaroenpun et al. 2018; Tang et al. 2018; Volden et al. 2018; Workman et al. 2018; Kadobianskyi et al. 2019; Sessegolo et al. 2019). In nanopore-based direct RNA sequencing (dRNAseq), RNA reads are captured by the 3’ end of their poly(A) tail, and sequenced 3’ to 5’ natively, directly measuring the RNA molecule. The full length of the poly(A) tail is sequenced, and using a trained hidden Markov model, the length of the poly(A) tail for each read can be estimated (Workman et al. 2018). The 3’ most base in the alignment should reflect the true cleavage and polyadenylation site for the full transcript represented by that read, provided that base-calling, trimming of poly(A) and adapter sequences, and alignment had sufficient precision. Despite these advantages, adoption of dRNAseq and other nanopore based sequencing methods is hindered due to the technology’s high error rates, and the relative lack of bioinformatics tools and analysis pipelines designed for long error-rich reads.
In this study, we have generated an atlas of post-embryonic C. elegans transcript structure using dRNAseq to sequence RNA extracted from across its developmental life cycle. We provide full length support for previously annotated splice isoforms, as well as novel splice isoforms. Furthermore, we identify and characterize 3’UTRs and compare these to known datasets. We also estimate poly(A) tail lengths for our reads and examine global properties of these lengths across development. Finally, we have made this data available both in raw formats and as a custom track hub.
Results
Collection and sequencing of developmentally staged C. elegans
To capture the diversity of transcript isoforms expressed across C. elegans development, we created dRNAseq libraries in technical duplicates from larval stages L1 to L4, as well as young and mature hermaphrodite adults (Figure 1A) (Corsi et al. 2015). Because wild-type C. elegans exists largely as hermaphrodites with spontaneous males (<0.5%) emerging in the population through chromosome nondisjunction, we also obtained a male enriched sample using a him-8 mutant that disrupts X chromosome segregation (Hodgkin et al. 1979; Broverman and Meneely 1994; Phillips et al. 2005). We further enriched for the male subpopulation by filtering them through a 35um mesh that allows the males to be collected in the filtrate.
Libraries were generated from RNA isolated by TriReagent (Ambion), poly(A) selected, and prepared for sequencing following the Oxford Nanopore Technologies SQK-RNA001 kit protocol with the exception of using Superscript IV (Thermo Fisher) in the optional reverse transcription step. The libraries were sequenced on an Oxford Nanopore Technologies GridION X5 (model #GRD-X5B002). Basecalling and adapter trimming of the reads was performed using poreplex (running albacore) (https://github.com/hyeshik/poreplex), resulting in over 540,000 reads that passed base calling quality control for each developmental stage sequenced, and 5.54 million total such reads (Supplemental Table 1). Reads had mean per base quality scores above 10 for each developmentally-staged sample, and median per base quality scores ranging between 9 and 10 for each sample. Median read lengths ranged between 573 and 687 for a given sample, while average read lengths were significantly longer, ranging from 739 to 934. Reads were aligned to the ce11 genome using minimap2, which successfully aligned 87.8% of our reads (Supplemental Table 2) (Li 2018).
Identifying reads representing full length transcripts
While the majority of our reads correspond to full length transcripts (Figure 1B, C), a significant fraction of aligned reads failed to span the full length of an annotated transcript isoform; these reads were predominantly truncated relative to annotated isoforms at their 5’ ends, resulting in a 3’ bias in coverage from our total reads (e.g. Figure 1D). Including these reads in our downstream analysis would have artificially inflated the number of isoforms identified. Therefore, to make use of the long read lengths possible through dRNA-seq, reduce this 3’ bias, and eliminate the need to computationally reconstruct gene models, reads were split into ‘full-length’ and ‘non-full-length’ groups using existing coding sequence annotations, and only full-length reads were considered in downstream analyses (see Methods and Supplemental Figure 1 for an outline of the entire analysis). Note that nanopore sequencing reads are currently unable to capture the last 10-15 bases proximal to the 5’ end because of the structure of the pore-motor protein-RNA assembly, as reported previously (Workman et al. 2018)).
To determine the efficacy of this full-length filtering approach, we made an aggregate plot of normalized coverage across the average coding gene (Figure 1B). Supporting the validity of this approach, the non-full-length reads identified have a very extreme 3’ bias, while the full-length reads identified do not have the 3’ bias present in the total reads. Full length reads comprise the majority of reads in each dataset (Figure 1C). Combining all datasets, almost 2.9 million full length reads were obtained (Supplemental Table 2 for a breakdown of reads remaining after each filtering step).
In addition to full length filtering, a number of other filtering and analysis steps were applied, detailed in full in the methods. Briefly, reads were filtered if they 1) contained large insertions or large 3’ softclips (i.e. bases at the end of a read that fail to align); 2) had no detectable poly(A) tail signal; 3) had 5’ ends that weren’t aligned within 100 nt of an annotated transcript start site; 4) had a donor or acceptor splice site that couldn’t be assigned to an annotated donor or acceptor splice site (i.e., a splice site not within 15 bp of an annotated splice site); or 5) had retained introns. Following read filtering, the splice isoforms and 3’UTRs present in each developmental stage and across all stages were identified and reads were assigned to splice isoforms and 3’UTR isoforms as described in the methods.
Identifying the full-length transcriptome
The full-length single-molecule resolution of nanopore sequencing means that, unlike short read sequencing, the full linear sequence of exons comprising a transcript and all of the associated splice junctions (i.e. the splice isoform) and the 3’UTR isoform are captured unambiguously together in a single read. This enables the identification of the “full-length transcriptome”, the set of full-length isoforms (splice isoform + 3’UTR isoform) observed together across all reads (Figure 1E). When considered across all developmental stages and conditions, 25,944 full-length isoforms were identified, comprised of 20,987 unique splice isoforms and 16,325 unique 3’UTRs (Figure 1F, Supplemental Table 3 for exact values). Over 12,000 full-length isoforms were identified in each stage. Because 3’UTRs were only called if there were 3 or more reads supporting the putative cleavage site, not all splice isoforms have an associated 3’UTR called, and therefore, some full-length isoforms have no high confidence 3’UTR call, and are in effect simply splice isoforms. This describes only a small number (3,518) of the full-length isoforms in the dataset.
To determine if these datasets were at or approaching saturation in the number of full-length isoforms identified, reads were randomly subsampled and the number of full-length isoforms that had support from one or more reads in the subsampled set was determined. These values were then plotted, and the relationship between the number of reads considered and the number of full-length isoforms supported was examined. As expected, none of the developmentally staged datasets appears to be saturated (Figure 1G). The number of isoforms identified across all stages appears to be approaching, but not at, saturation (Supplemental Figure 2). This implies that while further sequencing would expand the number of full-length isoforms identified in individual stages, it would likely have a modest effect on the total number of full-length isoforms identified across all stages.
The ability to resolve splice isoforms and 3’UTR isoforms together at single molecule resolution allows for identification of genes where the two features appear to be correlated. Notably, few examples of significant correlations between splice isoform use and 3’UTR isoform use were identified by Fisher exact test after multiple hypothesis testing correction (Supplemental Table 4). This is possibly due to lack of coverage, but more likely reflects an overall lack of coordination between splicing and polyadenylation site choice in C. elegans.
Quantifying genes and splice-isoforms captured with full-length support
Less than half of the 30,133 isoforms with annotated introns in the WormBase WS265 annotation have full-length support (here full-length support means that every annotated intron in the isoform is supported by the same cDNA or EST) (Figure 2A) (Lee et al. 2018; WormBase web site 2018). By comparison, 17,658 full-length supported splice-isoforms across 13,622 genes were identified in our data, well above the 12,613 isoforms and 10,711 genes that have full-length support in the WormBase WS265 annotation. Comparing the genes and isoforms with full-length support in each dataset 4,234 genes and 7,404 isoforms were identified that did not previously have full-length support (Supplemental Figure 3A, Figure 2B). This dataset therefore significantly expands the number of C. elegans genes and isoforms supported by full-length reads.
To examine the changes of splice isoform usage in each developmental stage and across all stages, we plotted the number of previously annotated splice isoforms and genes observed in each stage (Figure 2C, Supplemental Table 3). We found more than 9,900 previously annotated splice isoforms in each stage, with males having the most identified genes and splice isoforms of any individual stage despite having fewer reads after our filtering steps than most other stages (Supplemental Table 2). Combining across all stages, over 18,000 splice-isoforms were observed. Most genes in our transcriptome data have only a single identified splice isoform, and the frequency of genes with a given number of isoforms decreases as the number of isoforms increases (Figure 2D), consistent with the WS265 annotation of the C elegans transcriptome (Supplemental Figure 3B) (Lee et al. 2018; WormBase web site 2018).
In addition to capturing previously annotated splice isoforms, the appeal of long-read single molecule sequencing is the ability to detect novel splice isoforms. To test our ability to identify novel splice isoforms after stringent filtering and splice site correction steps, we searched for isoforms with a set of splice junctions not present in the WormBase WS265 annotation. 2,188 novel splice isoforms were identified across all stages corresponding to 1,349 genes (Figure 2E) (Supplemental Table 3). Of these novel splice isoforms, 1,283 have novel splice junctions between annotated donor and acceptor splice sites and 173 have novel exons. To determine the level of support for these novel isoforms we generated a density plot showing the proportion of novel isoforms with a given number of reads supporting them (Figure 2F). The majority of identified novel splice isoforms were identified with only a single read supporting their structure, however almost 25% of these novel isoforms had 4 or more reads supporting them indicating that these are high confidence novel isoforms.
Characterizing the identified 3’UTRome
Previous analyses of nanopore sequencing have largely centered on splice isoform identification and characterization while largely ignoring the 3’UTR. Because dRNAseq relies on sequencing in the 3’ to 5’ direction of mRNAs isolated by their poly(A) tails, full-length sequences of 3’UTRs are preferentially captured. After adapter trimming, discarding reads with large 3’ softclips, and realigning the 3’ softclipped portions of the remaining reads, we identified putative poly(A) cleavage sites and predicted stop codons to define full-length 3’UTRs. Using this method, 16,325 unique 3’UTR isoforms were identified, with over 10,000 3’UTRs identified in each stage (Figure 3A, Supplemental Table 3).
To determine the accuracy of this 3’UTR calling method, we compared the 3’UTRs identified by this method with those from previously published datasets (including 3P-Seq and 3’RACE data) generated in C. elegans (Mangone et al. 2010; Jan et al. 2011). Of our identified 3’UTRs, 82.9% overlap with one or more of these 3’UTR datasets (Figure 3B). In addition, we identified 2,304 novel 3’UTRs that do not fall within 10 bp of existing 3’UTRs or WormBase 3’UTR annotations (Figure 3C). The 3’UTR length distribution in our data was nearly identical to those observed by Jan et al and Mangone et al (Figure 3D). In agreement with Mangone et al, our 3’UTR length distributions change over developmental stages, progressively decreasing from L1 through L4, and shorter in males than in hermaphroditic adults (Figure 3E). Curiously, the 3’UTR length distributions in adult stages were slightly longer than the length distribution of L4 3’UTRs in our datasets, in contrast to Mangone et al, which showed that adult 3’UTRs had a slightly shorter average 3’UTR length than L4.
Given 3’UTRs were shown to change length over development, we investigated whether PAS usage changed during development. Frequency of canonical and alternative PAS usage was quite consistent between adjacent developmental stages, although by chi squared tests there were significant differences in overall PAS usage between the L4 and young adult stage, as well as between hermaphroditic young adults and males (Figure 3F). Given that distribution of canonical and alternative PAS usage are consistent across the larval stages, where a significant shift in 3’UTR length distributions occurs, this suggests that 3’UTR length changes over development are largely independent of PAS usage.
As a final metric for the accuracy of this 3’UTRome, we plotted nucleotide distributions in windows around identified PAS sites and around putative cleavage sites (Figure 3G). This largely agrees with previously published nucleotide distributions in windows around identified PAS sites (Mangone et al. 2010). These distributions are AT-rich, with a peak in T frequencies just 3’ of the PAS site. It is possible that 3’UTRs identified by our method were inaccurate and broadly distributed around true cleavage sites, and by anchoring nucleotide distributions with putative PAS sites at −19 nucleotides the impact of these errors was eliminated. To test this possibility, we generated a density plot of the offsets of identified PAS sites from putative cleavage sites identified by our method, and found that these offsets were enriched close to the canonical −19 nucleotides from putative cleavage sites, indicating cleavage site calls from this method are accurate within a few base pairs (Supplemental Figure 4A & B).
Notably, at 3’UTR sites without a putative PAS identified, the nucleotide distribution observed lacks the enrichment of As in a window around the cleavage site noted in Mangone et al. Our method may be capturing a different set of 3’ UTRs with no PAS than the Mangone dataset did. Supporting this possibility, only 27% of the no PAS 3’UTRs in our dataset overlap with a Mangone et al 3’UTR, as compared with 73% of canonical PAS 3’UTRs in our data, and 65% of alternative PAS 3’UTRs in our data (Supplemental Figure 4C). In addition, the no PAS 3’UTRs that do overlap with a Mangone 3’UTR have a different nucleotide distribution than the no PAS Mangone 3’UTRs in general (Supplemental Figure 4D) (Mangone et al. 2010).
Properties of poly(A) tail lengths
Poly(A) tails are known regulators of translation and transcript stability. However, profiling of poly(A) tail lengths at the transcriptome-wide level using short read sequencing is a relatively recent advance in the field (Subtelny et al. 2014; Chang et al. 2014; Lim et al. 2016). We have previously shown that, using a trained hidden Markov model, one can estimate the poly(A) tail length of dRNAseq reads using nanopolish (Workman et al. 2018). We performed these estimations on our datasets, providing a developmentally resolved poly(A) profiling dataset.
Global poly(A) tail length distributions are dynamic in the developing Drosophila melanogaster oocyte and embryo (Lim et al. 2016). To determine if there were comparable shifts in our poly(A) tail length distributions, we examined poly(A) tail lengths across the developmental stages in C. elegans. The poly(A) tail length distributions display only modest fluctuations, ranging from median values of 49 nt (L1) to 54 nt (L2) during larval development, although these shifts were considered to be statistically significant by Kolmogorov-Smirnov and Mann-Whitney U tests (Figure 4A). However, length distribution in all adult stages (young and mature hermaphrodites and males) are consistently longer than in the larval stages, with a median length of 58 in adults compared to an aggregate median length of 52 across all larval stages (p < 2.2e-16 by Kolmogorov-Smirnov and Mann-Whitney U tests). These data suggest that the most significant regulation of poly(A) tail lengths occurs between larval and adult stages during development.
As a means of confirming the validity of our poly(A) tail length profiling approach, we compared our poly(A) estimates from the L4 stage with previously published poly(A) measurements from the L4 stage of C. elegans from mTAILseq (Lima et al. 2017). The length scale distributions of our L4 data and this dataset are quite similar, as both have peaks around 30-40 nt and extended spread toward the longer tail length range. (Supplemental Figure 5A). However, we did not identify the shoulder peaks present in the Lima et al dataset (Lima et al. 2017).
An advantage of profiling poly(A) tail lengths with dRNAseq versus short read sequencing is that poly(A) tail lengths are directly coupled to information about the splice isoforms and 3’UTR isoforms of the associated read. This allows comparisons and correlations between poly(A) tail lengths and aspects of transcript structure. One possible driver of differences in poly(A) tail lengths between reads could be that poly(A) tail length distributions may vary depending on whether the associated 3’UTR has a canonical PAS site. To test this possibility, we plotted poly(A) tail length distributions versus PAS type (i.e. canonical AAUAAA, alternative PAS, and no PAS) for reads from the L1 stage (Figure 4B). We find that all PAS types are significantly different from one another by Kolmogorov-Smirnov and Mann-Whitney-U tests (p < 2.2e-16), and 3’UTRs with no PAS have longer poly(A) tail lengths, on average, than poly(A) tails associated with either canonical and alternative PAS, with a median poly(A) tail length of 58 nt for 3’UTRs with no PAS, 46 nt for 3’UTRs with alternative PAS, and 48 nt for 3’UTRs with canonical PAS.
It has been reported that median poly(A) tail length and expression level are anticorrelated, such that highly expressed genes generally have shorter median poly(A) tail lengths (Lima et al. 2017; Legnini et al. 2019). To determine if this relationship holds in our datasets, we plotted the log of the number of reads supporting a given isoform versus the median poly(A) tail length for that isoform for transcripts with 10 or more reads supporting them (Figure 4C, left panel; Supplemental Figure 5B). A similar inverse correlation between median poly(A) tail length and number of reads supporting that isoform was observed in the L1 to L4 stages and when all stages were pooled (Supplemental Figure 5B). For example, the a isoform of the Y37E3.8 gene is expressed much more than the b.1 isoform (18,161 reads versus 38 reads), and has a significantly shorter poly(A) tail length distribution than the b.1 isoform (Figure 4C, right panel). However, this correlation explains only a small fraction of the overall variation in the data, with the maximum R2 value of 0.1297. Interestingly, in the adult stages (both males and hermaphrodites), the slope of the regression lines between median poly(A) tail length and expression level were much more shallow, and the corresponding R2 values were much weaker with R2 values ranging from 0.0103 to 0.0004 (Figure 4C, middle panel; Supplemental Figure 5B). These results suggest that the inverse relationship between poly(A) length and expression level may vary depending on the developmental stage.
A recent study using FLAMseq, a PacBio sequencing method that also captures poly(A) tails and full length transcripts, demonstrated that poly(A) tail length and 3’UTR length were positively correlated (Legnini et al. 2019). Examining poly(A) tail length and 3’UTR lengths across all reads in our data, we also identify this same relationship (Figure 4D left panel). For example, the longer par-5 3’UTR isoform (termed 3’UTR 0; 486 nt) also has a longer poly(A) tail (median length 71) versus the shorter par-5 3’ UTR isoform (3’UTR 2; 51 nt) with a shorter poly(A) tail length distribution (median length 46) (Figure 4D right panel). However, the overall strength of this relationship also varies between developmental stages, and the slopes of the regression lines (and the corresponding R2 values) are smaller in adult stages than in larval stages (Figure 4D middle panel; Supplemental Figure 5C).
Finally, we examined the poly(A) tail length distributions between transcripts that are fully spliced versus those with retained introns. We previously showed in the human cell line GM12878 that intron retention correlates with transcripts with longer poly(A) tails (Workman et al. 2018). In our C. elegans datasets, we also found a positive correlation between intron retention and poly(A) tail length distributions by Kolmogorov-Smirnov and Mann-Whitney U tests, suggesting a conserved mechanism whereby nuclear transcripts possess longer poly(A) tails and supporting a model (Lima et al. 2017) in which poly(A) tails may be subject to post-transcriptional processing by deadenylation once exported into the cytoplasm.
A public resource for full-length isoform information
To make our transcriptome dataset accessible to the research community, we have created a public custom track hub (https://bx.bio.jhu.edu/track-hubs/dRNAseq/hub.txt). This track hub contains the full-length filtered and non-filtered reads from each developmental stage, as well as the full-length isoforms supported across all stages at each locus. To ease access to this track hub, we registered it with the Track Hub Registry (https://www.trackhubregistry.org/). Users can therefore easily load this track hub in Ensembl (Zerbino et al. 2018) based genome browsers by searching public track hubs for “ce11 staged dRNAseq”. As a proof of the utility of this track hub, we loaded the track hub in the Ensembl genome browser and searched for lin-14, a gene with a well-studied 3’UTR that is subject to regulation by the lin-4 microRNA (Wightman et al. 1991, 1993; Lee et al. 1993) but not currently annotated in the WormBase WS265 annotation (Lee et al. 2018; WormBase web site 2018). In our dataset, we identified the lin-14 3’UTR, as well as its splice isoforms, including a novel splice isoform (Figure 5A, “observed isoforms” track). As another example of the utility of this track hub, we searched for the locus mlp-1, a gene with multiple splice and 3’UTR isoforms identified, including multiple novel splice isoforms (isoforms 3, 5, and 7 of the observed isoform track in Figure 5B). These examples highlight possible uses of this resource by the research community to query currently unannotated 3’UTRs and splice isoforms.
Discussion
Despite years of study, our understanding of the C. elegans transcriptome remains incomplete. Although studies have been performed profiling transcription start sites, splicing in both cis and trans, 3’UTR isoforms, poly(A) tail lengths, RNA base modifications, and gene and isoform expression levels, the short read lengths intrinsic to the prevailing technologies have been limited to examining one or two of these features at a time (Saito et al. 2013; Tourasse et al. 2017; Jan et al. 2011; Mangone et al. 2010; Lima et al. 2017; Zhao et al. 2015; Packer et al. 2019; Hillier et al. 2009). Even within these datasets, short read lengths and reliance on PCR amplification eliminate single molecule resolution, and make correlation of distant features within transcripts impossible. Although our study focuses primarily on splice isoforms, 3’UTR isoforms, and poly(A) tail lengths due to current limitations of nanopore sequencing technologies, in principle, modified approaches to dRNAseq would be capable of capturing all of the above features at a single molecule level.
Nanopore sequencing therefore poses both a unique set of opportunities and challenges that must be addressed in any analysis pipeline. The dRNAseq pipeline FLAIR (Full-length alternative isoform analysis of RNA) utilizes a hybrid sequencing approach in which matched short read sequencing is used to correct splice junctions in reads, and reads are clustered together in to splice isoforms if they share a common set of splice junctions (Tang et al. 2018).
We utilized an approach similar to that used by FLAIR, in which reads are corrected, in our case by an existing annotation, and clustered together by splice isoform. Our approach differs from FLAIR in several ways, including a full-length filtering step that reduces the impact of 3’ bias in our reads. A recent publication examining the utility of dRNAseq and cDNA nanopore sequencing to generate transcriptome annotations independently revealed that many nanopore sequencing reads fail to span the full-length of annotated transcript isoforms, highlighting the need for analysis pipelines that take the possibility of 5’ truncations into account in isoform identification (Soneson et al. 2019). Our full-length filtering approach partially addresses this concern, although, as noted by Soneson and colleagues, doing so reduces the number of usable reads, and likely impacts the quantitative nature of our data. A possible experimental approach to solving this problem could involve ligating a set of known nucleotides to the 5’ end of RNA transcripts after a decapping reaction, allowing for selection of full-length transcripts by filtering for reads flanked by signals corresponding to a poly(A) tail and the 5’ ligated product. This approach would incidentally also address the known problem that 10 - 15 nucleotides at the 5’ end of each strand are unable to be read (Workman et al. 2018).
Also distinguishing our approach from FLAIR is a novel means of calling 3’UTRs that has great utility in the generation of transcriptome annotations. Notably, we identify 3’UTR structures with a standard dRNAseq library preparation protocol meaning that, in principle, any dRNAseq experiment can be used to identify 3’UTRs using our method. The implications of this are potentially wide reaching, as experiments once used for comparative analysis of splice isoforms between conditions may now also be used in comparative analysis of 3’UTR isoforms.
By combining our 3’UTR and splice isoform calls, we identified almost 26,000 full-length transcript isoforms. It is likely that increased depth and additional sequencing of other developmental stages such as embryos and the stress-induced dauer stage would further increase the number of genes and isoforms identified, bringing this dataset closer to capturing the theoretical complete C. elegans transcriptome.
The ability to estimate poly(A) tail lengths for each read is another advantage of dRNAseq. Supporting the validity of our poly(A) profiling approach, the length distribution of the poly(A) tail length estimates we obtain in the L4 stage are quite similar to the distribution in the L4 stage reported by Lima et al, a study utilizing mTAILseq (Lima et al. 2017). Coupling of poly(A) tail lengths to aspects of 3’UTR structure and splice isoform allowed us to identify relationships between putative PAS sites and intron retention transcripts to poly(A) tail lengths. The relationship between PAS sites and poly(A) tail lengths is an interesting result that indicates there may be differential deposition or regulation of poly(A) tail length based on the presence or absence of an upstream PAS sequence. Longer poly(A) tails in intron retention transcripts could be indicative of intron retention transcripts being partially processed RNA still retained in the nucleus, as nuclear RNAs would be shielded from cytoplasmic deadenylase complexes. Neither of these relationships could be discovered by short read sequencing of poly(A) tails, demonstrating the efficacy of full-length single molecule sequencing.
A notable discovery of developmentally resolved poly(A) tail length profiling was the difference in features of poly(A) tail lengths between larval and adult stages. Overall poly(A) tail length distributions were longer in adult stages than in larval stages, and the strength of previously reported correlations between poly(A) tail lengths and expression level and poly(A) tail lengths and 3’UTR lengths were weaker in adult stages than larval stages. One possible explanation for these differences is the development of a functional germline in adult stages. In hermaphrodites, the cytoplasmic polyadenylases gld-2 and gld-4 are known to be active in the germline (Nousch et al. 2017; Suh et al. 2006; Schmid et al. 2009; Millonigg et al. 2014). Given the relative size of the C elegans germline, it is possible that activity of such cytoplasmic poly(A) polymerases may influence global poly(A) tail length distributions.
Finally, we have created a custom track hub for exploration of this dataset by independent researchers. By making this data easily accessible, we hope to provide C. elegans researchers with information related to their genes of interest, providing a resource to identify what isoforms have full-length support in any given developmental stage, and across all stages, as well as the structure of any 3’UTRs that we identify. Given that our dataset provides support for over 7000 isoforms previously lacking full-length support and over 20,000 splice isoforms overall, and given that most isoforms have an associated 3’UTR called, this will be an excellent resource for the C elegans research community. Overall, we have demonstrated the utility of nanopore sequencing in providing support for full-length transcripts, annotating putative 3’UTRs, and interrogating poly(A) tail lengths.
Methods
C. elegans strains, maintenance, and collection
C. elegans N2 worms were grown and maintained under standard laboratory conditions on NGM plates seeded with E. coli OP50 (Stiernagle 2006). Samples for RNA analysis were synchronized by hypochlorite treatment and overnight hatching in M9 buffer. They were plated as L1d at 25°C and staged by pharyngeal pumping. L2, L3, L4 and young adult (YA) worms were collected approximately two hours post-lethargus. L1 worms were collected four hours after plating. Mature adults were collected approximately ten hours post-L4/YA transition. CB1489 [him-8(e1489)IV] adult males were enriched by filtering through 35um mesh.
RNA extraction
Total RNA isolation was performed using TriReagent (Ambion) following the vendor’s protocol, with the following alterations: three rounds of freeze/thaw lysis were conducted prior to the addition of BCP; RNA was precipitated in isopropanol supplemented with glycogen for one hour at −80°C; RNA was pelleted by centrifugation at 4°C for 30 min at 20,000 x g; the pellet was washed three times in 70% ethanol; the pellet was resuspended in water.
Library preparation and sequencing
Approximately 20 µg aliquots of total RNA were diluted to a total volume of 100 µl in nuclease free water and poly-A selected using NEXTflex Poly(A) Beads (BIOO Scientific Cat#NOVA-512980). Up to 600 ng of the resulting poly-A RNA was separately aliquoted for library generation. Any excess poly-A selected RNA was stored at −80°C. Biological poly-A RNA and a synthetic control (Lexogen SIRV Set 3, 2.5 ng) were prepared for nanopore direct RNA sequencing generally following the ONT SQK-RNA001 kit protocol, including the optional reverse transcription step recommended by ONT. One difference from the standard ONT protocol was use of Superscript IV (Thermo Fisher) for reverse transcription. RNA sequencing on the GridION platform was performed using ONT R9.4 flow cells and the standard MinKNOW protocol script (NC_48Hr_sequencing_FLO-MIN106_SQK-RNA001).
Preprocessing and alignments
Reads were basecalled and trimmed of adapter sequences using Poreplex version 0.3.1 (running Albacore version 2.3.1) with the following parameters: -p 24 --trim-adapter --basecall (https://github.com/hyeshik/poreplex). For each of our samples, reads were aligned to the WBcel235 ce11 genome using minimap2 version 2.14-r883 (Li 2018). Genomic alignments were run with the following parameters: -ax splice -k14 -uf --secondary=no -G 25000 -t 24. The resulting.sam files were converted to bam format using samtools view with parameters -b -F 2048 (Li et al. 2009).
Read filtering
Our first filtering step involved removing reads aligning to the genome with large insertions (>20bp) and large 3’ softclips (>20bp) that could be the result of not properly aligning internal or 3’ exons respectively. This filtering step ensures that novel isoforms identified in downstream scripts are not false positives resulting from poor alignments.
Following this, reads were filtered based on their QC tags from the polyA estimation module of the program nanopolish (Workman et al. 2018). Reads were removed from consideration if they had QC tags “READ_FAILED_LOAD”, “SUFFCLIP”, or “NOREGION”. This was meant to remove reads without a detectable poly(A) tail signal, to prevent inclusion of reads with truncated 3’ ends.
Next, for the purposes of better identifying 3’UTR isoforms in downstream analysis, 3’ soft-clips were realigned using a semi-global aligner with affine gap penalties anchored at the 3’ end of the original alignment. This resulted in more uniform 3’ ends of alignment. The resulting realigned reads were converted to bed12 format using the bedtools bamtobed function (version 2.27.1) (Quinlan 2014; Quinlan and Hall 2010).
To identify full length reads we made use of the Wormbase (release WS265) gff3 gene annotation file (WormBase web site 2018). We converted the Wormbase coding sequence annotations in this file to bed format using a custom python script, resulting in a cds.bed file. We then intersected the cds.bed file with the bed files describing our genomic alignments using the bedtools intersect function, with the flags -s -F 1.0 -u, which enforces that reads span the full length of an annotated CDS (with the correct strandedness) in order to be considered a full-length read. Following this, we collected the read IDs of the resulting full-length reads, and used these IDs to filter genomic alignment files in to full-length and non full-length reads using custom python scripts utilizing pysam (https://github.com/pysam-developers/pysam) version 0.14.1 (Li et al. 2009).
Reads were then filtered to ensure their 5’ ends were within 100 bp of an annotated transcript start site in the Wormbase WS265 gff3 file. This was meant to further reduce the impact of 5’ truncated reads.
To account for errors in splice junction alignments, we used the Wormbase WS265 gff3 annotation to define canonical donor and acceptor splice sites, and assigned each donor and acceptor splice site in our reads to a canonical splice site. Non-canonical donor and acceptor splice sites in our reads that fell within 15 bp of a canonical site were assigned to that site. Reads that contained non-canonical donor and acceptor splice sites that were not within 15 bp of a canonical site were thrown out, and not considered for the purposes of defining splice isoforms or UTRs. In addition, reads were thrown out if splice junctions in that read corresponded to annotated splice junctions from more than one gene. This allowed us to unambiguously assign each spliced read to a gene based on its correspondence to annotated donor and acceptor splice sites. Reads were assigned to splice isoforms in a similar manner (however some of these assignments were ambiguous when two annotated isoforms were comprised of the same sets of splice junctions). For non spliced reads, we assign gene ids based on overlap with single exon genes present in the annotation.
Finally, we separated reads that had exons that span the full length of any intron in the annotation that is not fully spanned by an exon in the annotation. We do this to remove intron retention transcripts from consideration in defining putative isoforms, as we believe these reads to be nuclear RNA that has not fully been processed, which, if included would artificially inflate the number of identified isoforms. Intron retention reads are considered in analysis of poly(A) tail length distributions, as well as in the comparison of poly(A) tail length distributions in fully spliced versus intron retention transcripts.
Reads were excluded from consideration in 3’UTR calling (but not splice isoform calling) if their original minimap2 alignments had 3’ softclips larger than 10 bases long. This exclusion prevented reads with 3’ ends that failed to align well from being considered, and reduced the variation in considered 3’ alignment ends significantly.
Splice isoform identification
After these stringent filtering steps, we extracted the sequences from the ce11 WBcel235 genome corresponding to each aligned read using getfasta function of the program bedtools with the following flags -s -split -bedOut (Quinlan 2014). We then clustered reads (and their associated sequences) together in to putative isoforms if the reads shared a common set of splice junctions. This resulted in reads clustered by splice isoform. For each of these sets of reads corresponding to splice isoforms, we chose a representative read by selecting the longest read. From this representative read, we extracted information about the isoform including putative coding sequence by identifying the longest open reading frame (with both start and stop codons) present in the read’s associated sequence. This allowed us to define putative start and stop codons.
Splice isoforms were called as novel if they contained a set of splice junctions not previously annotated in the reference. To deal with the possibility of 5’ truncated reads artificially inflating our novel isoform counts, we considered all possible 5’ truncations of previously annotated transcripts in the WormBase WS265 annotation file when defining our reference.
3’ UTR calling
To identify putative 3’UTRs, reads were first grouped by their putative stop codons and any splice junctions that occurred downstream of that stop codon. For each read in each of these groups the 3’ most base in their alignment was extracted. These end positions were then used to generate a gaussian kernel density estimate (using the python package seaborn, version 0.9.0 kdeplot function with a specified kernel width of 10). Local maxima in this kernel density estimate were identified, and reported as a putative 3’UTR cleavage site if there were at least 3 read end positions within 10 bp of that local maxima. Reads were assigned to a given 3’UTR if that UTR’s putative cleavage site was the closest UTR cleavage site to the end position of the read, and if the end position of the read and the putative cleavage site were within 10 bp of each other.
Poly(A) tail length estimation
Poly(A) tail lengths were estimated from raw signal for each read using the polya function of the program nanopolish (version 0.10.2) (Workman et al. 2018). Poly(A) tail length estimates were only considered if the QC tag reported by nanopolish was PASS. Poly(A) tail length estimates were grouped by gene and isoform using the gene and isoform assignments for each read derived from comparison of genomic alignments with the splice junctions in the Wormbase WS265 gff3 reference.
Calculating coverage for the metagene plot
To generate the metagene plot displayed in Figure 1B, we calculated coverage across every gene (as defined by the ce11 WB245 wormbase .gtf annotation file converted to bed format) using the pybedtools coverage function (Dale et al. 2011; Quinlan and Hall 2010; Lee et al. 2018). We then summed these coverage values together, and normalized the resulting values by dividing each value by the sum of all the coverage values. Genes sizes were scaled such that the size of the gene body and the UTRs were always the same.
Determining full length support from WormBase annotations
A WormBase splice isoform was said to have full length support if every one of its introns in the WS265 annotation gff3 was annotated to have support from the same EST or the same cDNA (Lee et al. 2018; WormBase web site 2018). This restricted our analysis to only consider isoforms that were annotated as having introns, and excluded single exon genes and genes without introns annotated in the gff3 annotation file (which includes all non-coding RNAs). To account for this, when comparing the number of genes and isoforms we support to the number of genes and isoforms with full length support in WormBase, we only considered splice isoforms from our dataset that corresponded to an isoform from the restricted WormBase isoform set.
3’UTR comparisons
We compared our 3’UTRs to the 3’UTRs identified in Jan et al and Mangone et al using a custom script that required putative stop codons match identically, but allowed for a 10 bp tolerance in putative 3’UTR end positions (Jan et al. 2011; Mangone et al. 2010). We identified novel 3’UTRs in a similar manner, but also added consideration of WormBase annotated 3’UTRs.
Calling PAS sites
We identified PAS sites in a method similar to that used by Mangone et al, in which we searched the 60 nucleotides upstream of the putative cleavage site for putative PAS hexamers (Mangone et al. 2010). Rather than recalculating the frequency of putative PAS hexamers upstream of our putative cleavage sites, we used the PAS hexamers specified in Table S5 of Mangone et al and searched for these hexamers in the order they appear in that Table. Once a putative PAS site was identified, the UTR was assigned that PAS hexamer. If the 3’UTR had none of the hexamers present in the table in it’s upstream sequence, the UTR was said to have no PAS.
Plotting PAS nucleotide distributions
To plot the nucleotide distribution around a given type of PAS site, we first sorted sequences by their PAS type. For canonical and alternative PAS sites, nucleotide distributions were anchored such that the PAS site began at −19 nucleotides. The percentage of use of each base at each position in a window around the PAS site was then calculated. For UTRs with no PAS identified, the nucleotide distribution was calculated such that the putative cleavage site was at position 0.
Software availability
Code required to replicate the analyses performed in this paper is available on GitHub at https://github.com/NatPRoach/c_elegans_dRNAseq_analysis.
Data Access
Both raw fast5 and basecalled fastq have been deposited at the European Nucleotide Archive (ENA) and can be found under accession number PRJEB31791.
Author contributions statement
AA collected the developmental samples for C. elegans and isolated the RNA to be sequenced and NS performed the sequencing. NR performed all of the sequencing analysis in consultation with J.T.. All authors reviewed the manuscript.
Disclosure Declaration
NPR, NS, & WT were reimbursed for conference fees, travel, and accommodation to speak at events organized by Oxford Nanopore Technologies (ONT). WT has two patents licensed to ONT (8,748,091 and 8,394,584).
Acknowledgements
This work was supported by a grant from the NIH to JKK. (NIH R01GM118875) and by a Johns Hopkins Discovery Award to WT, JT, and JKK. NPR was partly supported by a training grant awarded to the Johns Hopkins Cell, Molecular, Developmental Biology and Biophysics program (NIH T32GM007231). We thank Mindy Clark for the C. elegans life cycle diagram in Figure 1A. We thank Mallory Freeberg for initial computational analyses comparing nanopore based cDNA and dRNA sequencing that led us to utilize dRNAseq in this study.