ABSTRACT
High-throughput RNA sequencing allows genome-wide analyses of pre-mRNA splicing across multiple conditions. However, the increasing number of available datasets represents a major challenge in terms of time and storage required for analyses. Here we describe SUPPA, a computational pipeline to calculate relative inclusion values of alternative splicing events, exploiting fast transcript quantification of a known annotation. SUPPA provides a fast and accurate approach to calculate inclusion levels of alternative splicing events from a large number of samples, thereby facilitating systematic analyses in the context of large-scale projects using limited computational resources. SUPPA is available at https://bitbucket.org/regulatorygenomicsupf/suppa under the MIT license and is implemented in Python 2.7.
INTRODUCTION
Alternative splicing plays an important role in many cellular processes and has been linked to multiple diseases. High-throughput RNA sequencing has facilitated the study of alternative splicing at genome scale and under multiple conditions. However, as more datasets become available, a bottleneck appears in relation to the required time and storage space required for analysis, representing a major obstacle for large-scale projects and the analysis of publicly available data. We have developed SUPPA, an effective computational pipeline for the rapid calculation of the of relative inclusion values of alternative splicing events from a large number of samples.
The description of alternative splicing in terms of events facilitates their experimental validation and their characterization in terms of regulatory mechanisms; and is also motivated by the current limitations in transcript reconstruction from short sequencing reads. On the other hand, recent developments in the quantification of known transcripts have shown that considerable quality can be achieved at very high speed (Li et al. 2011, Patro et al. 2014, Zhang et al. 2014). These methods can thus provide in very short time an accurate census of mRNA molecules in a given condition for a deeply annotated genome as human (Harrow et al. 2012). The relative inclusion, or PSI, of a splicing event is generally defined as the fraction of mRNA isoforms that include an exon or a specific form of the event (Wang et al. 2008, Brosseau et al. 2010), which is often estimated from the reads falling specifically on either form of the event (Wang et al. 2008). SUPPA calculates PSI values directly from transcript isoform abundance values. We argue that this provides a fast and accurate approach to systematic splicing analysis.
METHODS
An alternative splicing event is a binary representation of a local splicing variation in a gene. In this context, the PSI value of an event can be defined as the ratio of the abundance of mRNAs that include one form of the event, F1 over the abundance of mRNAs that contain either form of the event, F1∪F2. Given the abundances for all transcripts isoforms, assumed without loss of generality to be given in transcript per million units (TPM) (Li et al. 2010), which we denote as TPMk, SUPPA calculates the PSI (Ψq) for an event as follows: SUPPA performs this calculation in two steps. The first one, generateEvents, reads an annotation file in GTF format and produces information about alternative splicing events in the annotation. SUPPA calculates exon skipping events, alternative 5’ and 3’ splice-sites, mutually exclusive exons, intron retention, and alternative first and last exons (Supp. Fig. 1). The Ψq values for all events are then calculated in a second step with the operation psiPerEvent, which uses as input the output from generateEvents and the abundances for all transcripts in one or more samples, which are previously obtained with a fast transcript quantification method (Figure 1). Although SUPPA is limited to the splicing events available in the gene annotation, events can be expanded with novel transcript variants obtained by other means. SUPPA also includes a tool to combine multiple input files from the transcript quantification and to obtain PSI values for transcript isoforms. More details are given in the documentation of the software.
RESULTS
To calculate 107,506 alternative splicing events from the Ensembl 75 annotation (37.494 genes, 135.521 transcripts), generateEvents took 10 minutes on a 2.5 GHz Intel Xeon, and 2 mins and 43 secs in a 2.9 Ghz Intel Core i7 processor. On the other hand, psiPerEvent took less than a minute in both machines to obtain the Ψq values (output size 26 Mb) for these events using 8 RNA-Seq samples previously processed with Sailfish. Similarly, psiPerEvent took 4 mins and 50 secs on a 2.5 GHz Intel Xeon for 929 breast tumor samples from TCGA for 40411 events (Supp. Material). Considering that the events only need to be computed once, SUPPA speed is very competitive.
We performed a benchmarking analysis by comparing SUPPA Ψq values with those obtained from junction reads (Ψj), defined in (Supp. Fig 2) on a set of non-overlapping alternative splicing events (Supp. Fig. 3). First, simulated reads obtained with FluxSimulator (Griebel et al. 2012) were mapped to the genome with STAR (Dobin et al. 2013) and reads in junctions were counted with sjcount (Pervouchine et al. 2013) to calculate Ψj values. The same simulated reads were used to quantify transcript abundances with Sailfish (Patro et al. 2014). SUPPA Ψq shows a high correlation with Ψj (Pearson R=0.94) over 1041 non-overlapping events, including exons skipping, alternative 5’/3'splice-sites and mutually exclusive exon events (Figure 2 and Supp. Fig. 4). This correlation is also high (R=0.92) when comparing SUPPA results to the values obtained from the simulated number of molecules (Supp. Fig. 5).
We also used RNA sequencing from nuclear and cytosolic fractions from MCF7 and MCF10 cells. Correlations between biological replicates were high for SUPPA Ψq values (Pearson R= 0.91-0.93) and improved when only genes with TPM>1 (calculated as the sum of TPMs of the transcripts in each gene) were used (Pearson R=0.97-0.99) (Supp. Figs. 6-9). SUPPA correlations between replicates were in all cases superior to those the junction-based values (Supp. Fig. 10), calculated as above. Moreover, SUPPA systematically recovers more events than using junction reads at similar correlation value between replicates (Supp. Material). Comparison between Ψq and Ψj values, using events with more than 20 junction reads in genes with TPM>1 and, showed a good correspondence in all samples tested (Pearson R=0.73-0.78) (Supp. Figs. 11-14). Moreover, to account for 3’ end sequencing biases (Supp. Fig. 15) the analysis was repeated quantifying protein-coding sequences only and comparing events occurring in CDS regions (see Supp. Material). This showed a considerable increase in the correlation between Ψq and Ψj values (R=0.89-0.93) (Figure 2 and Supp. Figs. 16-19).
As an additional benchmark, we calculated TPM values for transcript isoforms using the RNA-Seq data for lung adenocarcinoma from the TCGA project (https://tcga-data.nci.nih.gov/tcga/), and applied SUPPA to 55 tumor and paired normal samples. The Ψ values calculated with SUPPA for two splicing events in the genes NUMB and BIN1, known to be upregulated in lung tumors (Misquitta-Ali et al. 2011, Zong et al. 2014), show a good correlation with the Ψ values calculated with junction reads (Supp. Fig. 20). Further details are provided as supplementary material.
ACKNOWLEDGEMENTS
The authors acknowledge useful discussions with S. Mount, S. Janga, Y. Barash and M. Robinson. Funding: Spanish Government (BIO2011-23920 and Consolider RNAREG CSD2009-00080), Sandra Ibarra Foundation for Cancer, and Spanish National Institute of Bioinformatics (INB).
Footnotes
↵* eduardo.eyras{at}upf.edu