SUPPA: a super-fast pipeline for alternative splicing analysis from RNA-Seq

Gael P. Alamancos; Amadís Pagès; Juan L. Trincado; Nicolás Bellora; Eduardo Eyras

doi:10.1101/008763

ABSTRACT

High-throughput RNA sequencing allows genome-wide analyses of pre-mRNA splicing across multiple conditions. However, the increasing number of available datasets represents a major challenge in terms of time and storage required for analyses. Here we describe SUPPA, a computational pipeline to calculate relative inclusion values of alternative splicing events, exploiting fast transcript quantification of a known annotation. SUPPA provides a fast and accurate approach to calculate inclusion levels of alternative splicing events from a large number of samples, thereby facilitating systematic analyses in the context of large-scale projects using limited computational resources. SUPPA is available at https://bitbucket.org/regulatorygenomicsupf/suppa under the MIT license and is implemented in Python 2.7.

INTRODUCTION

Alternative splicing plays an important role in many cellular processes and has been linked to multiple diseases. High-throughput RNA sequencing has facilitated the study of alternative splicing at genome scale and under multiple conditions. However, as more datasets become available, a bottleneck appears in relation to the required time and storage space required for analysis, representing a major obstacle for large-scale projects and the analysis of publicly available data. We have developed SUPPA, an effective computational pipeline for the rapid calculation of the of relative inclusion values of alternative splicing events from a large number of samples.

The description of alternative splicing in terms of events facilitates their experimental validation and their characterization in terms of regulatory mechanisms; and is also motivated by the current limitations in transcript reconstruction from short sequencing reads. On the other hand, recent developments in the quantification of known transcripts have shown that considerable quality can be achieved at very high speed (Li et al. 2011, Patro et al. 2014, Zhang et al. 2014). These methods can thus provide in very short time an accurate census of mRNA molecules in a given condition for a deeply annotated genome as human (Harrow et al. 2012). The relative inclusion, or PSI, of a splicing event is generally defined as the fraction of mRNA isoforms that include an exon or a specific form of the event (Wang et al. 2008, Brosseau et al. 2010), which is often estimated from the reads falling specifically on either form of the event (Wang et al. 2008). SUPPA calculates PSI values directly from transcript isoform abundance values. We argue that this provides a fast and accurate approach to systematic splicing analysis.

METHODS

An alternative splicing event is a binary representation of a local splicing variation in a gene. In this context, the PSI value of an event can be defined as the ratio of the abundance of mRNAs that include one form of the event, F₁ over the abundance of mRNAs that contain either form of the event, F₁∪F₂. Given the abundances for all transcripts isoforms, assumed without loss of generality to be given in transcript per million units (TPM) (Li et al. 2010), which we denote as TPM_k, SUPPA calculates the PSI (Ψ_q) for an event as follows: SUPPA performs this calculation in two steps. The first one, generateEvents, reads an annotation file in GTF format and produces information about alternative splicing events in the annotation. SUPPA calculates exon skipping events, alternative 5’ and 3’ splice-sites, mutually exclusive exons, intron retention, and alternative first and last exons (Supp. Fig. 1). The Ψ_q values for all events are then calculated in a second step with the operation psiPerEvent, which uses as input the output from generateEvents and the abundances for all transcripts in one or more samples, which are previously obtained with a fast transcript quantification method (Figure 1). Although SUPPA is limited to the splicing events available in the gene annotation, events can be expanded with novel transcript variants obtained by other means. SUPPA also includes a tool to combine multiple input files from the transcript quantification and to obtain PSI values for transcript isoforms. More details are given in the documentation of the software.

Fig. 1.

Schematic description of how SUPPA works to define events from a gene annotation (generateEvents) and calculate event Ψ values from one or more transcript quantification files (psiPerEvent).

RESULTS

To calculate 107,506 alternative splicing events from the Ensembl 75 annotation (37.494 genes, 135.521 transcripts), generateEvents took 10 minutes on a 2.5 GHz Intel Xeon, and 2 mins and 43 secs in a 2.9 Ghz Intel Core i7 processor. On the other hand, psiPerEvent took less than a minute in both machines to obtain the Ψ_q values (output size 26 Mb) for these events using 8 RNA-Seq samples previously processed with Sailfish. Similarly, psiPerEvent took 4 mins and 50 secs on a 2.5 GHz Intel Xeon for 929 breast tumor samples from TCGA for 40411 events (Supp. Material). Considering that the events only need to be computed once, SUPPA speed is very competitive.

We performed a benchmarking analysis by comparing SUPPA Ψ_q values with those obtained from junction reads (Ψ_j), defined in (Supp. Fig 2) on a set of non-overlapping alternative splicing events (Supp. Fig. 3). First, simulated reads obtained with FluxSimulator (Griebel et al. 2012) were mapped to the genome with STAR (Dobin et al. 2013) and reads in junctions were counted with sjcount (Pervouchine et al. 2013) to calculate Ψ_j values. The same simulated reads were used to quantify transcript abundances with Sailfish (Patro et al. 2014). SUPPA Ψ_q shows a high correlation with Ψ_j (Pearson R=0.94) over 1041 non-overlapping events, including exons skipping, alternative 5’/3'splice-sites and mutually exclusive exon events (Figure 2 and Supp. Fig. 4). This correlation is also high (R=0.92) when comparing SUPPA results to the values obtained from the simulated number of molecules (Supp. Fig. 5).

Fig. 2.

Correlation (Pearson R) of the PSI values calculated with SUPPA (x-axis) and from junction reads (y-axis), respectively, for 1041 non-overlapping events in coding and non-coding regions using simulated reads (left) and for 2202 non-overlapping events in coding regions using sequencing reads from cytosolic mRNAs in MCF10 cells (right).

We also used RNA sequencing from nuclear and cytosolic fractions from MCF7 and MCF10 cells. Correlations between biological replicates were high for SUPPA Ψ_q values (Pearson R= 0.91-0.93) and improved when only genes with TPM>1 (calculated as the sum of TPMs of the transcripts in each gene) were used (Pearson R=0.97-0.99) (Supp. Figs. 6-9). SUPPA correlations between replicates were in all cases superior to those the junction-based values (Supp. Fig. 10), calculated as above. Moreover, SUPPA systematically recovers more events than using junction reads at similar correlation value between replicates (Supp. Material). Comparison between Ψ_q and Ψ_j values, using events with more than 20 junction reads in genes with TPM>1 and, showed a good correspondence in all samples tested (Pearson R=0.73-0.78) (Supp. Figs. 11-14). Moreover, to account for 3’ end sequencing biases (Supp. Fig. 15) the analysis was repeated quantifying protein-coding sequences only and comparing events occurring in CDS regions (see Supp. Material). This showed a considerable increase in the correlation between Ψ_q and Ψ_j values (R=0.89-0.93) (Figure 2 and Supp. Figs. 16-19).

As an additional benchmark, we calculated TPM values for transcript isoforms using the RNA-Seq data for lung adenocarcinoma from the TCGA project (https://tcga-data.nci.nih.gov/tcga/), and applied SUPPA to 55 tumor and paired normal samples. The Ψ values calculated with SUPPA for two splicing events in the genes NUMB and BIN1, known to be upregulated in lung tumors (Misquitta-Ali et al. 2011, Zong et al. 2014), show a good correlation with the Ψ values calculated with junction reads (Supp. Fig. 20). Further details are provided as supplementary material.

ACKNOWLEDGEMENTS

The authors acknowledge useful discussions with S. Mount, S. Janga, Y. Barash and M. Robinson. Funding: Spanish Government (BIO2011-23920 and Consolider RNAREG CSD2009-00080), Sandra Ibarra Foundation for Cancer, and Spanish National Institute of Bioinformatics (INB).

Footnotes

↵* eduardo.eyras{at}upf.edu

REFERENCES

↵
Brosseau JP, et al. (2010) High-throughput quantification of splicing isoforms. RNA 16(2):442–9.
OpenUrl Abstract/FREE Full Text
↵
Dobin A, et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21.
OpenUrl CrossRef PubMed Web of Science
↵
Griebel T, et al. (2012) Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 40(20):10073–83.
OpenUrl CrossRef PubMed Web of Science
↵
Harrow J, et al. (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22(9):1760–74.
OpenUrl Abstract/FREE Full Text
↵
Li B, et al. (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493–500.
OpenUrl CrossRef PubMed Web of Science
↵
Li B, et al. (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 4;12:323.
OpenUrl
↵
Misquitta-Ali CM, et al. (2011) Global profiling and molecular characterization of alternative splicing events misregulated in lung cancer. Mol Cell Biol. 31(1):138–50.
OpenUrl Abstract/FREE Full Text
↵
Patro R, et al. (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 32(5):462–4.
OpenUrl CrossRef PubMed
↵
Pervouchine et al. (2013) Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics 29(2):273–4
OpenUrl CrossRef PubMed Web of Science
↵
Wang ET, et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221):470–6.
OpenUrl CrossRef PubMed Web of Science
↵
Zhang Z & Wang W. (2014) RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12):i283–i292.
OpenUrl CrossRef PubMed
↵
Zong FY, et al. (2014) The RNA-binding protein QKI suppresses cancer-associated aberrant splicing. PLoS Genet. 10(4):e1004289.
OpenUrl CrossRef PubMed

REFERENCES

[1].↵
Flicek P, et al. Ensembl 2014. Nucleic Acids Res. 2014 Jan;42(Database issue):D749-55.
[2].↵
Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, Sammeth M. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012 Nov 1;40(20):10073–83.
OpenUrl CrossRef PubMed Web of Science
[3].↵
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15–21.
OpenUrl CrossRef PubMed Web of Science
[4].↵
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078–9.
OpenUrl CrossRef
[5].↵
Pervouchine DD, Knowles DG, Guigó R. Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics. 2013 Jan 15;29(2):273–4.
OpenUrl CrossRef PubMed Web of Science
[6].↵
Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014 May;32(5):462–4.
OpenUrl CrossRef PubMed