## Abstract

We have built a statistical package called *Ballgown* for estimating differential expression of genes, transcripts, or exons from RNA sequencing experiments. *Ballgown* is designed to work with the popular *Cufflinks* transcript assembly software and uses well-motivated statistical methods to provide estimates of changes in expression. It permits statistical analysis at the transcript level for a wide variety of experimental designs, allows adjustment for confounders, and handles studies with continuous covariates. *Ballgown* provides improved statistical significance estimates as compared to the *Cuffdiff2* differential expression tool included with *Cufflinks*. We demonstrate the flexibility of the *Ballgown* package by re-analyzing 667 samples from the GEUVADIS study to identify transcript-level eQTLs and identify non-linear artifacts in transcript data. Our package is freely available from: https://github.com/alyssafrazee/ballgown

A key advantage of RNA sequencing (RNA-seq) over hybridization-based technologies such as microarrays is that RNA-seq makes it possible to reconstruct complete gene structures, including multiple splice variants, from raw RNA-seq reads without relying on previously established annotations [17, 29, 8]. The price for this flexibility is a dramatically larger quantity of raw data [20] and much greater computational cost associated with assembly and quantification of transcript expression[25]. The most widely used pipeline for transcript assembly, quantification, and differential expression analysis is the Tuxedo suite, which aligns reads with *Bowtie* and *Tophat2* [10], assembles transcripts with *Cufflinks* [29] and performs differential expression analysis with *Cuffdiff2* [28]. This suite has been used in many influential projects [16, 32, 15], including the ENCODE [5] and modENCODE [9] consortium projects.

We have developed software called *Tablemaker* that takes a GTF file and a set of BAM files and generates a set of linked, tab-delimited text files. These text files contain the structure of assembled transcripts, mappings from exons and splice junctions to transcripts, and expression data measured by FPKM (Fragments Per Kilobase of transcript per Million reads sequenced) and by average per-base coverage (Figure 1a, Supplementary Material: Tablemaker output files). The *Ballgown* package can then be used to read these data into an easy-to-access and analyze *R* object for downstream analysis (Figure 1b), and *Ballgown* includes a flexible linear model framework for differential expression analysis (Supplementary Material: Data and Notation, statistical methods for detecting differential expression). It is also possible to link BAM [14] files to the *Ballgown* object and use them to plot the read-level coverage for transcripts of interest. *Ballgown* can work with any assembly tool that outputs assembled transcripts and expression estimates in the same format as *tablemaker* output (Supplementary Material: Tablemaker output files).

## Statistical significance comparisons

*Cuffdiff2* is designed for two-class differential expression analysis. It has been observed that *Cuffdiff2* produces conservatively biased statistical results when evaluating differential expression between two groups [7, 28]. To confirm this result, we collected *Cuffdiff2* output from InSilico DB [3] for two experiments with sufficient sample sizes for differential expression analysis (Supplementary Section: Data Analyses, InSilico DB Analysis). The first experiment [11] compared lung adenocarcinoma (*n* = 12) and normal control samples (*n* = 12) in nonsmoking female patients. The second experiment [31] compared cells at five developmental stages. We analyzed the data from two stages: embryonic stem cells (*n* = 34) and pre-implantation blastomeres (*n* = 78). We compared only transcripts with average FPKM greater than one across all samples within a study to avoid test results from transcripts with little or no observed expression.

Comparing transcript expression between either tumor and normal samples or between developmental cell types should show strong differential expression signals, given the sample size and distinct phenotypes. In the cancer versus normal comparison, there were 4454 transcripts with an average FPKM greater than one. *Cuffdiff2* identified 1 transcript as differentially expressed at the FDR 5% level, while *Ballgown*’s F-test identified 2178. When comparing developmental phenotypes, there were 12,469 assembled transcripts with average FPKM greater than one, and *Cuffdiff2* identified 0 differentially expressed transcripts versus *Ballgown*’s 7236. These results on large scale studies suggest that *Cuffdiff2*’s statistical significance estimates of differential expression at the isoform level show a strong conservative bias (Figure 2a,2b).

To confirm this result, we created an open-source tool called *polyester* for generating simulated RNA-seq reads from experiments with biological replicates and transcript-level differential expression (Supplementary Material: Simulation studies). We simulated a differential expression experiment with *n* = 10 samples in each of two groups, from *m* = 2, 745 annotated transcripts on human chromosome 22 from the Ensembl [6] annotation (GRCh37 build, v74). We set 274 transcripts to be differentially expressed with a fold change of 6 between groups, with an equal number of transcripts differentially expressed in each direction. In the simulated data, *Cuffdiff2* showed the same strong conservative bias, calling 0 transcripts differentially expressed (controlling FDR at the 5% level), compared to 80 using *Ballgown ’*s F-test (Supplementary Material: Simulation studies, Model fitting in simulated data). Accordingly, the p-value distributions showed similar patterns to those we observed in the adenocarcinoma and developmental cell datasets (Figure 2c). *Ballgown* also produced a more accurate ranking of transcripts for differential expression than *Cuffdiff2*: 78 of the top 100 transcripts called differentially expressed were truly differentially expressed for *Ballgown* versus 63 for *Cuffdiff2*, a 23% increase in truly differentially expressed genes (Figure 2d). We further investigated the source of the conservative bias of *Cuffdiff2* and found that when we sampled reads with equal probability from each transcript, ignoring transcript length, *Cuffdiff2* produced accurate measures of statistical significance (Supplmentary Figure 1). This result suggests that the conservative bias may be due to transcript length normalization in the *Cuffdiff2* software.

## 1 Flexibility of statistical models

The main advantage of *Ballgown* over *Cuffdiff2* is the added flexibility to compare any nested set of models for differential expression or to apply standard differential expression tools in *Bioconductor*, such as the *limma* package [24] (Supplementary Material: Data and Notation, statistical methods for detecting differential expression). To demonstrate *Ballgown*’s flexiblity, we performed two analyses that are not possible with *Cuffdiff2*: modeling continuous covariates and eQTL.

### Analysis of quantitative covariates

In the first analysis, we treated RNA Integrity Number (RIN) [22] as a continuous covariate [26] and used *Ballgown’*s modeling framework to discover transcripts in the GEUVADIS dataset [12] whose expression levels were significantly associated with RIN (Supplementary Material: Data Analysis). Of 43,622 assembled transcripts with average FPKM above 0.1, 19,118 showed a significant effect (*q <* 0.05) of RIN on expression, using a natural cubic spline model for RIN and adjusting for population and library size [18]. The populations included in the study were Utah residents with Northern and Western European ancestry (CEU), Yoruba in Ibadan, Nigeria (YRI), Toscans in Italy (TSI), British in England and Scotland (GBR), and Finnish in Finland (FIN).

A previous analysis of the GEUVADIS data modeled variation in RNA-quality as a linear effect [1]. We fit this model and identified an enrichment of transcripts that showed positive correlation between FPKM values and RNA-quality as expected (Supplementary Figure 2). To investigate the impact of using a more flexible statistical model to detect artifacts, we tested whether a 3rd-order polynomial fit for RIN on transcript expression was significantly better than simply including a linear term for RIN after adjusting for population. We found that the cubic fit was significantly better than the linear fit (*q <* 0.05) for 1,450 transcripts (Figure 3), suggesting that simple linear adjustment for confounding variables such as RNA quality might not be sufficient to capture unwanted sources of variation in transcript data.

### Expression quantitative trait locus analysis

To demonstrate the flexibility of using the post-processed *Ballgown* data for differential expression compared to *Cuffdiff2*, we next performed an eQTL analysis of the 464 non-duplicated GEUVADIS samples across all populations (Supplementary Material: Data Analyses, eQTL analysis). We filtered to transcripts with an average FPKM across samples greater than 0.1 and removed SNPs with a minor allele frequency less than 5%, resulting in 7,072,917 SNPs and 44,140 transcripts. We constrained our analysis to search for *cis*-eQTLs where the genotype and transcript pairs were within 1000 kb of each other resulting in 218,360,149 SNP-transcript pairs. To adjust for potential confounding factors, we adjusted for the first three principal components of the genotype data [19] and the first three principal components of the observed transcript FPKM data [13]. The analysis was performed in 2 hours and 3 minutes on a standard Desktop computer using the MatrixEQTL package [23].

Visual inspection of the distribution of statistically significant results and corresponding QQ-plot indicated that our confounder adjustment was sufficient to remove major sources of bias (Supplementary Figure 3). We identified significant eQTL at the FDR 1% level for 17,276 transcripts overlapping 10,624 unique Ensembl-annotated genes. We calculated a global estimate of the number of null hypotheses and estimated that 5.8% of SNP-transcript pairs showed differential expression. 57% and 78% of transcript-SNP pairs significant at FDR of 1% appeared in the list of significant transcript eQTL identified in the original analysis of the EUR and YRI populations individually. 14% of eQTL pairs were identified for transcripts that did not overlap Ensembl annotated transcripts (Figure 4).

### Computational time comparison

Next we investigated the computational efficiency of our approach compared to the standard *Cufflinks* pipeline. *Tophat* and *Cufflinks* can be run on each sample separately, but *Cuffdiff2* must be run on all samples simultaneously. While *Cuffdiff2* can make use of many cores on a single computer, is not parallelizable across computers. It has been noted that *Cuffdiff2* can take weeks or longer to run on experiments with a few hundred samples. This issue has led consortia and other groups to rely on unpublished software for transcript abundance estimation[1, 4].

We compared each component of the pipeline in terms of computational time on the simulated dataset with 20 samples and 2,745 transcripts. The *Tophat2* - *Cufflinks* -*Tablemaker*-*Ballgown* pipeline was fastest, taking about 3 minutes per sample for *Tablemaker*, 7 seconds to load transcript data into R and less than 1 second for differential expression analysis. This is faster than the recently published *Tophat2* - *Cufflinks* -*Cuffquant* -*Cuffdiff2* pipeline [27], which required about 4 minutes per sample for *Cuffquant*, 23 minutes for differential expression analysis with *Cuffdiff2*. The *Ballgown* -*tablemaker* pipeline was also substantially faster than directly running *Cufflinks* -*Cuffdiff2*, where the *Cuffdiff2* step took about 75 minutes. For all these pipelines, *Tophat2* took about 2 hours per sample and *Cufflinks* about 5 minutes per sample. All possible multicore processes (*Tophat2*, *Cufflinks*, *Cuffdiff2*, *Cuffquant*, *Tablemaker*) were run on 4 cores.

We also calculated the per-sample distribution of processing times for each step in the *Tophat2* - *Cufflinks* - *Tablemaker* pipeline for all 667 samples in the GEUVADIS study (Figure 5a-c). *Tablemaker* took a median of 0.97 hours per sample (IQR 0.24 hours) on a standard 4 core computer; this calculation can be parallelized across samples. By contrast, *Cuffdiff2* would take months to perform this analysis on a standard 4 core computer. *Ballgown* multiclass differential expression analysis between the CEU (*n* = 162), YRI (*n* = 163), FIN (*n* = 114), GBR (*n* = 115) and TSI (*n* = 93) samples for 334,206 transcripts took 42 minutes on a single core Desktop computer.

## Comparison of average coverage and FPKM for differential expression

There are two major classes of statistical methods for differential expression analysis of RNA-seq: those based on RPKMs or FPKMs, as exemplified by *Cufflinks*, and those based on counting the reads overlapping specific regions, as exemplifed by *DESeq* [2] and *edgeR* [21]. *Tablemaker* produces both FPKM estimates from *Cufflinks* and average coverage of each exon, intron, and transcript (Supplementary Materials: Tablemaker output files). We used our simulated dataset to investigate the impact of using average coverage as the transcript expression measurement, compared to using FPKM, as was done in our previous analyses. To do this comparison, we re-ran the same *Ballgown* model as in our simulation study (Figure 2), but used average coverage as the expression measurement. The differential expression rankings were highly correlated when using either FPKM or average coverage (Figure 6a). The p-value distribution using average coverage (Figure 6b) was similar to the p-value distribution using FPKM (Figure 2c), and the ranking accuracy of the transcript ranks was almost the same, whether average coverage or FPKM was used (Figure 6c). We also observed correlated ranks between the differential expression results by RIN value in the GEUVADIS dataset (Figure 6d). These results confirm the expected result: in differential expression analyses, count-based and FPKM-based (length-normalized) expression measurements perform similarly. *Ballgown* allows users to perform analyses with whatever expression measurement is available in their dataset, so other expression measurements, such as transcripts per million (TPM) [30] could also be explored within our framework.

The *Ballgown* R package includes functions for interactive exploration of the transcriptome assembly, visualization of transcript structures and feature-specific abundances for each locus, and post-hoc annotation of assembled features to annotated features. Direct availability of feature-by-sample expression tables makes it easy to apply alternative differential expression tests or to evaluate other statistical properties of the assembly, such as dispersion of expression values across replicates or genes. The *tablemaker* preprocessor writes the tables directly to disk and they can be loaded into *R* with a single function call. The *Ballgown*, *tablemaker* and *polyester* software are available from GitHub (Supplementary Material: Software), and code and data from the analyses presented here are in the process of being uploaded to GitHub (Supplementary Material: Scripts and Data).

## Acknowledgements

The authors would like to thank Peter A.C. ’t Hoen and Tuuli Lappalainen for providing QC data and assistance with contacting ArrayExpress. JTL, GP, and BL were partially supported by 1R01GM105705 and AF is supported by a Hopkins Sommer Scholarship.