Abstract
The first steps in the analysis of RNA sequencing (RNA-seq) data are usually to map the reads to a reference genome and then to count reads by gene, by exon or by exon-exon junction. These two steps are at once the most common and also typically the most expensive computational steps in an RNA-seq analysis. These steps are typically undertaken using Unix command-line or Python software tools, even when downstream analysis is to be undertaken using R.
We present Rsubread, a Bioconductor software package that provides high-performance alignment and counting functions for RNA-seq reads. Rsubread provides the ease-of-use of the R programming environment, creating a matrix of read counts directly as an R object ready for downstream analysis. It has no software dependencies other than R itself. Using SEQC data and simulations, we compare Rsubread to the popular non-R tools TopHat2, STAR and HTSeq. We also compare to counting functions provided in the Bioconductor infrastructure packages. We show that Rsubread is faster, uses less memory and produces read count summaries that more accurately correlate with true values. The results show that users can adopt the R environment for alignment and quantification without suffering any loss of performance.