recount: A large-scale resource of analysis-ready RNA-seq expression data

recount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bio-conductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.


Introduction
RNA sequencing (RNA-seq) is a ubiquitous tool for assaying gene expression. Public sequencing data repositories such as the Sequence Read Archive [1] now hold more than 50,000 human RNAseq samples, and the size of the archive doubles approximately every 18 months (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement). Many studies in public and dbGaP-protected repositories are valuable to biological researchers and methods developers. For example, some studies are derived from individuals with rare disease [2], from hard-toobtain tissues [3] or from rare forms of cancer [4]. Other studies are notable for their size, e.g. the GTEx study [5] consisting of 9,662 samples derived from 551 individuals and 54 body sites.
However, the majority of these archived samples are available only as compressed collections of raw sequencing reads. And while some samples have been summarized into gene counts in repositories like the Gene Expression Omnibus (GEO), these expression level summarizations are heavily dependent on the processing pipelines, which can vary dramatically across study. Typically analyses of public data require re-analysis beginning from the reads. However, processing raw reads into a form suitable for various downstream analyses is technically challenging. Care is required to craft expression summaries, derived from a common comparable pipeline -that are both conciseconvenient for researchers to download and interact with -and useful in a variety of downstream scenarios.
Our first effort addressed this in two ways: (a) by summarizing data into concise gene count tables, and (b) by making processed data available in the form of Bioconductor [6] ExpressionSet objects including associated metadata using a single processing pipeline. That resource, called recount [7], was applied, for example, to development of methods for differential expression and normalization [8,9,10], compilation of co-expression networks [11], and to studying the effect of ribosomal DNA dosage on gene expression [12]. Here we present an updated version of recount consisting of 59,319 uniformly processed human RNA-seq samples across 2,035 projects. These publicly available SRA samples include, for example, the entirety of the GTEx, GEUVADIS [13], SEQC/MACQ-III [14], and ABRF [15] studies. Now recount summarizes the expression data across multiple feature types, including genes, exons, exon-exon splice junctions and base-level coverage. These summarizations enable a wider variety of downstream analyses, including testing for differential expression of potentially unannotated transcribed sequence. Similarly, recount now marshals relevant meta/phenotype data into a searchable interface available from both the website (https://jhubiostatistics.shinyapps.io/recount/) and the Bioconductor package https:// github.com/leekgroup/recount, allowing users to rapidly access relevant data.
We demonstrate three potential workflows using the recount database and corresponding R package. First, we show that the resource can be used to merge multiple data sets studying the same problem to perform rapid genomic meta-analyses. Next, we demonstrate that our processed version of the GTEx gene count data closely matches the official gene counts released by the project itself. This serves to demonstrate (a) that it is easy to compare the processed data from recount to data processed with other pipelines and (b) that our gene counts are consistent with those published from the GTEx project. Lastly we show that the resource can be used to easily perform differential expression analysis at different feature summarizations: exons, genes, junctions, and expressed regions [7]. We also demonstrate the ease of comparing results discovered in one study within recount to other studies in the resource for validation. All of our analyses are reproducible and the results can be compiled using the R markdown files found at: http://leekgroup.github. io/recount-analyses/.

Data description
We analyzed the publicly available SRA and the latest (v6) release of GTEx and samples. The

Use case 1: Meta-analysis
To illustrate the ease of combining data from multiple projects included in recount as part of a cross-study meta-analysis, we carried out a cross-tissue differential expression (DE) analysis comparing gene expression between colon and whole blood. As an initial analysis, colon samples labeled as controls were taken from studies SRP029880 (a study of colorectal cancer [16], n=19) and SRP042228 (a study of Crohn's disease [17], n=41). Whole blood samples labeled as controls were taken from SRP059039 (a study of virus-caused diarrhea, unpublished, n=24), SRP059172 (a study of blood biomarkers for brucellosis, unpublished, n=47) and SRP062966 (a study of lupus, unpublished, n=18). After filtering genes to include only those with an average normalized count of at least 5 across samples, we performed gene-level differential expression analysis using limma [18] and voom [9].
To validate the results, we selected all colon and whole blood samples from the GTEx project (n=376 and 456, respectively) and performed the same analysis, adjusting for batch effects. We then computed rank-based concordance, examining the fraction of the top DE genes that were included in both analyses. Results are shown in orange in Figure 1A. Approximately 20% of the top 100 genes from the two analyses were concordant. As a comparison and to provide context for this result, we performed two additional comparisons. First, we used GTEx lung data (n=374) in place of the colon data and computed DE genes compared to whole blood. In this case, only approximately 5% of the top 100 DE genes were shared in the top 100 genes from our multistudy analysis. Second, to represent concordance results expected for a comparison of unrelated things, we used ranked coefficients for batch instead of for tissue and see very little concordance.
These comparisons support that we can use the resources found in recount to perform a valid tissue-specific meta-analysis.

Use case 2: GTEx comparison
One of the largest collections of RNA sequencing data currently available are data from the GTEx project consisting of 9,662 samples from over 250 individuals [19]. The recount collection includes the RNA-seq data from GTEx processed using the same pipeline as all other samples from SRA.
The exon, gene, and junction counts are available from recount in the form of both tab-delimited files and analysis-ready Bioconductor objects.
We downloaded the official release of the gene counts from the GTEx portal (which were based on read counting) and compared them to our genes counts (which were based on base-level coverage).
The gene expression levels we estimate using the recount pipeline have a median (IQR) correlation of 0.96 (0.93, 0.98) with the V6 release from GTEx ( Figure 1B). We performed a differential expression analysis comparing colon and whole blood samples. Differential expression analysis using the gene expression measurements from recount match the results using the V6 release from the GTEx portal (r 2 = .93 between fold changes for recount and GTEx v6 counts, Figure 1C).
The advantage of using the recount version of the GTEx data is that they are already processed identically to tens of thousands of SRA samples and can be easily integrated to perform more comprehensive analyses as we have shown in previous examples.

Use case 3: Multi-level differential expression analyses
To demonstrate the ease with which differential gene expression analyses can be carried out in recount, we lastly performed differential expression (DE) analysis at the gene, exon, exon-exonjunction, and expressed region levels from data generated to determine the transcriptomic differences between breast cancer subtypes. In this first analysis, HER2-positive and triple negative breast cancer (TNBC) samples were selected from study SRP032789 (TNBC, n=6; HER2-positive, n=5) [20], and feature-level expression at genes, exons, junctions, and expressed regions were ex- Finally, 19,805 exon-exon junctions were differentially expressed (q < 0.05, 18,073 downregulated and 1,732 upregulated). DE analysis identified 35,809 differentially expressed regions (q < 0.05, 17,170 downregulated and 18,639 upregulated). Of these significant DERs, 6,613 do not overlap any annotated exons, demonstrating that 18% of DERs detected would not be reported using annotation-dependent methods of expression estimation. Figure 2C highlights an example region in which differential expression occurs outside of any annotated protein-coding gene on Chromosome 3.
We then summarized junctions and exons at the gene-level using the resulting DE p-values, and 67% of the top 100 genes were shared across the gene and exon-level analyses ( Figure 2A). In comparison, expressed regions and exon-exon junction analyses only shared 18% and 5% of the top 100 genes, respectively. Furthermore, to validate the differential expression findings, we compared the gene level results from study SRP032789 to an independent study (SRP019936; TNBC, n=8; HER2-positive, n=7) [21] (see Supplementary Methods). DE analysis was carried out as described above, identifying 3,197 genes as differentially expressed (q < 0.05, 1,728 downregulated and 1,469 upregulated). Given the low concordance (8% among the top 1000 genes) between these results and those from study SRP032789, we then applied independent hypothesis weighting (IHW) [22] across the two studies, which slightly improved replication rates. While sample size is limited in these two studies and thus likely thwarts our ability to see a huge increase in power using IHW, example code for IHW in recount is provided for application to other data sets.

Discussion
By producing summaries at multiple levels of detail, recount enables a range of downstream analyses. The summaries are concise and easy for users to download and use. Achieving an appropriate balance between conciseness and queryability is an important design challenge for any effort that seeks to make public data more usable for researchers.
All recount summaries are produced with analysis pipelines that are both reproducible and annotation-agnostic. Gene annotations are used only to label summarized data post analysis, and not to align reads or discover splice junctions. Downstream analyses are therefore fully aware of unannotated splicing events.
Other efforts have been made to summarize public gene expression data. The Expression Atlas [23] provides final results queryable only at the gene level, Toil focuses on curated data [24], and other efforts have focused primarily on cancer [25]. recount, by contrast, covers a broad range of projects and produces summarized objects that can be further analyzed in a variety of ways -including directly with Bioconductor using the recount package.

Alignment
GTEx and public SRA samples were selected by searching the SRA website. Samples that could not be downloaded using fastq-dump were eliminated as discussed in Supplementary Methods. Samples were aligned in a spliced fashion to the hg38 assembly of the human genome using Rail-RNA. Alignments were performed in batches on computer clusters rented from the Amazon Web Services Elastic MapReduce service. The alignment pipeline was divided into two phases, where the first phase ("preprocessing") downloads and reformats the data and the second phase performs spliced alignment. Outputs of the pipeline include, for each sample, a junction coverage file (similar to a TopHat "junctions.bed" file) and a BigWig file [26] containing a genomewide coverage vector.
Further details are presented in the Rail-RNA study [27] and Supplementary Methods.

Tabulation
Gene and exon counts were compiled using the BigWig files output by Rail-RNA and the UCSC knownGene annotation. For exon counts, we first obtained a set of non-overlapping "unioned" exons. Gene and exon counts were compiled into per-project tables and RangedSummarizedExperiment objects. We expanded the tables with several metadata columns containing, for example, read count, paired-end status, GEO accession, and the tissue type as predicted by the SHARQ beta resource (http://www.cs.cmu.edu/~ckingsf/sharq/).

Use cases
All R code used for the analyses performing the use cases is available from the website: http: //leekgroup.github.io/recount-analyses/.

Competing interests
The authors declare that they have no competing interests.   Figure 1: Meta-analysis and study comparison facilitated by recount A. A concordance at the top plot showing comparisons between a meta-analysis tissue comparison of whole blood and colorectal tissue in data from the sequence read archive and the GTEx project. When comparing the same tissues there is strong concordance between differential expression results on public data and GTEx, less when different tissues are compared, and almost none when comparing different analyses. B. The distribution of correlations between gene expression estimates for GTEx V6 from the GTEx portal and the counts calculated in recount. The gene expression counts are highly correlated between both quantifications. C. An MA-plot comparing the fold changes for differential expression between colon and whole blood using the quantifications from GTEx and from recount. Most genes have similar fold change between the two analyses.  Spot instances allow users to bid for excess computing capacity. If the fluctuating market price drops below a users's bid, the instances could be lost, halting the computation. So saving money by bidding for spot instances comes with risk, and rather than aligning all samples in one batch, Intel Xeon E5-2680 v2 (Ivy Bridge) processing cores and 60 GB of RAM. Our GTEx alignment runs may be reproduced by following instructions at https://github.com/nellore/runs/blob/ master/gtex/README.md. Our alignment runs may be reproduced by following instructions at https://github.com/nellore/runs/blob/master/sra/v2/README.md. Alignment of GTEx data is described in more detail in the supplementary material of [2].
Incomplete and invalid input data. For the analysis of the public SRA samples, some samples could not be downloaded, due to failures of the fastq-dump software. The issue persisted even when we attempted to restart the fastq-dump process. Samples exhibiting this issue can be excluded from analysis. These samples are listed here: https://github.com/nellore/runs/raw/master/ sra/v2/hg38/NOTES. That file also describes two other preprocessing issues that led us to exclude a few other samples from analysis: (1) sequence input encoded in a manner we did not recognize; (2) miscellaneous errors reported by fastq-dump. exactly twice the number of reads Rail-RNA processed; other times, the sample was listed as singleend but the number of reads Rail-RNA processed was greater than the number of reads listed. In these cases, we made note of our suspicion that the library layout listed on SRA was incorrect.
The For each project, we created a RangedSummarizedExperiment using the sample metadata as in the exon and gene cases while annotating the exon-exon junctions with the: was carried out using limma and voom [9]. For multiple comparison correction we calculated q-values from the observed p-values after estimating the proportion of differentially expressed genes, exons, or exon-exon junctions in the experiment [10]. Features with calculated q-values smaller than 0.05 between different tissue types were declared statistically significant. To compare the results across these levels of data, we used Simes' rule [11] to calculate a gene-based p-value for all exons, exon-exon junctions, and differentially expressed regions that overlap genes. We then computed rank-based concordance between the gene and either the exon, junction, or expressed region level results.
In the replication dataset, count data were filtered as before and PCA was utilized to identify global sample outliers. One TNBC tumor sample was identified as an outlier and removed from analysis. IHW uses a set of independent weights for each gene in one study to weight the hypothesis tests in the second study to increase power to detect differences. Here, for each gene the absolute value of the test statistic from study SRP032789 were used as weights for DE analysis in study