Abstract
recount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bio-conductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.
1 Introduction
RNA sequencing (RNA-seq) is a ubiquitous tool for assaying gene expression. Public sequencing data repositories such as the Sequence Read Archive [1] now hold more than 50,000 human RNA-seq samples, and the size of the archive doubles approximately every 18 months (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement). Many studies in public and dbGaP-protected repositories are valuable to biological researchers and methods developers. For example, some studies are derived from individuals with rare disease [2], from hard-to-obtain tissues [3] or from rare forms of cancer [4]. Other studies are notable for their size, e.g. the GTEx study [5] consisting of 9,662 samples derived from 551 individuals and 54 body sites.
However, the majority of these archived samples are available only as compressed collections of raw sequencing reads. And while some samples have been summarized into gene counts in repositories like the Gene Expression Omnibus (GEO), these expression level summarizations are heavily dependent on the processing pipelines, which can vary dramatically across study. Typically analyses of public data require re-analysis beginning from the reads. However, processing raw reads into a form suitable for various downstream analyses is technically challenging. Care is required to craft expression summaries, derived from a common comparable pipeline - that are both concise – convenient for researchers to download and interact with – and useful in a variety of downstream scenarios.
Our first effort addressed this in two ways: (a) by summarizing data into concise gene count tables, and (b) by making processed data available in the form of Bioconductor [6] ExpressionSet objects including associated metadata using a single processing pipeline. That resource, called recount [7], was applied, for example, to development of methods for differential expression and normalization [8, 9, 10], compilation of co-expression networks [11], and to studying the effect of ribosomal DNA dosage on gene expression [12]. Here we present an updated version of recount consisting of 59,319 uniformly processed human RNA-seq samples across 2,035 projects. These publicly available SRA samples include, for example, the entirety of the GTEx, GEUVADIS [13], SEQC/MACQ-III [14], and ABRF [15] studies. Now recount summarizes the expression data across multiple feature types, including genes, exons, exon-exon splice junctions and base-level coverage. These summarizations enable a wider variety of downstream analyses, including testing for differential expression of potentially unannotated transcribed sequence. Similarly, recount now marshals relevant meta/phenotype data into a searchable interface available from both the website (https://jhubiostatistics.shinyapps.io/recount/) and the Bioconductor package https:// github.com/leekgroup/recount, allowing users to rapidly access relevant data.
We demonstrate three potential workflows using the recount database and corresponding R package. First, we show that the resource can be used to merge multiple data sets studying the same problem to perform rapid genomic meta-analyses. Next, we demonstrate that our processed version of the GTEx gene count data closely matches the official gene counts released by the project itself. This serves to demonstrate (a) that it is easy to compare the processed data from recount to data processed with other pipelines and (b) that our gene counts are consistent with those published from the GTEx project. Lastly we show that the resource can be used to easily perform differential expression analysis at different feature summarizations: exons, genes, junctions, and expressed regions [7]. We also demonstrate the ease of comparing results discovered in one study within recount to other studies in the resource for validation. All of our analyses are reproducible and the results can be compiled using the R markdown files found at: http://leekgroup.github.io/recount-analyses/.
2 Results
2.1 Data description
We analyzed the publicly available SRA and the latest (v6) release of GTEx and samples. The SRA data consisted of 49,848 publicly available samples spanning over 146 terabases of reads. These reads were downloaded and analyzed, resulting in a final set of 48,558 samples that could be fully downloaded and processed (1,290 samples could be downloaded only partially and were excluded, see Supplementary Methods). The GTEx data consisted of 9,662 samples spanning over 65 terabases of reads, 550 individuals and the K562 cell line, and 32 tissues. These reads were downloaded and analyzed, resulting in a final set of 9479 samples that could be fully downloaded and processed (183 could be downloaded only partially, see Supplementary Methods).
2.2 Use case 1: Meta-analysis
To illustrate the ease of combining data from multiple projects included in recount as part of a cross-study meta-analysis, we carried out a cross-tissue differential expression (DE) analysis comparing gene expression between colon and whole blood. As an initial analysis, colon samples labeled as controls were taken from studies SRP029880 (a study of colorectal cancer [16], n = 19) and SRP042228 (a study of Crohn’s disease [17], n = 41). Whole blood samples labeled as controls were taken from SRP059039 (a study of virus-caused diarrhea, unpublished, n = 24), SRP059172 (a study of blood biomarkers for brucellosis, unpublished, n = 47) and SRP062966 (a study of lupus, unpublished, n = 18). After filtering genes to include only those with an average normalized count of at least 5 across samples, we performed gene-level differential expression analysis using limma [18] and voom [9].
To validate the results, we selected all colon and whole blood samples from the GTEx project (n = 376 and 456, respectively) and performed the same analysis, adjusting for batch effects. We then computed rank-based concordance, examining the fraction of the top DE genes that were included in both analyses. Results are shown in orange in Figure 1A. Approximately 20% of the top 100 genes from the two analyses were concordant. As a comparison and to provide context for this result, we performed two additional comparisons. First, we used GTEx lung data (n = 374) in place of the colon data and computed DE genes compared to whole blood. In this case, only approximately 5% of the top 100 DE genes were shared in the top 100 genes from our multi-study analysis. Second, to represent concordance results expected for a comparison of unrelated things, we used ranked coefficients for batch instead of for tissue and see very little concordance. These comparisons support that we can use the resources found in recount to perform a valid tissue-specific meta-analysis.
2.3 Use case 2: GTEx comparison
One of the largest collections of RNA sequencing data currently available are data from the GTEx project consisting of 9,662 samples from over 250 individuals [19]. The recount collection includes the RNA-seq data from GTEx processed using the same pipeline as all other samples from SRA. The exon, gene, and junction counts are available from recount in the form of both tab-delimited files and analysis-ready Bioconductor objects.
We downloaded the official release of the gene counts from the GTEx portal (which were based on read counting) and compared them to our genes counts (which were based on base-level coverage). The gene expression levels we estimate using the recount pipeline have a median (IQR) correlation of 0.96 (0.93, 0.98) with the V6 release from GTEx (Figure 1B). We performed a differential expression analysis comparing colon and whole blood samples. Differential expression analysis using the gene expression measurements from recount match the results using the V6 release from the GTEx portal (r2 = .93 between fold changes for recount and GTEx v6 counts, Figure 1C). The advantage of using the recount version of the GTEx data is that they are already processed identically to tens of thousands of SRA samples and can be easily integrated to perform more comprehensive analyses as we have shown in previous examples.
2.4 Use case 3: Multi-level differential expression analyses
To demonstrate the ease with which differential gene expression analyses can be carried out in recount, we lastly performed differential expression (DE) analysis at the gene, exon, exon-exon-junction, and expressed region levels from data generated to determine the transcriptomic differences between breast cancer subtypes. In this first analysis, HER2-positive and triple negative breast cancer (TNBC) samples were selected from study SRP032789 (TNBC, n = 6; HER2-positive, n = 5) [20], and feature-level expression at genes, exons, junctions, and expressed regions were extracted (see Supplementary Methods). DE analysis at the gene-level identified 1,611 differentially expressed genes (q < 0.05) with 933 genes demonstrating decreased expression and 678 increased expression in TNBC relative to HER2-positive breast cancer. At the exon level, 23,647 exons demonstrated differential expression (q < 0.05; 11,218 downregulated and 12,429 upregulated). Finally, 19,805 exon-exon junctions were differentially expressed (q < 0.05, 18,073 downregulated and 1,732 upregulated). DE analysis identified 35,809 differentially expressed regions (q < 0.05, 17,170 downregulated and 18,639 upregulated). Of these significant DERs, 6,613 do not over-lap any annotated exons, demonstrating that 18% of DERs detected would not be reported using annotation-dependent methods of expression estimation. Figure 2C highlights an example region in which differential expression occurs outside of any annotated protein-coding gene on Chromosome 3.
We then summarized junctions and exons at the gene-level using the resulting DE p-values, and 67% of the top 100 genes were shared across the gene and exon-level analyses (Figure 2A). In comparison, expressed regions and exon-exon junction analyses only shared 18% and 5% of the top 100 genes, respectively. Furthermore, to validate the differential expression findings, we compared the gene level results from study SRP032789 to an independent study (SRP019936; TNBC, n = 8; HER2-positive, n = 7) [21] (see Supplementary Methods). DE analysis was carried out as described above, identifying 3,197 genes as differentially expressed (q < 0.05, 1,728 downregulated and 1,469 upregulated). Given the low concordance (8% among the top 1000 genes) between these results and those from study SRP032789, we then applied independent hypothesis weighting (IHW) [22] across the two studies, which slightly improved replication rates. While sample size is limited in these two studies and thus likely thwarts our ability to see a huge increase in power using IHW, example code for IHW in recount is provided for application to other data sets.
3 Discussion
By producing summaries at multiple levels of detail, recount enables a range of downstream analyses. The summaries are concise and easy for users to download and use. Achieving an appropriate balance between conciseness and queryability is an important design challenge for any effort that seeks to make public data more usable for researchers.
All recount summaries are produced with analysis pipelines that are both reproducible and annotation-agnostic. Gene annotations are used only to label summarized data post analysis, and not to align reads or discover splice junctions. Downstream analyses are therefore fully aware of unannotated splicing events.
Other efforts have been made to summarize public gene expression data. The Expression Atlas [23] provides final results queryable only at the gene level, Toil focuses on curated data [24], and other efforts have focused primarily on cancer [25].
recount, by contrast, covers a broad range of projects and produces summarized objects that can be further analyzed in a variety of ways - including directly with Bioconductor using the recount package.
The recount website is located at https://jhubiostatistics.shinyapps.io/recount/. The recount Bioconductor package is available under the Artistic 2.0 open source license and is available at https://github.com/leekgroup/recount.
4 Online Methods
4.1 Alignment
GTEx and public SRA samples were selected by searching the SRA website. Samples that could not be downloaded using fastq-dump were eliminated as discussed in Supplementary Methods.
Samples were aligned in a spliced fashion to the hg38 assembly of the human genome using Rail-RNA. Alignments were performed in batches on computer clusters rented from the Amazon Web Services Elastic MapReduce service. The alignment pipeline was divided into two phases, where the first phase (“preprocessing”) downloads and reformats the data and the second phase performs spliced alignment. Outputs of the pipeline include, for each sample, a junction coverage file (similar to a TopHat “junctions.bed” file) and a BigWig file [26] containing a genomewide coverage vector. Further details are presented in the Rail-RNA study [27] and Supplementary Methods.
4.2 Tabulation
Gene and exon counts were compiled using the BigWig files output by Rail-RNA and the UCSC knownGene annotation. For exon counts, we first obtained a set of non-overlapping “unioned” exons. Gene and exon counts were compiled into per-project tables and RangedSummarizedExperiment objects. We expanded the tables with several metadata columns containing, for example, read count, paired-end status, GEO accession, and the tissue type as predicted by the SHARQ beta resource (http://www.cs.cmu.edu/∼ckingsf/sharq/).
4.3 Use cases
All R code used for the analyses performing the use cases is available from the website: http://leekgroup.github.io/recount-analyses/.
Use case 1: Meta-analysis http://leekgroup.github.io/recount-analyses/example_meta/meta_analysis.pdf
Use case 2: GTEx comparison http://leekgroup.github.io/recount-analyses/example_meta/compare_with_GTEx_reproducible.pdf
Use case 3: Multi-level differential expression analyses Gene/exon: http://leekgroup. github.io/recount-analyses/example_de/recount_SRP032789.pdf and annotation agnostic: http://leekgroup.github.io/recount-analyses/example_de/recount_DER_SRP032789.pdf and validation of results in a second study http://leekgroup.github.io/recount-analyses/example_de/recount_SRP019936.pdf
5 Acknowledgments
We thank Carl Kingsford and Darya Filippova for their assistance in adding SHARQ metadata to recount. recount data is hosted on SciServer, a collaborative research environment for large-scale data-driven science. It is being developed at, and administered by, the Institute for Data Intensive Engineering and Science at Johns Hopkins University. SciServer is funded by the National Science Foundation Award ACI-1261715. For more information about SciServer, visit http://www. sciserver.org.
The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health. Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI/SAIC-Frederick, Inc. (SAIC-F) subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to The Broad Institute, Inc. Biorepository operations were funded through an SAIC-F subcontract to Van Andel Institute (10ST1035). Additional data repository and project management were provided by SAIC-F (HHSN261200800001E). The Brain Bank was supported by a supplements to University of Miami grants DA006227 & DA033684 and to contract N01MH000028. Statistical Methods development grants were made to the University of Geneva (MH090941 & MH101814), the University of Chicago (MH090951, MH090937, MH101820, MH101825), the University of North Carolina - Chapel Hill (MH090936 & MH101819), Harvard University (MH090948), Stanford University (MH101782), Washington University St Louis (MH101810), and the University of Pennsylvania (MH101822). The data used for the analyses described in this manuscript were obtained from: the GTEx Portal on 11/21/15 and/or dbGaP accession number phs000424.v6.p1 on 11/30/15 - 12/04/15.
6 Competing interests
The authors declare that they have no competing interests.
7 Funding
BL, JTL, LCT, SE, AN, MT, KH and KK were supported by NIH R01 GM105705. LCT was supported by Consejo Nacional de Ciencia y Tecnología M´exico 351535. Amazon Web Services experiments were supported by AWS in Education research grants.
Footnotes
↵* co-first authors;