recount: A large-scale resource of analysis-ready RNA-seq expression data

Leonardo Collado-Torres; Abhinav Nellore; Kai Kammers; Shannon E. Ellis; Margaret A. Taub; Kasper D. Hansen; Andrew E. Jaffe; Ben Langmead; Jeffrey T. Leek

doi:10.1101/068478

Abstract

recount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bio-conductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.

1 Introduction

RNA sequencing (RNA-seq) is a ubiquitous tool for assaying gene expression. Public sequencing data repositories such as the Sequence Read Archive [1] now hold more than 50,000 human RNA-seq samples, and the size of the archive doubles approximately every 18 months (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement). Many studies in public and dbGaP-protected repositories are valuable to biological researchers and methods developers. For example, some studies are derived from individuals with rare disease [2], from hard-to-obtain tissues [3] or from rare forms of cancer [4]. Other studies are notable for their size, e.g. the GTEx study [5] consisting of 9,662 samples derived from 551 individuals and 54 body sites.

However, the majority of these archived samples are available only as compressed collections of raw sequencing reads. And while some samples have been summarized into gene counts in repositories like the Gene Expression Omnibus (GEO), these expression level summarizations are heavily dependent on the processing pipelines, which can vary dramatically across study. Typically analyses of public data require re-analysis beginning from the reads. However, processing raw reads into a form suitable for various downstream analyses is technically challenging. Care is required to craft expression summaries, derived from a common comparable pipeline - that are both concise – convenient for researchers to download and interact with – and useful in a variety of downstream scenarios.

Our first effort addressed this in two ways: (a) by summarizing data into concise gene count tables, and (b) by making processed data available in the form of Bioconductor [6] ExpressionSet objects including associated metadata using a single processing pipeline. That resource, called recount [7], was applied, for example, to development of methods for differential expression and normalization [8, 9, 10], compilation of co-expression networks [11], and to studying the effect of ribosomal DNA dosage on gene expression [12]. Here we present an updated version of recount consisting of 59,319 uniformly processed human RNA-seq samples across 2,035 projects. These publicly available SRA samples include, for example, the entirety of the GTEx, GEUVADIS [13], SEQC/MACQ-III [14], and ABRF [15] studies. Now recount summarizes the expression data across multiple feature types, including genes, exons, exon-exon splice junctions and base-level coverage. These summarizations enable a wider variety of downstream analyses, including testing for differential expression of potentially unannotated transcribed sequence. Similarly, recount now marshals relevant meta/phenotype data into a searchable interface available from both the website (https://jhubiostatistics.shinyapps.io/recount/) and the Bioconductor package https:// github.com/leekgroup/recount, allowing users to rapidly access relevant data.

We demonstrate three potential workflows using the recount database and corresponding R package. First, we show that the resource can be used to merge multiple data sets studying the same problem to perform rapid genomic meta-analyses. Next, we demonstrate that our processed version of the GTEx gene count data closely matches the official gene counts released by the project itself. This serves to demonstrate (a) that it is easy to compare the processed data from recount to data processed with other pipelines and (b) that our gene counts are consistent with those published from the GTEx project. Lastly we show that the resource can be used to easily perform differential expression analysis at different feature summarizations: exons, genes, junctions, and expressed regions [7]. We also demonstrate the ease of comparing results discovered in one study within recount to other studies in the resource for validation. All of our analyses are reproducible and the results can be compiled using the R markdown files found at: http://leekgroup.github.io/recount-analyses/.

2 Results

2.1 Data description

We analyzed the publicly available SRA and the latest (v6) release of GTEx and samples. The SRA data consisted of 49,848 publicly available samples spanning over 146 terabases of reads. These reads were downloaded and analyzed, resulting in a final set of 48,558 samples that could be fully downloaded and processed (1,290 samples could be downloaded only partially and were excluded, see Supplementary Methods). The GTEx data consisted of 9,662 samples spanning over 65 terabases of reads, 550 individuals and the K562 cell line, and 32 tissues. These reads were downloaded and analyzed, resulting in a final set of 9479 samples that could be fully downloaded and processed (183 could be downloaded only partially, see Supplementary Methods).

2.2 Use case 1: Meta-analysis

To illustrate the ease of combining data from multiple projects included in recount as part of a cross-study meta-analysis, we carried out a cross-tissue differential expression (DE) analysis comparing gene expression between colon and whole blood. As an initial analysis, colon samples labeled as controls were taken from studies SRP029880 (a study of colorectal cancer [16], n = 19) and SRP042228 (a study of Crohn’s disease [17], n = 41). Whole blood samples labeled as controls were taken from SRP059039 (a study of virus-caused diarrhea, unpublished, n = 24), SRP059172 (a study of blood biomarkers for brucellosis, unpublished, n = 47) and SRP062966 (a study of lupus, unpublished, n = 18). After filtering genes to include only those with an average normalized count of at least 5 across samples, we performed gene-level differential expression analysis using limma [18] and voom [9].

To validate the results, we selected all colon and whole blood samples from the GTEx project (n = 376 and 456, respectively) and performed the same analysis, adjusting for batch effects. We then computed rank-based concordance, examining the fraction of the top DE genes that were included in both analyses. Results are shown in orange in Figure 1A. Approximately 20% of the top 100 genes from the two analyses were concordant. As a comparison and to provide context for this result, we performed two additional comparisons. First, we used GTEx lung data (n = 374) in place of the colon data and computed DE genes compared to whole blood. In this case, only approximately 5% of the top 100 DE genes were shared in the top 100 genes from our multi-study analysis. Second, to represent concordance results expected for a comparison of unrelated things, we used ranked coefficients for batch instead of for tissue and see very little concordance. These comparisons support that we can use the resources found in recount to perform a valid tissue-specific meta-analysis.

Figure 1: Meta-analysis and study comparison facilitated by recount A.

A concordance at the top plot showing comparisons between a meta-analysis tissue comparison of whole blood and colorectal tissue in data from the sequence read archive and the GTEx project. When comparing the same tissues there is strong concordance between differential expression results on public data and GTEx, less when different tissues are compared, and almost none when comparing different analyses. B. The distribution of correlations between gene expression estimates for GTEx V6 from the GTEx portal and the counts calculated in recount. The gene expression counts are highly correlated between both quantifications. C. An MA-plot comparing the fold changes for differential expression between colon and whole blood using the quantifications from GTEx and from recount. Most genes have similar fold change between the two analyses.

2.3 Use case 2: GTEx comparison

One of the largest collections of RNA sequencing data currently available are data from the GTEx project consisting of 9,662 samples from over 250 individuals [19]. The recount collection includes the RNA-seq data from GTEx processed using the same pipeline as all other samples from SRA. The exon, gene, and junction counts are available from recount in the form of both tab-delimited files and analysis-ready Bioconductor objects.

We downloaded the official release of the gene counts from the GTEx portal (which were based on read counting) and compared them to our genes counts (which were based on base-level coverage). The gene expression levels we estimate using the recount pipeline have a median (IQR) correlation of 0.96 (0.93, 0.98) with the V6 release from GTEx (Figure 1B). We performed a differential expression analysis comparing colon and whole blood samples. Differential expression analysis using the gene expression measurements from recount match the results using the V6 release from the GTEx portal (r² = .93 between fold changes for recount and GTEx v6 counts, Figure 1C). The advantage of using the recount version of the GTEx data is that they are already processed identically to tens of thousands of SRA samples and can be easily integrated to perform more comprehensive analyses as we have shown in previous examples.

2.4 Use case 3: Multi-level differential expression analyses

To demonstrate the ease with which differential gene expression analyses can be carried out in recount, we lastly performed differential expression (DE) analysis at the gene, exon, exon-exon-junction, and expressed region levels from data generated to determine the transcriptomic differences between breast cancer subtypes. In this first analysis, HER2-positive and triple negative breast cancer (TNBC) samples were selected from study SRP032789 (TNBC, n = 6; HER2-positive, n = 5) [20], and feature-level expression at genes, exons, junctions, and expressed regions were extracted (see Supplementary Methods). DE analysis at the gene-level identified 1,611 differentially expressed genes (q < 0.05) with 933 genes demonstrating decreased expression and 678 increased expression in TNBC relative to HER2-positive breast cancer. At the exon level, 23,647 exons demonstrated differential expression (q < 0.05; 11,218 downregulated and 12,429 upregulated). Finally, 19,805 exon-exon junctions were differentially expressed (q < 0.05, 18,073 downregulated and 1,732 upregulated). DE analysis identified 35,809 differentially expressed regions (q < 0.05, 17,170 downregulated and 18,639 upregulated). Of these significant DERs, 6,613 do not over-lap any annotated exons, demonstrating that 18% of DERs detected would not be reported using annotation-dependent methods of expression estimation. Figure 2C highlights an example region in which differential expression occurs outside of any annotated protein-coding gene on Chromosome 3.

Figure 2: Multi-level differential expression analysis is facilitated by recount A.

A concordance at the top plot showing concordance between a gene-level analysis, an exon-level analysis, a junction level analysis, and an expressed region level analysis (DER). All of these analyses are possible and comparable with recount. B. A Venn diagram showing the number of expressed regions detected that overlap exons, intergenic regions, and intronic regions, including expressed regions that overlap multiple annotation types. Differential expression occurs outside of previously annotated protein-coding regions. C. An example of a region showing differential expression between breast tumor subtypes where there is no annotated gene present. The lines show the average coverage in each group across samples.

We then summarized junctions and exons at the gene-level using the resulting DE p-values, and 67% of the top 100 genes were shared across the gene and exon-level analyses (Figure 2A). In comparison, expressed regions and exon-exon junction analyses only shared 18% and 5% of the top 100 genes, respectively. Furthermore, to validate the differential expression findings, we compared the gene level results from study SRP032789 to an independent study (SRP019936; TNBC, n = 8; HER2-positive, n = 7) [21] (see Supplementary Methods). DE analysis was carried out as described above, identifying 3,197 genes as differentially expressed (q < 0.05, 1,728 downregulated and 1,469 upregulated). Given the low concordance (8% among the top 1000 genes) between these results and those from study SRP032789, we then applied independent hypothesis weighting (IHW) [22] across the two studies, which slightly improved replication rates. While sample size is limited in these two studies and thus likely thwarts our ability to see a huge increase in power using IHW, example code for IHW in recount is provided for application to other data sets.

3 Discussion

By producing summaries at multiple levels of detail, recount enables a range of downstream analyses. The summaries are concise and easy for users to download and use. Achieving an appropriate balance between conciseness and queryability is an important design challenge for any effort that seeks to make public data more usable for researchers.

All recount summaries are produced with analysis pipelines that are both reproducible and annotation-agnostic. Gene annotations are used only to label summarized data post analysis, and not to align reads or discover splice junctions. Downstream analyses are therefore fully aware of unannotated splicing events.

Other efforts have been made to summarize public gene expression data. The Expression Atlas [23] provides final results queryable only at the gene level, Toil focuses on curated data [24], and other efforts have focused primarily on cancer [25].

recount, by contrast, covers a broad range of projects and produces summarized objects that can be further analyzed in a variety of ways - including directly with Bioconductor using the recount package.

The recount website is located at https://jhubiostatistics.shinyapps.io/recount/. The recount Bioconductor package is available under the Artistic 2.0 open source license and is available at https://github.com/leekgroup/recount.

4 Online Methods

4.1 Alignment

GTEx and public SRA samples were selected by searching the SRA website. Samples that could not be downloaded using fastq-dump were eliminated as discussed in Supplementary Methods.

Samples were aligned in a spliced fashion to the hg38 assembly of the human genome using Rail-RNA. Alignments were performed in batches on computer clusters rented from the Amazon Web Services Elastic MapReduce service. The alignment pipeline was divided into two phases, where the first phase (“preprocessing”) downloads and reformats the data and the second phase performs spliced alignment. Outputs of the pipeline include, for each sample, a junction coverage file (similar to a TopHat “junctions.bed” file) and a BigWig file [26] containing a genomewide coverage vector. Further details are presented in the Rail-RNA study [27] and Supplementary Methods.

4.2 Tabulation

Gene and exon counts were compiled using the BigWig files output by Rail-RNA and the UCSC knownGene annotation. For exon counts, we first obtained a set of non-overlapping “unioned” exons. Gene and exon counts were compiled into per-project tables and RangedSummarizedExperiment objects. We expanded the tables with several metadata columns containing, for example, read count, paired-end status, GEO accession, and the tissue type as predicted by the SHARQ beta resource (http://www.cs.cmu.edu/∼ckingsf/sharq/).

4.3 Use cases

All R code used for the analyses performing the use cases is available from the website: http://leekgroup.github.io/recount-analyses/.

Use case 1: Meta-analysis http://leekgroup.github.io/recount-analyses/example_meta/meta_analysis.pdf
Use case 2: GTEx comparison http://leekgroup.github.io/recount-analyses/example_meta/compare_with_GTEx_reproducible.pdf
Use case 3: Multi-level differential expression analyses Gene/exon: http://leekgroup. github.io/recount-analyses/example_de/recount_SRP032789.pdf and annotation agnostic: http://leekgroup.github.io/recount-analyses/example_de/recount_DER_SRP032789.pdf and validation of results in a second study http://leekgroup.github.io/recount-analyses/example_de/recount_SRP019936.pdf

5 Acknowledgments

We thank Carl Kingsford and Darya Filippova for their assistance in adding SHARQ metadata to recount. recount data is hosted on SciServer, a collaborative research environment for large-scale data-driven science. It is being developed at, and administered by, the Institute for Data Intensive Engineering and Science at Johns Hopkins University. SciServer is funded by the National Science Foundation Award ACI-1261715. For more information about SciServer, visit http://www. sciserver.org.

The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health. Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI/SAIC-Frederick, Inc. (SAIC-F) subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to The Broad Institute, Inc. Biorepository operations were funded through an SAIC-F subcontract to Van Andel Institute (10ST1035). Additional data repository and project management were provided by SAIC-F (HHSN261200800001E). The Brain Bank was supported by a supplements to University of Miami grants DA006227 & DA033684 and to contract N01MH000028. Statistical Methods development grants were made to the University of Geneva (MH090941 & MH101814), the University of Chicago (MH090951, MH090937, MH101820, MH101825), the University of North Carolina - Chapel Hill (MH090936 & MH101819), Harvard University (MH090948), Stanford University (MH101782), Washington University St Louis (MH101810), and the University of Pennsylvania (MH101822). The data used for the analyses described in this manuscript were obtained from: the GTEx Portal on 11/21/15 and/or dbGaP accession number phs000424.v6.p1 on 11/30/15 - 12/04/15.

6 Competing interests

The authors declare that they have no competing interests.

7 Funding

BL, JTL, LCT, SE, AN, MT, KH and KK were supported by NIH R01 GM105705. LCT was supported by Consejo Nacional de Ciencia y Tecnología M´exico 351535. Amazon Web Services experiments were supported by AWS in Education research grants.

Footnotes

↵* co-first authors;

References

1.↵
Y. Kodama, M. Shumway, and R. Leinonen. “The Sequence Read Archive: explosive growth of sequencing data”. In: Nucleic acids research 40.D1 (2012), pp. D54–D56.
OpenUrl
2.↵
C. A. Albers et al. “Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome”. In: Nature genetics 44.4 (2012), pp. 435–439.
OpenUrl
3.↵
R Kohen et al. “Transcriptome profiling of human hippocampus dentate gyrus granule cells in mental illness”. In: Translational psychiatry 4.3 (2014), e366.
OpenUrl
4.↵
G. Goh et al. “Recurrent activating mutation in PRKACA in cortisol-producing adrenal tumors”. In: Nature genetics 46.6 (2014), pp. 613–617.
OpenUrl
5.↵
M. Melé et al. “The human transcriptome across tissues and individuals”. In: Science 348.6235 (2015), pp. 660–665.
OpenUrl
6.↵
R. C. Gentleman et al. “Bioconductor: open software development for computational biology and bioinformatics”. In: Genome biology 5.10 (2004), R80.
OpenUrl
7.↵
A. C. Frazee, B. Langmead, and J. T. Leek. “ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets”. In: BMC bioinformatics 12 (2011), p. 449. issn: 1471-2105. doi:10.1186/1471-2105-12-449.
OpenUrl CrossRef PubMed
8.↵
M. I. Love, W. Huber, and S. Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2”. In: Genome biology 15.12 (2014), pp. 1–21.
OpenUrl
9.↵
C. W. Law et al. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts”. In: Genome Biol 15.2 (2014), R29.
OpenUrl
10.↵
J. N. Paulson et al. “Differential abundance analysis for microbial marker-gene surveys”. In: Nature methods 10.12 (2013), pp. 1200–1202.
OpenUrl
11.↵
O. D. Iancu et al. “Utilizing RNA-Seq data for de novo coexpression network inference”. In: Bioinformatics 28.12 (2012), pp. 1592–1597.
OpenUrl
12.↵
J. G. Gibbons et al. “Ribosomal DNA copy number is coupled with gene expression variation and mitochondrial abundance in humans”. In: Nature communications 5 (2014).
13.↵
T. Lappalainen et al. “Transcriptome and genome sequencing uncovers functional variation in humans”. In: Nature 501.7468 (2013), pp. 506–511.
OpenUrl
14.↵
S.-I. Consortium et al. “A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium”. In: Nature biotechnology 32.9 (2014), pp. 903–914.
OpenUrl
15.↵
S. Li et al. “Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study”. In: Nature biotechnology 32.9 (2014), pp. 915–925.
OpenUrl
16.↵
S.-K. Kim et al. “A nineteen gene-based risk score classifier predicts prognosis of colorectal cancer patients”. In: Molecular Oncology 8.8 (2014), pp. 1653–1666. issn: 1574-7891. doi:http://dx.doi.org/10.1016/j.molonc.2014.06.016. url: http://www.sciencedirect.com/science/article/pii/S1574789114001525.
OpenUrl
17.↵
Y. Haberman et al. “Pediatric Crohn disease patients exhibit specific ileal transcriptome and microbiome signature”. In: The Journal of Clinical Investigation 124.8 (Aug. 2014), pp. 3617–3633. doi:10.1172/JCI75436. url: http://www.jci.org/articles/view/75436.
OpenUrl CrossRef PubMed
18.↵
G. K. Smyth. “Limma: linear models for microarray data”. In: Bioinformatics and computational biology solutions using R and Bioconductor. Springer, 2005, pp. 397–420.
19.↵
G. Consortium et al. “The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans”. In: Science 348.6235 (2015), pp. 648–660.
OpenUrl
20.↵
J. Eswaran et al. “RNA sequencing of cancer reveals novel splicing alterations”. In: Scientific reports 3 (2013).
21.↵
K. R. Kalari et al. “An integrated model of the transcriptome of HER2-positive breast cancer”. In: PloS one 8.11 (2013), e79298.
OpenUrl
22.↵
N. Ignatiadis et al. “Data-driven hypothesis weighting increases detection power in genomescale multiple testing”. In: Nature methods(2016).
23.↵
R. Petryszak et al. “Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants”. In: Nucleic acids research 44.D1 (2016), pp. D746–D752.
OpenUrl
24.↵
J. Vivian et al. “Rapid and efficient analysis of 20,000 RNA-seq samples with Toil”. In: bioRxiv (2016), p. 062497.
25.↵
P. Tatlow and S. R. Piccolo. “A cloud-based workflow to quantify transcript-expression levels in public cancer compendia”. In: bioRxiv(2016), p. 063552.
26.↵
W. J. Kent et al. “BigWig and BigBed: enabling browsing of large distributed datasets”. In: Bioinformatics (Oxford, England) 26.17 (Sept. 2010), pp. 2204–2207. issn: 1367-4811. doi:10.1093/bioinformatics/btq351.
OpenUrl CrossRef
27.↵
A. Nellore et al. “Rail-RNA: Scalable analysis of RNA-seq splicing and coverage”. In: bioRxiv (2015), p. 019067.

References

[28].
A. Nellore et al. “Rail-RNA: Scalable analysis of RNA-seq splicing and coverage”. In: bioRxiv (2015), p. 019067.
[29].
A. Nellore, et al. “Rail-dbGaP: a protocol and tool for analyzing protected genomic data in a commercial cloud”. In: bioRxiv (2015), p. 035287.
[30].
M. Carlson. TxDb.Hsapiens.UCSC.hg38.knownGene: Annotation package for TxDb object(s). R package version 3.1.3. 2015.
[31].
A. Pohl and M. Beato. “bwtool: a tool for bigWig files”. In: Bioinformatics (Oxford, England) 30.11 (June 2014), pp. 1618–1619. issn: 1367–4811. doi: 10.1093/bioinformatics/btu056.
OpenUrl CrossRef PubMed
[32].
M. Morgan, et al. SummarizedExperiment: SummarizedExperiment container. R package version 1.3.3. 2016.
[33].
W. J. Kent, et al. “BigWig and BigBed: enabling browsing of large distributed datasets”. In: Bioinformatics (Oxford, England) 26.17 (Sept. 2010), pp. 2204–2207. issn: 1367–4811. doi: 10.1093/bioinformatics/btq351.
OpenUrl CrossRef PubMed Web of Science
[34].
L. Collado-Torres, et al. “Flexible expressed region analysis for RNA-seq with derfinder”. In: bioRxiv (2016), p. 015370. doi: 10.1101/015370.
OpenUrl Abstract/FREE Full Text
[35].
J. Harrow, et al. “GENCODE: the reference human genome annotation for The ENCODE Project”. In: Genome research 22.9 (2012), pp. 1760–1774.
OpenUrl
[36].
C. W. Law, et al. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts”. In: Genome Biol 15.2 (2014), R29.
OpenUrl
[37].
J. D. Storey and R. Tibshirani. “Statistical significance for genomewide studies”. In: Proceedings of the National Academy of Sciences 100.16 (2003), pp. 9440–9445.
OpenUrl
[38].
R. J. Simes. “An improved Bonferroni procedure for multiple tests of significance”. In: Biometrika 73.3 (1986), pp. 751–754.
OpenUrl