ABSTRACT
Long non-coding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes including disease. We present here FC-R2, a comprehensive expression atlas across a broadly-defined human transcriptome, inclusive of over 100,000 coding and non-coding genes as described by the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue specific transcription profiles for distinct classes of coding and non-coding genes, (b) perform differential expression analysis across thirteen cancer types, providing new insights linking promoter and enhancer lncRNAs expression to tumor pathogenesis, and (c) confirm the prognostic value of several enhancers in cancer. Comprised of over 70,000 samples, FC-R2 will empower other researchers to investigate the roles of both known genes and recently described lncRNAs. Access to the FC-R2 atlas is available from https://jhubiostatistics.shinyapps.io/recount/, the recount Bioconductor package, and http://marchionnilab.org/fcr2.html.
Introduction
Long non-coding RNAs (lncRNAs) are commonly defined as transcripts devoid of open reading frames (ORFs) longer than 200 nucleotides, which are often polyadenylated. This definition is not based on their function, since lncRNAs are involved in distinct molecular processes and biological contexts not yet fully characterized1. Over the past few years, the importance of lncRNAs has been clarified, leading to an increasing focus on decoding the consequences of their modulation and studying their involvement in the regulation of key biological mechanisms during development, normal tissue and cellular homeostasis, and in disease1–3.
Given the emerging and previously underestimated importance of non-coding RNAs, the FANTOM consortium has initiated the systematic characterization of their biological function. Through the use of Cap Analysis of Gene Expression sequencing (CAGE-seq), combined with RNA-seq data from the public domain, the FANTOM consortium released a comprehensive atlas of the human transcriptome, encompassing more accurate transcriptional start sites (TSS) for coding and non-coding genes, including numerous novel long non-coding genes: the FANTOM CAGE Associated Transcriptome (FANTOM-CAT)4. We hypothesized that these lncRNAs can be measured in many RNA-seq datasets from the public domain and that they have been so far missed by the lack of a comprehensive gene annotation.
Although the systematic analysis of lncRNAs function is being addressed by the FANTOM consortium in loss of function studies, increasing the detection rate of these transcripts combining different studies is difficult because the heterogeneity of analytic methods employed. Current resources that apply uniform analytic methods to create expression summaries from public data do exist but can miss several lncRNAs because their dependency on a pre-existing gene annotation for creating the genes expression summaries5, 6. We recently created recount27, a collection of uniformly-processed human RNA-seq data, wherein we summarized 4.4 trillion reads from over 70,000 human samples from the Sequence Reads Archive (SRA), The Cancer Genome Atlas (TCGA)8, and the Genotype-Tissue Expression (GTEx)9 projects7. Importantly, recount2 provides annotation-agnostic coverage files that allow re-quantification using a new annotation without having to re-process the RNA-seq data.
Given the unique opportunity to access lastest results to the most comprehensive human transcriptome (the FANTOM- CATproject) and the recount2 gene agnostic summaries, we addressed the previous described challenges building a comprehensive atlas of coding and non-coding gene expression across the human genome: the FANTOM-CAT/recount2 expression atlas (FC-R2 hereafter). Our resource contains expression profiles for 109,873 putative genes across over 70,000 samples, enabling an unparalleled resource for the analysis of the human coding and non-coding transcriptome.
Results
Building the FANTOM-CAT/recount2 resource
The recount2 resource includes a coverage track, in the form of a BigWig file, for each processed sample. We built the FC-R2 expression atlas by extracting expression levels from recount2 coverage tracks in regions that overlapped unambiguous exon coordinates for the permissive set of FANTOM-CAT transcripts, according to the pipeline shown in Figure 1. Since recount2’s coverage tracks does not distinguish from between genomic strands, we removed ambiguous segments that presented overlapping exon annotations from both strands (see Methods section). After such disambiguation procedure, the remaining 1,066,515 exonic segments mapped back to 109,869 genes in FANTOM-CAT (out of the 124,047 starting ones included in the permissive set4). Overall, the FC-R2 expression atlas encompasses 2,041 studies with 71,045 RNA-seq samples, providing expression information for 22,116 coding genes and 87,763 non-coding genes, such as enhancers, promoters, and others lncRNAs.
Validating the FANTOM-CAT/recount2 resource
We first assessed how gene expression estimates in FC-R2 compared to previous gene expression estimates from other projects. Specifically, we considered data from the GTEx consortium (v6), spanning 9,662 samples from 551 individuals and 54 tissues types9. First, we correlated gene expression levels between the FC-R2 atlas and quantification based on GENCODE (v25) in recount2 for the GTEx data, observing a median correlation ≥0.986 for the 32,922 genes in common. This result supports the notion that our pre-processing steps to disambiguate overlapping exon regions between strands did not significantly alter gene expression quantification.
Next, we assesed whether gene expression specificity, as measured in FC-R2, was maintained across tissue types. To this end, we selected and compared gene expression for known tissue-specific expression patterns, such as Keratin 1 (KRT1), Estrogen Receptor 1 (ESR1), and Neuronal Differentiation 1 (NEUROD1) (Figure 2). Overall, all analyzed tissue specific markers presented nearly identical expression profiles across GTEx tissue types between the alternative gene models considered (see Figure 2 and S1), confirming the consistency between gene expression quantification in FC-R2 and those based on GENCODE.
Tissue-specific expression of lncRNAs
It has been shown that, although expressed at a lower level, enhancers and promoters are not ubiquitously expressed and are more specific for different cell types than coding genes4. In order to verify this finding, we used GTEx data to assess expression levels and specificity profiles across samples from each of the 54 analyzed tissue types, stratified into four distinct gene categories: coding mRNA, intergenic promoter lncRNA (ip-lncRNA), divergent promoter lncRNA (dp-lncRNA), and enhancers lncRNA (e-lncRNA). Overall, we were able to confirm that these RNA classes are expressed at different levels, and that they display distinct specificity patterns across tissues, as shown for primary cell types by Hon et al.4, albeit with more variability likely due to the increased cellular complexity present in tissues. Specifically, coding mRNAs were expressed at higher levels than lncRNAs (log2 median expression of 6.6 for coding mRNAs, and of 4.1, 3.8 and 3.1, for ip-lncRNA, dp-lncRNA, and e-lncRNA, respectively). In contrast, the expression of enhancers and intergenic promoters was more tissue-specific (median = 0.41 and 0.30) than what observed for divergent promoters and coding mRNAs (median = 0.13 and 0.09) (Figure 3). Finally, when analyzing the percentage of genes expressed across tissues by category, we observed that coding genes are, in general, ubiquitous, while lncRNAs are more specific, with enhancers showing the lowest percentages of expressed (mean ranging from 88.42% to 41.98%, see Figure 3B), in agreement with the notion that enhancer transcription is tissue specific10.
Differential expression analysis of coding and non-coding genes in cancer
We analyzed coding and non-coding gene expression in cancer using TCGA data. To this end, we compared cancer to normal samples separately for 13 tumor types, using FC-R2 re-quantified data. We further identified the differentially expressed genes (DEG) in common across the distinct cancer types (see Figure 4). Overall, the number of DEG varied across cancer types and by gene class, with a higher number of significant coding than non-coding genes (FDR < 0.01, see table 1). Importantly, a substantial fraction of these genes was exclusively annotated in the FANTOM-CAT, suggesting that relying on other gene models would result in missing many potential important genes (see Table 1). We then analyzed the consensus among cancer types. A total of 41 coding mRNAs were differentially expressed across all the 13 tumor types after global correction for multiple testing (FDR < 10-6, see Supplementary table S1). For lncRNAs, a total of 28 divergent promoters, 4 intergenic promoters, and 3 enhancers were consistently up-or down-regulated across all the 13 tumor types after global correction for multiple testing (FDR < 0.1, see Supplementary tables S2, S3, S4, respectively).
Next, we reviewed the literature to assess functional correlates for these consensus genes. Most of the consensus up-regulated coding genes (Supplementary Table S1) participate in cell cycle regulation, cell division, DNA replication and repair, and chromosome segregation, and mitotic spindle checkpoints. Most of the consensus down-regulated mRNAs (Supplementary Table S1) are associated with metabolism and oxidative stress, transcriptional regulation, cell migration and adhesion, and with modulation of of DNA damage repair and apoptosis.
Down-regulated dp-lncRNAs were mostly those associated with immune cells (e.g., natural killer cells, T cell, and mature B-cells). Three genes, RP11-276H19, RPL34-AS1, and RAP2C-AS1, were reported to be implicated in cancer (Supplementary Table S2). The first controls epithelial-mesenchymal transition, the second is associated with tumor size increase, and the third is associated with urothelial cancer after kidney cancer transplantation11–13. Among up-regulated dp-lncRNA, SNHG1 (Supplementary Table S2) was implicated in cellular proliferation, migration, invasion of different cancer types, and strongly up-regulated in osteosarcoma, non-small lung cancer, and gastric cancer14, 15.
Among the ubiquitously down-regulated ip-lncRNAs (see Supplementary Table S3), LINC00478 has been previously reported in many different tumors including leukemia, breast, vulvar, prostate, and bladder cancer16–20. In vulvar squamous cell carcinoma, there is a statistical relationship between LINC00478 and MIR31HG expression and tumor differentiation17. Additionally, LINC00478 down-regulated in ER positive breast tumors was shown to be associated with progression, recurrence, and metastasis18. In contrast, increased expression of SNHG17 (an ip-lncRNA, see Supplementary Table S3), was associated with short term survival in breast cancer, and with tumor size, stage, and lymph node metastasis in colorectal cancer21, 22. Another ip-lncRNA, AC004463, (Supplementary Table S3), was found up-regulated in liver cancer and metastatic prostate cancer23. Regarding the last lncRNA category considered here, we could not find any cancer association for common e-lncRNAs, nevertheless one, RP5-965F6, was previously reported to be up-regulated in late-onset Alzheimer’s disease24. The e-lncRNAs category also yielded the lowest number of genes in common among all cancer types, reinforcing the concept that lncRNAs, specially enhancers are expressed in a specific manner (Supplementary Table S4).
Finally, as a prototypical example, we considered prostate cancer (PCa), and we were able to confirm findings from previous reports for both coding and non-coding genes (see Supplementary Figure S2). For coding genes, we confirmed differential expression for known markers of PCa progression and mortality, like ERG, FOXA1, RNASEL, ARVCF, and SLC43A125, 26. Similarly, we also confirmed differential expression for non-coding genes, like PCA3, the first clinically approved lncRNA marker for PCa27, 28, PCAT1, a prostate-specific lncRNA involved in disease progression29, MALAT1, which is associated with PCa poor prognosis30, CDKN2B-AS1, an anti-sense lncRNA up-regulated in PCa that inhibits tumor suppressor genes activity31, 32, and the MIR135 host gene, which is associated with castration-resistant PCa33.
Enhancer expression levels hold prognostic value
The number of lncRNAs involved in cancer development and progression is rapidly increasing, we therefore analyzed the prognostic value of the lncRNAs we identified in our gene expression differential analysis in TCGA, as well as those previously reported in other studies. To this end, Chen and collaborators have recently surveyed enhancers expression in nearly 9,000 patients from the TCGA34, using genomic coordinates from the FANTOM5 project35, identifying 4,803 enhancers with prognostic potential in one or more tumor types in the TCGA. We therefore leveraged the FC-R2 atlas to identify prognostic coding and non-coding genes using Univariate Cox proportional hazard models, comparing our results for e-lncRNAs with those reported by Chen and colleagues.
When we considered e-lncRNAexpression levels, we identified a total of 5,382 prognostic e-lncRNAs (FDR ≤ 0.05), and no single one was predictive across all cancer types. Overall, the number of significant prognostic e-lncRNAs varied across tumors, ranging from 3 in head and neck cancer to 3,850 in kidney cancers (see Supplementary Table S6). Notably, two (out of three) e-lncRNAs from our differential gene expression consensus list across all tumor types were also prognostic. Specifically, CATG00000107122 was associated with worst prognosis in kidney cancer, while ENSG00000255958 was associated with worse survival in stomach tumor. Overall, despite differences in annotation and quantification (see Supplementary Table S5), we were able to confirm prognostic value for 2,765 e-lncRNAs out of the 4,803 reported by Chen et al34, including “enhancer 22” (ENSG00000272666, which was highlighted as a promising prognostic marker for kidney cancer (Supplementary Figure S3).
Finally, we analyzed the prognostic value for dp-lncRNAs, ip-lncRNAs, and mRNAs (See Supplementary Tables S7, S8, and S9, respectively), and assessed the survival prognostic potential of our consensus genes across tumor types. Thirty-seven of the 41 coding mRNAs, 22 of the 28 differentially expressed dp-lncRNAs, and two out of the four DE ip-lncRNAs, respectively, were found to be prognostic (See Supplementary Tables S10, S11, S12, and S13). Kaplan-Meier survival curves for one selected DE gene on each RNA subtype evaluated here are shown in supplementary figure S4.
Discussion
The importance of lncRNAs in cell biology and disease has clearly emerged in the past few years and different classes of lncRNAs have been shown to play crucial roles in cell regulation and homeostasis36. For instance, enhancers – a major category of gene regulatory elements, which has been shown to be expressed35, 37 – play a prominent role in oncogenic processes38, 39 and other human diseases40, 41. Despite their importance, however, there is a scarcity of large-scale datasets investigating enhancers and other lncRNA classes, in part due to the technical difficulty in applying high-throughput techniques such as ChIP-seq and Hi-C over large cohorts, and to the use of gene models that do not account for them in transcriptomics analyses. Furthermore, the large majority of the lncRNAs that are already known – and that have been shown to be associated with some phenotype – are still lacking functional annotation.
To address these needs, the FANTOM consortium has first constructed the FANTOM-CAT meta-transcriptome, a comprehensive atlas of coding and non-coding genes with robust support from CAGE-seq data4, then it has undertaken a large scale project to systematically target lncRNAs and characterize their function using a multi-pronged approach (Jordan et al., under review). In a complementary effort, we have leveraged public domain gene expression data from recount27, 42 to create a comprehensive gene expression compendium across human cells and tissues based on the FANTOM-CAT gene model, with the ultimate goal of facilitating lncRNAs annotation through association studies.
In order to validate our resource, we have compared the gene expression summaries based on FANTOM-CAT gene models with previous, well-established quantification of gene expression, demonstrating virtually identical profiles across tissue types overall and for specific tissue markers. We have then confirmed that distinct classes of coding and non-coding genes differ in terms of overall expression levels and specificity patterns across cell types and tissues. Furthermore, with this approach, we were also able to identify mRNAs, promoters, enhancers, and other lncRNAs that are differentially expressed in cancer, both confirming previously reported findings, and identifying novel cancer genes exclusively annotated in the FANTOM-CAT gene model, which have been therefore missed in prior analyses with TCGA data. Finally, we also analyzed the prognostic value of the coding and non-coding genes we identified in our analyses, and confirmed the association with overall survival in TCGA for measurable enhancers.
Collectively, by confirming findings reported in previous studies, our results demonstrate that the FC-R2 gene expression atlas is a reliable and powerful resource for exploring both the coding and non-coding transcriptome, providing compelling evidence and robust support to the notion that lncRNA gene classes, including enhancers and promoters, despite not being yet fully understood, portend significant biological functions. Our resource, therefore, constitutes a suitable and promising platform for future large scales studies in cancer and other human diseases, which in turn hold the potential to reveal important cues to the understanding of their biological, physiological, and pathological roles, potentially leading to improved diagnostic and therapeutic interventions.
Finally, all results and data from the FC-R2 atlas are available as a public tool. With uniformly processed expression data for over 70,000 samples and 109,873 genes ready to analyze, we want to encourage researchers to dive deeper into the study of ncRNAs, their interaction with coding and non-coding genes, and their influence on normal and disease tissues. We hope this new resource will help paving the way to develop new hypotheses that can be followed to unwind the biological role of the transcriptome as a whole.
Methods
Data and pre-processing
FANTOM CAT permissive catalog was obtained from the pre-FANTOM6 consortium. This catalog initially comprised 124,245 genes defined by CAGE peaks published by Hon et al4. In order to remove ambiguity, BED files containing the coordinates for each gene/exon were imported into an R session and processed with the GenomicRanges package43 by disjoining the exon coordinates. To avoid losing strand information we processed it in a two-step approach by first disjoining overlapping segments on the same strand and latter across strands (Figure 5). Genomic ranges (disjoined exons segments) that mapped back to more than one gene were discarded. The expression values for these ranges were then quantified using recount.bwtool44 (code at https://github.com/LieberInstitute/marchionni_projects). The resulting expression quantifications were processed to generate RangedSummarizedExperiment objects compatible with the recount2 framework7, 42 (code at https://github.com/eddieimada/fcr2). Thus FC-R2 provides expression information for coding mRNAs, enhancers and promoters (divergent and intergenic) for 9,662 samples from the Genotype-Tissue Expression (GTEx) project, 11,350 samples from The Cancer Genome Atlas (TCGA) consortium, and over 50,000 samples from the Sequence Read Archive (SRA).
Correlation with other studies
To test if the pre-processing step had a major impact on expression quantification, we compared our counts tables to the published GTEx counts from recount2. The version 2 of the gene counts for the GTEx samples were downloaded from the recount website (https://jhubiostatistics.shinyapps.io/recount/). We compared distribution of tissue specific genes across tissues and computed the Pearson correlation for each gene in common across the original recount2 gene counts estimates and our version.
Expression specificity of tissue facets
We analyzed the expression level and specificity of each gene stratified by RNA class (i.e. mRNA, e-lncRNA, dp-lncRNA, ip-lncRNA). Expression levels for each gene were represented by the maximum transcripts per million (TPM) of all samples within a facet. To compute the gene specificity we followed the same approach used in Hon et al4. The 99.99 percent confidence intervals for the expression of each category by facet were calculated based on TPM values. Genes with a TPM greater than 0.01 were considered expressed.
Identification of differentially expressed genes
Differential gene expression was tested in 13 cancer types, comparing primary tumor with normal samples using TCGA FC-R2 gene expression summaries. Summaries for each cancer type were split by RNA class (coding mRNA, intergenic promoter lncRNA, divergent promoter lncRNA and enhancer lncRNA) and analyzed independently. A generalized linear model approach coupled with empirical Bayes standard errors45 was used to identify differentially expressed genes between the samples. The model was adjusted for the three most variable coefficients for data heterogeneity as estimated by surrogate variable analysis (SVA)46. Correction for multiple testing was performed across RNA classes by merging the resulting p-values for each cancer type and applying the Benjamini-Hochberg method47.
Prognostic analysis
To evaluate the prognostic potential of the genes in FC-R2 we applied a univariate Cox proportional regression model in four RNA classes (22106 mRNAs, 17,404 e-lncRNAs, 6,204 dp-lncRNAs, and 1,948 ip-lncRNAs) comprised in FC-R2 across each of the 13 TCGA cancer types with available survival follow-up. Genes with FDR equal or less than 0.05 using Benjamini-Hochberg47 correction within the cancer type and RNA class, were selected as significant prognostic factors. To indentify differentially expressed genes that portrait predictive potential, the DE lists were intersected with the significant prognostic genes lists. Supplementary data from Chen et al34 containing enhancers position and prognostic potential were obtained from the original publication and a liftover to hg38 genome assembly was performed to match FC-R2 coordinates in order to compare the results.
Data Availability
All data is available in http://marchionnilab.org/fcr2.html. Expression data can be directly accessed through https://jhubiostatistics.shinyapps.io/recount/ and the recount Bioconductor package (v1.9.5 or newer) at https://bioconductor.org/packages/recount as RangedSummarizedExperiment objects organized by The Sequence Read Archive (SRA) study ID. The data can be loaded using R-programming language and is ready to be analyzed using Bioconductor packages or the data can be exported to other formats for use in another environment.
Code Availability
All code used in this manuscript is available in: https://github.com/eddieimada/fcr2 and https://github.com/LieberInstitute/marchionni_projects for reproducibility purposes.
Author contributions statement
L.M. conceived the idea, L.M., E.I., A.F. and B.L. designed the study; E.L.I., D.F.S., T.M., W.D., A.S., L.C.T., and L.M. performed the analysis; E.L.I., D.F.S., F.P.L., G.R.F. and L.M. interpreted the results; L.C.T., C.W., C.Y., K.Y, N.K., M.I., H.S., T.K., C.C.H., M.H., J.W.S., P.C. A.E.J., J.T.L. and B.L. provided the data and tools; E.L.I., D.F.S., L.C.T., B.L. and L.M. wrote the manuscript; All authors reviewed and approved the manuscript.
Disclosure declaration
All authors declare no conflicts of interest.
Acknowledgements
This publication was made possible though support from the NIH-NCI grants P30CA006973 (L.M. and A.F.) and R01CA200859 (W.D. and L.M.), NIH-NIGMS grant R01GM118568 (C.W. and B.L.), R21MH109956-01 (L.C.T. and A.E.J.), and the Department of Defense (DoD) office of the Congressionally Directed Medical Research Programs (CDMRP) award W81XWH-16-1-0739 (E.L.I. and L.M.), RFBR 17-00-00208 (A.F.) and Russian Academic project 0112-2019-0001 (A.F.), FundaÇão de Amparo á Pesquisa do Estado de Minas Gerais award BDS-00493-16 (E.L.I and G.R.F.). recount2 and FC-R2 are hosted on SciServer, a collaborative research environment for large-scale data-driven science. It is being developed at, and administered by, the Institute for Data Intensive Engineering and Science (IDIES) at Johns Hopkins University. SciServer is funded by the National Science Foundation Award ACI-1261715. For more information about SciServer, visit http://www.sciserver.org/.