Development of a novel signature of long noncoding RNAs as a prognostic biomarker for esophageal cancer

Objectives This study aims to develop a lncRNA signature based on RNA-Seq data to predict overall survival in esophageal cancer patients. Methods The lncRNA expression profiles and clinical data were downloaded from The Cancer Genome Atlas (TCGA) database on August 30, 2017. Differentially expressed lncRNAs were screened out between tumor tissues and adjacent normal tissues. The univariate and multivariate Cox regression models were used to develop a prognostic signature for all esophageal cancer patients. The receiver operating curve (ROC) was used to test the sensitivity and specificity of lncRNA signature. Survivals were compared via log-rank test. GO and KEGG enrichment analyses were used to explore the potential functions of prognostic lncRNAs. Results We identified two lncRNAs (RPL34-AS1 and GK3P) were significantly associated with the overall survival of the total 150 esophageal cancer patients. A novel two-lncRNA signature was constructed by Cox regression models. Signature low-risk cases showed better overall survival (median 625.560 days vs. 478.000 days, p = 0.002) than high-risk cases. Further analysis suggested that this two-lncRNA signature was independent of clinical characteristics. GO functional and KEGG pathway enrichment analyses revealed potential functional roles of the two prognostic lncRNAs in tumorigenesis. Conclusions Our findings suggest that the two-lncRNA signature may be a useful prognostic biomarker for predicting overall survival in esophageal cancer patients.


INTRODUCTION
Esophageal cancer (EC) ranks the eighth most common cancer worldwide and the sixth leading cause of cancer-related mortality . Unfortunately, EC is often associated with a poor prognosis that having an 18% overall five-year survival (Malhotra et al. 2017). As with most solid tumors, pathological tumor-node-metastasis (TNM) staging is still the primary indicator of survival in patients with EC (Malhotra et al. 2017;Rice et al. 2017). The molecular heterogeneity and complexity of EC make it difficult to predict its clinical outcomes (Hao et al. 2016;Hardie et al. 2005).
Therefore, it is urgent to find specific prognostic factors of EC which would improve treatment efficacies for patients.
In the age of genomics, researchers have discovered tens of thousands of long noncoding RNAs (lncRNAs) by genome-wide techniques, which are defined as RNA transcripts exceeding 200 nucleotides in length without protein-coding capacity (Schmitt & Chang 2016;Young & Ponting 2013). Emerging evidence revealed that lncRNAs play a critical role in various aspects of cellular processes, including proliferation, migration, invasion, and genomic stability (Iyer et al. 2015;Ling et al. 2015). Based on the expression specificity among different tumor types, lncRNAs could serve as the basis for many clinical applications in oncology (Li & Chen 2013;Shi et al. 2013;Wahlestedt 2013). Till now, lncRNA biomarkers for diagnosis of EC have been reported in many studies (Tong et al. 2015;Wang et al. 2017;Wu et al. 2015).
However, the understanding of the prognostic value of lncRNAs in patients with EC from previous studies was limited.
The Cancer Genome Atlas (TCGA) is a database, which provides a collection of DNA copy-number variation, DNA methylation, RNA sequence, and corresponding clinical data for the top 25 tumor types. In this study, we used the RNA-Seq dataset from the TCGA database to identify candidate prognostic lncRNAs for EC. After integrative analysis, we developed a prognostic lncRNA signature for the overall survival (OS) prediction in EC patients.

TCGA data source
185 EC patients' data were downloaded from the TCGA database on August 30, 2017.
The exclusion criteria were listed in the following: i) histologic diagnosis ruled out EC; ii) another malignancy besides EC. After that, 150 EC patients were included in this study. As the data were downloaded from the public database, ethical approval was not applicable here. Data processing procedures followed the guidelines of TCGA data access (http://cancergenome.nih.gov/publications/publicationguidelines).

LncRNA data mining and processing
The level 3 RNA-Seq data were obtained from the TCGA database. Raw data of lncRNA sequencing were processed and normalized by TCGA RNASeqv2 system (Zand et al. 2016). Here, only lncRNAs with the description in NCBI (http://ncbi.nlm.nih.gov/gene) or Ensembl (http://ensembl.org/index.html) were selected for further analysis. To identify differentially expressed lncRNAs, the expression of lncRNAs in tumor tissues were screened and compared with that in adjacent non-tumor tissues. The candidate lncRNAs were gathered for further analysis.

Construction of the prognostic signature
The expression profile of each lncRNA was normalized by log2 transformation for statistical analysis. The univariate Cox regression model was used to evaluate the ECspecific lncRNAs that were associated with OS. The multivariate Cox regression model was used to evaluate the prognostic value of these OS-related lncRNAs. The risk-score model was constructed based on the combination of the expression profiles of each prognostic lncRNA, weighted by the regression coefficients that were calculated by multivariate Cox regression analysis. The formula was listed as follows: = 1 * 1 + 2 * 1 + ⋯ + * ( : expression level; : regression coefficient derived from the multivariate Cox regression model) (Zeng et al. 2017). The median risk score was set to the cutoff point, and EC patients were divided into high or low groups (Zhou et al. 2016). Further univariate and multivariate Cox regression models were used to investigate the effects of risk score and clinical characteristics in EC patients. The hazard ratio (HR) and the 95% confidence interval (CI) were used for quantification. The time-dependent receiver operating curve (ROC) within five years was used to assess the predictive value of risk score for time-dependent outcomes (Heagerty et al. 2000). The Kaplan-Meier (K-M) curve and log-rank test were used to assess the differences in survival distribution among different groups. Here, the p value <0.05 was considered to be significant. IBM SPSS Statistics 22.0 (IBM Corp., Armonk, NY, USA) was used to perform the analyses mentioned above.

Functional enrichment analysis
To investigate the biological features of these two lncRNAs, we selected the genes that were strongly correlated with these two lncRNAs expressions (Pearson |R| > 0.5) from TCGA database. Pathways and biological processes were predicted by functional enrichment analysis of the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) in the DAVID database (http://david.ncifcrf.gov/). The p-value <0.05 was considered to be significant. Afterward, the protein-protein interaction (PPI) network was constructed based on the co-expressed genes via the STRING database (http://string-db.org/). in stage III, and 109 lncRNAs in stage IV. We used "Fold change > 2 or < 0.5, p < 0.05, and FDR < 0.05" as the screening method to identify the significant differentially expressed lncRNAs. Then, 74 candidate lncRNAs were identified as differentially expressed in all tumor stages ( Fig. 1 and Fig. 2).

Construction of a prognostic signature for EC patients
Based on the 74 lncRNAs and the clinical information extracted from TCGA database, two candidate lncRNAs were identified to be significantly associated with OS (p < 0.05) in EC patients by univariate Cox regression analysis ( Table 1). The multivariate Cox regression analysis was used to verify the correlation between two candidate lncRNAs and OS (Table 1 and Fig 3A). These two lncRNAs, RPL34-AS1 and GK3P, also showed prognostic value for EC patients (Fig 3B).
A risk-score formula was established based on the combination of the expression profiles of each prognostic lncRNA, weighted by the regression coefficients that were calculated by multivariate Cox regression analysis. The formula was listed as follows: = 34− 1 * (−0.685) + 3 * (0.697). We calculated the risk score of the two-lncRNA expression signature for each patient.
Based on the median risk score as a cutoff value, all EC patients were divided into two groups: the low-risk score group (n = 75) and the high-risk score group (n = 75) (Fig 4). Meanwhile, K-M curves confirmed that the mean survival time (and standard deviation) of patients in the low-risk score group was 625.560 (± 507.079) days, significantly better than patients in the high-risk score group (478.000 ± 427.816 days, p = 0.002) ( Fig 5A). Furthermore, this risk score model could be used to predict 5-year survival of EC patients as the area under ROC curve was 0.647 ( Fig 5B).
The significant expression patterns of these two candidate lncRNAs in the tumor tissues/adjacent normal tissues group (Fig 6A), and in the low-risk score/high-risk score group were presented in Figure 6B.

Correlation between the two-lncRNA signature and clinical characteristics
To verify the correlation between the two-lncRNA signature and clinical characteristics in EC patients, we used univariate Cox regression analysis and multivariate Cox regression analysis.
The univariate Cox regression analysis indicated that gender, TNM stage, N stage, M stage, additional treatment completion success outcome, residual tumor, and neoplasm tumor status could significantly predict poorer survival of EC patients (p < 0.05, Table 2). The multivariate Cox regression analysis showed that neoplasm tumor status (p = 0.036) and risk score (p = 0.001) could serve as independent prognostic factors in EC patients ( Table 2).
The K-M curves of these clinical characteristics indicated that TNM stage (p < 0.001), N stage (p = 0.001), M stage (p < 0.001), and residual tumor (p = 0.005), and neoplasm tumor status (p = 0.001) were significantly associated with OS of EC patients (Fig 7A).
Moreover, we assessed the association between the risk score and clinical characteristics and found that the risk score might have the prognostic significance in predicting some clinical characteristics, including TNM stage (AUC = 0.643, p = 0.040) and N stage (AUC = 0.579, p = 0.032) (Fig 7B).
Functional assessment of the two-lncRNA signature 629 genes (extracted from TCGA database) were co-expressed with these two lncRNAs (Pearson |R| > 0.5), including 622 genes co-expressed with RPL34-AS1 and seven genes co-expressed with GK3P (Supplemental Table 1). We identified 89 enriched GO terms and seven pathways (p <0.05, enrichment score > 1.5, Supplemental Table 2) by gene enrichment and functional annotation analysis. The top GO biological process and KEGG pathway (Table 3 and Fig 8) were detection of chemical stimulus involved in sensory perception of smell (GO: 0050911) and olfactory transduction (hsa: 04740), respectively. Finally, we constructed a PPI network of 283 genes that were considered as hub genes (Fig 9).

DISCUSSION
Esophageal cancer (EC) is among the deadliest malignancies in the world, with high morbidity and mortality (Griffin & Berrisford 2013;Torre et al. 2015). The onset of dysphagia is associated symptom of EC, and a small proportion of patients with EC survive for five years (di Pietro et al. 2017). The novel biomarker for early diagnosis, process monitoring, and prognostic evaluation might increase the survival rate for EC patients. In the past decade, researchers have reported that about 70% of the genome is transcribed in various cell types and contexts (Djebali et al. 2012;Okazaki et al. 2002), nearly 80% of the genome is biochemically active (2012), and many DNAs code for RNA as end products instead of proteins (Pennisi 2012). LncRNAs are engaged in a wide range of cellular mechanisms, such as DNA methylation and histone modifications (O'Leary et al. 2015;Plath et al. 2003). In addition, increasing evidence has proved that lncRNAs function as a critical player in gene regulation and trend to be expressed in a more tissue-and cell-specific manner, which demonstrated the crucial advantages of lncRNAs as diagnostic or prognostic biomarkers in cancers (Bolha et al. 2017;Evans et al. 2016;Yang et al. 2014 suggested that NORAD was related to poor prognosis in EC patients .
Regarding these two lncRNAs, Zhao et al. reported that the decreased expression of RPL34-AS1 was correlated with larger gastric tumors . For GK3P, it has not been reported yet. Besides, the biological functions of these two lncRNAs have not been investigated in EC. Here, we assessed the functional relevance of these two lncRNAs by using DAVID bioinformatics resources. 629 genes were identified to have significant co-expression with these two lncRNAs. The relevant genes were mainly enriched in olfactory transduction, neuroactive ligand-receptor interaction, detection of chemical stimulus involved in sensory perception of smell, and defense response to bacterium. Additionally, 283 genes were identified as hub genes regulated by these two lncRNAs via the PPI network.
The findings of our study may have essential clinical significance. However, some limitations should be taken into consideration as well. Firstly, the patient cohort of TCGA was obtained from multi-institutional sites, and the results still need to be validated in other cohorts in the future study. Secondly, only lncRNAs were involved in our work, and this study could not represent the whole transcription alteration associated with EC. Thirdly, the functional role of RPL34-AS1 and GK3P are still unknown. Thus, we need well-designed experiments to validate the biological functions of these two lncRNAs in EC.

CONCLUSIONS
We identified two lncRNAs associated with the survival of EC patients in a large cohort from the TCGA database and developed a lncRNA signature. Further analysis indicated that the two-lncRNA signature could be a prognostic indicator independent of clinical characteristics. It can serve as a novel biomarker of prognostic prediction for EC patients. Further functional validations are needed to explore the regulatory mechanisms of these lncRNAs in EC.

ACKNOWLEDGMENTS
We thank Mr. Dong-Ling Chen for his help in technical assistance.