Abstract
The Cancer Genome Atlas (TCGA) initiative has been essential for revealing key mechanisms in human cancer leading to the development of novel therapeutics. TCGA data repository consists of genomic, transcriptomic, epigenetic data and clinical metadata from 11000+ patients across 33+ cancer types. Analysis of such large data sets require coding skills that are associated with a steep learning curve for most non-specialists. To enable a wider utilization of TCGA gene expression data, we introduce The Cancer Genome Explorer (TCGEx), a web-based visual data analysis interface that can perform a number of sophisticated analyses ranging from survival modeling and gene set enrichment analysis to unsupervised clustering and linear regression-based machine learning. TCGEx offers customizable options to tailor the analysis to different study contexts and helps generate publication-ready plots. Developed using R/Shiny framework, this opensource tool enables researchers with no programming expertise to analyze TCGA RNA and miRNA sequencing data from multiple angles. Pre-processed data in TCGEx contains cancer subtype information as well as previously reported intratumoral immune cell signatures making it possible to investigate the possible tumorimmune interactions. TCGEx, available at https://tcgex.iyte.edu.tr, provides an interactive solution to extract meaningful insights from TCGA data and guide scientific research.
Introduction
The Cancer Genome Atlas (TCGA) launched in 2006 is among the most ambitious projects to better understand and tackle human cancer (1). As of this writing, more than 11,000 tumor samples and matched healthy tissues have been characterized at the molecular level across 33 different cancer types as part of this initiative. TCGA studies employ a variety of experimental approaches including RNA and microRNA sequencing, whole exome sequencing, and genotyping and methylation arrays amounting to 2.5+ petabytes of publicly available data. Landmark studies from the TCGA network described common genetic aberrations in cancer such as TP53 loss (2) as well as other cancer-specific changes including BRCA1/2 mutations in breast and ovarian tumors (3, 4) and APC mutations in colorectal carcinoma (5). Furthermore, transcriptomic and proteomic analyses helped differentiate disease subsets with different clinical characteristics leading to the development of novel therapeutics. Gene expression profiling has also helped identify immune signatures within the tumor microenvironment (TME). For instance, a recent report characterized at least 6 different immunological states across all tumor types in humans and provided a framework for investigating cancer-immune cell interactions (6). Additionally, in skin cutaneous melanoma, immune-infiltrated and immune-devoid subsets have been defined that were characterized by differential survival outcomes (7). Follow-up studies have revealed that the genomic and transcriptomic features of tumors determine whether they will respond to targeted therapies and immunotherapeutics (8–10), suggesting that a better understanding of the heterogeneity within the TME is essential for improved therapeutics and clinical outcomes.
An efficient analysis of the large-scale genomic data in the TCGA repository requires programming skills and it is often challenging for the non-specialists. Multiple web-based software applications have been developed to facilitate TCGA data analysis including cBioPortal, Web-TCGA, UCSC Xena, and GEPIA2 (11–14). Although these visual analysis interfaces reduced the barriers to utilizing omics data in cancer research and have been used by numerous investigators in the field, there is a need for a centralized platform that integrates multiple methods to perform customized analyses. In particular, no interactive analysis tool to date is equipped with sophisticated functional genomic analysis capabilities that allow examining TCGA transcriptomics data. In this report, we introduce The Cancer Genome Explorer (TCGEx), an open-source web-based analysis platform written in R/Shiny, to address this need and increase the accessibility and reusability of the TCGA data.
The TCGEx program offers powerful features including the ability to perform multivariable Kaplan-Meier and Cox proportional hazards survival modeling, gene set enrichment analysis (GSEA), principal component analysis (PCA), annotated heatmap generation and hierarchical clustering, feature-to-feature correlation analysis, exploratory graphing of gene expression and clinical metadata, receiver operating curve (ROC) analysis, linear regression-based machine learning for investigating predictor variables against an arbitrary response variable. Through interactive interfaces and customizable plotting options, TCGEx allows users to focus on specific data subsets and tailor the analysis to various study contexts. Furthermore, data from multiple TCGA projects can be aggregated to perform integrative analyses and investigate commonalities among cancers. TCGEx’s efficient and comprehensive computational pipeline utilizes pre-processed data that was prepared by combining normalized RNA/miRNA sequencing data with clinical metadata and intratumoral immune signatures described by the leading studies in the field (6). We provide step-by-step in-app tutorials to increase user-friendliness and help researchers implement TCGEx into their workflows and generate publication-ready plots. TCGEx is publicly available without requiring user registration at https://tcgex.iyte.edu.tr web server as well as through a docker image to facilitate local execution and open-source program development (https://hub.docker.com/r/atakanekiz/tcgex). The source code of TCGEx is accessible at https://github.com/atakanekiz/TCGEx.
Materials and Methods
Summary of the pipeline
TCGEx was created using the R programming language and Shiny framework and it runs on a publicly accessible GNU/Linux server. TCGEx is designed to work with pre-processed harmonized data allowing rapid and iterative analyses of all TCGA projects. Data files for this pipeline were prepared by integrating RNA sequencing (RNAseq), microRNA sequencing (miRNAseq) data, and clinical metadata. The latter data type contains dozens of features ranging from patient demographics to tumor subtypes and genetic aberrations. We further expanded the sample-level metadata by incorporating intratumoral immune cell signatures described by Thorsson et al (6). TCGEx analysis starts by selecting one or more TCGA projects in the data selection module. After data selection, relevant data objects are loaded into the session and passed to individual analysis modules (Figure-1). TCGEx offers 10 analytical pipelines accessible through user-friendly graphical interfaces: i) Kaplan-Meier survival analysis; ii) Cox proportional hazards survival analysis; iii) feature-to-metadata exploratory graphing; iv) feature-to-feature correlation graphing; v) top correlating feature analysis; vi) hierarchical clustering and heatmap generation; vii) gene set enrichment analysis (GSEA); viii) receiving operating characteristics (ROC) analysis; ix) principal component analysis (PCA); and x) machine learning analysis with lasso, ridge, or elastic net regression. Each module allows users to examine specific data subsets such as metastatic and/or primary tumor biopsies as well as healthy control samples. Furthermore, flexible input options enable users to tailor the pipeline to their specific needs and generate informative publication-ready plots. The TCGEx Shiny app consists of modularized R scripts that generate both the user interface and the server code required for computation. This design facilitates troubleshooting and will help incorporate our code into other analysis pipelines. Thus, TCGEx aims to reduce barriers to high-throughput data analysis in cancer research by providing a scalable and user-friendly solution.
Data Selection Module
This module provides the interface for selecting the TCGA data sets to be analyzed in TCGEx. Upon data selection, pre-processed data objects are loaded, and descriptive statistics are visualized to help users examine the age and gender distribution of the selected patients as well as the tissue origin of the biopsy samples. TCGEx data objects were prepared by sample-level merging of counts-per-million-normalized log-transformed RNAseq and miRNA data and clinical metadata described by the TCGA studies. To facilitate downstream analyses, we removed genes that are not expressed in 25% or more of the samples in each TCGA project. Unlike some of the previously available tools for TCGA data analysis (13, 14), TCGEx allows users to select single or multiple TCGA projects for integrative analyses. If multiple TCGA projects are selected, pre-processed data sets are aggregated keeping only the matched genes and clinical metadata variables. In the end, the data selection module creates a tidy data frame containing normalized RNAseq/miRNAseq data and clinical metadata; and feeds it to the other analysis modules. In these modules, users can apply further filtering to focus the analysis on specific tissue types such as primary or metastatic samples.
Survival Analysis Modules
Studying the association between gene expression and survival outcome is important for understanding key mechanisms of cancer progression (15). TCGEx features two distinct survival analysis methods: Kaplan-Meier (KM) and Cox proportional hazards (CoxPH) modeling (16, 17). In both modules, categorical and continuous variables can be examined. Categorical meta-variables can differ between TCGA projects but generally include tumor grade, mutation types, and patient demographic information, among other features. Continuous variables such as gene expression and intratumoral immune cell signatures (6) can be directly modeled in CoxPH analysis or they can be categorized as “high”, “mid”, and “low” at user-defined thresholds for KM analysis. If more than one cancer type is selected by the user, gene expression can be categorized across all the projects at a common threshold, or within each cancer type separately. In KM analysis involving categorical variables, the user can further determine data subsets to include in the analysis and further customize the pipeline. Both KM and CoxPH modules allow including covariates in the model to examine the association of multiple variables with the survival outcome simultaneously. Furthermore, the CoxPH module is designed to accept user-defined interaction terms to accommodate arbitrarily complex multivariable study designs. Survival analysis pipelines of TCGEx generate customizable plots and show various statistical details including log-rank p-value for KM and CoxPH; and likelihood ratio test and Wald test p-values along with hazard ratios for CoxPH. Additionally, the proportionality assumption of the CoxPH model is checked using Schoenfeld residuals against time, and the user is prompted a message regarding the validity of the model. Taken together, TCGEx survival analysis modules provide flexible options for customizing analysis to specific needs and studying the relationship between gene expression and survival outcomes in cancer.
Boxplot module
In this module, users can explore the relationship between gene expression and categorical clinical metavariables. Categorical variables vary according to tumor type, and they include disease pathological stage, patient gender and race, tumor molecular subtype (7, 18), genome-wide methylation state (7), oncogenic signaling (OncoSign) cluster type (19), genetic aberrations (2), genomic instability levels (5), intratumoral lymphocyte scores (7), and numerous other metadata features (Supplementary Table-1). This module allows selecting specific categories to generate gene expression plots to emphasize interesting gene expression differences in data subsets. To offer more flexibility in the analyses, we also added an optional faceting variable which can further break data into sub-categories. This way, for instance, it becomes possible to plot the expression of a given gene in various mutation subsets in male and female patients separately. To facilitate publication-ready plotting, the program allows the user to select journal-specific color palettes, toggle individual data points on/off, and define specific statistical comparisons to highlight. Here, users can select the t-test or Wilcoxon test for pairwise comparisons between data sub-groups. Thus, this module helps researchers explore gene expression in clinically meaningful data subsets.
Correlation analysis modules
Examining correlations among genes and continuous metadata features can help study co-regulation mechanisms (8) and miRNA-mRNA interactions (20). TCGEx offers two modules for correlation analysis which can generate sample-level scatter plots or correlation matrices and tables. Sample-level plots are generated by plotting two continuous variables against each other, where each point denotes a patient sample. Continuous variables for this analysis can include transcript levels from RNAseq and miRNAseq data and numerical metadata such as mutation load or intratumoral immune cell scores (6). The correlation scatter plot is designed to be responsive, and it can show patient barcode, gender, and race upon hovering over individual data points. Furthermore, an optional faceting variable can be used to break data points into various subcategories and generate informative plots, as described in the previous modules. Users can visualize the best-fitting line and the linear regression equation on the graph and generate publication-ready figures. In addition to these functionalities, we anticipated that it can be of interest to examine the top positive and negative correlators of a specific gene and plot a gene-to-gene correlation matrix. The second correlation analysis module of TCGEx enables these analyses by calculating pairwise Pearson and Spearman correlation coefficients and creating tabular and graphical outputs. Therefore, these pipelines facilitate examining the linear relationships between variables across the data set and help investigate potential genetic interactions.
Heatmap and hierarchical clustering module
Heatmaps are commonly used for visualizing high-throughput gene expression data. Especially when combined with hierarchical clustering, heatmaps can reveal interesting patterns in the data and help categorize samples with different characteristics (21). We created an interactive heatmap module equipped with flexible options to facilitate visualization and analysis of TCGA RNAseq and miRNAseq data. Here, users can manually enter genes of interest, or select curated gene sets described in the Molecular Signatures Database (MSigDB) (22). Extensively utilized in both cancer and non-cancer literature, MSigDB contains thousands of annotated gene sets that define numerous biological states and processes. We anticipated that heatmaps created with these pre-defined gene sets can be especially informative when paired with user-selected sample-level annotations. Thus, we added functionality to show custom annotation bars above the heatmaps. In these annotation bars, both categorical and continuous features can be specified to generate complex heatmaps. As described previously, numerous cancer-specific categorical features are available for selection including tumor stage, mutational subtype, and oncogenic copy number alterations (23). When continuous annotation variables are selected, we designed the pipeline to categorize the samples at the median value to create two color-coded groups. The pipeline also allows selecting multiple continuous variables, in which case the categorization is done for each variable separately or after averaging the variables, thus making it possible to create meta-features on-the-fly. In the heatmap module, users can filter out genes with low variance and change the hierarchical clustering parameters to tailor the analysis to specific needs. Therefore, this module can help researchers create informative heatmaps to highlight meaningful patterns in gene expression data.
Gene set enrichment analysis module
In transcriptome analysis, expression patterns of genes in specific pathways can be examined to study the biological states of the samples. Gene set enrichment analysis (GSEA) is a powerful approach for comparing two groups of samples by focusing on a list of genes that share common biological characteristics (24). We developed an efficient GSEA pipeline that allows comparisons between user-defined data subsets. Similar to previous modules, users can define these data subsets by selecting specific values for categorical variables (such as mutation subtypes, tumor stage, and patient gender), or by binarizing continuous variables (such as gene expression, nonsilent mutation rate, and intratumoral immune scores) at custom thresholds. After defining two data subsets, the pipeline ranks genes based on their expression levels and variance using the previously defined signal-to-noise approach (24). The GSEA module utilizes thousands of readily available gene sets from the MSigDB repository and facilitates the investigation of various biological pathways in iterative analyses (22). Importantly, to further increase the usability of this module, we gave the user an option to provide custom gene sets and examine their expression patterns in user-defined data subsets. This module can create enrichment plots for individual gene sets and print a table of leading-edge genes that mostly drive the enrichment scores (24), or it can show the most highly enriched gene sets among others provided to the pipeline. Taken together, the GSEA module provides a flexible and customizable platform for examining functional associations in cancer gene expression data.
Receiving operator characteristics analysis module
Receiving operator characteristics (ROC) analysis is a method for evaluating the classification performance of variables in a binary classification system (25). ROC curves are generated by plotting the true positive rate (i.e. sensitivity) against the false positive rate (i.e. 1-specificity) across all possible thresholds for the variable of interest. The area under the ROC curve (AUC) can be examined to assess the power and accuracy of the classifier. ROC analysis is commonly used in the field for investigating diagnostic and prognostic biomarkers (26–28). TCGEx features a powerful and flexible module for performing ROC analyses using gene expression data and clinical metadata. Users can binarize features of interest at custom thresholds to create two groups needed for the ROC analysis and specify custom predictor variables to assess their classifier potential. Using this pipeline one can examine, for instance, whether the expression of a specific gene is associated with certain tumor characteristics such as mutation rate, and intratumoral immune signatures. As in the previously described modules, users can work with both categorical variables and numerical variables through flexible input options and tailor the analysis to specific needs. We also provided an option to add ROC curves to the graph generated using MSigDB gene sets. This design facilitates comparing userselected custom predictors and other functionally annotated predictors. Thus, we anticipate that the ROC module can be conveniently used in various study contexts to identify novel predictors in cancer data.
Principal component analysis module
Dimensionality reduction techniques such as principal component analysis (PCA) are commonly employed in transcriptomics (29). PCA involves constructing a new coordinate system for the data using linear combinations of its variables (i.e., genes in RNAseq). The axes of this new coordinate system, called principal components, represent the variation in the experiment, and they help visualize multidimensional data sets on a 2D graph. TCGEx program features a user-friendly PCA module with flexible inputs to be utilized in various study contexts. Similar to other modules, one can specify sample types and select a custom list of genes to prepare the gene expression matrix for PCA. Here, the module conveniently allows selecting all genes in the RNAseq/miRNAseq data or specifying genes from MSigDB pathways. Although PCA accounts for both highly and lowly variable genes, we provided a pre-filtering option to speed up the analysis by removing lowly variable genes. In PCA plots, it is usually helpful to annotate individual points based on specific characteristics. For instance, one may want to see whether samples with certain genetic features form a separate cluster in the PCA space (30). To address this need, this module allows color-coding data points based on categorical or continuous variables, the latter of which is categorized at the median value to form “high” and “low” groups. PCA module generates customizable and interactive graphics and allows exporting publication-ready plots.
Machine learning module
Machine learning (ML) refers to a range of applications and algorithms that aim to extract relevant and useful information from data.#In the context of bioinformatics, ML is utilized to perform classification, prediction, and feature selection from biological data (31, 32). Regularized regression, a type of supervised machine learning technique, is a derivative of linear regression which allows one to simultaneously create a model and perform feature selection in high dimensional data. It is optimized to minimize the sum of squared residuals while penalizing the generated model coefficient estimates.#The penalty term is applied to the model equation to reduce model complexity and make the prediction with the limited mean squared error. Ridge, lasso, and elastic net regression are all types of regularized regression methods with varying strengths depending on the underlying structure of the data. Ridge regression shrinks the estimated coefficients without making them zero and assigns correlating parameters with similar coefficients (33). Lasso regression on the other hand can shrink model coefficients to zero and therefore performs feature selection (34). Elastic net regression can refer to a middle ground in between ridge and lasso regression and it is ideal for data sets in which the number of predictor variables significantly exceeds the number of samples and/or there are a high number of correlating variables. The optimal degree of penalization for each approach can be selected using cross-validation across the data set. The ML module in TCGEx provides a user-friendly interface for utilizing these regularized linear regression methods on the TCGA transcriptomics data. Users choose response and predictor variables for the model and specify the ML algorithm by setting the penalization parameter alpha. The response variable (i.e., dependent variable) in these analyses can be constructed from a custom list of genes or MSigDB gene sets by taking the average of the expression values. While the predictor variables (i.e., independent variables) can be specified similarly, we also provided an option to select all miRNAs as predictors to make the pipeline convenient for miRNA-focused analyses (35). This way, one can easily examine, for instance, miRNAs that are most closely associated with a biological pathway of interest. The ML module can use the entire data to train a model or split it into training and testing subsets to evaluate the overall model accuracy (32). The findings of the analyses are displayed through interactive graphs which show penalized coefficients and the mean squared error of the cross-validated model across various levels of regularization. Taken together, the ML module provides a customizable platform for sophisticated analysis of the TCGA transcriptomics data.
Results and Discussion
In this section, we provide a use-case scenario to demonstrate how TCGEx can aid hypothesis-driven research. While the analyses herein will demonstrate the basic capabilities of the pipeline on the TCGA Skin Cutaneous Melanoma (SKCM) primary and metastatic tumor data, users can easily adapt this approach to various study contexts and examine other TCGA data sets. Melanoma is an aggressive form of cancer that develops in pigment-producing cells of the skin, although it can rarely be observed in other organs including the eye and gastrointestinal tract. More than 320,000 new cutaneous melanoma cases are reported globally each year and 55,000+ lives are lost because of this disease (36). UV exposure is a significant risk factor for the development of melanoma, and most cases are diagnosed at an advanced stage where the cancer cells have already metastasized to nearby lymph nodes or distant organs. Accordingly, the TCGA-SKCM study is composed of mainly metastatic tumor biopsies, as shown in the TCGEx data selection module (Fig.2a). These samples were collected from comparable numbers of male and female patients in various age groups (Fig.2b). The landmark TCGA paper describing the genetic features of SKCM classified tumors into subtypes based on mutations in BRAF, NF1, and RAS genes (7). These subtypes are characterized by distinct sets of accompanying genomic aberrations including PTEN loss in BRAF-mutant tumors, KIT amplification in triple wild-type (WT) tumors, Akt3 amplification in RAS-mutant and wild-type tumors which are readily observed in transcriptomics data (Fig.2c). When these graphs are faceted by patient gender, some of these genes showed sex-dependent differences in their expression patterns (Fig.S1a, b). Interestingly, BRAF-mutant melanomas expressed higher levels of transferrin (TF) which was recently shown to protect circulating tumor cells from oxidative stress leading to BRAF inhibitor resistance (Fig.S1c) (37). On the other hand, NF1-mutant melanoma was distinctly marked by the high expression of hepatocyte-specific transcription factor ONECUT1 which was shown to be associated with tumor progression (Fig.S1c) (38). Our analyses also revealed that ZC3H13, a component of the RNA methyltransferase complex, is enriched in RAS-mutant melanoma; and triple-WT tumors selectively expressed FOXF2, a transcription factor that plays a complex role in regulating oncogenic signaling pathways (Fig.S1c) (39). These findings suggest that distinct molecular mechanisms may be responsible for driving different tumor subtypes in melanoma and indicate possible intervention points.
BRAF-mutant melanoma constitutes about 50% of the cases in the TCGA cohort followed by RAS-mutant (∼30%), NF1-mutant, and triple-WT tumors (∼10% each). Of these 4 groups, BRAF-mutant melanoma has the most favorable prognosis (Fig.2d), although this may be confounded by the younger age of the patients in this group (7). While melanoma incidence is comparable between both sexes, females were reported to have a reduced risk of mortality (40). Examination of the TCGA-SKCM cohort did not show a significant survival difference between the two groups; however, a slightly increased risk was observed for the males (Fig.S2a, b). Notably, when we examined only BRAF and NRAS-mutant tumors, males were associated with poorer survival outcomes, especially in the NRAS-mutant subgroup (Fig.2e). These findings suggest that sex-specific factors may differentially influence the prognosis in different mutation subtypes.
In addition to defining potential genetic drivers in melanoma, the TCGA-SKCM manuscript also identified three distinct tumor subsets based on gene expression profiles as immune, keratin, and MITF-low groups (7). The immune subset was characterized by the higher expression of immune cell-specific genes (Fig.3a), while the keratin and MITF-low subsets were characterized by the high and low expression of epithelial genes, respectively, in the absence of a clear immune infiltration signature. Of these three groups, the immune subset exhibits the most favorable survival outcome (Fig.3b, S3), which is consistent with the positive role of inflammation in SKCM reported previously (6). Interferon-gamma (IFNγ) is a key inflammatory cytokine released from T and NK cells to eliminate tumor cells. IFNγ exerts its multifaceted functions within the TME by signaling through IFNγ-receptors (IFNGR) and regulating dozens of downstream genes in tumor cells and the neighboring immune cells (41). We examined IFNγ response gene expression signature in SKCM subsets and observed higher levels of most of these transcripts in the immune subset (Fig.3c) (42), and as expected, these trends were parallel by the elevated IFNγ levels and CD8+ T cell signatures within the TME. We next wanted to investigate the relationship between the intratumoral CD8+ T cell score and the T cell receptor (TCR) diversity (6). This is of interest because multiple studies have suggested that the clonally expanded T cells within melanoma are associated with a spectrum of dysfunctional phenotypes (43, 44). Increasing CD8+ T cell signature within TCGA-SKCM tumors positively correlated with the overall TCR diversity, suggesting a polyclonal T cell infiltrate and the presence of T cell clones with different antigenic specificities (Fig.3d) (45). Notably, increasing TCR diversity correlated with higher levels of cytotoxic effector molecule perforin (PRF1) and decreasing genetic signatures of M2 macrophages that are known to exert pro-tumorigenic functions (Fig.3e) (6). Taken together, TCGEx can be utilized to explore multifaceted data types in TCGA including the tumor-specific subtypes and the intratumoral immune landscape.
MicroRNAs (miRNAs), 20-22 nucleotide noncoding RNAs responsible for post-transcriptional modulation, are key regulators of antitumor immunity and tumor immunoevasion (46). Given the importance of IFNγ signaling in cancer, we next focused on TCGA-SKCM miRNAseq data and utilized lasso regression to select top miRNAs associated with the IFNγ response. The patient cohort was divided into training (70%) and test (30%) subsets and the expression of miRNAs (529 genes after low-expression filtering) were modeled as the predictor variables against the average expression of IFNγ signature genes as the response variable. 26 positive and 18 negative coefficients remained in the optimally regularized model (log lambda + 1 standard error) (Fig.4a, S4). 5 miRNAs, miRs-155-5p, -150-3p, -150-5p, -142-3p and -7702 were the last five predictors remaining in the model when the model penalty term lambda (λ) is increased further at the expense of minimized mean-squared error. As expected from prior work, these miRNAs showed a strong positive correlation with T cell effector molecules including IFNγ, PRF1, and granzyme B (GZMB) (Fig.4b). Of note, top positive correlators of miR-155-5p were immune-specific genes such as markers of T cells and costimulatory receptors; and top negative correlators included regulators of nucleotide biosynthesis, metabolism, and signal transduction (Fig.S5). Since a single miRNA can control dozens of targets and miR-155-5p correlates with immunity in SKCM (47), we next examined whether the expression levels of miR-155-5p can distinguish the immune subset of SKCM from the keratin and MITF-low subsets. We performed ROC analysis and found that miR-155-5p predicted the immune-enriched SKCM subset with an AUC of 0.819, although the average expression of IFNγ response genes was a slightly better predictor (Fig.4c). We noted that the predictive power of the averaged expression of 5 miRNAs that remained in the highly regularized lasso model was a better predictor of the immune subset of SKCM (Fig.4d). To further examine gene expression patterns in miR-155-5p-high melanoma, we performed GSEA using annotated KEGG gene sets from the MSigDB database (42) and noted that the expression of genes associated with cytokine-receptor interaction was enriched in the miR-155-5p-high subset (Fig.4e, f). Lastly, we performed dimensionality reduction using 685 genes belonging to the “adaptive immune response” gene ontology class and color-coded the individual samples based on miR-155-5p expression to see whether miR-155 levels can differentially mark samples in the PCA space. Interestingly, the first principal component distinguished a portion of the samples suggesting that miR-155 is a marker of adaptive immunity signature within the TME which is supported by the work from our group and others (47–50).
Taken together, TCGEx provides a powerful and flexible interface for investigating the TCGA gene expression data through sophisticated analyses. The use case scenario provided for SKCM here can be adapted to study many other aspects of cancers including genetic drivers in microsatellite-stable or unstable subsets of colon cancer and their immune associations (5), prognostic markers in different mutational and histological subtypes of breast cancers (4), and molecular characteristics of previously defined miRNA subtypes in ovarian cancer (3), among others. Furthermore, TCGEx allows integrative analyses of multiple cancer types to study the molecular underpinnings of human cancer. Visual analysis interfaces like TCGEx increase the accessibility and reusability of high-throughput cancer data and facilitate scientific research by serving as a bridge between bench scientists and bioinformaticians. TCGEx, publicly and freely accessible on the web, not only reduces barriers to data analysis for researchers with no programming experience but also can be implemented into larger analysis pipelines to facilitate hypothesis-driven research.
Funding
A.A. received Abdi Ibrahim Foundation undergraduate scholarship. C.S., E.K., and M.M.O. received Turkish Health Institutes Directorate (TUSEB) undergraduate research project funding (TUSEB-A1-28154). C.S. and E.K. also received Scientific and Technological Council of Turkey (TUBITAK) 2247C-STAR scholarships. M.E.K. was supported by a TUBITAK-2210-A graduate scholarship. H.A.E. is supported by TUBITAK (2232-121C115, 1001-122S337), Turkish Academy of Sciences (TUBA-GEBIP-2022), and institutional grants (2022IYTE-2-0060, 2023IYTE-1-0053, and 2023IYTE-1-0054).
Author Contributions
M.E.K. prepared the input data, developed the code connecting different modules of TCGEx, participated in developing the ML module, and performed app deployment and maintenance. C.S. developed survival analysis and heatmap modules. E.K. developed the ROC and heatmap modules. A. Askin developed the PCA, correlation, and GSEA modules. M.M.O developed the ML module and participated in preparing and cleaning up the input data. G.K. contributed to the correlation modules and website analytics. A. Aksit was instrumental in setting up the Shiny server within the institutional infrastructure. R.O.C. provided critical insight during the development of the TCGEx app. H.A.E. developed the initial prototype for most of the modules, oversaw the TCGEx app development, and set up the web-accessible server. All authors contributed to preparing and reviewing the manuscript.
Data availability
The TCGEx webserver is accessible at https://www.tcgex.iyte.edu.tr. All code for the TCGEx app, data download, and pre-processing is publicly accessible at https://github.com/atakanekiz/TCGEx. To facilitate local execution and code development, TCGEx docker image can be accessed at https://hub.docker.com/repository/docker/atakanekiz/tcgex/. The datasets used in TCGEx were derived from the TCGA data repository in the public domain (https://portal.gdc.cancer.gov/). Processed data files and TCGEx source code are accessible on Figshare (doi:10.6084/m9.figshare.23912532) (https://figshare.com/s/22f21b780acd57fb5dfb).
Conflict of interest
The authors declare no conflicts of interest.
Acknowledgments
We thank the Director of the IzTech IT Department Dr. Ozgur Orun for helpful discussions in hosting the TCGEx app and facilitating the establishment of the Shiny server at IzTech. We also extend our gratitude to Emin Bayindirli for his involvement during the initial development of the app. We are also grateful to the patients, and their families participating in the TCGA project, and the TCGA initiative as a whole for making this repository publicly available.