Pathway enrichment analysis of -omics data

Pathway enrichment analysis helps gain mechanistic insight into large gene lists typically resulting from genome scale (–omics) experiments. It identifies biological pathways that are enriched in the gene list more than expected by chance. We explain pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome sequencing experiments. The protocol comprises three major steps: define a gene list from genome scale data, determine statistically enriched pathways, and visualize and interpret the results. We focus on differentially expressed genes and mutated cancer genes, however the described principles can be applied to diverse –omics data. The protocol is designed for biologists with no prior bioinformatics training and uses freely available software including g:Profiler, GSEA, Cytoscape and Enrichment Map.


INTRODUCTION
Comprehensive surveys of DNA, RNA and proteins in biological samples are now routine. The resulting data are growing exponentially and their analysis helps discover novel biological functions, genotype-phenotype relationships and disease mechanisms. However, analysis and interpretation of these data is a major challenge for many researchers. Analyses often result in long lists of genes that require an impractically large amount of manual literature searching to interpret. A standard approach to addressing this problem is pathway enrichment analysis, which summarizes the large gene list as a smaller list of more easily interpretable pathways. Pathways are statistically tested for over-representation in the experimental gene list above what is expected by chance. For instance, experimental data containing 40% cell cycle genes is surprisingly enriched given that only 8% of human protein-coding genes are involved in this process.
In a recent example, we used pathway enrichment analysis to help identify histone and DNA methylation by the Polycomb repressive complex (PRC2) as the first rational therapeutic target for ependymoma, one of the most prevalent childhood brain cancers 1 . This pathway is targetable by available drugs, such as 5-azacytidine, which was used on a compassionate basis in a terminally ill patient and stopped rapid metastatic tumour growth. In another example, we analysed rare copy number variants (CNVs) in autism and identified several significant pathways affected by gene deletions, whereas only few significant hits were identified with case-control association tests of single genes or loci 2,3 . These examples illustrate the useful insights into biological mechanisms that can be achieved using pathway enrichment analysis.
This protocol covers pathway enrichment analysis of large gene lists typically derived from genome scale ("-omics") technology. The protocol is intended for experimental biologists who are interested in interpreting their -omics data. It requires only an ability to learn and use "point-and-click" computer software, although advanced users can benefit from automatic analysis scripts we provide. We analyse human gene expression and somatic mutation data as examples, however our conceptual framework is applicable to analysis of lists of genes or biomolecules from any organism derived from large-scale data, including proteomics, genomics, epigenomics and gene regulation studies. The protocol uses free, easy to use, updated and well documented software There are two major ways to define a gene list from -omics data: list or ranked list. Certain -omics data naturally produce a gene list, such as all somatically mutated genes in a tumor from exome sequencing, or all proteins that interact with a bait in a proteomics experiment. Such a list is suitable for direct input into pathway enrichment analysis using g:Profiler protocol 1A. Other -omics data naturally produce ranked lists.
For example, a list of genes can be ranked by differential gene expression score or sensitivity in a genome wide CRISPR screen. Early pathway enrichment analysis approaches involved applying a threshold to a ranked gene list (e.g. FDR-adjusted pvalue below 0.05 and fold-change above 2); however, this is often arbitrary and thus not recommended, especially when meaningful ranks are available for all or most of the genes in the genome. Modern approaches, like GSEA, are designed to analyze ranked lists of all available genes and do not require a threshold. A ranked list is suitable for input into pathway enrichment analysis using GSEA protocol 1B. Alternatively, a partial ranked gene list can be analysed using g:Profiler.
As an example, we describe analysis of raw RNA-seq data to define a ranked gene list. DNA sequence reads are quality filtered (e.g. by trimming to remove low quality bases) and mapped to a genome-wide reference set of transcripts to enable counting reads per transcript. Read counts are aggregated at the gene level (counts per gene). Typically, RNA-seq data for multiple biological replicates (three or more) for each of multiple experimental conditions (two or more, e.g. treatment vs. control) are available (Box 2 -experimental design). Read counts per gene are normalized across all samples to remove unwanted technical variation between samples, for example, due to differences in sequencing lane or total read number per sequencing run [13][14][15] . Next, read counts per gene are tested for differential expression across sample groups (e.g. treatment vs. control) (Supplementary Protocol 1). Software packages such as edgeR 16 , DESeq 17 , limma 18 , and Cufflinks 19 implement procedures for RNA-seq data normalization and differential expression analysis. Differential gene expression analysis results include: 1) the p-value of the significance of differential expression; 2) the related q-value (a.k.a adjusted p-value) that has been corrected for multiple testing across all genes (e.g. using the Benjamini-Hochberg False Discovery Rate (FDR) procedure 20 ); 3) effect size and direction of expression change (expressed as fold-change or log-transformed fold-change) so that up-regulated genes are positive and at the top of the list and downregulated genes are negative and at the bottom of the list. The list of all genes is then ranked by one or more of these values (e.g. -log10 p-value multiplied by the sign of logtransformed fold-change). This ranked list is provided as input to pathway enrichment analysis (ranked gene list, no threshold needed, pathway enrichment analysis using GSEA protocol 1B).
Step 2A: Pathway enrichment analysis of a gene list using g:Profiler The default analysis implemented in g:Profiler and similar web-based tools (e.g., Panther 21 , ToppGene 22 , Enrichr 23 , DAVID 24 ) searches for pathways whose genes are significantly enriched (i.e. over-represented) in the fixed list of genes of interest, compared to all genes in the genome. The p-value of the enrichment is computed using a Fisher's exact test and multiple test correction is applied (Box 3).
The g:Profiler tool also includes an ordered enrichment test, which is suitable for lists of up to a few thousand genes that are ordered by a score, while the rest of the genes in the genome lack meaningful signal for ranking. For example, significantly mutated genes may be ranked by a score from a cancer driver prediction method 25 . This analysis repeats a modified Fisher's exact test on incrementally larger sub-lists of the input genes and reports the sub-list with the strongest enrichment p-value for every pathway 26 . g:Profiler searches a set of pathway, network, regulatory motif, and phenotype gene sets.
Major gene set categories can be selected to customize the search.
Pathway enrichment methods that use the Fisher's exact test, or related overrepresentation tests, require the definition of a set of background genes for comparison. Step 2B: Pathway enrichment analysis of a ranked gene list using GSEA Pathway enrichment analysis of a ranked gene list is implemented in the GSEA algorithm 5 . GSEA is a threshold-free method that analyzes all genes based on their differential expression rank, or other score, without prior gene filtering. GSEA is particularly suitable and recommended when ranks are available for all or most of the genes in the genome (e.g. for RNA-seq data), however it is limited or inapplicable when only a small portion of genes have ranks available.
The GSEA method searches for pathways whose genes are enriched at the top or bottom of the ranked gene list, more so than expected by chance alone. For instance, if the top most differentially expressed genes are involved in the cell cycle, this suggests that the cell cycle pathway is regulated in the experiment. In contrast, the cell cycle pathway is likely not significantly regulated if the cell cycle genes appear randomly scattered through the whole ranked list. To calculate an enrichment score (ES) for a pathway, GSEA progressively examines genes from the top to the bottom of the ranked list, increasing the enrichment score if a gene is part of the pathway and decreasing the score otherwise. These running sum values are weighted, so that enrichment in the very top-(and bottom-) ranking genes is amplified, whereas enrichment in genes with more moderate ranks are not amplified. The ES score is calculated as the maximum value of the running sum and normalized relative to pathway size, resulting in a normalized enrichment score (NES) that reflects the enrichment of the pathway in the list. Positive and negative NES values represent enrichment at the top and bottom of the list, respectively. Finally, a permutation-based p-value is computed and corrected for multiple testing to produce a permutation-based FDR q-value that ranges from zero (highly significant) to one (not significant) (Box 3). The same analysis is performed starting from the bottom of the ranked gene list to identify pathways enriched in the bottom of the list.
Resulting pathways are selected using the FDR q-value threshold (e.g. q<0.05), and ranked using NES. It is also useful to inspect the "leading edge" genes that contribute to the increase of the enrichment score before it peaks.
GSEA has two methods to determine the statistical significance of the enrichment score and compute a p-value: gene set permutation and phenotype permutation. For gene set permutation, input is a ranked list and GSEA compares the observed pathway enrichment score to a distribution of scores obtained by repeating the analysis with randomly sampled gene sets of matching sizes (e.g. 1,000 times). In the phenotype permutation mode, input is expression data for all samples along with a definition of sample groups (called 'phenotypes' -e.g. cases vs. controls, tumor vs. normal) to be compared against each other. The observed pathway enrichment score is compared to a distribution of scores obtained by randomly shuffling the samples among phenotype categories and repeating the analysis (e.g. 1,000 times), including computation of the ranked gene list and resulting pathway enrichment score. The gene set permutation mode is recommended for studies with limited variability and biological replicates (i.e. 2 to 5 per condition). In this case, differential gene expression analysis should be computed using methods that include variance stabilization, outside of GSEA. If more replicates are available (above 6 to 10 per condition), the phenotype permutation should be used, offering as a main advantage that it models gene correlations in the gene expression matrix, unlike the gene set permutation approach. This protocol only covers gene set permutation because it can be accomplished using easy to use GSEA software, whereas phenotype permutation for RNA-seq data requires computing the enrichment score and differential expression statistics on thousands of phenotype randomizations, which currently requires custom programming outside of GSEA.
By default, the GSEA desktop software searches the MSigDB gene set database that includes pathways, published gene signatures, microRNA target genes and other gene set types (Box 4). The user can also provide a custom database as a text-based 'Gene Matrix Transposed' (GMT) file where each line defines a pathway, with its name, identifier and a list of gene identifiers that match the input gene list.

General recommendations for pathway enrichment analysis
We recommend searching enrichment only of pathway gene sets at first, as these capture familiar normal cellular processes that are easy to interpret. Gene Ontology (GO) 27 biological process terms and manually curated molecular pathways from Reactome 28 , Panther 21 , HumanCyc 29 , and NetPath 30 are good resources for human pathways (Box 4). GO biological process annotations include a mix of manually curated and electronically inferred sources. We recommend excluding those with the lower quality 'inferred from electronic annotation' (IEA) evidence code, unless no enriched pathways are found. Pathway definitions change rapidly and it is essential to use updated databases of gene annotations as outdated databases can lead to missed discoveries 31 .
Different types of gene sets help answer a variety of questions. For instance, gene sets corresponding to microRNA and transcription factor targets can be used to discover important regulators 32,33 . The use of additional sets must be carefully considered, as simultaneously analyzing all available gene sets increases the number of statistical tests and leads to more conservative p-values following multiple test correction (Box 3).
Gene set size is important to consider. Small pathways (e.g. less than ten or fifteen genes) should be excluded because these are often numerous, negatively affecting multiple test correction, and redundant with larger pathways. For human gene expression analysis, large pathways (e.g. over 300 genes) should also be excluded as these are overly general (e.g. 'metabolism') and don't contribute to interpretability of results. However, for other gene set types and organisms, larger pathways and gene sets may need to be included (e.g. up to 1000 genes).
A pathway enrichment analysis resulting in few or no enriched pathways may be caused by suboptimal statistical processing used to define the gene list. If the gene list ranks are too noisy (interfering with the signal of having the most important genes at the top of the list), all or no genes are highly significant, then enriched pathways are unlikely to be found. If the gene list has been correctly defined, increasing the number of pathways and gene sets searched or setting more liberal filters may improve results.
Finally, pathway enrichment analysis results can change based on the parameters used (e.g. minimum and maximum pathway size or selected pathway databases), thus the robustness of conclusions should be tested by varying these parameters.

Step 3: Visualising and interpreting pathway enrichment analysis results
Pathway information is inherently redundant, as genes often participate in multiple pathways, and some pathway databases organize pathways hierarchically by including general and specific pathways with many shared genes (e.g. 'cell cycle' and 'M-phase of cell cycle'). Pathway enrichment analysis often highlights several versions of the same pathway as a result. Collapsing redundant pathways into a single biological theme simplifies interpretation. We recommend addressing such redundancy with the Enrichment Map visualization method 7 or similar 34 . An enrichment map is a network representing overlaps among enriched pathways. Pathways are represented as circles (nodes) that are colored by enrichment score and are connected with lines (edges) sized based on the number of genes shared by the connected pathways. Network layout and clustering algorithms are used to automatically display and group similar pathways as major biological themes (Figure 1). The Enrichment Map software takes as input a text file containing pathway enrichment analysis results and another text file containing the pathway gene sets used in the original enrichment analysis. Interactive exploration of pathway enrichment score (filtering nodes) and connections between pathways (filtering edges) is possible (see visualize enrichment results with Enrichment Map, protocol 2). Multiple enrichment analysis results can be simultaneously visualized in a single enrichment map, in which case different colors are used on the nodes for each enrichment. If the gene expression data are optionally loaded, clicking on a pathway node will display a gene expression heat map of all genes in the pathway.
An enrichment map helps identify interesting pathways and themes. First, expected themes should be identified to help validate the pathway enrichment analysis results (positive controls). For instance, growth related pathways are expected to be identified in cancer samples relative to controls. Second, pathways not previously associated with the experimental context are evaluated more carefully as potential discoveries. Pathways and themes with the strongest enrichment scores should be studied first, followed by progressively weaker signals (see navigating and interpreting the  39 or GeneMANIA 40 can be used with Cytoscape 6 to define an interaction network among pathway genes for expression overlay. This helps visually identify pathway components (e.g. branches or single elements) that are most altered (e.g., differentially expressed) in the experiment. Additionally, master regulators for enriched pathways can be searched for by integrating miRNA 32 or transcription factor 33 target gene sets using the Enrichment Map post-analysis tool. Finally, pathway enrichment analysis results can be published to support a scientific conclusion (e.g. functional differences of two cancer subtypes), used for hypothesis generation or planning experiments to support the identification of novel pathways.

Caveats of pathway enrichment analysis
The following caveats are important to consider when interpreting pathway enrichment analysis results.
• Pathway enrichment analysis assumes that a strong experimental signal of pathways reflects the biology addressed by the experiment. For instance, in a transcriptomics experiment, we assume that evolution has optimized a cell to express a pathway only when needed and these can be identified. Pathway activity not controlled by gene expression (e.g. post-translational regulation) will not be observed.
• Unexpected biological themes may indicate problems with experimental design, data generation or analysis. For example, enrichment of the apoptosis pathway may indicate a problem with the experimental protocol that led to increased cell death during sample preparation. In these cases, the experimental design and data generation should be carefully reviewed prior to pathway analysis.
• Pathway databases, and therefore enrichment results are biased towards well known pathways.
• Multi-functional genes that are highly ranked in the gene list may lead to enrichment of many different pathways, some of which are not relevant to the experiment.
Repeating the analysis after excluding such genes may reveal pathways whose enrichment is overly-dependent on their presence or confirm the robustness of pathway enrichment.
• Pathway enrichment analysis ignores genes with no pathway annotations, sometimes called "dark matter of the genome", and these genes should be studied separately.
• Most enrichment analysis methods make unrealistic assumptions of statistical independence among genes as well as pathways. Some genes may be always coexpressed (e.g. genes within a protein complex) and some pathways have genes in common. Thus, standard false discovery rates, which assume statistical independence between tests, are often either more or less conservative than ideal. Nonetheless, they should still be used to adjust for multiple testing and rank enriched pathways for exploratory analysis and hypothesis generation. Custom permutation tests may lead to better estimates of false discovery (Box 3).
• By representing pathways as gene sets, many biological details such as proteinprotein interactions, biochemical reactions, post-translational modifications, protein complexes, and activation and inhibition relationships are ignored. These issues are addressed by advanced methods that consider mechanistic pathway details, however this is still an active area of research (Box 5).

Working with diverse -omics data
Pathway enrichment analysis is generally applicable to any experiment that can generate a list of genes, though experiment specific issues must be considered: • Genes are associated with many, diverse database identifiers (IDs). We recommend using unambiguous, unique and stable IDs, as some IDs become obsolete over time.
For human genes, we recommend using the Entrez Gene database IDs (e.g. 4193 corresponds to MDM2, http://www.ncbi.nlm.nih.gov/gene/4193) or gene symbols (MDM2 is the official symbol recommended by the HUGO Gene Nomenclature Committee). As gene symbols change over time, we recommend maintaining both gene symbols and Entrez Gene IDs. We recommend UniProt accession numbers for proteins (e.g. Q00987 for MDM2, http://www.uniprot.org/uniprot/Q00987) and Human Metabolome Database (HMDB) IDs for metabolites (e.g. ATP is denoted as HMDB00538, http://www.hmdb.ca/metabolites/HMDB00538). The g:Profiler and related g:Convert tool support automatic conversion of multiple ID types to standard IDs.
• Pathway enrichment analysis of short non-coding genomic regions such as transcription factor binding sites from ChIP-seq experiments need additional consideration. Genomic regions must be mapped to protein-coding genes and corrected for biases such as increased signal in longer genes. Tools such as GREAT 41 automatically perform both tasks.
• Large genomic intervals that span multiple genes (e.g. from genome-wide associations, copy number variation and differentially methylated regions) require specialized tests such as the PLINK CNV gene set burden test 42  • For rare genetic variants, case-control pathway "burden" tests are the most appropriate pathway enrichment analysis method (Box 3).

Future perspectives
Current pathway enrichment analysis methods provide a useful high-level overview of the pathways active in a genomics experiment. However, these methods consider a simplified pathway view (gene sets). Next generation pathway analysis methods will integrate more biological pathway details, build pathway models based on multiple types of genomics data measured across many samples, and consider positive and negative regulatory relationships in the data (Box 5). For instance, qualitative mathematical modeling parameterized with single cell RNA-seq data may enable accurate predictions of drug combinations capable of treating a given disease under study.

PROTOCOL INTRODUCTION
This step-by-step protocol explains how to complete pathway enrichment analysis using g:Profiler (gene list) and GSEA (ranked gene list), followed by visualization and interpretation using Enrichment Map, as explained above in the text. The example data provided for the g:Profiler analysis is a list of genes with frequent somatic single nucleotide variants (SNVs) identified in The Cancer Genome Atlas (TCGA) exome sequencing data of 3,200 tumors of 12 types 25 . The example data provided for the GSEA analysis is a list of differentially expressed genes in two types of ovarian cancer defined by TCGA.

Equipment
Hardware requirements: • A recent personal computer with Internet access and at least 8GB of RAM. Note: 1GB of RAM is sufficient to run GSEA analysis but Cytoscape requires at least 8GB.
Software requirements: • A contemporary web browser (e.g. Chrome) for pathway enrichment analysis with g:Profiler (Protocol 1A).
• Java Standard Edition. Java is required to run GSEA and Cytoscape. It is available at http://java.oracle.com. Version 8 or higher is required.
• GSEA desktop application for pathway enrichment analysis protocol 1B. Data requirements: • We provide example files that are listed following the protocol. We recommend saving all files in a personal project folder before starting.

Equipment setup
• Protocol 1A uses web-based software and just requires a web browser. 3 Check the box next to "Ordered query". This option treats the input as an ordered gene list and prioritizes genes with higher mutation enrichment scores at the beginning of the list.
4 Check the box next to "No electronic GO annotations". This option will discard less reliable Gene Ontology (GO) annotations (IEA -inferred from electronic annotation) that are not manually reviewed.
5 Set filters on gene annotation data using the legend on the right. We recommend 8 Set the dropdown "Size of query/term intersection" to 3. The analysis will only consider more reliable pathways that have three or more genes in the input gene list.
9 Click "g:Profile!" to run the analysis. A graphical image will be shown with detected pathways from top to bottom and associated genes of the input list left to right. Resulting pathways are organized hierarchically into related groups.
g:Profiler uses graphical output by default and switches to textual output when a large number of pathways is found. g:Profiler returns only statistically significant pathways with p-values adjusted for multiple testing correction using a custom pathway-focused procedure. By default, results with corrected q-value below 0.05 are reported.
10 Use the dropdown menu "Output type" and select the option "Generic Enrichment Map (TAB)". This file is required for visualizing pathway results with Cytoscape and Enrichment Map. 12 Download the required GMT file by clicking on the link "name" at the bottom of the Advanced Options form. The GMT file is a compressed ZIP archive that contains all gene sets used by g:Profiler (e.g., gprofiler_hsapiens.NAME.gmt.zip).
The gene set files are divided by data source. Download and uncompress the ZIP archive to your project folder. All required gene sets for this analysis will be in the file hsapiens.pathways.Name.gmt (Supplementary_Table5_hsapiens.pathways.NAME.gmt).
TIMING: ~3 minutes to run g:Profiler using Chrome on Windows7.
i Click on "Load Data" in the top left corner in the "Steps in GSEA Analysis" section.
ii In the "Load Data" tab, click on "Browse for files …" iii Find your project folder and select the file iv Also select the pathway gene set definition (GMT) file using a multiple select method such as shift-click (Supplementary_Table3_Human_GOBP_AllPathways_no_GO_iea_J uly_01_2017_symbol.gmt (TT2, TT3)). Then click the 'Choose' button to continue.
16 Click on "Run GSEAPreranked" in the side bar under "Tools". The tab "Run GSEA on a Pre-Ranked gene list" will appear.
17 Specify the following parameters: i Gene sets database -click on the button (…) located to the right and wait for the gene set selection window to appear. Go to the "Gene matrix (local GMX/GMT)" tab using the top right arrow. Click on the downloaded local GMT file

Supplementary_Table3_Human_GOBP_AllPathways_no_GO_iea_Ju
ly_01_2017_symbol.gmt and click on OK at the bottom of the window.
ii Number of permutations -number of times that the gene sets will be randomized to create the null distribution to calculate the p-value and FDR q-value (TT4). Use the default value of 1000 permutations.
iii Ranked List -select the ranked gene list by clicking on the right-most arrow and highlighting the rank file (Supplementary_Table2_MesenvsImmuno_RNASeq_ranks.rnk).
iv Click on "Show" button next to "Basic Fields" to display extra options.
v Analysis name -change default "my_analysis" to a specific name, for example "Mesen_vs_Immuno".
vi Save results in this folder -navigate to the folder where GSEA should save the results. By default, GSEA will use gsea_home/output/[date] in your home directory.
vii Max size: exclude larger sets -By default GSEA sets the upper limit to 500. Set this to 200 to remove the larger sets from the analysis.
18 Run GSEA -click on the "Run" button located at the bottom right corner of the window. Expand the window if the button is not visible. The "GSEA reports" panel at the bottom left of the window will show the status "Running". It will be updated to "Success" upon completion (TT5, TT6). 19 Examine GSEA results -once the GSEA analysis is complete, a green notification "Success" will appear in the bottom left section of the screen. All iii In the right hand panel g:Profiler output files will be auto populated into their specified fields. (Alternately, users can click on the "+" to specify each of the required files manually).
• If desired, modify the 'dataset' name. By default, EM will use the name of the g:Profiler enrichment results file name (e.g. gprofiler_cancer_drivers).
• Verify the Analysis type is set to "Generic/gProfiler".
• Verify the Enrichments results file is the results file downloaded in Protocol 1A step 11 (or alternately manually specify

Supplementary_Table4_gprofiler_results.txt)
• Verify the GMT specified is the file retrieved from the g:Profiler website in Protocol 1A -step 12. Use the file hsapiens.pathways.NAME.gmt (or alternately manually specify Supplementary_Table5_hsapiens.pathways.NAME.gmt) that contains gene sets corresponding to GO biological processes and Reactome pathways.
iv Specify additional files: • Expression -(Optional) Upload an expression matrix for the genes analyzed in g:Profiler or alternatively an expression data set of all genes. If the expression data set contains additional genes not used for the g:Profiler search, their expression values will still appear in the heat map of the enrichment map (for example file see

Supplementary_Table6_TCGA_OV_RNAseq_expression.txt).
• Ranks -(Optional) Ranks for gene list or for the expression data can also be specified (for example file see
• Classes -(Optional) GSEA cls file defining the phenotype (i.e. biological conditions) of each sample in the expression file, for example file see

Supplementary_Table9_TCGA_OV_RNAseq_classes.cls.
Generally, this file is only required when performing phenotype randomization in GSEA, but if it is supplied to enrichment map it is used to identify and label the columns of the expression file in the Enrichment Map heat map by phenotype.
• Phenotypes -(Optional) If there are two different phenotypes in the expression data, update the phenotype labels so that 'positive' represents the phenotype associated with positive values (Mesenchymal in this example) and 'negative' with negative values (Immunoreactive in this example) (TT11).
v Tune parameters in the "Parameters" box: iii In the right-hand panel, GSEA output files will be auto populated into their specified fields. Alternately the "+" can be clicked to specify each of the required files manually. Equivalent supplementary files that users can specify manually are indicated in brackets.
• If desired, modify the 'dataset' name. By default, EM will use the name of the GSEA results folder prior to the first '.' as the 'dataset' name.
• Verify the Analysis type is set to "GSEA".
• GMT -Verify that the file is set to [

Supplementary_Table2_MesenvsImmuno_RNASeq_ranks.rnk
iv Specify additional files: Supplementary_Table9_TCGA_OV_RNAseq_classes.cls • Phenotypes -(Optional) In the text boxes replace 'na_pos' with "Mesenchymal" and 'na_neg' with Immunoreactive. Mesenchymal will be associated with red nodes as it corresponds to the positive phenotype while Immunoreactive will be labeled blue (TT16,
• Keep the connectivity slider in the center. For networks with fewer edges, a sparser network, move the slider to the left. Alternatively, for networks with more edges, a denser network, move the slider to the right (TT12).
vi Click the "Build" button at the bottom of the Enrichment Map Input panel (TT6).
27 Figure 6 shows the resulting enrichment maps from the above g:Profiler and GSEA protocols.
TIMING: ~5 minutes to create Enrichment Map in Cytoscape using Windows7 with 8GB of RAM and Java 8.

Protocol 3 -Navigating and interpreting the Enrichment Map
An enrichment map must be interpreted to discover novel information about a set of data and must be refined to create a publication quality figure.
28 To explore the enrichment map, select the network of interest in the control panel located at the left side of the Cytoscape window and navigate it (zoom and pan) using Cytoscape controls (Figure7A). Pathways with many common genes often represent similar biological processes and group together as 'themes' in the network. Click on a node to display the corresponding genes in the table below the network view ( Figure 7B). 29 To find a gene or pathway of interest, type its name in the search bar located in the top right corner ( Figure 7C). All pathways containing that gene will be highlighted. For example, TP53 and BGN are the top genes in g:Profiler and GSEA analyses, respectively (TT19). 30 To find the most enriched pathways, find the column named "EM1_fdr_qvalue" (for g:Profiler) or EM1_NES" (for GSEA) in the 'Node' tab in the table panel ( Figure 7C and 7D). For GSEA, we specifically recommend using the NES (normalized enrichment score) to sort pathways by enrichment strength, whereas we recommend using the enrichment p-value for other enrichment methods (TT21). Click on the column name to sort the table according to that attribute.
Click the greatest value to show the pathway most enriched in the data. To ii Define genes you wish to include in the heat map ( Figure 8B) -data can be viewed for all genes contained in the selected nodes or just for the genes common to all selected nodes. By default all genes are shown.
iii Change expression value visualization depending on your data type ( Figure 8C) -data can be viewed as it was loaded ("Values"), or row normalized where the row mean is subtracted from every value and then divided by the row's standard deviation ("Row Norm"), or log transformed ("log"). vi Additional fine tuning of the heat map can be done through the settings panel that includes functionality to add new rank files, export the heat map data as a tab delimited text file or PDF image, change the distance metric for hierarchical clustering, or turn on heat map autofocus ( Figure 8F).
The resulting heat map can be seen in Figure 8. Columns headings are colored according to sample phenotype. Red color refers to the first phenotype (Mesenchymal), and blue to the second phenotype (Immunoreactive) (TT24).
The heat map can be exported to a text file for further analysis.
• Click on "Export to txt" in heat map settings ( Figure 8F) • Specify the name and location of the saved file • If only an individual node is selected, a dialog will offer to save the leading edge only. If "Yes", only the highlighted genes will be exported, and the entire set is exported otherwise (TT25) 32 Organize and de-clutter the network i If the network has too many nodes, increasing the Node cutoff q-value will remove less significant nodes ( Figure 7E).
ii If the network is too interconnected, increasing the edge cutoff (similarity) threshold will remove less pronounced edges between nodes ( Figure 7F).
iii The network layout may be applied again after adjusting the cutoffs (see the Layout menu in Cytoscape). The default layout algorithm is the unweighted prefuse force-directed layout. We also recommend the yFiles organic layout or weighted prefuse force-directed layouts. (TT26) iv To restore nodes or edges, adjust threshold sliders to their original positions.
v It can be helpful to separate the two different phenotypes (i.e. place all the red nodes to one side and all blue nodes to the other). To do this: • Click on the select tab in the control panel ( Figure 7A) • Click on the "+" and select "Column filter" • Click on "Choose column…" and select "EM1_NES • Click on the box next to "between" and change the value to zero. Click i Launch AutoAnnotate by selecting Apps à AutoAnnotate à New Annotation Set… in the Cytoscape menu bar. The "AutoAnnotate" tab will appear in the Cytoscape control panel.
ii Click on "+" in the AutoAnnotate panel.
iii The "AutoAnnotate: create Annotation Set" panel will appear. iv In the "Quick Start" tab click on "Create Annotations" (TT29).
v Each cluster in the network will have a circle annotation drawn around it and will be associated with a set of words (by default three) that appear most in the node description fields. Moving individual nodes within a cluster will automatically resize the surrounding circle annotation and moving an entire cluster will redraw the annotations in the new cluster location (TT30).
vi Manually arrange clusters to clean up the figure. Move nodes to reduce node and label overlap. Figure 9 shows the results of this process.
34 Create a simplified network view (Figure 10). This creates a single group node for every cluster with a summarized name and provides an overview of the enrichment result themes that is useful for enrichment maps containing many nodes.
1. In the Cytoscape Control Panel select the "AutoAnnotate" Tab.
2. Click on the Menu icon in the upper right hand corner.

Scale collapsed network for better viewing:
i. In the Cytoscape menu bar, select: View → Show Tool Panel.
ii. Go to Tool Panel located at the bottom of the Control Panel.
iii. Click on the "Scale" Tab.
iv. Move slider left to tighten the node spacing. 35 Manually arranging the network nodes and custom labeling the major themes is required for the clearest network view and for a publication quality figure.
i For instance, it is useful to bring together similar themes, such as signaling or metabolic pathways, even if they are not connected in the map.
ii If the focus of the figure is only on a subset of the network, it can be easier to work with just the subset. To create this, select the nodes of interest, then in the Cytoscape menu bar Select File à New à Network à From selected nodes, all edges.
iii When the purpose of the figure is to show a large network highlight only the main themes, clicking on "Publication ready" in the enrichment map panel will remove node labels. To revert to the original network, click on the "Publication ready" button again.
36 Create a sub network that highlights a specific theme or data -often enrichment maps generated from platforms that measure signals from a large percentage of the genome are large and complicated. When generating a figure, it is important to highlight specific themes or pathways relevant to the analysis in question. For example, we will select the top mesenchymal and immunoreactive pathways and create a sub network containing them.
i Click on the select tab in the control panel ( Figure 7A).
ii Click on the "+" and select "Column filter".
iii Click on "Choose column…" and select "EM1_NES iv Click on the box next to "between" and change the value to 2.5. Click "Enter".
v Click on the "+" and select "Column filter".
vi Click on "Choose column…" and select "EM1_NES vii Click on the box next to "inclusive" and change the value to -2.5. Click "Enter".
viii Above the two column filters you just added, change the drop down from "Match all (AND)" to "Match any (OR)".
ix Click on Apply. Under the apply button, it should say "Selected 32 nodes and 0 edges in Xms". The exact number of seconds specified will depend on your computer speed.
x Select File à New à Network à From selected nodes, all edges.
xi A new smaller network should appear. Manually move nodes around for optimal layout. xii Annotate network as described in step 6 ( Figure 11).
37 Export the image (TT34) i In the Cytoscape menu bar, select File → Export as Image… ii Set "Select the export file format" to PDF (TT35).
iii Click on "Browse…" to specify file name and location.
iv Click on "Save" to close the browser window and then on "OK".
v A window "Export Network" will appear, click on the "OK" button.
38 Get network creation parameters. In the previous step we exported the network as an image but there is information that either needs to be included in the text legend or as a pictograph within the image itself so the network can be easily interpreted. It is important to include the thresholds used when creating the map. Cytoscape has the ability to export a legend of the current style, it is not easily transferrable as a legend for the resulting figure. Figure 12 shows the basic legend components (available as SVG and PDF images at http://baderlab.org/Software/EnrichmentMap#Legends) that can be used for an enrichment map figure. Only include components relevant to the given analysis.
See bottom of Figure 9 for components used for current analysis. can't be opened because it is from an unidentified developer". Click on "Ok". Instead of double clicking on the gsea.jnlp icon/file, right click and select "open". The same error "'gsea.jnlp' can't be opened because it is from an unidentified developer" will appear but this time it will give you the option to "Open" or "Cancel". Click on "Open". After this initial opening, subsequent double clicks on gsea.jnlp will launch GSEA without any TT4. The higher the number of permutations the longer the analysis will take. To calculate the FDR q-value for each gene set, the data is randomized by permuting the genes in each gene set and recalculating the p-values for the randomized set. This parameter specifies how many times this randomization is done. The more randomizations are performed, the more precise the FDR q-value estimation will be (to a limit, as eventually the FDR q-value will stabilize at the actual value). On a Windows machine with 16G of RAM and i7 3.4 GHz processor, an analysis with 10,100, 500, or 1000 randomizations on our example set with above defined parameters takes 155, 224, 544, and 1012 seconds, respectively. You can download a webstart application that launches GSEA with 1, 2, 4, or 8GB.
Upgrade to a webstart that launches with more memory. If you are already using the webstart that launches with 8GB then you require GSEA JAVA jar file which can be executed from the command line with increased memory (see TT1 for details).
TT7. If the GSEA software is closed, you can still see the results by opening the working folder and opening the 'index.html' file. Alternatively, you can re-launch GSEA, and click on "Analysis history", then "History" and then navigate to date of your analysis.
Although all analyses, regardless of where the results files were saved, are listed under history, it is organized by date the analysis was run. If you can't remember when you ran a specific analysis, then you may have to manually search through a few directories to find the desired analysis.
TT8. When running GSEA with expression data as input (instead of a pre-calculated rank file), a phenotype label (i.e. biological condition or sample class) is provided as input for each sample and specified in the GSEA 'cls' file. When running GSEA, the two phenotypes to compare for differential gene expression analysis are specified and these phenotypes are used in the pathway enrichment result files. In contrast, in a GSEA preranked analysis (i.e. when a ranked gene list is provided by the user), GSEA automatically labels one phenotype "na_pos" (corresponding to enrichment in the genes at the top of the ranked list, where 'na' means the phenotype label 'not available') and the other "na_neg" (corresponding to enrichment in the genes at the bottom of the ranked list). This convention is also used by the Enrichment Map software, designating the first phenotype as "positive" and the second phenotype as "negative".
TT9. Check the number of gene sets that were analyzed. If the number is low (e.g. low hundreds), it could indicate gene ID mapping problems. TT13. If you specify a directory that contains multiple GSEA results rather than an individual GSEA results folder, EM will treat every GSEA results folder as its own data set. This enables easy multi-data set analyses. If you only wanted one data set but inadvertently selected the directory containing multiple GSEA results instead of selecting an individual folder, simply select the data sets you do not want to use and click on the trash can at the top of the EM input panel.

TT14.
Every GSEA analysis generates a random number that is appended to the names of the files and directories. The number will be different for every new analysis.
TT15. If Enrichment Map cannot find the original GMT file used in the GSEA analysis, it will use a filtered GMT file found in the GSEA 'edb' results directory. Enrichment map will not be able to find your original GMT file if you have moved it since running GSEA analysis. Although it is a GMT file, it has been filtered to contain only genes found in the expression file. If you use this filtered file, you will get different pathway connectivity depending on the expression data being used. We recommend using original GMT file used for the GSEA analysis and not the filtered one in the results directory.

TT16.
To annotate the phenotypes in the Enrichment Map heat map, the specified phenotype labels need to exactly match the GSEA CLS file.
TT17. If you load the CLS file prior to specifying the phenotypes, EM will automatically guess the phenotypes from the class file. If your class file specifies more than two phenotypes, EM will choose the first two phenotypes defined in the file.

TT18.
To set the threshold to a small number, select 'Scientific Notation' and set a qvalue cutoff such as 1E-04.

TT19.
Multiple genes separated by spaces can be entered into the search bar. Any pathway that contains the gene will be selected and highlighted in the network. Adding keywords "AND" into the search bar will show only pathways that contain all genes in the search query. If the analysis was not done using gene symbols then you will not be able to search by gene symbols. Instead use the identifier the analysis was based on, for example Entrez gene ID or Ensembl gene ID.

TT20.
If there are very few records in the node table make sure that no nodes are selected in the network. Click on the gear icon and change the setting from "Auto" to "Show all".

TT21.
If no expression file is given to Enrichment Map it will automatically create a dummy expression file where any gene found in the enrichment file will be given a placeholder expression value of 0.25, and any gene found in a pathway but not found in the enrichment results file be assigned a placeholder expression value of NA. Therefore clicking on any node in the enrichment map will show the genes used for the analysis as well as genes in the pathway but not part of the query set.
TT22. The leading edge can be displayed only if the rank file is provided when the network is built. The rank file supplied needs to be identical to the one used for the GSEA analysis for the leading edge calculation to be accurate.

TT23.
In case of multiple conditions or conditions with variable expression profiles (e.g. cancer patient samples), hierarchical clustering tends to generate a more informative visualization.
TT24. If the heat map columns are not colored for a GSEA analysis, make sure the phenotype names specified in the Enrichment Map input panel match the class names specified in the class file (MesenchymalvsImmunoreactive_RNA-Seq_classes.cls)

TT25.
Leading edge is only available for GSEA analyses. The option will only appear if the Enrichment map was built with GSEA results and a rank file was specified.

TT26.
There are many different layout algorithms available in Cytoscape that can be used for Enrichment Map. We recommend using an edge weighted layout, which considers the overlap score between pathways. Most layouts offer the ability to organize just the selected nodes (except for yFiles layouts). Experiment with different layouts to see which works best with your data. If you do not like layout results simply press command-Z on macOS or ctrl-Z on Windows or click on Edit --> Undo to revert to the previous view.

TT27. If particular non-informative words keep appearing in the labels generated by
AutoAnnotate, try adjusting the WordCloud normalization factor. The significance of each word is calculated based on the number of occurrences in the given cluster of pathways. This causes frequent words such as "pathway" or "regulation" to be prominent.
By increasing the normalization factor, we reduce the priority of such recurrent words in cluster labels. If that doesn't help you can add the non-informative words to the WordCloud word exclusion list.
TT28. If a specific character is used to separate words besides space (for example "-" or "|"), it should be added as a delimiter in the WordCloud app. Annotation labels can all be set to the same size by unchecking the option "Scale font by cluster size" in the AutoAnnotate results panel.

TT31.
Once you click on "Collapse All" a pop-up window will show the message "Before collapsing clusters please go to the menu Edit->Preferences->Group preferences and select "Enable attribute aggregation". There is no need to adjust this parameter repeatedly. Click on "Don't ask me again" and "OK" if you have set this parameter previously.

TT32.
For large networks, collapsing and expanding may take time. For a quick view of the collapsed network you can create a summary network by selecting the "Create summary Network…". There are two options for the summary network: "clusters only" which creates a summary network with just the circled clusters, or "clusters and unclustered nodes" which creates a summary network that also includes the singleton nodes not part of any cluster.
TT33. If the nodes in the resulting collapsed network are grey then you forgot to enable attribute aggregation. Expand clusters and refer to TT31 before collapsing again.

TT34.
In image export, only the visible part of the map will be exported. Make sure that the entire network is visible on your screen before exporting.

TT35.
Vector-based PDF and SVG formats are recommended for publication quality figures because they can zoom without losing quality. Either file type can be edited using software packages such as Adobe Illustrator or Inkscape. The PNG file format is recommended for high-quality online images while the JPG format is not recommended because it may lead to visual artefacts due to compression.

TT36.
The creation parameters panel only shows the parameters that were used at network creation. If you modified the network using filters or the EM slider bars you will have to update the changed thresholds accordingly.
TT37. If a session is saved that contains a collapsed Enrichment Map, it will automatically be expanded before it is saved. Depending on the size of the network this might take a few minutes. Enrichment Map will not automatically collapse nodes that were previously collapsed when reopening the session. To keep them collapsed, recollapse them using the AutoAnnotate app as done previously. localization (e.g., nuclear genes) or enzymatic function (e.g., protein kinases). Details such as protein interactions are not included.

Gene list of interest -the list of genes derived from an -omics experiment that is input
to pathway enrichment analysis.
Ranked gene list -In many -omics data (e.g. RNA-seq for gene expression) genes can be ranked according to some score (e.g. level of differential expression) to provide more information for pathway enrichment analysis. Pathways enriched in genes clustered at the top of a ranked list would score higher than if the pathway genes are randomly scattered across the ranked list.

BOX 2 -Experimental design and data quality
Pathway enrichment analysis benefits greatly from careful experimental design.
Otherwise the analysis may reveal apparently meaningful results caused by experimental biases or other confounders. Systematic removal of outliers may be justified to reduce variability in the experiment.
Experimental sensitivity. Some experimental methods can be tuned to be more or less sensitive. For instance, the number of reads in RNA-seq experiments influences downstream analysis. For quantifying gene expression in a biological system with modest variability and testing differential expression with variance stabilization, at least three to five replicates and ten million mapped reads are required 49 . Substantially greater sequencing depth, such as 50-100 million mapped reads, is required to investigate splice isoforms, to detect low-expressed genes or for samples with complex cellular mixtures such as surgical resection specimens.  Importantly, both Bonferroni and BH-FDR assume tests are independent, while pathways are typically not independent because of overlapping genes and cross-talk. Therefore, BH-FDR estimates for pathway analysis can be inaccurate, although practically they are still useful for filtering and hypothesis generation and thus are routinely used.

Pathway databases
We describe a selection of large, open-access and conveniently accessible pathway databases that offer the maximal value for pathway enrichment analysis. Hundreds of pathway databases are available for many purposes 54 .

Databases of gene sets
• Gene Ontology (GO) 27 -GO provides a hierarchically organized set of thousands of standardized terms for biological processes, molecular functions and cellular components as well as curated and predicted gene annotations based on these terms for multiple species. Biological process annotations are the most commonly used resource for pathway enrichment analysis.
• KEGG 55 -most useful for its intuitive pathway diagrams. Contains multiple types of pathways, some of which are not normal pathways, but are rather disease associated gene sets, such as "pathways in cancer" (http://www.genome.jp/kegg/) Pathway meta-databases. These databases collect detailed pathway descriptions from multiple originating pathway databases.
• Pathway Commons 35 -collects information from other pathway databases and provides it in a standardized format (http://www.pathwaycommons.org).

Pathway Enrichment Analysis Tools
Hundreds of pathway enrichment analysis tools exist, although many of them rely on outof-date pathway databases or do not present any unique feature compared to the most commonly used tools. The following are free pathway enrichment analysis software tools that we recommend based on their ease of use or unique features: • g:Profiler 4,26 -analyzes gene lists using Fisher's exact test and ordered gene lists using a modified Fisher's test. It provides a graphical web interface and access via R and python programming languages. The software is frequently updated and the gene set database can be downloaded as a GMT file (http://biit.cs.ut.ee/gprofiler).
• Genomic Regions Enrichment of Annotations Tool (GREAT) 41

Visualisation tools
• Enrichment Map 7 -this Cytoscape 6 app visualizes the results from pathway enrichment analysis, eases interpretation by displaying pathways as a network where overlapping pathways are clustered together to identify major biological themes in the results (http://apps.cytoscape.org/apps/enrichmentmap).
• ClueGO 34 -This Cytoscape app is conceptually similar to Enrichment Map. It includes a GO-based pathway enrichment analysis feature using Fisher's exact test.

BOX 5 -Topology-Aware Pathway Enrichment Analysis Methods
Most pathway enrichment analysis methods treat all genes in a pathway uniformly and ignore gene interactions. In contrast, topology-aware methods explicitly model the interactions between genes. CePa 58 , GANPA 59 , and THINK-Back 60 use physical gene interactions or co-expression networks to assign a weight to each gene in each pathway.
Weights can be derived from measures of the gene importance in the network such as degree, the number of gene connections, and betweenness centrality, and can be integrated into a traditional pathway enrichment analysis method such as GSEA. Methods like SPIA 61 , Pathway-Express 62 , and EnrichNet 63 generate an enrichment score for the entire pathway that considers pathway regulatory interactions, such as activation and inhibition. While useful and potentially more accurate, regulatory gene interactions are available for fewer genes compared to physical interactions networks and co-expression.

Figure 9 -Resulting Publication Ready Enrichment Map
Publication-ready annotated enrichment map (created with parameters FDR q-value < 0.01, and combined coefficient >0.375 with combined constant = 0.5). Red and blue nodes represent mesenchymal and immunoreactive phenotype pathways, respectively, and were manually separated to form a clearer picture. Clusters of nodes were labelled using the AutoAnnotate Cytoscape app. Individual node labels were removed for clarity using the publication ready button in Enrichment Map and exported to PNG and PDF files. The figure was resized using illustration software but no additional modifications were made to the network.

Figure 10 -Collapsed Enrichment Map
The network was further summarized by collapsing node clusters using the AutoAnnotate app. The network was scaled for better node distribution and manually adjusted to reduce node and label overlap. The enrichment map was exported to PNG and PDF. No additional modifications were made to the network in any illustration software tools.

Supplementary Protocols
This protocol processes RNA-seq data using the R programming environment and specialized packages from Bioconductor to create genes lists. The scripts are available for download and novice users can copy and paste commands into the R console. Maximization) data can be used with the edgeR method.

Equipment
Hardware requirements: • A recent personal computer with at least 8 gigabytes of memory (RAM).
Software requirements: • This information is used to extract two subgroups of interest, mesenchymal and immunoreactive.

Equipment Setup
• Download and install R from http://cran.r-project.org/ • Download and install RStudio from https://www.rstudio.com/ (optional, but recommended) • Launch R or RStudio install.packages(c("pheatmap", "RColorBrewer", "gProfileR", "RJSONIO", "httr")) • If the required packages are already installed you may receive a prompt to update these. The prompt window will ask you about updating the packages: The first step of your R script will change the working directory of R to this folder.
• As text editors sometimes add invisible characters to text copied from PDF files, copying and pasting from this document is not recommended. The R scripts or the R notebook available in the above URLs should be used instead.
• Setting the current directory and loading packages (libraries) are the required first and second steps of each protocol. These are needed each time a new session is opened in R.

Supplementary Protocol 1 -create a gene list by analyzing gene expression data from RNA-seq using edgeR
This part of the supplementary protocol demonstrates filtering and scoring RNA-seq data using normalized RNA-seq count data with the edgeR R package. The protocol can be used to produce input data for pathway enrichment methods like g:Profiler, GSEA and others. This RNA-seq analysis protocol follows conceptually similar steps to microarray analysis shown above.   5. Data normalization, dispersion analysis is performed on the entire data.
# create data structure to hold counts and subtype information for each sample.  8b. Create a two-column rank (.RNK) file of all gene IDs and corresponding scores for input to GSEA pre-ranked analysis. One option is to rank genes by t-statistic of differential gene expression. GSEA will look for enrichment in the set of most differentially expressed genes at the top of the list as well as those at the bottom of the list. Genes at the top of the list are more highly expressed in class A of samples (e.g., mesenchymal) while genes at the bottom are highly expressed in class B (e.g., immunoreactive). An alternative score that we use here is computed by multiplying direction (sign) of fold change and logarithm of p-value for each gene.

Supplementary Protocol 2 -create a gene list by analyzing gene expression data from Affymetrix microarrays with Limma
This protocol demonstrates the generation of gene lists for pathway enrichment analysis using RMA-normalized gene expression data from Affymetrix microarrays for downstream pathway enrichment analysis with g:Profiler, GSEA and other similar tools.
g:Profiler requires a ranked list of differentially expressed genes that are filtered according to a significance cut-off. GSEA requires a two-column tab-separated RNK file with a ranked list of all genes in the genome. In the RNK file, the first column specifies the gene name and the second column specifies a numeric score representing the level of differential expression. For both methods, the first step involves calculating a statistic for each gene that represents the difference in its expression levels between the two groups.
This step is performed using the limma R package.  20b. Create a rank file for GSEA. To run GSEA in pre-ranked mode, you need a two column RNK file with gene/protein/probe name (column 1) and the associated score

Supplementary Protocol 3 -Pathway Enrichment Analysis in R using Roast and Camera
This protocol will demonstrate the use of R packages Roast and Camera to automate pathway enrichment analysis. Each method requires an expressionSet that minimally contains a matrix of expression values for a set of genes and conditions. The expression matrix generated in supplementary protocol part 1 or 2 is suitable for the analysis.  27. Filter the pathway gene sets according to their size, following the previous step of filtering by availability of expression data. Here we only include sets with more than or equal to 15 and less than 200 genes.
classes <-data_for_gs_analysis$samples$group design <-model.matrix(~ 0 + classes) contrast_mesenvsimmuno <-makeContrasts( mesenvsimmuno ="classesImmunoreactive-classesMesenchymal",levels=design) 29. Run enrichment analysis and format the results to the 'generic' file format of Enrichment Map. This is a tab-delimited file that includes a pathway gene set name, pathway description, p-value, FDR q-value, phenotype and a comma-separated list of associated genes for every detected pathway. Depending on your data size and computer speed, this command could take from a few minutes to an hour to run. If you receive the warning "In dnbinom(q, size = size, mu = mu, log = TRUE) : non-integer x", the software has encountered unexpected non-integer values of gene expression, often indicating problems with upstream analysis such as suboptimal pre-processing or normalization procedures. Simply rounding the gene expression values may fix the error, however it should be investigated further to ensure no errors with the workflow.