GSA-Genie: a web application for gene set analysis

Gene set analysis is often used to interpret results from upstream analysis through predefined gene sets that are linked to biological features such as cell cycle or tumorgenesis. Gene sets have been defined in the literature via various criteria and are archived by numerous databases. We compiled over 2.3 million gene sets from 17 sources, and made them accessible through a web application, GSA-Genie. Selected gene sets can be analyzed online using one of 16 statistical methods. These methods can be grouped into two strategies: test of gene set over-representation within a gene list, or comparison of a gene-level statistics between gene set and background. GSA-Genie operates on a Shiny web server, hosted in a cloud instance within Amazon Web Services. GSA-Genie offers a broad selection of gene sets and statistical methods comparing to existing tools. GSA-Genie is freely available at http://gsagenie.awsomics.org.


Introduction
Gene sets are groups of genes with one or more shared biological features, such as those involved in the same signaling pathways or activated together by a genetic perturbation. For example, the Gene Ontology defines gene sets based on collective knowledge about genes [1]; KEGG maps genes to canonical metabolic pathways [2], and iProClass classifies genes according to protein sequence, structure, and function [3]. Furthermore, integrative databases like MsigDb [4] collect gene sets from primary databases and stores them in a consistent format.
Gene set analysis (GSA) is often used by researchers to link their own results to a predefined gene set, and hence the biological features associated to that gene set. There are two common strategies of running GSA. The first tests if a gene set is over-represented in a project-specific gene list provided by the analyst. The gene list could be selected based on differential gene expression, burden test of mutation frequency, or any other analysis. DAVID is a popular online GSA tool using this stragegy [5]. The second strategy requires a gene-level statistic, such as the fold change of differential expression or the p value of burden test. The statistics is used for a two-group comparison of means or distributions between each gene set and the background. GSEA [6] is a standalone program that applies this strategy. A variety of statistical tests have been developed to analyze different types of gene-level statistics (t-like, p-like, etc.), and The R/Bioconductor package piano [7] implemented 11 such methods.
GSA-Genie is a web application and an alternative to the tools and databases mentioned above. It provides a one-stop, comprehesive solution of running GSA by allowing users to choose from 16 statistical methods and over 2.3 million gene sets sourced from 17 original databases.

Functionality
GSA-Genie sets up an analysis in 3 steps: select gene sets; upload user inputs; and choose method ( Figure 1A).

2.1
Step 1: select gene sets Gene sets were collected from 17 sources ( Figure 1B . All gene sets are consistently annotated and structured into collections that can be browsed by specifying species-source-collection, such as mouse-KEGG-pathway. Three types of gene identifiers are available for all gene sets: NCBI ID, Official sybmol, and Ensembl ID. Gene sets can be downloaded for offline analysis in several formats including the .gmt format used by GSEA. Instead of using whole collections of gene sets, users can perform hypothesis testing on one or a few selected gene sets of interest to save runtime and reduce multiple-testing burden.

2.2
Step 2: upload input GSA-Genie accepts two types of user input: an ID-only gene list or a table of N rows of genes and at least one column of gene-level statistics. Both must use one of the three recognizable gene identifiers. When the input is a gene list, the GSA must be over-representation test. Users also have the option to upload the testing background to which the gene list will be compared. When gene-level statistics is uploaded, users can use the statistics to select top genes for overrepresentation test or use all genes in the table for two-group comparison between each gene set and all other genes in the uploaded table.

2.3
Step 3: choose method Three methods are available for over-representation tests: Fisher's exact (hypergeometric), Chi-squared, and proportion tests. Users can choose all known genes or all genes in selected gene set collections as background if they did not upload their own.
The R/Bioconductor piano package implements 11 statistical methods for twogroup comparison of gene-level statistics, such as PAGE [22] and tail strength [23]. GSA-Genie added two more methods, Student's t and Kolmogorov-Smirnov tests, for mean and distribution comparisons respectively. These methods offers analytical flexibility corresponding to different types of gene-level statistics. For example, Fisher's combined test is for p-like statistics only while GSEA requires t-like statistics. Non-parametric methods, such as Wilcoxon rank sum test, are more robust and applicable to all types of statistics. GSA-Genie also provides options to transform one type of statistics to another. For example, re-scaling plus adding directionality information can convert a p-like statistics to a t-like one. GSA-Genie also implements a configurable methods-vs-dataset size gatekeeping to avoid testing a large number of gene sets using slower methods. For example, GSEA, which is much slower than the other methods, is only available for simultaneously testing 5 or less gene sets. This limit can be adjusted when running GSA-Genie on a local instance.

Results
GSA-Genie presents the results online immediately after an analysis is completed. No matter which method was used, each tested gene set will have a p value and corresponding false discovery rate. For the over-representation test, odds ratio and enrichment percentage are also included in the results. Overlapping between a selected gene set and user's gene list can be plotted as a venn diagram ( Figure 1C). For two-group comparison, the results often include a method-specific test statistics, such as the t statistics of Studen's t test and the enrichment score of GSEA. Methods using the directionality information of t-like statistics also report extra directional p values. Users can also select a gene set to visualize its density distribution versus the background distribution ( Figure 1D). Step 1 Select gene sets Step 2 Upload user input Step 3 Choose method Method'availability'is' limited'by'type'of'sta9s9cs' and'number'of'gene'sets'