Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Fast gene set enrichment analysis

Gennady Korotkevich, Vladimir Sukhov, Alexey Sergushichev
doi: https://doi.org/10.1101/060012
Gennady Korotkevich
1Computer Technologies Laboratory, ITMO University, Saint Petersburg, 197101, Russia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vladimir Sukhov
1Computer Technologies Laboratory, ITMO University, Saint Petersburg, 197101, Russia
2JetBrains Research, Saint Petersburg, Russia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Alexey Sergushichev
1Computer Technologies Laboratory, ITMO University, Saint Petersburg, 197101, Russia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: alserg@itmo.ru
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Preranked gene set enrichment analysis (GSEA) is a widely used method for interpretation of gene expression data in terms of biological processes. Here we present FGSEA method that is able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations. We also present a polynomial algorithm to calculate GSEA P-values exactly, which we use to practically confirm the accuracy of the method.

1 Main

Preranked gene set enrichment analysis [1] is a widely used method for analyzing gene expression data. It allows to select from an a priori defined list of gene sets those which have non-random behavior in a considered experiment. The method uses an enrichment score (ES) statistic which is calculated based on a vector of gene-level signed statistics, such as t-statistic from a differential expression test. Compared to a similar method of calculating Fisher P-values based on overlap statistic it does not require an arbitrary thresholding. This also allows the method to identify pathways that contain many co-regulated genes even with small individual effects.

The method has a major drawback of it’s implementations being slow. As the analytical form of the null distribution for the ES statistic is not known, empirical null distribution has to be calculated. That can be done in a straightforward manner by sampling random gene sets as was done in the reference implementation [1] and reimplementations [2, 3]. In this case for each of the input pathways, an ES value is calculated. Next, a number of random gene sets of the same size are generated, and for each of them an ES value is calculated. Then a P-value is estimated as the number of random gene sets with the same or more extreme ES value divided by the total number of generated gene sets (a formal definition is available in the section 2.1). However, a large number of gene set samples are required for the test to have a good statistical power, in particular due to correction for multiple hypotheses testing.

Here we present a fast gene set enrichment analysis (FGSEA) method for efficient estimation of GSEA P-values for a collection of pathways. The method consist of two main procedures: FGSEA-simple and FGSEA-multi-level. FGSEA-simple procedure allows to efficiently estimate P-values with a limited accuracy but simultaneously for the whole collection of gene sets, while FGSEA-multilevel procedure allows to accurately estimate arbitrarily low P-values but for individual gene sets.

FGSEA-simple procedure is based on an idea that generated random gene set samples can be shared between different input pathways. Indeed, consider M gene sets of the sizes K1 ≤ K2 ≤ … ≤ KM = K and a collection of n independent samples gi of size K (Fig 1a). As in the naive approach, due to gi being independent samples of the size K the P-value for the pathway M can be estimated as a proportion of samples gi having the same or more extreme ES value as the pathway M. However, for any other pathway j we can construct a set of n independent samples of size Kj by considering the prefixes Embedded Image. Again, given a set of independent samples, the P-value can be estimated as a proportion of the samples having the same or more extreme ES value.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Preranked gene set enrichment analysis can be sped up by sharing sampling information between different gene set sizes. a, It is sufficient to generate n independent samples of size KM to calculate empirical distribution for any sizes Kj ≤ KM by considering only the prefix of the samples of the size Kj. b, For a given gene sample enrichment scores for all the prefixes can be efficiently calculated by employing a square root heuristic. The heuristic allows to recalculate enrichment score after adding one gene to gene set in time proportional to the square root of the gene set size. The enrichment curve is split into Embedded Image blocks such that one block (“redo”) containing the added gene takes Embedded Image time to update and other blocks (“keep” and “shift”) take O(1) time. c, The P-values calculated with the FGSEA-simple method are consistent with the reference implementation, but the results are obtained hundreds times faster.

The next important idea is that given a gene set sample gi of the size K the ES values for all the prefixes gi,1‥j can be calculated in an efficient manner using a square root heuristic (Fig 1b). Briefly, a variant of an enrichment curve is considered: the genes are enumerated starting from the most up-regulated to the most down-regulated, with the curve going to the right if the gene is not present in the pathway, and the curve goes upward if the gene is present in the pathway. It can be shown that the enrichment score can be easily calculated if the most distant from the diagonal curve point is known. Let us split K genes from the gene set into Embedded Image consecutive blocks of size Embedded Image and consider what happens with the curve when we change the prefix from gi,1‥j−1 to gi,1‥j by adding gene gi,j. The curve in the blocks to the left of gi,j are not changed at all, while the blocks to the right of gi,jare uniformly shifted. This observation allows us to consider the prefixes in an√increasing order and update the position of the most distant point in Embedded Image time. Briefly, for the each block which is either not changed or shifted the update procedure takes O(1) time, while for the changed block the update procedure is proportional to its size and takes Embedded Image time. Finally, aggregating the blocks takes additional Embedded Image time. Overall this results in time complexity of Embedded Image to calculate ES values for all the prefixes. In total, the time complexity of the calculating P-values for the set of M pathways is Embedded Image, which gives around Embedded Image speed up K compared to a naive approach. The full description of the algorithm is given in the section 2.3.

As an example we ran FGSEA-simple and the reference implementations on the same example dataset of genes differentially regulated on Th1 activation [4] against a set of 700 Reactome [5] pathways (see section 2.2) and compared the resulting nominal P-values (Fig 1c). Both methods were ran with n = 10000 and the results are indistinguishable from each other up to the random noise inherent to both methods. However, on this example the reference implementation (version 4.0.1 has been used) took about 420 seconds, while FGSEA-simple finished in about 4 seconds. The two order of magnitude speed-up is consistent with the theoretical one due to the algorithm time complexity. Given a highly parallel implementation of FGSEA-simple, its performance allows to routinely achieve nominal p-values on the order of 10−5 and use standard procedures to correct for multiple hypothesis testing, like Benjamini-Hochberg procedure, for thousands of gene sets.

However, accurately estimating P-values lower than 10−6 with FGSEA-simple can be impractical or even infeasible. To estimate such low P-values we developed FGSEA-multilevel method, which is based on an adaptive multi-level split Monte Carlo scheme [6]. The method takes as an input an ES value γ > 0 and a gene set size K, and calculates the probability PK(ES ≥ γ) of a random gene set of size K to have an enrichment score no less than γ. The method sequentially finds ES levels li for which the probability PK(ES ≥ li) is approximately equal to 2−i (see Fig 2a for a toy example). The method stops when li becomes greater than γ and the P-value can be crudely approximated as 2−i.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

Adaptive multilevel Monte Carlo sampling scheme can be used to calculate arbitrarily low P-values. a, A toy illustration of the multilevel split Monte Carlo scheme for sample size of Z = 5. First, five uniformly random gene sets are generated and the level l1 of ES is selected that corresponds to P – value of Embedded Image. Then five samples are iteratively modified with Metropolis algorithm steps to obtain a uniform sample of gene sets with ES value greater l1. Based on these samples, a threshold l2 is selected that corresponds to P-value of Embedded Image and so on. b, Comparison of GSEA P-values as calculated by FGSEA-simple method run on 108 samples and FGSEA-multilevel with the sample size of Z=101. c, Comparison of P-values as calculated with an exact method and FGSEA-multilevel method. Both methods were run on genelevel statistic values rounded to integers. d, Comparison between estimated and an observed error of log2 P-values for different P-values (from 10−4 to 10−100), gene set sizes (from 15 to 250) and sample sizes (from 101 to 1001).

The intermediate li thresholds are calculated as follows. First, a set of Z (an odd number, parameter of the method) random gene sets of size K are generated uniformly and ES values for them are calculated. The median value of the ES values is calculated and assigned to l1. By construction, the probability PK(γ ≥ li) of a random gene set to have an ES value no less than l1 can be approximated as Embedded Image. Next Embedded Image generated gene sets with the ES values less than l1 are discarded, while Embedded Image gene sets with the ES values greater than l1 are duplicated. This results in a sample of Z gene sets with the ES values no less than l1, but the distribution is non-uniform. However, it can be made into a uniform sample with a Metropolis algorithm. On each Metropolis algorithm step each gene set sample is tried to be modified by swapping a random gene from the set with a gene outside of the set. The change is accepted if an enrichment score of the new set is no less then current threshold l1, otherwise the change is rejected. Metropolis algorithm guarantees, that after enough steps the sample becomes close to uniformly distributed. Thus, a median of the enrichment scores (l2) would correspond to probability of Embedded Image for a gene set to have an enrichment score no less than l2 given it has an enrichment score no less than l1: Embedded Image Which means Embedded Image

The same procedure is applied to calculate the next li values.

The iterations stop when li becomes greater than γ. On this iteration the probability of a random gene set to have a ES value no less than γ can be approximated as: Embedded Image

When estimating small P-values it becomes practical to carry out the estimation in log-scale. In particular, the values become practically unbiased both in median and mean sense and it becomes simple to estimate the error (see section 2.5.4).

The full formal description of the algorithm is available in the section 2.5.

For the example dataset we show that P-values are as low as 10−26 for some of the pathways and the results are consistent with FGSEA-simple P-values ran on 108 permutations (Fig 2b). Note, that FGSEA-multilevel calculation with sample size of Z=101 took only 10 seconds working on a single thread while 108 permutations on FGSEA-simple took 40 minutes working in 32 threads.

To further prove the approximation quality of FGSEA-multilevel algorithm we developed an exact method for calculating GSEA P-values, but limited to integer weights. The method is based on dynamic programming, the full description is given in section 2.4. The complexity of the algorithm is O(NKT2), where N is the number of genes, K is the size of gene sets and T is the sum of the top K absolute values of gene-level statistics. With a number of optimizations this method allows to calculate P-values for rounded weights in the example dataset in a couple of hours.

When run on the same integer weights FGSEA-multilevel and the exact method give highly concordant results (Fig 2c). Additionally, using the exact P-values as a control real errors can be compared with the estimated ones. We show, that the FGSEA-multilevel error estimation are highly concordant with the real errors (Fig 2d) for a wide range of P-values (from 10−4 to 10−100), gene set sizes (from 15 to 250) and sample sizes (from 101 to 1001).

In practice FGSEA-multilevel method is combined with FGSEA-simple. First, for all the input pathways FGSEA-simple method can be run with a limited sample size. Next, for the pathways that have high relative error after FGSEA-simple (i.e. pathways with low p-values) FGSEA-multilevel method is executed. As many of the pathways in an input collection usually are not enriched, they have a relatively high P-value and will be batch-processed with a highly efficient FGSEA-simple algorithm with deterministic time boundaries. The more interesting pathways with lower P-values will then be processed with FGSEA-multilevel algorithm individually and the amount of processing time will depend on their P-values.

Finally, as FGSEA allows to practically estimate the P-values for a large collections of gene sets, it can lead to a large number of statistically significant hits with high overlaps. To deal with this issue and make the representation of FGSEA results more concise we developed a procedure to filter the redundant gene sets. The procedure is similar to GO Trimming method [7] but is based on the Bayesian network construction approaches. It considers the significant pathways one by one and tries to remove gene sets that do not provide new information given some other pathway already present in the output. In this case, we consider a pathway P1 to give a new information given a pathway P2 if the P-value of pathway P1 in the universe of genes from P2 or genes outside of P2 is less than some threshold. This procedure allows to filter redundant pathways without requirement of having any explicit hierarchy of pathways. The full description of the procedure is given in section 2.6. The table resulting from running FGSEA on the example dataset with filtering of redundant hits is shown on Fig 3.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3:

An example of FGSEA results as run with FGSEA-multilevel method for Th0 vs Th1 comparison and Reactome pathways. The analysis was run with samples size of 101. Redundant pathways were filtered.

To conclude, here we present a method FGSEA for fast preranked gene set enrichment analysis. The method allows to routinely estimate even very low P-values and can be used with conjunction with standard multiple hypothesis testing correction methods, such as Benjamini-Hochberg procedure. This, in turn, allows to analyze even large collections of pathways which require a very low nominal P-value for the pathway to remain significant after multiple hypothesis testing correction. FGSEA method is freely available as an R package at Bioconductor (http://bioconductor.org/packages/fgsea) and on GitHub (https://github.com/ctlab/fgsea).

2 Methods

2.1 Formal definitions

The preranked gene set enrichment analysis takes as input two objects: an array of gene-level statistic values S for the genes U = {1, 2, …, N} and a list of query gene sets (pathways) P. The goal of the analysis is to determine which of the gene sets from P has a non-random behavior.

The statistic array S of the size |S| = N for each gene i ∈ U contains a value Si ∈ ℝ that characterizes the gene behavior in a considered biological process. Commonly, if Si > 0 the expression of gene i goes up on treatment compared to control and Si < 0 means that the expression goes down. Absolute values |Si| represent magnitude of the change. Array S is sorted in a decreasing order: Si > Sj for i < j. The value of N in practice is about 10000–20000.

The list of gene sets P = {P1, P2, …, PM} of length M usually contains groups of genes that are commonly regulated in some biological process. We assume that the gene sets Pi are ordered by their size (denoted as Ki): K1 ≤ K2 ≤ … ≤ KM = K. Usually only relatively small gene sets are considered with K ≈ 500 genes.

To quantify a co-regulation of genes in a gene set p Subramanian et al.[1] introduced a gene set enrichment score function sr(p) that uses gene rankings (values of S). The more positive is the value of sr(p) the more enriched the gene set is in the positively-regulated genes (with Si > 0). Accordingly, negative sr(p) corresponds to enrichment in the negatively regulated genes.

Value of sr(p) can be calculated as follows. Let k = |p|, NS = Σi∈p |Si|. Let also ES be an array specified by the following formula: Embedded Image

The value of sr(p) corresponds to the largest by the absolute value entry of ES: Embedded Image

For convenience, we also introduce the following notation: Embedded Image

From these two values it easy to find value of sr(p), which is equal to Embedded Image if Embedded Image or Embedded Image otherwise.

Often we will consider only the positive values of the gene set enrichment score function since: Embedded Image where Embedded Image and Embedded Image corresponds to the gene set enrichment score function for array S′ such that Embedded Image.

Next, following Subramanian et al for a pathway p we define GSEA P-value as: Embedded Image where q is a random gene set of size k.

2.2 The example data

As the example ranking we used Th0 vs Th1 comparison from dataset GSE14308 [4]. The differential expression was calculated using limma [8]. Only top 12000 genes by mean expression were used. Limma t-statistic was used as gene-level statistic. The script to generate rankings is available on GitHub: https://github.com/ctlab/fgsea/blob/master/inst/gen_gene_ranks.R.

Reactome [5] database was used as an example collection via reactome.db R package. For the analysis only the pathways of the size from 15 to 500 were used. The script to generate pathway collection is available on GitHub: https://github.com/ctlab/fgsea/blob/master/inst/gene_reactome_pathways.R

2.3 FGSEA-simple: an algorithm for fast calculation of GSEA P-values simultaneously for many path-ways

In this section we describe an algorithm for fast estimation of GSEA P-values simultaneously for a collection of pathways P. There, for each pathway p a set of n uniformly random gene sets qi are considered. Then P-value is estimated as: Embedded Image for positively enriched pathway p and as: Embedded Image for negatively enriched pathway. These two formulas follow Subramanian et al. implementation, except of +1 terms, which are recommended by Phipson and Smyth [9]. Otherwise, the nominal P-values from FGSEA-simple and reference implementation are indistinguishable, however FGSEA-simple works orders of magnitude faster.

2.3.1 Cumulative statistic calculation for the mean statistic

Let first describe the idea of the proposed algorithm on a simple mean statistic sm: Embedded Image

The main idea of the algorithm is to reuse sampling for different query gene sets. This can be done due to the fact that for an estimation of null distributions samples have to be independent only for a specific gene set size, while they can be dependent between different sizes.

Instead of generating nM independent random gene sets: n for each of M input gene sets, we will generate only n random gene sets of size K. Let πi be an i-th random gene set of size K. From that gene set we can generate gene sets for a all the query pathways Pj by using its prefix: πi,j = πi[1‥Kj].

The next step is to calculate the enrichment scores for all gene sets πi,j. Instead of calculating enrichment scores separately for each gene set we will calculate simultaneously scores for all πi,j for a fixed i. Using a simple procedure it can be done in Θ(K) time.

Let us find enrichment scores for all prefixes of πi. This can be done by element-wise dividing of cumulative sums array by the length of the corresponding prefix: Embedded Image

Selecting only the required prefixes takes an additional Θ(m) time.

The described procedure allows to find P-values for all query gene sets in Θ(n(K + m)) time. This is about min(K, m) times faster than the straight-forward procedure.

2.3.2 Cumulative statistic calculation for enrichment score

For the enrichment score Sr we use the similar idea as above: we will also be sampling only gene sets of size K and from that sample will calculate statistic values for all the other sizes. However, calculation of the cumulative statistic values for the subsamples is more complex in this case. In this section we only be considering the positive mode of enrichment statistic Embedded Image.

It is helpful to look at enrichment score from a geometric point of view. Let us consider for a pathway p of size |p| = k a graph of N +1 points (Fig. 4) with the coordinates (xi, yi) for 0 ≤ i ≤ N such that: Embedded Image Embedded Image Embedded Image

Figure 4:
  • Download figure
  • Open in new tab
Figure 4:

A graph that corresponds to a calculation of enrichment score. Each breakpoint on a graph corresponds to a gene present in the pathway. Dotted lines cross at a point which is the farthest up from a diagonal (dashed line). This point correspond to gene i+, where the maximal value of ESi is reached.

The calculation of Embedded Image corresponds to finding the point farthest up from a diagonal ((x0, y0), (xN, yN)). Indeed, it is easy to see that xN = N − |p| = N − k and yN = Σj∈p|Sj| = NS, while the individual enrichment scores ESi can be calculated as Embedded Image. Value of ESi is proportional to the directed distance from the line going through (x0, y0) and (xN, yN) to the point (xi, yi).

Let us fix a sample π of size K. To efficiently calculate cumulative values Embedded Image for all k ≤ K we need a fast method of updating the farthest point when a new gene is added. In that case we can add genes from π one by one and calculate values Embedded Image from the corresponding maximal distances.

Because we are calculating values for π[1‥k] for k ≤ K we know in advance which K genes will be added. This allows us to consider K + 1 points instead of N + 1 for each iteration k. Let array o of size K contain the sorted order of genes in π: that is, Embedded Image is the minimal among π, Embedded Image is the second minimal and so on. The coordinates can be calculated as follows: Embedded Image Embedded Image Embedded Image where we set Embedded Image to be zero.

It can be shown that finding the farthest up point among (4)–(6) is equivalent to finding the farthest up point among (1)–(3) with Embedded Image being equal to Embedded Image calculated for p = π[1‥k]. Consider Embedded Image. By the definition of x it is equal to: Embedded Image

By the definition of o, in the interval Embedded Image there are no genes from π and, thus, from π[1‥k]. Thus we can replace the sum with its last member: Embedded Image

We got the same difference as in (5).

Now consider Embedded Image. By the definition of y it is equal to: Embedded Image

Again, in the interval Embedded Image there are no genes from π[1‥k]. Thus we can replace the sum with only the last member: Embedded Image

We got the same difference as in (6).

We do not need to consider other points, because points from oi−1 to oi−1 have the same y coordinate and oi−1 is the leftmost of them. Thus, when at least one gene is added the diagonal ((x0, y0), (xN, yN)) is not horizontal and oi−1 is the farthest point among oi−1, …, oi − 1.

Now let consider what happens with the enrichment score graph when gene πk is added to the query set π[1‥k − 1] (Fig. 5). Let rk be a rank of gene πk among genes π, then coordinate of points (xi, yi) for i < rk do not change, while all (xi, yi) for i ≥ rk are changed on Embedded Image.

Figure 5:
  • Download figure
  • Open in new tab
Figure 5:

Update of an enrichment score graph when gene πi ≈ 800 is added. Only a fragment is shown. Black graph corresponds to a graph for gene set π[1‥k − 1], gray graph corresponds to π[1‥k]. A part of the graph to the left of Embedded Image does not change and the other part is shifted to the top-left corner. The diagonal ((x0, y0),(xN, yN)) is rotated counterclockwise.

To make fast incremental updates we will decompose the problem into multiple smaller ones. For simplicity we assume that K +1 is an exact square of an integer b. Let split K + 1 points into b consecutive blocks of the size b: Embedded Image and so on.

For each of b blocks we will store and update the farthest up point from the diagonal. When we know for each block its farthest point we can find the globally farthest point by a simple pass in O(b) time.

Next, we show how to update the farthest points in blocks in amortized time O(b). This taken together with one O(b) pass will get us an algorithm to update the globally farthest point in amortized O(b) time.

Below we use c = ⌊rk/b⌋ as an index of a block where gene πk belongs, where rk is the ranking of the genes from π, i.e. Embedded Image.

First, we describe the procedure to update point coordinates. We will store xi coordinates using two vectors: B of size b and D of size K + 1, such that xi = Bi/b + Di. When gene πk is added all xi for i ≥ rk are decremented by one. To reflect this we will decrement all Bj for j > c and decrement all Di for rk ≤ i < cb. The update takes O(b) time. After this update procedure we can get value xi in O(1) time. The same procedure is applied for y coordinates.

Second, for each block we will maintain an upper part of its convex hull. Having convex hull is useful because the farthest point in block always lays on its convex hull. All blocks except c have the points either not changed or shifted simultaneously on the same value. That means that the lists of points on the convex hulls for these blocks remain unchanged. For the block c we can reconstruct convex hull from scratch using Graham scan algorithm [10]. Because the points are already sorted by x coordinate, this reconstruction takes O(b) time. In total, it takes O(b) time to update the convex hulls.

Third, the farthest points in blocks can be updated using the stored convex hulls. Consider a block where the convex hull was not changed (every block except, possibly, block c). Because diagonal always rotates in the same counterclockwise direction, the farthest point in block on iteration k either stays the same or moves on the convex hull to the left of the farthest point on the (k − 1)-th iteration. Thus, for each such block we can compare current farthest point with its left neighbor on the convex hull and update the point if necessary. It is repeated until the next neighbor is closer to the diagonal than the current farthest point. In the block c we just find the farthest point in a single pass by the points on the convex hull.

To show that the updating the farthest points takes O(b) amortized time we will use potential method. Let a potential after adding k-th gene Φk be a sum of relative indexes of the farthest points for all the blocks. As there are b blocks of size b the sum of relative indexes lies between 0 and b2. Thus, Φk = O(b2). For an update of all b − 1 blocks except c we need to make tk = b − 1 + z operations of comparing two points, where z is the number of times the farthest points were updated. This can take up to Θ(b2) time in the worst case. However, it can be noticed, that potential change Φk − Φk−1 is equal to − z + O(b): the sum of indexes is decreased by a number of times the farthest points were updated plus O(b) for the block c where the index can go from 0 to b − 1. This gives an amortized cost of k-th iteration to be ak = tk + Φk − Φk−1 = b − 1 + z − z + O(b) = O(b). The total real cost of K iterations is Embedded Image, which means amortized cost of one iteration to be O(b).

Taken together the algorithm allows to find all cumulative enrichment scores sr(π[1‥k]) in Embedded Image time. The straightforward implementation of calculating cumulative values from scratch would take O(K2 logK) time. Thus, we have improved the performance Embedded Image times.

2.3.3 Implementation details

We also implemented an optimization so that the algorithm does not build convex hull from scratch for a changed block c, but only updates the changed points. This does not influence the asymptotic performance, but decreases the constant factor.

First, we start updating the convex hull from position rk and not from the start. To be able to do this, we have an array prev that for each gene g ∈ π stores the previous point on the convex hull if g were the last gene in the block. This actually is the same as the top of the stack in Graham algorithm and represent the algorithms state for any given point. As all points h to the left of g are not changed prevh also remains unchanged and need not to be recalculated.

Second, we stop updating the hull, when we reach the point on the previous iteration convex hull. We can do this because every point to the left of g is rotated counterclockwise of any point to the right of g, which means that the first point on the convex hull right of g on (k − 1)-th iteration remains being a convex hull point at k-th iteration.

2.4 An algorithm for exact calculation of GSEA P-values for integer gene-level statistics

In this section we describe a polynomial algorithm to calculate GSEA P-value exactly, but only for the case when gene-level statistics are integer numbers: Si ∈ ℤ. For simplicity we will consider a problem of calculating the following probability: Embedded Image where q is a random gene set of size k. We also assume γ > 0.

Let denote the sum of k largest absolute values of gene ranks by T. The algorithm will be polynomial in terms of N, k and T.

2.4.1 The basic algorithm

Let us consider a gene set q = {q1, q2, …, qk}. Recall the formula for s+(q): Embedded Image, where Embedded Image, Embedded Image

First, let rewrite the formula for ESi in an equivalent fashion, grouping positive and negative summands: Embedded Image

Then for calculating ESi the following values are sufficient:

  • i: the index of the current gene;

  • Embedded Image: the number of genes included into the set q among genes 1‥i;

  • Embedded Image: the sum of the absolute values of gene-level statistics for genes included in the set among genes 1‥i

  • Embedded Image: the sum of the absolute values of gene-level statistics for all genes in the set.

Knowing the values above, ESi can be calculated as Embedded Image.

Notice that NS can take only integer values from 0 to T (for a set of genes with the largest absolute values of gene-level statistics). Let us split the desired probability to a sum of independent probabilities based on the value of NS: Embedded Image

Our algorithm will be based on dynamic programming. For each possible value of NS we will process the genes one by one in increasing order of index and calculate an array fNS(i, c, s). The value fNS(i, c, s) will contain the probability for a uniformly random gene set q′ of c genes selected from genes 1‥i to simultaneously have the following two properties:

  1. the sum of the absolute values of gene-level statistics of genes from q′ is equal to s;

  2. ESj < γ holds for all j ≤ i, where the values of ES are calculated for the gene set q′ but using the selected values of NS and k, not the ones calculated for the set q′.

Suppose that we have calculated all values of fNS(i, c, s), then Embedded Image and Embedded Image

Finally, the sought probability is equal to: Embedded Image

Let us find a formula for fNS(i, c, s). The base case of dynamic programming is i = 0 for all NS: Embedded Image

Suppose we want to calculate fNS(i, c, s) for some i > 0. First, calculate Embedded Image and compare it to γ. If ESi ≥ γ, then fNS(i, c, s) = 0 by definition.

Otherwise, condition “ESj < γ holds for all j ≤ i” can be simplified to “ESj < γ holds for all j ≤ i − 1”. This observation allows us to use values of f that have already been calculated. Consider two cases:

  1. Gene i does not belong to the set q′. As q′ is a set of c genes chosen uniformly at random from i genes, this case happens with the probability Embedded Image. The conditional probability that such set satisfies the two necessary properties is fNS(i − 1, c, s). Indeed, any set of size c with the sum of absolute values of gene-level statistics values equal to s, chosen among genes 1‥i − 1 and satisfying the conditions on ES, is a valid set chosen among genes 1‥i. Similarly, if a set does not satisfy the condition on ESj for some j ≤ i − 1, this set should not be counted towards fNS(i, c, s) since obviously j ≤ i.

  2. Gene i belongs to the set. This case happens with the probability Embedded Image. The probability that this set satisfies the necessary conditions is fNS(i − 1, c − 1, s − Si). Indeed, any set of size c − 1 with the sum of absolute values of gene-level statistics equal to s − Si, chosen among genes 1‥i − 1 and satisfying the conditions on ES, can be extended with gene i, thus forming a set of size c satisfying both necessary properties. Similarly, if a set does not satisfy the condition on ESj for some j ≤ i − 1, adding gene i will not fix the situation.

Then we can calculate fNS(i, c, s) using the law of total probability: Embedded Image in the case when i > 0 and ESi < γ.

Putting all the cases together, we arrive to the final formula for fNS(i, c, s): Embedded Image

The overall complexity of the algorithm is O(NkT 2). The values of f can be evaluated sequentially in increasing order of i. It is enough to evaluate fNS(i, c, s) for 0 ≤ i ≤ N, 0 ≤ c ≤ k, and 0 ≤ s ≤ NS ≤ T. Each value of f can be evaluated in constant time.

2.4.2 Optimizations and implementation details

While the algorithm described above is polynomial, a number of further optimizations are required to make execution on real size inputs feasible.

First, let note that the following property holds: Embedded Image as long as NS2 ≥ NS1. Indeed, ES values calculated using different values of NS are decreasing when NS is increased. That means all gene sets counted towards Embedded Image should also be counted towards Embedded Image if NS2 ≥ NS1.

Following the observation above, instead of calculating values of fNS(i, c, s) we will consider the values g(i, c, s, b) = fb+1(i, c, s) − fb(i, c, s). These values will contain the probability of a random gene set q of size k selected uniformly from genes 1‥N to satisfy simultaneously the following three properties:

  1. set q contains exactly c genes from the genes 1‥i.

  2. the sum of the absolute values of gene-level statistics of the first c genes from q is equal to s;

  3. ESj < γ holds for all j ≤ i, where the values of ES are calculated for the gene set q using NS = b + 1 (and for all higher values of NS);

  4. ESj ≥ γ holds for at least one j ≤ i, where the values of ES are calculated for the gene set q using NS = b (and for all lower values of NS).

The sought probability can be calculated from values of g as follows: Embedded Image

To calculate the values of g we will use the forward dynamic programming algorithm. In this algorithm we expand a tree of reachable dynamic programming states, starting from g(0, 0, 0, 0) which is equal to 1.

The states will be considered by “levels” in an increasing order of i. The values g(i + 1, c, s, b) from (i + 1)-th level are calculated based on level i. Note, that the sum of values on i-th level is always equal to 1.

To calculate all values from the (i + 1)-th level all non-zero values from the i-th level are considered sequentially. Let consider state (i, c, s, b) and let define p = (k − c)/(N − i) – the probability that gene i + 1 will be added to the set. The corresponding set G(i, c, s, b) can be divided into two groups.

  1. The gene sets from G(i, c, s, b) that do not include gene i + 1. These gene sets are included into gene sets G(i + 1, c, s, b) on the level i + 1. Thus the corresponding probability g(i, c, s, b) · (1 − p) is added to the value of g(i + 1, c, s, b).

  2. The gene sets from G(i, c, s, b) that do include gene i + 1. These gene sets are included into G(i + 1, c + 1, s′ = s + |Si+1|, b′) where b′ is an updated bound. To calculate b′ let note that ESj will be greater or equal to γ iff Embedded Image which is equivalent to Embedded Image Embedded Image. Thus Embedded Image The probability that is added to g(i+1, c+1, s′, b′) is equal to g(i, c, s, b) · p.

While the asymptotic number of states remains to be O(NkT2) the forward dynamic programming allows to consider only “reachable” gene stats with g(i, c, s, b) > 0. In practice the number of reachable stats can be several orders of magnitude smaller then the total states.

Furthermore, for the algorithm we can consider only states with g(i, c, s, b) > ε to be reachable for some small value of ε. If we do not consider the un-reachable states we would not be able to calculated the desired probability exactly. However, if we calculate the value of δ as a sum of all the skipped states values, the desired probability will be calculated with the absolute error no more than δ.

The algorithm implementation with few other optimizations is available at: https://github.com/ctlab/fgsea/blob/master/inst/exact/exact.cpp.

2.5 FGSEA-multilevel: an algorithm for calculation of arbitrarily low P-values using adaptive multilevel split Monte Carlo scheme

In this section we describe FGSEA-multilevel algorithm that can accurately estimate GSEA P-value for a pathway p of size k even when the true P-value is very small.

Let γ = sr(p) > 0 be the enrichment score of the query pathway p for which we want to calculate the following value: Embedded Image where q is a random gene set of size k. This probability can be rewritten as follows: Embedded Image

First, we focus on determining the probability Embedded Image. This probability can be extremely small, so using a naive sampling gives a bad estimation. We use the adaptive multilevel split Monte Carlo method [6] to solve this problem.

To estimate the probability Embedded Image we split the enrichment scores into levels 0 = l0 < l1 < … < lt = γ. Then we can define the following probabilities: Embedded Image

Now the probability Embedded Image can be rewritten as Embedded Image.

To estimate αi we can draw a sample Embedded Image of size Z from a conditional distribution Embedded Image. Then Embedded Image where Zi is the number of elements in the set Embedded Image.

Below we show how levels li can be chosen and how to sample from the corresponding conditional distributions.

2.5.1 Choosing the enrichment score levels

We propose to chose value for a level li as a median of the enrichment scores for the Embedded Image sample. For simplicity Z is required to be an odd number.

Then the procedure for estimating probability Embedded Image consists of repetition of the following steps:

  1. On iteration i ≥ 1 sample Z gene sets Embedded Image of size k from the distribution Embedded Image.

  2. Set the level Embedded Image to be equal to the median of value Embedded Image.

  3. If Embedded Image then stop the iterations and set li = γ and t = i, otherwise set Embedded Image.

As a result, by construction, αi ≈ 1/2 for 1 ≤ i ≤ t − 1. The value of αt can be approximated as Zt/Z (which is always ≥ 1/2). Together we get the following expression for estimating the desired probability: Embedded Image

2.5.2 The conditional sampling implementation

To generate a uniform sample Embedded Image from the conditional distribution Embedded Image we use the Metropolis algorithm.

First, we generate a sample Embedded Image of size Z from the distribution Embedded Image Since l0 = 0 and values of Embedded Image are always non-negative it can be done by generating a uniformly random subset of size k from the genes {1, 2, …, N}.

Now let consider a sample Embedded Image at a step i > 1. The sample can be sorted in an increasing order of enrichment score values: Embedded Image. Let d = ⌈Z/2⌉. The level li−1 is the median of the values Embedded Image and, thus, is equal to Embedded Image.

Let first populate Embedded Image in the following way: Embedded Image

This gives us a sample from the conditional distribution Embedded Image, however it is not uniform.

To make the sample uniform we apply a number of the Metropolis algorithm iterations. On each iteration for each gene set Embedded Image we apply the following steps:

  1. Choose a random gene Embedded Image.

  2. Choose a random gene Embedded Image.

  3. Consider Embedded Image. If Embedded Image then we replace Embedded Image with Embedded Image.

The iterations are repeated until the total number of successful replacements becomes greater or equal to k · Z. In practice, this number of steps is enough to get a sufficiently uniform sample to obtain a good estimation of probability, without a significant increase in the running time of the algorithm.

2.5.3 Estimating the P-value

In order to estimate the desired P-value we also need to calculate the probabilities P (sr (q) ≥ 0) and Embedded Image.

To calculate the probability P (sr (q) ≥ 0) we generate gene sets q1, q2, …, qZ′, where each sample qi is selected uniformly at random from all the subsets of size k from the set {1, 2, …, N}. The samples are generated until the number of samples qi with sr(qi) ≥ 0 becomes equal to Z. Then the probability P (sr (q) ≥ 0) is estimated as follows Embedded Image

To determine the remaining probability Embedded Image we calculate the number of gene sets in Embedded Image with value of the enrichment score function sr is greater than zero. After that, the probability can be estimated as follows: Embedded Image

2.5.4 Estimating log-probability

To properly estimate a logarithm of the desired probability let note that the j-th order statistic of a standard uniform sample of size Z is a random variable from the beta distribution Beta (j, Z + 1 − j). Therefore, we can use the properties of the beta distribution and make correct transition to the logarithm of probability. So for the median value of sample of odd size Z we have: Embedded Image where ψ is digamma function. In the same way, we can calculate the expectation of the logarithm αt: Embedded Image

Then the logarithm of probability Embedded Image is estimated as Embedded Image

Similarly, we can estimate the variance of the estimates Embedded Image Embedded Image, where ψ1 is trigamma function. From this we can approximate a standard error of our estimator as: Embedded Image

The same approach with digamma functions is used to calculate the logarithm of the probabilities Embedded Image and P (sr (q) ≥ 0).

2.5.5 Comparison with the exact method

To compare FGSEA-multilevel and the exact method on the same dataset we used rounded values of the gene-level statistics from the example data (section 2.2) as input data for both algorithms. Both algorithms calculated the probability Embedded Image.

The results of the algorithms for the pathways from the example data are shown on Fig 2c. The exact algorithm was run with ε = 10−40, all the probabilities were obtained with accuracy of at least six significant digits. For FGSEA-multilevel Z = 101 was used.

We also calculated empirical estimation errors and compared it to the theoretical ones (Fig 2d). For this we generated 100 independent estimates for a range of ES values (corresponding to P-values of 10−4 to 10−100, gene set sizes (from 15 to 250) and sample size (from 101 to 1001). The raw values are available in the Supplementary Table.

2.6 Filtering redundant pathways

In this section we describe an algorithm to filter redundant pathways from the results of FGSEA.

Let consider two pathways p1 and p2 that both have a significant GSEA P-value. There are two situations in which we will consider p2 to be non-redundant given p1:

  1. If pathway p2 is enriched even if we do not consider the genes from p1 at all. Formally, we calculate GSEA P-value for gene set p2 \ p1 and gene-level statistics vector S[U \ p1] for all the genes except p1. If the P-value is less than a pre-defined threshold, then pathway p2 is considered as non-redundant given p1.

  2. If pathway p2 is enriched even if we consider only genes from p1. Formally, we calculate GSEA P-value for gene set p2 ∩ p1 and gene-level statistics vector S[p1] for the genes from p1. Again, if the P-value is less than a pre-defined threshold, then pathway p2 is considered as non-redundant given p1.

Otherwise pathway p2 is considered to be redundant.

The filtering procedure starts with a set of significantly enriched pathways Psig selected by the user: for example the pathways with GSEA P-values less than 0.01 after Benjamini-Hochberg correction, sorted by P-value. The output of the procedure is a list Pmain ⊂ Psig of pathways that are pairwise non-redundant. At the same time, all the other pathways Pred = Psig \ Pmain are redundant given some pathway from Psig.

The procedure itself is similar to Sieve of Eratothenes algorithm. The pathways are considered one by one and some of them are marked as redundant. For a pathway p we first check if it is already marked as redundant, if yes, we go to the next pathway. Otherwise, we first run FGSEA-simple algorithm on a vector of statistics S[U \ p] and all the pathway currently not marked as redundant (including the ones that already have been considered, but excluding pathway p). Then, similarly, we run FGSEA-simple algorithm on a vector of statistics S[p]. Pathways that do not achieve non-redundant P-value threshold in both tests are marked as redundant.

Footnotes

  • FGSEA-multilevel procedure has been added to estimate arbitrarily low P-values.

References

  1. [1].↵
    Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukher-jee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, and Jill P Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545–50, 2005.
    OpenUrlAbstract/FREE Full Text
  2. [2].↵
    G. Yu, L. G. Wang, G. R. Yan, and Q. Y. He. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics, 31(4):608–609, Feb 2015.
    OpenUrlCrossRefPubMedWeb of Science
  3. [3].↵
    L. Varemo, J. Nielsen, and I. Nookaew. Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods. Nucleic Acids Res., 41(8):4378–4391, Apr 2013.
    OpenUrlCrossRefPubMedWeb of Science
  4. [4].↵
    Gang Wei, Lai Wei, Jinfang Zhu, Chongzhi Zang, Jane Hu-Li, Zhengju Yao, Kairong Cui, Yuka Kanno, Tae-Young Roh, Wendy T Watford, Dustin E Schones, Weiqun Peng, Hong-Wei Sun, William E Paul, John J O’Shea, and Keji Zhao. Global mapping of H3K4me3 and H3K27me3 reveals specificity and plasticity in lineage fate determination of differentiating CD4+ T cells. Immunity, 30(1):155–67, 2009.
    OpenUrlCrossRefPubMedWeb of Science
  5. [5].↵
    G Joshi-Tope, M Gillespie, I Vastrik, P D’Eustachio, E Schmidt, B de Bono, B Jassal, G R Gopinath, G R Wu, L Matthews, S Lewis, E Birney, and L Stein. Reactome: a knowledgebase of biological path-ways. Nucleic acids research, 33(Database issue):D428–32, 2005.
    OpenUrlCrossRefPubMedWeb of Science
  6. [6].↵
    Zdravko I Botev and Dirk P Kroese. An efficient algorithm for rare-event probability estimation, combinatorial optimization, and counting. Methodology and Computing in Applied Probability, 10(4):471–505, 2008.
    OpenUrl
  7. [7].↵
    S. G. Jantzen, B. J. Sutherland, D. R. Minkley, and B. F. Koop. GO Trimming: Systematically reducing redundancy in large Gene Ontology datasets. BMC Res Notes, 4:267, Jul 2011.
    OpenUrlCrossRefPubMed
  8. [8].↵
    M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, pages gkv007–, 2015.
  9. [9].↵
    Belinda Phipson and Gordon K Smyth. Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Statistical applications in genetics and molecular biology, 9(1), 2010.
  10. [10].↵
    Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition, volume 7. 2001.
Back to top
PreviousNext
Posted October 22, 2019.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Fast gene set enrichment analysis
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Fast gene set enrichment analysis
Gennady Korotkevich, Vladimir Sukhov, Alexey Sergushichev
bioRxiv 060012; doi: https://doi.org/10.1101/060012
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Fast gene set enrichment analysis
Gennady Korotkevich, Vladimir Sukhov, Alexey Sergushichev
bioRxiv 060012; doi: https://doi.org/10.1101/060012

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3505)
  • Biochemistry (7346)
  • Bioengineering (5323)
  • Bioinformatics (20263)
  • Biophysics (10016)
  • Cancer Biology (7743)
  • Cell Biology (11300)
  • Clinical Trials (138)
  • Developmental Biology (6437)
  • Ecology (9951)
  • Epidemiology (2065)
  • Evolutionary Biology (13322)
  • Genetics (9361)
  • Genomics (12583)
  • Immunology (7701)
  • Microbiology (19021)
  • Molecular Biology (7441)
  • Neuroscience (41036)
  • Paleontology (300)
  • Pathology (1229)
  • Pharmacology and Toxicology (2137)
  • Physiology (3160)
  • Plant Biology (6860)
  • Scientific Communication and Education (1272)
  • Synthetic Biology (1896)
  • Systems Biology (5311)
  • Zoology (1089)