## Abstract

Preranked gene set enrichment analysis (GSEA) is a widely used method for interpretation of gene expression data in terms of biological processes. Here we present FGSEA method that is able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations. We also present a polynomial algorithm to calculate GSEA P-values exactly, which we use to practically confirm the accuracy of the method.

## 1 Main

Preranked gene set enrichment analysis [1] is a widely used method for analyzing gene expression data. It allows to select from an *a priori* defined list of gene sets those which have non-random behavior in a considered experiment. The method uses an enrichment score (ES) statistic which is calculated based on a vector of gene-level signed statistics, such as *t*-statistic from a differential expression test. Compared to a similar method of calculating Fisher P-values based on overlap statistic it does not require an arbitrary thresholding. This also allows the method to identify pathways that contain many co-regulated genes even with small individual effects.

The method has a major drawback of it’s implementations being slow. As the analytical form of the null distribution for the ES statistic is not known, empirical null distribution has to be calculated. That can be done in a straightforward manner by sampling random gene sets as was done in the reference implementation [1] and reimplementations [2, 3]. In this case for each of the input pathways, an ES value is calculated. Next, a number of random gene sets of the same size are generated, and for each of them an ES value is calculated. Then a P-value is estimated as the number of random gene sets with the same or more extreme ES value divided by the total number of generated gene sets (a formal definition is available in the section 2.1). However, a large number of gene set samples are required for the test to have a good statistical power, in particular due to correction for multiple hypotheses testing.

Here we present a fast gene set enrichment analysis (FGSEA) method for efficient estimation of GSEA P-values for a collection of pathways. The method consist of two main procedures: *FGSEA-simple* and *FGSEA-multi-level*. FGSEA-simple procedure allows to efficiently estimate P-values with a limited accuracy but simultaneously for the whole *collection* of gene sets, while FGSEA-multilevel procedure allows to accurately estimate arbitrarily low P-values but for *individual* gene sets.

FGSEA-simple procedure is based on an idea that generated random gene set samples can be shared between different input pathways. Indeed, consider *M* gene sets of the sizes *K*_{1} ≤ *K*_{2} ≤ … ≤ *K*_{M} = *K* and a collection of *n* independent samples *g*_{i} of size *K* (Fig 1a). As in the naive approach, due to *g*_{i} being independent samples of the size *K* the P-value for the pathway *M* can be estimated as a proportion of samples *g*_{i} having the same or more extreme ES value as the pathway *M*. However, for any other pathway *j* we can construct a set of *n* independent samples of size *K*_{j} by considering the prefixes . Again, given a set of independent samples, the P-value can be estimated as a proportion of the samples having the same or more extreme ES value.

The next important idea is that given a gene set sample *g*_{i} of the size *K* the ES values for *all* the prefixes *g*_{i,1‥j} can be calculated in an efficient manner using a square root heuristic (Fig 1b). Briefly, a variant of an enrichment curve is considered: the genes are enumerated starting from the most up-regulated to the most down-regulated, with the curve going to the right if the gene is not present in the pathway, and the curve goes upward if the gene is present in the pathway. It can be shown that the enrichment score can be easily calculated if the most distant from the diagonal curve point is known. Let us split *K* genes from the gene set into consecutive blocks of size and consider what happens with the curve when we change the prefix from *g*_{i,1‥j−1} to *g*_{i,1‥j} by adding gene *g*_{i,j}. The curve in the blocks to the left of *g*_{i,j} are not changed at all, while the blocks to the right of *g*_{i,j}are uniformly shifted. This observation allows us to consider the prefixes in an*√*increasing order and update the position of the most distant point in time. Briefly, for the each block which is either not changed or shifted the update procedure takes *O*(1) time, while for the changed block the update procedure is proportional to its size and takes time. Finally, aggregating the blocks takes additional time. Overall this results in time complexity of to calculate ES values for all the prefixes. In total, the time complexity of the calculating P-values for the set of *M* pathways is , which gives around speed up *K* compared to a naive approach. The full description of the algorithm is given in the section 2.3.

As an example we ran FGSEA-simple and the reference implementations on the same example dataset of genes differentially regulated on Th1 activation [4] against a set of 700 Reactome [5] pathways (see section 2.2) and compared the resulting nominal P-values (Fig 1c). Both methods were ran with *n* = 10000 and the results are indistinguishable from each other up to the random noise inherent to both methods. However, on this example the reference implementation (version 4.0.1 has been used) took about 420 seconds, while FGSEA-simple finished in about 4 seconds. The two order of magnitude speed-up is consistent with the theoretical one due to the algorithm time complexity. Given a highly parallel implementation of FGSEA-simple, its performance allows to routinely achieve nominal p-values on the order of 10^{−5} and use standard procedures to correct for multiple hypothesis testing, like Benjamini-Hochberg procedure, for thousands of gene sets.

However, accurately estimating P-values lower than 10^{−6} with FGSEA-simple can be impractical or even infeasible. To estimate such low P-values we developed *FGSEA-multilevel* method, which is based on an adaptive multi-level split Monte Carlo scheme [6]. The method takes as an input an ES value *γ* > 0 and a gene set size *K*, and calculates the probability *P*_{K}(ES ≥ *γ*) of a random gene set of size *K* to have an enrichment score no less than *γ*. The method sequentially finds ES levels *l*_{i} for which the probability *P*_{K}(ES ≥ *l*_{i}) is approximately equal to 2^{−i} (see Fig 2a for a toy example). The method stops when *l*_{i} becomes greater than *γ* and the P-value can be crudely approximated as 2^{−i}.

The intermediate *l*_{i} thresholds are calculated as follows. First, a set of *Z* (an odd number, parameter of the method) random gene sets of size *K* are generated uniformly and ES values for them are calculated. The median value of the ES values is calculated and assigned to *l*_{1}. By construction, the probability *P*_{K}(*γ* ≥ *l*_{i}) of a random gene set to have an ES value no less than *l*_{1} can be approximated as . Next generated gene sets with the ES values less than *l*_{1} are discarded, while gene sets with the ES values greater than *l*_{1} are duplicated. This results in a sample of *Z* gene sets with the ES values no less than *l*_{1}, but the distribution is non-uniform. However, it can be made into a uniform sample with a Metropolis algorithm. On each Metropolis algorithm step each gene set sample is tried to be modified by swapping a random gene from the set with a gene outside of the set. The change is accepted if an enrichment score of the new set is no less then current threshold *l*_{1}, otherwise the change is rejected. Metropolis algorithm guarantees, that after enough steps the sample becomes close to uniformly distributed. Thus, a median of the enrichment scores (*l*_{2}) would correspond to probability of for a gene set to have an enrichment score no less than *l*_{2} given it has an enrichment score no less than *l*_{1}:
Which means

The same procedure is applied to calculate the next *l*_{i} values.

The iterations stop when *l*_{i} becomes greater than *γ*. On this iteration the probability of a random gene set to have a ES value no less than *γ* can be approximated as:

When estimating small P-values it becomes practical to carry out the estimation in log-scale. In particular, the values become practically unbiased both in median and mean sense and it becomes simple to estimate the error (see section 2.5.4).

The full formal description of the algorithm is available in the section 2.5.

For the example dataset we show that P-values are as low as 10^{−26} for some of the pathways and the results are consistent with FGSEA-simple P-values ran on 10^{8} permutations (Fig 2b). Note, that FGSEA-multilevel calculation with sample size of Z=101 took only 10 seconds working on a single thread while 10^{8} permutations on FGSEA-simple took 40 minutes working in 32 threads.

To further prove the approximation quality of FGSEA-multilevel algorithm we developed an exact method for calculating GSEA P-values, but limited to integer weights. The method is based on dynamic programming, the full description is given in section 2.4. The complexity of the algorithm is *O*(*NKT*^{2}), where *N* is the number of genes, *K* is the size of gene sets and *T* is the sum of the top *K* absolute values of gene-level statistics. With a number of optimizations this method allows to calculate P-values for rounded weights in the example dataset in a couple of hours.

When run on the same integer weights FGSEA-multilevel and the exact method give highly concordant results (Fig 2c). Additionally, using the exact P-values as a control real errors can be compared with the estimated ones. We show, that the FGSEA-multilevel error estimation are highly concordant with the real errors (Fig 2d) for a wide range of P-values (from 10^{−4} to 10^{−100}), gene set sizes (from 15 to 250) and sample sizes (from 101 to 1001).

In practice FGSEA-multilevel method is combined with FGSEA-simple. First, for all the input pathways FGSEA-simple method can be run with a limited sample size. Next, for the pathways that have high relative error after FGSEA-simple (i.e. pathways with low p-values) FGSEA-multilevel method is executed. As many of the pathways in an input collection usually are not enriched, they have a relatively high P-value and will be batch-processed with a highly efficient FGSEA-simple algorithm with deterministic time boundaries. The more interesting pathways with lower P-values will then be processed with FGSEA-multilevel algorithm individually and the amount of processing time will depend on their P-values.

Finally, as FGSEA allows to practically estimate the P-values for a large collections of gene sets, it can lead to a large number of statistically significant hits with high overlaps. To deal with this issue and make the representation of FGSEA results more concise we developed a procedure to filter the redundant gene sets. The procedure is similar to GO Trimming method [7] but is based on the Bayesian network construction approaches. It considers the significant pathways one by one and tries to remove gene sets that do not provide new information given some other pathway already present in the output. In this case, we consider a pathway *P*_{1} to give a new information given a pathway *P*_{2} if the P-value of pathway *P*_{1} in the universe of genes from *P*_{2} or genes outside of *P*_{2} is less than some threshold. This procedure allows to filter redundant pathways without requirement of having any explicit hierarchy of pathways. The full description of the procedure is given in section 2.6. The table resulting from running FGSEA on the example dataset with filtering of redundant hits is shown on Fig 3.

To conclude, here we present a method FGSEA for fast preranked gene set enrichment analysis. The method allows to routinely estimate even very low P-values and can be used with conjunction with standard multiple hypothesis testing correction methods, such as Benjamini-Hochberg procedure. This, in turn, allows to analyze even large collections of pathways which require a very low nominal P-value for the pathway to remain significant after multiple hypothesis testing correction. FGSEA method is freely available as an R package at Bioconductor (http://bioconductor.org/packages/fgsea) and on GitHub (https://github.com/ctlab/fgsea).

## 2 Methods

### 2.1 Formal definitions

The preranked gene set enrichment analysis takes as input two objects: an array of gene-level statistic values *S* for the genes *U* = {1, 2, …, *N*} and a list of query gene sets (pathways) *P*. The goal of the analysis is to determine which of the gene sets from *P* has a non-random behavior.

The statistic array *S* of the size |*S*| = *N* for each gene *i* ∈ *U* contains a value *S*_{i} ∈ ℝ that characterizes the gene behavior in a considered biological process. Commonly, if *S*_{i} > 0 the expression of gene *i* goes up on treatment compared to control and *S*_{i} < 0 means that the expression goes down. Absolute values |*S*_{i}| represent magnitude of the change. Array *S* is sorted in a decreasing order: *S*_{i} > *S*_{j} for *i* < *j*. The value of *N* in practice is about 10000–20000.

The list of gene sets *P* = {*P*_{1}, *P*_{2}, …, *P*_{M}} of length *M* usually contains groups of genes that are commonly regulated in some biological process. We assume that the gene sets *P*_{i} are ordered by their size (denoted as *K*_{i}): *K*_{1} ≤ *K*_{2} ≤ … ≤ *K*_{M} = *K*. Usually only relatively small gene sets are considered with *K* ≈ 500 genes.

To quantify a co-regulation of genes in a gene set *p* Subramanian *et al.*[1] introduced a gene set enrichment score function *s*_{r}(*p*) that uses gene rankings (values of *S*). The more positive is the value of *s*_{r}(*p*) the more enriched the gene set is in the positively-regulated genes (with *S*_{i} > 0). Accordingly, negative *s*_{r}(*p*) corresponds to enrichment in the negatively regulated genes.

Value of *s*_{r}(*p*) can be calculated as follows. Let *k* = |*p*|, NS = Σ_{i∈p} |*S*_{i}|. Let also ES be an array specified by the following formula:

The value of *s*_{r}(*p*) corresponds to the largest by the absolute value entry of ES:

For convenience, we also introduce the following notation:

From these two values it easy to find value of *s*_{r}(*p*), which is equal to if or otherwise.

Often we will consider only the positive values of the gene set enrichment score function since:
where and corresponds to the gene set enrichment score function for array *S*′ such that .

Next, following Subramanian *et al* for a pathway *p* we define GSEA P-value as:
where *q* is a random gene set of size *k*.

#### 2.2 The example data

As the example ranking we used Th0 vs Th1 comparison from dataset GSE14308 [4]. The differential expression was calculated using limma [8]. Only top 12000 genes by mean expression were used. Limma t-statistic was used as gene-level statistic. The script to generate rankings is available on GitHub: https://github.com/ctlab/fgsea/blob/master/inst/gen_gene_ranks.R.

Reactome [5] database was used as an example collection via reactome.db R package. For the analysis only the pathways of the size from 15 to 500 were used. The script to generate pathway collection is available on GitHub: https://github.com/ctlab/fgsea/blob/master/inst/gene_reactome_pathways.R

#### 2.3 FGSEA-simple: an algorithm for fast calculation of GSEA P-values simultaneously for many path-ways

In this section we describe an algorithm for fast estimation of GSEA P-values simultaneously for a collection of pathways *P*. There, for each pathway *p* a set of *n* uniformly random gene sets *q*_{i} are considered. Then P-value is estimated as:
for positively enriched pathway *p* and as:
for negatively enriched pathway. These two formulas follow Subramanian *et al.* implementation, except of +1 terms, which are recommended by Phipson and Smyth [9]. Otherwise, the nominal P-values from FGSEA-simple and reference implementation are indistinguishable, however FGSEA-simple works orders of magnitude faster.

##### 2.3.1 Cumulative statistic calculation for the mean statistic

Let first describe the idea of the proposed algorithm on a simple mean statistic *s*_{m}:

The main idea of the algorithm is to reuse sampling for different query gene sets. This can be done due to the fact that for an estimation of null distributions samples have to be independent only for a specific gene set size, while they can be dependent between different sizes.

Instead of generating *nM* independent random gene sets: *n* for each of *M* input gene sets, we will generate only *n* random gene sets of size *K*. Let *π*_{i} be an *i*-th random gene set of size *K*. From that gene set we can generate gene sets for a all the query pathways *P*_{j} by using its prefix: *π*_{i,j} = *π*_{i}[1‥*K*_{j}].

The next step is to calculate the enrichment scores for all gene sets *π*_{i,j}. Instead of calculating enrichment scores separately for each gene set we will calculate simultaneously scores for all *π*_{i,j} for a fixed *i*. Using a simple procedure it can be done in Θ(*K*) time.

Let us find enrichment scores for all prefixes of *π*_{i}. This can be done by element-wise dividing of cumulative sums array by the length of the corresponding prefix:

Selecting only the required prefixes takes an additional Θ(*m*) time.

The described procedure allows to find P-values for all query gene sets in Θ(*n*(*K* + *m*)) time. This is about min(*K*, *m*) times faster than the straight-forward procedure.

##### 2.3.2 Cumulative statistic calculation for enrichment score

For the enrichment score *S*_{r} we use the similar idea as above: we will also be sampling only gene sets of size *K* and from that sample will calculate statistic values for all the other sizes. However, calculation of the cumulative statistic values for the subsamples is more complex in this case. In this section we only be considering the positive mode of enrichment statistic .

It is helpful to look at enrichment score from a geometric point of view. Let us consider for a pathway *p* of size |*p*| = *k* a graph of *N* +1 points (Fig. 4) with the coordinates (*x*_{i}, *y*_{i}) for 0 ≤ *i* ≤ *N* such that:

The calculation of corresponds to finding the point farthest up from a diagonal ((*x*_{0}, *y*_{0}), (*x*_{N}, *y*_{N})). Indeed, it is easy to see that *x*_{N} = *N* − |*p*| = *N* − *k* and *y*_{N} = Σ_{j∈p}|*S*_{j}| = NS, while the individual enrichment scores ES_{i} can be calculated as . Value of ES_{i} is proportional to the directed distance from the line going through (*x*_{0}, *y*_{0}) and (*x*_{N}, *y*_{N}) to the point (*x*_{i}, *y*_{i}).

Let us fix a sample *π* of size *K*. To efficiently calculate cumulative values for all *k* ≤ *K* we need a fast method of updating the farthest point when a new gene is added. In that case we can add genes from *π* one by one and calculate values from the corresponding maximal distances.

Because we are calculating values for *π*[1‥*k*] for *k* ≤ *K* we know in advance which *K* genes will be added. This allows us to consider *K* + 1 points instead of *N* + 1 for each iteration *k*. Let array *o* of size *K* contain the sorted order of genes in *π*: that is, is the minimal among *π*, is the second minimal and so on. The coordinates can be calculated as follows:
where we set to be zero.

It can be shown that finding the farthest up point among (4)–(6) is equivalent to finding the farthest up point among (1)–(3) with being equal to calculated for *p* = *π*[1‥*k*]. Consider . By the definition of *x* it is equal to:

By the definition of *o*, in the interval there are no genes from *π* and, thus, from *π*[1‥*k*]. Thus we can replace the sum with its last member:

We got the same difference as in (5).

Now consider . By the definition of *y* it is equal to:

Again, in the interval there are no genes from *π*[1‥*k*]. Thus we can replace the sum with only the last member:

We got the same difference as in (6).

We do not need to consider other points, because points from *o*_{i−1} to *o*_{i}−1 have the same *y* coordinate and *o*_{i−1} is the leftmost of them. Thus, when at least one gene is added the diagonal ((*x*_{0}, *y*_{0}), (*x*_{N}, *y*_{N})) is not horizontal and *o*_{i−1} is the farthest point among *o*_{i−1}, *…, o*_{i} − 1.

Now let consider what happens with the enrichment score graph when gene *π*_{k} is added to the query set *π*[1‥*k* − 1] (Fig. 5). Let *r*_{k} be a rank of gene *π*_{k} among genes *π*, then coordinate of points (*x*_{i}, *y*_{i}) for *i* < *r*_{k} do not change, while all (*x*_{i}, *y*_{i}) for *i* ≥ *r*_{k} are changed on .

To make fast incremental updates we will decompose the problem into multiple smaller ones. For simplicity we assume that *K* +1 is an exact square of an integer *b*. Let split *K* + 1 points into *b* consecutive blocks of the size *b*: and so on.

For each of *b* blocks we will store and update the farthest up point from the diagonal. When we know for each block its farthest point we can find the globally farthest point by a simple pass in *O*(*b*) time.

Next, we show how to update the farthest points in blocks in amortized time *O*(*b*). This taken together with one *O*(*b*) pass will get us an algorithm to update the globally farthest point in amortized *O*(*b*) time.

Below we use *c* = ⌊*r*_{k}/*b*⌋ as an index of a block where gene *π _{k}* belongs, where

*r*

_{k}is the ranking of the genes from

*π*, i.e. .

First, we describe the procedure to update point coordinates. We will store *x*_{i} coordinates using two vectors: *B* of size *b* and *D* of size *K* + 1, such that *x*_{i} = *B*_{i/b} + *D*_{i}. When gene *π*_{k} is added all *x*_{i} for *i* ≥ *r*_{k} are decremented by one. To reflect this we will decrement all *B*_{j} for *j* > *c* and decrement all *D*_{i} for *r*_{k} ≤ *i* < cb. The update takes *O*(*b*) time. After this update procedure we can get value *x*_{i} in *O*(1) time. The same procedure is applied for *y* coordinates.

Second, for each block we will maintain an upper part of its convex hull. Having convex hull is useful because the farthest point in block always lays on its convex hull. All blocks except *c* have the points either not changed or shifted simultaneously on the same value. That means that the lists of points on the convex hulls for these blocks remain unchanged. For the block *c* we can reconstruct convex hull from scratch using Graham scan algorithm [10]. Because the points are already sorted by *x* coordinate, this reconstruction takes *O*(*b*) time. In total, it takes *O*(*b*) time to update the convex hulls.

Third, the farthest points in blocks can be updated using the stored convex hulls. Consider a block where the convex hull was not changed (every block except, possibly, block *c*). Because diagonal always rotates in the same counterclockwise direction, the farthest point in block on iteration *k* either stays the same or moves on the convex hull to the left of the farthest point on the (*k* − 1)-th iteration. Thus, for each such block we can compare current farthest point with its left neighbor on the convex hull and update the point if necessary. It is repeated until the next neighbor is closer to the diagonal than the current farthest point. In the block *c* we just find the farthest point in a single pass by the points on the convex hull.

To show that the updating the farthest points takes *O*(*b*) amortized time we will use potential method. Let a potential after adding *k*-th gene Φ_{k} be a sum of relative indexes of the farthest points for all the blocks. As there are *b* blocks of size *b* the sum of relative indexes lies between 0 and *b*^{2}. Thus, Φ_{k} = *O*(*b*^{2}). For an update of all *b* − 1 blocks except *c* we need to make *t*_{k} = *b* − 1 + *z* operations of comparing two points, where *z* is the number of times the farthest points were updated. This can take up to Θ(*b*^{2}) time in the worst case. However, it can be noticed, that potential change Φ_{k} − Φ_{k−1} is equal to − *z* + *O*(*b*): the sum of indexes is decreased by a number of times the farthest points were updated plus *O*(*b*) for the block *c* where the index can go from 0 to *b* − 1. This gives an amortized cost of *k*-th iteration to be *a*_{k} = *t*_{k} + Φ_{k} − Φ_{k−1} = *b* − 1 + *z* − *z* + *O*(*b*) = *O*(*b*). The total real cost of *K* iterations is , which means amortized cost of one iteration to be *O*(*b*).

Taken together the algorithm allows to find all cumulative enrichment scores *s*_{r}(*π*[1‥*k*]) in time. The straightforward implementation of calculating cumulative values from scratch would take *O*(*K*^{2} log*K*) time. Thus, we have improved the performance times.

##### 2.3.3 Implementation details

We also implemented an optimization so that the algorithm does not build convex hull from scratch for a changed block *c*, but only updates the changed points. This does not influence the asymptotic performance, but decreases the constant factor.

First, we start updating the convex hull from position *r*_{k} and not from the start. To be able to do this, we have an array `prev` that for each gene *g* ∈ *π* stores the previous point on the convex hull if *g* were the last gene in the block. This actually is the same as the top of the stack in Graham algorithm and represent the algorithms state for any given point. As all points *h* to the left of *g* are not changed `prev`_{h} also remains unchanged and need not to be recalculated.

Second, we stop updating the hull, when we reach the point on the previous iteration convex hull. We can do this because every point to the left of *g* is rotated counterclockwise of any point to the right of *g*, which means that the first point on the convex hull right of *g* on (*k* − 1)-th iteration remains being a convex hull point at *k*-th iteration.

#### 2.4 An algorithm for exact calculation of GSEA P-values for integer gene-level statistics

In this section we describe a polynomial algorithm to calculate GSEA P-value exactly, but only for the case when gene-level statistics are integer numbers: *S*_{i} ∈ ℤ. For simplicity we will consider a problem of calculating the following probability:
where *q* is a random gene set of size *k*. We also assume *γ* > 0.

Let denote the sum of *k* largest absolute values of gene ranks by *T*. The algorithm will be polynomial in terms of *N*, *k* and *T*.

##### 2.4.1 The basic algorithm

Let us consider a gene set *q* = {*q*_{1}, *q*_{2}, …, *q*_{k}}. Recall the formula for *s*^{+}(*q*): , where ,

First, let rewrite the formula for ES_{i} in an equivalent fashion, grouping positive and negative summands:

Then for calculating ES_{i} the following values are sufficient:

*i*: the index of the current gene;: the number of genes included into the set

*q*among genes 1‥*i*;: the sum of the absolute values of gene-level statistics for genes included in the set among genes 1‥

*i*: the sum of the absolute values of gene-level statistics for

*all*genes in the set.

Knowing the values above, ES_{i} can be calculated as .

Notice that NS can take only integer values from 0 to *T* (for a set of genes with the largest absolute values of gene-level statistics). Let us split the desired probability to a sum of independent probabilities based on the value of NS:

Our algorithm will be based on *dynamic programming*. For each possible value of NS we will process the genes one by one in increasing order of index and calculate an array *f*_{NS}(*i*, *c*, *s*). The value *f*_{NS}(*i*, *c*, *s*) will contain the probability for a uniformly random gene set *q*′ of *c* genes selected from genes 1‥*i* to simultaneously have the following two properties:

the sum of the absolute values of gene-level statistics of genes from

*q*′ is equal to*s*;ES

_{j}<*γ*holds for all*j*≤*i*, where the values of ES are calculated for the gene set*q*′ but using the selected values of NS and*k*, not the ones calculated for the set*q*′.

Suppose that we have calculated all values of *f*_{NS}(*i*, *c*, *s*), then
and

Finally, the sought probability is equal to:

Let us find a formula for *f*_{NS}(*i*, *c*, *s*). The base case of dynamic programming is *i* = 0 for all NS:

Suppose we want to calculate *f*_{NS}(*i*, *c*, *s*) for some *i* > 0. First, calculate
and compare it to *γ*. If ES_{i} ≥ *γ*, then *f*_{NS}(*i*, *c*, *s*) = 0 by definition.

Otherwise, condition “ES_{j} < *γ* holds for all *j* ≤ *i*” can be simplified to “ES_{j} < *γ* holds for all *j* ≤ *i* − 1”. This observation allows us to use values of *f* that have already been calculated. Consider two cases:

Gene

*i*does not belong to the set*q*′. As*q*′ is a set of*c*genes chosen uniformly at random from*i*genes, this case happens with the probability . The conditional probability that such set satisfies the two necessary properties is*f*_{NS}(*i*− 1*, c, s*). Indeed, any set of size*c*with the sum of absolute values of gene-level statistics values equal to*s*, chosen among genes 1‥*i*− 1 and satisfying the conditions on ES, is a valid set chosen among genes 1‥*i*. Similarly, if a set does not satisfy the condition on ES_{j}for some*j*≤*i*− 1, this set should not be counted towards*f*_{NS}(*i*,*c*,*s*) since obviously*j*≤*i*.Gene

*i*belongs to the set. This case happens with the probability . The probability that this set satisfies the necessary conditions is*f*_{NS}(*i*− 1,*c*− 1,*s*−*S*_{i}). Indeed, any set of size*c*− 1 with the sum of absolute values of gene-level statistics equal to*s*−*S*_{i}, chosen among genes 1‥*i*− 1 and satisfying the conditions on ES, can be extended with gene*i*, thus forming a set of size*c*satisfying both necessary properties. Similarly, if a set does not satisfy the condition on ES_{j}for some*j*≤*i*− 1, adding gene*i*will not fix the situation.

Then we can calculate *f*_{NS}(*i*, *c*, *s*) using the law of total probability:
in the case when *i* > 0 and ES_{i} < *γ*.

Putting all the cases together, we arrive to the final formula for *f*_{NS}(*i*, *c*, *s*):

The overall complexity of the algorithm is *O*(*NkT* ^{2}). The values of *f* can be evaluated sequentially in increasing order of *i*. It is enough to evaluate *f*_{NS}(*i*, *c*, *s*) for 0 ≤ *i* ≤ *N*, 0 ≤ *c* ≤ *k*, and 0 ≤ *s* ≤ NS ≤ *T*. Each value of *f* can be evaluated in constant time.

##### 2.4.2 Optimizations and implementation details

While the algorithm described above is polynomial, a number of further optimizations are required to make execution on real size inputs feasible.

First, let note that the following property holds: as long as NS_{2} ≥ NS_{1}. Indeed, ES values calculated using different values of NS are decreasing when NS is increased. That means all gene sets counted towards should also be counted towards if NS_{2} ≥ NS_{1}.

Following the observation above, instead of calculating values of *f*_{NS}(*i*, *c*, *s*) we will consider the values *g*(*i*, *c*, *s*, *b*) = *f*_{b}_{+1}(*i*, *c*, *s*) − *f*_{b}(*i*, *c*, *s*). These values will contain the probability of a random gene set *q* of size *k* selected uniformly from genes 1‥*N* to satisfy simultaneously the following three properties:

set

*q*contains exactly*c*genes from the genes 1‥*i*.the sum of the absolute values of gene-level statistics of the first

*c*genes from*q*is equal to*s*;ES

_{j}<*γ*holds for all*j*≤*i*, where the values of ES are calculated for the gene set*q*using NS =*b*+ 1 (and for all higher values of NS);ES

_{j}≥*γ*holds for at least one*j*≤*i*, where the values of ES are calculated for the gene set*q*using NS =*b*(and for all lower values of NS).

The sought probability can be calculated from values of *g* as follows:

To calculate the values of *g* we will use the forward dynamic programming algorithm. In this algorithm we expand a tree of reachable dynamic programming states, starting from *g*(0, 0, 0, 0) which is equal to 1.

The states will be considered by “levels” in an increasing order of *i*. The values *g*(*i* + 1*, c, s, b*) from (*i* + 1)-th level are calculated based on level *i*. Note, that the sum of values on *i*-th level is always equal to 1.

To calculate all values from the (*i* + 1)-th level all non-zero values from the *i*-th level are considered sequentially. Let consider state (*i*, *c*, *s*, *b*) and let define *p* = (*k* − *c*)/(*N* − *i*) – the probability that gene *i* + 1 will be added to the set. The corresponding set *G*(*i*, *c*, *s*, *b*) can be divided into two groups.

The gene sets from

*G*(*i*,*c*,*s*,*b*) that do not include gene*i*+ 1. These gene sets are included into gene sets*G*(*i*+ 1,*c*,*s*,*b*) on the level*i*+ 1. Thus the corresponding probability*g*(*i*,*c*,*s*,*b*) · (1 −*p*) is added to the value of*g*(*i*+ 1,*c*,*s*,*b*).The gene sets from

*G*(*i*,*c*,*s*,*b*) that do include gene*i*+ 1. These gene sets are included into*G*(*i*+ 1,*c*+ 1,*s*′ =*s*+ |*S*_{i}_{+1}|,*b*′) where*b*′ is an updated bound. To calculate*b*′ let note that ES_{j}will be greater or equal to*γ*iff which is equivalent to . Thus The probability that is added to*g*(*i*+1,*c*+1,*s*′,*b*′) is equal to*g*(*i*,*c*,*s*,*b*)*· p*.

While the asymptotic number of states remains to be *O*(*NkT*^{2}) the forward dynamic programming allows to consider only “reachable” gene stats with *g*(*i*, *c*, *s*, *b*) > 0. In practice the number of reachable stats can be several orders of magnitude smaller then the total states.

Furthermore, for the algorithm we can consider only states with *g*(*i*, *c*, *s*, *b*) > *ε* to be reachable for some small value of *ε*. If we do not consider the un-reachable states we would not be able to calculated the desired probability exactly. However, if we calculate the value of *δ* as a sum of all the skipped states values, the desired probability will be calculated with the absolute error no more than *δ*.

The algorithm implementation with few other optimizations is available at: https://github.com/ctlab/fgsea/blob/master/inst/exact/exact.cpp.

#### 2.5 FGSEA-multilevel: an algorithm for calculation of arbitrarily low P-values using adaptive multilevel split Monte Carlo scheme

In this section we describe FGSEA-multilevel algorithm that can accurately estimate GSEA P-value for a pathway *p* of size *k* even when the true P-value is very small.

Let *γ* = *s*_{r}(*p*) > 0 be the enrichment score of the query pathway *p* for which we want to calculate the following value:
where *q* is a random gene set of size *k*. This probability can be rewritten as follows:

First, we focus on determining the probability . This probability can be extremely small, so using a naive sampling gives a bad estimation. We use the adaptive multilevel split Monte Carlo method [6] to solve this problem.

To estimate the probability we split the enrichment scores into levels 0 = *l*_{0} < *l*_{1} < *… < l _{t}* =

*γ*. Then we can define the following probabilities:

Now the probability can be rewritten as .

To estimate *α*_{i} we can draw a sample of size *Z* from a conditional distribution . Then
where *Z*_{i} is the number of elements in the set .

Below we show how levels *l*_{i} can be chosen and how to sample from the corresponding conditional distributions.

##### 2.5.1 Choosing the enrichment score levels

We propose to chose value for a level *l*_{i} as a median of the enrichment scores for the sample. For simplicity *Z* is required to be an odd number.

Then the procedure for estimating probability consists of repetition of the following steps:

On iteration

*i*≥ 1 sample*Z*gene sets of size*k*from the distribution .Set the level to be equal to the median of value .

If then stop the iterations and set

*l*_{i}=*γ*and*t*=*i*, otherwise set .

As a result, by construction, *α*_{i} ≈ 1/2 for 1 ≤ *i* ≤ *t* − 1. The value of *α _{t}* can be approximated as

*Z*

_{t}/

*Z*(which is always ≥ 1/2). Together we get the following expression for estimating the desired probability:

##### 2.5.2 The conditional sampling implementation

To generate a uniform sample from the conditional distribution we use the Metropolis algorithm.

First, we generate a sample of size *Z* from the distribution Since *l*_{0} = 0 and values of are always non-negative it can be done by generating a uniformly random subset of size *k* from the genes {1, 2, …, *N*}.

Now let consider a sample at a step *i* > 1. The sample can be sorted in an increasing order of enrichment score values: . Let *d* = ⌈*Z*/2⌉. The level *l _{i−}*

_{1}is the median of the values and, thus, is equal to .

Let first populate in the following way:

This gives us a sample from the conditional distribution , however it is not uniform.

To make the sample uniform we apply a number of the Metropolis algorithm iterations. On each iteration for each gene set we apply the following steps:

Choose a random gene .

Choose a random gene .

Consider . If then we replace with .

The iterations are repeated until the total number of successful replacements becomes greater or equal to *k* · *Z*. In practice, this number of steps is enough to get a sufficiently uniform sample to obtain a good estimation of probability, without a significant increase in the running time of the algorithm.

##### 2.5.3 Estimating the P-value

In order to estimate the desired P-value we also need to calculate the probabilities P (*s*_{r} (*q*) ≥ 0) and .

To calculate the probability P (*s*_{r} (*q*) ≥ 0) we generate gene sets *q*_{1}*, q*_{2}, …, *q*_{Z′}, where each sample *q*_{i} is selected uniformly at random from all the subsets of size *k* from the set {1, 2, …, *N*}. The samples are generated until the number of samples *q*_{i} with *s*_{r}(*q*_{i}) ≥ 0 becomes equal to *Z*. Then the probability P (*s*_{r} (*q*) ≥ 0) is estimated as follows

To determine the remaining probability we calculate the number of gene sets in with value of the enrichment score function *s*_{r} is greater than zero. After that, the probability can be estimated as follows:

##### 2.5.4 Estimating log-probability

To properly estimate a logarithm of the desired probability let note that the *j*-th order statistic of a standard uniform sample of size *Z* is a random variable from the beta distribution Beta (*j*, *Z* + 1 − *j*). Therefore, we can use the properties of the beta distribution and make correct transition to the logarithm of probability. So for the median value of sample of odd size *Z* we have:
where *ψ* is digamma function. In the same way, we can calculate the expectation of the logarithm *α _{t}*:

Then the logarithm of probability is estimated as

Similarly, we can estimate the variance of the estimates , where *ψ*_{1} is trigamma function. From this we can approximate a standard error of our estimator as:

The same approach with digamma functions is used to calculate the logarithm of the probabilities and P (*s*_{r} (*q*) ≥ 0).

##### 2.5.5 Comparison with the exact method

To compare FGSEA-multilevel and the exact method on the same dataset we used rounded values of the gene-level statistics from the example data (section 2.2) as input data for both algorithms. Both algorithms calculated the probability .

The results of the algorithms for the pathways from the example data are shown on Fig 2c. The exact algorithm was run with *ε* = 10^{−40}, all the probabilities were obtained with accuracy of at least six significant digits. For FGSEA-multilevel *Z* = 101 was used.

We also calculated empirical estimation errors and compared it to the theoretical ones (Fig 2d). For this we generated 100 independent estimates for a range of ES values (corresponding to P-values of 10^{−4} to 10^{−100}, gene set sizes (from 15 to 250) and sample size (from 101 to 1001). The raw values are available in the Supplementary Table.

#### 2.6 Filtering redundant pathways

In this section we describe an algorithm to filter redundant pathways from the results of FGSEA.

Let consider two pathways *p*_{1} and *p*_{2} that both have a significant GSEA P-value. There are two situations in which we will consider *p*_{2} to be non-redundant given *p*_{1}:

If pathway

*p*_{2}is enriched even if we do not consider the genes from*p*_{1}at all. Formally, we calculate GSEA P-value for gene set*p*_{2}\*p*_{1}and gene-level statistics vector*S*[*U*\*p*_{1}] for all the genes except*p*_{1}. If the P-value is less than a pre-defined threshold, then pathway*p*_{2}is considered as non-redundant given*p*_{1}.If pathway

*p*_{2}is enriched even if we consider only genes from*p*_{1}. Formally, we calculate GSEA P-value for gene set*p*_{2}∩*p*_{1}and gene-level statistics vector*S*[*p*_{1}] for the genes from*p*_{1}. Again, if the P-value is less than a pre-defined threshold, then pathway*p*_{2}is considered as non-redundant given*p*_{1}.

Otherwise pathway *p*_{2} is considered to be redundant.

The filtering procedure starts with a set of significantly enriched pathways *P*_{sig} selected by the user: for example the pathways with GSEA P-values less than 0.01 after Benjamini-Hochberg correction, sorted by P-value. The output of the procedure is a list *P*_{main} ⊂ *P*_{sig} of pathways that are pairwise non-redundant. At the same time, all the other pathways *P*_{red} = *P*_{sig} \ *P*_{main} are redundant given some pathway from *P*_{sig}.

The procedure itself is similar to Sieve of Eratothenes algorithm. The pathways are considered one by one and some of them are marked as redundant. For a pathway *p* we first check if it is already marked as redundant, if yes, we go to the next pathway. Otherwise, we first run FGSEA-simple algorithm on a vector of statistics *S*[*U* \ *p*] and all the pathway currently not marked as redundant (including the ones that already have been considered, but excluding pathway *p*). Then, similarly, we run FGSEA-simple algorithm on a vector of statistics *S*[*p*]. Pathways that do not achieve non-redundant P-value threshold in both tests are marked as redundant.

## Footnotes

FGSEA-multilevel procedure has been added to estimate arbitrarily low P-values.