Abstract
Preranked gene set enrichment analysis (GSEA) is a widely used method for interpretation of gene expression data in terms of biological processes. Here we present FGSEA method that is able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations. We also present a polynomial algorithm to calculate GSEA P-values exactly, which we use to practically confirm the accuracy of the method.
1 Main
Preranked gene set enrichment analysis [1] is a widely used method for analyzing gene expression data. It allows to select from an a priori defined list of gene sets those which have non-random behavior in a considered experiment. The method uses an enrichment score (ES) statistic which is calculated based on a vector of gene-level signed statistics, such as t-statistic from a differential expression test. Compared to a similar method of calculating Fisher P-values based on overlap statistic it does not require an arbitrary thresholding. This also allows the method to identify pathways that contain many co-regulated genes even with small individual effects.
The method has a major drawback of it’s implementations being slow. As the analytical form of the null distribution for the ES statistic is not known, empirical null distribution has to be calculated. That can be done in a straightforward manner by sampling random gene sets as was done in the reference implementation [1] and reimplementations [2, 3]. In this case for each of the input pathways, an ES value is calculated. Next, a number of random gene sets of the same size are generated, and for each of them an ES value is calculated. Then a P-value is estimated as the number of random gene sets with the same or more extreme ES value divided by the total number of generated gene sets (a formal definition is available in the section 2.1). However, a large number of gene set samples are required for the test to have a good statistical power, in particular due to correction for multiple hypotheses testing.
Here we present a fast gene set enrichment analysis (FGSEA) method for efficient estimation of GSEA P-values for a collection of pathways. The method consist of two main procedures: FGSEA-simple and FGSEA-multi-level. FGSEA-simple procedure allows to efficiently estimate P-values with a limited accuracy but simultaneously for the whole collection of gene sets, while FGSEA-multilevel procedure allows to accurately estimate arbitrarily low P-values but for individual gene sets.
FGSEA-simple procedure is based on an idea that generated random gene set samples can be shared between different input pathways. Indeed, consider M gene sets of the sizes K1 ≤ K2 ≤ … ≤ KM = K and a collection of n independent samples gi of size K (Fig 1a). As in the naive approach, due to gi being independent samples of the size K the P-value for the pathway M can be estimated as a proportion of samples gi having the same or more extreme ES value as the pathway M. However, for any other pathway j we can construct a set of n independent samples of size Kj by considering the prefixes . Again, given a set of independent samples, the P-value can be estimated as a proportion of the samples having the same or more extreme ES value.
Preranked gene set enrichment analysis can be sped up by sharing sampling information between different gene set sizes. a, It is sufficient to generate n independent samples of size KM to calculate empirical distribution for any sizes Kj ≤ KM by considering only the prefix of the samples of the size Kj. b, For a given gene sample enrichment scores for all the prefixes can be efficiently calculated by employing a square root heuristic. The heuristic allows to recalculate enrichment score after adding one gene to gene set in time proportional to the square root of the gene set size. The enrichment curve is split into blocks such that one block (“redo”) containing the added gene takes
time to update and other blocks (“keep” and “shift”) take O(1) time. c, The P-values calculated with the FGSEA-simple method are consistent with the reference implementation, but the results are obtained hundreds times faster.
The next important idea is that given a gene set sample gi of the size K the ES values for all the prefixes gi,1‥j can be calculated in an efficient manner using a square root heuristic (Fig 1b). Briefly, a variant of an enrichment curve is considered: the genes are enumerated starting from the most up-regulated to the most down-regulated, with the curve going to the right if the gene is not present in the pathway, and the curve goes upward if the gene is present in the pathway. It can be shown that the enrichment score can be easily calculated if the most distant from the diagonal curve point is known. Let us split K genes from the gene set into consecutive blocks of size
and consider what happens with the curve when we change the prefix from gi,1‥j−1 to gi,1‥j by adding gene gi,j. The curve in the blocks to the left of gi,j are not changed at all, while the blocks to the right of gi,jare uniformly shifted. This observation allows us to consider the prefixes in an√increasing order and update the position of the most distant point in
time. Briefly, for the each block which is either not changed or shifted the update procedure takes O(1) time, while for the changed block the update procedure is proportional to its size and takes
time. Finally, aggregating the blocks takes additional
time. Overall this results in time complexity of
to calculate ES values for all the prefixes. In total, the time complexity of the calculating P-values for the set of M pathways is
, which gives around
speed up K compared to a naive approach. The full description of the algorithm is given in the section 2.3.
As an example we ran FGSEA-simple and the reference implementations on the same example dataset of genes differentially regulated on Th1 activation [4] against a set of 700 Reactome [5] pathways (see section 2.2) and compared the resulting nominal P-values (Fig 1c). Both methods were ran with n = 10000 and the results are indistinguishable from each other up to the random noise inherent to both methods. However, on this example the reference implementation (version 4.0.1 has been used) took about 420 seconds, while FGSEA-simple finished in about 4 seconds. The two order of magnitude speed-up is consistent with the theoretical one due to the algorithm time complexity. Given a highly parallel implementation of FGSEA-simple, its performance allows to routinely achieve nominal p-values on the order of 10−5 and use standard procedures to correct for multiple hypothesis testing, like Benjamini-Hochberg procedure, for thousands of gene sets.
However, accurately estimating P-values lower than 10−6 with FGSEA-simple can be impractical or even infeasible. To estimate such low P-values we developed FGSEA-multilevel method, which is based on an adaptive multi-level split Monte Carlo scheme [6]. The method takes as an input an ES value γ > 0 and a gene set size K, and calculates the probability PK(ES ≥ γ) of a random gene set of size K to have an enrichment score no less than γ. The method sequentially finds ES levels li for which the probability PK(ES ≥ li) is approximately equal to 2−i (see Fig 2a for a toy example). The method stops when li becomes greater than γ and the P-value can be crudely approximated as 2−i.
Adaptive multilevel Monte Carlo sampling scheme can be used to calculate arbitrarily low P-values. a, A toy illustration of the multilevel split Monte Carlo scheme for sample size of Z = 5. First, five uniformly random gene sets are generated and the level l1 of ES is selected that corresponds to P – value of . Then five samples are iteratively modified with Metropolis algorithm steps to obtain a uniform sample of gene sets with ES value greater l1. Based on these samples, a threshold l2 is selected that corresponds to P-value of
and so on. b, Comparison of GSEA P-values as calculated by FGSEA-simple method run on 108 samples and FGSEA-multilevel with the sample size of Z=101. c, Comparison of P-values as calculated with an exact method and FGSEA-multilevel method. Both methods were run on genelevel statistic values rounded to integers. d, Comparison between estimated and an observed error of log2 P-values for different P-values (from 10−4 to 10−100), gene set sizes (from 15 to 250) and sample sizes (from 101 to 1001).
The intermediate li thresholds are calculated as follows. First, a set of Z (an odd number, parameter of the method) random gene sets of size K are generated uniformly and ES values for them are calculated. The median value of the ES values is calculated and assigned to l1. By construction, the probability PK(γ ≥ li) of a random gene set to have an ES value no less than l1 can be approximated as . Next
generated gene sets with the ES values less than l1 are discarded, while
gene sets with the ES values greater than l1 are duplicated. This results in a sample of Z gene sets with the ES values no less than l1, but the distribution is non-uniform. However, it can be made into a uniform sample with a Metropolis algorithm. On each Metropolis algorithm step each gene set sample is tried to be modified by swapping a random gene from the set with a gene outside of the set. The change is accepted if an enrichment score of the new set is no less then current threshold l1, otherwise the change is rejected. Metropolis algorithm guarantees, that after enough steps the sample becomes close to uniformly distributed. Thus, a median of the enrichment scores (l2) would correspond to probability of
for a gene set to have an enrichment score no less than l2 given it has an enrichment score no less than l1:
Which means
The same procedure is applied to calculate the next li values.
The iterations stop when li becomes greater than γ. On this iteration the probability of a random gene set to have a ES value no less than γ can be approximated as:
When estimating small P-values it becomes practical to carry out the estimation in log-scale. In particular, the values become practically unbiased both in median and mean sense and it becomes simple to estimate the error (see section 2.5.4).
The full formal description of the algorithm is available in the section 2.5.
For the example dataset we show that P-values are as low as 10−26 for some of the pathways and the results are consistent with FGSEA-simple P-values ran on 108 permutations (Fig 2b). Note, that FGSEA-multilevel calculation with sample size of Z=101 took only 10 seconds working on a single thread while 108 permutations on FGSEA-simple took 40 minutes working in 32 threads.
To further prove the approximation quality of FGSEA-multilevel algorithm we developed an exact method for calculating GSEA P-values, but limited to integer weights. The method is based on dynamic programming, the full description is given in section 2.4. The complexity of the algorithm is O(NKT2), where N is the number of genes, K is the size of gene sets and T is the sum of the top K absolute values of gene-level statistics. With a number of optimizations this method allows to calculate P-values for rounded weights in the example dataset in a couple of hours.
When run on the same integer weights FGSEA-multilevel and the exact method give highly concordant results (Fig 2c). Additionally, using the exact P-values as a control real errors can be compared with the estimated ones. We show, that the FGSEA-multilevel error estimation are highly concordant with the real errors (Fig 2d) for a wide range of P-values (from 10−4 to 10−100), gene set sizes (from 15 to 250) and sample sizes (from 101 to 1001).
In practice FGSEA-multilevel method is combined with FGSEA-simple. First, for all the input pathways FGSEA-simple method can be run with a limited sample size. Next, for the pathways that have high relative error after FGSEA-simple (i.e. pathways with low p-values) FGSEA-multilevel method is executed. As many of the pathways in an input collection usually are not enriched, they have a relatively high P-value and will be batch-processed with a highly efficient FGSEA-simple algorithm with deterministic time boundaries. The more interesting pathways with lower P-values will then be processed with FGSEA-multilevel algorithm individually and the amount of processing time will depend on their P-values.
Finally, as FGSEA allows to practically estimate the P-values for a large collections of gene sets, it can lead to a large number of statistically significant hits with high overlaps. To deal with this issue and make the representation of FGSEA results more concise we developed a procedure to filter the redundant gene sets. The procedure is similar to GO Trimming method [7] but is based on the Bayesian network construction approaches. It considers the significant pathways one by one and tries to remove gene sets that do not provide new information given some other pathway already present in the output. In this case, we consider a pathway P1 to give a new information given a pathway P2 if the P-value of pathway P1 in the universe of genes from P2 or genes outside of P2 is less than some threshold. This procedure allows to filter redundant pathways without requirement of having any explicit hierarchy of pathways. The full description of the procedure is given in section 2.6. The table resulting from running FGSEA on the example dataset with filtering of redundant hits is shown on Fig 3.
An example of FGSEA results as run with FGSEA-multilevel method for Th0 vs Th1 comparison and Reactome pathways. The analysis was run with samples size of 101. Redundant pathways were filtered.
To conclude, here we present a method FGSEA for fast preranked gene set enrichment analysis. The method allows to routinely estimate even very low P-values and can be used with conjunction with standard multiple hypothesis testing correction methods, such as Benjamini-Hochberg procedure. This, in turn, allows to analyze even large collections of pathways which require a very low nominal P-value for the pathway to remain significant after multiple hypothesis testing correction. FGSEA method is freely available as an R package at Bioconductor (http://bioconductor.org/packages/fgsea) and on GitHub (https://github.com/ctlab/fgsea).
2 Methods
2.1 Formal definitions
The preranked gene set enrichment analysis takes as input two objects: an array of gene-level statistic values S for the genes U = {1, 2, …, N} and a list of query gene sets (pathways) P. The goal of the analysis is to determine which of the gene sets from P has a non-random behavior.
The statistic array S of the size |S| = N for each gene i ∈ U contains a value Si ∈ ℝ that characterizes the gene behavior in a considered biological process. Commonly, if Si > 0 the expression of gene i goes up on treatment compared to control and Si < 0 means that the expression goes down. Absolute values |Si| represent magnitude of the change. Array S is sorted in a decreasing order: Si > Sj for i < j. The value of N in practice is about 10000–20000.
The list of gene sets P = {P1, P2, …, PM} of length M usually contains groups of genes that are commonly regulated in some biological process. We assume that the gene sets Pi are ordered by their size (denoted as Ki): K1 ≤ K2 ≤ … ≤ KM = K. Usually only relatively small gene sets are considered with K ≈ 500 genes.
To quantify a co-regulation of genes in a gene set p Subramanian et al.[1] introduced a gene set enrichment score function sr(p) that uses gene rankings (values of S). The more positive is the value of sr(p) the more enriched the gene set is in the positively-regulated genes (with Si > 0). Accordingly, negative sr(p) corresponds to enrichment in the negatively regulated genes.
Value of sr(p) can be calculated as follows. Let k = |p|, NS = Σi∈p |Si|. Let also ES be an array specified by the following formula:
The value of sr(p) corresponds to the largest by the absolute value entry of ES:
For convenience, we also introduce the following notation:
From these two values it easy to find value of sr(p), which is equal to if
or
otherwise.
Often we will consider only the positive values of the gene set enrichment score function since:
where
and
corresponds to the gene set enrichment score function for array S′ such that
.
Next, following Subramanian et al for a pathway p we define GSEA P-value as:
where q is a random gene set of size k.
2.2 The example data
As the example ranking we used Th0 vs Th1 comparison from dataset GSE14308 [4]. The differential expression was calculated using limma [8]. Only top 12000 genes by mean expression were used. Limma t-statistic was used as gene-level statistic. The script to generate rankings is available on GitHub: https://github.com/ctlab/fgsea/blob/master/inst/gen_gene_ranks.R.
Reactome [5] database was used as an example collection via reactome.db R package. For the analysis only the pathways of the size from 15 to 500 were used. The script to generate pathway collection is available on GitHub: https://github.com/ctlab/fgsea/blob/master/inst/gene_reactome_pathways.R
2.3 FGSEA-simple: an algorithm for fast calculation of GSEA P-values simultaneously for many path-ways
In this section we describe an algorithm for fast estimation of GSEA P-values simultaneously for a collection of pathways P. There, for each pathway p a set of n uniformly random gene sets qi are considered. Then P-value is estimated as:
for positively enriched pathway p and as:
for negatively enriched pathway. These two formulas follow Subramanian et al. implementation, except of +1 terms, which are recommended by Phipson and Smyth [9]. Otherwise, the nominal P-values from FGSEA-simple and reference implementation are indistinguishable, however FGSEA-simple works orders of magnitude faster.
2.3.1 Cumulative statistic calculation for the mean statistic
Let first describe the idea of the proposed algorithm on a simple mean statistic sm:
The main idea of the algorithm is to reuse sampling for different query gene sets. This can be done due to the fact that for an estimation of null distributions samples have to be independent only for a specific gene set size, while they can be dependent between different sizes.
Instead of generating nM independent random gene sets: n for each of M input gene sets, we will generate only n random gene sets of size K. Let πi be an i-th random gene set of size K. From that gene set we can generate gene sets for a all the query pathways Pj by using its prefix: πi,j = πi[1‥Kj].
The next step is to calculate the enrichment scores for all gene sets πi,j. Instead of calculating enrichment scores separately for each gene set we will calculate simultaneously scores for all πi,j for a fixed i. Using a simple procedure it can be done in Θ(K) time.
Let us find enrichment scores for all prefixes of πi. This can be done by element-wise dividing of cumulative sums array by the length of the corresponding prefix:
Selecting only the required prefixes takes an additional Θ(m) time.
The described procedure allows to find P-values for all query gene sets in Θ(n(K + m)) time. This is about min(K, m) times faster than the straight-forward procedure.
2.3.2 Cumulative statistic calculation for enrichment score
For the enrichment score Sr we use the similar idea as above: we will also be sampling only gene sets of size K and from that sample will calculate statistic values for all the other sizes. However, calculation of the cumulative statistic values for the subsamples is more complex in this case. In this section we only be considering the positive mode of enrichment statistic .
It is helpful to look at enrichment score from a geometric point of view. Let us consider for a pathway p of size |p| = k a graph of N +1 points (Fig. 4) with the coordinates (xi, yi) for 0 ≤ i ≤ N such that:
A graph that corresponds to a calculation of enrichment score. Each breakpoint on a graph corresponds to a gene present in the pathway. Dotted lines cross at a point which is the farthest up from a diagonal (dashed line). This point correspond to gene i+, where the maximal value of ESi is reached.
The calculation of corresponds to finding the point farthest up from a diagonal ((x0, y0), (xN, yN)). Indeed, it is easy to see that xN = N − |p| = N − k and yN = Σj∈p|Sj| = NS, while the individual enrichment scores ESi can be calculated as
. Value of ESi is proportional to the directed distance from the line going through (x0, y0) and (xN, yN) to the point (xi, yi).
Let us fix a sample π of size K. To efficiently calculate cumulative values for all k ≤ K we need a fast method of updating the farthest point when a new gene is added. In that case we can add genes from π one by one and calculate values
from the corresponding maximal distances.
Because we are calculating values for π[1‥k] for k ≤ K we know in advance which K genes will be added. This allows us to consider K + 1 points instead of N + 1 for each iteration k. Let array o of size K contain the sorted order of genes in π: that is, is the minimal among π,
is the second minimal and so on. The coordinates can be calculated as follows:
where we set
to be zero.
It can be shown that finding the farthest up point among (4)–(6) is equivalent to finding the farthest up point among (1)–(3) with being equal to
calculated for p = π[1‥k]. Consider
. By the definition of x it is equal to:
By the definition of o, in the interval there are no genes from π and, thus, from π[1‥k]. Thus we can replace the sum with its last member:
We got the same difference as in (5).
Now consider . By the definition of y it is equal to:
Again, in the interval there are no genes from π[1‥k]. Thus we can replace the sum with only the last member:
We got the same difference as in (6).
We do not need to consider other points, because points from oi−1 to oi−1 have the same y coordinate and oi−1 is the leftmost of them. Thus, when at least one gene is added the diagonal ((x0, y0), (xN, yN)) is not horizontal and oi−1 is the farthest point among oi−1, …, oi − 1.
Now let consider what happens with the enrichment score graph when gene πk is added to the query set π[1‥k − 1] (Fig. 5). Let rk be a rank of gene πk among genes π, then coordinate of points (xi, yi) for i < rk do not change, while all (xi, yi) for i ≥ rk are changed on .
Update of an enrichment score graph when gene πi ≈ 800 is added. Only a fragment is shown. Black graph corresponds to a graph for gene set π[1‥k − 1], gray graph corresponds to π[1‥k]. A part of the graph to the left of does not change and the other part is shifted to the top-left corner. The diagonal ((x0, y0),(xN, yN)) is rotated counterclockwise.
To make fast incremental updates we will decompose the problem into multiple smaller ones. For simplicity we assume that K +1 is an exact square of an integer b. Let split K + 1 points into b consecutive blocks of the size b: and so on.
For each of b blocks we will store and update the farthest up point from the diagonal. When we know for each block its farthest point we can find the globally farthest point by a simple pass in O(b) time.
Next, we show how to update the farthest points in blocks in amortized time O(b). This taken together with one O(b) pass will get us an algorithm to update the globally farthest point in amortized O(b) time.
Below we use c = ⌊rk/b⌋ as an index of a block where gene πk belongs, where rk is the ranking of the genes from π, i.e. .
First, we describe the procedure to update point coordinates. We will store xi coordinates using two vectors: B of size b and D of size K + 1, such that xi = Bi/b + Di. When gene πk is added all xi for i ≥ rk are decremented by one. To reflect this we will decrement all Bj for j > c and decrement all Di for rk ≤ i < cb. The update takes O(b) time. After this update procedure we can get value xi in O(1) time. The same procedure is applied for y coordinates.
Second, for each block we will maintain an upper part of its convex hull. Having convex hull is useful because the farthest point in block always lays on its convex hull. All blocks except c have the points either not changed or shifted simultaneously on the same value. That means that the lists of points on the convex hulls for these blocks remain unchanged. For the block c we can reconstruct convex hull from scratch using Graham scan algorithm [10]. Because the points are already sorted by x coordinate, this reconstruction takes O(b) time. In total, it takes O(b) time to update the convex hulls.
Third, the farthest points in blocks can be updated using the stored convex hulls. Consider a block where the convex hull was not changed (every block except, possibly, block c). Because diagonal always rotates in the same counterclockwise direction, the farthest point in block on iteration k either stays the same or moves on the convex hull to the left of the farthest point on the (k − 1)-th iteration. Thus, for each such block we can compare current farthest point with its left neighbor on the convex hull and update the point if necessary. It is repeated until the next neighbor is closer to the diagonal than the current farthest point. In the block c we just find the farthest point in a single pass by the points on the convex hull.
To show that the updating the farthest points takes O(b) amortized time we will use potential method. Let a potential after adding k-th gene Φk be a sum of relative indexes of the farthest points for all the blocks. As there are b blocks of size b the sum of relative indexes lies between 0 and b2. Thus, Φk = O(b2). For an update of all b − 1 blocks except c we need to make tk = b − 1 + z operations of comparing two points, where z is the number of times the farthest points were updated. This can take up to Θ(b2) time in the worst case. However, it can be noticed, that potential change Φk − Φk−1 is equal to − z + O(b): the sum of indexes is decreased by a number of times the farthest points were updated plus O(b) for the block c where the index can go from 0 to b − 1. This gives an amortized cost of k-th iteration to be ak = tk + Φk − Φk−1 = b − 1 + z − z + O(b) = O(b). The total real cost of K iterations is , which means amortized cost of one iteration to be O(b).
Taken together the algorithm allows to find all cumulative enrichment scores sr(π[1‥k]) in time. The straightforward implementation of calculating cumulative values from scratch would take O(K2 logK) time. Thus, we have improved the performance
times.
2.3.3 Implementation details
We also implemented an optimization so that the algorithm does not build convex hull from scratch for a changed block c, but only updates the changed points. This does not influence the asymptotic performance, but decreases the constant factor.
First, we start updating the convex hull from position rk and not from the start. To be able to do this, we have an array prev that for each gene g ∈ π stores the previous point on the convex hull if g were the last gene in the block. This actually is the same as the top of the stack in Graham algorithm and represent the algorithms state for any given point. As all points h to the left of g are not changed prevh also remains unchanged and need not to be recalculated.
Second, we stop updating the hull, when we reach the point on the previous iteration convex hull. We can do this because every point to the left of g is rotated counterclockwise of any point to the right of g, which means that the first point on the convex hull right of g on (k − 1)-th iteration remains being a convex hull point at k-th iteration.
2.4 An algorithm for exact calculation of GSEA P-values for integer gene-level statistics
In this section we describe a polynomial algorithm to calculate GSEA P-value exactly, but only for the case when gene-level statistics are integer numbers: Si ∈ ℤ. For simplicity we will consider a problem of calculating the following probability:
where q is a random gene set of size k. We also assume γ > 0.
Let denote the sum of k largest absolute values of gene ranks by T. The algorithm will be polynomial in terms of N, k and T.
2.4.1 The basic algorithm
Let us consider a gene set q = {q1, q2, …, qk}. Recall the formula for s+(q): , where
,
First, let rewrite the formula for ESi in an equivalent fashion, grouping positive and negative summands:
Then for calculating ESi the following values are sufficient:
i: the index of the current gene;
: the number of genes included into the set q among genes 1‥i;
: the sum of the absolute values of gene-level statistics for genes included in the set among genes 1‥i
: the sum of the absolute values of gene-level statistics for all genes in the set.
Knowing the values above, ESi can be calculated as .
Notice that NS can take only integer values from 0 to T (for a set of genes with the largest absolute values of gene-level statistics). Let us split the desired probability to a sum of independent probabilities based on the value of NS:
Our algorithm will be based on dynamic programming. For each possible value of NS we will process the genes one by one in increasing order of index and calculate an array fNS(i, c, s). The value fNS(i, c, s) will contain the probability for a uniformly random gene set q′ of c genes selected from genes 1‥i to simultaneously have the following two properties:
the sum of the absolute values of gene-level statistics of genes from q′ is equal to s;
ESj < γ holds for all j ≤ i, where the values of ES are calculated for the gene set q′ but using the selected values of NS and k, not the ones calculated for the set q′.
Suppose that we have calculated all values of fNS(i, c, s), then
and
Finally, the sought probability is equal to:
Let us find a formula for fNS(i, c, s). The base case of dynamic programming is i = 0 for all NS:
Suppose we want to calculate fNS(i, c, s) for some i > 0. First, calculate
and compare it to γ. If ESi ≥ γ, then fNS(i, c, s) = 0 by definition.
Otherwise, condition “ESj < γ holds for all j ≤ i” can be simplified to “ESj < γ holds for all j ≤ i − 1”. This observation allows us to use values of f that have already been calculated. Consider two cases:
Gene i does not belong to the set q′. As q′ is a set of c genes chosen uniformly at random from i genes, this case happens with the probability
. The conditional probability that such set satisfies the two necessary properties is fNS(i − 1, c, s). Indeed, any set of size c with the sum of absolute values of gene-level statistics values equal to s, chosen among genes 1‥i − 1 and satisfying the conditions on ES, is a valid set chosen among genes 1‥i. Similarly, if a set does not satisfy the condition on ESj for some j ≤ i − 1, this set should not be counted towards fNS(i, c, s) since obviously j ≤ i.
Gene i belongs to the set. This case happens with the probability
. The probability that this set satisfies the necessary conditions is fNS(i − 1, c − 1, s − Si). Indeed, any set of size c − 1 with the sum of absolute values of gene-level statistics equal to s − Si, chosen among genes 1‥i − 1 and satisfying the conditions on ES, can be extended with gene i, thus forming a set of size c satisfying both necessary properties. Similarly, if a set does not satisfy the condition on ESj for some j ≤ i − 1, adding gene i will not fix the situation.
Then we can calculate fNS(i, c, s) using the law of total probability:
in the case when i > 0 and ESi < γ.
Putting all the cases together, we arrive to the final formula for fNS(i, c, s):
The overall complexity of the algorithm is O(NkT 2). The values of f can be evaluated sequentially in increasing order of i. It is enough to evaluate fNS(i, c, s) for 0 ≤ i ≤ N, 0 ≤ c ≤ k, and 0 ≤ s ≤ NS ≤ T. Each value of f can be evaluated in constant time.
2.4.2 Optimizations and implementation details
While the algorithm described above is polynomial, a number of further optimizations are required to make execution on real size inputs feasible.
First, let note that the following property holds: as long as NS2 ≥ NS1. Indeed, ES values calculated using different values of NS are decreasing when NS is increased. That means all gene sets counted towards
should also be counted towards
if NS2 ≥ NS1.
Following the observation above, instead of calculating values of fNS(i, c, s) we will consider the values g(i, c, s, b) = fb+1(i, c, s) − fb(i, c, s). These values will contain the probability of a random gene set q of size k selected uniformly from genes 1‥N to satisfy simultaneously the following three properties:
set q contains exactly c genes from the genes 1‥i.
the sum of the absolute values of gene-level statistics of the first c genes from q is equal to s;
ESj < γ holds for all j ≤ i, where the values of ES are calculated for the gene set q using NS = b + 1 (and for all higher values of NS);
ESj ≥ γ holds for at least one j ≤ i, where the values of ES are calculated for the gene set q using NS = b (and for all lower values of NS).
The sought probability can be calculated from values of g as follows:
To calculate the values of g we will use the forward dynamic programming algorithm. In this algorithm we expand a tree of reachable dynamic programming states, starting from g(0, 0, 0, 0) which is equal to 1.
The states will be considered by “levels” in an increasing order of i. The values g(i + 1, c, s, b) from (i + 1)-th level are calculated based on level i. Note, that the sum of values on i-th level is always equal to 1.
To calculate all values from the (i + 1)-th level all non-zero values from the i-th level are considered sequentially. Let consider state (i, c, s, b) and let define p = (k − c)/(N − i) – the probability that gene i + 1 will be added to the set. The corresponding set G(i, c, s, b) can be divided into two groups.
The gene sets from G(i, c, s, b) that do not include gene i + 1. These gene sets are included into gene sets G(i + 1, c, s, b) on the level i + 1. Thus the corresponding probability g(i, c, s, b) · (1 − p) is added to the value of g(i + 1, c, s, b).
The gene sets from G(i, c, s, b) that do include gene i + 1. These gene sets are included into G(i + 1, c + 1, s′ = s + |Si+1|, b′) where b′ is an updated bound. To calculate b′ let note that ESj will be greater or equal to γ iff
which is equivalent to
. Thus
The probability that is added to g(i+1, c+1, s′, b′) is equal to g(i, c, s, b) · p.
While the asymptotic number of states remains to be O(NkT2) the forward dynamic programming allows to consider only “reachable” gene stats with g(i, c, s, b) > 0. In practice the number of reachable stats can be several orders of magnitude smaller then the total states.
Furthermore, for the algorithm we can consider only states with g(i, c, s, b) > ε to be reachable for some small value of ε. If we do not consider the un-reachable states we would not be able to calculated the desired probability exactly. However, if we calculate the value of δ as a sum of all the skipped states values, the desired probability will be calculated with the absolute error no more than δ.
The algorithm implementation with few other optimizations is available at: https://github.com/ctlab/fgsea/blob/master/inst/exact/exact.cpp.
2.5 FGSEA-multilevel: an algorithm for calculation of arbitrarily low P-values using adaptive multilevel split Monte Carlo scheme
In this section we describe FGSEA-multilevel algorithm that can accurately estimate GSEA P-value for a pathway p of size k even when the true P-value is very small.
Let γ = sr(p) > 0 be the enrichment score of the query pathway p for which we want to calculate the following value:
where q is a random gene set of size k. This probability can be rewritten as follows:
First, we focus on determining the probability . This probability can be extremely small, so using a naive sampling gives a bad estimation. We use the adaptive multilevel split Monte Carlo method [6] to solve this problem.
To estimate the probability we split the enrichment scores into levels 0 = l0 < l1 < … < lt = γ. Then we can define the following probabilities:
Now the probability can be rewritten as
.
To estimate αi we can draw a sample of size Z from a conditional distribution
. Then
where Zi is the number of elements in the set
.
Below we show how levels li can be chosen and how to sample from the corresponding conditional distributions.
2.5.1 Choosing the enrichment score levels
We propose to chose value for a level li as a median of the enrichment scores for the sample. For simplicity Z is required to be an odd number.
Then the procedure for estimating probability consists of repetition of the following steps:
On iteration i ≥ 1 sample Z gene sets
of size k from the distribution
.
Set the level
to be equal to the median of value
.
If
then stop the iterations and set li = γ and t = i, otherwise set
.
As a result, by construction, αi ≈ 1/2 for 1 ≤ i ≤ t − 1. The value of αt can be approximated as Zt/Z (which is always ≥ 1/2). Together we get the following expression for estimating the desired probability:
2.5.2 The conditional sampling implementation
To generate a uniform sample from the conditional distribution
we use the Metropolis algorithm.
First, we generate a sample of size Z from the distribution
Since l0 = 0 and values of
are always non-negative it can be done by generating a uniformly random subset of size k from the genes {1, 2, …, N}.
Now let consider a sample at a step i > 1. The sample can be sorted in an increasing order of enrichment score values:
. Let d = ⌈Z/2⌉. The level li−1 is the median of the values
and, thus, is equal to
.
Let first populate in the following way:
This gives us a sample from the conditional distribution , however it is not uniform.
To make the sample uniform we apply a number of the Metropolis algorithm iterations. On each iteration for each gene set we apply the following steps:
Choose a random gene
.
Choose a random gene
.
Consider
. If
then we replace
with
.
The iterations are repeated until the total number of successful replacements becomes greater or equal to k · Z. In practice, this number of steps is enough to get a sufficiently uniform sample to obtain a good estimation of probability, without a significant increase in the running time of the algorithm.
2.5.3 Estimating the P-value
In order to estimate the desired P-value we also need to calculate the probabilities P (sr (q) ≥ 0) and .
To calculate the probability P (sr (q) ≥ 0) we generate gene sets q1, q2, …, qZ′, where each sample qi is selected uniformly at random from all the subsets of size k from the set {1, 2, …, N}. The samples are generated until the number of samples qi with sr(qi) ≥ 0 becomes equal to Z. Then the probability P (sr (q) ≥ 0) is estimated as follows
To determine the remaining probability we calculate the number of gene sets in
with value of the enrichment score function sr is greater than zero. After that, the probability can be estimated as follows:
2.5.4 Estimating log-probability
To properly estimate a logarithm of the desired probability let note that the j-th order statistic of a standard uniform sample of size Z is a random variable from the beta distribution Beta (j, Z + 1 − j). Therefore, we can use the properties of the beta distribution and make correct transition to the logarithm of probability. So for the median value of sample of odd size Z we have:
where ψ is digamma function. In the same way, we can calculate the expectation of the logarithm αt:
Then the logarithm of probability is estimated as
Similarly, we can estimate the variance of the estimates
, where ψ1 is trigamma function. From this we can approximate a standard error of our estimator as:
The same approach with digamma functions is used to calculate the logarithm of the probabilities and P (sr (q) ≥ 0).
2.5.5 Comparison with the exact method
To compare FGSEA-multilevel and the exact method on the same dataset we used rounded values of the gene-level statistics from the example data (section 2.2) as input data for both algorithms. Both algorithms calculated the probability .
The results of the algorithms for the pathways from the example data are shown on Fig 2c. The exact algorithm was run with ε = 10−40, all the probabilities were obtained with accuracy of at least six significant digits. For FGSEA-multilevel Z = 101 was used.
We also calculated empirical estimation errors and compared it to the theoretical ones (Fig 2d). For this we generated 100 independent estimates for a range of ES values (corresponding to P-values of 10−4 to 10−100, gene set sizes (from 15 to 250) and sample size (from 101 to 1001). The raw values are available in the Supplementary Table.
2.6 Filtering redundant pathways
In this section we describe an algorithm to filter redundant pathways from the results of FGSEA.
Let consider two pathways p1 and p2 that both have a significant GSEA P-value. There are two situations in which we will consider p2 to be non-redundant given p1:
If pathway p2 is enriched even if we do not consider the genes from p1 at all. Formally, we calculate GSEA P-value for gene set p2 \ p1 and gene-level statistics vector S[U \ p1] for all the genes except p1. If the P-value is less than a pre-defined threshold, then pathway p2 is considered as non-redundant given p1.
If pathway p2 is enriched even if we consider only genes from p1. Formally, we calculate GSEA P-value for gene set p2 ∩ p1 and gene-level statistics vector S[p1] for the genes from p1. Again, if the P-value is less than a pre-defined threshold, then pathway p2 is considered as non-redundant given p1.
Otherwise pathway p2 is considered to be redundant.
The filtering procedure starts with a set of significantly enriched pathways Psig selected by the user: for example the pathways with GSEA P-values less than 0.01 after Benjamini-Hochberg correction, sorted by P-value. The output of the procedure is a list Pmain ⊂ Psig of pathways that are pairwise non-redundant. At the same time, all the other pathways Pred = Psig \ Pmain are redundant given some pathway from Psig.
The procedure itself is similar to Sieve of Eratothenes algorithm. The pathways are considered one by one and some of them are marked as redundant. For a pathway p we first check if it is already marked as redundant, if yes, we go to the next pathway. Otherwise, we first run FGSEA-simple algorithm on a vector of statistics S[U \ p] and all the pathway currently not marked as redundant (including the ones that already have been considered, but excluding pathway p). Then, similarly, we run FGSEA-simple algorithm on a vector of statistics S[p]. Pathways that do not achieve non-redundant P-value threshold in both tests are marked as redundant.
Footnotes
FGSEA-multilevel procedure has been added to estimate arbitrarily low P-values.