## Abstract

Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used. The method is implemented in a form of an R package and is freely available at https://github.com/ctlab/fgsea.

## 1 Introduction

Gene set enrichment analysis is a very widely used method for analyzing gene expression data. It allows to select from an *a priori* defined list of gene sets those which have non-random behavior in a considered experiment. Compared to a similar method of calculating Fisher p-values of overlap statistic it does not require an arbitrary thresholding. This also allows the method to identify pathways that contain many co-regulated genes but with small individual effects.

The method has a major drawback of being relatively slow. Because analytical from of the null distribution for the used gene set enrichment statistic is not known, empirical null distribution has to be calculated. That can be done in a straightforward manner by sampling random gene sets. However, a big number of gene sets are usually tested simultaneously. This leads to a requirement of a large number of samples for the test to have a good statistical power after a correction for multiple testing.

In the original paper [3] Subramanian et al. developed an ad-hoc method for multiple testing correction. However, the developed method is approximate for the commonly used parameters and it is unclear how accurate it is.

Here we present a fast gene set enrichment analysis (FGSEA) method which is much faster than the original method [3] in finding nominal p-values. The method is based on an algorithm to calculate cumulative gene set enrichment statistic values, which allows to rapidly calculate multiple sample statistic values from a single sample. Ability to get accurate nominal p-values achieved by the method in a reasonable time leads to using well-developed general methods for multiple testing correction such as Bonferroni or Benjamini-Hochberg.

The rest of the paper is structured as follows. First, in section 2 we formally define gene set enrichment statistic and introduce related definitions. In section 3 we explain the idea of the algorithm on a simple mean statistic. Then, in section 4 we show how GSEA statistic can be interpreted geometrically and present the algorithm for fast computation of cumulative values that follows from such interpretation. Finally, in section 5 we show how the algorithm works in practice and how it is compared to the reference implementation.

## 2 Definitions

The preranked gene set enrichment analysis takes as input two objects: an array of gene statistic values *S* and a list of query gene sets *P*. The goal of the analysis is to determine which of the gene sets from *P* has a non-random behavior.

The gene statistic array *S* of the size |*S*| = *N* for each gene *i*, 1 ≤ *i* ≤ *N*, contains a value *S _{i}* ∈ ℝ that characterises the gene behavior in a considered process. Commonly, if

*S*> 0 the expression of gene i goes up in a treatment compared to control and

_{i}*S*< 0 means that the expression goes down. Absolute value |

_{i}*S*| represents a magnitude of the change. Array

_{j}*S*is sorted in a decreasing order:

*S*>

_{i}*S*for

_{j}*i < j*. The value of

*N*in practice is about 10000–20000.

The list of gene sets *P* of length |*P*| = *m* usually contains groups of genes that are commonly regulated in some biological process. In this paper we assume that all gene sets *p* ∈ *P* have a size upper bound of *K* ≈ 500 genes: |*p*| ≤ *K*, ∀*p* ∈ *P*.

To quantify a co-regulation of genes in a gene set *p* Subramanian et al. introduced a gene set enrichment score function *s _{r}(p)* that uses gene rankings (values of

*S*). The more positive is the value of

*s*(

_{r}*p*) the more enriched the gene set is in positively-regulated genes

*g*with

*S*> 0, accordingly, negative

_{g}*s*(

_{r}*p*) corresponds to enrichment of negatively regulated genes.

Value of *s _{r}(p)* can be calculated as follows. Let

*k*= |

*p*|, NS =

*Σ*|

_{i∈p}*S*|. Let also ES be an array specified by the following formula:

_{i}The value of *s _{r}*(

*p*) corresponds to the largest by absolute value entry of ES:

For each *p* ∈ *P* we need to find the enrichment statistic value and to calculate a p-value of this not to be random. To calculate a p-value for gene set *p* we can obtain an empirical null distribution by sampling *n* random gene sets of the same size as *p*.

Such straightforward implementation takes *Θ*(*mnK log K*) time. For each of *m* gene sets we need to perform *n* permutations for all of which we need to calculate enrichment statistic value, that can be done in *Θ*(*K log K*) time.

## 3 Cumulative statistic calculation for mean statistic

Let first describe the idea of the proposed algorithm on a simple statistic of mean rank *s _{m}*:

The idea of the algorithm is to reuse sampling for different query gene sets. This can be done due to the fact that for an estimation of null distributions samples have to be independent only for a specific gene set size. Samples can be dependent between different sizes.

Instead of generating *nm* independent random gene set for each permutation and each gene set we will generate only *n* radom gene sets of size *K*. Let *π _{i}* be an

*i*-th random gene set of size

*K*. From that gene set we can generate gene sets for a all the query pathways

*P*by using its prefix:

_{j}*π*[1‥|

_{i,j}= π_{i}*P*|].

_{j}The next step is to calculate enrichment scores for all generated gene sets *π _{i,j}*. Instead of calculating enrichment scores separately for each gene set we will calculate simultaneously scores for all

*π*for a fixed

_{i,j}*i*. Using a simple procedure it can be done in

*Θ(K)*time.

Let us find enrichment scores for all prefixes of *π _{i}*. This can be done by element-wise dividing of cumulative sums array by the length of the corresponding prefix:

Selecting only the required prefixes takes an additional *Θ(m)* time.

The described procedure allows to find p-values for all query gene sets in *Θ(n(K+m))* time. This is about min(*K, m*) times faster than the straightforward procedure.

## 4 Cumulative statistic calculation for GSEA statistic

For the GSEA statistic we use the similar idea: we will also be sampling only gene sets of size *K* and from that sample will calculate statistic values for all the other sizes. However, calculation of the cumulative statistic values for the subsamples is more complex in this case.

In this section we only be considering positive mode of enrichment statistic. It can be defined as follows:

Calculation of for *i _{-}* = arg min

_{i}ES

_{i}is very similar. From these two values it easy to find value of

*s*which is equal to or otherwise.

_{r}(p)### 4.1 Geometric interpretation of GSEA statistic

It is helpful to look at GSEA statistic from a geometric point of view. Let us consider *N* + 1 points (Fig. 1) with coordinates (*x _{i}, y_{i}*) for 0 ≤

*i*≤

*N*such that:

The calculation of corresponds to finding the point farthest up from a diagonal ((*x _{0}, y_{0}*), (

*x*)). Indeed, it is easy to see that

_{N}, y_{N}*x*|

_{N}= N —*p*| =

*N — k*and

*y*|

_{N}= Σ_{j∈p}*S*| = NS, while the individual enrichment scores ES

_{j}_{i}can be calculated as Value of ES

_{i}is proportional to the directed distance from the line going through (

*x*) and (

_{0}, y_{0}*x*) to the point (

_{N}, y_{N}*x*).

_{i}, y_{i}Let us fix a sample *π* of length *K*. To efficiently calculate cumulative values for *k ≤ K* we need a fast method of updating the farthest point when a new gene is added. In that case we can add genes from *π* one by one and calculate values from the corresponding maximal distances.

Because we are calculating values for *π*[1‥*k*] for *k ≤ K* we know in advance which K genes will be added. This allows us to consider *K* + 1 points instead of *N* + 1 for each iteration *k*. Let array *o* contains order of genes in *π*, that is is minimal among *π*, is second minimal and so on. The coordinates can be calclated as follows:
where we set to be zero.

It can be shown that finding the farthest up point among (4)–(6) is equivalent to finding the farthest up point among (1)–(3) with being equal to calculated for *p = π*[1‥*k*].

Consider. By the definition of *x* it is equal to:

By the definition of *o*, in the interval there are no genes from *π* and, thus, from *π*[1‥*k*]. Thus we can replace sum with its last member:

We got the same difference as in (5).

Now consider. By the definition of *y* it is equal to:

Again, in the interval there are no genes from *π*[1‥*k*]. Thus we can replace the sum with only the last member:

We got the same difference as in (6).

We do not need to consider other points, because points *o _{i—1}‥o_{i} — 1* have the same

*y*coordinate and

*o*is the leftmost of them. Thus, when at least one gene is added the diagonal ((

_{i—1}*x*), (

_{0}, y_{0}*x*)) is not horizontal and

_{N}, y_{N}*o*is the farthest point among

_{i—1}*o*.

_{i—1}‥o_{i}— 1### 4.2 Square root decomposition

Let consider what happens when gene *π _{k}* is added to query set

*π*[1‥

*k*—1] (Fig. 2). Let

*r*be a rank of gene

_{k}*π*among genes

_{k}*π*, then coordinate of points (

*x*) for

_{i}, y_{i}*i < r*do not change, while all (

_{k}*x*) for

_{i}, y_{i}*i ≥ r*are changed on (

_{k}*Δ*) = ().

_{x},Δ_{y}To make fast incremental updates we will decompose the problem into multiple smaller ones. For simplicity we assume that *K* + 1 is an exact square of an integer *b*. Let split *K* + 1 points into *b* consecutive blocks of the size *b*:{} and so on.

For each of *b* blocks we will store and update the farthest up point from the diagonal. When we know for each block its farthest point we can find the globally farthest point by a simple pass in *0(b)* time.

Next, we show how to update the farthest points in blocks in amortized time *0(b)*. This taken together with one *0(b)* pass will get us an algorithm to update the globally farthest point in amortized *0(b)* time.

Below we use *c* = [*r _{k}/b*] as an index of a block where gene

*π*belongs.

_{k}First, we describe the procedure to update points coordinates. We will store *x _{i}* coordinates using two vectors:

*B*of size

*b*and

*D*of size

*K*+ 1, such that

*x*. When gene

_{i}= B_{i/b}, + D_{i}*π*is added all

_{k}*x*for

_{i}*i ≥ r*are decremented by one. To reflect this we will decrement all

_{k}*B*for

_{j}*j > c*and decrement all

*D*for

_{i}*r*It is easy to see that it takes

_{k}≤ i < cb.*0(b)*time. After this update procedure we can get value

*x*in

_{i}*O(1)*time. The same procedure is applied for

*y*coordinates.

Second, for each block we will maintain an upper part of its convex hull. Having convex hull is useful because the farthest point in block always lays on its convex hull. All blocks except *c* have the points either not changed or shifted simultaneously on the same value. That means that lists of points on the convex hulls for these blocks remain unchanged. For the block *c* we can reconstruct convex hull from scratch using Graham scan algorithm. Because the points are already sorted by *x* coordinate, this reconstruction takes *O(b)* time. In total, it takes *O(b)* time to update the convex hulls.

Third, the farthest points in blocks can be updated using the stored convex hulls. Consider a block where the convex hull was not changed (every block except, possibly, block *c*). Because diagonal always rotates in the same counterclockwise direction, the farthest point in block on iteration *k* either stays the same or moves on the convex hull to the left of the farthest point on the (k — 1)-th iteration. Thus, for each such block we can compare current farthest point with its left neighbor on the convex hull and update the point if necessary. It is repeated until the next neighbor is closer to the diagonal than the current farthest point. In the block c we just find the farthest point in a single pass by the points on the convex hull.

Using potential method we can show that the updating farthest points takes *O(b)* amortized time. Let a potential after adding *k*-th gene be a sum of relative indexes of the farthest points for all the blocks. As there are b blocks of size *b* the sum of relative indexes lies between 0 and *b*^{2}. Thus, . For an update of all *b* — 1 blocks except *c* we need to make *t _{k} = b — 1 + z* operations of comparing two points, where

*z*is the number of times the farthest points were updated. This can take up to

*Θ(b*time in the worst case. However, it can be noticed, that potential change is equal to

^{2})*—z+O(b)*: the sum of indexes is decreased by a number of times the farthest points were updated plus

*O(b)*for the block

*c*where the index can go from 0 to

*b*— 1. This gives an amortized cost of

*k*-th iteration to be . The total real cost of

*K*iterations is , which means amortized cost of one iteration to be

*O(b)*.

Taken together the algorithm allows to find all cumulative enrichment scores *s _{r}*(

*π*[1‥

*k*]) in time. The straightforward implementation of calculating cumulative values from scratch would take

*O(KK log K)*time. Thus, we have improved the performance times.

### 4.3 Implementation details

We also implemented an optimization that does not build convex hull from scratch for a changed block *c*, but only updates the changed points. This does not influence on asymptotic performance, but decreases the constant factor.

First, we start updating convex hull from position *r _{k}* and not from 1. To be able to do this, we have an array prev that for each gene

*g ∈ π*stores a previous point on a convex hull if

*g*were the last gene in the block. This actually is the same as the top of the stack in Graham algorithm and represent the algorithms state for any given point. As all points

*h*to the left of

*g*are not changed prev

_{h}also remains unchanged and need not to be recalculated.

Second, we stop updating the hull, when we reach the point on the previous iteration convex hull. We can do this because every point to the left of *g* is rotated counterclockwise of any point to the right of g, which means that the first point on the convex hull right of *g* on (*k* — 1)-th iteration remains being a convex hull point at *k*-th iteration.

## 5 Experimental results

To assess the algorithm performance we ran the algorithm on a T-cells differentiation dataset [4]. The ranking was obtained from differential gene expression analysis for Naive vs. Th1 states using *limma* [2]. From that results we selected 12000 genes with the highest mean expression levels.

As a pathway databases we used Reactome database [1]. There were 586 gene sets that had overlaps with the selected genes of the size 15 to 500 (common gene set size limits for preranked gene set enrichment analysis).

We compared the algorithm to the reference GSEA implementation [3] version 2.1.

Experiments were run on an Intel Core i3 2.10GHz processor. Both methods were run in one thread.

We ran reference GSEA with default parameters. The permutation number was set to 1000, which means that for each input gene set 1000 independent samples were generated. The run took 100 seconds and resulted in 79 gene sets with GSEA-adjusted FDR q-value of less than 10^{—2}. All significant gene sets were in a positive mode.

First, to get a similar nominal p-values accuracy we ran FGSEA algorithm on 1000 permutations. This took 2 seconds, but resulted in no significant hits due after multiple testing correction (with FRD ≤ 1%). The same effect required GSEA authors to develop a custom method approximate FDR correction.

Second, to get a similar number of significant hits FGSEA was run with 10000 permutations. It took 9 seconds and resulted in 78 gene sets with BH-adjusted p-value of less than 10^{—2}. Is is important to note, that, unlike for GSEA, 3 of these 78 gene sets were in a negative mode.

Last, we ran FGSEA with a similar total running time. Withing 90 seconds FGSEA was able to do 100000 permutations. This resulted in 77 gene sets with BH-adjusted p-value of less than 10^{—2} (3 sets were in the negative mode). The minimal nominal p-value was 1.23 · 10^{—5}.

## 6 Conclusion

Preranked gene set enrichment analysis is a widely used tool in analysis of gene expression data. However, current implementations are slow due to a lot of sampling. Here we present an algorithm that allows to decrease the number of required samples while keeping the same accuracy of nominal p-values. This allows to achieve more accurate p-values in a faster time. Consequently, gene sets can be ranked more precisely in the results and, which is even more important, standard multiple testing correction methods can be applied instead of approximate ones as in [3].

## Availability

The FGSEA method implemented as a package for R and is available at https://github.com/ctlab/fgsea along with the example data and corresponding results.

## Acknowledgements

This work was supported by Government of Russian Federation Grant 074-U01. The author thanks Gennady Korotkevich for the idea of the algorithm.