Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation

Alexey A. Sergushichev
doi: https://doi.org/10.1101/060012
Alexey A. Sergushichev
Computer Technologies Department, ITMO University, Saint Petersburg, 197101, Russia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: alserg@rain.ifmo.ru
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used. The method is implemented in a form of an R package and is freely available at https://github.com/ctlab/fgsea.

1 Introduction

Gene set enrichment analysis is a very widely used method for analyzing gene expression data. It allows to select from an a priori defined list of gene sets those which have non-random behavior in a considered experiment. Compared to a similar method of calculating Fisher p-values of overlap statistic it does not require an arbitrary thresholding. This also allows the method to identify pathways that contain many co-regulated genes but with small individual effects.

The method has a major drawback of being relatively slow. Because analytical from of the null distribution for the used gene set enrichment statistic is not known, empirical null distribution has to be calculated. That can be done in a straightforward manner by sampling random gene sets. However, a big number of gene sets are usually tested simultaneously. This leads to a requirement of a large number of samples for the test to have a good statistical power after a correction for multiple testing.

In the original paper [3] Subramanian et al. developed an ad-hoc method for multiple testing correction. However, the developed method is approximate for the commonly used parameters and it is unclear how accurate it is.

Here we present a fast gene set enrichment analysis (FGSEA) method which is much faster than the original method [3] in finding nominal p-values. The method is based on an algorithm to calculate cumulative gene set enrichment statistic values, which allows to rapidly calculate multiple sample statistic values from a single sample. Ability to get accurate nominal p-values achieved by the method in a reasonable time leads to using well-developed general methods for multiple testing correction such as Bonferroni or Benjamini-Hochberg.

The rest of the paper is structured as follows. First, in section 2 we formally define gene set enrichment statistic and introduce related definitions. In section 3 we explain the idea of the algorithm on a simple mean statistic. Then, in section 4 we show how GSEA statistic can be interpreted geometrically and present the algorithm for fast computation of cumulative values that follows from such interpretation. Finally, in section 5 we show how the algorithm works in practice and how it is compared to the reference implementation.

2 Definitions

The preranked gene set enrichment analysis takes as input two objects: an array of gene statistic values S and a list of query gene sets P. The goal of the analysis is to determine which of the gene sets from P has a non-random behavior.

The gene statistic array S of the size |S| = N for each gene i, 1 ≤ i ≤ N, contains a value Si ∈ ℝ that characterises the gene behavior in a considered process. Commonly, if Si > 0 the expression of gene i goes up in a treatment compared to control and Si < 0 means that the expression goes down. Absolute value |Sj| represents a magnitude of the change. Array S is sorted in a decreasing order: Si > Sj for i < j. The value of N in practice is about 10000–20000.

The list of gene sets P of length |P| = m usually contains groups of genes that are commonly regulated in some biological process. In this paper we assume that all gene sets p ∈ P have a size upper bound of K ≈ 500 genes: |p| ≤ K, ∀p ∈ P.

To quantify a co-regulation of genes in a gene set p Subramanian et al. introduced a gene set enrichment score function sr(p) that uses gene rankings (values of S). The more positive is the value of sr(p) the more enriched the gene set is in positively-regulated genes g with Sg > 0, accordingly, negative sr(p) corresponds to enrichment of negatively regulated genes.

Value of sr(p) can be calculated as follows. Let k = |p|, NS = Σi∈p|Si|. Let also ES be an array specified by the following formula: Embedded Image

The value of sr(p) corresponds to the largest by absolute value entry of ES: Embedded Image

For each p ∈ P we need to find the enrichment statistic value and to calculate a p-value of this not to be random. To calculate a p-value for gene set p we can obtain an empirical null distribution by sampling n random gene sets of the same size as p.

Such straightforward implementation takes Θ(mnK log K) time. For each of m gene sets we need to perform n permutations for all of which we need to calculate enrichment statistic value, that can be done in Θ(K log K) time.

3 Cumulative statistic calculation for mean statistic

Let first describe the idea of the proposed algorithm on a simple statistic of mean rank sm: Embedded Image

The idea of the algorithm is to reuse sampling for different query gene sets. This can be done due to the fact that for an estimation of null distributions samples have to be independent only for a specific gene set size. Samples can be dependent between different sizes.

Instead of generating nm independent random gene set for each permutation and each gene set we will generate only n radom gene sets of size K. Let πi be an i-th random gene set of size K. From that gene set we can generate gene sets for a all the query pathways Pj by using its prefix: πi,j = πi[1‥|Pj|].

The next step is to calculate enrichment scores for all generated gene sets πi,j. Instead of calculating enrichment scores separately for each gene set we will calculate simultaneously scores for all πi,j for a fixed i. Using a simple procedure it can be done in Θ(K) time.

Let us find enrichment scores for all prefixes of πi. This can be done by element-wise dividing of cumulative sums array by the length of the corresponding prefix: Embedded Image

Selecting only the required prefixes takes an additional Θ(m) time.

The described procedure allows to find p-values for all query gene sets in Θ(n(K+m)) time. This is about min(K, m) times faster than the straightforward procedure.

4 Cumulative statistic calculation for GSEA statistic

For the GSEA statistic we use the similar idea: we will also be sampling only gene sets of size K and from that sample will calculate statistic values for all the other sizes. However, calculation of the cumulative statistic values for the subsamples is more complex in this case.

In this section we only be considering positive mode of enrichment statisticEmbedded Image. It can be defined as follows:Embedded Image

Calculation of Embedded Image for i- = arg mini ESi is very similar. From these two values it easy to find value of sr(p) which is equal to Embedded Image or Embedded Image otherwise.

4.1 Geometric interpretation of GSEA statistic

It is helpful to look at GSEA statistic from a geometric point of view. Let us consider N + 1 points (Fig. 1) with coordinates (xi, yi) for 0 ≤ i ≤ N such that: Embedded Image Embedded Image Embedded Image

Fig. 1.
  • Download figure
  • Open in new tab
Fig. 1.

A graph that corresponds to a calculation of GSEA statistic. Each breakpoint on a graph corresponds to a gene present in the pathway. Dotted line cross at a point which is the farthest up from a diagonal (dashed line). This point correspond to gene i+, where the maximal value of ESi is reached

The calculation of Embedded Image corresponds to finding the point farthest up from a diagonal ((x0, y0), (xN, yN)). Indeed, it is easy to see that xN = N — |p| = N — k and yN = Σj∈p|Sj| = NS, while the individual enrichment scores ESi can be calculated as Embedded Image Value of ESi is proportional to the directed distance from the line going through (x0, y0) and (xN, yN) to the point (xi, yi).

Let us fix a sample π of length K. To efficiently calculate cumulative values Embedded Image for k ≤ K we need a fast method of updating the farthest point when a new gene is added. In that case we can add genes from π one by one and calculate values Embedded Image from the corresponding maximal distances.

Because we are calculating values for π[1‥k] for k ≤ K we know in advance which K genes will be added. This allows us to consider K + 1 points instead of N + 1 for each iteration k. Let array o contains order of genes in π, that is Embedded Image is minimal among π, Embedded Image is second minimal and so on. The coordinates can be calclated as follows: Embedded Image Embedded Image Embedded Image where we set Embedded Image to be zero.

It can be shown that finding the farthest up point among (4)–(6) is equivalent to finding the farthest up point among (1)–(3) with Embedded Image being equal to Embedded Image calculated for p = π[1‥k].

ConsiderEmbedded Image. By the definition of x it is equal to: Embedded Image

By the definition of o, in the interval Embedded Image there are no genes from π and, thus, from π[1‥k]. Thus we can replace sum with its last member: Embedded Image

We got the same difference as in (5).

Now considerEmbedded Image. By the definition of y it is equal to: Embedded Image

Again, in the interval Embedded Image there are no genes from π[1‥k]. Thus we can replace the sum with only the last member: Embedded Image

We got the same difference as in (6).

We do not need to consider other points, because points oi—1‥oi — 1 have the same y coordinate and oi—1 is the leftmost of them. Thus, when at least one gene is added the diagonal ((x0, y0), (xN, yN)) is not horizontal and oi—1 is the farthest point among oi—1‥oi — 1.

4.2 Square root decomposition

Let consider what happens when gene πk is added to query set π[1‥k—1] (Fig. 2). Let rk be a rank of gene πk among genes π, then coordinate of points (xi, yi) for i < rk do not change, while all (xi, yi) for i ≥ rk are changed on (Δx,Δy) = (Embedded Image).

Fig. 2.
  • Download figure
  • Open in new tab
Fig. 2.

Update of a GSEA graph when gene πk ≈ 800 is added. Only a fragment is shown. Black graph to the GSEA graph for gene set π[1‥k—1], gray grcorrespondsaph corresponds to π[1‥k]. A part of the graph to left of Embedded Image does not change and the other part is shifted to the top-left comer. The diagonal ((x0, y0), (xN, yN)) is rotated counterclockwise.

To make fast incremental updates we will decompose the problem into multiple smaller ones. For simplicity we assume that K + 1 is an exact square of an integer b. Let split K + 1 points into b consecutive blocks of the size b:{Embedded Image} and so on.

For each of b blocks we will store and update the farthest up point from the diagonal. When we know for each block its farthest point we can find the globally farthest point by a simple pass in 0(b) time.

Next, we show how to update the farthest points in blocks in amortized time 0(b). This taken together with one 0(b) pass will get us an algorithm to update the globally farthest point in amortized 0(b) time.

Below we use c = [rk/b] as an index of a block where gene πk belongs.

First, we describe the procedure to update points coordinates. We will store xi coordinates using two vectors: B of size b and D of size K + 1, such that xi = Bi/b, + Di. When gene πk is added all xi for i ≥ rk are decremented by one. To reflect this we will decrement all Bj for j > c and decrement all Di for rk ≤ i < cb. It is easy to see that it takes 0(b) time. After this update procedure we can get value xi in O(1) time. The same procedure is applied for y coordinates.

Second, for each block we will maintain an upper part of its convex hull. Having convex hull is useful because the farthest point in block always lays on its convex hull. All blocks except c have the points either not changed or shifted simultaneously on the same value. That means that lists of points on the convex hulls for these blocks remain unchanged. For the block c we can reconstruct convex hull from scratch using Graham scan algorithm. Because the points are already sorted by x coordinate, this reconstruction takes O(b) time. In total, it takes O(b) time to update the convex hulls.

Third, the farthest points in blocks can be updated using the stored convex hulls. Consider a block where the convex hull was not changed (every block except, possibly, block c). Because diagonal always rotates in the same counterclockwise direction, the farthest point in block on iteration k either stays the same or moves on the convex hull to the left of the farthest point on the (k — 1)-th iteration. Thus, for each such block we can compare current farthest point with its left neighbor on the convex hull and update the point if necessary. It is repeated until the next neighbor is closer to the diagonal than the current farthest point. In the block c we just find the farthest point in a single pass by the points on the convex hull.

Using potential method we can show that the updating farthest points takes O(b) amortized time. Let a potential after adding k-th gene be Embedded Image a sum of relative indexes of the farthest points for all the blocks. As there are b blocks of size b the sum of relative indexes lies between 0 and b2. Thus, Embedded Image. For an update of all b — 1 blocks except c we need to make tk = b — 1 + z operations of comparing two points, where z is the number of times the farthest points were updated. This can take up to Θ(b2) time in the worst case. However, it can be noticed, that potential change Embedded Image is equal to —z+O(b): the sum of indexes is decreased by a number of times the farthest points were updated plus O(b) for the block c where the index can go from 0 to b — 1. This gives an amortized cost of k-th iteration to be Embedded Image. The total real cost of K iterations is Embedded Image, which means amortized cost of one iteration to be O(b).

Taken together the algorithm allows to find all cumulative enrichment scores sr(π[1‥k]) in Embedded Image time. The straightforward implementation of calculating cumulative values from scratch would take O(KK log K) time. Thus, we have improved the performance Embedded Image times.

4.3 Implementation details

We also implemented an optimization that does not build convex hull from scratch for a changed block c, but only updates the changed points. This does not influence on asymptotic performance, but decreases the constant factor.

First, we start updating convex hull from position rk and not from 1. To be able to do this, we have an array prev that for each gene g ∈ π stores a previous point on a convex hull if g were the last gene in the block. This actually is the same as the top of the stack in Graham algorithm and represent the algorithms state for any given point. As all points h to the left of g are not changed prevh also remains unchanged and need not to be recalculated.

Second, we stop updating the hull, when we reach the point on the previous iteration convex hull. We can do this because every point to the left of g is rotated counterclockwise of any point to the right of g, which means that the first point on the convex hull right of g on (k — 1)-th iteration remains being a convex hull point at k-th iteration.

5 Experimental results

To assess the algorithm performance we ran the algorithm on a T-cells differentiation dataset [4]. The ranking was obtained from differential gene expression analysis for Naive vs. Th1 states using limma [2]. From that results we selected 12000 genes with the highest mean expression levels.

As a pathway databases we used Reactome database [1]. There were 586 gene sets that had overlaps with the selected genes of the size 15 to 500 (common gene set size limits for preranked gene set enrichment analysis).

We compared the algorithm to the reference GSEA implementation [3] version 2.1.

Experiments were run on an Intel Core i3 2.10GHz processor. Both methods were run in one thread.

We ran reference GSEA with default parameters. The permutation number was set to 1000, which means that for each input gene set 1000 independent samples were generated. The run took 100 seconds and resulted in 79 gene sets with GSEA-adjusted FDR q-value of less than 10—2. All significant gene sets were in a positive mode.

First, to get a similar nominal p-values accuracy we ran FGSEA algorithm on 1000 permutations. This took 2 seconds, but resulted in no significant hits due after multiple testing correction (with FRD ≤ 1%). The same effect required GSEA authors to develop a custom method approximate FDR correction.

Second, to get a similar number of significant hits FGSEA was run with 10000 permutations. It took 9 seconds and resulted in 78 gene sets with BH-adjusted p-value of less than 10—2. Is is important to note, that, unlike for GSEA, 3 of these 78 gene sets were in a negative mode.

Last, we ran FGSEA with a similar total running time. Withing 90 seconds FGSEA was able to do 100000 permutations. This resulted in 77 gene sets with BH-adjusted p-value of less than 10—2 (3 sets were in the negative mode). The minimal nominal p-value was 1.23 · 10—5.

6 Conclusion

Preranked gene set enrichment analysis is a widely used tool in analysis of gene expression data. However, current implementations are slow due to a lot of sampling. Here we present an algorithm that allows to decrease the number of required samples while keeping the same accuracy of nominal p-values. This allows to achieve more accurate p-values in a faster time. Consequently, gene sets can be ranked more precisely in the results and, which is even more important, standard multiple testing correction methods can be applied instead of approximate ones as in [3].

Availability

The FGSEA method implemented as a package for R and is available at https://github.com/ctlab/fgsea along with the example data and corresponding results.

Acknowledgements

This work was supported by Government of Russian Federation Grant 074-U01. The author thanks Gennady Korotkevich for the idea of the algorithm.

References

  1. 1.↵
    Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G.R., Wu, G.R., Matthews, L., Lewis, S., Birney, E., Stein, L.: Reactome: a knowledgebase of biological pathways. Nucleic acids research 33(Database issue), D428–32 (2005)
    OpenUrlCrossRefPubMedWeb of Science
  2. 2.↵
    Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K.: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research pp. gkv007– (2015)
  3. 3.↵
    Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43), 15545–50 (2005)
  4. 4.↵
    Wei, G., Wei, L., Zhu, J., Zang, C., Hu-Li, J., Yao, Z., Cui, K., Kanno, Y., Roh, T.Y., Watford, W.T., Schones, D.E., Peng, W., Sun, H.W., Paul, W.E., O’Shea, J.J., Zhao, K.: Global mapping of H3K4me3 and H3K27me3 reveals specificity and plasticity in lineage fate determination of differentiating CD4+ T cells. Immunity 30(1), 155–67 (2009)
    OpenUrlCrossRefPubMedWeb of Science
Back to top
PreviousNext
Posted June 20, 2016.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation
Alexey A. Sergushichev
bioRxiv 060012; doi: https://doi.org/10.1101/060012
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation
Alexey A. Sergushichev
bioRxiv 060012; doi: https://doi.org/10.1101/060012

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4239)
  • Biochemistry (9172)
  • Bioengineering (6804)
  • Bioinformatics (24064)
  • Biophysics (12155)
  • Cancer Biology (9564)
  • Cell Biology (13825)
  • Clinical Trials (138)
  • Developmental Biology (7658)
  • Ecology (11737)
  • Epidemiology (2066)
  • Evolutionary Biology (15541)
  • Genetics (10672)
  • Genomics (14359)
  • Immunology (9511)
  • Microbiology (22901)
  • Molecular Biology (9129)
  • Neuroscience (49113)
  • Paleontology (357)
  • Pathology (1487)
  • Pharmacology and Toxicology (2583)
  • Physiology (3851)
  • Plant Biology (8351)
  • Scientific Communication and Education (1473)
  • Synthetic Biology (2301)
  • Systems Biology (6205)
  • Zoology (1302)