Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
Confirmatory Results

LargeQvalue: A Program for Calculating FDR Estimates with Large Datasets

Andrew Anand Brown
doi: https://doi.org/10.1101/010074
Andrew Anand Brown
*Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
†NORMENT, KG Jebsen Centre for Psychosis Research, Institute of Clinical Medicine, University of Oslo, Norway
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

This is an implementation of the R statistical software qvalue package [Dabney et al., 2014], designed for use with large datasets where memory or computation time is limiting. In addition to estimating p values adjusted for multiple testing, the software outputs a script which can be pasted into R to produce diagnostic plots and parameter estimates. This program runs almost 30 times faster and requests substantially less memory than the qvalue package when analysing 10 million p values on a high performance cluster. The software has been used to control for the multiple testing of 390 million tests when analysing a full cis scan of RNA-seq exon level gene expression from the Eurobats project [Brown et al., 2014]. The source code and links to executable files for linux and Mac OSX can be found here: https://github.com/abrown25/qvalue. Help for the package can be found by running ./largeQvalue --help.

1. Introduction

Recent developments have allowed us to investigate the action of DNA in a new, hypothesis free framework. The ability to collect information on DNA at millions of loci has led to a reappraisal of statistical methods to correct for multiple testing: it has been argued that in many contexts traditional measures of significance such as Bonferroni are too conservative, leading to an unacceptably high rate of falsely accepting null hypotheses. The false discovery rate is an alternative way of accounting for multiple testing, which tolerates a higher false positive rate in order not to discard potentially interesting findings [Benjamini and Hochberg, 1995]. It can simplistically be viewed as the proportion of falsely rejected null hypotheses, and the concept was further developed in Storey and Tibshirani [2003] and Storey et al. [2004], and implemented in the R package qvalue [Dabney et al., 2014].

Advances in sequencing have accelerated our ability to perform experiments of even greater scope. Whereas genotyping arrays ascertain DNA at around a million SNPs, the 1000 Genomes project [1000 Genomes Project Consortium, 2012] has uncovered around 100 million variants in the human genome. Application of sequencing to produce molecular phenotypes has also revolutionised how we can interrogate cellular function with few preconceived notions; for example sequencing of RNA allows us to construct hundreds of thousands of phenotypes, each capturing the expression of a given exon [Lappalainen et al., 2013]. This permits a full, hypothesis free investigation of the genetics of gene regulation. Together, these developments imply that a full genome-wide scan for all possible variants affecting exon expression would involve ∼ 1013 hypothesis tests.

The qvalue package [Dabney et al., 2014], implemented in R [R Core Team, 2013], allows estimation of false discovery rates, as well as the quantity π0, which is an estimate of the proportion of the tests for which the null hypothesis is true. However, R programs usually run slower and consume more memory than programs written in lower level languages, rendering the qvalue package impractical when the number of tests is of the order 1013. Here I present an implementation of qvalue in a compiled language, which runs faster, uses less memory, and is capable of scaling up to the experiments produced by these new sequencing technologies.

2. Methods

2.1. Brief description of qvalue algorithms

The qvalue package implements the algorithms described in Storey [2002], Storey and Tibshirani [2003] and Storey et al. [2004]. It takes as input a set of p values and maps these onto q values, which can be interpreted as follows: a decision rule which rejects all null hypotheses with corresponding q values less than α has an estimated false discovery rate of α. Taking an increasing set of p values p1..pn this mapping is defined as: Embedded Image

The quantity π0 is the proportion of null hypotheses amongst all tests, and is a central difference between the methods used to control the FDR in Benjamini and Hochberg [1995] and the qvalue approach. Its estimation is described in Sections 2.4, 2.5 and 2.6.

2.2. Implementation

largeQvalue is written mainly in the D programming language [Alexandrescu, 2010]. D is a compiled language which promises runtime speeds of the same order as C or C++ while guaranteeing memory safety, and also allows easy integration with code written in C. Leveraging this, parts of the program are written in C; in particular we use the GNU Software Library (http://www.gnu.org/software/gsl/) functions for fitting cubic splines, and sampling from the binomial distribution. The program is released as standalone binaries for use on the linux and Mac OSX systems, as well as source code which can be downloaded from a github repository (https://github.com/abrown25/qvalue) and be compiled on the Windows platform. This source code is released under the GNU General Public Licence v3 (http://www.gnu.org/copyleft/gpl.html). The program is called from the command line with options set using parameter flags. A description of all the available options can be found by running ./largeQvalue --help.

2.3. Input and output file formats and commands

The input format for this program is a whitespace delimited (i.e. either tab or space separated fields) text file where the p values to be adjusted for multiple testing lie within one column. Consecutive runs of whitespace are collapsed to one whitespace character. The file can either be specified using the --input flag (i.e. --input PValueFile) or if not present, then following UNIX convention the filename is taken as the last argument on the command line. The column containing the p values can be specified with the --col flag; by default it is assumed to be the first column. If the --header flag is not present, then it is assumed the file does not have a header line. Any data that cannot be converted to numerical form will be treated as missing (the corresponding q value estimate will be written as NA). If any numerical values lie outside the interval [0, 1], then the program will terminate with the message Some p values are outside [0, 1] interval.

The standard output file is the input file with an additional column on the end which specifies the q value [Storey, 2002] corresponding to the p value in that row. If a filename for the output is not specified using the --out flag, then output is written to stdout.

An optional parameter file is written, if specified using the --param flag. This file contains the estimate of π0 used to calculate q values as well as other intermediate parameters. This file can also be run in R, either by pasting it at the R prompt or by running the command source(“ParameterFileName”) from the R terminal. This script will produce diagnostic plots, as seen in Figure 1 and described in Sections 2.5 and 2.6. For this functionality, the graphing package ggplot2 [Wickham, 2009] is required.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1.

(a) Example of the diagnostic plot produced by the standard options, the dots represent the estimate of the proportion of null hypotheses (π0) for various values of λ. These values should converge as λ approaches 1; the point of convergence is estimated using a spline function (black line). Estimate of convergence is at the red line. (b) Example of the diagnostic plot with the --boot option. Again, estimates of Embedded Image are shown (blue dots). The uncertainty of these estimates is calculated by taking 100 bootstrap samples of the data, the distribution of these parameter estimates is shown by the box plots. The estimate of π0 used to calculate q values (the red line) is the value that minimises the mean squared error with the lowest estimate of π0 calculated on the original data (blue line).

2.4. Estimation of π0

One method to estimate the proportion of null hypotheses is to choose a value λ ∈ [0, 1) and assume that every p > λ is from the null hypothesis (obviously this assumption is more valid as λ tends to 1). Then, assuming uniform p values generated by the null hypothesis, this proportion can be reflected back across the interval. This corresponds to estimates of π0 as a function of λ: Embedded Image

Using the qvalue package in R, the user can specify a value of λ to be used to estimate π0 according to Equation 2. The equivalent command in the largeQvalue package uses the flag --lambda to specify this value. More sophisticated methods, also implemented in the R package, use a sequence of λ values to produce a range of estimates of π0. Information from all of these values is combined either by fitting a cubic spline to these estimates [Storey and Tibshirani, 2003] or by taking bootstrap samples of the p values [Storey et al., 2004]. The implementation of both of these options in largeQvalue is described in the following two sections.

2.5. Estimation of π0 with cubic splines

The methodology to estimate π0 using cubic splines was described in Storey and Tibshirani [2003], and is used when the specified values of λ form a sequence rather than a single number (this is true by default) and the --boot flag is not given.

This option takes a sequence of λ values, specified by --lambda START,STOP,STEP (default value is 0 to 0.9 in steps of 0.05). Estimates of Embedded Image are calculated at every value of λ and a cubic spline is fitted to these points to estimate the asymptote as λ → 1. The estimate of π0 used is the value of this cubic spline at the maximum value of λ.

The amount of smoothing supplied by the spline can be adjusted using the --coef flag (default 5, must be greater than 3). If the --log flag is specified then the spline is fitted to the Embedded Image values instead of Embedded Image.

Figure 1a shows the output of the parameter file processed by R when this methodology is used. The dots show the estimates of π0 at various values of λ, the black line is a cubic spline fitted to these points which is used to approximate the function π0(λ). The estimate Embedded Image used to estimate q values is shown in red line.

largeQvalue uses the spline fitting algorithm implemented in the gsl, which differs slightly to the implementation used by the qvalue package. Exact replication of qvalue results can be produced by running the program twice. First, run the code with the --param flag specified. Paste the parameter output into R. A line in the output Estimate of pi0 from qvalue package is: specifies the qvalue estimate of π0. The analysis can then be run again, specifying this value with the --pi0 flag.

2.6 Estimation of π0 from bootstrap samples

Storey et al. [2004] proposed a method of estimating π0 based on bootstrap samples of the p values. This is used when largeQvalue is run with the flag --boot. The seed for the generation of the samples can be set with the flag --seed (default 0).

Again, a sequence of λ values is given --lambda START,STOP,STEP which is used to calculate estimates Embedded Image. Then, we take 100 bootstrap samples Bj of the p values with replacement and calculate estimates of Embedded Image on these new datasets. The implementation of the bootstrap differs from that used in the qvalue package. To save memory and computation time, instead of resampling individual p values we sample counts of p values lying within intervals defined by the sequence of λ from a multinomial distribution. This is effectively the same procedure.

The estimate Embedded Image is used to estimate q values, where λ* is chosen to minimise Equation 3. Embedded Image

Figure 1b shows the diagnostic plot when this method is applied. Estimates Embedded Image are shown by the blue dots, with Embedded Image shown by the blue line. The bootstrap estimates Embedded Image are shown in the box plots and the mean square error by the dashed line. The chosen estimate of π0, which minimises this mean squared error, is shown by the red line.

2.7. Other parameter flags

The flags --help and --version will bring up the help page and print the version number of the software respectively. The flag --issorted informs the program that the p values are already in ascending order with no missing values: especially with large datasets this can speed up the program considerably.

The flag --param allows the user to specify an output file to write estimates of parameters. When only a single value of λ is specified, this file only contains the point estimate of π0. When λ is specified as a sequence, these parameters include the estimate of π0 used to calculate q values and the estimates Embedded Image for every value in this sequence. If the cubic spline method is employed, then smoothed cubic splines values are reported at every value of λ. For the bootstrap method the file contains all bootstrap estimates Embedded Image and estimates of the mean square error. Examples of these files can be found in Supplementary Materials S2 and S3. These files can be pasted into R to produce diagnostic plots, examples of which can be seen in Figures 1a and 1b.

The --pi0 flag allows the estimate of π0 to be directly specified. This can be useful to exactly recreate q values produced using the cubic spline routine from the q value package.

The --robust flag produces q values which are more conservative for very small values of p. Instead of q values defined by Equation 1, we use the formula: Embedded Image

2.8. An example command

./largeQvalue --header --col 4 --out OutputFile --param ParameterFile --lambda 0,0.9,0.05 --robust QTLresults.txt

The p values can be found in the 4th column of QTLresults.txt, write the results to OutputFile and the estimated parameter values to ParameterFile. Use values of λ from 0 to 0.9 in steps of 0.05 to estimate proportion of null hypotheses (standard settings in qvalue) and produce estimates of q values robust for small p values.

2.9. Comparison of speed and memory usage of largeQvalue and qvalue package

We compared the CPU time and memory used by largeQvalue with that of the qvalue implementation in R. Each program was run five times against four datasets; these consisted of text files with a single column of either 5 × 105, 106, 5 × 106, or 107 uniformly distributed numbers between 0 and 1. The following R code was used to call the qvalue algorithm with the cubic spline method: library(qvalue) Pvalues <-scan(file = “InputFile”) write(qvalue(Pvalues)$qvalues, file = “Qvalues”, ncolumns = 1) For the bootstrap method the following code was used: library(qvalue) Pvalues <-scan(file= “InputFile”) write(qvalue(Pvalues, pi0.method == “bootstrap”)$qvalues, file = “Qvalues", ncolumns = 1) The corresponding commands for the largeQvalue package were: ./largeQvalue --out Qvalues InputFile and ./largeQvalue --boot --out Qvalues InputFile

2.10 List Option

List of options:

Usage: largeQvalue [options]

Options:

  • --help: Print help and quit

  • --version: Print version and quit

  • --header: Input has header line (default = FALSE)

  • --boot: Apply bootstrap method to find π0 (default = FALSE)

  • --seed: Set seed for generating bootstrap samples (default = 0, equivalent to gsl default)

  • --coef: Set the number of coefficients used to fit cubic spline (default = 5, must be greater than 3)

  • --log: Smoothing spline applied to log π0 values (default = FALSE)

  • --robust: More robust values for small p values (default = FALSE)

  • --pi0: Use value of π0 given (useful for recreating qvalue package results)

  • --lambda: Either a fixed number or a sequence given 0,0.9,0.05 (default = 0,0.9,0.05)

  • --param: Print out parameter list to given file

  • --out: File to write results to (default = stdout)

  • --input: File to take results from (must be specified, if not explicitly, the last parameter after all options have been parsed is used)

  • --issorted: File has already been sorted with no missing values (default = FALSE)

  • --col: Column with p values (default = 1)

2.11. Binary executables

The binary for 64bit linux can be found here: ftp://ftp.sanger.ac.uk/pub/resources/software/largeqvalue/largeQvalue.tar.gz. The binary for Mac OSX can be found here: ftp://ftp.sanger.ac.uk/pub/resources/software/largeqvalue/largeQvalue_mac.tar.gz.

2.12. Building from source

To compile an executable binary from source (for example if you are using Windows), it is necessary to have both a D compiler (we have used the ldc compiler as it produces faster programs) and the gsl library installed on your system. Binaries for the ldc compiler can be downloaded from here: https://github.com/ldc-developers/ldc/releases and a link to the gsl software library found here: http://www.gnu.org/software/gsl/. Once both are installed you can clone the github account by running git clone https://github.com/abrown25/qvalue.git. Then change to the qvalue directory and run make. If you see an error stating /usr/lib/libgsl.a: No such file or directory this means the gsl is installed in a different folder on your system. Edit the first line of the makefile to reflect where the library can be found, then rerun the make command. Once this is done, if installation has worked then running ./largeQvalue should bring up the help information. If there are any issues with this procedure, please contact andrew.brown{at}unige.ch.

It is possible that you wish to compile with the reference D compiler DMD (this may especially be true if you are working on a Windows system). This can be downloaded from http://dlang.org/download.html. Then, the only difference with the previous installation is you must run make dmd instead of make.

3. Results

3.1. CPU time and memory usage of largeQvalue and the qvalue package

To compare performance of the two implementations of the qvalue algorithm, we analysed four different datasets five times with both of the programs on the High Performance Cluster at the Wellcome Trust Sanger Institute (further details in Section 2.9). Table S1 lists the CPU time and maximum memory reported by the cluster. The CPU time is shown in Figure 2. The largeQvalue package requires less time to analyse the data, and the difference increases as the number of tests increases. When there are 107 p values, the largeQvalue program is 27 times faster than the qvalue package when comparing median running times.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2.

Graph showing amount of CPU time needed to run largeQvalue compared to the qvalue package in R.

We do not believe that the cluster reports reliable measures of memory usage, especially as the largeQvalue package runs for a relatively brief time, and allocates memory for much of it. However, when comparing maximum reported memory use across all runs with 107 p values, this program uses less than a quarter of the memory of the qvalue package.

3.2. Application to find tissue specific eQTL

The program was used to look for evidence of cell type specific eQTL in the skin expression data from the Eurobats project [Brown et al., 2014]. We are currently preparing a manuscript based on this work. In brief, a measure of cell type composition was constructed, quantifying the proportions of dermis and epidermis in each individual sample. Then, for every gene and every SNP within 1Mbp of that gene, we tested for interactions between SNP and cell-type measure which affected the expression of that gene (as quantified by RNA-seq), a total of 390,537,631 tests. We ran largeQvalue on these p values. Although the high multiple testing burden meant that we found no significant hits, we estimated a value of π0 of 0.961, suggesting that 3.9% of cis-SNPs had different effects on dermis expression in comparison to expression within the epidermis layer. The program took 1 hour 25 mins 55 seconds of CPU time to run, and used 3.99GB of memory.

4. Conclusion

The qvalue package has been widely adopted to control for multiple testing; the two papers proposing its methodology [Storey and Tibshirani, 2003, Storey et al., 2004] have been cited over 5,600 times. This means that the procedure and its implementation have become widely accepted. The integration within the R framework makes it easy for results to be processed and plotted. The implementation is capable of analysing the results of genome-wide association studies or studies of differential expression of genes (one context where the methodology was originally proposed). However, it is doubtful that it scales up to cope with the results coming from modern cellular sequencing experiments, which can test hundreds of thousands of phenotypes for association with tens of thousands of SNPs (if concentrating on the cis window). For this we propose our implementation, which has already been used to analyse such data.

5. Acknowledgements

I would like to thank Leo Parts and Ana Viñuela for their comments and suggestions which have greatly changed this article for the better. Andrew Brown has been supported by a grant from the South-Eastern Norway Health Authority (No. 2011060).

References

  1. ↵
    1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65, 2012.
    OpenUrlCrossRefPubMedWeb of Science
  2. ↵
    Andrei Alexandrescu. The D Programming Language. Addison-Wesley Professional, 1st edition, 2010. ISBN 0321635361, 9780321635365.
  3. ↵
    Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.
  4. ↵
    Andrew Anand Brown, Alfonso Buil, Ana Viñuela, Tuuli Lappalainen, Hou-Feng Zheng, J Brent Richards, Kerrin S Small, Timothy D Spector, Emmanouil T Dermitzakis, and Richard Durbin. Genetic interactions affecting human gene expression identified by variance association mapping. eLife, 3, 2014.
  5. ↵
    Alan Dabney, John D. Storey, and with assistance from Gregory R. Warnes. qvalue: Q-value estimation for false discovery rate control, 2014. R package version 1.34.0.
  6. ↵
    Tuuli Lappalainen, Michael Sammeth, Marc R Friedländer, Peter AC’t Hoen, Jean Monlong, Manuel A Rivas, Mar Gonzàlez-Porta, Natalja Kurbatova, Thasso Griebel, Pedro G Ferreira, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 2013.
  7. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org/.
  8. ↵
    John D Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):479–498, 2002.
    OpenUrlCrossRefWeb of Science
  9. ↵
    John D Storey and Robert Tibshirani. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445, 2003.
    OpenUrlAbstract/FREE Full Text
  10. ↵
    John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187–205, 2004.
    OpenUrl
  11. ↵
    Hadley Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. ISBN 978-0-387-98140-6. URL http://had.co.nz/ggplot2/book.
Back to top
PreviousNext
Posted October 06, 2014.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
LargeQvalue: A Program for Calculating FDR Estimates with Large Datasets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
LargeQvalue: A Program for Calculating FDR Estimates with Large Datasets
Andrew Anand Brown
bioRxiv 010074; doi: https://doi.org/10.1101/010074
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
LargeQvalue: A Program for Calculating FDR Estimates with Large Datasets
Andrew Anand Brown
bioRxiv 010074; doi: https://doi.org/10.1101/010074

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2530)
  • Biochemistry (4972)
  • Bioengineering (3482)
  • Bioinformatics (15212)
  • Biophysics (6897)
  • Cancer Biology (5390)
  • Cell Biology (7738)
  • Clinical Trials (138)
  • Developmental Biology (4530)
  • Ecology (7147)
  • Epidemiology (2059)
  • Evolutionary Biology (10227)
  • Genetics (7512)
  • Genomics (9786)
  • Immunology (4844)
  • Microbiology (13215)
  • Molecular Biology (5138)
  • Neuroscience (29435)
  • Paleontology (203)
  • Pathology (837)
  • Pharmacology and Toxicology (1463)
  • Physiology (2138)
  • Plant Biology (4748)
  • Scientific Communication and Education (1013)
  • Synthetic Biology (1338)
  • Systems Biology (4012)
  • Zoology (768)