LargeQvalue: A Program for Calculating FDR Estimates with Large Datasets

This is an implementation of the R statistical software qvalue package [Dabney et al., 2014], designed for use with large datasets where memory or computation time is limiting. In addition to estimating p values adjusted for multiple testing, the software outputs a script which can be pasted into R to produce diagnostic plots and report parameter estimates. This program runs almost 30 times faster and requests substantially less memory than the qvalue package when analysing 10 million p values on a high performance cluster. The software has been used to control for the multiple testing of 390 million tests when analysing a full cis scan of RNA-seq exon level gene expression from the Eurobats project [Brown et al., 2014]. The source code and links to executable files for linux and Mac OSX can be found here: https://github.com/abrown25/qvalue. Help for the package can be found by running ./largeQvalue --help.


Introduction
Recent developments have allowed us to investigate the action of DNA in a new, hypothesis free framework.The ability to collect information on DNA at millions of loci has led to a reappraisal of statistical methods to correct for multiple testing: it has been argued that in many contexts traditional measures of significance such as Bonferroni are too conservative, leading to an unacceptably high rate of falsely accepting null hypotheses.
The false discovery rate is an alternative way of accounting for multiple testing, which 2 LARGEQVALUE tolerates a higher false positive rate in order not to discard potentially interesting findings [Benjamini and Hochberg, 1995].It can simplistically be viewed as the proportion of falsely rejected null hypotheses, and the concept was further developed in Storey and Tibshirani [2003] and Storey et al. [2004], and implemented in the R package qvalue [Dabney et al., 2014].
Advances in sequencing have accelerated our ability to perform experiments of even greater scope.Whereas genotyping arrays ascertain DNA at around a million SNPs, the 1000 Genomes project [1000Genomes Project Consortium, 2012] has uncovered around 100 million variants in the human genome.Application of sequencing to produce molecular phenotypes has also revolutionised how we can interrogate cellular function with few preconceived notions; for example sequencing of RNA allows us to construct hundreds of thousands of phenotypes, each capturing the expression of a given exon [Lappalainen et al., 2013].This permits a full, hypothesis free investigation of the genetics of gene regulation.
Together, these developments imply that a full genome-wide scan for all possible variants affecting exon expression would involve ∼ 10 13 hypothesis tests.
The qvalue package [Dabney et al., 2014], implemented in R [R Core Team, 2013], allows estimation of false discovery rates, as well as the quantity π0, which is an estimate of the proportion of the tests for which the null hypothesis is true.However, R programs usually run slower and consume more memory than programs written in lower level languages, rendering the qvalue package impractical when the number of tests is of the order 10 13 .
Here I present an implementation of qvalue in a compiled language, which runs faster, uses less memory, and is capable of scaling up to the experiments produced by these new sequencing technologies.

2.1.
Brief description of qvalue algorithms.The qvalue package implements the algorithms described in Storey [2002], Storey and Tibshirani [2003] and Storey et al. [2004].It takes as input a set of p values and maps these onto q values, which can be interpreted as follows: a decision rule which rejects all null hypotheses with corresponding q values less than α has an estimated false discovery rate of α.Taking an increasing set of p values p1..pn this mapping is defined as: (1) qi = min(min i ≥i The quantity π0 is the proportion of null hypotheses amongst all tests, and is a central difference between the methods used to control the FDR in Benjamini and Hochberg [1995] and the qvalue approach.Its estimation is described in Sections 2.4, 2.5 and 2.6. 2.2.Implementation.largeQvalue is written mainly in the D programming language [Alexandrescu, 2010].D is a compiled language which promises runtime speeds of the same order as C or C++ while guaranteeing memory safety, and also allows easy integration with code written in C. convention the filename is taken as the last argument on the command line.The column containing the p values can be specified with the --col flag; by default it is assumed to be the first column.If the --header flag is not present, then it is assumed the file does not have a header line.Any data that cannot be converted to numerical form will be treated as missing (the corresponding q value estimate will be written as NA).If any numerical values lie outside the interval [0, 1], then the program will terminate with the message Some p values are outside [0, 1] interval.
The standard output file is the input file with an additional column on the end which specifies the q value [Storey, 2002] corresponding to the p value in that row.If a filename for the output is not specified using the --out flag, then output is written to stdout.
. An optional parameter file is written, if specified using the --param flag.This file contains the estimate of π0 used to calculate q values as well as other intermediate parameters.This file can also be run in R, either by pasting it at the R prompt or by running the command source("ParameterFileName") from the R terminal.This script will produce diagnostic plots, as seen in Figure 1 and described in Sections 2.5 and 2.6.
For this functionality, the graphing package ggplot2 [Wickham, 2009] is required.
2.4.Estimation of π0.One method to estimate the proportion of null hypotheses is to choose a value λ ∈ [0, 1) and assume that every p > λ is from the null hypothesis (obviously this assumption is more valid as λ tends to 1).Then, assuming uniform p values generated by the null hypothesis, this proportion can be reflected back across the interval.This corresponds to estimates of π0 as a function of λ: ( Using the qvalue package in R, the user can specify a value of λ to be used to estimate π0 according to Equation 2. The equivalent command in the largeQvalue package uses the flag --lambda to specify this value.More sophisticated methods, also implemented in the R package, use a sequence of λ values to produce a range of estimates of π0.Information from all of these values is combined either by fitting a cubic spline to these estimates [Storey and Tibshirani, 2003] or by taking bootstrap samples of the p values [Storey et al., 2004].The implementation of both of these options in largeQvalue is described in the following two sections. 2.5.Estimation of π0 with cubic splines.The methodology to estimate π0 using cubic splines was described in Storey and Tibshirani [2003], and is used when the specified values of λ form a sequence rather than a single number (this is true by default) and the --boot flag is not given.This option takes a sequence of λ values, specified by --lambda START,STOP,STEP (default value is 0 to 0.9 in steps of 0.05).Estimates of π0(λ) are calculated at every value of λ and a cubic spline is fitted to these points to estimate the asymptote as λ → 1.The estimate of π0 used is the value of this cubic spline at the maximum value of λ.
The amount of smoothing supplied by the spline can be adjusted using the --coef flag (default 5, must be greater than 3).If the --log flag is specified then the spline is fitted to the log(π0(λ)) values instead of π0(λ).
. Figure 1a shows the output of the parameter file processed by R when this methodology is used.The dots show the estimates of π0 at various values of λ, the black line is a cubic spline fitted to these points which is used to approximate the function π0(λ).The estimate π0 used to estimate q values is shown in red line.largeQvalue uses the spline fitting algorithm implemented in the gsl, which differs slightly to the implementation used by the qvalue package.Exact replication of qvalue results can be produced by running the program twice.First, run the code with the --param flag specified.Paste the parameter output into R.A line in the output Estimate of pi0 from qvalue package is: specifies the qvalue estimate of π0.The analysis can then be run again, specifying this value with the --pi0 flag.
2.6.Estimation of π0 from bootstrap samples.Storey et al. [2004] proposed a method of estimating π0 based on bootstrap samples of the p values.This is used when largeQvalue is run with the flag --boot.The seed for the generation of the samples can be set with the flag --seed (default 0).Again, a sequence of λ values is given --lambda START,STOP,STEP which is used to calculate estimates π0(λ).Then, we take 100 bootstrap samples Bj of the p values with replacement and calculate estimates of πB j 0 (λ) on these new datasets.The implementation of the bootstrap differs from that used in the qvalue package.To save memory and computation time, instead of resampling individual p values we sample counts of p values lying within intervals defined by the sequence of λ from a multinomial distribution.This is effectively the same procedure.
The estimate π0(λ * ) is used to estimate q values, where λ * is chosen to minimise Equation 3.
(3) Mean squared error = j (π The flag --param allows the user to specify an output file to write estimates of parameters.When only a single value of λ is specified, this file only contains the point estimate of π0.When λ is specified as a sequence, these parameters include the estimate of π0 used to calculate q values and the estimates π0(λ) for every value in this sequence.If the cubic spline method is employed, then smoothed cubic splines values are reported at every value of λ.For the bootstrap method the file contains all bootstrap estimates πB j 0 (λ) and estimates of the mean square error.Examples of these files can be found in Supplementary Materials S2 and S3.These files can be pasted into R to produce diagnostic plots, examples of which can be seen in Figures 1a and 1b.
The --pi0 flag allows the estimate of π0 to be directly specified.This can be useful to exactly recreate q values produced using the cubic spline routine from the q value package.
The --robust flag produces q values which are more conservative for very small values of p. Instead of q values defined by Equation 1, we use the formula: 2.8.An example command../largeQvalue--header --col 4 --out OutputFile --param ParameterFile --lambda 0,0.9,0.05--robust QTLresults.txt The p values can be found in the 4th column of QTLresults.txt, write the results to OutputFile and the estimated parameter values to ParameterFile.Use values of λ from 0 to 0.9 in steps of 0.05 to estimate proportion of null hypotheses (standard settings in qvalue) and produce estimates of q values robust for small p values.
2.9.Comparison of speed and memory usage of largeQvalue and qvalue package.We compared the CPU time and memory used by largeQvalue with that of the qvalue implementation in R. Each program was run five times against four datasets; these consisted of text files with a single column of either 5 × 10 5 , 10 6 , 5 × 10 6 , or 10 7 uniformly distributed numbers between 0 and 1.The following R code was used to call the qvalue algorithm with the cubic spline method: library(qvalue) Pvalues <-scan(file = "InputFile") write(qvalue(Pvalues)$qvalues, file = "Qvalues", ncolumns = 1) .q q q q q q q q q q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 λ π 0 (a) π0 estimated using cubic spline.q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.00  For the bootstrap method the following code was used: library(qvalue) Pvalues <-scan(file = "InputFile") write(qvalue(Pvalues, pi0.method == "bootstrap")$qvalues, file = "Qvalues", ncolumns = 1) The corresponding commands for the largeQvalue package were: --lambda : Either a fixed number or a sequence given 0,0.9,0.05(default = 0,0.9,0.05)--param : Print out parameter list to given file --out : File to write results to (default = stdout) --input : File to take results from (must be specified, if not explicitly, the last parameter after all options have been parsed is used) Edit the first line of the makefile to reflect where the library can be found, then rerun the make command.Once this is done, if installation has worked then running ./largeQvaluebring up the help information.If there are any issues with this procedure, please contact andrew.brown@unige.ch.
It is possible that you wish to compile with the reference D compiler DMD (this may especially be true if you are working on a Windows system).This can be downloaded from http://dlang.org/download.html.Then, the only difference with the previous installation is you must run make dmd instead of make.

3.1.
CPU time and memory usage of largeQvalue and the qvalue package.To compare performance of the two implementations of the qvalue algorithm, we analysed four different datasets five times with both of the programs on the High Performance Cluster at the Wellcome Trust Sanger Institute (further details in Section 2.9).Table S1 lists the CPU time and maximum memory reported by the cluster.The CPU time is shown in Figure 2. The largeQvalue package requires less time to analyse the data, and the difference increases as the number of tests increases.When there are 10 7 p values, the largeQvalue program is 27 times faster than the qvalue package when comparing median running times.
We do not believe that the cluster reports reliable measures of memory usage, especially as the largeQvalue package runs for a relatively brief time, and allocates memory for much  of it.However, when comparing maximum reported memory use across all runs with 10 7 p values, this program uses less than a quarter of the memory of the qvalue package.
3.2.Application to find tissue specific eQTL.The program was used to look for evidence of cell type specific eQTL in the skin expression data from the Eurobats project [Brown et al., 2014].We are currently preparing a manuscript based on this work.In brief, a measure of cell type composition was constructed, quantifying the proportions of dermis and epidermis in each individual sample.Then, for every gene and every SNP within 1Mbp of that gene, we tested for interactions between SNP and cell-type measure which affected the expression of that gene (as quantified by RNA-seq), a total of 390,537,631 tests.We .ran largeQvalue on these p values.Although the high multiple testing burden meant that we found no significant hits, we estimated a value of π0 of 0.961, suggesting that 3.9% of cis-SNPs had different effects on dermis expression in comparison to expression within the epidermis layer.The program took 1 hour 25 mins 55 seconds of CPU time to run, and used 3.99GB of memory.

Conclusion
The qvalue package has been widely adopted to control for multiple testing; the two papers proposing its methodology [Storey andTibshirani, 2003, Storey et al., 2004] have been cited over 5,600 times.This means that the procedure and its implementation have become widely accepted.The integration within the R framework makes it easy for results to be processed and plotted.The implementation is capable of analysing the results of genome-wide association studies or studies of differential expression of genes (one context where the methodology was originally proposed).However, it is doubtful that it scales up to cope with the results coming from modern cellular sequencing experiments, which can test hundreds of thousands of phenotypes for association with tens of thousands of SNPs (if concentrating on the cis window).For this we propose our implementation, which has already been used to analyse such data.

Figure
Figure1bshows the diagnostic plot when this method is applied.Estimates π0(λ) are shown by the blue dots, with min λ π0(λ ) shown by the blue line.The bootstrap estimates πB j 0 (λ) are shown in the box plots and the mean square error by the dashed line.The chosen estimate of π0, which minimises this mean squared error, is shown by the red line.
π0 estimated using bootstrap methods.

Figure 1 .
Figure 1.(a) Example of the diagnostic plot produced by the standard options, the dots represent the estimate of the proportion of null hypotheses (π 0 ) for various values of λ.These values should converge as λ approaches 1; the point of convergence is estimated using a spline function (black line).Estimate of convergence is at the red line.(b) Example of the diagnostic plot with the --boot option.Again, estimates of π0 (λ) are shown (blue dots).The uncertainty of these estimates is calculated by taking 100 bootstrap samples of the data, the distribution of these parameter estimates is shown by the box plots.The estimate of π 0 used to calculate q values (the red line) is the value that minimises the mean squared error with the lowest estimate of π 0 calculated on the original data (blue line).

.
: Input has header line (default = FALSE) --boot : Apply bootstrap method to find π0 (default = FALSE) --seed : Set seed for generating bootstrap samples (default = 0, equivalent to gsl default) --coef : Set the number of coefficients used to fit cubic spline (default = 5, must be greater than 3) --log : Smoothing spline applied to log π0 values (default = FALSE) --robust : More robust values for small p values (default = FALSE) --pi0 : Use value of π0 given (useful for recreating qvalue package results)

Figure 2 .
Figure 2. Graph showing amount of CPU time needed to run largeQvalue compared to the qvalue package in R.

Table S1 -
Continued from previous page

Table S1 -
Continued from previous page Parameter output file using default settings.This is an example of the standard output using the --param flag and the standard, spline estimation of π0.It has been edited to solely to ensure lines fit comfortably on the page and to replace unicode characters.