Abstract
Power analysis is essential to optimize the design of RNA-seq experiments and to assess and compare the power to detect differentially expressed genes in RNA-seq data. PowsimR is a flexible tool to simulate and evaluate differential expression from bulk and especially single-cell RNA-seq data making it suitable for a priori and posterior power analyses.
Introduction
RNA-sequencing (RNA-seq) is an established method to quantify levels of gene expression genome-wide [17]. Furthermore, the recent development of very sensitive RNA-seq protocols, such as Smart-seq2 and CEL-seq [7,18] allows transcriptional profiling at single-cell resolution and droplet devices make single cell transcriptomics high-throughput, allowing to characterize thousands or even millions of single cells [9,15, 30].
Even though technical possibilities are vast, scarcity of sample material and financial consideration are still limiting factors [31], so that a rigorous assessment of experimental design remains a necessity [1, 3]. The number of replicates required to achieve the desired statistical power is mainly determined by technical noise and biological variability [3] and both are considerably larger if the biological replicates are single cells. Crucially, it is common that genes are detected in only a subset of cells and such dropout events are thought to be rooted in the stochasticity of single-cell library preparation [8]. Thus dropouts in single-cell RNA-seq are not a pure sampling problem that can be solved by deeper sequencing [2]. In order to model dropout rates it is absolutely necessary to model the mean-variance relationship inherent in RNA-seq data. Even though current power assessment tools use the negative binomial or similar models that have an inherent mean-variance relationship, they do not explicitly estimate and model the observed relationship, but rather draw mean and variance separately (reviewed in [19]).
In powsimR, we have implemented a flexible tool to assess power and sample size requirements for differential expression (DE) analysis of single cell and bulk RNA-seq experiments. For our read count simulations, we (1) reliably model the mean, dispersion and dropout distributions as well as the relationship between those factors from the data. (2) Simulate read counts from the empirical mean-variance- and dropout relations, while offering flexible choices of the number of differentially expressed genes, effect sizes and DE testing method. (3) Finally, we evaluate the power over various sample sizes. We will use the embryonic stem cell data from [10] to illustrate powsimR’s utility to plan and evaluate RNA-seq experiments.
powsimR
Estimation of RNA-seq Characteristics
An important step in the simulation framework is the reliable representation of the characteristics of the observed data. In agreement with others [5,14,16], we find that the read distribution for most genes is sufficiently captured by the negative binomial. We analyzed 18 single cell datasets using unique molecular identifiers (UMIs) to control for amplification duplicates and 20 without duplicate control. The negative binomial provides an adequate fit for 54% of the genes for the non-UMI-methods and 39% of the genes for UMI-methods, while the zero-inflated negative binomial was only adequate for 2.8% of the non-UMI-methods. In contrast, for the UMI-methods a simple Poisson distribution fits well for some studies [25,31] (Supplementary File S2). Furthermore, when comparing the fit of the other commonly used distributions, the negative binomial was most often the best fitting one for both non-UMI (57%) and UMI-methods (66%), while the zero inflated negative binomial improves the fit for only 19% and 1.6% (Supplementary Figure S4). Therefore the default sampling distribution in powsimR is the negative binomial (Figure 1), however the user has also the option to choose the zero-inflated negative binomial.
A The mean-dispersion relationship is estimated from RNA-seq data, which can be either single cell or bulk data. The users can provide their own count tables or one of our five example data sets. The plot shows the mean-dispersion estimated, assuming a negative binomial for the Kolodziejczyk-data, the red line is the loess fit, that we later use for the simulations. B These distribution parameters are then used to set-up the simulations. For better comparability, the parameters for the simulation of differential expression are set separately. C Finally, the TPR and FDR are calculated. Both can be either returned as marginal estimates per sample configuration (top), or stratified according to the estimates of mean expression, dispersion or dropout-rate (bottom).
Simulation of Read Counts and Differential Expression
Simulations in powsimR can be based on provided data or on user-specified parameters. We first draw the mean expression for each gene. The expected dispersion given the mean is then determined using a locally weighted polynomial regression fit of the observed mean-dispersion relationship and to capture the variability of the observed dispersion estimates, a local variability prediction band (σ = 1.96) is applied to the fit (Figure 1A). Note, that using the fitted mean-dispersion spline is the feature that critically distinguishes powsimR from other simulation tools that draw the dispersion estimate for a gene independently of the mean. Our explicit model of mean and dispersion across genes allows us to reproduce the mean-variance as well as mean-dropout relationship observed (Supplementary Figure S2, Supplementary File S2).
To simulate DE genes, the user can specify the number of genes as well as the fraction of DE genes as log2 fold changes (LFC). For the Kolodziejczyk data, we found that a narrow gamma distribution mimicked the observed LFC distribution well (Supplementary Figure S3). The set-up for the expression levels and differential expression can be re-used for different simulation instances, allowing an easier comparison of experimental designs.
Finally, the user can specify the number of samples per group as well as their relative sequencing depth and the number of simulations. The simulated count tables are then directly used for DE analysis. In powsimR, we have integrated 8 R-packages for DE analysis for bulk and single cell data (limma [21], edgeR [22], DESeq2 [13], ROTS [24], baySeq [6], DSS [29], NOISeq [26], EBSeq [12]) and five packages that were specifically developed for single-cell RNA-seq (MAST [4], scde [8], BPSC [27], scDD [11], monocle [20]). For a review on choosing an appropriate method for bulk data, we refer to the work of others e.g. [23]. Based on our analysis of the single-cell data from [10], using standard settings for each tool we found that MAST performed best for this dataset given the same simulations as compared to results of other DE-tools.
Evaluating Statistical Power
Finally, powsimR integrates estimated and simulated expression differences to calculate marginal and conditional error matrices. To calculate these matrices, the user can specify nominal significance levels, methods for multiple testing correction and gene filtering schemes. Amongst the error matrix statistics, the power (True Positive Rate; TPR) and the False Discovery Rate (FDR) are the most informative for questions of experimental design. For easy comparison, powsimR plots power and FDR for a list of sample size choices either conditional on the mean expression [28] or simply as marginal values (Figure 1). For example for the Kolodziejczyk data, 384 single cells for each condition would be sufficient to detect > 80% of the DE genes with a well controlled FDR of 5%. Given the lower sample sizes actually used in [10], our power analysis suggests that only 60% of all DE genes could be detected.
In summary, powsimR can not only estimate sample sizes necessary to achieve a certain power, but also informs about the power to detect DE in a data set at hand. We believe that this type of posterior analysis will become more and more important, if results from different studies are compared. Often enough researchers are left to wonder why there is a lack of overlap in DE-genes when comparing similar experiments. Powsim will allow the researcher to distinguish between actual discrepancies and incongruities due to lack of power.
Availability
The R package and associated tutorial are freely available at https://github.com/bvieth/powsimR.