Abstract
Cells from the same individual share a common genetic and environmental background and are not independent, therefore they are subsamples or pseudoreplicates. Empirically, we show this dependence across a range of cell types. Thus, single-cell data have a hierarchical structure that current single-cell methods do not address and subsequently the application of such tools leads to biased inference and reduced robustness and reproducibility. When properly simulating the hierarchical structure of single-cell data, commonly applied single-cell differential expression analysis tools exhibit highly inflated type I error rates, particularly when applied together with a batch effect correction for individual as a means of accounting for within sample correlation. As single-cell experiments increase in size and frequency, we propose applying generalized linear mixed models that include random effects for differences among persons to properly account for the correlation structure that exists among measures from cells within an individual.
The rapid evolution of single-cell technologies will enable novel interrogation of fundamental questions in biology, dramatically accelerating discoveries across many biological disciplines. Thus, researchers are developing methods that leverage or account for the unique properties of single-cell RNA sequencing (scRNA-seq) data, particularly its increased sparseness and heterogeneity compared to its bulk sequencing counterpart1–3. An important characteristic of single-cell experiments is that they result in many cells from the same individual, and therefore the same genetic and environmental background. Here we empirically document the correlation among measures from cells within an individual and demonstrate how differential expression analysis of scRNA-seq data without considering this correlation, the current common practice, violates fundamental assumptions and leads to false conclusions. Proper identification of the experimental unit (i.e., the smallest observation for which independence can be assumed) for the hypothesis is critical for proper inference. Observations nested within an experimental unit are referred to as subsamples, technical replicates, or pseudoreplicates. Pseudoreplication, or subsampling, is formally defined as “the use of inferential statistics where replicates are not statistically independent”4. There are two types of pseudoreplication commonly occurring in single-cell experiments: simple and sacrificial. Simple pseudoreplication occurs when “samples from a single experimental unit are treated as replicates representing multiple experimental units” 4,5. Sacrificial pseudoreplication occurs when “the samples taken from each experimental unit are treated as independent replicates”4,5. Pseudoreplication has been addressed repeatedly in the fields of ecology, agriculture, psychology, and neuroscience4–8 and has been acknowledged as one of the most common statistical mistakes in scientific literature9. New technologies are particular prone to this error. Thus, it is not surprising that pseudoreplication is ubiquitous in the single-cell literature. Properly identifying the right experimental unit in single-cell studies will greatly increase both robustness and reproducibility, thereby leveraging the very features that make single-cell methods powerful.
Measures from cells from the same individual should be more (positively) correlated with each other than cells from unrelated individuals. Empirically, this appears true across a range of cell types (Fig. 1). Thus, single-cell data have a hierarchical structure in which the single-cells may not be mutually independent and have a study-specific correlation (e.g., exchangeable correlation within an individual). We note, that, within a cell type, cells appear to also exhibit some correlation across individuals (Fig. 1). We hypothesize this is due to zero-inflation and the stability in functional gene expression that is needed for a cell to classify as a specific cell type (e.g., T-cells need to have some consistent signals of gene expression related to their function as T-cells). As the denominator of most statistical tests (e.g., Wald test) is a function of the variance, not accounting for the positive correlation among sampling units underestimates the true standard error and leads to false positives10,11. In addition, treating each cell as independent inflates the test degrees of freedom, making it easier to falsely reject the null hypothesis (type 1 error). Too many false positives can mask true associations, especially when multiple comparison procedures such as false discovery rate are applied. In combination, this will adversely affect downstream analyses (pathway analysis), robustness, and reproducibility – increasing the cost of science.
Single-cell studies designed to identify differentially expressed genes rarely note or address the correlation among cells from the same individual or experimental unit. Excellent reviews of the field and methodological work have largely focused on challenges presented by properly classifying cell types, multimodality, dropout, and higher noise derived from biological and technical factors. However, they fail to highlight the effect of pseudoreplication and, furthermore, publications evaluating the performance single-cell specific tools all compute the simulations as if cells were independent12–18. The result is reduced reproducibility with real data, leading to the conclusion that tools built specifically to handle single-cell data do not appear to perform better than tools created for bulk data analysis19–21. We completed a simulation study that reproduces both the inter- and intra-individual variance structures estimated from real data and documented the effect of intra-individual correlation on the type 1 error rates of the most frequently used single-cell analysis tools (Fig. 2, fig. S1). Our simulation compared methods that do and do not account for the repeated observations within an experimental unit (see Methods). We varied the number of individuals and cells within an individual. All methods considered use asymptotic approximations and admit covariates.
We observed that the generalized linear mixed model (GLMM), either employing a tweedie distribution or a two-part hurdle model with a random effect (RE) for individual, outperformed other methods across a variety of conditions (Table 1, tables S1-S4)22–28.
Specifically, among the methods that explicitly model the correlation structure, GLMM consistently had more appropriate type 1 error rate control than both generalized estimating equations (GEE1) models and nested fixed-effects models, where the latter two perform poorly for any number of subsampling until the number of independent experimental units approached 2529,30. When the number of experimental units was small, the GEE1 sandwich estimator of the variance provided standard errors that were too small and therefore inflated the type 1 error rate. Similarly, for nested fixed-effects models, the standard errors were also underestimated with standard estimation techniques (i.e., REML). The models that explicitly model the correlation structure all outperformed the methods that do not account for the lack of independence among experimental units (Table 1, tables S1-S4). All methods that treat observations as independent perform increasingly worse as the number of correlated cells increases. We note that DESeq2 regularly failed to compute in scenarios where the numbers of cells and samples were large because the geometric mean normalization method implemented requiring at least one transcript to consist completely of all non-zero values (Tables S1-S4). A particularly noteworthy approach that has been suggested to account for the within-individual correlation is applying a batch effect correction method, for which the batches are individuals. This approach had markedly increased type 1 error rates (Table 1, tables S1-S4). This is primarily because regressing out the person-specific effect as a batch effect and subsequently analyzing each cell as an independent observation will underestimate the overall variance by removing inter-individual differences while maintaining an inappropriately large number of degrees of freedom when treating cells as if they are independent.
One of the most heavily cited single-cell analysis tools, Model-based Analysis of Single-cell Transcriptomics (MAST), is a two-part hurdle model built to handle sparse and bi-modally distributed single-cell data22. Although, to our knowledge, there are no publications that employ MAST to account for pseudoreplication as discussed here, Finak et al. note that MAST “can easily be extended to accommodate random effects”22. Here, we emphasize this tool as an already well-established tool in the field and demonstrate that MAST performs exceptionally well when adjusting for individual as a random effect (i.e., MAST with RE), but no different than other tools when not doing so. This specific evaluation of MAST’s performance with and without a random effect for individual serves as a perfect example of why accounting for pseudoreplication is so important. While we do recommend computing differential expression analysis using MAST with RE, alternative methods include the tweedie GLMM or permutation testing. In order not to violate the exchangeability assumption, permutation methods must randomize at the independent experimental unit (e.g., individual) and properly account for covariates (i.e., conditional permutation). The tweedie GLMM method could be implemented using the ‘glmmTMB’ R-package23, but neither of these alternative approaches explicitly incorporate some of the single-cell specific concepts implemented in MAST (e.g., cellular detection rate). As detailed in their original manuscript, MAST models a log2(TPM + 1) gene expression matrix as a two-part generalized regression model22. Using their same notation, the addition of random effects for differences among persons is as follows: where Yig is the expression level for gene g and cell i, Zig is an indicator for whether gene g is expressed in cell i, Xi contains the predictor variables for each cell i, and Wi is the design matrix for the random effects of each cell i belonging to each individual j (i.e., the random complement to the fixed Xi). βg represents the vector of fixed-effects regression coefficients and γj represents the vector of random effects (i.e., the random complement to the fixed βg). γj is distributed normally with a mean of zero and variance . To obtain a single result for each gene, likelihood ratio or Wald test results from each of the two components are summed and the corresponding degrees of freedom for each component are added22. These tests have asymptotic χ2 null distributions and they can be summed and remain asymptotically χ2 because Zg and Yg are defined conditionally independent for each gene22.
We computed an extensive simulation-based power analysis to provide researchers estimates across a wide range of experimental conditions. This was computed using a two-part hurdle model with random effects for individuals as implemented in MAST. Increasing the number of independent experimental units (e.g., individuals) in a study is the best way to increase power to detect true differences (Fig. 3, fig. S2-S31).
Empirically, there are only marginal gains in power when more than twenty-five cells per individual are sampled. Increasing the number of cells per individual provides more precision in the estimate for an individual. However, it does not directly affect power for detecting differences across individuals such as treatment effects applied at the individual level (i.e., cases/control studies). We estimated negligible improvement power when increasing the expected number of cells per individual beyond 100 in a handful of situations (Fig. 3). We note that estimating power with more than 100 cells per individual was exceedingly slow and computationally expensive. Because 1000s of cells per individual is not atypical for single-cell experiments, tools that account for the correlation structure when analyzing these data need to be further developed to increase computational efficiency.
Most papers compare cells across very few individuals, sometimes even a single case and control (simple pseudoreplication); in the former case the estimate of the variance is possible but has wide bounds on parameter confidence intervals, and in the latter case the variance is not estimable. These power simulations indicate that the majority of published studies are underpowered (Fig. 3, fig. S2-S31). The majority of single-cell papers show a deep understanding of the underlying biology and conduct otherwise very informative experiments, appropriately landing in very high visibility journals. However, our type 1 error and power simulations document that many published studies are missing important true effects while reporting too many false positives generated via pseudoreplication. As single-cell technology continues to evolve and costs decrease, reviewers need to be aware of this issue to avoid proliferation of irreproducible results. We encourage the use of mixed models, such as the tweedie GLMM or the two-part hurdle model with a random effect (e.g., as implemented in MAST with RE), as ways of accounting for the repeated observations from an individual while being able to adjust for covariates at the individual level and, if appropriate, at the individual cell level. Finally, we note that although our focus here is on hypothesis testing for finding differentially expressed genes, the concept is applicable to all single-cell sequencing technologies such as proteomics, metabolomics, and epigenetics.
Materials and Methods
Literature Review
A PubMed search for the keywords “single-cell differential expression” returned 251 articles published in the last 3 years which were subsequently sorted and filtered by each of their abstracts. Many of the returned articles were associated with bulk RNA sequencing or completely irrelevant to differential expression analysis in single-cell and were therefore eliminated. Of the 251 original hits, 76 of them were deemed appropriate for further consideration. Of those, 10 of them were reviews, 36 of them were methods papers, and 30 of them were implementation papers. This method is not meant to be a perfect capture of all of the literature, but provides a clear snapshot of the current state of single-cell differential expression analyses. Each of the methods and implementation articles was thoroughly reviewed and tabled along with its number of citations, date of publication, and any other pertinent information such as number of independent samples, tools used, or number of cells captured.
Intra and Inter-correlation analyses
Pairwise comparisons between all cells of interest were made to compute intra- and inter-individual correlations. Genes were filtered if the average transcript-per-million (TPM) value was not greater than five. For intra-individual correlation, spearman’s correlation was computed for all possible pairs of cells within an individual. For inter-individual correlation, spearman’s correlation was computed for all possible pairs of cells from a random draw of one cell from each individual. 1,000 draws were computed. Correlations and their means were tested for differences (Fig. 1). The measures were compared in six different cell types across three different single-cell studies. These studies are publically available under the accession numbers GSE81861, GSE72056, and E-MTAB-5061. The cell type designations that were used were given by the authors of these studies.
Simulation
Means and variances were computed empirically from the transcript per million read count values previously reported in six different cell types across three different single-cell studies. Once consistent patterns were identified across cell types, alpha cells from the pancreas dataset, were used as the model data for our simulation. After removing the top percent of the most variant genes, we fit a gamma distribution to the global mean transcript per million read count values of each gene to obtain a grand mean, μi. The variance of the individual-specific means (inter-individual variance) was modeled as a quadratic function of the grand mean, f1(μi) and the within-sample variance (intra-individual variance) was simulated using a uniform distribution, U(a, b). (Fig. 2). The objective for obtaining the inter-individual variance as function of the grand mean was to simply capture the average associations between the variance and the grand mean - the uncertainty in this relationship was not of particular interest for this simulation. Using a normal distribution with an expected value of zero and a variance computed by the first quadratic relationship, f1(μi), a difference in means was drawn for each individual in the simulation. This difference was summed with the grand mean to obtain an individual mean, μij. A Poisson distribution with a λ equal to the expected number of cells desired for each individual was then used to obtain the count of cells per individual. For each cell assigned to an individual, a transcript per million read count value, Yijk, was drawn from a normal distribution with an expected value equal to the individual’s assigned mean transcript per million read count value, μij, and a variance, , drawn from a uniform distribution. The correlation structure between genes was not taken into account in this simulation and this simple process was replicated a selected number of times to obtain the desired number of genes. Due to their widespread use in the field, tSNE plots were made of the simulated data to assess how realistic the simulated data appeared and to assess the effects of altering intra-individual variance in these data (Fig S1).
Type I error
250,000 iterations of our simulation were computed for varying numbers of cells and varying numbers of individuals. The number of individuals per group was fixed at either 2, 5, 10, or 25. The number of cells per individual was drawn from a Poisson distribution with either a λ of 10, 25, 50, or 100. For each of the 250,000 iterations, the number of results that met our significance threshold were counted and the type I error was computed as the percentage of significant results. After primary analysis of the type I error using a tweedie mixed-effects model, type I error was computed with the following tools: MAST, MAST with random effects, MAST with a batch effect correction, DESeq2, Monocle, ROTS, Tweedie GLM, and a GEE1 with a Gaussian link and exchangeable correlation. All of these tools were selected because they could handle the transcript per million read count values being simulated – with the exception of DESeq2. DESeq2 requires integers and at least one gene without a zero value to compute its normalization, so as the number of samples and cells increased, the likelihood of if computing greatly decreased. We acknowledge DESeq2 is not appropriate for analyzing these data, but felt that where we could complete simulations, the tool must be addressed because of its frequent use in the field. MAST was implemented with and without the use of a random effect for individual and the remaining single-cell tools were implemented exactly as their vignettes instruct. GEE1 with exchangeable correlation was implemented to compare its performance to the mixed-effects model, particularly where the numbers of donors are low. Type I errors were computed using significance thresholds of 0.05, 0.01, 0.001, and 0.0001 (Table 1, tables S1-S4).
Power calculations
Using MAST with a random effect for individual we computed power curves to estimate how well this tool functions with varying numbers and ratios of cells and individuals. Computations were identical to the type I error analyses with exception of multiplying a constant, hereafter labeled fold change, with the global mean gene expression value of a gene to spike the expression values in one group. Power was computed at small increments between a fold change of 1 and 5, or until MAST with RE was unable to compute because of complete separation. For lowly expressed genes with high amounts of zero inflation, inference remained difficult, causing MAST with RE to asymptote out before reaching maximum power. This is just the nature of sparse single-cell data, and it cannot be avoided.
Author contributions
C.D.L. and K.D.Z conceived the study together. K.D.Z implemented simulations and analyses with guidance from C.D.L. K.D.Z wrote the original draft and reviewed and edited it with M.A.E. and C.D.L.
Competing interests
Authors declare no competing interests.
Data and materials availability
All data are publically available under the accession numbers GSE81861, GSE72056, and E-MTAB-5061. All base code is available in the supplementary materials.
Supplementary Materials
Materials and Methods
Figures S1-S31 (tSNE and power curves)
Tables S1-S4
R code – Correlation
R code – Simulation
Acknowledgments
We thank T. D. Howard, L. D. Miller, and M. Olivier (All WFU School of Medicine) for critical review of this content. This work was supported by The Center for Public Health Genomics and grant U01 NS036695 (Co-PI Langefeld) from NIH.