Pseudoreplication bias in single-cell studies; a practical solution

Cells from the same individual share a common genetic and environmental background and are not independent, therefore they are subsamples or pseudoreplicates. Thus, single-cell data have a hierarchical structure that many current single-cell methods do not address, leading to biased inference, highly inflated type 1 error rates, and reduced robustness and reproducibility. This includes methods that use a batch effect correction for individual as a means of accounting for within sample correlation. Here, we document this dependence across a range of cell types and show that ‘pseudo-bulk’ aggregation methods are overly conservative and underpowered relative to mixed models. We propose applying two-part hurdle generalized linear mixed models with a random effect for individual to properly account for both zero inflation and the correlation structure among measures from cells within an individual. Finally, we provide power estimates across a range of experimental conditions to assist researchers in designing appropriately powered studies.


Introduction
cell studies before a lack of reproducibility tarnishes single-cell's reputation and single-cell 45 technology itself is potentially thought of as unreliable.
In this study we simulate hierarchical single-cell expression data and evaluate the type 1 error rates and power of mixed models relative to some of the most frequently applied differential expression methods. We assess a number of commonly applied differential expression methods, but we primarily focus on the computation of a two-part hurdle model. This 50 model explicitly accounts for the common problem of zero-inflation in scRNA-seq data by simultaneously modeling the rate of expression and the positive expression mean 10 . Using the two-part hurdle model, we compute differential expression as it is most typically applied in the literature: without a random effect for individual. We then re-evaluate the two-part hurdle model's performance when computing differential expression with a random effect for 55 individual, and after applying a batch effect correction for individual. Additionally, we examine the type 1 error control and power of aggregation (i.e., "pseudo-bulk") methods, where gene expression values are averaged across all of the cells within an individual and the test statistic is computed on the individual means [11][12][13] . Aggregation methods are implemented to control for both zero-inflation and inter-individual heterogeneity, but represent the classic definition of 60 sacrificial pseudoreplication. In total, these simulations inform how properly accounting for the correlation structure among measures from cells within an individual will greatly increase both robustness and reproducibility, thereby leveraging the very features that make single-cell methods powerful.

Intra-individual Correlation
Measures from cells from the same individual should be more (positively) correlated with each other than cells from 70 unrelated individuals. Empirically, this appears true across a range of cell types (Fig.   1). Thus, single-cell data have a hierarchical structure in which the single-cells may not be mutually independent and have a study-75 specific correlation (e.g., exchangeable correlation within an individual). We note, that, across a cell type, cells appear to also exhibit some correlation across individuals ( Fig. 1). We hypothesize this is primarily due 80 to both zero-inflation and the stability in functional gene expression that is needed for a cell to classify as a specific cell type (e.g., T-cells need to have some consistent signals of gene expression related to their function as 85 T-cells).

Fig. 1 | Intra-individual correlation.
Box plot of the intra-and inter-individual Spearman's correlations for gene expression values across six different cell types as well as across multiple cell types in the pancreas. Cell types, along with their respective numbers of cells (Ncells) and individuals (Nindividuals), are labeled on the x axis. Mean correlation among a donor's own cells (intra-individual) is always greater than the mean correlation across individuals (interindividual). Some cell types may be more correlated than others. Cell types were designated by previous authors. The center line represents the median. The lower and upper box limits represent the 25% and 75% quantiles, respectively. The whiskers extend to the largest observation within the box limit plus or minus one and a half multiplied by the interquartile range.

Simulation
We completed a simulation study that reproduces both the inter-and intra-individual variance structures 90 estimated from real data and documented the effect of intraindividual correlation on the type 1 error rates of the most frequently used single-cell analysis tools (Fig. 2, fig. S1-4).
Our simulation captures some of the most important aspects of single-cell data ( Fig. S1-4) and was used to compare 95 methods that do and do not account for the repeated observations within an experimental unit (see Methods).
We varied the number of individuals and cells within an individual and all methods considered use asymptotic approximations and admit covariates.

Type 1 Error Evaluations
We observed that the generalized linear mixed model (GLMM), either employing a tweedie distribution or a two-part hurdle model with a random effect (RE) for individual, outperformed other methods across a variety of 105 conditions (Table 1, tables S1-S4).

Fig. 2 | Simulation workflow.
A gamma [ ( , )] distribution was fit to the global mean normalized read counts of each gene and used to obtain a grand mean, . The variance of the individual-specific means (inter-individual variance) was modeled as a linear function of the grand mean, ( ). Using a normal ( , ) distribution with an expected value of zero and a variance computed by the linear relationship, ( ), a difference in means was drawn for each individual in the simulation. This difference was summed with the grand mean to obtain an individual mean, . Within-sample dispersion was simulated as a logarithmic function of the inter-individual mean, ( ). A Poisson ( ) distribution with a λ equal to the expected number of cells desired for each individual was then used to obtain the count of cells per individual. The probability of dropout was estimated as a gamma distribution. For each cell assigned to an individual, a count, , was drawn from a negative binomial distribution.
Among the methods that explicitly model the correlation structure, GLMM consistently had better type 1 error rate control than generalized estimating equations (GEE1) models, where the latter performed poorly for all numbers of subsamples until the number of independent experimental units approached 30. However, all models that explicitly model the correlation 110 structure have more appropriate type 1 error rates than the methods that do not account for the lack of independence among experimental units (Table 1, tables S1-S4). All methods that treat observations as independent perform increasingly worse as the number of correlated cells increases (Table 1, tables S1-S4). One of the most heavily cited single-cell analysis tools, Model-  Table 1 | Type I error rates of some of the currently applied tools in single-cell analysis. Type I error rates of ten different methods under twenty different conditions and a significance threshold of p<0.05. 250,000 iterations were computed to obtain an error rate for each method. The inflated type I error rates computed with mixed models at the lower numbers of individuals per group are a consequence of the two-part hurdle model simultaneously testing two hypotheses and an overabundance of sub-sampling with small sample sizes. Type I error rates are well controlled for with mixed models and pseudo-bulk methods, while type I error rates inflate with other methods as additional independent samples or more cells are added. Pseudo-bulk methods are overly conservative.

*Default denotes MAST was implemented without random-effects, RE denotes random-effects, Corrected denotes data was batch-corrected for individual prior to analysis without using individual as a random-effect, GLM denotes generalized linear model, and GLMM denotes generalized linear mixed-effects model. **Two-part Hurdle model as implemented in MAST, Tweedie distribution as implemented in 'glmmTMB', GEE1 as implemented in 'geepack', Pseudobulk averaged or summed across cells within an individual and was implemented in DESeq2, Modified t as implemented in ROTS, and Tobit as implemented in Monocle
based Analysis of Single-cell Transcriptomics (MAST), is a two-part hurdle model built to 115 handle sparse and bi-modally distributed single-cell data 10 . Although, to our knowledge, there are no publications that employ MAST to account for pseudoreplication as discussed here, Finak et al. note that MAST "can easily be extended to accommodate random effects" 10 . When implementing MAST with a random effect for individual (i.e., MAST with RE) the type 1 error rate is well-controlled, but its type 1 error rate is just as inflated as other tools when it is not

Power Analysis
We  Single-cell studies designed to identify differentially expressed genes rarely note or address the correlation among cells from the same individual or experimental unit. Excellent reviews of the field and methodological work have largely focused on challenges presented by properly classifying cell types, multimodality, dropout, and higher noise derived from biological and technical factors. However, they fail to highlight the effect of pseudoreplication and, 160 furthermore, publications evaluating the performance single-cell specific tools all compute the simulations as if cells were independent [14][15][16][17][18][19][20] . The result is reduced reproducibility with real data, leading to the conclusion that tools built specifically to handle single-cell data do not appear to perform better than tools created for bulk data analysis [21][22][23] .
Here, we have empirically documented the correlation among measures from cells within 165 an individual for a few independent datasets and different cell types ( Figure 1). These findings imply that current practice of testing for differential expression analysis across conditions in scRNA-seq data within a cell type without considering this correlation leads to pseudoreplication. Pseudoreplication, formally defined as "the use of inferential statistics where replicates are not statistically independent", has been addressed repeatedly in both new and well- In our results, the models that explicitly parameterized the correlation structure all showed improved type 1 error control over the methods that did not account for the lack of independence among experimental units (Table 1, tables S1-S4). Furthermore, all methods that treated observations independently performed increasingly worse as the number of correlated 180 cells increased (Table 1,  false discovery rate are applied. In combination, this will adversely affect downstream analyses (pathway analysis), robustness, and reproducibilityincreasing the cost of science.
One potential method to remove inter-individual differences prior to analysis would be to apply batch effect correction prior to differential expression analysis, for which the batches are individuals. We note that applying batch effect correction techniques to correct for intra- 195 individual correlation should, in fact, be used more often in this way prior to cell-type clustering or prior to finding marker genes between cell-type clusters. However, when using these methods prior to differential expression analysis within a cell type, they show markedly increased type 1 error rates (Table 1, tables S1-S4). This is primarily because regressing out the person-specific effect as a batch effect and subsequently analyzing each cell as an independent observation will 200 underestimate the overall variance by removing inter-individual differences while maintaining an inappropriately large number of degrees of freedom when treating cells as if they are independent.
Another such method that has been recommended prior to analysis is to aggregate individual gene expression values within an individual by summing or averaging them 11-13 . Such 205 approaches, labeled "pseudo-bulk" techniques, fall under the classical definition of sacrificial pseudoreplication because they lose information. This is particularly true with respect to the within-sample variance, where heterogeneity is very likely to exist 4 . Without such information there is not a proper way to assess the significance of the difference between treatments 4 .
Additionally, in imbalanced situations, "pseudo-bulk" methods cause cells from individuals with 210 fewer cells to be more heavily weighted, where mixed models have consistent estimators and do not require balanced data 28 . The problem of imbalance is even more apparent when summing.
Overall, it has been demonstrated that mixed-effects models lead to the most accurate results when analyzing data with a hierarchical structure 6,24,25 . As we demonstrate here, aggregating values across cells from the same experimental unit will actually lead to an increased type 2 error 215 rate and decreased power ( Table 1, figure 3). This is due to an overestimation of the mean-square error relative to mixed-effects models, particularly when the intra-individual variance is larger than the inter-individual variance, as appears to be the case with scRNA-seq data 24,25 . 'Pseudobulk' techniques may control the type 1 error rates and may help remove zero-inflation, however, we strongly recommend mixed-effects models based on long-standing statistical justifications for 220 the analysis of subsamples, including increased power.
Among the methods that explicitly model the correlation structure, generalized linear mixed models (GLMM) consistently had better type 1 error rate control than generalized estimating equations (GEE1) models, where the latter performed poorly for any number of subsampling until the number of independent experimental units approached 30. When the 225 number of experimental units was small, the GEE1 sandwich estimator of the variance provided standard errors that were too small and therefore inflated the type 1 error rate 29,30 . Here, we emphasize the two-part hurdle mixed model, MAST, as an already well-established tool in the field and we demonstrate that MAST performs exceptionally well when adjusting for individual as a random effect 10 . We note, that MAST with RE is testing a two-part hypothesis that the other 230 tools are not directly testing. The discrete and continuous components being tested fit together in the sense that higher mean expression will generally correlate with a higher proportion of expressing cells, but making the assumption that the two will always relate is incorrect. There will be specific instances in the simulated data when the inter-individual means are not significantly different, but the proportion of cell dropout is significantly different (even though 235 the probability of dropout for any one gene across cells is held constant) and is driving a significant result. This will be particularly true with smaller sample sizes, and may contribute to the slightly elevated type 1 error rates with smaller sample sizes and cell counts. While we do recommend computing differential expression analysis using MAST with RE, alternative methods include the tweedie GLMM or permutation testing. In order not to violate the 240 exchangeability assumption, permutation methods must randomize at the independent experimental unit (e.g., individual) and properly account for covariates (i.e., conditional permutation). The tweedie GLMM method could be implemented using the 'glmmTMB' R-package 31 , but neither of these alternative approaches explicitly incorporate some of the singlecell specific concepts implemented in MAST (e.g., cellular detection rate). 245 We computed power analyses for a variety of sample sizes and cell amounts. Empirically, we demonstrated that increasing the amount of cells captured per sample returns very little gain in power after 100 cells per individual in most scenarios, particularly after sample sizes increase ( Figure 3, fig. S5-S7). Instead, we suggest that increasing sample size is the most efficient way to improve power ( Figure 3A). Increasing the number of cells per individual does provide more 250 precision in the estimate for an individual. However, it has limited effects on the power for detecting differences across individuals, such as differences among treatments applied to individuals (i.e., cases/control studies). We note that estimating power with more than 1,000 cells per individual is computationally expensive. Because 1,000s of cells per individual is not atypical for single-cell experiments, tools that account for the correlation structure when 255 analyzing these data need to be further developed to increase computational efficiency.
Most papers compare cells across very few individuals, sometimes even a single case and control (simple pseudoreplication). In the former case the estimate of the inter-individual variance is possible but has wide bounds on parameter confidence intervals, and in the latter case the variance is not estimable from the data. Simulations indicate that the majority of published 260 studies are underpowered (Fig. 3, fig. S5-S7). The majority of single-cell papers show a deep understanding of the underlying biology and conduct otherwise very informative experiments, appropriately landing in very high visibility journals. However, our type 1 error and power simulations document that many published studies are missing important true effects while reporting too many false positives generated via pseudoreplication.
As single-cell technology continues to evolve and costs decrease, scientists need to be aware of this issue to improve study design and avoid proliferation of irreproducible results. Our results encourage the use of mixed models, such as the two-part hurdle model with a random effect (e.g., as implemented in MAST with RE), as ways of accounting for the repeated observations from an individual while being able to adjust for covariates at the individual level 270 and, if appropriate, at the individual cell level. Additional random effects, such as sampling time, may also be included 32 . Our extensive simulation study provides valuable information for understanding the power of specific designs and can be used in grant reviews as one justification of the design and analyses employed. Finally, we note that although our focus here is on hypothesis testing for finding differentially expressed genes within a cell type across conditions, 275 the concept is applicable when comparing expression patterns between cell-types and is broadly appropriate for all single-cell sequencing technologies such as proteomics, metabolomics, and epigenetics. The cell type designations that were used were given by the authors of these studies and more 315 detailed information is provided in their respective manuscripts.

Simulation
To control for the correlation structure between genes, genes were sampled one at a time and any genes with a Spearman's correlation coefficient > 0.25 relative to the gene that was drawn were subsequently trimmed from the dataset. This was repeated until either no more 320 uncorrelated genes remained or a total of 500 uncorrelated genes were obtained, whichever  (Table 1, tables S1-S4). 360 MAST models a log(x + 1) transformed gene expression matrix as a two-part generalized regression model 10 . As in this cited manuscript, the addition of random effects for differences among persons is:

Power calculations
Using MAST with a random effect for individual, we computed power curves to estimate how well this tool functions with varying numbers and ratios of cells and individuals.
Computations were identical to the type I error analyses with exception of multiplying a constant, hereafter labeled fold change, with the global mean gene expression value of a gene to spike the expression values in one group. Power was computed at small increments between a fold change of 1 and 5 and each value was used fit a sigmoidal curve to approximate power ( Fig.   S4-S7).
Code availability 385 Data were simulated in R-3.5.1. All of the code for the simulations and the evaluation of intra-and inter-individual correlation structure is available on GitHub at kdzimmer/PseudoreplicationPaper. This base code was modified to run type I error and power analyses in parallel on Wake Forest's High Performance Computing Cluster, DEMON.