Single-cell RNA-seq differential expression tests within a sample should use pseudo-bulk data of pseudo-replicates

Single-cell RNA sequencing (scRNA-seq) has become a standard approach to investigate molecular differences between cell states. Comparisons of bioinformatics methods for the count matrix transformation (normalization) and differential expression (DE) analysis of these data have already highlighted recommendations for effective between-sample comparisons and visualization. Here, we examine two remaining open questions: (i) What are the best combinations of data transformations and statistical test methods, and (ii) how do pseudo-bulk approaches perform in single-sample designs? We evaluated the performance of 343 DE pipelines (combinations of eight types of count matrix transformations and ten statistical tests) on simulated and real-world data, in terms of precision, sensitivity, and false discovery rate. We confirm superior performance of pseudo-bulk approaches without prior transformation. For within-sample comparisons, we advise the use of three pseudo-replicates, and provide a simple R package DElegate to facilitate application of this approach.


Introduction
To learn about the molecular processes that characterize biological conditions, high-throughput profiling of the transcriptome by RNA sequencing (RNA-seq) followed by differential expression (DE) analysis has become a standard workflow in genomics. With the introduction of single-cell RNA-seq (scRNA-seq) technologies this workflow has been expanded to the level of cell types and cell states. As these technologies have seen widespread adoption, the number of methods to perform data normalization, and DE analysis has also grown. However, previous systematic evaluations of DE approaches did not cover the combinations of transformations and DE methods, leaving a knowledge gap we aim to fill here.
The main input to a scRNA-seq analysis is a gene expression count matrix. The elements in this matrix correspond to the number of molecules assigned to a particular gene and cell. For homogeneous cell populations, the vast majority of genes has very low counts (< 1 on average) and follows a Poisson distribution, while highly expressed genes show a variance larger than the mean and can be modeled as negative binomial [Grün et al., 2014, Chen et al., 2018, Hafemeister and Satija, 2019. However, cell-to-cell variability in gene expression is obscured by technical effects like sequencing depth that may distort the observed counts by several orders of magnitude. Therefore, a considerable amount of effort has gone into developing and examining count matrix transformations ("normalization") that aim to address the technical variability and mean-variance relationship in the data, with the aim to facilitate downstream analyses like dimensionality reduction and clustering [Hafemeister and Satija, 2019, Lause et al., 2021, Ahlmann-Eltze and Huber, 2023, Booeshaghi et al., 2022, Choudhary and Satija, 2022.
Some commonly used tools for bulk RNA-seq DE analysis have been adopted to work with the sparser input and higher number of observations in scRNA-seq (e.g. single-cell parameterizations of DESeq2 [Love et al., 2014], and edgeR [McCarthy et al., 2012]), and single-cell specific methods were developed (e.g. MAST [Finak et al., 2015]). On the other hand, especially with higher cell numbers, general-purpose statistical tests, applied independently per gene, have been used instead (e.g., randomization, Wilcoxon rank-sum test, t-test, logistic regression). As an alternative to treating each cell as an observation in the DE test, bulk RNA-seq DE methods may be used after aggregating single-cell counts (e.g., original use-case of DESeq2, edgeR, limma [Law et al., 2014]), however, this would normally require multiple samples.
Initial comparisons of tools for single-cell DE analysis highlighted inconsistencies between methods and found that tools designed specifically for single-cell data did not necessarily outperform general-purpose methods [Soneson and Robinson, 2018]. Later benchmarks included methods that aggregate single cells, or use multi-sample designs [Pullin and McCarthy, 2022, Crowell et al., 2020, Squair et al., 2021. Jointly, their results highlighted the importance of biological replicates and methods that account for them, and demonstrated the effectiveness of aggregating data (by creating pseudo-bulk) for DE analysis.
Here, we aimed to expand previous benchmarks of single-cell DE methods by focusing on single-sample designs and considering recently developed data transformations. We wanted to address two questions: 1. What are best combinations of data transformations and statistical tests for single-cell DE analysis?
2. How do pseudo-bulk approaches perform in the absence of real replicates? Figure 1: Evaluation of an exhaustive collection of DE pipelines A) In our terminology, each pipeline is a combination of one particular data transformation method and one particular statistical test. All pipelines were evaluated jointly on simulated data and an immune data set. B) Transformations were paired with DE methods. We use a naming scheme with unique prefixes to indicate the methods. NB negative binomial, GLM generalized linear model, LM linear model.
We approached the task of finding DE genes as a sequence of a data transformation step and a statistical test. In total, we combined parameterizations of eight types of count matrix transformations and ten test methods for a total of 343 DE "pipelines" (Fig. 1, Supp. Tables 1-4). We focused on comparing two groups of cells in the absence of replicates, as is commonly done when working within an individual sample (e.g., comparing cell types or cell states from one individual), and evaluated each pipeline on simulated data using precision, sensitivity, Matthews correlation coefficient, and false discovery rate. Additionally, we performed evaluation using real-word data with independent validation data.

Results
Overall we tested 343 DE pipelines (combinations of commonly used count-data transformations and statistical tests) on four types of simulation experiments, and on one immune cell dataset.

Simulated data: Pseudo-bulk methods with pseudo-replicates have lower FDR and outperform single-cell methods
First, we sought to evaluate the DE pipelines in wellcontrolled settings, in which we knew beforehand which genes were truly differentially expressed and which were not. Therefore, we simulated an artificial dataset of four distinct experiment types using the muscat package [Crowell et al., 2020]  For every experiment, replicate, and pipeline we determined the predicted DE genes (at an expected FDR cutoff of 0.05 based on the method's own calculated P-values) and calculated precision, sensitivity, and Matthews correlation coefficient (MCC, re-scaled per experiment, see Methods section) across all genes. Additionally, we evaluated highly and lowly expressed genes separately and calculated the true FDR, averaging across replicates. The decision to separately measure FDR for genes based on mean expression was motivated by prior reports of high FDR for highly expressed genes for some methods [Squair et al., 2021] and our own observations of the mean vs. precision dependency (examples shown in Fig. 2B). We aimed to compile a summary that captured overall performance in all experiments and that highlighted methods that did well across all expression levels. Therefore, we combined the MCC, a measure of the overall agreement of the predicted DE genes with the ground truth, with the true FDR for lowly and highly expressed genes. Especially for highly expressed genes, FDR can be a problem that would bot be captured by a genome-wide metric due to the low number of genes affected. The final aggregate score is then MCC+(1−FDR low )+(1−FDR high ), where the MCC is the average of the rescaled MCC values for the four experiments. The FDR values are taken from the experiment with the highest number of cells, as that experiment tended to yield more false positives. See Fig. 2C and Additional Files 1&2 for complete results. The highest ranking pipeline for every DE method is shown in Fig. 2D.
We found that the top-ranking pipelines all employed test methods designed to work on raw counts from multiple replicates of bulk data. In our single-sample setting, these tests had been run using pseudo-bulk data from random splits of the dataset ("pseudo-replicates", see Meth-     ods section) without any further count data transformations applied. These were also the methods that most effectively controlled the true FDR among the highly expressed genes. Specifically, the highest-ranking methods were DE-seq2, qlrt, edgeR, and limma trend. While all four methods had high sensitivity when many cells were compared, we found that DESeq2 had the highest precision at the cost of slightly lower sensitivity. This conservative DE calling made DESeq2 particularly good at controlling the FDR for lowly expressed genes (mean ≤ 0.1). Methods directly using the single-cell data ranked in the middle, with tests specifically modeling count data (qlrt, and single-cell-adapted DE-Seq2 and edgeR) performing comparably to generic statistical tests. The latter were paired with prior scaling transformations (e.g., logistic regression with log-transformed and median-centered counts), and reported fewer false positives among the highly expressed genes. The bottom of the ranking was made up of methods that clearly failed to control the overall FDR, especially with an increasing number of cells tested. These methods have in common that they were designed to take read counts as input, but did not perform well with the tested parameter choices. In particular, an inappropriate choice of size factor models led to low precision (e.g., qlrt which worked well with raw counts did not perform well with some size factors). Notably, a closer look at the full results table revealed that the popular combination of log-counts per 10k transformation and Wilcoxon rank sum test was among those low-ranking pipelines (position 198/343, Add. File 1).

Immune cell data: Pseudo-bulk methods have highest agreement with independent bulk data results
To investigate the performance of the DE pipelines on real data, we contrasted the three main immune cell compartments (monocytes, B cells, T cells) from a commonly used dataset of peripheral blood mononuclear cells from a single healthy donor [10x Genomics]. We used bulk RNA-seq from sorted cell populations to determine a high-confidence set of DE genes that we considered a ground truth reference. In short, we queried the BLUEPRINT consortium data (http://dcc.blueprint-epigenome.eu) for RNA-seq experiments of venous blood samples for cell types that match our three main cell types. This resulted in 9 B-cell, 10 monocyte, and 17 T-cell samples. Bulk data DE genes were then determined by running edgeR, limma, and DESeq2, and combining the results by intersection (Fig. 3A). All DE pipelines were evaluated and ranked using the same metrics as for the simulated data, with the exception that FDR values shown are averages across all three comparisons (Fig. 3B). Again, among the high-ranking methods we saw mostly bulk data methods with raw counts as input (DESeq2, qlrt, edgeR, limma trend), and a t-test pipeline as a notable exception ( Fig. 3 and Add. Files 3&4). The FDR values observed in these experiments are generally much higher than in the simulation experiments with the smallest value being around 0.3 for the lowly expressed genes, and 0.4 for the highly expressed genes. This means that all pipelines returned a high proportion of DE genes that are not in the multi-sample derived ground truth.

Well-established bulk methods using untransformed pseudo-bulk counts consistently outperform other pipelines
To establish what pipelines consistently performed well in our tests, we ordered the pipelines by their worst rank across the two evaluation scenarios (simulated data, immune cell data) and focused on the pipelines that ranked in the top 50 in both cases (Fig. 4). Among these 23 topperformers, 15 are using the pseudo-replicate pseudo-bulk approach with untransformed counts, and nine of these make up the top of the list (DESeq2, qlrt, edgeR, and limma trend are all represented). The single-cell methods in this list are t-test (5 times), logistic regression (2 times), and a single-cell optimized variant of edgeR. Regarding the count data transformations, we observed that in six out of the eight cases the single-cell pipelines used a simple shiftedlog with cell size factors defined by deconvolution (scran) or the geometric mean (see Supplementary Table 2 for size factor details). Given that pipelines using pseudo-replicate (PR) data consistently ranked highly in our benchmark, we sought to examine the influence of the choice of number of PRs on the result. In all experiments, we had included runs with 3, 5, or 7 PRs, to evaluate the impact of that parameter. We averaged the simulation and immune cell evaluation scores and assessed performance as a function of the number of PRs (Supp. Fig. 1). We found that three PRs mostly led to better performance than a higher number of PRs (true for qlrt, limma voom, edgeR QLF, DESeq2). In a few cases, three PRs resulted in only slightly lower, or very similar performance than more PRs (limma trend, edgeR LRT, edgeR exact).

Pipeline runtimes
While we did not focus on pipeline runtimes for evaluation, we kept track of the time it took to perform transformation and DE testing (Supp. Fig. 2&3). Transformations that required model parameter estimation were among the slowest ones, as were those that used scran's deconvolution-based cell size factor. Among the DE tests applied to single-cell data, highly efficient implementations of the t-test, logistic regression, and Wilcoxon rank-sum test stood out as particularly fast, while all model-based approaches were significantly slower (at least 50x). Methods applied to aggregated counts were also fast, with limma and edgeR being the fastest.

Discussion
In this study, we set out to evaluate the performance of differential expression (DE) pipelines on simulated and real single-cell RNA-seq data. We included many combinations of data transformations and statistical tests, as well as test  Figure 3: Immune cell experiments A) A single-cell dataset of healthy PBMC was used for testing the pipelines. For evaluation, a dataset of cell-type-sorted bulk RNA-seq was used. B) All pipelines were evaluated and ranked according to relative performance. C) Top-performing pipelines for every test method (as in Fig. 2).  methods originally designed for bulk RNA-seq. Since previous studies have already convincingly shown that bulk methods are the methods of choice when replicates are given [Crowell et al., 2020, Squair et al., 2021, we focused on the scenario where no replicates are available. For example, this would be the case when trying to find markers of different clusters in one sample.
The results of our simulation experiments showed that bulk methods that use pseudo-bulk raw count data from pseudo-replicates ranked highest and were most effective in controlling the false discovery rate (FDR) for highly expressed genes. This is in line with previous observations [Squair et al., 2021]. For real scRNA-seq data, the topperforming pipelines were also dominated by the same kind of pipelines, but the differences between single-cell and pseudo-replicate methods were less clear. This is in part due to the different nature of the evaluation -biological replicates were used to generate the set of true DE genes, while the evaluation was done using a single sample resulting in all pipelines suffering from higher false positives. Another difference is the smaller variability in sequencing depth between cell groups being compared -in the simulation experiments one group was downsampled to 85%, while the differences in sequencing depth within the singlecell immune data were very small. This decreased the effect of cell size factors and library size normalization on the overall outcome.
Among the top-performing single-cell pipelines, we found mostly shifted-log transformed count data in combination with a t-test or logistic regression. This is an inter-esting result, showing that rather simple transformations, when using the right size factors (here geometric mean, or scran's size factors), can be appropriate for more generalized statistical analyses. This finding matches the good performance Ahlmann-Eltze and Huber [2023] have recently reported for the shifted-log transformation in the context of dimensionality reduction, and the highlighting of t-test, logistic regression, and Wilcoxon rank-sum test by Pullin and McCarthy [2022] for marker gene identification.
Overall, our results indicate that DE pipelines designed for bulk sequencing data are the method of choice also in the single-cell single-replicate scenario. We recommend the use of pseudo-bulk data of three pseudo-replicates, and DESeq2 (lowest FDR) with Wald test, or edgeR (slightly higher sensitivity) with QLF test. Besides being more accurate, these pipelines are also among the fastest that we tested.
In a more general context of single-cell RNA-seq data analysis, our results support the idea that there is not just one data transformation that is suited for all kinds of tasks. Rather, one should use different transformations for different purposes. For example, shifted-log or Pearson residuals would be better suited for dimensionality reduction [Ahlmann-Eltze and Huber, 2023], mapping to manifolds, and nearest-neighbor network creation, while pseudo-bulk of untransformed counts and pseudo-replicates would be better for DE testing. The overall favorable results of pseudo-replicate approaches can also be seen as an endorsement of meta-cell methods, as pioneered by Baran et al. [2019], where homogeneous single-cell profiles are grouped for robust statistics (see also Bilous et al. [2022], Persad et al. [2023]).
Our benchmarking study has limitations. For one, we focused on just one particular sample type, namely PBMCs. Using additional samples from other tissues, especially for real-data experiments, could change the pipeline rankings. However, the overall agreement between simulation data results and real data results is an indicator, that we captured true performance differences. Also, the single-cell data used in this pipeline benchmark was generated with 10x Chromium technology and we did not evaluate how the pipelines performed on data generated using other singlecell technologies. Furthermore, for evaluation, we have not considered true or predicted effect size, only statistical significance. This was a deliberate decision, as effect size (e.g., fold change) depends on both, the transformation and DE method, and is therefore not comparable between pipelines. Thus, by focusing on the binary DE call and its effect direction, we simplified the evaluation and could avoid the use of arbitrary thresholds. Finally, we would like to note, that pseudo-replicates are not a substitute for true biological replicates. Only true biological replicates can supply the estimates of inter-individual gene expression variability needed to make robust inferences of DE.

Conclusion
In the benchmark presented here, we see clear evidence for the good performance of bulk DE methods in the single-cell scenario, even in the absence of true replicates. After testing many data transformations and DE tests, we do not see any advantages of true single-cell DE pipelines, leading us to suggest the use of pseudo-bulk data of pseudo-replicates with bulk DE methods (DESeq2, edgeR) instead. To facilitate the use of this recommended methodology, we have implemented these methods together with the generation of pseudo-replicates from count matrices, SingleCellExperiment, and Seurat objects in a simple R package DElegate.

Simulated data
To simulate single-cell data with known DE genes, we used muscat [Crowell et al., 2020]. While simulated data may not accurately reflect real data (see Crowell et al. [2022] for a recent discussion), we expect the relative ranking of pipelines to still be informative in this simple two-group experimental design. As an input to muscat, we used two samples: 1) the monocyte cluster from the single-cell immune data described below, and 2) the same cells, but each cell downsampled to 85% of the total counts. We downsampled one sample, to make the task of DE testing more challenging and more similar to real data situations, where total counts may vary between cell types, or may arise from sequencing depth differences. For every simulation experiment, we generated 13 replicates of two samples with 10% genes DE (using the DE muscat model only).

Immune cell data
As "real" single-cell RNA-seq test data, we used a publicly available dataset of approximately 10k healthy human PBMC [10x Genomics]. We then performed filtering to keep only high-quality cells: 1. We excluded cells with more than 15% mitochondrial reads. 2. We z-scored the log10transformed number of genes detected and excluded cells with values outside the range [-3, 3]. 3. We z-scored the log10 of the number of reads and excluded cells outside of [-3, 3]. We then used Seurat v4.1.0 [Hao et al., 2021] to cluster the cells (we used the following non-default parameters: PCs = 15, distance metric = manhattan, k parameter = 20, n neighbors = 40, cluster resolution = 0.8). We then used celltypist [Domínguez Conde et al., 2022] with the Immune_All_Low model and the best_match option to annotate the cells. To focus on the three main immune cell types, we merged the subtypes of T cells, monocytes, and B cells, respectively. Finally, we kept all clusters that had not more than 50% "other" cells (e.g., NK cells) and that were not less than 100 cells in size, and removed all "other" cells. This resulted in a data set of 4,493 T cells, 4,453 monocytes, and 1,028 B cells.
The immune cell validation data was obtained from the BLUEPRINT consortium data site at http://dcc. blueprint-epigenome.eu. Using the provided metadata, we selected RNA-seq experiments of venous blood samples for cell types that match our three main immune cell types described above. This resulted in 9 B-cell samples from 6 individuals, 10 monocyte samples from 8 individuals, and 17 T-cell samples from 12 individuals. Bulk data DE genes were then determined by running three bulk DE methods (edgeR QLF, limma trend, and DESeq2 Wald) with default parameters, and combining the results by intersection (FDR ≤ 0.05 per method).

Evaluation metrics and pipeline ranking 4.2.1 Simulation experiments
In all simulation experiments, "positives" were defined as the genes identified as DE by a given pipeline (at FDR cutoff of 0.05). True positives (TP) are therefore those positives that were simulated with a non-zero fold change, while false positives (FP) were simulated with a fold change of 0. True negatives (TN) are genes not identified as DE and simulated with a fold change of 0, while false negatives (FN) were simulated with a non-zero fold change. In the event of positives with wrong effect direction (e.g., simulated with positive fold change, but identified as DE with negative fold change), the gene is counted as 1/2 FP and 1/2 FN.
For every experiment we calculate the following metrics: Precision = TP TP+FP , Sensitivity = TP

Matthews correlation coefficient (MCC
We clipped the MCC to a lower bound of 0 to have a wider dynamic range for the better-than-random results (MCC below 0 indicates a bias towards misclassification). FDR_low = FP low FP low +TP low . The FDR among all genes with a geometric mean < 0.1 across the simulation group means. As lowly and highly expressed genes follow different distributions [Grün et al., 2014, Chen et al., 2018, Hafemeister and Satija, 2019, it was important to consider both groups of genes separately to evaluate DE pipelines. FDR_high = FP high FP high +TP high . The FDR among all genes with a geometric mean ≥ 1 across the simulation group means.
Final ranking To rank the pipelines, we calculated a score for each pipeline: MCC rescaled + (1 − FDR low ) + (1 − FDR high ). With this score we aimed to capture the overall agreement of the predicted set of DE genes with the ground truth, but also penalized pipeline biased towards false positives among lowly and highly expressed genes. In detail, we rescaled MCC to range from 0 to 1 across all pipelines per experimental replicate, then averaged (mean) the replicates per experiment and pipeline, and then averaged (mean) the experiments to obtain one MCC rescaled score per pipeline. For the FDR, we used the 2k vs. 2k cell experiment and averaged the values across replicates.

Immune cell data experiments
For the evaluation of the DE pipelines on real data, we performed DE analysis on bulk RNA-seq data (see Data section above), and treated the results as a proxy for a real "ground truth" (high-confidence set of DE genes). Specifically, for every cell type comparison, we used the intersection of the DE genes by 1) edgeR with QLF test, 2) DESeq with Wald test, 3) limma-trend. For every method we used an estimated FDR ≤ 0.05, and we required all three DE methods to also agree on the effect direction to add a gene to the DE set.
The definition of the pipeline evaluation metrics was then analogous to the simulation experiments. Note that we expected lower precision and higher FDR than compared to the simulation experiments. Our ground truth represented a core set of DE genes derived from bulk data with replicates from multiple individuals, while the test data consists of a single sample from one individual, assayed with a different technology. However, all pipelines should be affected by these differences, and the metrics still allowed us to rank the pipelines according to their overall agreement with the bulk results.
Final ranking Similarly to the simulation experiments, we calculated a score for each pipeline: MCC rescaled + (1 − FDR low ) + (1 − FDR high ). We rescale MCC to range from 0 to 1 across all pipelines per experiment, and then averaged the experiments to obtain one MCC rescaled score per pipeline. For the FDR, we averaged the values across ex-periments.

Additional files
Additional File 1: CSV file with details of simulation experiment results Additional File 2: PDF file with heatmap of simulation experiment results Additional File 3: CSV file with details of immune cell experiment results Additional File 4: PDF file with heatmap of immune cell experiment results

Data and software availability
We have uploaded the data used in our experiments to zenodo where it is publicly available at https://doi.org/10. 5281/zenodo.7751830.
All code used to execute the benchmark, and a project Dockerfile are available on GitHub at https://github. com/cancerbits/scDE_benchmark.
The DElegate R package facilitating the use of recommended pipelines is available at https://github.com/ cancerbits/DElegate.  Fit a generalized linear model for Negative Binomial count data that is aware of group membership and optionally cell size factors. A quasilikelihood ratio test with the group contrast is used to test for DE. Implemented in glmGamPoi::glm_gp() and glmGamPoi::test_de(). Only used with raw or corrected counts. Supp. Figure 3: Timing for the different DE methods. For each simulation experiment, the average across replicates is shown. For DE methods that had a runtime dependency on the transformation used, we selected the transformation that resulted in the fastest average runtime.