Benchmarking of a Bayesian single cell RNAseq differential gene expression test for dose-response study designs

The application of single-cell RNA sequencing (scRNAseq) for the evaluation of chemicals, drugs, and food contaminants presents the opportunity to consider cellular heterogeneity in pharmacological and toxicological responses. Current differential gene expression analysis (DGEA) methods focus primarily on two group comparisons, not multi-group dose-response study designs used in safety assessments. To benchmark DGEA methods for dose-response scRNAseq experiments, we proposed a multiplicity corrected Bayesian testing approach and compare it against 8 other methods including two frequentist fit-for-purpose tests using simulated and experimental data. Our Bayesian test method outperformed all other tests for a broad range of accuracy metrics including control of false positive error rates. Most notable, the fit-for-purpose and standard multiple group DGEA methods were superior to the two group scRNAseq methods for dose-response study designs. Collectively, our benchmarking of DGEA methods demonstrates the importance in considering study design when determining the most appropriate test methods.


Introduction
Single-cell transcriptomics enables researchers to investigate homeostasis, development, and disease at unprecedented cellular resolution [1][2][3][4][5] . As with any new innovative technology, diverse tools soon follow to address specific applications and unique challenges. Currently, there are dozens of differential gene expression analysis (DGEA) approaches for single-cell RNAseq (scRNAseq) data; developed based on differences in assumptions, statistical methodologies, and study designs [6][7][8][9][10][11] . A recent comparison of 36 approaches demonstrated acceptable performance for common bulk RNAseq tools such as edgeR and limma-trend, and MAST for snRNAseq, as well as common statistical tests such as the Wilcoxon Rank Sum (WRS) and the t-test 9 .
However, most methods have been developed primarily for two group comparisons whereas experiments that include multiple groups, such as when assessing risk in pharmacology and toxicology studies where doseresponse designs are required. The use of two sample tests for multiple group study designs elevate the type I error rate warranting further investigation of these methods for multiple group dose-response study designs 12 .
Dose-response studies are used to derive the efficacy and/or safety margins such as effective dose and the point of departure (POD). Significant efforts by the toxicology and regulatory communities have suggested that acute (<14 days) and sub-acute (14 -28 days) transcriptomic studies as viable alternative to the current standard 2-year rodent bioassay that significantly reduces the time and resources needed to assess risk 13-15 .
Gene expression profiling at single-cell resolution could further support such evaluations by identifying cellspecific dose-dependent responses indicative of an adverse event. The U.S. National Toxicology Program (NTP) recently reported a robust DGEA approach is essential to deriving biologically relevant PODs 15 . However, concerns regarding the inclusion of false positives that produce less conservative POD estimates potentially leads to incorrect classification of mode-of-action, thus highlighting the importance of controlling type I error rates 16,17 .
Unlike microarray and bulk RNAseq datasets, single-cell RNAseq (scRNAseq) data is zero inflated due to the low per cell RNA input, biases in capture and amplification, transcriptional bursts, and other technical factors 18 . Consequently, scRNAseq test methods usually consider the gene expression distribution as a mixture of a zero and a positively (non-zero) expressed population [19][20][21] . For example, the Seurat Bimod approach tests for differential gene expression using a likelihood ratio test designed for the said mixture population. MAST extends the Seurat Bimod test to a two-part generalized linear model structure capable of incorporating covariates 19,20 . Given the improved performance of MAST 9, 19, 20 , we hypothesized that multiple group tests developed assuming the same distributional framework would be most favorable for dose-response study designs. Furthermore, a Bayesian approach which considers prior knowledge is anticipated to minimize type I error rates 22, 23 .
We propose a novel, multiplicity corrected, Bayesian multiple group test (scBT) designed exclusively for DGEA in zero inflated continuous data populations, characteristic of dose-response scRNAseq data. Two other fit-for-purpose frequentist multiple group tests are also examined: (1) a multiple group extension of the Seurat Bimod test and (2) a simple extension of test (1) to a generalized linear model framework. The proposed methods are benchmarked against commonly used approaches for DGEA on simulated and real experimental doseresponse datasets.

Dose-response single-cell data simulations
For benchmarking of DGEA methods, a ground truth is required. Existing simulation tools such as PowSimR, SymSim, SPsimSeq, and Splatter are commonly used for power analyses, evaluating DE analysis methods, and testing cell clustering strategies [24][25][26][27] . Tools such as SymSim and Splatter are also capable of simulating cell trajectories and model differentiation processes. Trajectories which exhibit non-linear changes over time or across different developmental stages are not unlike dose-response effects which change over a continuum of doses. However, dose-responsive changes commonly follow defined trajectories such as Hill, exponential, power, and linear models 28 . To simulate dose-response scRNAseq data we developed a wrapper for the Splatter scRNAseq data simulation tool named SplattDR. SplattDR modified the Splatter grouped data simulation strategy by adjusting counts from means defined by one of the dose-response functions outlined in the Materials and Methods.
To demonstrate the modeling capability of SplattDR, 10,000 gene expression responses were simulated with a 10% probability of being differentially expressed, equally distributed across the dose-response models.
Parameters used in Splatter were initially estimated from our experimental single nuclei RNAseq (snRNAseq) dose-response dataset. The simulated data compared to the experimental data showed the relationship between the mean expression, percentage of zeroes, and mean variance were consistent (Figs. 1a-b). Estimation of the normalized root mean square deviation (NRMSD) from a curve fit to the experimental data indicated excellent concordance. This strong concordance was also maintained within distinct dose groups (Figs. S1-2). The distribution of log(fold-changes) between vehicle (dose 0) and the highest simulated dose (dose 9; 30 µg/kg) showed a more even distribution within a similar range compared to experimental data which was skewed towards induction (Fig. 1c). However, the gene induction skew was captured by modulating the parameters affecting the probability of differential expression and the proportion of differentially repressed genes (Fig. S3).
Principal components analysis (PCA) of the simulated data clearly showed the dose-dependent characteristics of scRNAseq data with distinct clusters increasing in separation with increasing dose (Fig. 1d) which was also resolved by PCA within the experimental data (Fig. S4).
To our knowledge, no other published in-vivo dose-response scRNAseq datasets are available limiting the number of datasets to estimate initial parameters for simulation to date. To investigate whether existing datasets generated using a different study design (e.g., whole cells or different tissue source) could be used to derive initial parameters, we also simulated 10,000 genes starting with parameters estimated from (i) a two-dose liver snRNAseq (GSE148339), (ii) whole cell liver scRNAseq (GSE129516), and (iii) peripheral blood mononuclear cells (PBMC; GSE108313) datasets. When compared to a model fit for experimental data to determine the relation between mean expression and percent zeroes or mean variance, the NRMSD for data simulated from these datasets were between 1 -10% with data simulated from whole cell data differing the most from the model fit (Fig. 1e). We then explored whether parameters estimated from distinct cell types could replicate the characteristics of that same cell type (Fig. 1f). Not surprisingly, lower NRMSD values were observed for simulated cell-specific data based on experimental dose-response data estimated starting parameters with whole cell data performing the worst. Notably, when data derived from a lower abundant cell subtype was used to estimate starting parameters, the dose-response characteristics for that cell subtype was also poorly modeled (Figs. 1e-f, S1-2).

Performance Accuracy of DE test methods
We evaluated the performance of several differential gene expression analysis methods on simulated datasets consisting of 9 dose groups of 500 cells each (4,500 total) and 5,000 genes with a 10% probability of being differentially expressed (500 differentially expressed genes  34 . With ground truth from simulated data, the sensitivity, specificity, and precision for each test method was computed. Area under the receiver-operating characteristic curve (AUROC) was used to measure test performance for correctly classified differentially expressed genes. In unfiltered data, AUROC scores showed similar performance for most tests except scBT which had the largest AUROC among all test methods (Fig. 2a). To account for the inherent class imbalance between differentially expressed and non-differentially expressed classes the area under the precision-recall curves (AUPRC) was also calculated. Similar to AUROCs, AUPRCs identified scBT as the best performing test ( Fig. 2c). In most standard differential expression testing pipelines genes expressed at low levels are removed to minimize false detection rates. Following filtering of genes expressed in ≤5% of cells in any dose group, scBT was consistently ranked as the best test based on AUROC and AUPRC scores. The performance of LRT linear test also improved, with comparable AUROC and AUPRC scores relative to scBT, suggesting LRT linear is poorly suited for genes expressed at low levels ( Fig. 2b,d).
AUROC and AUPRC reflect the performance of each test method with varying significance (i.e., p-value) thresholds. In the standard pipeline a fixed threshold is used, typically a p-value ≤ 0.05 after adjustment for multiple hypothesis testing (i.e., Bonferroni correction). For each method except scBT, the performance at an adjusted p-value ≤ 0.05 significance criteria was evaluated. In scBT analysis, a gene was considered differentially expressed when the estimated posterior probabilities of the null hypothesis, , | , was less than , where the value was chosen to achieve a target FDR of 0.05. scBT significantly outperformed all other tests in precision rates irrespective of low expression filtering (Figs. 2e, S5). However, scBT was less effective in identifying true positives (Figs. 2f, S5). Applying the filtering criteria improved the recall rates, but the precision rates remain largely unchanged (Figs. 2e,f). Test method classification performance scores were estimated as the Matthews Correlation Coefficient (MCC) which is well suited for unbalanced data 35 . We see that the scBT and LRT linear tests performed best for this metric on both unfiltered and filtered data (Fig. 2g).

Type I error control and power
To investigate test performance in controlling type I errors (false positives), DGEA methods on simulated datasets were examined with 0% DE genes (i.e., negative control). Using the ζ threshold for the computed posterior null probabilities, scBT identified only 1 false positive gene in 2 of 10 simulations (Fig. 3a). ANOVA, scBT, KW, limma-trend, and LRT linear had false positive rates (FPRs) below 3% indicating better performance compared to two group tests. After filtering for genes with low expression levels, scBT still correctly identified all the non-differentially expressed genes and was the best performing test. These are the same tests that had a better FPR control in initial simulations (Fig. 2). To explore whether mean expression or percentage of zeroes influenced type I error rates, a logistic regression model was fit to negative control data. We predicted the probability for each gene to be identified as differentially expressed in the negative control data. While the curve for scBT is missing since few false positives were identified, the predicted FPR for all the other tests except LRT linear were also high for highly expressed genes with few zeroes (Fig. 3b,c). Next, a positive control dataset with 100% differentially expressed genes was simulated to evaluate test performance for detecting true positives.
All tests except scBT exhibited a false negative rate (FNR) ≤ 40% (Fig. 3d). The best performing tests for FNR also had high FPR. Logistic model regression fitting for false negative classification of genes shows that the false negative rates were highest when the mean expression was either too high or too low for all tests (Fig. 3e,f).

Parameter sensitivity analyses
Experimental scRNAseq datasets will vary between cell types, cell composition, and responses depending on the target tissue, treatment, number of cells sequenced, and more. For example, some distinct cell types are very abundant (e.g., hepatocytes), with others present at lower levels (e.g., portal fibroblasts) in hepatic scRNAseq datasets. Moreover, treatments such as exposure to a xenobiotic, can elicit dose-dependent changes in relative proportions of cell types such as the infiltration of immune cells 36 . We investigated the impact by changing cell abundance from 25 to 2,000 cells per dose group and observed an increase in the false positive rate (FPR) when increasing the number of cells (Fig. S6). The scBT and LRT linear tests were less sensitive to an increase in the FPR as cell abundance increased while the total positive rates (TPR + FPR) increased with cell abundance for all methods. Although all tests exhibited comparable performance at low cell numbers (≤ 500), as cell numbers increased scBT outperformed all other tests in both precision and MCC score (Fig. 4a, S6).
Comparison of AUROCs and AUPRCs across cell numbers showed that ANOVA, KW, limma-trend, and LRT linear tests performed best for a small number of cells, but the increase in AUROC was steeper for scBT (Fig.

S7-8).
It was also evident from the experimental snRNAseq dataset that the number of cells per dose group was not fixed. We evaluated the performance of the test methods when the number of cells dose-dependently increased or decreased, and when the number of cells per dose group were taken from experimental data.
Notably, while scBT had the best MCC for increasing number of cells per dose, LRT linear performed better than scBT when the number of cells decreased before and after filtering for genes expressed at low levels (Fig. 4b).
The shift in MCC between increasing and decreasing cell numbers for scBT appears to be driven by a concomitant decrease in FTPR and increase in FNR (Fig. S9).
Unique chemical, drug, environmental contaminant, and natural product classes elicit distinct differential gene expression profiles defined by the mode-of-action (MoA) as well as by their metabolism, potency (sensitivity) and efficacy (maximal response). Differences between compound classes are reflected in the gene expression profile in (a) the proportion of differential expressed genes, (b) the number of induced/repressed genes, (c) the mean fold-change for differentially expressed genes, and (d) the distribution of fold-change for differentially expressed genes. These 4 parameters were modulated in simulated data to determine the effect of the percentage of differentially expressed genes, the mean fold-change (aka location), and the fold-change distribution (aka scale) on test performance. Among these scenarios, changing the proportion of repressed genes had little to no impact on test method performance ( Fig. 4c-f, S14).
Increasing the proportion of differentially expressed genes led to an improvement in MCC except for scBT and LRT linear, though these tests maintained the top MCC scores as well as AUROC and AUPRC (Fig. 4c, S11-13). As the magnitude of the effect increased, LRT linear performed best at the low end while scBT exhibited the greatest improvement in MCC (Fig. 4d). Conversely, while the MCC decreased for most tests when modulating the fold-change scale of differentially expressed genes, scBT improved and was more stable (Fig.   4e). As zero inflation increased, the FPR increased and the precision decreased for all tests (Fig. S23). However, scBT was least affected, and maintained the highest MCC among all tests (Fig. 4f). AUROC and AUPRC values also indicated that scBT consistently outperformed other test methods (Fig. S24-25).

Test method agreement
To assess agreement between tests, the area under the concordance curve (AUCC) for each pair of tests for the top 100 genes ranked by adjusted p-value was calculated as previously described 9, 37 . All methods showed excellent concordance (AUCC ≥ 0.77) with LRT linear showing the poorest consistency compared to all other tests while the limma-trend and ANOVA tests showed perfect agreement with an AUCC of 1 (Fig. S5). Pairwise differential gene expression comparisons between DE, Seurat Bimod, MAST and WRS had AUCCs >0.95 AUCCs while the multiple group tests ANOVA, LRT multiple, KW, and scBT clustered together with AUCC ranging between 0.9 -1. In the absence of nuisance covariates, MAST and Seurat Bimod provided similar results, as expected given their similar mixture normal model structure. Likewise for ANOVA and limma-trend, both of which rely on normality assumptions for testing differential gene expression.

Real dose-response dataset DE analysis
Without ground truth for experimental data, the performance of the differential expression test methods was examined by first evaluating the agreement for each identified cell type (Fig. 5, S26). Genes in the experimental dataset were considered differentially expressed when expressed in ≥5% of cells in at least one dose group and had a |fold-change| ≥ 1.5. In hepatocytes, the most abundant cell type, fewer than 5 genes were not detected in all test methods, with the majority missed by the WRS test (Fig. 5a). Upon closer examination, those genes were not expressed in control hepatocytes. Not surprisingly, for all cell types, the largest intersection was between all tests indicating strong agreement within all test methods. Only a few tests identified a subset of unique genes as differentially expressed, which accounted for a very small fraction. For example, LRT linear identified 12 unique differentially expressed genes in portal fibroblasts one of the least abundant cell types (Fig.   5b). LRT linear was the best performing test for low cell numbers indicating that the 12 unique differentially expressed genes may in fact be true positives. Consistent with simulations of varying cell numbers (Fig. 4a), 24 genes were not identified as differentially expressed by the scBT method for stellate cells which exhibit a dosedependent decrease in numbers (Fig. 5c,d). Although scBT outperformed other tests in most scenarios, it under performed in this scenario. Nevertheless, when ranking genes by significance level (i.e., p-values), AUCC were high for all pairwise comparisons.

Discussion
The goal of this study was to compare the performance of newly developed DGEA test methods for doseresponse experiments to existing analysis methods. Using simulated data to generate ground truth, we evaluated the performance of 9 differential expression testing methods which were broadly classified as either fit-forpurpose, multiple group, or two group tests. Criteria for test method selection was based on previous benchmarking efforts for two group study designs identifying MAST, limma-trend, WRS, and t-test as the best performers 9, 38 . ANOVA and KW tests were also included for evaluating multiple group comparisons, and Seurat Bimod, for having the same modelling framework as scBT, LRT multiple, and LRT linear tests. The test methods were ranked from best to worse (1 -9) based on type I error rate, type II error rate, MCC, AUROC, and AUPRC ( Fig. 6, Table S1).
While several scRNAseq tools have been developed 24-27 , none are developed to simulate dose-response models commonly identified in toxicological and pharmacological datasets 28, 39 . Our SplattDR wrapper for the Splatter package 27 was able to show that simulated data can effectively emulate key experimental scRNAseq data characteristics when simulation parameters were estimated from various Unique Molecular Identifier (UMI)based datasets. In agreement with a previous report, technical and biological factors, such as cell type, does appear to influence gene dropout rates 18 . We primarily focused on 10X Genomics UMI data given the unavailability of real experimental dose-response data generated using other platforms.
Overall, test method performance was consistent with their intended application. For example, fit-forpurpose tests scBT and LRT linear consistently ranked higher followed by multiple groups tests such as KW and LRT multiple. scBT exhibited the best overall performance with excellent FPR control and top ranked MCC while LRT linear struck a balance between type I and type II error rates. The scBT results are not surprising as Bayes factor-based tests have proven to be conservative and consequently more appropriate when false positives are of concern 22, 23 . In the context of investigating chemical or drug MoAs, false positives have the potential to lead to wasted effort and resources in attempts to validation and support findings 40 . Moreover, when assessing a large number of genes, a 5% FP rate (P-value ≤ 0.05) can result in hundreds of FPs that skew MoA classifications 17 .
A single test method was not expected to outperform all other tests under all conditions as previously demonstrated when comparing pairwise testing 6,9,38 . Therefore, we assessed the strengths and limitations of each test method by varying parameters likely to change within and across various experimental datasets. The number and relative abundance of cell types is known to be affected by disease or treatment, and the distribution of differential expression influenced by the chemical, drug, or food contaminant being evaluated 5,36 . scBT consistently ranked at the top under most scenarios, particularly when the mean and standard deviation of the fold-change for differentially expressed genes varied. However, scBT under performed in MCC when the number of cells decrease in a dose-dependent manner which would be expected in treatments which alter cell population sizes (e.g., inflammation). Under these circumstances LRT linear outperformed all other tests with scBT performing similar to the other test methods as evident when 24 differentially expressed genes were not identified by scBT within experimental data for stellate cells which experienced a dose-dependent decrease in relative abundance following TCDD treatment. Although excluding genes expressed at low levels generally improved the performance of all test methods, the comparative performance of test methods did not significantly change in most cases.
Collectively, our findings suggest that scBT and LRT linear fit-for-purpose tests are better suited for the differential expression analysis of dose-response studies and when false positives are of greater concern than false negatives. Moreover, consistent with previous benchmarking efforts, we show that common non-parametric tests such as KW outperform test methods developed for scRNAseq data when the study involves comparisons between multiple groups. Ultimately, each test method performs optimally under diverse scenarios. While the importance of controlling type I error rates is acknowledged, a balance must be struck with type II error rates.
The tradeoff should be determined based on the individual research question being investigated. It may even become reasonable to apply disparate test methods for distinct cell types based on dropout rates, cell abundance, and changes in relative cell proportions given the strengths and weaknesses of each test method.
, ω k,j ∼ Beta(a k,ω , b k,ω ), where m k,0 , τ k,µ , a σ , b σ , a k,ω , b k,ω are the hyperparameters. Now we calculate the marginal likelihood under the null hypothesis and alternative hypothesis. Under the null hypothesis the marginal likelihood is Under the alternative hypothesis we compute the marginal likelihood without any restriction on the K means µ k,j and the zero inflation parameter ω k,j ; k = 1, 2, . . . K . Particularly, we assume that The ratio of the marginal likelihood from H 0 to H a is .
The Bayes factor can be thus be defined as where π(H a ) and π(H 0 ) are the prior probabilities for the alternative and null model, respectively.
To control for multiplicity we adopt the FDR correction approach discussed in 1 . The rejection threshold is estimated in terms of the posterior probabilities of the null hypothesis, p(H 0,j |D j ). For a target FDR α, the procedure rejects all hypotheses with p(H 0,j |D j ) < ζ , where p(H 0,j |D j ) = (1+ 1/BF 01,j ) −1 and ζ is the largest value such that C(ζ)/J(ζ) ≤ α where, J(ζ) = {j : p(H 0,j |D j ) ≤ ζ} and C(ζ) = j∈J(ζ) p(H o,j |D j ).

Derivation of the combined Likelihood Ratio Test Statistic (LRTmultiple)
In this section, we extend the two-sample test proposed by 2 to a test for k-samples. Consider the composite K-sample test H 0 : ω 1 = ω 2 = · · · = ω K = ω and µ 1 = µ 2 = · · · = µ K = µ versus the alternative H a : ω k is different for at least one k and µ k is different for at least one k, k ∈ 1, . . . , K.
Similar to ANOVA, we assume homogeneity for variance parameter σ 2 j . Now, fixing the gene index j, the likelihood ratio test can be defined as; Λ(Y, R) = sup θ∈H 0 L(θ|Y, R) sup θ∈Ha L(θ|Y, R) where the likelihood can be written as; L(θ|Y, R) = k ω e k k (1 − ω k ) n k −e k i∈C k f (Y i,k |µ k , σ 2 ) Y and R represent the gene observation vector and the gene indicator vector across K dose groups and θ = {µ k , σ 2 , π k , k = 1, . . . , K} is the vector of unknown parameters. We define C k to be the set of cells expressing the gene in group k (i.e.C k = {i : R ik = 1}) and e k = i R ik is the cardinality of set C k . Here, f denotes the density function of the normal distribution with parameters µ k and σ 2 . Therefore, it follows that the likelihood ratio test can be written as Λ(Y, R) = sup θ∈H 0 L(θ|Y, R) sup θ∈Ha L(θ|Y, R) sup {ω k ,k=1,...,K} k ω e k k (1 − ω k ) (n k −e k ) × sup {µ,σ 2 } k i∈C k N (Y i,k |µ, σ 2 ) sup {µ k ,σ 2 ;k=1,...,K} k i∈C k N (Y i,k |µ k , σ 2 ) = k k e k k n k e k n k where N (·|µ, σ 2 ) denotes the normal density with mean and variance µ and σ 2 , Λ b is a binomial LRT, Λ n is a normal LRT ,Y + is the set of positive Y values, Y + k = (1/e k ) e k i=1 Y + ik and Y + = (1/ k e k ) k e k i=1 Y + ik . Thus our combined LRT can be computed as the product of a binomial and a normal LRT statistic, both of which can easily be derived using classical statistical theory.

Derivation of the combined Likelihood Ratio Test Statistic for the linear model setup (LRT-linear)
In this section we extend the combined Likelihood Ratio Test Statistic (LRT-multiple) to a linear model setup. Treating dose (d) as a continuous covariate we write µ ij = m 0j + d i m 1j and logit(ω ij ) = ψ 0j +d i ψ 1j . Under the null hypothesis the model can be reformulated as H 0 : µ ij = m 0j and logit(ω ij ) = ψ 0j . Therefore the likelihood function for gene j under the full model can be written as: where R j = I(Y j = 0) denotes the gene expression indicator vector of size n = K k=1 n k , Y + j denotes the positively expressed gene observation vector of size n e = k i R ijk and θ j = {m 0j , m 1j , σ 2 j , ψ 0j , ψ 1j } Using the likelihood function described above, the likelihood ratio test can be derived following the same approach detailed in Section 2. Since R j and Y j are conditionally independent for each gene j, the individuals LRT statistics derived from the logistic and linear regression parts can be summed to obtain an asymptotically χ 2 distribution with the degrees of freedom of the component tests added.