On the discovery of subpopulation-specific state transitions from multi-sample multi-condition single-cell RNA sequencing data

Single-cell RNA sequencing (scRNA-seq) has quickly become an empowering technology to profile the transcriptomes of individual cells on a large scale. Many early analyses of differential expression have aimed at identifying differences between subpopulations, and thus are focused on finding subpopulation markers either in a single sample or across multiple samples. More generally, such methods can compare expression levels in multiple sets of cells, thus leading to cross-condition analyses. However, given the emergence of replicated multi-condition scRNA-seq datasets, an area of increasing focus is making sample-level inferences, termed here as differential state analysis. For example, one could investigate the condition-specific responses of cell subpopulations measured from patients from each condition; however, it is not clear which statistical framework best handles this situation. In this work, we surveyed the methods available to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated “pseudobulk” data. We developed a flexible simulation platform that mimics both single and multi-sample scRNA-seq data and provide robust tools for multi-condition analysis within the muscat R package.

: Schematic overview of muscat's simulation framework. (a) Given a count matrix of features by cells and, for each cell, pre-determined cluster (subpopulation) identifiers as well as sample labels (0), dispersion and sample-wise means are estimated from a negative binomial distribution for each gene (for each subpopulation) (1.1); and library sizes are recorded (1.2). From this set of parameters (dispersions, means, library sizes), gene expression is sampled from a negative binomial distribution. Here, genes are selected to be "type" (subpopulation-specifically expressed; e.g., via marker genes), "state" (change in expression in a condition-specific manner) or equally expressed (relatively) across all samples (2). The result is a matrix of synthetic gene expression data (3); (b) Differential distributions are simulated from a NB distribution or mixtures thereof, according to the definitions of random variables X, Y and Z. (c) t-SNE plots for a set of simulation scenarios with varying percentage of "type" genes (top), DS genes (middle), and difference in the magnitude (logFC) of DS between subpopulations (bottom). (d) Schematic overview of cell-and sample-level approaches for DS analysis. Top panels show a schematic of the data distributions or aggregates across samples (each violin is a group or sample; each dot is a sample) and conditions (blue or orange). The bottom panels highlight the data organization into sub-matrix slices of the original count table.
Aggregation versus non-aggregation methods. The starting point for a differential state analysis is a (sparse) matrix of gene expression, either as counts (with library or size factors) or normal-147 ized data (log-transformed expression values, residuals [38,39] ), where each row is a gene and each 148 column a cell. Each cell additionally has a subpopulation (cluster) label as well as a sample label; 149 metadata should be linked to samples, such that they can be organized into comparable groups 150 with sample-level replicates (e.g., via a design matrix). The data processing aspect, depending 151 on whether to aggregate data to the subpopulation-sample level, is described in the schematic 152 in Figure 1d. The methods presented here are modular and thus the subpopulation label could 153 originate from an earlier step in the analysis, such as clustering [40,41,42] after integration [43,9] or 154 after inference of cell-type labels at the subpopulation- [10] or cell-level [11] . The specific details 155 and suitability of these various preprocessing steps is an active area of current research and a full 156 evaluation of them is beyond the scope of the current work; a comprehensive review was recently 157 made available [44] .

158
For aggregation-based methods, we considered various combinations of input data (log- 159 transformed expression values, residuals, counts), summary statistics (mean, sum), and methods 160 for differential testing (limma-voom, limma-trend , edgeR) that are sensible from a methodolog-161 ical perspective. For example, limma-voom and edgeR operate naturally on pseudobulk counts, 162 while we have also used limma-trend on the mean of log-transformed library-size-normalized 163 counts (logcounts). MAST [34] was run on logcounts; Anderson-Darling (AD) tests [32] and 164 scDD [33] on both logcounts and standardized residuals (vstresiduals) [38] . For the AD tests, we con-165 sidered two distinct approaches to test for equal distributions, with alternative hypotheses having 166 samples different either sample-wise or group-wise (see Supplementary Table 1 and Methods). 167 Performance of differential state detection. First, we generated null simulations where no 168 genes are truly differential (across conditions), to evaluate the ability of methods to control error 169 rates (3 replicates in each of 2 conditions, K = 2 subpopulations). While various methods show 170 mild departures from uniform ( Supplementary Fig. 2a), the Anderson-Darling tests, regardless 171 of whether they were run comparing groups or samples, deviated the furthest from uniform and 172 were the most unstable across replicates. 173 To compare the ability of methods to detect DS genes, we simulated S 1 = S 2 = 3 samples 174 across 2 conditions. To retain the empirical distribution of library sizes, we simulated the same 175 number of genes as in the reference dataset, and selected a random subset of G = 4,000 genes 176 for further analysis to reduce runtimes. We simulated K = 3 subpopulations and introduced 177 10% of genes with DS, with equal magnitude of differential expression across subpopulations 178 (E[logFC] = 2) and randomly assigned to genes across the range of expression strength.

179
To ensure that method performances are comparable and do not suffer from low cell numbers, 180 we simulated an average of 200 cells per subpopulation-sample instance, amounting to a total of 181 ∼ 200 × (S 1 + S 2 ) × K ≈ 3,600 cells per simulation. Each simulation and method was repeated 182 5 times per scenario, and performances were averaged across replicates.

183
In the context of DS analysis, each of the G genes is tested independently in each of K 184 subpopulations, resulting in a total of ∼ G × K differential tests (occasionally, a small number of 185 genes are filtered out due to low expression). Multiple testing correction could thus, in principle, be 186 performed globally, i.e., across all tests (n = G × K), or locally, i.e., on each of the subpopulation-187 level tests (n = G). We compared overall False Discovery Rate (FDR) and True Positive Rate 188 (TPR) estimates computed from both locally and globally adjusted p-values. Global p-value 189 adjustment led to a systematic reduction of both FDRs and TPRs ( Fig. 2a; stratified also by the 190 type of DS) and is therefore very conservative.

191
Moreover, detection performance is related to expression level, with differences in lowly ex-192 pressed genes especially difficult to detect ( Supplementary Fig. 4). On the basis of these 193 observations, for the remainder of this study, all method performances were evaluated using lo-194 cally adjusted p-values, after exclusion of genes with a simulated expression mean below 0. 1. 195 In general, all methods performed best for genes of the DE category, followed by DM, DP, 196 and DB (Fig. 2a). This level of difficulty by DS type is to be expected, given that genes span  Fig. 5a). Although the differential detection performance does not seem to be compromised, 211 applying the logarithm transformation (with an offset to avoid zero) to the rather low counts 212 of cell-level data attenuates the scale and thus the magnitude of the estimated logFCs.   Figure 2: DS method performance across p-value adjustment types, differential distribution categories, and subpopulation-sample cell counts. All panels show observed overall true positive rate (TPR) and false discovery rate (FDR) values at FDR cutoffs of 1%, 5%, and 10%; dashed lines indicate desired FDRs (i.e., methods that control FDR at their desired level should be left of the corresponding dashed lines). For each panel, performances were averaged across 5 simulation replicates, each containing 10% of DS genes (of the type specified in the panel labels of (a), and 10% of DE genes for (b); see Figure 1b  remainder of methods, simulated and estimated logFC showed high correspondence across all gene categories.

215
To investigate the effect of subpopulation size on DS detection, we ran methods on simulations 216 containing 10% of DE genes using subsets of 20 to 400 cells per subpopulation-sample (Fig. 2b).

226
To investigate overall method concordance, we intersected the top ranked DS detections 227 (FDR < 0.05) returned by each method across 5 simulation replicates per DS category (Fig. 3). 228 We observed overall high concordance between methods, with the majority of common hits being 229 truly differential. In contrast, most isolated intersections, i.e., hits unique to a certain method, 230 were genes that had been simulated to be EE and thus false discoveries. Methods with vstresiduals 231 as input yielded a noticeably high proportion of false discoveries.

232
Using a different anchor dataset as input to our simulation framework yielded highly consistent alternatively, can act outside the BBB by stimulating afferent nerves, acting at circumventricular 245 organs, and altering BBB permeabilities and functions [45,46,47,48] . 246 We sought to investigate the effects of peripheral LPS administration on all major cell types   Figure 3: Between-method concordance. Upset plot obtained from intersecting the top-n ranked gene-subpopulation combinations (lowest p-value) across methods and simulation replicates. Here, n = min(n 1 ,n 2 ), where n 1 = number of genes simulated to be differential, and n 2 = number of genes called differential at FDR < 0.05. Shown are the 40 most frequent interactions; coloring corresponds to (true) simulated gene categories. Bottom right annotation indicates method types (PB = pseudobulk (aggregation-based) methods, MM = mixed models, AD = Anderson-Darling tests).
in mouse frontal cortex using single-nuclei RNA-seq (snRNA-seq). The goal was to identify genes  We identified 915 genes with differential states (FDR < 0.05, |logFC| > 1) in at least one 260 subpopulation, 751 of which were detected in only a single subpopulation ( Supplementary Fig.   261 13). Since relying on thresholds alone is prone to bias, we next clustered the (per-subpopulation) 262 fold-changes across the union of all differentially expressed genes (Fig. 4d). We observed a dis-  Fig. 14). 270 We next sought to estimate how homogeneous the effects observed at the pseudobulk-level Data is split horizontally by subpopulation and vertically by consensus clustering ID (of genes); top and bottom 1% logFC quantiles were truncated for visualization. Bottom-row violin plots represent cell-level effect coefficients computed across all differential genes, and scaled to a maximum absolute value of 1 (each violin is a sample; coloring corresponds to group ID); effect coefficients summarize the extent to which each cell reflects the population-level fold-changes (see Methods).

Discussion
We have compared what can be considered as in silico sorting approaches for multi-subpopulation 286 multi-sample multi-condition scRNA-seq datasets, where the interest is to follow each cell sub-287 population along the axis of samples and conditions; we refer to these generally as differential subpopulation-specific changes in brain tissue from mice exposed to peripheral LPS treatment.

293
Aggregating data from a subpopulation to a single observation (per sample) is a natural 294 approach to the DS problem [20,21] , but it still remained to be demonstrated how effective it DS detection sensitivity (TPR) and specificity (FDR) for each type of differential distribution, uniformity of p-value distributions under the null (null simulation), concordance between simulated and estimated logFCs (logFC estimation), ability to accommodate complex experimental designs (complex design), and runtimes (speed). Top annotation indicates method types (PB = pseudobulk (aggregation-based) methods, MM = mixed models, AD = Anderson-Darling tests). Null simulation, logFC estimation, complex design and runtimes received equal weights of 0.5; TPR and FDR were weighted according to the frequencies of modalities in scRNA-seq data reported by Korthauer et al. [33] : ∼ 75% unimodal, ∼ 5% trimodal and ∼ 25% bimodal, giving weights of 0.75 for DE, 0.125 for DP and DM, and 0.05 for DB.
is. Based on our simulation results, the tested aggregation-based DS methods were extremely fast and showed overall a stable high performance, although depending on the scale of the data 297 analyzed, logFCs were attenuated for some combinations. While mixed model methods performed 298 similarly well, their computational cost may not be worth the flexibility they provide (Fig. 5 and 299 Supplementary Fig. 11). Methods developed specifically for scRNA-seq differential analysis were 300 outperformed by aggregation and mixed models, but it should be mentioned that these methods 301 focus on comparing sets of cells and were not specifically designed for the multi-group multi-sample 302 problem. Furthermore, methods that compared full distributions did not perform well overall (Fig.   303   5). This latter class of methods was used here as a reference point, but could also be improved to into subpopulations ("types"), is itself an active and debated area of research [2,3] and one that 316 already applies a computational analysis on a given dataset, whether that be clustering or manual or 317 computational assignment; in fact, combining computational and manual assignment was recently 318 listed as best practice [44] .

319
Although not discussed here, researchers would generally first apply a differential abundance of cells (Fig. 2). Going forward, it would be of interest to further explore the origins of this 342 instability, in order to better maximize sensitivity while still controlling for errors.

343
Another aspect to consider in this context is the resolution of subpopulations subjected to 344 differential testing; for example, there is an analogous tradeoff between sensitivity (e.g., larger 345 subpopulations) and specificity (e.g., effects that target particular subpopulations). Here, methods 346 that integrate the relationship between subpopulations (e.g., treeclimbR [50] ) could be applied as 347 an additional layer to improve signal detection.

348
In the process of this study, we created a flexible simulation framework to facilitate method signal, the percentage of differential genes, as well as the set of affected subpopulations could be 360 varied; or, ii) implementing type genes such that they are not specific to a single subpopulation, 361 perhaps even in a hierarchical structure to represent markers of both broad and specific cell types.

362
Taken together, we expect our simulation framework to be useful to investigate various scRNA-seq 363 data analyses, such as batch correction frameworks, clustering, reference-based cell-type inference 364 methods, marker gene selection methods as well as further developments in DS analysis.

365
Although we set out with the goal of discovering subpopulation-specific responses across 366 experimental conditions, one needs to be careful in how strongly these claims are made. Absence 367 of evidence is not evidence of absence. In particular, there is a potentially strong bias in statistical 368 power to detect changes in larger cell populations, with decreased power for rarer populations.

369
Statistical power to detect changes in cell states also relates to the depth of sequencing per cell; 370 for example, it has been speculated that cell states are a secondary regulatory module [3] and it of gene g in sample s(c), λ c is the library size (total number of counts), and φ g is the dispersion.

397
In order to introduce a multi-subpopulation, multi-sample data structure, we sample a set 398 of K clusters as reference, as well as S reference samples for each of two groups, resulting in , while a small set (p · 100 percent) of type-specific genes 403 are sampled separately for each subpopulation (G k ∩ G k = Ø ∀ k = k ), giving rise to distinct 404 subpopulations. Secondly, for each sample s and subpopulation k, we draw a set of cells C * sk ⊂ C sk 405 (and their corresponding λ c , β s(c) g and φ g ) to simulate (negative binomial random variables) from, 406 where cells C sk belong to the corresponding reference cluster-sample drawn previously.

407
Lastly, differential expression of a variety of types is added for a subset of genes. For each 408 subpopulation, we randomly assign each gene to a given differential distribution category accord-

431
Depending on the specific method, which includes both a type of data to operate on (e.g., counts, 432 logcounts) and summary function (e.g., mean, sum), the varying number of cells between samples 433 and subpopulations is accounted for prior to or following aggregation. For logcounts methods, 434 we apply a library size normalization to the input raw counts. vstresiduals are computed using R 435 package sctransform's vst function [38] . For scalecpm, we calculate the total library size of each 436 subpopulation k and sample s as where G represents the number of genes, C sk is the total number of cells in sample s that have 438 been assigned to subpopulation k, and y gc denotes the counts observed for gene g in cell c. We 439 then multiply the CPM of a given sample and subpopulation with the respective total library size 440 in millions to scale the CPM values back to the count scale: edgeR-based methods were run using glmQLFit and glmQLFTtest [51] ; methods based on 442 limma-voom and limma-trend were run using default parameters. testing and moderation were applied subpopulation-wise.

452
For the first approach (MM -dream), we relied on the variancePartition [52] package's implemen-453 tation for repeated measurement bulk RNA-seq, using voom's [25] precision weights as originally 454 described but without empirical Bayes moderation and the duplicateCorrelation [53] step, as this 455 was computationally intensive and had a negligible impact on the significance (as also observed pre-456 viously for batch effects [21] ). Method MM -dream2 uses an updated alternative to this approach 457 using variancePartition's new weighting scheme [Hoffman2020-variancePartition] instead of voom.

458
For the second approach (MM -vst), we first applied the variance-stabilizing transformation glob-459 ally before splitting cells into subpopulations, and then fitted the model using the lme4 package [54] 460 directly on transformed data (and without observational weights). We then applied eBayes mod-461 eration as in the first approach. We tested both the variance-stabilizing transformation from 462 the DESeq2 package [23] , and that from the sctransform package [38] , the latter of which was 463 specifically designed for Unique Molecular Identifier (UMI) based scRNA-seq; since the latter out-464 performed the former (data not shown), it was retained for the main results shown here.

465
For the GLMM-based approach (MM -nbinom), we supplemented the model with an offset equal to the library size factors, and fitted it directly on counts using both Poisson and negative binomial anesthetizing the animals with isoflurane followed by decapitation. Brains were quickly frozen and stored at -80 • C. the hybrid method of the scds package [56] , removing the expected 1% per thousand cells captured 529 with the highest doublet score. Quality control and filtering were performed using the scater [57] 530 R package. Upon removal of genes that were undetected across all cells, we removed cells whose Next, we used Seurat [43,9] v3.0 for integration, clustering, and dimension reduction. Integration well as dimension reductions (t-SNE [58] and UMAP [59] ) were computed using the first 20 principal 539 components. For clustering, we considered a range of resolution parameters (0.1-2); downstream 540 analyses were performed on cluster assignments obtained from resolution 0.2 (22 subpopulations).

541
Cluster merging and cell-type annotation were performed manually on the basis of a set of 542 known marker genes in conjunction with marker genes identified programmatically with scran's 543 findMarkers function [60] , and additional exploration with iSEE [61] . We identified 8 subpopula-

547
DS analysis was run using edgeR [22] on pseudobulk (sum of counts), requiring at least 10 cells 548 in at least 2 samples per group for a subpopulation to be considered for differential testing; the 549 CPE cells subpopulation did not pass this filtering criterion and were excluded from differential 550 analysis. Genes with FDR < 0.05 and |logFC| > 1 were retained from the output. To distinguish 551 subpopulation-specific and shared signatures, we assembled a matrix of logFCs (calculated for 552 each cell subpopulation) of the union of all differential genes (FDR < 0.05 and |logFC| > 1), and 553 performed consensus clustering of the genes using the M3C package [62] (penalty term method), 554 choosing the number of clusters with the highest stability.

555
To estimate per-cell effect coefficients, we calculated dot products of each cell's normalized log-556 expression and the group-level logFCs using only the DS genes detected for the corresponding 557 subpopulation.

558
Performance summary criteria. For each of the metrics in Figure 5, method performances are 559 considered to be 'good', 'intermediate', 'poor' or 'NA' (not available). Method assessments were 560 made as follows:

561
• TPR/FDR: For each type of DD category, we consider TPRs and FDR at FDR 5% averaged 562 across two references, 5 simulation replicates and 3 clusters (Fig. 2a) i.e., multiple replicates across two conditions.

578
• speed summarizes the runtimes recorded for increasing numbers of cells and genes, respec-579 tively ( Supplementary Fig. 11). Scores are given according to the three major groups  Software specifications and code availability. All analyses were run in R v3.6.2 [63] , with 591 Bioconductor v3.10 [64] . Performance measures were calculated using iCOBRA [65] , and results

592
were visualized with ggplot2 [66] , ComplexHeatmap [67] , and UpSetR [68] . All package versions used 593 throughout this study are captured in Supplementary File 5. Data preprocessing, simulation and a browseable workflowr [69] website for the LPS dataset analysis (Supplementary File 3). All 596 aggregation and DS analysis methods are provided in the muscat R package, which is available 597 at https://www.bioconductor.org/packages/muscat through the open-source Bioconductor 598 project.

599
Data availability. The original droplet scRNA-seq data from Kang et al. [20] is deposited un-